CEPCEB Home Center for Plant Cell Biology  at UC Riverside
About CEPCEB Facilities Members Resources Visual Microscopy Gallery News & Upcoming Events IGERT Endowments
 

 

CEPCEB Members


Stefano Lonardi
Assistant Professor
Department of Computer Science and Engineering
320 Surge
University of California
Riverside, CA 92521
Phone: (951) 827-2203
Fax: (951) 827-4643


Areas of Expertise
  • Computational Molecular Biology
  • Data Compression
  • Data Mining
  • Information Hiding

 Background
Research Interests
Ongoing Current Projects
Current Laboratory Personnel
Selected Recent Publications (Bibliography page)

 

Background

I received a Laurea degree cum laude in Computer Science from the CS Department of the University of Pisa, Italy. While I was pursuing a doctoral degree in Computer Science and Electronical Engineering at the Department of EE and CS, University of Padua, Italy, I joined the graduate program of the Computer Science Department of Purdue University.

During 1999, I spent the summer working for Celera Genomics, Rockville, MD. I contributed to the code for the Celera Assembler that assembled the genome of Drosophila melanogaster, and then made history by assembling the genome of Homo sapiens. I joined the Department of Computer Science and Engineering, University of California, Riverside in the summer of 2001.

My research interests are aimed towards the computational challenges of molecular biology and, in particular, the statistical analysis of sequences. An unprecedented wealth of data is being generated by genome sequencing projects and other efforts to determine the structures and functions of biological molecules. The demands and opportunities for interpreting these data are increasing at the same rate as biological databases are expanding. At the current pace at which biological information becomes available in the form of new DNA sequences, proteins, 3D structures, gene expression profiles, protein interactions, etc., the ability to automatically analyze, cluster, classify, and annotate it is becoming an increasingly critical need.

Ever increasingly fast and sophisticated algorithms have to be sought, while the scientific explorations of the computational problems is still in its infancy. Nonetheless, a basic repertoire of techniques for searching, matching, comparing, and analyzing some discrete structures -- such as strings, arrays, trees, automata, etc. -- have been developed over the past decades, in response to the needs coming from a variety of applications. The focus of my research has been mostly concerned with the design of algorithms to find, exploit and analyze regularities and repetitions in texts. In the domain of data compression these repetitive structures are regarded as redundancies and sought to be removed. Vice versa, in the context of learning, classification and pattern discovery, repeated patterns are unveiled as carriers of information and structure.

With biological data increasingly and massively accumulated, a critical issue is to design strategies to limit and filter what the analysis would return. The limited bandwidth of our perceptions will become otherwise an unsurmountable bottleneck in the analysis pipeline. We have to develop effective synthetic descriptions of our data, generate succinct characterizations of the problem at hand, and enhance prominent features of the analysis. Hints could come from data compression and scientific visualization, as well as from information retrieval and machine learning. Understanding and characterizing combinatorial structures and properties can also help to reduce the amount of data at the outset. An orthogonal research direction also addressing these needs would be to study sampling techniques and devise algorithms to compute ``approximate analysis'' of the data.

In addition to sequencing projects, we can also expect a massive amount of data on gene expression to be generated in the near future, mainly by DNA micro-array technology. Databases and tools which can organize and visualize these data in useful ways need to be developed. Appropriate statistical and mathematical models are required for analysis and comparison of expression and functional data. One of the steps of these studies uses pattern discovery tools to find promoters and enhancers of transcription.

Back to Top

 

Research Interests

Computational molecular biology, data mining, data compression, statistical analysis of sequences, information hiding, advanced data structures and algorithms, scientific visualization

Back to Top


Ongoing Current Projects

Verbumculus picture

Verbumculus is a a suite of software tools for the detection of over- or under-represented words that might appear as consecutive substrings in nucleotide and amino acid sequences. The inner core of Verbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatorics on words. These properties enable us to limit drastically and a priori the set of over- or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. (joint work with A.Apostolico, M.E. Bock)
Microarray pictureDNA microarrays technology is a powerful method to monitor the activity of thousands of genes simultaneously in a cell. An impressive amount of data is being collected in "wet labs" all over the world. A large microarray instrument can easily analyze and record in a day the expression profiles of say, tens of thousands of genes. The output is typically in the tens of MBytes, and can be easily reach in the hundreds of MBytes of raw data. It is believed crucial to keep the raw data in some permament storage medium. Repeating the experiment is usually not viable because of the high costs associated with this new technology. Some form of data compression is therefore required. (joint work with Yu Luo)
Offline PictureGreedy off-line textual substitution refers to the following approach to lossless compression. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted textstring, until substrings capable of producing contractions can no longer be found. (joint work with A.Apostolico)
PMP PicturePattern Matching Pointers is a webpage containing an extensive collection of resources on combinatorial pattern matching. Home pages of researches, books, software, journals are constantly updated. Currently, the pages receives about two hundreds hits per day, and it is ranked first (or second, depending on the days) in the category "Computer > Algorithms" by Google.
GZIP PicturegzipS is a version of gzip which allows to hide within the text enough information to warrant the authenticity of the compressed document. The design is based on cryptographically-secure pseudo-random generators, in such a way that the hidden data cannot be retrieved in a reasonable amount of time by an attacker unless the secret key is known. The embedding is totally transparent: since it can still be decompressed by the original LZ-77 algorithm, to a ``casual eavesdropper'' the augmented compressed file appears perfectly normal. Preliminary experiments show also the degradation in compression due to the embedding is almost negligible. (joint work with M.J.Atallah)
  
Source coding and channel coding are two opposing forces that present significant challenges in error-resilient adaptive lossless data compression. Source coding tries to decorrelate the input sequence as much as possible by removing redundant information, while channel coding introduces additional correlation by adding the information in order to protect against errors. Due to their devastating effects, errors in adaptive data compression have been a long-standing open problem. As a result, the non-resilience of adaptive data compression has hindered its use in many applications. However, joint source-channel coding has emerged as a possible solution to the problem. We have developed a novel joint source-coding algorithm capable of correcting errors in the popular Lempel-Ziv'77 scheme without losing any practical compression power (joint work with Wojciech Szpankowski)
  
HarvEST EST analysis (joint work with Tim Close, Tao Jiang, Steve Wanamaker Jie Zheng)


Current Laboratory Personnel

  • Qiaofeng Yang, Ph.D. student, Genetics program
  • Yu Luo, Ph.D. student, Computer Science
  • Jie Zheng, Ph.D. student, Computer Science
  • Kun Yan, M.S. student, Computer Science

Selected Recent Publications (Bibliography page)

 

Back to Top


University of California, Riverside
CEPCEB Home
page created by:
rtz media
maintained by:
webmaster
last modified: