Tuesday, September 20, 2011

SIMAP (Similarity Matrix of Proteins)

SIMAP (Similarity Matrix of Proteins) is a public database of pre-calculated protein similarities that plays a key role in many bioinformatics methods. Protein sequence comparison is the most powerful tool in computational biology for characterizing protein sequences because of the enormous amount of information that is preserved throughout the evolutionary process.

About SIMAP (Similarity Matrix of Proteins) 
The SIMAP database contains all currently published protein sequences and is continuously updated. The computational effort for keeping SIMAP up-to-date is constantly increasing. Please help to update SIMAP by calculating protein similarities on your computer. The computing power you donate supports manifold biological research projects that make use of SIMAP data.

Protein similarities are computed using the FASTA algorithm which provides optimal speed and sensitivity. Protein domains are calculated using the InterPro methods and databases. SIMAP is, to our knowledge, the only project that combines comprehensive coverage with respect to all known proteins and incremental update capabilities.

What is SIMAP used for?
Because of the huge amount of known protein sequences in public databases it became clear that most of them will not be experimentally characterized in the near future. Nevertheless, proteins that have evolved from a common ancestor often share same functions (so-called orthologs). So it is possible to infer the function of a non-characterized protein from an ortholog with known function. A well-known example is the investigations about mouse genes and proteins. Their results are also true for orthologous human genes and proteins in many cases. Protein similarities provide information about relationships between proteins and are necessary for the prediction of orthologs.

Protein domains (often called function domains) are the structural building blocks of proteins. They are responsible for the activities of a certain protein, e.g. binding of small molecules, catalytic reactions or binding other proteins in large complexes. The knowledge about protein domains is stored in huge repositories like the InterPro databases. The prediction of domains in newly sequenced proteins is based on those databases and provides a fully automatic functional annotation of these proteins. Therefore we calculate protein domains for all proteins in SIMAP, thus providing the largest system for protein function prediction worldwide.

There are many more bioinformatics methods that rely on protein similarity and domains. Our protein similarity database provides pre-computed similarity, domain data and represents the known protein space. This opens completely new perspectives compared to the commonly used method to repeatedly re-calculate such kind of data. SIMAP is regularly updated. The similarity matrix is simply being incrementally extended if new sequences occur. The use of SIMAP is completely free for education and public research.
Why do we need distributed computing for SIMAP?
The computational costs to calculate the similarity data depend on the square of the number of contained sequences. So the computational effort for keeping the matrix up-to-date is constantly increasing. Our internal resources that perform calculations for SIMAP for the last number of years are no longer sufficient to keep track of all new sequences. That's why we implemented a SIMAP-client for the BOINC platform (Berkeley Open Infrastructure for Network Computing) which is based on the FASTA algorithm to detect sequence similarities.The situation for proteins domains is different but of similar complexity. The computational costs are proportional to the number of sequences and the number of domain models. Due to the growth of the sequence space and the frequent updates in the domain databases, the computational effort for keeping the domain predictions up-to-date is constantly increasing.

What are the institutions behind SIMAP?
SIMAP is a joint project of the GSF National Research Center for Environment and Health, Neuherberg and Technical University Munich, Center of Life and Food Science Weihenstephan (both in Germany). Please contact Thomas Rattei (Department of Genome Oriented Bioinformatics, TU Munich).

No comments:

Post a Comment