Saturday, June 28, 2008

Capturing biology one model at a time

Mathematical and computational modeling is (I hope) a well accepted requirement in biology. These tools allow us to formalize and study systems of higher complexity that are hard to conceptualize with logic thinking. There have been great advances in our capacity to model different biological systems, from single components to cellular functions and tissues. Many of these efforts have been ongoing separately, each one dealing with a particular layer of abstraction (atoms, interactions, cells, etc) and some of them are now reaching a level of accuracy that rivals some experimental methods. I will try to summarize, in a series of blog posts, the main advances behind some of these models and examples of integration between them with particular emphasis on proteins and cellular networks. I invite others to post about models in their areas of interest to be collected for a review.

From sequence to fold
RNA and proteins once produced adopt structures that have different functional roles. In principle all information required to determine the structure is in the DNA sequence that encodes for the RNA/protein. Although there has been some success in the prediction of RNA structure from sequence ab-initio protein folding remains a difficult challenge (see review by R.Das and D.Baker). A more pragmatic approach has been to use the increasing structural and sequence data made available in public databases to develop sequence based models for protein domains. In this way, for well studied protein folds it is possible to ask the reverse question, what sequences are likely to fold this way.
(To be expanded in a future post, volunteers welcome)

Protein binding models

I am particularly interested in how proteins interact with other components (mainly other proteins and DNA) and in trying to model these interactions from sequence to function. I will leave protein-compound interactions and metabolic networks for more knowledge people.
As mentioned above even without a complete ab-initio folding model, it is possible to predict for some sequences what is their structure or determine to what protein/domain family the sequence belongs from comparative genomics analysis. This by itself might not be very informative from a cellular perspective. We need to know how cellular components interact and hwo these interconnected components create useful functions in a cell.

Docking
Trying to understand and predict how two proteins interact in a complex has been the challenge of structural computational biology for more than two decades . The initial attempt to understand protein-interaction from computational analysis of structural data (what is known today as docking) was published by Wodak and Janin in 1978. In this seminal study, the authors established a computational procedure to reconstitute a protein complex from simplified models of the two interacting proteins. In the twenty-years that have followed the complexity and accuracy of docking methods has steadily increased but still faces difficult hurdles (see reviews Bonvin et al. 2006, Gray, 2006). Docking methods start from the knowledge that two proteins interact and aim at predicting the most likely binding interfaces and conformation of these proteins in a 3D model of the complex. Ultimately, docking approaches might one day also predict new interactions for a protein by exhaustively docking all other proteins in the proteome of the species, but at the moment this is still not feasible.

Interaction types
It should still be possible to use the 3D structures of protein complexes to understand at least particular interactions types. In a recent study, Russel and Aloy have shown that it is possible to transfer structural information on protein-protein interactions by homology to other proteins with identical sequences (Aloy and Russell 2002). In this approach the homologous proteins are aligned to the sequences of the proteins in the 3D complex structure. Mutations in the homologous sequences are evaluated with an empirical potential to determine the likelihood of binding. A similar approach was described soon after by Lu and colleagues and both have been applied on large scale genomic studies (Aloy and Russell 2003 ; Lu et al. 2003). As any other functional annotation by homology this method is limited by how much the target proteins have diverged from the templates. Alloy and Rusell estimated that interaction modeling is reliable above 30% sequence identity (Aloy et al. 2003). Substitutions can also be evaluated with more sophisticated energy potentials after an homology model of the interface under study is created. Examples of tools that can be used to evaluate the impact of mutations on binding propensity include Rosetta and FoldX.
Althougt the methods described above were mostly developed for domain-domain protein interactions similar aproaches have been developed for protein-peptide interactions (see for example McLaughlin et al. 2006) and protein-DNA interactions (see for example Kaplan et al. 2005) .

In summary the accumulation of protein-protein and protein-DNA interaction information along with structures of complexes and the ever increase coverage of sequence space allow us to develop models that describe binding for some domain families. In a future blog post I will try to review the different domain families that are well covered by these binding models.

Previous mini-reviews
Protein sequence evolution