October 28, 2014
Here is some information on how our study of Protein-Protein Interactions (PPI) is evolving within the framework of the MAPPING project (Investissement d'Avenir en Bioinformatique, funded by the French Ministry of Research). We are moving in several directions, all aimed at refining information on protein partnership within the cell that has previously been obtained [Lopes et al 2013; Sacquin-Mora et al. 2008]. The different directions taken by the project are explained below:
DOCKING COMBINED TO EXPERIMENTAL INTERFACES
We have previously shown that molecular docking simulations ("docking"), combined with the knowledge of experimental interfaces, improve the discrimination of protein partners in the cell from those that are not interacting [Lopes et al 2013, Sacquin-Mora et al. 2008]. The computational cost of docking simulations is important, even with a coarse-grain molecular representation, which we used in our WCG computations. We tested replacing these simulations with a very effective approach to rigid docking, which uses a scoring function based only on geometric complementarity of protein surfaces [Ritchie 2002]. Our results for a set of 186 protein complexes from different functional classes [Mintseris classes et al. 2005] indicate that this function fails and its score does not accurately reflect binding affinities. The conformational space that is generated and selected by the scoring function is actually very poor in conformations that are near the native complexes. However, the evaluation of a large number of docking conformations combined with the knowledge of experimental interfaces still allows us to identify protein partners with a precision that is almost equivalent to that obtained with coarse-grain simulations [Laine & Carbone 2013].
Fig. Prediction of partners among 46 proteins, by combining rigid docking based on geometrical complementarity and knowledge of experimental interfaces. 500 (pink, black), 1000 (blue) and 2000 (cyan) docking conformations have been analyzed. The docking has been realized starting from original PDB positions (S) or from random positions (M).
WHEN DOCKING IS CROSSED WITH PREDICTIONS
As argued in [Sacquin-Mora et al. 2008, Lopes et al 2013], docking can be successfully crossed with the prediction of interaction surfaces. We recall that the HCMD2 calculations were heavily based on this result. We have used JET predictions [Engelen et al. 2009] to reduce the computational space for the 2200 proteins launched on WCG. We have pointed out that accurate predictions will improve partner identification scores and for this we are working on improving JET. The account of our new results is described below.
MORE ACCURATE PREDICTIONS OF INTERACTION SITES
We previously developed a tool, Joint Evolutionary Tree (JET) for the prediction of protein-protein interfaces [Engelen et al. 2009]. JET assumes that interaction sites are formed by a highly conserved core, surrounded by multiple concentric layers of less conserved residues, but with specific physico-chemical properties. Although this assumption is verified in the majority of interfaces, there are some cases where the interaction site only extends in a preferred direction from the retained core. In order to refine the definition of interaction sites predicted by JET, we have introduced a geometric criterion to describe the protein surface and have coupled this with conservation and physico-chemical properties. The new version of JET encourages extensions of interaction sites to protruding areas in the protein that appear to be highly exposed to the solvent. Depending on the system studied, JET is now able to automatically determine the combination of the most relevant criteria. These changes have significantly improved the performance of JET on different types of protein [Laine et al complexes. 2013- post Berlin]. We are currently setting up a database for all PDB structures, including those structures studied on the WCG grid in the HCMD2 project.
Fig. Prediction of interaction sites for a set of proteins of Huang database. From Left to right, results from the old JET version, the new version, and the experimental interface with the trace (associated to conservation of amino-acids in sequences and to conservation of physico-chemical properties expected at the interface). Proteins in the blue square correspond to those where a small molecule appears.
COEVOLUTION ANALYSIS AND PROTEIN-PROTEIN INTERACTIONS
We planned to obtain extra evidence of protein-protein interactions, and a better understanding of their nature, by exploiting information coming from co-evolution of residues at the interface of the two proteins. Novel kinds of interaction between potential partners, identified on interaction surfaces by coevolution analysis, are currently under investigation. We also developed a very fast version of the algorithm for the analysis of co-evolution "Blocks In Sequences" (BIS) [Dib & Carbone 2012]. Today, this implementation allows us to analyse long genomic sequences, such as complete viral genomes and to study the interaction of sets of proteins. This was previously impossible and it allows us to identify contacts among multiple proteins for which no interaction information was available. Using this approach, we considered the entire coding portion of the genome of hepatitis C virus (HCV) and identified, with BIS, all potential points of contact between the 10 proteins of this virus. In collaboration with F. Penin we are currently analyzing the interaction data.
Fig. HCV protein-protein interaction network [Champeimont et al. unpublished results]. The blue network reports the result of coevolution analysis performed on the full HCV polyprotein (for three genotypes). The width of the lines is proportional to the number of predicted direct and indirect interactions (at the domain level), while the circles representing the proteins have an area proportional to the protein length. The red network reports all experimentally known HCV protein-protein interactions. Coevolution analysis is realized with a new version of the BIS tool [Dib&Carbone, 2012].
PROTEIN-PROTEIN INTERACTION AND DOCKING
Our collaborators in Lyon (R.Lavery team) have made progress in simulation as well as in the prediction of protein interactions. Protein-protein recognition has been studied by advanced all-atom molecular dynamics techniques. The separation of complexes by these methods provides access to energy profile and the role of different factors such as conformational flexibility and the impact of water and ions surrounding the protein partners. The separation of the complex constituted by ubiquitin (an enzyme responsible for marking proteins for degradation) and its recognition domain has helped shed light on the mechanism for the recognition of this particular system [Bouvier 2014].
In parallel, R.Lavery's team has developed a new coarse-grain docking approach, using a simplified representation of protein structures, based on the PaLaCe model [Pasi et al. 2013]. This model provides very encouraging results for the prediction of binding affinity constants for the formation of binary protein complexes [N Ceres, unpublished results]. The development of more refined energy terms representing both electrostatics and interactions with the solvent [Ceres et al. 2012] are underway and should lead to further improved affinity predictions. In terms of docking, the team has developed a multiple minimization approach that can deal with the flexibility of the interacting proteins (either limiting movements to side chains, to protein loops at the interface, or treating the complete protein as flexible). The use of internal coordinates (notably torsion angles) makes minimization much more efficient than it would be using Cartesian coordinates.
Fig. Comparison between experimentally measured binding affinities of different types of protein-protein complexes (x-axis, data from the "Affinity Benchmark" [Kastritis et al. Protein Science 2013]) and values predicted by PaLaCe (y-axis). The global correlation coefficient is 0.8.
Fig. Docking study between the Alpha Chymotrypsin (left) and the Eglin C (right) proteins, minimized (on 50000 randomly chosen starting points) with the PaLaCe potential. The interaction surface is colored depending on the probability that a residue will participate to the protein-protein interface (red: strong, white: medium, blue: weak). The favored surfaces are in excellent correlations with the experimental structure (visualized on the bottom).
A.Lopes, S.Sacquin-Mora, V.Dimitrova, E.Laine, Y.Ponty, A.Carbone, Protein-protein interactions in a crowded environment: an analysis via cross-docking simulations and evolutionary information, PLoS Computational Biology, 2013.
E. Laine, A. Carbone, Identification of Protein Interaction Partners from Shape Complementarity Molecular Cross-Docking. In A. Petrosino, L. Maddalena, P. Pala (Eds.), IEEE International Conference on Image Analysis and Processing (ICIAP) 2013 Workshops, LNCS 8158, pp. 318–325. Springer, Heidelberg, 2013.
A.Carbone, Extracting co-evolving characters from a tree of species. In Discrete and Topological Models in Molecular Biology, N.Jonoska, M.Saito, G.Rozenberg (eds.), Springer, 2013.
Pasi, M., Lavery, R., & Ceres, N. PaLaCe: A Coarse-Grain Protein Model for Studying Mechanical Properties. Journal of Chemical Theory and Computation, 9, 785–793, 2013.
N. Ceres, M. Pasi, R. Lavery. A Protein Solvation Model Based on Residue Burial. Journal of Chemical Theory and Computation 2012 8:2141-2144.
A webLecture was organized by World Community Grid in February 2014. The video in on YouTube.
November 13, 2012
We are at the end of HCMD2 and I would like to thank you for the patience and persistence in running our docking program in your machines. The huge amount of cross-docking data that we collected, thanks to you (!), has been for the first time realized. It is a mine of information for our research in protein-protein interactions and it will constitute a precious amount of information also for our colleagues in the world interested in molecular docking.
We finished to analyze the data on the 168 protein complexes run on HCMD1 and we now know what has to be done next. We shall integrate novel and quantitative, experimental data on protein binding to predict not only the conformation of interacting proteins, but also which proteins will interact and how strongly. This involves four specific challenges:
1) Obtain quantitative experimental data on protein interactions with a wide range of binding affinities. We will use surface plasmon resonance
(SPR), followed by isothermal titration calorimetry (ITC) to fully characterize the thermodynamics of protein interactions over a wide range of affinities and physical conditions (concentration, pH, temperature, …). These methods constitute ideal tools for our purpose. They will be used to quantify interactions between a set of commercially available proteins, including known interacting partners. However, we will also characterize nominally non-functional "cross-interactions" within this set to test, for the first time, the common assumption that choosing single proteins from known binary complexes, or choosing proteins from different cellular compartments, implies the absence of interaction.
2) Use evolutionary sequence data to detect protein residues involved in interaction interfaces and pairs of interacting proteins. We will identify key residues within interaction sites and co-evolution signals between pairs of interaction sites in order to predict interacting partners and integrate this information into a refined molecular docking approach, with the aim of identifying binary interactions within a large set of proteins. This goal will include constructing an automated pipeline for co-evolution analysis of single proteins and protein pairs.
3) Formulate new protein-protein interaction potentials using experimental data, molecular simulations and existing structural data. Molecular simulations coupled with free energy calculations will be used to obtain an atomic-scale view of the dissociation of a limited number of the weak and strong protein interactions studied by microcalorimetry. We will determine the extent to which complexes have well-defined conformations and fully desolvated interfaces. This data will be used to formulate and iteratively refine new interaction potentials within a coarse-grain model, which will be sensitive to binding affinity.
4) Carry out a refined analysis of the large database of protein interactions that you generated (!) to characterize interaction networks and binding promiscuity. During stage two of the Help Cure Muscular Dystrophy project (HCMD2), the resources of the World Community Grid (WCG) were used to dock all possible protein pairs within a set of 2200 proteins, potentially important for understanding and treating
neurodegenerative diseases. This data will be analyzed to characterize key “hub” proteins and network structures, first, with the existing energetic and residue conservation data and then with the new methods resulting from 1-3.
The methods and interaction data derived from our studies will be freely available to the scientific community by the implementation of web servers and web databases.
We will do all this with a 4 years funding from the French ministry of research that was awarded to our group this year. We shall devote this grant to the development of the new tools (in biophysics and bioinformatics) mentioned above, as well as on the analysis of the HCMD2 dataset to arrive to the best prediction possible on the human protein-protein interaction network that you generated in these two last years.
To keep you informed on the development of the project, I shall provide news on the advancements in my webpage. Pointers to the publications will be given there. If by any chance I do not post news from more than 6 months, send me a reminder!
THANK YOU again to all of you from all the scientists of the HCMD1 and HCMD2 projects.
Best regards to all,
December 21, 2011
Hi to all, with Sophie and Richard we have written up an account of the docking analysis of the 168 protein complexes of the Mintseris dataset tested in phase 1. The paper is under review right now and will give you the link as soon as it is published.
The analysis of the dataset of 168 protein complexes is not finished yet! In fact, we try to improve the signals for the detection of partnership. There are two main points that one needs to keep in mind. In phase 2 we do not know the real partners and we had to use predictions of interaction sites to run MaXDO. This was because the search space on a protein surface could not be exhaustively explored, even with the help of WCG. it would be far too big! This means that we need to understant on a pool of proteins that we know (that is the 168 protein complexes) how the predictions of protein interaction sites will impact partners predictions. This is what we carefully investigate right now. It takes time! There are a number of intermediate results that you might like to know about:
1. the analysis realized in [Sacquin-Mora et al. 2008] on 12 complexes, has been scaled to 168 complexes, and it highlighted a predictive protein-protein interaction power of AUC=0.84 (see Figure A below) when using knowledge on real interaction surfaces and when exploring the whole protein surface. It is important to stress that this successful scaling of the analysis in [Sacquin-Mora et al. 2008] to 168 proteins was not an obvious guess! Why successful? The AUC is a probability measure used to evaluate the accuracy of the test. Values vary from 0 to 1, where 1 represents a perfect test and 0.5 represents a worthless test. Roughly speaking, one can think of the following ranking:
.90-1 = excellent, .80-.90 = good, .70-.80 = fair, .60-.70 = poor, .50-.60 = fail.
2. We also observed that amongst the 168 protein complexes several had the tendency to bind to nearly all other proteins and others showed very few strong interactions. Both these families of proteins negatively contribute to partnership prediction, and, when eliminated, enable the predictive power to be increased to an AUC=0.98 (see Figure B).
3. When experimental information on interaction surfaces is replaced with data from JET [Engelen 2009] (the tool for conservation analysis developed within our consortium) the predictive power only decreases slightly, with an AUC=0.82. This suggests that coupling protein interface predictions with docking is a very promising approach.
4. Nevertheless improvements are still required, since when JET predictions are used to delimit the docking area, as well as to compute the numerical index that discriminates partners, the predictive power falls at an AUC=0.59. This implies that better interaction patch detection has to be developed. However, we note that a subgroup of 20 complexes was identified where JET predictions already yielded very good predictions (AUC=0.97; Figure C below), suggesting that generating subgroups by categorizing protein interaction proclivities could improve performance.
5. Lastly, we systematically analyzed complexes in terms of the functional classes of the interacting proteins. The complexes could be grouped into: Enzyme-Inhibitors (46 proteins), Antigen-Antibody (20), Antigen-Antibody Bound (24), Others (78), and also as, Rigid Body (126), Medium (26) and Difficult (16). Interactions within certain classes, such as Enzyme-Inhibitors, were clearly easier to predict suggesting that such classifications should be considered in partnership prediction.
Figure. Matrices of pairwise interaction indexes for different subsets of proteins. High interaction scores (between 0.7 and 1, blue and black in the color scale) indicate a high probability of interaction. Proteins are ordered in the matrix such that true interacting partners lie on the diagonal. A: full dataset of 168 protein complexes. Interaction scores were computed using knowledge of the experimental interfaces (AUC=0.84). B: subset of 44 protein complexes leading to an AUC=0.98. Interaction scores were computed using knowledge on experimental interfaces. C: subset of 20 proteine complexes leading to an AUC=0.97. Interaction scores were computed using interfaces predicted by JET.
At the moment we work on coevolution between protein interfaces and on improving JET interaction predictions. With both advancements we count improving identification of new partners, and increase the AUC above. We have done a lot of work already on this. A new approach to coevolution analysis, treating especially conserved sites like protein interfaces, has been recently developed at the lab. More on this soon.
Merry Christmas and a Happy New Year to all!
July 31, 2010
Hi to all! Thought to try to explain what we are doing right now before you take some vacation, like the scientists here. Hope it will help to feel that things are improving and that the project is very active from this side!! Actually, someone new will join the group on september, Anne Lopes. Anne is assistant professor in structural bioinformatics and has a background in physical-chemistry. She is very interested in working on the protein partnership problem with the numerical approach we developed and on the data analysis of the huge amount of information you are producing!
The state of the art here is the following.
In the paper [S. Sacquin-Mora, A. Carbone and
R. Lavery (2008), Identification of protein interaction partners
and protein-protein interaction sites, J. Mol. Biol. 382,
p1276-1289] we developed a numerical method to detect protein partners. The method was presented and tested on a small quantity of known protein complexes. As you can imagine, as soon as the data from HCMD Phase 1 arrived (THANKS TO YOUR CONTRIBUTION!!) we retested the approach to verify whether we could confirm the results on a larger dataset. This is indeed the case, the method works, and we can distinguish protein partners within the about 150 proteins tested. We observed that the signal is much less sharp when we work with 150 proteins than with 12 proteins (like in the paper) though and that some extra work should be done to improve the numerical method. Remember that for HCMD Phase 2 we shall search for partners among about 2200 proteins.
At the moment, we have improved the formula introduced in the paper and we are developing an "intelligent" approach to arrive fast and surely to identify a small number of potential partners for any protein.
Let me give you an insight on the complexity underlying the problem. It has something to do with the understanding of protein population. This is an important point to assimilate, if you like to understand a bit more of our analysis. When we consider a protein, we do not just study one protein (that is, its geometry and its physico-chemical properties: this is already taken into account in the docking algorithm running on your computers and into JET, the program that allowed us to predict protein binding sites) but we rather study its behaviour with the population of proteins that are around it (in the cell; for the HCMD phase 2, population means the 2200 proteins analyzed in your computers). In other words, when we look at a protein we hope to get a signal on its partnership by looking at her way to interact with all other proteins in the population. This means that we hope to learn from bad interactions as well as from good interactions. The information that YOU are giving us provides to us some insight on what is bad and what is good! but this is not enough and we shall use also some extra observation on the interaction of the protein within a population.
Some proteins are slippery, meaning that they do not seem to glue to any partner. Some others are gluing, meaning that they do glue to essentially everybody. Then there are many other proteins (about a half) that seem to stick on the right place with some specificity. They are the easiest to study. When we use, in our calculations, contributions coming from the entire population, one should think that these contributions come, in principle, from slippery proteins, gluing proteins and many other proteins whose behaviour is less sharply characterizable. "Noise" might enter into the calculation and we wish to reduce it. Learning from the whole set of interactions of a protein, means to learn to which group the protein belongs to. Once this is determined, the numerical criteria that we developed could be adjusted to accurately predict a partner or a small set of potential partners, whenever possible. The understanding of the whole set of behaviours that we need to take into account to know how to correctly evaluate the data coming from WCG is our goal today.
There are a few other concerns that are present in our analysis, and they have something to do with :
1. the algorithmic aspects concerning the handling of large amount of information to be combined for the "learning" approach I mentioned above.
2. the fact that on HCMD phase 2 data analysis, we use JET predictions of protein interaction in our numerical criteria instead of actual real interfaces as done in the paper cited above. This implies a loss of precision that we should consider in our numerical evaluations of the interactions.
These informations should give you some insight on the complexity of the question we face today. Hope that everyone will be feeling that we are advancing, together, for a project that runs alive and hopefully will reserve exciting surprises to all. We expect it.
Have a good summer! Alessandra