| |
HELP CURE MUSCULAR DYSTROPHY
|
We are investigating protein-protein
interactions for more than 2000 human proteins whose structures
are known, with particular focus on those proteins that play a role
in neuromuscular diseases. The database of information that will
be produced will help researchers design molecules to inhibit or
enhance binding of particular macromolecules, hopefully leading
to better treatments for muscular dystrophy and other neuromuscular
diseases.
Phase 1 of Help Cure Muscular Dystrophy
has ended in June 2007 and Phase 2 has been launched in may 2009.
The project is supported by World
Community Grid and by Decrypthon (a partnership between AFM (French
Muscular Dystrophy Association), CNRS (French National Center for
Scientific Research) and IBM).
|
| UPDATES for the project |
July 31, 2010
Hi to all! Thought to try to explain what we are doing right now before you take some vacation, like the scientists here. Hope it will help to feel that things are improving and that the project is very active from this side!! Actually, someone new will join the group on september, Anne Lopes. Anne is assistant professor in structural bioinformatics and has a background in physical-chemistry. She is very interested in working on the protein partnership problem with the numerical approach we developed and on the data analysis of the huge amount of information you are producing!
The state of the art here is the following.
In the paper [S. Sacquin-Mora, A. Carbone and
R. Lavery (2008), Identification of protein interaction partners
and protein-protein interaction sites, J. Mol. Biol. 382,
p1276-1289] we developed a numerical method to detect protein partners. The method was presented and tested on a small quantity of known protein complexes. As you can imagine, as soon as the data from HCMD Phase 1 arrived (THANKS TO YOUR CONTRIBUTION!!) we retested the approach to verify whether we could confirm the results on a larger dataset. This is indeed the case, the method works, and we can distinguish protein partners within the about 150 proteins tested. We observed that the signal is much less sharp when we work with 150 proteins than with 12 proteins (like in the paper) though and that some extra work should be done to improve the numerical method. Remember that for HCMD Phase 2 we shall search for partners among about 2200 proteins.
At the moment, we have improved the formula introduced in the paper and we are developing an "intelligent" approach to arrive fast and surely to identify a small number of potential partners for any protein.
Let me give you an insight on the complexity underlying the problem. It has something to do with the understanding of protein population. This is an important point to assimilate, if you like to understand a bit more of our analysis. When we consider a protein, we do not just study one protein (that is, its geometry and its physico-chemical properties: this is already taken into account in the docking algorithm running on your computers and into JET, the program that allowed us to predict protein binding sites) but we rather study its behaviour with the population of proteins that are around it (in the cell; for the HCMD phase 2, population means the 2200 proteins analyzed in your computers). In other words, when we look at a protein we hope to get a signal on its partnership by looking at her way to interact with all other proteins in the population. This means that we hope to learn from bad interactions as well as from good interactions. The information that YOU are giving us provides to us some insight on what is bad and what is good! but this is not enough and we shall use also some extra observation on the interaction of the protein within a population.
Some proteins are slippery, meaning that they do not seem to glue to any partner. Some others are gluing, meaning that they do glue to essentially everybody. Then there are many other proteins (about a half) that seem to stick on the right place with some specificity. They are the easiest to study. When we use, in our calculations, contributions coming from the entire population, one should think that these contributions come, in principle, from slippery proteins, gluing proteins and many other proteins whose behaviour is less sharply characterizable. "Noise" might enter into the calculation and we wish to reduce it. Learning from the whole set of interactions of a protein, means to learn to which group the protein belongs to. Once this is determined, the numerical criteria that we developed could be adjusted to accurately predict a partner or a small set of potential partners, whenever possible. The understanding of the whole set of behaviours that we need to take into account to know how to correctly evaluate the data coming from WCG is our goal today.
There are a few other concerns that are present in our analysis, and they have something to do with :
1. the algorithmic aspects concerning the handling of large amount of information to be combined for the "learning" approach I mentioned above.
2. the fact that on HCMD phase 2 data analysis, we use JET predictions of protein interaction in our numerical criteria instead of actual real interfaces as done in the paper cited above. This implies a loss of precision that we should consider in our numerical evaluations of the interactions.
These informations should give you some insight on the complexity of the question we face today. Hope that everyone will be feeling that we are advancing, together, for a project that runs alive and hopefully will reserve exciting surprises to all. We expect it.
Have a good summer! Alessandra
|
| Help Cure Muscular Distrophy Project and the World
Community Grid |
Computational grids are emerging
as a new paradigm for sharing and aggregation of geographically
distributed resources with the aim of solving large-scale computational
and data intensive problems in science. This project proposes to
apply this powerful computational schema to the detection of protein-protein
interactions. Identifying pairs or larger complexes
of functionally interacting proteins, or determining the binding
of a protein to a DNA sequence or to a ligand are fundamental problems
in biology with immediate consequences in drug design. This multidisciplinary
project directly addresses this question by setting the goal of
screening a database containing thousands of proteins, predict functional
sites involved in binding to other proteins or ligand targets, and
determine whether two proteins are potential interacting partners
in the cell. The project will determine information on the structure
of macromolecular complexes which is important not only for identifying
functionally important partners, but also for determining how such
interactions will be perturbed by natural or engineered site mutations
in either of the interacting partners, or as the result of exogenous
molecules, and, notably, pharmacophores. A database of such information
would be of significant medical interest since, while it now becomes
feasible to design a small molecule to inhibit or enhance the binding
of a given macromolecule to a given partner, it is much more difficult
to know how the same small molecule could directly or indirectly
influence other existing interactions.

Given n protein structures, they are docked one against the other, that is cross-docked.
Notice that nxn interactions are tested.
Molecular modeling refers to theoretical
methods and computational techniques to model or mimic the behavior
of molecules. These methods and techniques are used to investigate
the structure of biological systems such as protein folding or molecular
recognition of protein-ligand binding, ranging from small chemical
systems to large biological molecules and assemblies of material
(protein complexes). Protein-ligand docking is a molecular modeling
technique to predict the position and orientation (the 3D-structure)
of a protein in relation to a ligand (another protein, DNA, drug,
etc.). Docking methods are based on purely physical principles;
even proteins of unknown function (or which have been studied relatively
little) may be docked. The only prerequisite is that their 3D-structure
has been either determined experimentally, or can be estimated by
some theoretical technique. The docking approach generally starts
with a database of known molecules and attempts to find pairs of
molecules which have an affinity to bind to one another. The affinity
is estimated using a so-called scoring function. In the end, a list
of the best-binding molecules for a targeted protein is returned.
The quality of fit has a geometric and a chemical component. The
geometric component measures how well the surface shapes (the 3D-structures)
complement each other like a hand in glove. The chemical component
measures the quality of the atomic interactions between the partner
molecules (i.e. are the interactions strong or weak?). For complex
structures like proteins (the smallest are composed of hundreds
of atoms), it takes considerably computer time to determine the
fit of correct protein-protein interactions. Without World Community
Grid, the computations required to conduct the docking was prohibitively
time consuming. For the first 168 selected proteins, the estimated
CPU time on a 2 GHz PC should have been about 8,000 years.
A solution to this computational
barrier is to use evolutionary information to predict potential
binding sites and realize localized docking only on surfaces which
are most likely to interact. By using the prediction of protein
binding sites, based on protein evolution, highly reduces computational
time by a factor of 100 and therefore allow us to extend the analysis
at large scale with the crucial help of World Community Grid. Without
World Community Grid, the computations required to conduct the (localized)
docking at large scale would be also prohibitively time consuming.
Volunteers donating their computer
time to World Community Grid searched (in phase I) and will search
(in phase II) for the best protein-protein partners.
The main pool of calculations
runs on World Community Grid:
Phase I, with cross-docking of 168 proteins, has been completed.
Phase II, with targeted cross-docking of about 2200 human proteins
whose three dimensional structures are known (and stored in the
Protein Data Bank www.rcsb.org/pdb) will be launched at the end
of march-beginning april 2009. The set of proteins include those
found to be mutated in neuromuscular disord
Preliminary calculations
and a posteriori analysis run on Decrypthon Grid and on Grid'5000:
Preliminary calculations for Phase I and Phase II and analysis of data for phase I have been done on French University grids.
|
Participants : teams and computing infrastructure |
Computational biology: Alessandra
Carbone (PI) team, Analytical Genomics, FRE 3214
CNRS-UPMC, Université Pierre et Marie Curie, Paris.
Grid computing: Jean-Marie
Chesneaux team, Laboratory of Computer Science (LIP6), UMR 7606
CNRS-UPMC, Université Pierre et Marie Curie, Paris.
Genetic analysis of myopathies: Pascale Guicheney
team, INSERM Laboratory U582, Myology Institute, "Pitié-Salpetrière"
Hospital, Paris.
Molecular modelling: Richard
Lavery team, Intitut de Biologie et Chimie des Protéines, UMR 5086 CNRS-Université de Lyon, Lyon.
Thanks to the funding of AFM and CNRS, three postdocs actively partecipated from 2005 to 2008
to the advancement of the project :
Stefan Engelen - IR Genoscope, Evry
Postdoc AFM in 2006-2008
Prediction of protein binding sites and development of JET
Yann Ponty - Postdoc at Laboratoire
d'Informatique de Paris 6, CNRS-UPMC
Postdoc AFM/CNRS in 2008
Analysis of data from phase 1, interface between JET and MAXDo and
criteria for protein partnership prediction
Sophie Sacquin-Mora - CR CNRS - Laboratoire
de Biochimie Théorique, UMR 9080 CNRS, Institut de Biologie Physico-Chimique,
Paris
Postdoc AFM in 2005-2006 AFM
Development of MAXDO, criteria for protein partnership prediction
Sophie Sacquin-Mora and Yann Ponty
are working actively on the project even if associated to other
institutions now.
Decrypthon Program provided the infrastructure
necessary to assure the portability of the software to the Decrypthon Grid and to WCG. The Laboratoire d'Informatique du Parallelisme (ENS, Lyon), partner of AFM/CNRS, gave us the possibility to run tests
and a posteriori analysis on the French University Grid5000.
For this, we acknowledge the work of Raphael Bolze (LIP, ENS Lyon). Nicolas Bard (LIP, ENS Lyon) and Michael Heymann (LIP, ENS Lyon) will soon replace Raphael on the same tasks.
The project run on WCG. Thousands
of internautes offered their computer time to the project. To them
we are most grateful.
|
| What have we done on Phase 1 |
In Phase
1 we tested the feasibility of the docking algorithm MAXDo on a database of 168 proteins for which we had crystallographic information and experimental evidence of the interaction of these proteins complexes in the cell. We performed cross docking on all protein pairs (see figure below on the left where the receptor is fixed at the center of the sphere and the ligand moves all around the ideal sphere indicated with red dots) and collected energy maps (with the Euler angles θ and ? along the vertical and horizontal axis; see an example of the maps issued from our calculations on the right below) after docking. For each map, the experimental binding site of the receptor in its complexed form with the ligand is located at the center (see complex in the middle of the figure below). Blue and red areas correspond to the most negative and the least negative energies, respectively. These data were then analyzed and crossed with a numerical criteria that we had developed for discrimination of protein partners. In the figures below you can see the docking schema, a complex and the energy map associated to the complex, computed on WCG.
 
In Phase II, MAXDo will run only on a part of the sphere. Which part will be determined by a prediction on protein binding sites for receptor and ligand (by using the JET approach). Only the part of the sphere that corresponds to the region of interaction will be docked. In this way we can cross approximately 2200 protein structures instead of the 168 docked in Phase I.
|
| HCMD status report, february 2009 |
From the end of Phase
1 our efforts have been concentrated on four directions: the finalisation
and testing of JET, the analysis of data gathered in phase 1, the
interface between JET and MAXDo, the algorithmic improvement of
MAXDo, the constitution of a database of proteins that will be analysed
in phase 2. We completed the essential parts of all steps and we
are ready to start phase 2 of the project.
|
| 1.
Finalisation of JET and test of its performance |
Information
obtained on the structure of macromolecular complexes is important
for identifying functionally important partners, but also for determining
how such interactions will be perturbed by natural or engineered
site mutations. Hence, to fully understand or to control biological
processes we need to predict in the most accurate manner protein
interfaces for a protein structure, possibly without knowing its
partners. Joint Evolutionary Trees (JET) is a method designed to
detect very different types of interactions of a protein with another
protein, ligands, DNA and RNA. JET uses evolutionary information,
namely how certain positions within the primary sequence of a protein
are conserved in a given protein family, and also physico-chemical
properties of residues to finally predict the location of protein-protein
interface sites on the protein surface. It uses a carefully designed
sampling method making sequence analysis more sensitive to the functional
and structural importance of individual residues, and a clustering
method parameterized on the target structure for the detection of
patches on protein surfaces and their extension into predicted interaction
sites. JET is a large scale method, highly accurate and applicable
to search protein partners.
JET has been conceived and
developed in different stages along 4 years of work which now finalized
in a publically available software package and a publication [Engelen
et al 2008]. The information obtained with JET is extremely interesting
for the HCMD project because we can save a large amount of computational
time by restricting the search of the conformational space between
the two proteins to the area surrounding the predicted interface.
The data resulting from phase 1 of our project were used to adjust
various parameters of the JET program and see how far we could go
in filtering the starting positions in the docking calculations
without loosing relevant information on protein interactions.
|
| -
Read more on JET |
The Joint Evolutionary Trees (JET) method detects protein interfaces,
the core of residues involved in the folding process, residues
susceptible to be relevant to site-directed mutagenesis and to
molecular recognition. The approach, based on the Evolutionary
Trace (ET) method, introduces a novel way to treat evolutionary
information. Families of homologous sequences are analyzed through
a Gibbs-like sampling of distance trees to reduce effects of erroneous
multiple alignment and impacts of weakly homologous sequences
on distance tree construction. The sampling method makes sequence
analysis more sensitive to functional and structural importance
of individual residues by avoiding effects of the overrepresentation
of highly homologous sequences and it improves computational efficiency.
A carefully designed clustering method is parameterized on the
target structure to detect and extend patches on protein surfaces
into predicted interaction sites. Clustering takes into account
residue's physical-chemical properties as well as conservation.
Large-scale application of JET requires the system to be adjustable
for different datasets and to guarantee predictions even if the
signal is low. Flexibility was achieved by a careful treatment
of the number of retrieved sequences, the amino acid distance
between sequences, and the selective thresholds for cluster identification.
An iterative version of JET (iJET) that guarantees finding the
most likely interface residues is proposed as the appropriate
tool for large scale predictions. Tests are carried out on the
Huang database of 62 heterodimer, homodimer and transient complexes,
and on 265 interfaces belonging to signal transduction proteins,
enzymes, inhibitors, antibodies, antigens and others. A specific
set of proteins chosen for their special functional and structural
properties illustrate JET behavior on a large variety of interactions
covering proteins, ligands, DNA and RNA. JET is compared at large
scale to ET, and to Consurf, Rate4Site, siteFiNDER|3D, SCORECONS
on specific structures. A significant improvement in performance
and computational efficiency is shown.

Fig 0. JET prediction on the interaction between a RNA strand and a protein complex: the crystal structure of a complex of TRP RNA-binding attenuation protein with a 53-base single stranded RNA containing eleven GAG triplets separated by AU dinucleotides (Antson, A.A., Dodson, E.J., Dodson, G., Greaves, R.B., Chen, X., Gollnick, P. (1999) Structure of the trp RNA-binding attenuation protein, TRAP, bound to RNA. Nature 401: 235-242 ). Residues are colored from red (most conserved) to blue (no conservation), and red-like residues indicate predicted interacting residues. Notice the red color of an entire face of the complex (left) and the blue color of the opposite face (middle). The RNA is in contact with the red surface (see right view).
|
| 2. Analysis
of data coming from Phase 1 |
Phase 1 generated a huge amount
of data on protein-protein interactions, which we are still working
on. This data gives us valuable information on the way proteins
interact. |
| |
| - Identification of protein interaction
partners |
In order to evaluate the quality
of the protein interactions modelled with the MAXDo program (which
ran on WCG during phase 1 of the HCMD project) we developed an Interaction
Index (II), based on the interaction energies between proteins and
the residues found at the protein interface. The II value ranges
from O (no complex formation) and 1 (excellent interaction between
the two proteins). If we compare the distribution of the II value
for all protein pairs (over 28,000) and only pairs of experimental
partners (168) we see that "real partners" usually form complexes
with a significantly larger II than randomly chosen partners: The
average II value is only 0.18 for the whole database, while the
"real" couples yield an average II of 0.32. There are also some
very important variations depending on the type of complex considered.
For example, in a reduced dataset excluding enzyme-inhibitor complexes,
the experimental partners had a remarkable II average value of 0.91,
which make then very easily distinguishable from the random partners
(see Fig.1).

Fig. 1: Normalized
Interaction Index matrix for a reduced dataset of ten proteins.
The rows and columns are ordered so that experimentally observed
complexes lie on the trailing diagonal of the matrix. These complexes
clearly distinguish themselves, with a higher II, from the incorrect
off-diagonal complexes.
|
| 3. Interface
between JET and MAXDo |
Since JET is used
to restrict docking interaction targets, it was critical to assess
the sensitivity of our tool in order to ensure a reasonable coverage
and not overlook entire interactions. To that purpose, we tested
the software on two large sets of proteins originally used for benchmarking
docking in silico approaches (Mintseris 2.0 + Kanamori). In some
case, high sensitivity/specificity couples can be observed and JET
pinpoints the interaction site (See Fig. 2). More generally, we
observed an average 38% sensitivity with a remarkable typical specificity
of 80%, under the settings used for restricting MAXDo's conformational
space. This means that a bad prediction is in most cases, an absence
of prediction, which can be detected and corrected prior to passing
the protein to our docking software MAXDo. Furthermore, large discrepancies
are observed for the sensitivities achieved on proteins having different
functions. For instance, JET performs very well on Enzymes substrates/inhibitors
(55.3% Sens. 75% Spec.) and rather poorly on Antigens/Antibodies
(17% Sens. 81% Spec.). We will use this information to adapt our
restriction of MAXDo's conformation space, being more stringent
for easy family and more relaxed for hard ones. Despite these corrections,
we expect the MAXDo/JET joint approach for an ab initio cross-docking
to perform better on Enzymes than on Antigens/Antibodies.

Fig. 2: For
both the Ras-RasGAP complex (PDB:1WQ1, left) and the DNA glycosylase
bound to its inhibitor (PDB:1UDI, right), JET accurately targets
the experimentally-observed interfaces. More precisely, the (sensitivity,specificity)
couples achieved by JET on individual chains are: (74%,89%) for
6Q21:D, (58%,95%) for 1WER, (52%,77%) for 2UGI:B and (65%,89%) for
1UDH. Such figures are not only suitable for restricting MAXDo search
space, but will also greatly help directing our search for the real
docking sites during a posteriori analyses.
|
| -
A posteriori analysis |
A preliminary analysis of the data produced during phase 1 showed
that Interaction Indices, recalculated using JET-predicted data
instead of experimental ones, still carry some discriminative
power. We have good hope that taking into account more features
on the candidate interfaces, combined with automated machine learning
approaches, will greatly enhance the quality of our analysis of
the second phase results. More specifically, supervised learning
procedures such as Support Vector Machines (SVM) will be fine-tuned
during the second phase results to automatically distinguish between
positive (Interacting proteins) and negative (Non-interacting
proteins) examples.
|
| 4. A new version
of MAXDo |
We also worked on a faster, more
efficient version of MAXDo. This new version takes into account
the information produced by the JET program concerning the location
of protein interfaces, so that we now need to test only around 15%
of the starting positions for docking that we previously studied
with MAXDo (see Fig. 3). Combined with other improvements made to
MAXDo, the docking of a protein pair should take only 3% of the
computation time that was necessary during Phase1.

Fig. 3: Cartoon
representation of the RacGTApase/P67Phox complex (pdb code 1E96).
The systematically generated starting positions for the receptor
and ligand proteins are plotted as blue and red points. After filtration
using JET information, docking calculations are only performed for
those starting points that are located in the vicinity of JET predicted
interface residues (plotted as green spheres on the proteins), thus
considerably reducing the computation time.
|
| 5. Proteins
list that will be analysed on phase 2 of the HCMD project |
In phase 2, we will profit from this speedup to work on a larger,
but also more targeted protein database. It will include around
2200 human proteins, 200 of which are of interest for neuromuscular
diseases and have been proposed as targets for study by the medical
research groups working with us in this project. About 200 more
are structural models with potential interest for muscular dystrophy.
The massive docking experiment performed during phase 2 will give
us precious information on how these target proteins interact
within the human body (about 1800 protein structures with yet
unknown involvement in muscular dystrophy will be analysed), helping
biomedical researchers designing new strategies to neutralize
them and develop therapies for neuromuscular diseases.
|
| - To know
more on the list of proteins |
The list of proteins considered
for phase 2 was initially selected by experts on three major classes:
Neuromuscular diseases, experimentally-determined (Guicheney and
Carbone labs) and predicted structures (Laboratoire de Bioinformatique
et Genomique Integratives, IGBMC); others (like heart and brain-related)
experimental structures (Carbone lab). This initial list was later
semi-automatically filtered to weed out very similar proteins. To
that purpose, we used the ASTRAL structurally non-redundant database,
and added a final filter based on sequence similarity (PDBSelect95
database) in order to cover as much of the structural and functional
diversity as possible. Overall, 2263 single-chain proteins of lengths
ranging from 25 to 800 residues will be run on the grid (See Fig.
4). Since our ability to run MAXDo on such a large dataset relies
on an initial restriction of the conformation spaces, we have installed
JET on the Decrypthon grid in Lyon. At the moment, JET is being
run simultaneously on the machines of the grid and has already yielded
results for more than half of the proteins. Full results should
be available during the upcoming weeks. These results will be carefully
post-processed, using conclusions from our analysis of the first
phase, before we feed the JET-annotated PDB files to MAXDo.

Fig.
4: Distribution of the number of residues (x-axis) for the proteins
(y-axis) in the list that will be investigated during phase 2 of
the HCMD project.
|
| Publications |
S. Sacquin-Mora, A. Carbone and
R. Lavery (2008), Identification of protein interaction partners
and protein-protein interaction sites, J. Mol. Biol. 382,
p1276-1289.
S. Engelen, L.A. Trojan, S. Sacquin-Mora,
R. Lavery and A. Carbone (2009), Joint Evolutionary Trees: a large
scale method to predict protein interfaces based on sequence sampling,
PLoS Comp. Biol. 5(1): e1000267. doi:10.1371/journal.pcbi.1000267.
|
| Some numbers |
HCMD2 studies cross docking of 2246 human protein structures. In total we shall dock 2 466 753 pairs of proteins among the 2246^2 possible ones. This means that 913 627 781 945 docking initial positions should be computed by a full cross-docking. By using JET, we can reduce the docking space of 85%, that is we shall have only 137 652 178 995 conformations to be analyzed.
Phase 1 of the project studied 28 224 pairs (168 proteins) and explored a total of 10 391 124 240 conformations, that is 13,25 times less than what we shall have to compute in Phase II (in terms of docking initial positions). |
| |
|