Gènomique Analytique

Equipe de Génomique Analytique
Université Pierre et Marie Curie, INSERM U511
Responsable : Alessandra Carbone, carbone AT ihes DOT fr

A.Carbone, M.Gromov, Mathematical slices of molecular biology, La Gazette des Mathematiciens, Numéro spécial 88:11-80, Société Mathématique de France, 2001.

This paper contains a brief overview of molecular biology with indications and speculations on mathematical approaches to biological problems.

Download the PS Reprint

A.Carbone, N.C.Seeman, Circuits and Programmable Self-Assembling DNA Structures, Proceedings of the National Academy of Science USA, 99:12577-12582, 2002.

Self-assembly is beginning to be seen as a practical vehicle for computation. We investigate how basic ideas on tiling can be applied to the assembly and evaluation of circuits. We suggest that these procedures can be realized on the molecular scale through the medium of self-assembled DNA tiles. One layer of self-assembled DNA tiles will be used as the program or circuit that leads to the computation of a particular Boolean expression. This layer templates the assembly of tiles, and their associations then lead to the actual evaluation involving the input data. We describe DNA motifs that can be used for this purpose; we show how the template layer can be programmed, in much the way that a general-purpose computer can run programs for a variety of applications. The molecular system that we describe is fundamentally a pair of two-dimensional layers, but it seems possible to extend this system to multiple layers.

Download the PDF Reprint

A.Carbone, N.C.Seeman, A Root to Fractal DNA Assembly, Natural Computing, 1:469-480, 2002.

Crystallization is periodic self-assembly on the molecular scale. Individual DNA components have been used several times to achieve self-assembled crystalline arrangements in two dimensions. The design of a fractal system is a much more difficult goal to achieve with molecular components. We present DNA components whose cohesive portions are compatible with a fractal assembly. These components are DNA parallelograms that have been used previously to produce two dimensional arrays. To obtain a fractal arrangement, however, we find it necessary to combine these parallelograms with glue-like constructs. The assembly of the individual parallelograms and a series of glues and protecting groups appear to ensure the fractal growth of the system in two dimensions. Synthetic protocols are suggested for the implementation of this approach to fractal assembly.

Download the PDF Reprint

A.Carbone. Cooperativity and symmetry at biological scales, in GROUP-24: Physical and Mathematical aspects of symmetries (Proceedings of the 24th International Colloquium on Group Theoretical Methods in Physics, Paris, 15-20 Juillet 2002), Institute of Physics, Conference Series Number 173, J-P.Gazeau, R.Kerner, J-P.Antoine, S.Métens and J-Y.Thibon Eds., Institute of Physics Publishing, Bristol and Philadelphia, 51-60, 2003.

Facts and ideas presented in this paper have been written mostly as a guideline to orient the reader through some references in the field. Biological scales are intended in a very broad sense, they can refer to bio-molecular structures as well as to supra-molecular organisation.

Download the PS Reprint

A.Carbone, M.Gromov, Functional labels and syntactic entropy on DNA strings and proteins, Theoretical Computer Science, 303:35--51, 2003.

The DNA of a cell is an object which admits a simple mathematical description and a convenient representation in a computer (it is given by an easily manipulatable list, a finite sequence in four letters typically of length between one million and 10 billions). In contrast to this there is no simple way of describing the cell neither statically and even less temporally (dynamically). We shall indicate here a possible formalism of combinatorial and numerical (entropic) structures on spaces of sequences which reflect, up to some degree, the organization and functions of DNA and proteins.

Download the PDF Reprint

A.Carbone, A.Zinovyev, F.Képès, Codon Adaptation Index as a measure of dominating codon bias, Bioinformatics, 19:2005-2015, 2003.

We propose a simple algorithm to detect dominating synonymous codon usage bias in genomes. The algorithm is based on a precise mathematical formulation of the problem that lead us to use the Codon Adaptation Index (CAI) as a ‘universal’ measure of codon bias. This measure has been previously employed in the specific context of translational bias. With the set of coding sequences as a sole source of biological information, the algorithm provides a reference set of genes which is highly representative of the bias. This set can be used to compute the CAI of genes of prokaryotic and eukaryotic organisms, including those whose functional annotation is not yet available. An important application concerns the detection of a reference set characterizing translational bias which is known to correlate to expression levels; in this case, the algorithm becomes a key tool to predict gene expression levels, to guide regulatory circuit reconstruction, and to compare species. The algorithm detects also leading–lagging strands bias, GC-content bias, GC3 bias, and horizontal gene transfer. The approach is validated on 12 slow-growing and fast-growing bacteria, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster.

Download the PDF Reprint

A.Carbone, N.C.Seeman, Coding and Geometrical Shapes in Nanostructures: fractal DNA assemblies, Natural Computing, 2:133-151, 2003.

Fractal patterns represent an important class of aperiodic arrangements. Generating fractal structures by self-assembly is a major challenge for nanotechnology. The specificity of DNA sticky-ended interactions and the well-behaved structural nature of DNA parallelogram motifs has previously led to a protocol that appears likely to be capable of producing fractal constructions [A. Carbone and N.C. Seeman, A route to fractal DNA assembly, Natural Computing 1, 469–480, 2002]. That protocol depends on gluing the set of tiles with special ‘glue tiles’ to produce the fractal structure. It is possible to develop a fractal-assembly protocol that does not require the participation of gluing components. When designed with similar DNA parallelogram motifs, the protocol involves sixteen specific tiles, sixteen closely related tiles, and a series of protecting groups that are designed to be removed by the introduction of specific strands into the solution. One novel aspect of the construction on the theoretical level is the interplay of both geometry and coding in tile design. A second feature, related to the implementation, is the notion of generalized protecting groups.

Download the PDF Reprint

A.Carbone, N.C.Seeman. Molecular Tiling and DNA self-assembly, in "Aspects of Molecular Computing", N.Jonoska, G.Paun, G.Rozenberg (Eds), Lecture Notes in Computer Science 2950, Springer, 2003.

We examine hypotheses coming from the physical world and address new mathematical issues on tiling. We hope to bring to the attention of mathematicians the way that chemists use tiling in nanotechnology, where the aim is to propose building blocks and experimental protocols suitable for the construction of 1D, 2D and 3D macromolecular assembly. We shall especially concentrate on DNA nanotechnology, which has been demonstrated in recent years to be the most effective programmable self-assembly system. Here, the controlled construction of supramolecular assemblies containing components of fixed sizes and shapes is the principal objective. We shall spell out the algorithmic properties and combinatorial constraints of ”physical protocols”, to bring the working hypotheses of chemists closer to a mathematical formulation.

Download the PDF Reprint

B. Mishra, R. Daruwala, Y. Zhou, N. Ugel, A. Policriti, M. Antoniotti, S. Paxia, M. Rejali, A. Rudra, V. Cherepinsky, N. Silver, W. Casey, C. Piazza, M. Simeoni, P. Barbano, M. Spivak, J-W. Feng, O. Gill, M. Venkatesh, F. Cheng, B. Sun, I. Ioniata, T.S. Anantharaman, E.J.A. Hubbard, A. Pnueli, D. Harel, V. Chandru, R. Hariharan, M. Wigler, F. Park, S.-C. Lin, Y. Lazebnik, F. Winkler, C. Cantor, A. Carbone, and M. Gromov. A Sense of Life: computational and experimental investigations with models of biochemical and evolutionary processes, OMICS - A Journal of Integrative Biology, Special Issue on BioCOMP, S.Kumar Ed., 7(3):253-268, 2003.

We collaborate in a research program aimed at creating a rigorous framework, experimental infrastructure, and computational environment for understanding, experimenting with, manipulating, and modifying a diverse set of fundamental biological processes at multiple scales and spatio-temporal modes. The novelty of our research is based on an approach that (i) requires coevolution of experimental science and theoretical techniques and (ii) exploits a certain universality in biology guided by a parsimonious model of evolutionary mechanisms operating at the genomic level and manifesting at the proteomic, transcriptomic, phylogenic, and other higher levels. Our current program in “systems biology” endeavors to marry largescale biological experiments with the tools to ponder and reason about large, complex, and subtle natural systems. To achieve this ambitious goal, ideas and concepts are combined from many different fields: biological experimentation, applied mathematical modeling, computational reasoning schemes, and large-scale numerical and symbolic simulations. From a biological viewpoint, the basic issues are many: (i) understanding common and shared structural motifs among biological processes; (ii) modeling biological noise due to interactions among a small number of key molecules or loss of synchrony; (iii) explaining the robustness of these systems in spite of such noise; and (iv) cataloging multistatic behavior and adaptation exhibited by many biological processes.

Download the PDF Reprint

A.Carbone, F.Képès, A.Zinovyev, Codon bias signatures, organisation of microorganisms in codon space and lifestyle, Molecular Biology and Evolution, 22(3):547–561, 2004.

New and simple numerical criteria based on a codon adaptation index are applied to the complete genomic sequences of 80 Eubacteria and 16 Archaea, to infer weak and strong genome tendencies toward content bias, translational bias, and strand bias. These criteria can be applied to all microbial genomes, even those for which little biological information is known, and a codon bias signature, that is the collection of strong biases displayed by a genome, can be automatically derived. A codon bias space, where genomes are identified by their preferred codons, is proposed as a novel formal framework to interpret genomic relationships. Principal component analysis confirms that although GC content has a dominant effect on codon bias space, thermophilic and mesophilic species can be identified and separated by codon preferences. Two more examples concerning lifestyle are studied with linear discriminant analysis: suitable separating functions characterized by sets of preferred codons are provided to discriminate: translationally biased (hyper)thermophiles from mesophiles, and organisms with different respiratory characteristics, aerobic, anaerobic, facultative aerobic and facultative anaerobic. These results suggest that codon bias space might reflect the geometry of a prokaryotic "physiology space." Evolutionary perspectives are noted, numerical criteria and distances among organisms are validated on known cases, and various results and predictions are discussed both on methodological and biological grounds.

Download the PDF Reprint

A.Carbone, C.Mao, P.E.Constantinou, B.Ding, J.Kopatsch, W.B.Sherman, N.C.Seeman, 3D Fractal DNA Assembly from Coding, Geometry and Protection, Natural Computing, 3:235-252, 2004.

We present DNA components whose 3D geometry and cohesive portions are compatible with a fractal 3D assembly. DNA parallelograms have been proposed in Carbone and Seeman [(2002b) Natural Computing 1: 469–480; (2003) Natural Computing 2: 133–151] as suitable building blocks for a 2D fractal assembly of the Sierpinski carpet. Here we use Mao 3D triangles, which are 3D geometrically trigonal molecules, to construct basic building blocks and we obtain a simpli.ed version of the 2D assembly design. As in the previous 2D construction, we utilize the interplay of coding in the form of cohesive ends, geometrical complementarity and protection of potentially undesirable sites of reactivity. The schema we propose works for trigonal symmetries and the Mao triangle is one example of a possible DNA trigonal tile.

Download the PDF Reprint

A.Carbone, Revisiting the codon adaptation index from a whole-genome perspective: gene expression, codon bias, and metabolic networks in the context of genomes comparison, Proceedings of the Belgian Royal Academy of Sciences, Mathematics and Genomics, 18 October 2003. 29--35, 2005.

Facts and ideas presented in this short review concern some recent developments at the interface between sequence analysis, gene expression prediction and genome comparison carried on in our group. The guiding line to all results presented here is to derive biological information from genome sequences by means of a purely statistical analysis and an appropriate design of algorithms.

Download the PDF Reprint

A.Carbone, R.Madden, Insights on the evolution of metabolic networks of unicellular translationally biased organisms from transcriptomic data and sequence analysis, Journal of Molecular Evolution. 61:456–469, 2005.

Codon bias is related to metabolic functions in translationally biased organisms, and two facts are argued about. First, genes with high codon bias describe in meaningful ways the metabolic characteristics of the organism; important metabolic pathways corresponding to crucial characteristics of the lifestyle of an organism, such as photosynthesis, nitrification, anaerobic versus aerobic respiration, sulfate reduction, methanogenesis, and others, happen to involve especially biasedgen es. Second, gene transcriptional levels of sets of experiments representing a significant variation of biological conditions strikingly confirm, in the case of Saccharomyces cerevisiae, that metabolic preferences are detectable by purely statistical analysis: the high metabolic activity of yeast during fermentation is encoded in the high bias of enzymes involved in the associated pathways, suggesting that this genome was affected by a strong evolutionary pressure that favoreda predominantly fermentative metabolism of yeast in the wild. The ensemble of metabolic pathways involving enzymes with high codon bias is rather well defined andremai ns consistent across many species, even those that have not been considered as translationally biased, such as Helicobacter pylori, for instance, reveal some weak form of translational bias for this genome. We provide numerical evidence, supported by experimental data, of these facts and conclude that the metabolic networks of translationally biased genomes, observable today as projections of eons of evolutionary pressure, can be analyzed numerically and predictions of the role of specific pathways during evolution can be derived. The new concepts of Comparative Pathway Index, used to compare organisms with respect to their metabolic networks, and Evolutionary Pathway Index, used to detect evolutionarily meaningful bias in the genetic code from transcriptional data, are introduced.

Download the PDF Reprint

A.Carbone, Computational prediction of genomic functional cores specific to different microbes. Journal of Molecular Evolution, 63(6):733-746, 2006.

Computational and experimental attempts tried to characterize a universal core of genes representing the minimal set of functional needs for an organism. Based on the increasing number of available complete genomes, comparative genomics has concluded that the universal core contains <50 genes. In contrast, experiments suggest a much larger set of essential genes (certainly more than several hundreds, even under the most restrictive hypotheses) that is dependent on the biological complexity and environmental speci.city of the organism. Highly biased genes, which are generally also the most expressed in translationally biased organisms, tend to be over represented in the class of genes deemed to be essential for any given bacterial species. This association is far from perfect; nevertheless, it allows us to propose a new computational method to detect, to a certain extent, ubiquitous genes, nonorthologous genes, environment-speci.c genes, genes involved in the stress response, and genes with no identi.ed function but highly likely to be essential for the cell. Most of these groups of genes cannot be identified with previously attempted computational and experimental approaches. The large variety of life-styles and the unusually detectable functional signals characterizing translationally biased organisms suggest using them as reference organisms to infer essentiality in other microbial species. The case of small parasitic genomes is discussed. Data issued by the analysis are compared with previous computational and experimental studies. Results are discussed both on methodological and biological grounds.

Download the PDF Reprint

J.Baussand, C.Deremble, A.Carbone, Periodic distributions of hydrophobic amino acids allows to define fundamental building blocks to align distantly related proteins, Proteins: Structures, Functions and Bioinformatics, 67(3):695-708, 2007.

Several studies on large and small families of proteins proved in a general manner that hydrophobic amino-acids are globally conserved even if they are subjected to high rate substitution. Statistical analysis of amino-acids evolution within blocks of hydrophobic amino-acids detected in sequences suggests their usage as a basic structural pattern to align pairs of proteins of less than 25% sequence identity, with no need of knowing their 3D structure. We present a new global alignment method and an automatic tool for Proteins with HYdrophobic Blocks ALignment (PHYBAL) based on the combinatorics of overlapping hydrophobic blocks. Two substitution matrices modeling a different selective pressure inside and outside of hydrophobic blocks are constructed, the Inside Hydrophobic Blocks Matrix (IHBM) and the Outside Hydrophobic Blocks Matrix (OHBM), and a 4-dimensional space of gap values is explored. PHYBAL performance is evaluated against Needleman and Wunsch algorithm run with Blosum 30, Blosum 45, Blosum 62, Gonnet, HSDM, PAM250, Johnson and Remote Homo matrices. PHYBAL behavior is analyzed on 8 randomly selected pairs of proteins of < 30% sequence identity which cover a large spectrum of structural properties. It is also validated on two large datasets, the 127 pairs of the Domingues dataset with < 30% sequence identity, and 181 pairs issued from BAliBASE 2.0 and ranked by percentage of identity from 7 to 25%. Results confirm the importance of considering substitution matrices modeling hydrophobic contexts and a 4-dimensional space of gap values in aligning distantly related proteins. Two new notions of local and global stability are defined to assess the robustness of an alignment algorithm and the accuracy of PHYBAL. A new notion, the SAD-coe±cient, to assess the difficulty of structural alignment is also introduced. PHYBAL has been compared to Hydrophobic Cluster Analysis and HMMSUM methods.

Download the PDF Reprint

J.Baussand, A.Carbone, Metagénomique bactérienne et virale - nouvelles définitions d'espace microbiale et nouveaux défis algorithmiques. Techniques et Sciences Informatiques, Hérmes, 245-255, 2007.

Several problems in metagenomics are discussed concerning genome assembly, environmental species classification, phylogenetic reconstruction, communities specificity, communities quantification. The challenge is great because of the difficulties due to incomplete available sequences detected for uncultured microbial communities, the lack of tools for detecting homology between divergent proteins, missing concepts on which to base environmental classification to cite just a few hurdles. The paper gives a concise overview of current projects and available data, and sets some algorithmic questions.

Download the PDF Reprint

J.Breton, E.Bart-Delabesse, S.Biligui, A.Carbone, X.Sellier, M.Okome-Nkoumou, C.Nzamba, M.Kombila, I.Accoceberry, M.Thellier. Genotypic analysis of Enterocytozoon bieneusi isolates from Gabon and Cameroon: reporting a new highly divergent sequence and a wide distribution of genotypes'', Journal of Clinical Microbiology, 45(8):2580–2589, 2007.

Intestinal microsporidiosis due to Enterocytozoon bieneusi is a leading cause of chronic diarrhea in severely immunocompromised HIV-positive patients. It may be a public health problem in Africa due to the magnitude of the HIV pandemic and poor level of sanitary conditions. We designed two prevalence studies of E. bieneusi in Central Africa, the first in HIV-positive patients from an urban setting in Gabon and the second in a non selected rural population in Cameroon. Stool samples were analyzed by IFAT and PCR. Twenty five out of 822 HIV positive patients from Gabon and 22 out of 758 villagers in Cameroon were found positive for E. bieneusi. The prevalence rates were surprisingly similar in both studies (3.0% and 2.9%). Genotypic analysis of the ITS region of the rRNA gene showed a high degree of diversity in samples from both countries. In Gabon, 15 isolates showed 7 different genotypes: the previously reported genotypes A, D, and K along with 4 new genotypes referred to as CAF1, CAF2, CAF3 and CAF4, respectively. In Cameroon, five genotypes were found in 20 isolates, the known genotypes A, B, D and K and the new genotype CAF4. Genotypes A and CAF4 predominated in Cameroon, whereas K, CAF4 and CAF1 were more frequent in Gabon, suggesting that different genotypes present differing risks of infection associated with immune status and living conditions. Phylogenetic analysis of the new genotype CAF4, identified in both HIV-negative and positive subjects, indicates that it represents a highly divergent strain.

Download the PDF Reprint

A.Carbone, Adaptation studied with the Self-Consistent Codon Index: genomic spaces, metabolic network comparison, minimal gene sets and viral classification. Proceedings of the Evry Spring School on Modelling Complex Biological Systems in the Context of Genomics, Genopole, Evry, May 2007.

Facts and ideas presented in this short review concern some recent developments at the interface between microbial spaces, metabolic network comparison, minimal gene sets and viral classification. The guiding line to all results presented here is to derive biological information from genome sequences by means of a purely statistical analysis and an appropriate design of algorithms. The paper is an updated version of (Carbone 2005).

Download the PDF Reprint

A.Carbone, Codon bias is a major factor explaining phage evolution in translationally biased hosts, Journal of Molecular Evolution, 66(3):210--23, 2008.

The size and diversity of bacteriophage populations require methodologies to quantitatively study the landscape of phage differences. Statistical approaches are confronted with small genome sizes forbidding significant single-phage analysis, and comparative methods analyzing full phage genomes represent an alternative but they are of difficult interpretation due to lateral gene transfer, which creates a mosaic spectrum of related phage species. Based on a large-scale codon bias analysis of 116 DNA phages hosted by 11 translationally biased bacteria belonging to different phylogenetic families, we observe that phage genomes are almost always under codon selective pressure imposed by translationally biased hosts, and we propose a classification of phages with translationally biased hosts which is based on adaptation patterns. We introduce a computational method for comparing phages sharing homologous proteins, possibly accepted by different hosts. We observe that throughout phages, independently from the host, capsid genes appear to be the most affected by host translational bias. For coliphages, genes involved in virion morphogenesis, host interaction and ssDNA binding are also affected by adaptive pressure. Adaptation affects long and small phages in a significant way. We analyze in more detail the Microviridae phage space to illustrate the potentiality of the approach. The small number of directions in adaptation observed in phages grouped around phiX174 is discussed in the light of functional bias. The adaptation analysis of the set of Microviridae phages defined around phiMH2K shows that phage classification based on adaptation does not reflect bacterial phylogeny.

Download the PDF Reprint

A.Carbone, S.Engelen, Information content of sets of biological sequences revisited, in "Algorithmic Bioprocesses", edited by A.Condon, D.Harel, J.N.Kok, A.Salomaa, E.Winfree, Natural Computing Series, Springer, 2008. In press.

To analyze the information included in a pool of amino-acid sequences, a first approach is to align the sequences, to estimate the probability of each amino-acid to occur within columns of the aligned sequences and to combine these values through an ”entropy” function whose minimum corresponds to absence of information, that is to the case where each amino-acid has the same probability to occur. Another alternative is to construct a distance tree between sequences (issued by the alignment) based on sequence similarity and to properly interpret the tree topology so to model the evolutionary property of residue conservation. We introduced the concept of ”evolutionary content” of a tree of sequences, and demonstrated at what extent the more classical notion of ”information content” on sequences approximates the new measure and in what manner tree topology contributes sharper information for the detection of protein binding sites.

Download the PDF Reprint

A.Carbone, A.Mathelier, Environmental and physiological insights from microbial genome sequences, Elements of Computational Systems Biology, Huma Lodhi and Stephen Muggleton (eds.), Wiley Book Series in Bioinformatics, 2008. In press.

Facts and ideas presented in this short review are addressed to those computer scientists and mathematicians that want to learn about some open questions on the bioinformatics of microbial organisms. We present some recent results of our group on the statistical analysis of microbial genomes involving the formalization of microbial spaces, metabolic network comparison, minimal gene sets, host-phage adaptation and gene chromosomal organization. The guideline to all results presented here is to derive insights on microbial physiology and habitat directly from genome sequences by means of a purely statistical analysis and an appropriate design of algorithms.

Download the PDF Reprint

J.Baussand, A.Carbone, Inconsistent distances in substitution matrices can be avoided by properly handling hydrophobic residues, Evolutionary Bioinformatics, 1-6, 2008. In press.

The adequacy of substitution matrices to model evolutionary relationships between amino acid sequences can be numerically evaluated by checking the mathematical property of triangle inequality for all triplets of residues. By converting substitution scores into distances, one can verify that a direct path between two amino acids is shorter than a path passing through a third amino acid in the amino acid space modeled by the matrix. If the triangle inequality is not verified, the intuition is that the evolutionary signal is not well modeled by the matrix, that the space is locally inconsistent and that the matrix construction was probably based on insufficient biological data. Previous analysis on several substitution matrices revealed that the number of triplets violating the triangle inequality increases with sequence divergence. Here, we compare matrices which are dedicated to the alignment of highly divergent proteins. The triangle inequality is tested on several classical substitution matrices as well as in a pair of "complementary" substitution matrices recording the evolutionary pressures inside and outside hydrophobic blocks in protein sequences. The analysis proves the crucial role of hydrophobic residues in substitution matrices dedicated to the alignment of distantly related proteins.

Download the PDF Reprint