Exploration of the large-scale structure of triplet distributions in whole genomes

Suppose we have following things:

Set of three-letter words, which can be used for coding information; each triplet has it's frequency, which is, more or less, maintained while coding; for example, some triplets always have almost zero frequency (bias phenomenon).

Some text, which contains of subsequences of two different types: coding regions, composed from triplets, following without any delimiters; these regions should be conserved strictly while evolution process; and non-coding regions (or junk) which allow mutations and because of this have no special structure; for example, we can assume that they are composed from the same set of triplets but after a number of random insertions and deletions of different letters.

Then suppose we are given such a text and we dont know anything about the code.

The question is: is it possible to find locations of coding regions and the code?

The answer: It is possible, if the entropy of your code (or mutual information, more precisely) is high enough. Then, in 64-dimensional space of frequencies the distribution of local triplet frequencies will look like this real picture taken from complete genome of Caulobacter crescentus:

Here black points correspond to non-coding regions, color - to coding. Reds are in direct strand, yellow - in the complementary one.

Rough explanation of the structure is rather clear: those windows that contain coding information in the direct strand can have it with one of three possible phase shifts. This phase shift is not known in advance, so approximately one third of the windows falls into the vicinity of the point corresponding to the fijk (0-shift), one third are close to the (1-shift), and the last - to the (2-shift). The same with complementary strand, but with centers, corresponding to the complementary distributions.

Also we can show trajectory of triplet frequencies in this space. It looks like random walks in vicinity of coding phase centers:

In fact, it gives possibility to make segmentation of the sequence into regions of the same coding phase. The method does not use any learning dataset, but the results of gene recognition are quite convincing.

You can read more about the method in this paper: VISUALIZING THE SPATIAL STRUCTURE OF TRIPLET DISTRIBUTIONS IN GENETIC TEXTS (Andrey Zinovyev) .

Examples of model sequencies, used in the paper are here:

The program, written on Java, for the calculations described in the paper is here.

Here is the link to the ViDaExpert data visualization tool.