7-clusters scatter

For visualization of 7-cluster structure a data set for every 143 genome was prepared in the following way:

1) A Genbank file with full genome was downloaded from Genbank FTP-site. Using BioJava package the complete sequence and the annotation was parsed. In the case when the genome had two chromosomes, the sequences of both were concantenated. The short sequences of plasmids were ignored.

2) We have sequence $S$ having length $N$ and $S_i$ is a letter in the $i$th position of $S$. We define a constant $p$, window size $W$ and in every position $i=w/2+pk$, $k=[\frac{N-W}{p}]$, we open a window in the sequence, centered in position $i$. In every window we count words of length $k$, starting from every third position:

count(c_{i_1}c_{i_2}..c_{i_k}) =
\end{displaymath} (1)

where $comp(word1,word1)$ is string comparison function having value 1 if $word1$ equals $word2$ and 0 otherwise. $c_i$ is a letter from genetic alphabet ($c_1=A$, $c_2=C$, $c_3=G$, $c_4=T$).

For every window a frequency vector is defined:

X_i^{c_{i_1}c_{i_2}..c_{i_k}} =
\end{displaymath} (2)

All words containing non-standard letters like N, S, W are ignored.

The data set ${X_i^j}$, $j=1..[\frac{N-W}{p}]$ is normalized to have unity standard deviation and zero mean.

3) Using annotation from Genbank file for every window a label is assigned accordingly to if the center of window is inside a marked CDS feture (including hypothetical ones) or not. In the first case the reading frame and the strand of the CDS feature are determined and the window is assigned one of F0,F1,F2,B0,B1,B2 lable. In the second case the label is J.

4) A standard PCA-analysis is performed and the first three principal components are calculated. They form form a 3D-orthonormal basis in $4^k$-dimensional space. Every point is projected in the basis, thus we assign three coordinates for every point .

7cluster schema

To create the schema of 7-cluster structure the following method was utilized. We calculated the mean point $y^L$ for every subset with a given label $L$.

For the set of centroids $y^{F0}$, $y^{F1}$, $y^{F2}$, $y^{B0}$, $y^{B1}$, $y^{B2}$ a distance matrix of euclidean distances was calculated and visualized using classical MDS.

To visualize the "radii" of the subsets, a mean squared distance $d^p$ to the centroid $p$ was calculated (intraclass dispersion). To visualize the value on 2D plane, we have to introduce dimension correction factor, so the radius drawn on the picture equals

\end{displaymath} (3)

The form of the cluster is not always spherical and often intersection of radii do not reflect real overlapping of classes in high-dimensional space. To show how good the classes are separated in fact, we developped the following method for cluster contour visualization. To create a contour for class $p$, we calculate averages of all positive and negative projections on the vectors connecting centroid $p$ and 6 other centroids $i=1..6$.

\bf {n}_i^p = \frac{y^i-y^p}{\vert\vert y^i-y^p\vert\vert},
\it {n}_i^p(X_k)=(X_k-y^p,\bf {n}^p_i)
\end{displaymath} (4)

\it {f}^p_i=\frac{\sum_{\it {n}_i^p(X_k)>0}{\it {n}_i^p}}{\s...
...it {n}_i^p(X_k)<0}{\it {n}_i^p}}{\sum_{\it {n}_i^p(X_k)<0}{1}}
\end{displaymath} (5)

Then, using the 2D MDS plot where every vector $(y^p)'$ has 2 coordinates, we put 12 points $t^f$, $t^b$ analogously.

(\bf {n}_i^p)' = \frac{(y^i)'-(y^p)'}{\vert\vert(y^i)'-(y^p)'\vert\vert},
\end{displaymath} (6)

t^f_i = (y^i)'+f_i^p(n_i^p)', t^b_i = (y^i)'+b_i^p(n_i^p)', i=1..6
\end{displaymath} (7)

Using a smoothing procedure in polar coordinates we create a smooth contour approximating these 12 points.