Discovery of Bacterial Relationships through Latent

Download Report

Transcript Discovery of Bacterial Relationships through Latent

Discovery of Prokaryotic
Relationships through Latent
Structure
Natalya Muzinich
Advisors:
Dr. John Paolillo
Dr. Sun Kim
Importance of Studying
Bacterial Relationships

Identification of regions
– distinguishing pathogenic from non-pathogenic
strains
– important for general cell functioning
Ex. Minimal gene set
 Ex. Non-coding RNA


Better understanding of evolution
– lost genes
– re-arranged genes
– congruent evolution
7/16/2015
2
Common Computational
Methodology in Relating
Biological Organisms

Alignment
– BLAST

Phylogenies
–
–
–
–
7/16/2015
Maximum likelihood
Neighbor joining
Bayesian
Maximum Parsimony
3
Problems with Existing
Methods

Based on a subset of available genetic
information
– Potentially leads to inaccurate results

Regions non-coding for proteins are typically
excluded from analysis
– They are shown to be important:

“for decades the functions of RNA in the cells was grossly
underestimated. It is difficult to assess the number of
noncoding RNAs yet to be discovered. Unlike protein-coding
genes…, RNA-coding genes are much more difficult to find
within the genomic sequences.” (Szymanski et al, 2003)
7/16/2015
4
Problems with Existing
Methods


Computationally expensive
Problems
– Sequence length for BLAST
– Number of genomes for many methods of
phylogenetic tree construction


ex. runtime is in Ο(n!) for maximum parsimony
Very limited number of genomes can be classified
– ex. Around 10 with maximum parsimony
7/16/2015
5
Project Goals


Utilize information in the whole
genomic sequence
Scalability
– Be able to compare a large number of
genomes quickly
7/16/2015
6
Proposal: Divide-and-Conquer


Divide a complete genomic sequence
into overlapping subsequences
Construct genomic representation
from the subsequences
(cmp. shotgun sequencing)
7/16/2015
7
Representation Construction




Select subsequence length (7)
Count all occurrences of every possible
(47=16384) subsequence within a unit of
genetic information
Arrange the pairing the of subsequences
with the frequency of their occurrence into a
named vector
Collect all vectors into a matrix
7/16/2015
8
Data Matrix
Arp_.
A_69
A_70
A_71
A_72
A_5
A_67
Arc_.
Bclls_h. Bclls_s. Br.
aaaaata
842690
14991
10884
7809
8606
6950
6652
6186
4882
3902
3360
aaaaaca
11
12
13
14
15
16
17
18
19
20
21
aaaaact
3166
3160
2958
2623
2495
2387
2237
2267
2150
2074
1869
aaaaacc
22
23
24
25
26
27
28
29
30
31
32
aaaataa
1786
1730
2041
1980
1734
1771
1764
1601
1584
1558
1616
aaaatat
33
34
35
36
37
38
39
40
41
42
43
aaaatag
1487
1426
1437
1447
1411
1421
1345
1406
1388
1331
1320
aaaatac
44
45
46
47
48
49
50
51
52
53
54
aaaagct
1274
1315
1251
1284
1297
1306
1227
1307
1203
1291
1286
aaaagcc
55
56
57
58
59
60
61
62
63
64
65
aaaacaa
1294
1288
1232
1163
1238
1305
1299
1292
1232
1152
1175
aaaacag
66
67
68
69
70
71
72
73
74
75
76
aaaacac
1268
1266
1234
1251
1360
1328
1385
1439
1349
1347
1315
aaaacta
77
78
79
80
81
82
83
84
85
86
87
aaaactt
1281
1388
1451
1338
1332
1372
1351
1356
1380
1317
1276
aaaacct
88
89
90
91
92
93
94
95
96
97
98
aaaaccc
1425
1349
1263
1316
1301
1323
1324
1308
1260
1277
1269
aaataaa
99
100
101
102
103
104
105
106
107
108
109
aaataat
1197
1157
1201
1241
1272
1320
1225
1272
1313
1318
1270
aaatata
110
111
112
113
114
115
116
117
118
119
120
7/16/2015
9
Genomes


Represented by the same
subsequences
Differ in the distribution of the
subsequences
7/16/2015
10
Proposal: Latent
Structure Modeling


Reasons behind correlations?
Statistical method capturing
correlations in the observed data
through unobserved (latent) variables
– Principle Component Analysis (PCA)

Singular value decomposition (SVD)
detects these correlations
7/16/2015
11
SVD



Requires data arranged in a matrix
Values are correlations between the
observed data (rows) and their
representation (columns)
In this project rows are heptamers,
columns are genomic units
(chromosomes and plasmids)
7/16/2015
12
SVD


Decomposes any matrix into a product
of three matrices: X=USVT
Derives the system of coordinates in a
high dimensional space for
– the observed data (matrix U)
– its representation (matrix VT)

S measures variance along each
dimension
7/16/2015
13
Measuring Similarity:
Coordinate System


Allows for the use of vectors for
measuring of distance among the data
points
Distance measures
– euclidean
– cosine similarity
7/16/2015
14
Proposal: Hierarchical
Clustering


Input: factor scores computed from the
results of SVD to derive a matrix
F=(√S * VT)T
Method: Ward’s
–
–
–
–
7/16/2015
Minimizes variance within each cluster
Uses error of the sum-of-squares criterion (ESS)
Principle Components minimize ESS
Clustering correlates with PCA results
15
Data and Related Tools

Data
–
–
–
–

58 bacterial genomes from NCBI genbank
83 total (plasmids, second chromosomes)
log-transformed
normalized (by row, Z-scores)
Patdist program of Dr. Kim
– a list of the heptamers at every position in a genome

C program
– subsequences paired with their frequencies

R programming language for statistical analysis
7/16/2015
16
Related Work

G.W. Stuart and M.W. Berry,
Comprehensive Whole Genome
Bacterial Phylogeny Using Correlated
Peptide Motifs Defined in a High
Dimensional Vector Space, J. of
Bioinformatics and Computational
Biology 1(3) (2003), pp. 475-493.
7/16/2015
17
Stuart and Berry (S&B)
vs. Current Approach

Represented genomes as tetrads of
peptides
– Numrows(S&B)=160000
– Numrows(current)=16384
– S&B~10*current

Built phylogenetic tree using
PHYLIPv3.6 package
7/16/2015
18
Singular Values
7/16/2015
19
Classification – NOT a Phylogeny
7/16/2015
20
Principle Component Dimensions 1 & 2
7/16/2015
21
Principle Component Dimensions 2 & 4
7/16/2015
22
Principle Component Dimensions 3 & 4
7/16/2015
23
Principle Component Dimensions 5 & 6
7/16/2015
24
Principle Component Dimensions 7& 8
7/16/2015
25
Principle Component Dimensions 8 & 9
7/16/2015
26
Results

7 major groupings
– Most plasmids
– Mycobacteria
– Rhizobiales
– Enterobacteriales
– Archae
– Gamma-Epsilon
– Bacilli and Mycoplasma
7/16/2015
27
Plasmids
7/16/2015
28
Mycobacteria, Rhizobiales and
Enterobacteria
7/16/2015
29
Archea
7/16/2015
30
Gamma-Epsilon
7/16/2015
31
Bacilli and Mycoplasma
7/16/2015
32
Future Work


Analyses and interpretation of
heptamer clusters
Establishing relationship between
heptamer and genetic clusters
– Principal Components
7/16/2015
33
Conclusion

Flexible Method:
– can utilize


all available genetic information
subset of it
– choice of the unit of representation

Scalable
7/16/2015
34
References









G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny
Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of
Bioinformatics and Computational Biology 1(3) (2003), pp. 475-493.
Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition
and principal component analysis". in A Practical Approach to Microarray Data
Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA
(2003).
Basilevsky, A. Statistical factor analysis and related methods: theory and applications
New York: J. Wiley, (1994).
Ward, Joe H. Hierarchical Grouping to optimize an objective function. Journal of
American Statistical Association, 58(301) (1963), 236-244.
http://www.uwlax.edu/faculty/will/svd/svd/index.html
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Rogozin et al. Congruent evolution of different classes of non-coding DNA in
prokaryotic genomes. Nucleic Acids Research, 2002, Vol. 30, No. 19 pp.4264-4271
Szymanski et al. Noncoding regulatory RNAs database. Nucleic Acids Research, 2003,
Vol. 31, No. 1 pp. 429-431
http://www.genomics.arizona.edu/553/documents
7/16/2015
35
Acknowledgements




Dr John Paolillo
Dr Sun Kim
Dr Haixu Tang
Informatics Colleagues
7/16/2015
36
Any Questions?
7/16/2015
37

“A three-way genome comparison of
the CFT073, enterohemorrhagic E.coli
EDL933, and laboratory strain MG1655
reveals that, amazingly, only 39.2% of
their combined (non-redundant) set of
proteins actually are common to all
three strains.” (N.Moran, 2004)
7/16/2015
38