A Framework for Sequence Cluster Merging

Download Report

Transcript A Framework for Sequence Cluster Merging

Framework for Sequence Cluster
Merging
(Also showing importance of domain knowledge)
Arvind Gopu
Masters student, Computer Science & Bioinformatics
Indiana University, Bloomington
http://biokdd.informatics.indiana.edu/~agopu
Email: [email protected]
Introduction

Sequence Clustering very important research topic.




Bottom-up approach – basically merge elements
recursively upto certain specificity
Top-down approach – split elements until desired specificity
is achieved
Two important issues: selectivity and sensitivity
Sequence clustering problem is unique


No “observable” attributes unlike most clustering problems
Example:



Supermarket: Soda, Fruit juice, Frozen foods, Clothing, etc.
Demographic: Height, Race, etc.
Sequence clustering: Just a bunch of amino acid
characters! (with accompanying well studied sequence
comparison/alignment programs).
Introduction …

Getting back to sequence clustering…




Fragmentation problem – well known in sequence
clustering algorithms.
Example: BAG (Sun Kim)
99 % accuracy (selective) but at cost of ~40-50 %
fragmentation (over-sensitive)
Solution?

Bottom-Up merging back of fragmented clusters
Need for framework

Suggested bottom-up approach possible
using various sub-methods



Framework: Do common and unique tasks
seamlessly
Insert new sub-methods easily with very little
hassle
Implemented primarily in Perl with supporting
C programs and Unix Shell scripts
Framework Schematic
Merge
Suggestions
from Clustering
Algorithm
Test Scaffold
Prepare
Sequence Data
Generate Combined
Profile for Two
Fragment Clusters
Test
Merge’bility
Post-process
New Clustering
Result
Enhanced
Clustering
Result
Framework – Profile Generation
Merge
Suggestions
from Clustering
Algorithm
Test Scaffold
Prepare
Sequence Data
GENERATE
COMBINED PROFILE
FOR TWO FRAGMENT
CLUSTERS
Test
Merge’bility
Post-process
New Clustering
Result
Enhanced
Clustering
Result
Profile Generation – MSA

MSA = Multiple Sequence Alignment
C1
MSA (C1)
MSA (C1, C2)
C2
MSA (C2)
Combined
Profile
Profile Generation – MSA

Common first step: MSA profile generation for
two fragment clusters C1 and C2 (Clustalw)



MSA (C1) and MSA (C2)
Most expensive step in framework
Common second step: Combined profile
generation (Clustalw)

Prof_Align [MSA (C1), MSA (C2)]
Profile Generation – MSA explained..


All of the implemented techniques depend on MSA
profiles
MSA profile: align more than 2 sequences
simultaneously
Image from http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/
Profile Generation – MSA explained..
Image from http://www.mscs.mu.edu/~cstruble/class/mscs230/fall2002/notes/3
Framework – Merge’bility Test
Merge
Suggestions
from Clustering
Algorithm
Test Scaffold
Prepare
Sequence Data
Generate Combined
Profile for Two
Fragment Clusters
TEST
MERGE’BILITY
Post-process
New Clustering
Result
Enhanced
Clustering
Result
Model Comparison based Merge Test
Model Comparison based Merge Test

Statistics/Machine learning technique based
method:


Uses Relative Entropy and Statistical measures
w.r.t. Runs test
Drawbacks


Almost impossible to nail down on threshold values for
z-score or any other statistical measure
Extremely dependent sample size equality – does not
work well when the two fragment sizes vary
Model Comparison based Merge Test


Each column in a MSA profile is a probabilistic
model (details of construction beyond the scope of
this talk)
Compute similarity between corresponding columns
in the two fragments – Kullback Liebler distance


Need to consider gaps while matching up columns –
challenging task
Also need to screen for random “good” distances – taken
care off using random model in distance computation
Model Comparison based Merge Test
Model Comparison based Merge Test

Using column wise comparison distance
scores, compute “distance vector”


Symbolic representation for “good”, “bad” and
“don’t care” distances (detail abstracted)
Do standard statistical test: Runs test to
check out how random distance vector is…

Nice pattern:


y|y|y|n|n|y|y|y|n|n|y|y|y
Random pattern:

y| n|y|y|n|n|n|y|n|y|n|n|y|y
Model Comparison based Merge Test
4) Do Runs test
Model Comparison based Merge Test

Compute mean, standard deviation and
subsequently z-score


Threshold to separate “good” and “bad” merges
Drawbacks again…


Threshold will be sample specific, hard to have
one threshold for entire dataset (illustrated in test
results)
Failure rate is high if sample size is unequal
Phylogenetic Tree based Merge Test
Merge’bility Test – Techniques …

Phylogenetic tree based method:

Evolutionary Distance based method


Drawback: Too strict; many false negatives possible;
Also hard to nail a threshold
Evolutionary Least Common Ancestor (LCA)
based method

Improved performance in both of the previously
mentioned issues
Phylogenetic Tree
Evolutionary Distance based Merge Test
Phylogenetic Tree Distance based method


Clustalw (or other tree generation tools)
provide NJ tree of a MSA profile
Sequence length normalized distance from
root for each sequence


0 < distance < 1
Define some threshold for distance that
constitutes intra/inter cluster distances
Phylogenetic Tree Distance based method

Distance between sequences from…

Two clusters will be closer to:




‘1’ if two clusters are not merge’ble – call these “bad
distances”
‘0’ if two clusters are actually part of the same super
cluster
The same cluster will be obviously closer to ‘0’ –
these constitute “good distances”; don’t care in
our case
Count number of “bad distances”

Gives a good idea of how good a merge is
Phylogenetic Tree Distance based method

Good enough? Not
yet – need for
normalization of the
“bad distance” count.


Why?
Number of edges
between vertices of
same/different clusters
is proportional to size of
clusters!
Phylogenetic Tree Distance based method

Once normalization of number of “bad
distances” is done, this method churned out
decent results


Normalizing factor? Contentious.. What is a good
normalizer?
Method too strict for unequally sized clusters.
Most merges rejected leading to appreciable
number of false negatives

Inherent nature of MSA programs and unequally sized
profiles (cluster sizes)
Phylogenetic Tree
LCA Coverage based Merge Test
Phy.Tree LCA coverage based method


Clustalw, Phylip (or other tree generation
tools) provide a rooted phylogenetic tree for a
MSA profile
Looking at the tree, one can easily make out
if a pair of clusters should be merged or not



How?
Parse tree into a usual tree data structure and
look for common ancestor of sequences of each
cluster
Example…
Phy.Tree LCA coverage based method

Good Merge

Sequences of the
two clusters
(shaded blue and
red) are from the
same super cluster
Phy.Tree LCA coverage based method

Bad Merge

Sequences of the
two clusters
(shaded blue and
red) are from
different super
clusters
Phy.Tree LCA coverage based method


Same LCA for both clusters? Good merge!
If not … Bad merge?



Not quite. Possible that LCAs may be different but
they cover sequences from either cluster upto a
considerable extent
Better to use coverage of LCAs instead
Example…
Phy.Tree LCA coverage based method

Why LCA Coverage?

Second cluster has
three sequences, but
its LCA covers four
more sequences
from the other cluster
Phy.Tree LCA coverage based method

Coverage test:



For clusters Ci and Ck, choose smaller cluster say
Ci i.e | Ci | < | Ck |
Define Cov (LCA[Ci]) as the number of sequences
LCA Ci covers.
If Cov(LCA[Ci]) > # of sequences in Ci
… where | Ci | < | Ck |


i.e. { Cov (LCA[Ci]) / | Ci | } > 1
Or {Cross Coverage (LCA[Ci])} > 0
Phy.Tree LCA coverage based method

Advantages:




Sample size difference does not play a big role
Demarcating between “good” and “bad” merges is
much simpler and straight forward
Shown to work really well on a variety of data
sizes, difficulty levels – test results…
Possible weakness:

Bound to fail for extremely small fragments (say 2
sequences each) – hard not to have a common
LCA !
Test Results – 4 datasets
(from COG database)
Test Results – Data set 1
MERGE’BILITY TEST METHOD
Observed Outcome
DATA: COG {0001, 0005} (Real Size: 35,30)
Fragment Cluster Size
n (F1)
n (F2)
Expected Outcome
Good / Bad
Model Comparison
Phy.tree Distance
Phy.tree LCA
coverage
(0.0001)
10
10
Good
Good
Good
Good
10
10
Bad
Bad
Bad
Bad
10
5
Good
Good
Good
Good
10
5
Bad
Bad
Bad
Bad
10
3
Good
Good
Good
Good
10
3
Bad
Bad
Bad
Bad
4
2
Good
Good
Good
Good
4
2
Bad
Bad
Bad
Bad
3
3
Good
Good
Good
Good
3
3
Bad
Bad
Bad
Bad
Test Results – Data set 2
MERGE’BILITY TEST METHOD
Observed Outcome
DATA: COG {0142, 0183} (Real Size: 74,116)
Feagment Cluster Size
n (F1)
n (F2)
Expected Outcome
Good / Bad
Model Comparison
Phy.tree Distance
Phy.tree LCA
coverage
(0.001)
10
10
Good
Good
Good
Good
10
10
Bad
Bad
Bad
Bad
10
5
Good
Good
Bad
Good
10
5
Bad
Bad
Bad
Bad
10
3
Good
Good
Bad
Good
10
3
Bad
Good
Bad
Bad
4
2
Good
Good
Bad
Good
4
2
Bad
Bad
Bad
Bad
3
3
Good
Good
Bad
Bad
3
3
Bad
Bad
Bad
Bad
Test Results – Data set 3
MERGE’BILITY TEST METHOD
Observed Outcome
DATA: COG {0380, 0383} (Real Size: 15,13)
Fragment Cluster Size
n (F1)
n (F2)
Expected Outcome
Good / Bad
Model Comparison
Phy.tree Distance
Phy.tree LCA
coverage
(0.001 / 0.0005)
10
10
Good
Good / Bad
Good
Good
10
10
Bad
Good / Bad
Bad
Bad
10
5
Good
Good / Bad
Bad
Good
10
5
Bad
Good / Bad
Bad
Bad
10
3
Good
Good / Bad
Bad
Good
10
3
Bad
Good / Bad
Bad
Bad
4
2
Good
Good / Good
Good
Bad
4
2
Bad
Bad / Good
Bad
Bad
3
3
Good
Good / Bad
Good
Good
3
3
Bad
Bad / Bad
Bad
Good
Test Results – Data set 4
MERGE’BILITY TEST METHOD
Observed Outcome
DATA: COG {0160, 0161} (Real Size: 79,49)
Fragment Cluster Size
n (F1)
n (F2)
Expected Outcome
Good / Bad
Model Comparison
Phy.tree Distance
Phy.tree LCA
coverage
(0.0001)
10
10
Good
Bad
Good
Good
10
10
Bad
Good
Good
Bad
10
5
Good
Bad
Good
Good
10
5
Bad
Good
Good
Bad
10
3
Good
Good
Good
Good
10
3
Bad
Good
Bad
Bad
4
2
Good
Good
Good
Good
4
2
Bad
Good
Good
Bad
3
3
Good
Good
Good
Good
3
3
Bad
Good
Good
Bad
2
2
Good
Good
Good
Good
2
2
Bad
Good
Good
Good
Acknowledgements!

A big thank you to:







Prof. Sun Kim, advisor
My parents, brother, grand parents!
All my colleagues and friends: JH, Zhiping, Scott Martin,
SR, Raj, Anshul, Pat Hayes and everyone else!
Folks at CS & Informatics: CS Systems staff, Lucy, Linda,
Wendy, Cheryl, Errissa, Bob!
Profs. Marty Siegel and Gary Wiggins – GPC.
RATS folks!
Did I forget someone?! Sorry if I did…