Progresive MSA - Academia Sinica

Download Report

Transcript Progresive MSA - Academia Sinica

Multiple Sequence Alignment Based
on Compact Set
Department of Computer Science
National Tsing Hua University
Chuan Yi Tang
Multiple Sequence Alignment


Given s set of sequences,the MSA
problem is to find an alignment of the
sequences such that some object
function is minimized
ie.(Sum of Pair Score)
S1:ATTCG
S2:AGTCG
S3:ATCAG
S’1:A T – T C – G
MSA
S’2:A – G T C – G
S’
3:A
T–– CAG
Cost = 8
2
4
2
MSA with SP-Score:Exact Algorithm and Heuristics


k : # of Sequences n : Sequences of length
Exactly (using Dynamic Programming)


O((2n)k):D.Snakoff, Simultaneous solution of RNA folding,
alignment and Protosequence prolblems, SIAM J. Appl.
Math.,(1985)
Heuristics




D.F.Feng,R.F.Doolittle, Progressive sequence alignment as a
prerequisite to correct phylogenetic trees. J. Mol. Evol. 25,
351-360., (1987)
S.F.Altschul,D.J.Lipman, Trees,star and mutiple biological
sequence aligment,SIAM J. Appl. Math.,(1989)
D.J.lipman,S.F.Altschul, A tool for multiple sequences
alignment,Proc.Nat.Acad. Sci. U.S.A.,(1989)
S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of multiples
sequences comparison methods,Bull.Math Bio.,(1992)
MSA with SP-Score:Complexity

J Comput Biol 1994 Winter;1(4):337-48
On the complexity of multiple sequence alignment.
Wang L. Jiang T.
McMaster University, Hamilton, Ontario, Canada.
We study the computational complexity of two popular problems in multiple
sequence alignment:
1. multiple alignment with SP-Score => NP-complete(non-metric)
2. multiple tree alignment => MAX SNP-hard

Theoretical Computer Science;259 (2001) 63-79
The complexity with Multiple sequence alignment with SPscore that is a metric
Paola Bonizzoni, Gianluca Della Vedoa
1. multiple alignment with SP-Score => NP-complete(metric)
MSA with SP-Score:Approximation

Approximation Algorithm:




Performance ratio of 2-2/k:D.Gusfilde,Efficient methods for
multiple sequence alignment with guaranteed error bounds,Bull.
Math Bio.,(1993)
Performance ratio of 2-3/k:P.Pevzner,Multiple
alignment,communication cost,and graph matching,SIAM J. Appl.
Math.,(1992)
Performance ratio of 2-l/k(assembling l-way alignments,l  k):
V.Bafna,E.L.Lawler and Pevzner,Approximation algorithms for
multiple sequences alignment,Theor. Comput. Sci.,(1997)
Polynomial Time Approximation Scheme(PTAS):

MSA within a constant band and allows only constant number of
insertion and deletion gaps of arbitrary length per sequence on
average :M. Li,B. Ma. And L. Wang, Near optimal alignment within
a band in polynomial time,STOC 2000.
Compact Set Definition



Let S be the set of n objects {S1,S2,S3…Sn} and
D(Si,Sj) denote the distance between Si and Sj in the
distance matrix D.
Consider any C which is a subset of S,if the distance
between elements in C and not in C is larger than the
longest distance in C , then C is called a compact
set.
Property :


The entire set S is a compact set.
Each set consisting of a single object is also a compact set.
Compact Set Example
11 Minimal border edge
S6
D
S1
S2
S5
S6
S1
0
10 16 18 13
8
S2
S3
S4
S5
0
S3
S4
14 17 15
0
9
0
S5
10 Maximal
inside edge for
compact set 3
9
10 12
9
19
0
11
S6
for compact set 3
S1
S4
Compact Set 1
0
Distance Matrix
S2
Compact Set 2
S3
Compact Set 3
Compact Set Example(con’t)

Compact Set is hierarchical
MSA & Compact Set













Consider 12 Protein sequences example:
S1 :MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSN
DANKYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGVI
S2 :SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDVTVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF
S3 :SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYDVTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF
S4 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNN
TLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII
S5 :MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKY
QIKKAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI
S6 : MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQLFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY
S7 :MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNT
LVFIVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALDVANKIGII
S8 :MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKH
QIKQAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI
S9 :MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIE
DNNTLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEVDALDIANKIGFI
S10 :MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDN
NTLVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDALDVANKIGII
S11 :APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKY
QIKKAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI
S12 :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTD
KAMKKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAYIRLSASYDALDTANKMGLV
Original sequence
MSA & Compact Set(con’t)
D
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S1
0
900
906
352
412
867
384
426
410
352
416
509
0
525
950
892
644
938
887
954
950
884
1007
0
936
890
635
929
899
962
936
882
1002
0
390
905
254
375
295
0
390
438
0
860
415
188
411
390
132
538
0
902
864
946
905
857
968
0
418
344
254
422
437
0
404
375
198
550
0
295
424
442
0
390
438
0
539
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
0
Original distance matrix
Original Compact Set Tree
Good MSA should Preserve Compact Set as well
MSA & Compact Set(con’t)












S1’ :-----------------MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA…
S2’ :---------------------------------------------------------------------------------SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDV…
S3’ :--------------------------------------------------------------------------------SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEV…
S4’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES…
S5’ :----------------------MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAMKK…
S6’ :------------------------------------------------------------------------------MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEA…
S7’ : ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE…
S8’ :----------------------MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAMKK…
S9’ :------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN…
S10’ :--------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE…
S11’ : -----------------------APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSETAMKK…
S12’ :MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP…
MSA by MSA1
MSA & Compact Set(con’t)












S1’ : ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQHPL…
S2’ : ---------------------------------------------------------------------------------------S--SIIDYPLVTEKAMDEMDFQNKLQFIVDID- AAK…
S3’ : ---------------------------------------------------------------------------------------SW-DVIKHPHVTEKAMNDMDFQNKLQFAVD-DRA…
S4’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKFP…
S5’ : -----------------MAPST-KATAAKKAVVKGT-NG--K--KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIEQPITSET…
S6’ : -------------------------------------------------------------------------------------MDAF-DVIKTPIVSEKTMKLIEEENRLVFYVER-KATK…
S7’ : MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILKYP…
S8’ : -----------------MAPST-KAASAKKAVVKGS-NG--S--KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIVAPIASET…
S9’ : MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR…
S10’ : MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIKF…
S11’ : ------------------APSA-KATAAKKAVVKGT-NG--K--KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VIEQPITSET…
S12’ : ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR…
MSA by MSA2
MSA & Compact Set(con’t)
D
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S1
0
939
949
382
427
895
457
440
449
382
425
615
0
524
980
940
652
973
942
1011
980
937
1028
0
972
929
635
961
933
1001
972
924
1034
0
410
927
256
395
298
0
404
508
0
879
476
188
447
410
132
652
0
932
881
981
927
869
986
0
468
361
256
473
557
0
444
395
198
671
0
298
449
565
0
404
508
0
645
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
0
Distance Matrix by MSA1
Compact Set Tree by MSA1
MSA & Compact Set(con’t)
D
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S1
0
867
883
381
422
823
455
440
440
381
419
459
0
444
969
919
573
971
920
985
969
918
959
0
974
919
563
973
924
987
974
915
972
0
412
852
246
402
286
0
405
418
0
808
483
173
436
422
117
498
0
942
870
966
927
858
919
0
455
330
246
453
429
0
433
412
183
520
0
285
433
454
0
405
418
0
493
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
0
Distance Matrix by MSA2
Compact Set Tree by MSA2
Measure of Compact Set Preservation

How can we measure the Compact Set Preservation
in quantity?
N1: # of the original Compact Set relations
N2: # of the relations preserved after MSA
N2
Estimate by Compact Set Preservation =
N1
Measure of Compact Set Preservation(con’t)
D
S1
S2
S3
S4
S5
S1
0
8
10
13
16
124
0
9
14
15
0
12
13
125
0
9
S2
S3
S4
S5
0
Distance Matrix
Original Compact Set relations
134
135
234
235
123
451
452
Compact Set Tree
453
N1 = 10
Measure of Compact Set Preservation(con’t)
D
S1
S2
S3
S4
S5
S1
0
15
26
18
23
0
24
19
25
0
21
16
0
20
S2
S3
S4
S5
0
Distance Matrix After MSA
Compact Set Tree after MSA
The relations preserved after MSA
124
124
125
125
134
143 ×
135
351 ×
234
After MSA
243 ×
235
======>
352 ×
123
123
451
145 ×
452
245 ×
453
354 ×
N2=10-7=3 =>
Estimate by Compact Set Preservation = 3/10
Why Pair Wise Compact Set?



Evolutionary tree is the real judge
Evolutionary tree has property to
minimize the total evolutionary edges
(say tree size) from pair wise distance
which seems to be compact
It is true in experiments
Compact Set Relation Preserved Rate for Evolutionary Tree
# of relations preserved in Evolutionary Tree / # of Compact
Set relations of Pair Wise Distance
Tree_Method
Protein 12
Neighbor Joining
143/208
UPGMA
190/208
Minimum
208/208
More larger more better
Compact Set Evaluation Algorithm





Step1 : Construct the original Compact Set Tree T and the Compact Set
Tree after MSA T’ [1].
Step2 : Preorder Traversal T’ to generate the Compact Set relations
after MSA R’ ,and mark the entry in the hash table H’ according to R’.
Step3 : Preorder Traversal T to generate the Original Compact Set
Relations R ,and check whether the marked entry in the hash table by
R is a subset of the hash table H’.
3
Total Time Complexity = O(
),where n is the number of sequences
Reference:

1. E. Dekel,J. Hu and W. Ouyang, An optimal algorithm for finding
compact sets, Inform. Process. Lett. 44(1992) 285~289
n
Our Strategy for MSA


Progressive alignment (Fei Feng and Doolittle: 1987 )
with neighbor first( by using Minimal Spanning Tree(MST) Kruskal
Merging Order)
Set-to-Set align. Once a gap, always a gap.
Kruskal merging order tree
3
1
S1
S3:----ACAGACTCCA
2
S2 S3 S4
S4:TTTAAAAGTC----
S1:---AACAGACTT-A-
set1
S2:----ACAGACTT-AA
S1:AACAGACTTA-
S2:-ACAGACTTAA
S3:----ACAGACTCCA-
S4:TTTAAAAGTC-----
set2
Q: Why do we use MST Kruskal Order?
A1:It has similar structure with compact set
MST Order Merge Tree
Compact Tree
A2:MST Kruskal order is obtained easily
Score function
Begin- gap
Match
---AACAGACTT-A-
Gap-extended
----ACAGAC---AA
----ACAGACTCCA-
TTTAAAAGTC-C---
End-gap
Mismatch
Gap-open
Strategy of set-to-set alignment
Score(7, 7) +(α8:β8)
Score(8, 8) = Max{ Score(7, 8) +(α8:G3)
Score(8, 7) +(G2:β8)
*(α8:β8) = (G,C)+(G,-)+(G,G)+(-,C)+(-,-)+(-,G)
= (-10)+(-15)+(10)+(-15)+(0)+(-15) = -45
α
β
1
A
A
2
A
A
3
T
T
4
T
T
5
A
A
6
T
T
7
C
C
8
G
-
A
T
T
A
A
A
T
T
T
T
T
T
A
A
A
T
T
T
C
C
C
C
-
G
G2
-
-
G3
-
-
-
Time Complexity of setα to setβ alignment = (sα*sβ*lα*lβ )=(2*3*8*8),
Where sα,sβ are the number of sequences in setα and setβrespectively,
and lα,lβ are the length of resulted sequences in setα and setβ respectively.
Time Complexity of our strategy

The worst case happens in that the binary tree is balanced.

Total set-to-set time complexity is bounded by
 2 n

n
n
2 n
2 n







l 1 1 2 2 2  2 4 4  2
2 (n 2)  2 
n
n
n n
2 n
 l 1  2  4     
2
2
2 2
 2
2
l
n
n 2 nn
 nn  1l
  1  2  4      l    2  1 
2
2
22
2

2
where l is the length of the resulted sequences and n is the
number of sequences.
The worst case time complexity = O(n2l2 )


2
MSA Useful tools

GCG (Genetics Computer Group) : PileUp


http://gcg.nhri.org.tw:8003/gcg-bin/seqweb.cgi
Clustalw

http://clustalw.genome.ad.jp/
Clustal W

Pairwise alignment



Construct the unrooted NeighborJoining (NJ) tree
Construct the rooted NJ tree


Calculate distance matrix
rooted at “mid-point”
Progressive alignment


Align following the rooted NJ tree
set-to-set alignment
Experiment
Test data
Amino acid
sequences
12*(80~160) residues
Fruit fly DNA
sequences
28*(800~900) nucleotides
Mitochondrion
Sequecnes
136*(660~690) nucleotides
Multiple sequence
alignment
Measurement
GCG
SP Score
Clustalw
Compact Set
Our_MSA
Three-Point
Relation
SP Score Result
Clustalw and our result are better than GCG’s
MSA_Method
Protein 12
DNA28
Human136
GCG
6410
293113
56817190
Clustalw
9532
458182
56830150
Our_MSA
9868
454397
57132720
More larger more better
Compact Set Relation Failure rate Result
# of relation not preserved / # of source compact set relation
MSA_Method
Protein 12
DNA28
Human136
GCG
19/208
120/230
803/9798
Clustalw
0/208
120/230
668/9798
Our_MSA
0/208
0/230
1/9798
More smaller more better
Three-point Relative Scale Preserved Rate
For all three species A, B,C, we evaluate their relative distance relation
between original distance matrix and the MSA distance are identical or
not.
MSA_Met
hod
GCG
Clustalw
Our_MSA
Protein12
DNA28
Human136
Preserved
(%)
UnPre.
(%)
Preserved
(%)
UnPre. (%)
Preserved(%) UnPre. (%)
159(72.27)
61(27.73)
1095(33.42)
2181(66.58)
355438(86.68)
54602(13.32)
187 (85)
33 (15)
2007(61.26)
1269(38.74)
336173(81.99)
73867(18.01)
196(89.09) 24(10.91) 2645(80.74) 631(19.26)
384896(93.87) 25144(6.13)
I Believe Tree Only


One might still not believe original pair
wise distance is not a good judge
One believes the true evolutionary tree
only
Compact Set Relation Failure Rate
Take Protein 12 for example
# of relations not preserved / # of source Compact Set relations
Original
Neighbor
Joining
UPGMA
Minimum
GCG
19/208
1/144
49/220
31/220
Clustalw
0/208
1/144
30/220
12/220
Our_MSA
0/208
1/144
30/220
12/220
Distance
MSA_Method
More smaller more better
Future Work

Is our measurement and algorithms really good?
Simulations and Web service

Does Our MSA by set-to-set alignment satisfy
some approximation property?
Theoretical Proving

How can we reduce the time?
Hardwired Dynamic Programming
ex:PARACEL http://www.paracel.com/