Multi-alignment of Genomes

Download Report

Transcript Multi-alignment of Genomes

Comparison of large sequences
First part:
Alignment of large sequences
Dynamic programming
accaccacaccacaacgagcata … acctgagcgatat
a
c
c
.
.
t
acc.................................agt
| | |.................................|xx
acc.................................a--
• Quadratic cost of space and time.
• Short sequences (up to 10.000 bps) can be aligned
using dynamic programming
What about genomes?
Genomic sequences
• Genomic sequences have millions of base pairs.
•The length of sequences is 1000 times longer.
•The running time is 1.000.000 times higher !
(1 second becomes 11 days)
(1 minute becomes 2 years)
In which case Dynamic Programming can be applied?
First assumption
Genome A ………………………….………………...…………...….
Genome B ……………………………………………………………….
Genome B
……………………………………
Realistic assumption?
Genome A ………………………….………………...…………...….
Unrealistic
……………………………………………………………….
Genome B
assumption!
Genome A ………………………………………………...…………...….
More
Genome B………………………………………………………………….realistic
Genome B
assumption
…………………
Realistic assumptions?
Genome A ………………………….………………...…………...….
Unrealistic
……………………………………………………………….
Genome B
assumption!
Genome A ………………………………………………...…………...….
More
Genome B …………………………………………………………………realistic
Genome B
assumption
…………………
Preview in a real case
Chlamidia muridarum: 1.084.689bps
Chlamidia Thrachomatis:1057413bps








Preview in a real case
Pyrococcus abyssis: 1.790.334 bps
Pyrococcus horikoshu: 1.763.341 bps






Methodology of an alignment
…………………...….
1st: Make a preview:
……………………..….
2nd: Identify the portions that can be aligned.
3th: Make the alignment:
Methodology of an alignment
…………………...….
1st: Make a preview:
……………………..….
?
2nd: Identify the portions that can be aligned.
3th: Make the alignment:
Preview-Revisited
… a a t g….c t g...
Maximal
… c g t g….c c c ...
Unique
MUM
Connect to MALGEN
Matching
Methodology of an alignment
…………………...….
1st: Make a preview:
……………………..….
How can MUMs be found?
2nd: Identify the portions that can be aligned.
How can these portions be determined?
3th: Make the alignment:
With CLUSTALW, TCOFFEE,…
Bioinformatics PhD. Course
Second part:
Introducing Suffix trees
Suffix trees
Given string ababaas:
Suffixes:
7: s
6: as
s,7
s,6
a
as,5
5: aas
as,3
3: abaas
baas,1 1: ababaas
ba
ba
as,4
baas,2
4: baas
2: babaas
What kind of queries?
Applications of Suffix trees
1. Exact string matching
• Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
…………………………
s,7
s,6
a
as,5
ba
ba
as,4
baas,2
as,3
baas,1
Quadratic insertion algorithm
Invariant Properties:
Given the string …………………………......

and the suffix-tree
…...
P1: the leaves of suffixes from  have been inserted
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
babaabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
aba
baabbs,1
babaabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
aba
abbs,3
baabbs,1
babaabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
aba
abbs,3
baabbs,1
ba
babaabbs,2
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
aba
abbs,3
baabbs,1
ba
abbs,4
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
abaa
abbs,3
baabbs,1
ba
abbs,3
baabbs,1
ba
abbs,4
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
a
abbs,5
ba
abbs,3
baabbs,1
ba
abbs,4
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
a
abbs,5
ba
abbs,3
baabbs,1
ba
abbs,4
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
a
abbs,5
ba
b
a
ba
abbs,4
baabbs,2
abbs,3
baabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
a
abbs,5
b
ba
abbs,4
baabbs,2
bs,6
a
abbs,3
baabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
a
abbs,5
b
ba
abbs,4
baabbs,2
bs,6
a
abbs,3
baabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
a
abbs,5
b
b
bs,7
a
bs,6
a
abbs,3
baabbs,1
abbs,4
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
abbs,5
a
b
bs,7
b
a
s,8
bs,6
a
abbs,3
baabbs,1
abbs,4
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
abbs,5
a
b
bs,7
b
s,9
a
s,7
bs,6
a
abbs,3
baabbs,1
abbs,4
baabbs,2
Generalizad suffix tree
The suffix tree of many strings …
is called the generalized suffix tree …
and it is the suffix tree of the concatenation
of strings.
For instance,
the generalized suffix tree of ababaabb and aabaat …
is the suffix tree of ababaabαaabaatβ, :
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
Given the suffix tree of ababaabα :
abbα,5
a
b
bα,7
b
α,9
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
abbα,5
a
b
bα,7
b
α,9
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
aaβ,1
ab
a
bα,5
b
bα,7
b
α,9
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
aaβ,1
ab
a
bα,5
b
bα,7
b
α,9
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
aaβ,1
ab
a
bα,5
b
bα,7
b
α,9
a
α,8
β,2
bα,6
a
a
baabbα,1
abbα,4
baabbα,2
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
aaβ,1
ab
a
bα,5
b
bα,7
b
α,9
a
α,8
β,2
bα,6
a
a
baabbα,1
abbα,4
baabbα,2
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
aaβ,1
ab
a
bα,5
b
a
a
bα,7
b
α,9
a
α,8
β,2
bα,6
a
baabbα,2
baabbα,1
β,3
bbα,4
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
aaβ,1
ab
a
bα,5
b
a
a
bα,7
b
α,9
a
α,8
β,2
bα,6
a
baabbα,2
baabbα,1
β,3
bbα,4
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
β,4
a
a
aaβ,1
b
bα,5
bα,6
b
a
a
bα,7
b
α,9
a
α,8
a
baabbα,2
baabbα,1
β,3
bbα,4
β,2
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
β,4
a
a
aaβ,1
b
bα,5
bα,6
b
a
a
bα,7
b
α,9
a
α,8
a
baabbα,2
baabbα,1
β,3
bbα,4
β,2
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
β,5
β,4
a
a
aaβ,1
b
bα,5
bα,6
b
a
a
bα,7
b
α,9
a
α,8
a
baabbα,2
baabbα,1
β,3
bbα,4
β,2
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
β,5
β,4
a
a
aaβ,1
b
bα,5
bα,6
b
a
a
bα,7
b
α,9
a
α,8
a
baabbα,2
baabbα,1
β,3
bbα,4
β,2
bbα,3
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
β,5
β,6
β,4
a
a
aaβ,1
b
bα,5
bα,6
b
a
a
bα,7
b
α,9
a
α,8
a
baabbα,2
baabbα,1
β,3
bbα,4
β,2
bbα,3
Generalizad suffix tree
Generalized suffix tree of ababaabbαaabaaβ :
β,6
a
β,5
a
b
α,9
bα,7
b
a
α,8
β,4
b
aaβ,1
bα,5
bα,6
a
a
baabbα,1
β,3
a
bbα,4
baabbα,2
β,2
bbα,3
Applications of Suffix trees
1. Exact string matching
• Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
…………………………
s,7
s,6
a
as,5
ba
ba
as,4
baas,2
as,3
baas,1
Applications of Suffix trees
2. The substring problem for a database of strings DB
• Does the DB contain any ocurrence of patterns
abab, aab, and ab?
β,6
a
β,5
a
b
α,9
bα,7
b
a
α,8
β,4
b
aaβ,1
bα,5
bα,6
a
a
baabbα,1
β,3
a
bbα,4
baabbα,2
β,2
bbα,3
Applications of Suffix trees
3. The longest common substring of two strings
β,6
a
β,5
a
b
α,9
bα,7
b
a
α,8
β,4
b
aaβ,1
bα,5
bα,6
a
a
baabbα,1
β,3
a
bbα,4
baabbα,2
β,2
bbα,3
Applications of Suffix trees
5. Finding MUMs.
β,6
a
β,5
a
b
α,9
bα,7
b
a
α,8
β,4
b
aaβ,1
bα,5
bα,6
a
a
baabbα,1
β,3
a
bbα,4
baabbα,2
β,2
bbα,3
Bioinformatics PhD. Course
Third part:
Suffix links
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
?
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Suffix links

a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Given S2 = a a b a a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Given S2 = a a b a a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
aa in S2 [1]
Given S2 = a a b a a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
aa in S2 [1]
aab in S2 [1]
=
S1[5..6-7] in S2 [1]
Given S2 = a a b a a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a b b a
S1[3..6-…] in S2 [2]
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a b b a
S1[3..6-…] in S2 [2]
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a b b a
S1[3..6-…] in S2 [2]
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a b b a
S1[3..6-…] in S2 [2]
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..6-7] in S2 [1]
Given S2 = a a b a a b b a
S1[3..6-8] in S2 [2]
a
b
α,9
S1[4..6-8] in S2 [3]
abbα,5
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
Traversal using Suffix links
Unique matchings
S1[5..8] in S2 [4]
S1[3..6-8] in S2 [2]
S1[4..6-8] in S2 [3]
S1[6..8] in S2 [5]
S1[7..8] in S2 [6]
Given S2 = a a b a a b b a
abbα,5
a
b
α,9
bα,7
b
a
α,8
bα,6
a
abbα,3
baabbα,1
abbα,4
baabbα,2
From UMs to MUMs
Given S2 = a a b a a b b a
and S1 = a b a b a a b b α
Array of UMs
1
2
3
4
5
6
7
8
9
6-8
6-8
8
8
8
Unique matchings
S1[5..8] in S2 [4]
S1[3..6-8] in S2 [2]
S1[4..6-8] in S2 [3]
S1[6..8] in S2 [5]
S1[7..8] in S2 [6]
MUM: S1[3..6-8] in S2[2]
Bioinformatics PhD. Course
Third part:
Linear insertion algorithm
Quadratic insertion algorithm
Invariant Properties:
Given the string …………………………......

and the suffix-tree
…...
P1: the leaves of suffixes from  have been inserted
Linear insertion algorithm
Invariant Properties:

Given the string …………………………......

and the suffix-tree

…...
P1: the leaves of suffixes from  have been inserted
P2: the string  is the longest string that can be
spelt through the tree.
Linear insertion algorithm: example


Given the string ababaababb...

a
a
ababb...,5
ba
ababb...,3
baababb...,1
ba
ababb...,4
baababb...,2
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
baababb...,1
ba
ababb...,4
baababb...,2
678
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
baababb...,1
ba
ababb...,4
baababb...,2
678
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
baababb...,1
ba
ababb...,4
baababb...,2
6 7 89
Linear insertion algorithm: example

Given the string ababaababb...

a
6 7 89
ababb...,5
ba
ababb...,3
aababb...,1
baababb...,1
b
ba
ababb...,4
baababb...,2
b...,6
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
7 89
ababb...,4
baababb...,2
aababb...,1
b...,6
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
7 89
ababb...,4
baababb...,2
aababb...,1
b...,6
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
7 89
ababb...,4
baababb...,2
aababb...,1
b...,6
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
7 89
ababb...,4
aababb...,2
b
baababb...,2
aababb...,1
b...,6
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
7 8…
ababb...,4
aababb...,1
b...,6
aababb...,2
b
baababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
89
ababb...,4
b
aababb...,1
b...,6
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
89
ababb...,4
b
aababb...,1
b...,6
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
89
ababb...,4
b
aababb...,1
b...,6
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ba
ababb...,3
b
ba
89
ababb...,4
b
aababb...,1
b...,6
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ababb...,3
a
b
ba
b
aababb...,1
b...,6
ababb...,4
b
89
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ababb...,3
a
b
ba
b...,8
b
aababb...,1
b...,6
ababb...,4
b
89
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ababb...,3
a
b
ba
b...,8
b
aababb...,1
b...,6
ababb...,4
b
9
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
ababb...,3
a
b
ba
b...,8
b
aababb...,1
b...,6
ababb...,4
b
9
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
a
b
b
a
b...,8
ababb...,3
b
aababb...,1
b...,6
ababb...,4
b
9
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
a
b
b
a
b...,8
ababb...,3
b
b
aababb...,1
b...,6
ababb...,4
b...,9
9
aababb...,2
b...,7
Linear insertion algorithm: example
Given the string ababaababb...

a
ababb...,5
a
b
b
a
b...,8
ababb...,3
b
b
ababb...,1
b...,6
ababb...,4
b...,9
9
aababb...,2
b...,7
Linear insertion algorithm: example
Given the string ababaababb...

a
ababb...,5
a
b
b
a
b...,8
ababb...,3
b
b
ababb...,1
b...,6
ababb...,4
b...,9
9
aababb...,2
b...,7
Linear insertion algorithm: example

Given the string ababaababb...

a
ababb...,5
a
b
b
a
b...,8
ababb...,3
b
b
ababb...,1
b...,6
ababb...,4
b...,9
9
aababb...,2
b...,7
Index
Suffix arrays
Suffix-arrays: a new method for on-line
string searches,
G. Myers, U. Manber
Suffix arrays
Given string ababaa#:
Suffixes: 1: ababaa#… but lexicographically sorted
2: babaa#
3: abaa#
4: baa#
5: aa#
6: a#
7: #
1
2
3
4
5
6
7
Which is the cost?
1: #
6: a#
5: aa#
3: abaa#
1: ababaa#
4: baa#
2: babaa#
O(n log(n))
Applications of suffix arrays
1. Exact string matching
• Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
1
2
3
4
5
6
7
1: #
6: a#
5: aa#
3: abaa#
1: ababaa#
4: baa#
2: babaa#
Binary search … which is the cost?
O(log(n) |P|)
Can it be improved to …
O(log(n)+|P|) ?
Fast search with cost O(log(n)+|P|)
Suffix array
1
2
…
Query:
Invariant Properties:
…
P1: α < query ≤ β
α
β
n
P2:
matches pref( query)
Fast search with cost O(log(n)+|P|)
Suffix array
1
2
…
Query:
Invariant Properties:
…
P1: α < query ≤ β
α
γ
β
n
P2:
matches pref( query)
Algorithm:
If suff(γ)<suff(query) then α = γ
else β = γ