Multi-alignment of Genomes

Download Report

Transcript Multi-alignment of Genomes

Master Course

MSc Bioinformatics for Health Sciences

H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees

Contents and bibliography 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees

Flexible pattern matching in strings

G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press

Algorithms on strings, trees and sequences

D. Gusfield, Cambridge University Press, 1997

String matching Definition : given a long text T and a set of k patterns p 1 ,p 2 ,…,p k , the string matching problem is to find all the ocurrences of all the patterns in the text T. On-line algorithms : the patterns are known.

Only one pattern (exact and approximated)Five, ten, hundred, thusand,.. patterns (exact)Extended patterns

Off-line algorithms : the text is known.

Suffix trees

Master Course

First lecture: First part: (Exact) string matching of one pattern

String matching: one pattern

How does the string algorithms made the search?

For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA.

and for the pattern TACTACGGTATGACTAA

String Matching: Brute force algorithm Example:

Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ...

A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A

String Matching: Brute force algorithm

Connect to http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open Brute Force algorithm C code of the running file Connect to http://www.lsi.upc.edu/~peypoch

String Matching of one pattern

The cost of Brute Force algorithm is O(nm). Can the search be made with lower cost?

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC TACTACGGTATGACTAA Prefix search Suffix search Factor search

String matching of one pattern

How does the string algorithms made the search?

There is a sliding window along the text against which the pattern is compared: Text : Pattern : At each step the comparison is made and the window is shifted to the right.

Which are the facts that differentiate the algorithms?

1. How the comparison is made.

2. The length of the shift.

String Matching: Brute force algorithm

• How the comparison is made?

Text : Patern : From left to right: prefix search • Which is the next position of the window?

Text : Patró : The window is shifted only one cell The cost is O(mn).

String Matching: one pattern Most efficient algorithms (Navarro & Raffinot)

|  | BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching 64 32 16 8 4 2 Horspool BNDM BOM Length of the pattern

String Matching: Horspool algorithm

• How the comparison is made?

Text : Pattern : From right to left: suffix search • Which is the next position of the window?

Text : Pattern : a It depends of where appears the last letter of the text, say it ‘a’, in the pattern: a a a a a a a a a Then it is necessary a preprocess that determines the length of the shift.

String Matching: Horspool algorithm Example:

Given the pattern ATGTA, the shift table is A 4 C 5 G 2 T 1 And the search: G T A C T A G A G G A C G T A T G T A C T G ...

A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A

String Matching: Horspool algorithm Example:

Given the pattern ATGTA, the shift table is A 4 C 5 G 2 T 1 And the search: G T A C T A G A G G A C G T A T G T A C T G ...

A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A …

String Matching: Horspool algorithm

Connect to http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open the Horspool algorithm C code Connect to http://www.lsi.upc.edu/~peypoch

String Matching: one pattern The most efficient algorithms (Navarro & Raffinot)

|  | BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching 64 32 16 8 4 2 Horspool BNDM BOM Length of the pattern

BNDM algorithm

• How the comparison is made?

Searches for suffixes of T that are factors of P Text : x Pattern : This state is expressed with an array D of bits: D 2 = 1 0 0 0 1 0 0 How the next state can be obtained?

Given the mask B(x) of x, the cells where character x appears into the pattern D = D<<1 & B(x) If B(x) = ( 0 0 1 1 0 0 0) then D 3 = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) • How the shift is determined?

?

BNDM algorithm: example

Given the pattern ATGTA, the mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = B(G) = B(T) =

BNDM algorithm: example

Given the pattern ATGTA, the mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 )

BNDM algorithm: example

Given the pattern ATGTA, the mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D 1 D 2 Given the text : G T A C T A G A G G A C G T A T G T A C T G ...

A T G T A = = ( 0 1 0 1 0 ) = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) A T G T A D 1 D 2 = = ( 0 0 1 0 0 ) = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) A T G T A D 1 D 2 D 3 = = ( 1 0 0 0 1 ) = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D 4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 ) A T G T A

BNDM algorithm: example

The pattern is ATGTA , the masks are: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) and the text: G T A C T A G A G G A C G T A T G T A C T G ...

A T G T A D 1 D 2 D 3 D 4 D 5 D 6 = = ( 1 0 0 0 1 ) = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Pattern found!

A T G T A …

BNDM algorithm

• How the comparison is made?

Searches for suffixes of T that are factors of P Text : Pattern : This state is expressed with an array D of bits: D = 1 0 0 0 1 0 0 • How the shift is determined?

?

BNDM algorithm

• How the comparison is made?

Searches for suffixes of T that are factors of P Text : Pattern : This state is expressed with an array D of bits: D = 1 0 0 0 1 0 0 • How the shift is determined?

If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells

String matching: one pattern The most efficient algorithms (Navarro & Raffinot)

|  | BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching 64 32 16 8 4 2 Horspool BNDM BOM Long. patró

BOM (Backward Oracle Matching)

• How the comparison is made?

Text : Pattern : Automaton: Factor Oracle(1999) Checks if the suffix is a factor of the pattern • How the shifted is determined?

?

Automaton Factor Oracle: properties

Factor Oracle of the word G T A T G T A G T A T A G T G T A G T A T G T A T G A T G T G G but the automaton also recognizes other strings as G T G then it is usefull only for discard words out as factors!

BOM: example

• How the comparison is made?

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G T A T A G T G T A • Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A...

A T G T A T G

BOM: example

• How the comparison is made?

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G T A T A G T G T A • Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G

BOM: example

• How the comparison is made?

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G T A T A G T G T A • Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G A T G T A T G

BOM: example

• How the comparison is made?

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G T A T A G T G T A • Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G

BOM: example

• How the comparison is made?

• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G T A T A G T G T A • Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...

A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G

BOM: example

• How the comparison is made?

• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG G T A T A G T G T A • Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...

A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G …

BOM (Backward Oracle Matching)

• How the comparison is made?

Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of the pattern • How the shifted is determined?

a • a is the first mismatch

String Matching: BNDM and BOM

Connect to http://www-igm.univ-mlv.fr/~lecroq/string/index.html

and open the BNDM and BOM algorithms C code of BNDM C code of BOM

Master Course

First lecture: Second part: (Exact) string matching of many patterns

String matching: many patterns

Given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC Search for the patterns ACTGACT GTCT AATT ACTGATCTTT GTAGC AATACT ACATGC ACTGA.

Trie

G T Trie of words GTATGTA , GTAT , TAATA , GTGTA T G T A A G A T A T A A T Which is the cost?

A

Horspool for many patterns

Search for ATGTATG,TATG,ATAAT,ATGTG 1. Build the trie of the inverted patterns G T T A T G A G T A A A T 2.

lmin=4

A

3. Table of shifts

T A A 1 C 4 (lmin) G 2 T 1 4

. Start the search

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1 …

Horspool for many patterns

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T The text ACATGCTATGTGACA… A A 1 C 4 (lmin) G 2 T 1 Short Shifts!

Horspool to Wu-Manber

How do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG 1 símbol A 1 C 4 (lmin) G 2 T 1 2 símbols AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 … AA 1 AT 1 GT 1 TA 2 TG 2

Wu-Manber algorithm

G Search for ATGTATG,TATG,ATAAT,ATGTG T A T G T G T A A T A A A T A into the text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2 … Experimental length: log | Σ| 2*lmin*r

String matching of many patterns

|  | 8 Wu-Manber 4 SBOM 2 5 10 15 20 25 30 35 40 45 (5 patterns) Lmin 8 Wu-Manber 4 SBOM 2 5 10 15 20 25 30 35 40 45 (10 patterns) 8 Wu-Manber 4 SBOM 2 5 10 15 20 25 30 35 40 45 (100 patterns)

String matching of many patterns

|  | 8 Wu-Manber 4 SBOM 2 5 10 15 20 25 30 35 40 45 (5 patterns) Lmin 8 Wu-Manber 4 SBOM 2 5 10 15 20 25 30 35 40 45 (1000 patterns) 8 SBOM 4 2 5 10 15 20 25 30 35 40 45 (10 patterns) (100 patterns)

SBOM

• How the comparison is made?

Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of any pattern • How the shifted is determined?

?

Factor Oracle of many patterns

G T A T G A A T

2

G T T A A

3

A

1,4

The AFO of GTATGTA, GTAA, TAATA i GTGTA

SBOM algorithm

• How the comparison is made?

Text : Patrons: Autòmaton………… of lenght lmin • How the shift is determined?

a • If the a doesn’t appears in the AFO • If

lmin

characters have been read

SBOM algorithm : example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGTATG

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGTATG

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGTATG

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGTATG

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGTATG

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGTATG

SBOM algorithm: example

Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T A G A T G T

2 1

A T A

3

A

4

ACATGCTAGCTATAATAATGT…

Alg. Cerca exacta de molts patrons

8 |  | 4 Wu-Manber SBOM 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber 4 SBOM 2 Ad AC 5 10 15 20 25 30 35 40 45 (5 mots) Long. mínima (10 mots) (100 mots) 8 Wu-Manber 4 2 Ad AC SBOM 5 10 15 20 25 30 35 40 45 Wu-Manber 8 SBOM (1000 mots) 4 Ad AC 2 5 10 15 20 25 30 35 40 45