CS790 – Introduction to Bioinformatics

Download Report

Transcript CS790 – Introduction to Bioinformatics

BIO/CS 471 – Algorithms for Bioinformatics
Sequence Alignments
and Dynamic Programming
Module II: Sequence Alignments
 Problem: SequenceAlignment
• Input: Two or more strings of characters
• Output: The optimal alignment of the input strings,
possibly including gaps, and an alignment score.
 Example
• Input: The calico cat was very sleepy.
The cat was sleepy.
• Output: The calico cat was very sleepy.
The ------ cat was ---- sleepy.
Sequence Alignments
2
Biological Sequence Alignment
 Input: DNA or Amino Acid sequence
 Output: Optimal alignment
 Example
• Input:
ACTATAGCTATAGCTATAGCGTATCG
ACGATTACAGGTCGTGTCG
• Output: ACTATAGCTATAGCTATAGCGTATCG
|| ||
||*|| |
|||*|||
ACGAT---TACAGGT----CGTGTCG

Score: +8
Sequence Alignments
3
Why align sequences?
 Why would we want to align two sequences?
• Compare two genes/proteins, e.g. to infer function
• Database searches
• Protein structure prediction
•
 Why would we want to align multiple
sequences?
• Conservation analysis
• Molecular evolution
•
Sequence Alignments
4
How do we decide on a scoring function?
 Objective: Sequences that are homologous or
evolutionarily related, should align with high
scores. Less related sequences should score
lower.
 Question: How do homologous sequences
arise?
Sequence Alignments
5
Transcription
The Central
Dogma
DNA
transcription

RNA
translation

Proteins
Sequence Alignments
6
DNA Replication
 Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set
 DNA polymerase is the enzyme
that copies DNA
• Reads the old strand in the 3´ to 5´
direction
Sequence Alignments
7
Over time, genes
accumulate mutations
 Environmental factors
• Radiation
• Oxidation
 Mistakes in replication or repair
 Deletions, Duplications
 Insertions
 Inversions
 Point mutations
Deletions
 Codon deletion:
ACG ATA GCG TAT GTA TAG CCG…
• Effect depends on the protein, position, etc.
• Almost always deleterious
• Sometimes lethal
 Frame shift mutation:
ACG ATA GCG TAT GTA TAG CCG…
ACG ATA GCG ATG TAT AGC CG?…
• Almost always lethal
Sequence Alignments
9
Indels
 Comparing two genes it is generally impossible
to tell if an indel is an insertion in one gene, or
a deletion in another, unless ancestry is known:
ACGTCTGATACGCCGTATCGTCTATCT
ACGTCTGAT---CCGTATCGTCTATCT
Sequence Alignments
10
The Genetic Code
Substitutions are
mutations
accepted by
natural selection.
Synonymous:
CGC  CGA
Non-synonymous:
GAU  GAA
Sequence Alignments
11
Two kinds of homologs
 Orthologs
• Species A has a particular
gene
• Species A diverges into
species B and C, each
with the same gene
• The two genes
accumulate substitutions
independently
Sequence Alignments
12
Orthologs and Paralogs
 Paralogs
• A gene duplication event
within a species results in
two copies of a gene
• One of the copies is now
under less selective
constraint
• The copies can
accumulate substitutions
independently, before and
after speciation
Sequence Alignments
13
Scoring a sequence alignment
 Match score:
+1
 Mismatch score:
+0
 Gap penalty:
–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
 Matches: 18 × (+1)
 Mismatches: 2 × 0
 Gaps: 7 × (– 1)
Sequence Alignments
Score = +11
14
Origination and length penalties
 We want to find alignments that are
evolutionarily likely.
 Which of the following alignments seems more
likely to you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT

 We can achieve this by penalizing more for a
new gap, than for extending an existing gap
Sequence Alignments
15
Scoring a sequence alignment (2)
 Match/mismatch score:
+1/+0
 Origination/length penalty:
–2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT




Matches: 18 × (+1)
Mismatches: 2 × 0
Origination: 2 × (–2)
Length: 7 × (–1)
Sequence Alignments
Score = +7
16
Optimal Substructure in Alignments
 Consider the alignment:
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
 Is it true that the alignment in the boxed region
must be optimal?
Sequence Alignments
17
A Greedy Strategy
 Consider this pair of sequences
GAGC
GAP = 1
CAGC
 Greedy Approach:
G or G or
C
 Leads to
GAGC-----CAGC
Sequence Alignments
Match = +1
G
Better:
Mismatch = 2
GACG
CACG
18
Breaking apart the problem
 Suppose we are aligning:
ACTCG
ACAGTAG
 First position choices:
Sequence Alignments
A
A
+1
CTCG
CAGTAG
A
-
-1
CTCG
ACAGTAG
A
-1
ACTCG
CAGTAG
19
A Recursive Approach to Alignment
 Choose the best alignment based on these three
possibilities:
align(seq1, seq2) {
if (both sequences empty) {return 0;}
if (one string empty) {
return(gapscore * num chars in nonempty seq);
else {
score1 = score(firstchar(seq1),firstchar(seq2))
+ align(tail(seq1), tail(seq2));
score2 = align(tail(seq1), seq2) + gapscore;
score3 = align(seq1, tail(seq2) + gapscore;
return(min(score1, score2, score3));
}
}
}
Sequence Alignments
20
Time Complexity of RecurseAlign
 What is the recurrence equation for the time
needed by RecurseAlign?
T (n)  3T (n  1)  3
3
n
3
3
3
…
3
3
9
3
27
n
3
Sequence Alignments
21
RecurseAlign repeats its work
A
C
G
T
A
T
C
G
C
G
T
A
T
A
G
A
T
G
C
T
C
T
C
G
Sequence Alignments
22
How can we find an optimal alignment?
 Finding the alignment is computationally hard:
ACGTCTGATACGCCGTATAGTCTATCT
CTGAT---TCG—CATCGTC--T-ATCT
 C(27,7) gap positions = ~888,000 possibilities
 It’s possible, as long as we don’t repeat our
work!
 Dynamic programming: The Needleman &
Wunsch algorithm
Sequence Alignments
23
What is the optimal alignment?
 ACTCG
ACAGTAG
 Match: +1
 Mismatch: 0
 Gap: –1
Sequence Alignments
24
Needleman-Wunsch: Step 1
 Each sequence along one axis
 Mismatch penalty multiples in first row/column
 0 in [1,1] (or [0,0] for the CS-minded)
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
Sequence Alignments
A
-1
1
C
-2
T
-3
C
-4
G
-5
25
Needleman-Wunsch: Step 2
 Vertical/Horiz. move: Score + (simple) gap penalty
 Diagonal move: Score + match/mismatch score
 Take the MAX of the three possibilities
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
Sequence Alignments
A
-1
1
C
-2
T
-3
C
-4
G
-5
26
Needleman-Wunsch: Step 2 (cont’d)
 Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
Sequence Alignments
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
t
-2
0
c
-3
-1
g
-4
-2
-5
-3
27
Needleman-Wunsch: Step 2 (cont’d)
 Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
 The optimal alignment score is calculated in the
lower-right corner
Sequence Alignments
28
But what is the optimal alignment
 To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…
a
a
c
a
g
t
a
g
Sequence Alignments
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
29
A path corresponds to an alignment

= GAP in top sequence

= GAP in left sequence

= ALIGN both positions
 One path from the previous table:
 Corresponding alignment (start at the end):
AC--TCG
ACAGTAG
Sequence Alignments
Score = +2
30
Practice Problem
 Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
 Match: +1
 Mismatch: 0
 Gap: –1 g
c
g
t
Sequence Alignments
g
0
-1
-2
-3
-4
c
-1
g
-2
g
-3
t
-4
t
-5
-6
31
Practice Problem
 Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
g
c
g
g
t
t
g
c
g
t
0
-1
-2
-3
-4
-1
1
0
-1
-2
-2
0
2
1
0
-3
-1
1
3
2
GCGGTT
GCG-TSequence Alignments
-4
-2
0
2
3
-5
-3
-1
1
3
-6
-4
-2
0
2
Score = +2
32
Semi-global alignment
 Suppose we are aligning:
GCG
GGCG
 Which do you prefer?
G-CG
-GCG
GGCG
GGCG
g
g
g
c
g
0
-1
-2
-3
-4
c
-1
1
0
-1
-2
g
-2
0
1
1
0
-3
-1
1
1
2
 Semi-global alignment allows gaps at the ends
for free.
Sequence Alignments
33
Semi-global alignment
 Semi-global alignment allows gaps at the ends
for free.
g
g
g
c
g
0
0
0
0
0
c
0
1
1
0
1
g
0
0
1
2
1
0
1
1
1
3
 Initialize first row and column to all 0’s
 Allow free horizontal/vertical moves in last
row and column
Sequence Alignments
34
Local alignment
 Global alignments – score the entire alignment
 Semi-global alignments – allow unscored gaps
at the beginning or end of either sequence
 Local alignment – find the best matching
subsequence
 CGATG
AAATGGA
 This is achieved by allowing a 4th alternative at
each position in the table: zero.
Sequence Alignments
35
Local alignment
 Mismatch = –1 this time
c
a
a
a
t
g
g
a
0
-1
-2
-3
-4
-5
-6
-7
g
-1
0
0
0
0
0
0
0
a
-2
0
0
0
0
1
1
0
t
-3
0
1
1
0
0
0
2
g
-4
0
0
0
2
1
0
1
-5
0
0
0
1
3
2
1
CGATG
AAATGGA
Sequence Alignments
36
Food for thought…
 What is the asymptotic time complexity of
the Needleman & Wunsch algorithm?
• Is this good or bad?
 How about the space complexity?
• Why might this be a problem?
Sequence Alignments
37
Saving Space
 Note that we can throw away the previous rows
of the table as we fill it in:
a
This row
is based
only on
this one
Sequence Alignments
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
38
Saving Space (2)
 Each row of the table contains the scores for
aligning a prefix of the left-hand sequence with
all prefixes of the top sequence:
a
Scores for
aligning
aca with
all
prefixes of
actcg
Sequence Alignments
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
39
Divide and Conquer
 By using a recursive approach, we can use only
two rows of the matrix at a time:
• Choose the middle character of the top sequence, i
• Find out where i aligns to the bottom sequence

Needs two vectors of scores
• Recursively align the sequences before and after the
fixed positions
i
ACGCTATGCTCATAG
CGACGCTCATCG
Sequence Alignments
40
Finding where i lines up
 Find out where i aligns to the bottom sequence

Needs two vectors of scores
i
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
 Assuming i lines up with a character:
alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t)
+ align(CTCATAG, suffix(t))
 Which character is best?
• Can quickly find out the score for aligning ACGCTAT with
every prefix of t.
Sequence Alignments
41
Finding where i lines up
 But, i may also line up with a gap
i
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
 Assuming i lines up with a gap:
alignscore = align(ACGCTAT, prefix(t)) + gapscore
+ align(CTCATAG, suffix(t))
Sequence Alignments
42
Recursive Call
 Fix the best position for I
 Call align recursively for the prefixes and
suffixes:
i
s: ACGCTATGCTCATAG
t:
Sequence Alignments
CGACGCTCATCG
43
Complexity
 Let len(s) = m and len(t) = n
i
 Space: 2m
s: ACGCTATGCTCATAG
 Time:
t:
CGACGCTCATCG
• Each call to build
similarity vector = m´n´
• First call + recursive call:
mn mn
m
T m, n  

T ,
2
2
2
j

m
j T ,n 

2

j

 m n  m j  m( n  j )
 2m n
Sequence Alignments
44
General Gap Penalties
 Suppose we are no longer using simple gap
penalties:
• Origination = −2
• Length = −1
 Consider the last position of the alignment for
ACGTA with ACG
 We can’t determine the score for
G
or
G
unless we know the previous positions!
Sequence Alignments
45
Scoring Blocks
 Now we must score a block at a time
A A C --- A TATCCG A C T AC
A C T ACC T ------ C G C --
 A block is a pair of characters, or a maximal
group of gaps paired with characters
 To score a position, we need to either start a
new block or add it to a previous block
Sequence Alignments
46
The Algorithm
 Three tables
• a – scores for alignments ending in char-char blocks
• b – scores for alignments ending in gaps in the top
sequence (s)
• c – scores for alignments ending in gaps in the left
sequence (t)
 Scores no longer depend on only three
positions, because we can put any number of
gaps into the last block
Sequence Alignments
47
The Recurrences
ai  1, j  1

ai, j   p i, j   maxbi  1, j  1
ci  1, j  1

ai, j  k   wk , for1  k  j
bi, j   max
ci, j  k   wk , for1  k  j
ai  k , j   wk , for1  k  i
ci, j   max
bi  k , j   wk , for1  k  i
Sequence Alignments
48
The Optimal Alignment
 The optimal alignment is found by looking at
the maximal value in the lower right of all three
arrays
 The algorithm runs in O(n3) time
• Uses O(n2) space
Sequence Alignments
49