CS790 – Introduction to Bioinformatics
Download
Report
Transcript CS790 – Introduction to Bioinformatics
BIO/CS 471 – Algorithms for Bioinformatics
Sequence Alignments
and Dynamic Programming
Module II: Sequence Alignments
Problem: SequenceAlignment
• Input: Two or more strings of characters
• Output: The optimal alignment of the input strings,
possibly including gaps, and an alignment score.
Example
• Input: The calico cat was very sleepy.
The cat was sleepy.
• Output: The calico cat was very sleepy.
The ------ cat was ---- sleepy.
Sequence Alignments
2
Biological Sequence Alignment
Input: DNA or Amino Acid sequence
Output: Optimal alignment
Example
• Input:
ACTATAGCTATAGCTATAGCGTATCG
ACGATTACAGGTCGTGTCG
• Output: ACTATAGCTATAGCTATAGCGTATCG
|| ||
||*|| |
|||*|||
ACGAT---TACAGGT----CGTGTCG
Score: +8
Sequence Alignments
3
Why align sequences?
Why would we want to align two sequences?
• Compare two genes/proteins, e.g. to infer function
• Database searches
• Protein structure prediction
•
Why would we want to align multiple
sequences?
• Conservation analysis
• Molecular evolution
•
Sequence Alignments
4
How do we decide on a scoring function?
Objective: Sequences that are homologous or
evolutionarily related, should align with high
scores. Less related sequences should score
lower.
Question: How do homologous sequences
arise?
Sequence Alignments
5
Transcription
The Central
Dogma
DNA
transcription
RNA
translation
Proteins
Sequence Alignments
6
DNA Replication
Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set
DNA polymerase is the enzyme
that copies DNA
• Reads the old strand in the 3´ to 5´
direction
Sequence Alignments
7
Over time, genes
accumulate mutations
Environmental factors
• Radiation
• Oxidation
Mistakes in replication or repair
Deletions, Duplications
Insertions
Inversions
Point mutations
Deletions
Codon deletion:
ACG ATA GCG TAT GTA TAG CCG…
• Effect depends on the protein, position, etc.
• Almost always deleterious
• Sometimes lethal
Frame shift mutation:
ACG ATA GCG TAT GTA TAG CCG…
ACG ATA GCG ATG TAT AGC CG?…
• Almost always lethal
Sequence Alignments
9
Indels
Comparing two genes it is generally impossible
to tell if an indel is an insertion in one gene, or
a deletion in another, unless ancestry is known:
ACGTCTGATACGCCGTATCGTCTATCT
ACGTCTGAT---CCGTATCGTCTATCT
Sequence Alignments
10
The Genetic Code
Substitutions are
mutations
accepted by
natural selection.
Synonymous:
CGC CGA
Non-synonymous:
GAU GAA
Sequence Alignments
11
Two kinds of homologs
Orthologs
• Species A has a particular
gene
• Species A diverges into
species B and C, each
with the same gene
• The two genes
accumulate substitutions
independently
Sequence Alignments
12
Orthologs and Paralogs
Paralogs
• A gene duplication event
within a species results in
two copies of a gene
• One of the copies is now
under less selective
constraint
• The copies can
accumulate substitutions
independently, before and
after speciation
Sequence Alignments
13
Scoring a sequence alignment
Match score:
+1
Mismatch score:
+0
Gap penalty:
–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Gaps: 7 × (– 1)
Sequence Alignments
Score = +11
14
Origination and length penalties
We want to find alignments that are
evolutionarily likely.
Which of the following alignments seems more
likely to you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a
new gap, than for extending an existing gap
Sequence Alignments
15
Scoring a sequence alignment (2)
Match/mismatch score:
+1/+0
Origination/length penalty:
–2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Origination: 2 × (–2)
Length: 7 × (–1)
Sequence Alignments
Score = +7
16
Optimal Substructure in Alignments
Consider the alignment:
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Is it true that the alignment in the boxed region
must be optimal?
Sequence Alignments
17
A Greedy Strategy
Consider this pair of sequences
GAGC
GAP = 1
CAGC
Greedy Approach:
G or G or
C
Leads to
GAGC-----CAGC
Sequence Alignments
Match = +1
G
Better:
Mismatch = 2
GACG
CACG
18
Breaking apart the problem
Suppose we are aligning:
ACTCG
ACAGTAG
First position choices:
Sequence Alignments
A
A
+1
CTCG
CAGTAG
A
-
-1
CTCG
ACAGTAG
A
-1
ACTCG
CAGTAG
19
A Recursive Approach to Alignment
Choose the best alignment based on these three
possibilities:
align(seq1, seq2) {
if (both sequences empty) {return 0;}
if (one string empty) {
return(gapscore * num chars in nonempty seq);
else {
score1 = score(firstchar(seq1),firstchar(seq2))
+ align(tail(seq1), tail(seq2));
score2 = align(tail(seq1), seq2) + gapscore;
score3 = align(seq1, tail(seq2) + gapscore;
return(min(score1, score2, score3));
}
}
}
Sequence Alignments
20
Time Complexity of RecurseAlign
What is the recurrence equation for the time
needed by RecurseAlign?
T (n) 3T (n 1) 3
3
n
3
3
3
…
3
3
9
3
27
n
3
Sequence Alignments
21
RecurseAlign repeats its work
A
C
G
T
A
T
C
G
C
G
T
A
T
A
G
A
T
G
C
T
C
T
C
G
Sequence Alignments
22
How can we find an optimal alignment?
Finding the alignment is computationally hard:
ACGTCTGATACGCCGTATAGTCTATCT
CTGAT---TCG—CATCGTC--T-ATCT
C(27,7) gap positions = ~888,000 possibilities
It’s possible, as long as we don’t repeat our
work!
Dynamic programming: The Needleman &
Wunsch algorithm
Sequence Alignments
23
What is the optimal alignment?
ACTCG
ACAGTAG
Match: +1
Mismatch: 0
Gap: –1
Sequence Alignments
24
Needleman-Wunsch: Step 1
Each sequence along one axis
Mismatch penalty multiples in first row/column
0 in [1,1] (or [0,0] for the CS-minded)
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
Sequence Alignments
A
-1
1
C
-2
T
-3
C
-4
G
-5
25
Needleman-Wunsch: Step 2
Vertical/Horiz. move: Score + (simple) gap penalty
Diagonal move: Score + match/mismatch score
Take the MAX of the three possibilities
A
C
A
G
T
A
G
0
-1
-2
-3
-4
-5
-6
-7
Sequence Alignments
A
-1
1
C
-2
T
-3
C
-4
G
-5
26
Needleman-Wunsch: Step 2 (cont’d)
Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
Sequence Alignments
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
t
-2
0
c
-3
-1
g
-4
-2
-5
-3
27
Needleman-Wunsch: Step 2 (cont’d)
Fill out the rest of the table likewise…
a
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
The optimal alignment score is calculated in the
lower-right corner
Sequence Alignments
28
But what is the optimal alignment
To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…
a
a
c
a
g
t
a
g
Sequence Alignments
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
29
A path corresponds to an alignment
= GAP in top sequence
= GAP in left sequence
= ALIGN both positions
One path from the previous table:
Corresponding alignment (start at the end):
AC--TCG
ACAGTAG
Sequence Alignments
Score = +2
30
Practice Problem
Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
Match: +1
Mismatch: 0
Gap: –1 g
c
g
t
Sequence Alignments
g
0
-1
-2
-3
-4
c
-1
g
-2
g
-3
t
-4
t
-5
-6
31
Practice Problem
Find an optimal alignment for these two
sequences:
GCGGTT
GCGT
g
c
g
g
t
t
g
c
g
t
0
-1
-2
-3
-4
-1
1
0
-1
-2
-2
0
2
1
0
-3
-1
1
3
2
GCGGTT
GCG-TSequence Alignments
-4
-2
0
2
3
-5
-3
-1
1
3
-6
-4
-2
0
2
Score = +2
32
Semi-global alignment
Suppose we are aligning:
GCG
GGCG
Which do you prefer?
G-CG
-GCG
GGCG
GGCG
g
g
g
c
g
0
-1
-2
-3
-4
c
-1
1
0
-1
-2
g
-2
0
1
1
0
-3
-1
1
1
2
Semi-global alignment allows gaps at the ends
for free.
Sequence Alignments
33
Semi-global alignment
Semi-global alignment allows gaps at the ends
for free.
g
g
g
c
g
0
0
0
0
0
c
0
1
1
0
1
g
0
0
1
2
1
0
1
1
1
3
Initialize first row and column to all 0’s
Allow free horizontal/vertical moves in last
row and column
Sequence Alignments
34
Local alignment
Global alignments – score the entire alignment
Semi-global alignments – allow unscored gaps
at the beginning or end of either sequence
Local alignment – find the best matching
subsequence
CGATG
AAATGGA
This is achieved by allowing a 4th alternative at
each position in the table: zero.
Sequence Alignments
35
Local alignment
Mismatch = –1 this time
c
a
a
a
t
g
g
a
0
-1
-2
-3
-4
-5
-6
-7
g
-1
0
0
0
0
0
0
0
a
-2
0
0
0
0
1
1
0
t
-3
0
1
1
0
0
0
2
g
-4
0
0
0
2
1
0
1
-5
0
0
0
1
3
2
1
CGATG
AAATGGA
Sequence Alignments
36
Food for thought…
What is the asymptotic time complexity of
the Needleman & Wunsch algorithm?
• Is this good or bad?
How about the space complexity?
• Why might this be a problem?
Sequence Alignments
37
Saving Space
Note that we can throw away the previous rows
of the table as we fill it in:
a
This row
is based
only on
this one
Sequence Alignments
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
38
Saving Space (2)
Each row of the table contains the scores for
aligning a prefix of the left-hand sequence with
all prefixes of the top sequence:
a
Scores for
aligning
aca with
all
prefixes of
actcg
Sequence Alignments
a
c
a
g
t
a
g
0
-1
-2
-3
-4
-5
-6
-7
c
-1
1
0
-1
-2
-3
-4
-5
t
-2
0
2
1
0
-1
-2
-3
c
-3
-1
1
2
1
1
0
-1
g
-4
-2
0
1
2
1
1
0
-5
-3
-1
0
2
2
1
2
39
Divide and Conquer
By using a recursive approach, we can use only
two rows of the matrix at a time:
• Choose the middle character of the top sequence, i
• Find out where i aligns to the bottom sequence
Needs two vectors of scores
• Recursively align the sequences before and after the
fixed positions
i
ACGCTATGCTCATAG
CGACGCTCATCG
Sequence Alignments
40
Finding where i lines up
Find out where i aligns to the bottom sequence
Needs two vectors of scores
i
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
Assuming i lines up with a character:
alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t)
+ align(CTCATAG, suffix(t))
Which character is best?
• Can quickly find out the score for aligning ACGCTAT with
every prefix of t.
Sequence Alignments
41
Finding where i lines up
But, i may also line up with a gap
i
s: ACGCTATGCTCATAG
t: CGACGCTCATCG
Assuming i lines up with a gap:
alignscore = align(ACGCTAT, prefix(t)) + gapscore
+ align(CTCATAG, suffix(t))
Sequence Alignments
42
Recursive Call
Fix the best position for I
Call align recursively for the prefixes and
suffixes:
i
s: ACGCTATGCTCATAG
t:
Sequence Alignments
CGACGCTCATCG
43
Complexity
Let len(s) = m and len(t) = n
i
Space: 2m
s: ACGCTATGCTCATAG
Time:
t:
CGACGCTCATCG
• Each call to build
similarity vector = m´n´
• First call + recursive call:
mn mn
m
T m, n
T ,
2
2
2
j
m
j T ,n
2
j
m n m j m( n j )
2m n
Sequence Alignments
44
General Gap Penalties
Suppose we are no longer using simple gap
penalties:
• Origination = −2
• Length = −1
Consider the last position of the alignment for
ACGTA with ACG
We can’t determine the score for
G
or
G
unless we know the previous positions!
Sequence Alignments
45
Scoring Blocks
Now we must score a block at a time
A A C --- A TATCCG A C T AC
A C T ACC T ------ C G C --
A block is a pair of characters, or a maximal
group of gaps paired with characters
To score a position, we need to either start a
new block or add it to a previous block
Sequence Alignments
46
The Algorithm
Three tables
• a – scores for alignments ending in char-char blocks
• b – scores for alignments ending in gaps in the top
sequence (s)
• c – scores for alignments ending in gaps in the left
sequence (t)
Scores no longer depend on only three
positions, because we can put any number of
gaps into the last block
Sequence Alignments
47
The Recurrences
ai 1, j 1
ai, j p i, j maxbi 1, j 1
ci 1, j 1
ai, j k wk , for1 k j
bi, j max
ci, j k wk , for1 k j
ai k , j wk , for1 k i
ci, j max
bi k , j wk , for1 k i
Sequence Alignments
48
The Optimal Alignment
The optimal alignment is found by looking at
the maximal value in the lower right of all three
arrays
The algorithm runs in O(n3) time
• Uses O(n2) space
Sequence Alignments
49