Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey

Download Report

Transcript Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey

Algorithms in
Bioinformatics
Lawrence D’Antonio
Ramapo College of New Jersey
Bioinformatics Workshop, Fall 2003
Topics
• Algorithm basics
• Types of algorithms in bioinformatics
• Sequence alignment
• Database Searches
Bioinformatics Workshop, Fall 2003
Algorithm basics
• What is an algorithm?
• Algorithm complexity
• P vs. NP
• NP completeness
Bioinformatics Workshop, Fall 2003
What is an algorithm?
• An algorithm is a step-by-step procedure to
solve a problem
• The word “algorithm” comes from the 9th
century Islamic mathematician alKhwarizmi
Bioinformatics Workshop, Fall 2003
Algorithm Complexity
• If the algorithm works with n pieces of data
and the number of steps is proportional to n,
then we say that the running time is O(n).
• If the number of steps is proportional to
log n, then the running time is O(log n).
Bioinformatics Workshop, Fall 2003
Example
• Problem: find the largest element in a
sequence of n elements.
• Solution idea: Iteratively compare size of
elements in sequence.
Bioinformatics Workshop, Fall 2003
Algorithm:
1. Initialize first element as largest.
2. For each remaining element.
If current element larger than largest, make
that element largest.
Running time: O(n)
Bioinformatics Workshop, Fall 2003
Polynomial Time
• An algorithm is said to run in polynomial
time if its running time can be written in the
form O(nk) for some power k.
• The underlying problem is said to be of
class P.
Bioinformatics Workshop, Fall 2003
Polynomial Time Examples
• Searching
Binary Search: O(log n)
• Sorting
Quick Sort: O(n log n)
Bioinformatics Workshop, Fall 2003
NP Algorithms
• An algorithm is nondeterministic if it begins
with guessing a solution to the problem and
then verifies the guess.
• A problem is of category NP if there is a
nondeterministic algorithm for that problem
which runs in polynomial time.
Bioinformatics Workshop, Fall 2003
NP Complete
• A problem is NP-complete if it has an NP
algorithm, and solutions to this problem can
be used to solve all other NP problems.
• A problem is NP-hard if it is at least as hard
as the NP-complete problems
Bioinformatics Workshop, Fall 2003
NP Complete Examples
• Traveling salesman
• Knapsack problem
• Partition problem
• Graph coloring
Bioinformatics Workshop, Fall 2003
P = NP ?
• P  NP
• If P  NP then NP-complete problems have
exponential running time.
Bioinformatics Workshop, Fall 2003
Polynomial vs. Exponential
Bioinformatics Workshop, Fall 2003
Algorithms in Bioinformatics
• Algorithms to compare DNA, RNA, or
protein sequences
• Database searches to find homologous
sequences
• Sequence assembly
• Construction of evolutionary trees
• Structure prediction
Bioinformatics Workshop, Fall 2003
Edit operations on sequences
Substitution
Insertion
Deletion
AATAAGC
AAT-AAGC
AATAAGC
ATTAAGC
AATTAAGC
AA-AAGC
Bioinformatics Workshop, Fall 2003
What is sequence alignment?
• Compare two sequences using matches,
substitutions and indels.
G AA - - T C AT
G - TGG -CA • 3 matches, 1 substitution, 5 indels
Bioinformatics Workshop, Fall 2003
Complexity of DNA Problems
• 3 billion base pairs in human genome
• Many NP complete problems
• 10600 possible alignments for two 1000
character sequences
Bioinformatics Workshop, Fall 2003
Types of sequence alignment
• Determine the alignment of two sequences
that maximizes similarity (global
alignment)
• Determine substrings of two sequences with
maximum similarity (local alignment)
• Determine the alignment for several
sequences that maximizes the sum of pairs
similarity (multiple alignment)
Bioinformatics Workshop, Fall 2003
Significance of Alignment
• Functional similarity
• Structural similarity
• Homology
Bioinformatics Workshop, Fall 2003
Scoring System
• Assign a score for each possible match,
substitution and indel
• Distance functions – Find alignment to
minimize distance between sequences
• Similarity functions – Find alignment to
maximize similarity between sequences
Bioinformatics Workshop, Fall 2003
Edit Distance
G AA - - T C AT
G - TGG -CA • Similarity function: 1 for match, -1 for
substitution, -2 for indel
• Score: -8
Bioinformatics Workshop, Fall 2003
Dynamic Programming
• Used on optimization problems
• Bottom-up approach
• Recursively builds up solution from
subproblem optimal solutions
Bioinformatics Workshop, Fall 2003
Dynamic Programming Alignment
Algorithm (Needleman-Wunsch)
• Given sequences a1,a2,…,an and b1,b2,…,bm
to be aligned:
• Initialize alignment matrix (aligning with
spaces)
• Entry [i,j] gives optimal alignment score for
sequences a1,a2,…,ai and b1,b2,…,bj (where
1  i  n, 1  j  m)
Bioinformatics Workshop, Fall 2003
Computing Alignment Matrix
If a1,a2,…,ai and b1,b2,…,bj have been aligned,
there are three possible next moves:
• Match ai+1 with bj+1
• Match ai+1 with a space —
• Match bj+1 with a space —
Choose the move that maximizes the similarity
of the two sequences
Bioinformatics Workshop, Fall 2003
Global Alignment Matrix
—
G
G
G
C
A
T
—
0
-2
-4
-6
-8
-10
-12
G
-2
1
-1
-3
-5
-7
-9
G
-4
-1
2
0
-2
-4
-6
A
-6
-3
0
1
-1
-1
-3
Bioinformatics Workshop, Fall 2003
C
-8
-5
-2
-1
2
0
-2
A
-10
-7
-4
-3
0
3
1
Optimal Global Alignment
G
G
G
C
A
T
G
G
A
C
A
—
Bioinformatics Workshop, Fall 2003
Alignment Running Time
• Assuming two sequences n characters each
• Running time is O(n2) (each entry of matrix
must be calculated)
Bioinformatics Workshop, Fall 2003
Variations of Alignment
Algorithm
• Gap penalty
• Local alignment
• Multiple alignment
Bioinformatics Workshop, Fall 2003
Gap Penalty
• A gap is a number k of consecutive spaces
• k consecutive spaces are more probable than
k isolated spaces
• Typical gap penalty function: a + b·k
(affine gap penalty)
• Here the first space in a gap is penalized
a+b, further spaces are penalized b each.
Bioinformatics Workshop, Fall 2003
Gap Penalty Example
• Use penalty, 1 + k
A- A- C - A
AC TAT CA
• Score: -6
AA C - - - A
AC TAT CA
• Score: -4
Bioinformatics Workshop, Fall 2003
Local Alignment
• Find conserved regions in otherwise
dissimilar sequences (e.g., viral and host
DNA)
• Smith-Waterman algorithm
• Includes a fourth possibility at each step
(don’t align)
Bioinformatics Workshop, Fall 2003
Local Alignment Example
• Align the following
G C T C T G C G AATA
C G TT GAGATAC T
Bioinformatics Workshop, Fall 2003
Optimal Local Alignment
G C T C T G C G AATA
C G TT GAGATAC T
(G C T C) T G C G A A T A
(C G T) T G A G - A T A (C T)
Bioinformatics Workshop, Fall 2003
Multiple Alignment
• Find the alignment among a set of
sequences that maximizes the sum of scores
for all pairs of sequences
• Dynamic programming run-time for k
sequences of length n: O(k2 2k nk)
• Multiple alignment is NP-complete
Bioinformatics Workshop, Fall 2003
Other Features
• Usually used for protein alignment
• Can be used for global or local alignment
Bioinformatics Workshop, Fall 2003
Multiple Alignment Example
P E A A L Y G R F T - - - I K S D VW
P E S L A Y N K F - - - S I K S D VW
P E A L N Y G R Y - - - S S E S D VW
P E A L N Y GWY - - - S S E S D VW
P E V I RMQ D D N P F S F Q S D V Y
Bioinformatics Workshop, Fall 2003
Multiple vs. Pairwise Alignment
• Optimal multiple alignment does not imply
optimal pairwise alignment
AT
A-T
A-T
Bioinformatics Workshop, Fall 2003
Substitution Matrices
• In homologous sequences certain amino
acid substitutions are more likely to occur
than others
• Types of substitution matrices
* PAM
* BLOSUM
Bioinformatics Workshop, Fall 2003
PAM Matrices
• Defines units of evolutionary distance
• 1 PAM unit represents an average of one
mutation per 100 amino acids
• Start with a set of highly similar sequences
and compute
* pa = probability of occurrence of amino acid a
* Mab = probability of a mutating to b
Bioinformatics Workshop, Fall 2003
PAM Matrix Formula
• Entries in a k-PAM matrix
k
ab
M
10 log10
pb
Bioinformatics Workshop, Fall 2003
PAM250 Matrix
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
C
12
S
0
2
T
-2
1
3
P
-3
1
0
6
A
-2
1
1
1
2
G
-3
1
0
-1
1
5
N
-4
1
0
-1
0
0
2
D
-5
0
0
-1
0
1
2
4
E
-5
0
0
-1
0
0
1
3
4
Q
-5
-1
-1
0
0
-1
1
2
2
4
H
-3
-1
-1
0
-1
-2
2
1
1
3
6
R
-4
0
-1
0
-2
-3
0
-1
-1
1
2
6
K
-5
0
0
-1
-1
-2
1
0
0
1
0
3
5
M
-5
-2
-1
-2
-1
-3
-2
-3
-2
-1
-2
0
0
6
I
-2
-1
0
-2
-1
-3
-2
-2
-2
-2
-2
-2
-2
2
5
L
-6
-3
-2
-3
-2
-4
-3
-4
-3
-2
-2
-3
-3
4
2
6
V
-2
-1
0
-1
0
-1
-2
-2
-2
-2
-2
-2
-2
2
4
2
4
F
-4
-3
-3
-5
-4
-5
-4
-6
-5
-5
-2
-4
-5
0
1
2
-1
9
Y
0
-3
-3
-5
-3
-5
-2
-4
-4
-4
0
-4
-4
-2
-1
-1
-2
7
10
W
-8
-2
-5
-6
-6
-7
-4
-7
-7
-5
-3
2
-3
-4
-5
-2
-6
0
0
Bioinformatics Workshop, Fall 2003
W
17
BLOSUM Matrices (Omit)
• Uses log-odds ratio similar to PAM
• Uses short highly conserved sequences
• BLOSUM x matrices created after removing
sequences that are more than x percent identical
• Better at local alignments
Bioinformatics Workshop, Fall 2003
BLOSUM Matrices
• A motif is a conserved amino acid pattern
found in a group of proteins with similar
biological meaning (PROSITE)
• A block is a conserved amino acid pattern in
a group of proteins (no spaces allowed in
the pattern) (BLOCKS)
Bioinformatics Workshop, Fall 2003
Motif Example
• Motif obtained from a group of 34 tubulin
proteins
M[FYW] . . F[VLI]H . [FYW] . . EGM
Bioinformatics Workshop, Fall 2003
Defining BLOSUM (I)
• BLOSUMn uses blocks that are n%
identical (BLOSUM62 is most common)
• Consider all pairs of amino acids appearing
in the same column in the blocks
Bioinformatics Workshop, Fall 2003
Defining BLOSUM (II)
• Define n(i,j) to be the frequency that amino
acids i,j appear in a column pair
• Define e(i,j) to be the frequency that amino
acids i,j appear in any pair
• Define BLOSUM entry
n( i , j )
s( i , j )  log 2
e( i, j )
Bioinformatics Workshop, Fall 2003
PAM vs. BLOSUM
• PAM derived from highly similar sequences
(evolutionary model)
• BLOSUM derived from protein families
sharing a common ancestor (conserved
domain model)
Bioinformatics Workshop, Fall 2003
Database Searches
• FASTA
• BLAST
Bioinformatics Workshop, Fall 2003
FASTA
• Looks for sequences in a database similar to
a query sequence
• Heuristic, exclusion method
• Compares query sequence to each database
sequence (called the text)
Bioinformatics Workshop, Fall 2003
FASTA Algorithm (I)
• Look for small substrings in query and text
that exactly match (“hot spots”)
• Find ten best “diagonal runs” of hot spots
Bioinformatics Workshop, Fall 2003
Hot Spot Example
E K L A S R K L
H
A
S
H
K
L
*
*
*
*
Bioinformatics Workshop, Fall 2003
FASTA Algorithm (II)
• Find best local alignment for each run
• Combine these into larger alignment
• Do multiple alignment on query and texts
having highest score in last step
Bioinformatics Workshop, Fall 2003
BLAST
• Basic Local Alignment Search Tool
• Heuristic, exclusion method
• Computes statistical significance of
alignment scores
Bioinformatics Workshop, Fall 2003
BLAST Algorithm
• Find all w-length substrings in text that align to
some w-length substring in query with score above
a given threshold (called “hits”)
• Extend these hits as far as possible (“segment
pairs”)
• Report the highest scoring segment pairs
Bioinformatics Workshop, Fall 2003
Other Bioinformatics Algorithms
•
•
•
•
•
Palindromes
Tandem Repeats
Longest Common Subsequence
Double Digest (NP complete)
Shortest Common Superstring (NP
complete)
Bioinformatics Workshop, Fall 2003
References
• Clote and Backofen, Computational Molecular
Biology, Wiley
• Gusfield, Algorithms on Strings, Trees, and
Sequences, Cambridge University Press
• Mount, Bioinformatics, Cold Spring Harbor Press
• Setubal and Meidanis, Introduction to
Computational Molecular Biology, PWS
• Waterman, Introduction to Computational
Biology, CRC Press
Bioinformatics Workshop, Fall 2003