Application of Algorithm Research to Molecular Biology

Download Report

Transcript Application of Algorithm Research to Molecular Biology

Application of Algorithm
Research to Molecular Biology
R. C. T. Lee
Dept. Of Computer Science
National Chinan University
1
• There is one peculiar characteristics of all
living organisms: We can reproduce
ourselves.
• Yet, it is important that what we reproduce
have to be the same as we are.
• That is, wild flowers produce the same kind
of wild flowers and birds reproduce the
same kind of birds.
2
• Information about ourselves must be passed
to our descendants.
• Question: How is this done?
• Answer: Through DNA.
3
• DNA(Deoxyribonucleic Acid) can
be viewed as two strands of nucleic
acids formed as a double helix.
4
5
• There are only four types of nucleic acids in
every DNA:
• A: Adenine
• G: Guanine
• C: Cytosine
• T: Thymine
6
• Each strand of a DNA is a sequence of A, G,
C and T.
• Yet, in each strand, A is paired with T in the
other strand.
• Similarly, G is paired with C.
7
Human Mitochondrial
DNA Control Region
TTCTTTCATGGGGAAGCAAA
AAGAAAGTACCCCTTCGTTT
8
• DNA exists in cells.
• For each living organism, there are a lot of
different kinds of cells. For instance, in
human beings, we have muscle cells, blood
cells, neural cells etc.
• How can different cells perform different
functions?
9
Genes
• In each DNA sequence, there are
subsequences which are called genes.
• Each gene corresponds to a distinct protein
and it is the protein which determines the
function of the cell.
• For instance, in red blood cells, there must
be oxygen carrying protein haemoglobin
and the production of this protein is
controlled by a certain gene.
10
Proteins
• Each protein consists of amino acids.
• There are 20 different amino acids
11
12
The Relationship between a Gene
and its Corresponding Protein
13
• As shown above, each amino acid is coded
by a triplet. For instance, TTC denotes
PHE(Phenylalanine).
• Each triplet is called a codon.
• There are three codons, namely TAA, TGA
and TAG which represent “end of gene”.
14
• Protein Rnase A:
KETAAAKFER
• Its corresponding DNA sequence is:
AAA GAA ACT GCT GCT GCT AAA TTT
GAA CGT
15
How Is a Protein Produced?
• RNA (Ribonucleic Acid)
• Each cell is able to recognize all of the
starting points of genes relevant to the
proteins important to the functions of the
cell.
16
• The RNA system scans a gene. For each
codon being scanned, it produces a
corresponding amino acid.
• After all codons have been scanned, the
corresponding protein is produced.
17
18
• AAA GAA ACT GCT GCT GCT AAA TTT
GAA CGT
• KETAAAKFER
• Note that codon AAA corresponds to amino
acid K and CGT corresponds to R.
• Remember TAA, TGA and TAG signify
“end of gene”.
19
Problems
1.
2.
3.
4.
String Matching Problem
Sequence Alignment Problem
Evolution Tree Problem
RNA Secondary Structure Prediction
Problem
5. Protein Structure Problem
6. Physical Mapping Problem
20
Exact String Matching Problems
• Exact String Matching Problems
– Instance: A text T of length n and a pattern P of length m,
where n > m.
– Question: Find all occurrences of P in T.
– Example: If T = “ttaptaap” and P = “ap”, then P occurs in T
starting at 3 and 7.
• Linear time (O(n+m) time) Algorithms
– Knuth-Morris-Pratt (KMP) algorithm
– Boyer-Moore algorithm
21
Approximate String Matching
Problems
• Approximate String Matching Problems
– Instance: A text T of length n, a pattern P of length m and a
maximal number of errors allowed k
– Question: Find all text positions where the pattern matches
the text up to k errors, where errors can be substituting,
deleting, or inserting a character.
– Example:
• Let T = “pttapa”, P = “patt” and k = 2.
• The substrings T[1..2], T[1..3], T[1..4] and T[5..6] are up
to 2 errors with P.
• Algorithms
– Dynamic Programming approach
22
– NFA approach
Sequence Alignment Problem
• ATTCATTACAACCGCTATG
ACCCATCAACAACCGCTATG
• It appears that these two sequences are quite
different.
• An alignment will produce the following:
ATTCATTA-CAACCGCTATG
ACCCATCAACAACCGCTATG
23
• Given two sequences, any alignment will
have a corresponding score.
• For each exact match, the score is equal to 2.
• For each mismatch, the score is equal to -1.
• AGCAG-C
AAAC
AAAC
2-3=-1
2x2-2x(-1)=2
24
• The sequence alignment problem: Given
two sequences, find an alignment which
produces the highest score.
• Approach: Dynamic Programming
• The multiple sequence alignment problem is
NP-hard
25
The Evolution Tree Problem
26
27
• The evolution tree problem: Given a
distance matrix of n species, find an
evolution tree under some criterion.
• Usually, the criteria are such that all of the
tree distances reflect the original distances.
• That is, when two species are close to each
other in the distance matrix, they should be
close in the evolution tree.
28
• Each criterion corresponds to a distinct
evolution tree problem.
• Most of them are NP-complete.
• Algorithms which produce optimal
evolution trees in polynomial time are
mostly based upon the minimal spanning
tree approach.
29
A Partial Evolution Tree of the Homo Sapien
(Intelligent Human Beings, also Modern Men)
Our ancestors are from Africa.
30
Secondary Structure of RNA
• Due to hydrogen bonds, the primary
structure of a RNA can fold back on
itself to form its secondary structure.
• Base pairs (formed by hydrogen bonds):
1. AU (Watson-Crick base pair)
2. CG (Watson-Crick base pair)
3. GU (Wobble base pair)
31
AGGCCUUCCU
32
2D & 3D Structures of Yeast
Phenylalanyl-Transfer RNA
2D Structure
3D Structure
33
Secondary Structure Prediction
Problem
• Given an RNA sequence, determine the
secondary structure of the minimum
free energy from this sequence.
• Approach: Dynamic Programming
34
Protein Structure Problem
• Each amino acid of a protein can be classified
into either of the following two types:
– H (hydrophobic, non-polar) (hating water)
– P (hydrophilic, polar) (loving water)
• Then the amino acid sequence of a protein can
be viewed as a binary sequence of H’s (1’s) and
P’s (0’s).
35
Example
• Instance: 011001001110010
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
0
0
0
1
0
Score = 3
Score = 5
36
H-P Model
• Instance: A sequence of 1’s (H’s) and 0’s
(P’s).
• Question: To find a self-avoiding paths
embedded in either a 2D or 3D lattice which
maximizes score, where the score is the
number of pairs of 1’s that are adjacent in
the lattice without being adjacent in the
sequence.
• NP-complete even for 2D lattice.
37
Physical Mapping Problem
108 bp
C: Full DNA
Physical
mapping
Physical
mapping
106 bp
Cut C and clone
into overlapping
YAC clones.
Cut the DNA in each YAC clone and
clone into overlapping cosmid clones.
104 bp
Select a subset of cosmid clones of minimum
total length that covers the YAC DNA.
Fragment
assembling
102 bp
Duplicate the cosmid and then cut the copies randomly.
Select and sequence short fragments and then reassemble
them into a deduced cosmid string.
38
Shortest Common Superstring
• Input: A collection F of strings.
• Output: A shortest possible string S such
that for every f  F, S is a superstring of f.
• For example:
ACT
F
CTA
AGT
S ACTAGT
• NP-complete
39
• Suppose the target is too long and its
contents are unknown.
• What can we do?
• Enzyme A  {6, 8, 3, 10}
Enzyme B  {7, 11, 4, 5}
Enzymes A and B  {1, 5, 2, 6, 7, 3, 3}
40
A
3
B
AB
8
6
4
5
11
3 1
5
2
6
10
7
3
7
This problem is called the two digest
problem which is NP-complete.
41
• TAA, TGA, or TAG.
• Do you know what they mean?
• End of Gene.
• Thank you for your patience. Have a good
conference.
42