Human Origins

Transcript Human Origins

Chapter 6
The Computational Foundations of
Genomics
Applying algorithms to analyze
genomics data
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Contents
 What are computational biology and
bioinformatics?
 Understanding computers and algorithms
 Sequence alignment
 Gene prediction
 Algorithms for analysis of phylogeny
 Analysis of microarray data
 Computer simulation
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Computational Biology and
Bioinformatics
 Computational biology
 Development of computational methods to solve
problems in biology
 Bioinformatics
 Application of computational biology to analysis and
management of real molecular biology data
 Why do molecular biologists need computer science?
 Discrete nature of sequence data is ideal for analysis
using digital computers
 Size and complexity of genomics data make the data
impossible to analyze without computers
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithm
 an algorithm is a procedure (a finite set of
well-defined instructions) for accomplishing
some task
 A recipe to perform a task
 Algorithms often have steps that repeat
(iterate) or require decisions (such as logic or
comparison). Algorithms can be composed to
create more complex algorithms.
 Concept originated in 1936
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A historical perspective
 The 1960s: the birth of
bioinformatics
 High-level computer
languages
 Protein sequence data
 Academic access to
computers
 Margaret Oakley
Dayhoff
 First protein database
 First program for
sequence assembly
IBM 7090 computer
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Solving problems in computer science
 Necessary parameters for assessing the
difficulty of a computer science problem
 Algorithmic complexity
 Is the problem theoretically solvable?
 If so, what is the most efficient solution?
 Current state of computer technology
 Memory
 CPU speed
 Cost
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic problems
 Example: searching for a number in an unordered list
 If the list has N numbers, the average amount of time
the search will take will be proportional to N
 A more clever approach
 Place the numbers in order
 Do a binary search
 Step 1: Pick a number in the middle of the list
 Step 2: Restrict the search to the half that contains your
number
 Return to Step 1 until you find your number
 Time for this approach is proportional to log2N
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The digital computer
 Represents everything in
a code of zeros and ones
 Computer architecture
 CPU
 Memory
 Input / Output
Input
CPU
 Advantages of digital
computer
 Deterministic
 Minimization of noise
Memory
Output
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The limitations of digital computers
 The limitations of digital computers are
conceptual, not just technological
 Digital computers are deterministic
 Incapable of truly random behavior
 Digital computers deal with strictly discrete
values
 Can only approximate continuous behavior
 Many interesting biological phenomena occur
in the continuous realm of space and time
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence databases
 What is a database?
 An indexed set of records
 Records retrieved using a query language
 Database technology is well established
 Examples of sequence databases
 GenBank
 Encompasses all publicly available protein and
nucleotide sequences
 Protein Data Bank
 Contains 3-D structures of proteins
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The client-server model
 The clients and servers
are software processes
 Clients request data
from servers
 Servers and clients can
reside on the same or
different machines
 Clients can act as
servers to other
processes and vice versa
Web Browser
Web Server
BLAST Search
Engine
Database
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence alignment
 Sequence alignments
search for matches
between sequences
 Two broad classes of
sequence alignments
 Global (wide)
 Local (narrow)
 Alignment can be
performed between two
or more sequences
QKESGPSSSYC
VQQESGLVRTTC
Global alignment
ESG
ESG
Local alignment
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The biological importance of sequence
alignment
 Sequence alignments assess the degree of
similarity between sequences
 Similar sequences suggest similar function
 Proteins with similar sequences are likely to
play similar biochemical roles
 Regulatory DNA sequences that are similar will
likely have similar roles in gene regulation
 Sequence similarity suggests evolutionary
history
 Fewer differences mean more recent divergence
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The algorithmic problem of aligning
sequences
 Comparison of similar
sequences of similar
length is straightforward
 How does one deal with
insertions and gaps that
may hide true similarity?
 How does one interpret
minimal similarity?
 Are sequences actually
related?
 Is alignment by chance?
QKESGPSRSYC
QQESGPVRSTC
RQQEPVRSTC
QQESGPVRSTC
QKGSYQEKGYC
QQESGPVRSTC
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Methods of sequence alignment
 Graphical methods: visual
 Dynamic-programming methods:
mathematically correct but needs time
 Heuristic methods: approximate but close to
real answer
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis
RQQEPVRSTC
 A graphical method
 Shows all possible
alignments
 Caveats
 Some guesswork in
picking parameters
 Window size
 Stringency
 Not as rigorous or
quantitative as other
methods
QQESGPVRSTC
R Q Q E P V R S T C
Q
Q
E
S
G
P
V
R
S
T
C
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis: a real example
Window size: 1
Stringency: 1
Window size: 23
Stringency: 15
Noise to signal ratio
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Devising a scoring system
 Scoring matrices allow biologists to quantify the
quality of sequence alignments
 Use different scoring matrices for different purposes
 Score for similar structural domains in proteins
 Score for evolutionary relationship
 Some popular scoring matrices
 PAM for evolutionary studies (Percent Accepted
Mutation)
 BLOSUM for finding common motifs (BLOcks amino
acid SUbstitution Matrix)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
An example of scoring
A
R
N
D
C
Q
E
A
4
-1
-2
-2
0
-1
-1
R
-1
5
0
-2
-3
1
0
A D D R Q C E R A D
A Q E R Q E C Q A Q
4 0 2 5 5 -4 -4 1 4 0
N
-2
0
6
1
-3
0
0
D
-2
-2
1
6
-3
0
2
Total score: 18
C
0
-3
-3
-3
9
-3
-4
Q
-1
1
0
0
-3
5
2
E
-1
0
0
2
-4
2
5
A sequence comparison
BLOSUM62
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dynamic programming (DP)
 Possibility of gaps (or insertions) makes number of
possible sequence alignments astronomical
 Dynamic programming makes sequence alignment
possible by abandoning low scoring alignments
among subsequences as the algorithm progresses
 Mathematically proven to provide optimal alignments
 DP algorithms for sequence alignment
 Needleman-Wunsch-Gotoh algorithm for global
alignments
 Smith-Waterman algorithm for local alignments
 DP alignment algorithms still too slow for searching
an entire sequence database
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Heuristic methods with k-tuples
 Example: BLAST
 Using query sequence,
derive a list of words
(tupules) of length w
(e.g., 3)
 Keep high-scoring
matching words
 High-scoring words are
compared with
database sequences
 Sequences with many
matches to highscoring words are used
for final alignments
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Statistical significance
 Chance alignments have no biological significance
 Statistical significance implies low probability of
generating a chance alignment
 Probability of long alignments increases with longer
sequences
 The extreme-value distribution (E value)
 Used to calculate the probability of chance alignment
 Generated by calculating the scores resulting from
repeatedly scrambling one of the sequences being
compared
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A practical example of sequence alignment
MASH-1, a transcription factor
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
BLAST results
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Detailed BLAST results
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A pairwise alignment with MASH-1
 HASH-2, a human homolog of MASH-1
 “+” indicates conservative amino acid substitution
 “–” indicates gap/insertion
 XXXX… shows areas of low complexity (common occurrence)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Multiple-sequence alignments
 Uses of multiple-sequence alignments
 Automated reconstruction of sequence fragments
 Phylogenetic analysis
 Identification of sequence families
 The problem of multiple-sequence alignment
 O(NM) where N is the average sequence length and M is
the number of sequences being aligned (optimal
methods)
 Dynamic programming will work only for small M
 Heuristic methods are required
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Some methods for global
multiple-sequence alignment
 Progressive methods
 Align most closely related sequences, and then less
related ones
 Use phylogenetic trees to quantify similarities
 Downside: poor results with distantly related sequences
 Iterative methods




Start with progressive alignment
Realign sequences after leaving one sequence out
Add left-out sequence
Repeat until acceptable alignment is achieved
 Probabilistic methods
 Hidden Markov models ( we will talk later)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Phylogenetic analysis
 Phylogenetic trees
 Describe evolutionary
relationships between
sequences
 Three common methods
 Maximum parsimony
 Distance
 Maximum likelihood
human immunodeficiency viruses from around the world
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Comparison of methods for
phylogenetic analysis
 Maximum parsimony (machine input)(close seqs)
 Finds optimal tree (or trees) requiring minimum
number of substitutions to explain sequence variation
 Maximum likelihood (user input) (distantly related)
 Finds most probable tree
 Similar to maximum parsimony
 Distance (mix of close and distantly related)
 Compare pairs of sequences for number of differences
between them
 Use many methods to get consensus tree
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic complexity and
phylogenetic analysis
 Four steps




Sequence alignment
Substitution model (scoring matrix)
Tree building
Tree evaluation
 Tree building and evaluation are
computationally expensive
 Heuristic methods required in most cases
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Gene prediction
 A problem of pattern recognition
 Algorithms look for features of genes:
 E.g., Splice sites, ORFs, starting methionine
 Identification of regulatory regions is difficult
 Statistical understanding of genes is ongoing
 Problems of this type require machine learning
algorithms: learn what is the pattern based on
small dataset
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Central Dogma in Molecular Biology
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Artificial neural networks
 Machine learning algorithms
that mimic the brain
 Connections between
“neurons” vary in strength
 Connection weights (wij)
(strength) change while
network is exposed to
training set
 Fully trained network
recognizes pattern in novel
input
 GRAIL
A feed forward
neural network
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hidden Markov models
 Can be used for machine
learning
 Units constitute
transition states
 Transitions not
dependent on history
 Many uses in genomics
 Gene prediction
 Multiple sequence
alignment
 Finding periodic
patterns
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs
 The example of a dishonest gambler is often used to illustrate this point.
The gambler may carry a loaded die that he or she occasionally substitutes
for a fair die, but not so often that the other players would notice. The fair
die has a one-in-six chance of showing any particular number. When using
the loaded die, a player will have a 50% chance of rolling a one and a 10%
chance of rolling any other number. It is in these types of situations that
stochastic models called hidden Markov models (HMMs) are useful,
because they take into account unknown (or hidden) states. For example,
exactly when the cheating gambler is using a fair or loaded die is hidden
from the other players, but insight may still be gained by looking at the
outcome of the cheater’s rolls. If he or she rolls three ones in a row, it is
more likely (a 12.5% chance) that the loaded die is being used than the fair
one, which would have only a 0.5% chance of generating three ones in a
row. Hidden Markov models describe the probability of transitions
between hidden states, as well as the probabilities associated with each
state. In the example of the cheating gambler, an HMM would describe the
probabilities of rolling particular numbers given the loaded or fair die and
the probability that the dishonest gambler would switch from one die to
another.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs continued

Hidden Markov models can be used to answer three types of questions. The first
type is the likelihood question: Given a particular HMM, what is the probability of
obtaining a particular outcome (e.g., rolling three ones)? The second type is the
decoding question: Given a particular HMM, what is the most likely sequence of
transitions between states for a particular outcome? In the case of the cheating
gambler, this sequence would be the order in which he or she transitioned from one
die to another. The third type is the learning question: Given a particular outcome
and set of assumptions about possible transition states, what are the best model
parameters (e.g., probabilities between transition states)? This third question allows
HMMs to be used for machine learning. The figure in the slide shows a simple
example of a hidden Markov model being used to account for the DNA sequence at
the bottom. Every HMM has a start and end state, denoted by the S and E,
respectively, in the slide. Hidden states lie between the start and end states. In the
figure, the squares are states, and the lines between them indicate the probability of
one state transitioning to another. The loops on the upper and lower states show the
probabilities associated with the state remaining the same. States transition back
and forth until the HMM reaches the end state. In this HMM, the top square
represents a state that has equal probabilities of generating A, G, C, or T. The
bottom state has probabilities of 0.1, 0.1, 0.1, and 0.7 of generating A, G, C, and T,
respectively.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Hidden Markov models
 Can be used for machine
learning
 Units constitute
transition states
 Transitions not
dependent on history
 Many uses in genomics
 Gene prediction
 Multiple sequence
alignment
 Finding periodic
patterns
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
HMMs for gene prediction
 HMMs are trained on
sequences that are members
of known gene class
 HMM gives probability that
a particular sequence
belongs to the gene class
 Length of the bar indicates
probability
 Bigger the bar higher
probability
 Genscan
2000 human introns
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithms for secondary-structure
determination
 Chou-Fasman / GOR method
 Based on experimentally determined frequency
of amino acids in secondary structures
 Machine learning algorithms
 Neural networks
 Nearest-neighbor methods
 Trained on previously deduced structures to
detect amino acid patterns in secondary
structures
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Analysis of microarray data
 Microarrays can measure the expression of
thousands of genes simultaneously
 Vast amounts of data require computers
 Types of analysis
 Gene-by-gene
 Method: Statistical techniques
 Categorizing groups of genes
 Method: Clustering algorithms
 Deducing patterns of gene regulation
 Method: Under development
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Unsupervised techniques
 Make no assumptions
about how the data
should behave
 Cluster genes based on
similar patterns of gene
expression
 Examples
 Hierarchical clustering
 Principal components
analysis (PCA)
Hierarchical
clustering
PCA
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Metrics for gene expression
 Need a method to
measure how similar
genes are based on
expression
 Examples
 Euclidean distance
 Pearson correlation
coefficient
Euclidean
distance
Pearson
correlation
coefficient
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Supervised techniques
 Divide groups of genes
based on sample
properties
 Can predict sample
condition based on gene
expression pattern
 Examples
 Support vector
machine
 Nearest neighbor
Support
vector
machine
Nearest
neighbor
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The usefulness of simulation
 Why simulate when you can experiment?
 Models involving many parameters may be
difficult to conceptualize without simulations
 A simulation may suggest ways of testing a
hypothesis
 Some experiments cannot be done in vivo, or in
vitro, and must therefore be done in silico
 If a simulation is good, it can be used in place
of more expensive or time-consuming
experiments. Nuclear experiments by the US.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Numerical methods
 Numerical methods are needed because of the
discrete nature of computers
 Differential equations are turned into
difference equations that deal with discrete
rather than continuous quantities
 Smaller steps lead to greater simulation
accuracy
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Examples of computer simulations in biology
 Gene regulatory
networks
 Simulations of cells
 Networks of neurons
 Population genetics
A model of gene regulation
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Prospects for a fully simulated cell
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Limitations of computer
simulation
 Algorithmic
 Computers only can process discrete values
 Simulating continuous behavior accurately often
requires an unfeasible number of calculations
 Experimental
 Simulation only as good as data it is based on
 Critical data often missing from simulation
 Conceptual
 Overly complex simulations do not contribute to
understanding of a biological system
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Summary
 Vast amounts of data require bioinformatics
 These are limited by the following:
 Algorithmic complexity of bioinformatics problems
 Computer hardware performance
 Heuristic methods used to get around these limitations
 Bioinformatics methods used in the following areas:






Sequence alignment
Phylogenetic-tree construction
Gene prediction
Secondary-structure determination
Analysis of microarray data
Simulation of biological systems
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458