Human Origins - Bilkent University

Download Report

Transcript Human Origins - Bilkent University

Chapter 6
The Computational Foundations of
Genomics
Applying algorithms to analyze
genomics data
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Contents




Sequence alignment
Gene prediction
Algorithms for analysis of phylogeny
Analysis of microarray data
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Computational Biology and
Bioinformatics
 Computational biology
 Development of computational methods to solve
problems in biology
 Bioinformatics
 Application of computational biology to analysis and
management of real data
 Why do biologists need computer science?
 Discrete nature of sequence data is ideal for analysis
using digital computers
 Size and complexity of genomics data make the data
impossible to analyze without computers
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Algorithmic problems
 Example: searching for a number in an unordered list
 If the list has N numbers, the average amount of time
the search will take will be proportional to N
 A more clever approach
 Place the numbers in order
 Do a binary search
 Step 1: Pick a number in the middle of the list
 Step 2: Restrict the search to the half that contains your
number
 Return to Step 1 until you find your number
 Time for this approach is proportional to log2N
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The digital computer
 Represents everything in
a code of zeros and ones
 Computer architecture
 CPU
 Memory
 Input / Output
Input
CPU
 Advantages of digital
computer
 Deterministic
 Minimization of noise
Memory
Output
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence databases
 What is a database?
 An indexed set of records
 Records retrieved using a query language
 Database technology is well established
 Examples of sequence databases
 GenBank
 Encompasses all publicly available protein and
nucleotide sequences
 Protein Data Bank
 Contains 3-D structures of proteins
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The client-server model
 The clients and servers
are software processes
 Clients request data
from servers
 Servers and clients can
reside on the same or
different machines
 Clients can act as
servers to other
processes and vice versa
Web Browser
Web Server
BLAST Search
Engine
Database
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Sequence alignment
 Sequence alignments
search for matches
between sequences
 Two broad classes of
sequence alignments
 Global
 Local
 Alignment can be
performed between two
or more sequences
QKESGPSSSYC
VQQESGLVRTTC
Global alignment
ESG
ESG
Local alignment
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The biological importance of sequence
alignment
 Sequence alignments assess the degree of
similarity between sequences
 Similar sequences suggest similar function
 Proteins with similar sequences are likely to
play similar biochemical roles
 Regulatory DNA sequences that are similar will
likely have similar roles in gene regulation
 Sequence similarity suggests evolutionary
history
 Fewer differences mean more recent divergence
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
The algorithmic problem of aligning
sequences
 Comparison of similar
sequences of similar
length is straightforward
 How does one deal with
insertions and gaps that
may hide true similarity?
 How does one interpret
minimal similarity?
 Are sequences actually
related?
 Is alignment by chance?
QKESGPSRSYC
QQESGPVRSTC
RQQEPVRSTC
QQESGPVRSTC
QKGSYQEKGYC
QQESGPVRSTC
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Methods of sequence alignment
 Graphical methods
 Dynamic-programming methods
 Heuristic methods
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis
RQQEPVRSTC
 A graphical method
 Shows all possible
alignments
 Caveats
 Some guesswork in
picking parameters
 Window size
 Stringency
 Not as rigorous or
quantitative as other
methods
QQESGPVRSTC
R Q Q E P V R S T C
Q
Q
E
S
G
P
V
R
S
T
C
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Dot matrix analysis: a real example
Window size: 1
Stringency: 1
Window size: 23
Stringency: 15
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Devising a scoring system
 Scoring matrices allow biologists to quantify
the quality of sequence alignments
 Use different scoring matrices for different
purposes
 Score for similar structural domains in proteins
 Score for evolutionary relationship
 Some popular scoring matrices
 PAM for evolutionary studies
 BLOSUM for finding common motifs
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
An example of scoring
A
R
N
D
C
Q
E
A
4
-1
-2
-2
0
-1
-1
R
-1
5
0
-2
-3
1
0
A D D R Q C E R A D
A Q E R Q E C Q A Q
4 0 2 5 5 -4 -4 1 4 0
N
-2
0
6
1
-3
0
0
D
-2
-2
1
6
-3
0
2
Total score: 18
C
0
-3
-3
-3
9
-3
-4
Q
-1
1
0
0
-3
5
2
E
-1
0
0
2
-4
2
5
A sequence comparison
BLOSUM62
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Heuristic methods with k-tuples
 Example: BLAST
 Using query sequence,
derive a list of words
of length w (e.g., 3)
 Keep high-scoring
words
 High-scoring words are
compared with
database sequences
 Sequences with many
matches to highscoring words are used
for final alignments
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Statistical significance
 Chance alignments have no biological significance
 Statistical significance implies low probability of
generating a chance alignment
 Probability of long alignments increases with longer
sequences
 The extreme-value distribution
 Used to calculate the probability of chance alignment
 Generated by calculating the scores resulting from
repeatedly scrambling one of the sequences being
compared
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A practical example of sequence alignment
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
BLAST results
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Detailed BLAST results
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
A pairwise alignment with MASH-1
 HASH-2, a human homolog of MASH-1
 “+” indicates conservative amino acid substitution
 “–” indicates gap/insertion
 XXXX… shows areas of low complexity
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Phylogenetic analysis
 Phylogenetic trees
 Describe evolutionary
relationships between
sequences
 Three common methods
 Maximum parsimony
 Distance
 Maximum likelihood
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Gene prediction
 A problem of pattern recognition
 Algorithms look for features of genes:
 E.g., Splice sites, ORFs, starting methionine
 Identification of regulatory regions is difficult
 Statistical understanding of genes is ongoing
 Problems of this type require machine learning
algorithms
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Analysis of microarray data
 Microarrays can measure the expression of
thousands of genes simultaneously
 Vast amounts of data require computers
 Types of analysis
 Gene-by-gene
 Method: Statistical techniques
 Categorizing groups of genes
 Method: Clustering algorithms
 Deducing patterns of gene regulation
 Method: Under development
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Normalization of Microarray Data
 To make arrays comparable:
 Assume total intensity from an RNA pool is the
same from another (cells growth arrested vs.
cells dividing).
 Take the median value of all the spot intensities
and subtract it from each spot’s own intensity.
 THIS IS KNOWN AS GLOBAL
NORMALIZATION
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Example Data
Gene
Arrest1
FSTL1
10.128
AACS
8.548685
RPS11
10.694
ELMO2
9.191341
PNMA1
9.019924
C6orf162
6.1991
RPS17
13.165
MMP2
11.90253
EIF2B5
7.822947
PAX7
5.76221
VPS33A
6.872039
TFIP11
6.5346
PKNOX2
6.5961
POLR2G
10.61103
MedianIntensity (14 genes) 8.784305
MedianIntesity (3269 genes)
7.70135
Arrest2
10.055
8.491826
10.951
9.103312
8.801906
6.1832
13.137
11.90142
8.082848
5.442757
6.881928
6.6533
6.6745
10.59466
8.646866
7.699
Arrest3
10.036
8.548685
10.578
9.044376
8.81503
6.2037
13.242
11.94042
7.864379
5.867109
6.962075
6.7281
6.7709
10.58321
8.681858
7.7101
Control1
9.7809
8.70205
10.858
9.107896
8.654899
6.1943
13.273
11.51725
8.260006
5.624901
6.809616
6.5774
6.4398
10.43322
8.678474
7.7232
Control2
9.8175
8.835956
11.028
9.126183
8.838787
6.3391
13.317
11.56317
8.200409
5.543204
6.872039
6.7161
6.3031
10.63613
8.837371
7.79113
Control3
9.887
8.776186
11.034
8.935463
8.957624
6.254
13.362
11.63513
8.347466
5.782179
7.025566
6.7118
6.358
10.60713
8.855825
7.824586
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Log Normalized Data to total median
intensity (Log2Ratio normalized)
10.128-7.7=2.428
Gene
FSTL1
AACS
RPS11
ELMO2
PNMA1
C6orf162
RPS17
MMP2
EIF2B5
PAX7
VPS33A
TFIP11
PKNOX2
POLR2G
Arrest1
2.428
0.848685
2.994
1.491341
1.319924
-1.5009
5.465
4.202527
0.122947
-1.93779
-0.827961
-1.1654
-1.1039
2.911029
Arrest2
2.355
0.791826
3.251
1.403312
1.101906
-1.5168
5.437
4.201419
0.382848
-2.257243
-0.818072
-1.0467
-1.0255
2.894659
Arrest3
2.326
0.838685
2.868
1.334376
1.10503
-1.5063
5.532
4.230415
0.154379
-1.842891
-0.747925
-0.9819
-0.9391
2.873206
Control1
2.0609
0.98205
3.138
1.387896
0.934899
-1.5257
5.553
3.797249
0.540006
-2.095099
-0.910384
-1.1426
-1.2802
2.713224
Control2
1.8175
0.835956
3.028
1.126183
0.838787
-1.6609
5.317
3.563173
0.200409
-2.456796
-1.127961
-1.2839
-1.6969
2.636129
Control3
2.087
0.976186
3.234
1.135463
1.157624
-1.546
5.562
3.835128
0.547466
-2.017821
-0.774434
-1.0882
-1.442
2.807134
6.5961-7.71=-1.1039
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Differentially Expressed Genes (DEGs)
 The difference between two groups of samples
(arrays that belong to tumor vs. those to health;
or arrays from growth arrested cell and those
from asynchronously dividing cells) can be
estimated and those genes whose mRNA
expression significantly differ can be
determined statistically.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Log2 Ratio
 ½=0.5
 2/1=2
 Log(1/2)=-1
 Log(1)-Log(2)=-1
 Log(2/1)=1
 Log(2)-Log(1)=1
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Average(Arrest)-Average(Control)
 Which genes upregulated with respect to
control in arrest phenotype?
 Which genes downregulated with respect to
control in arrest phenotype?
Gene
FSTL1
AACS
RPS11
ELMO2
PNMA1
C6orf162
RPS17
MMP2
EIF2B5
PAX7
VPS33A
TFIP11
PKNOX2
POLR2G
Arrest1
2.428
0.848685
2.994
1.491341
1.319924
-1.5009
5.465
4.202527
0.122947
-1.93779
-0.827961
-1.1654
-1.1039
2.911029
Arrest2
2.355
0.791826
3.251
1.403312
1.101906
-1.5168
5.437
4.201419
0.382848
-2.257243
-0.818072
-1.0467
-1.0255
2.894659
Arrest3
2.326
0.838685
2.868
1.334376
1.10503
-1.5063
5.532
4.230415
0.154379
-1.842891
-0.747925
-0.9819
-0.9391
2.873206
Control1
2.0609
0.98205
3.138
1.387896
0.934899
-1.5257
5.553
3.797249
0.540006
-2.095099
-0.910384
-1.1426
-1.2802
2.713224
Control2
1.8175
0.835956
3.028
1.126183
0.838787
-1.6609
5.317
3.563173
0.200409
-2.456796
-1.127961
-1.2839
-1.6969
2.636129
Control3
2.087
0.976186
3.234
1.135463
1.157624
-1.546
5.562
3.835128
0.547466
-2.017821
-0.774434
-1.0882
-1.442
2.807134
Log2Ratio FoldChange (2^Log2Ratio)
0.3812
1.30
-0.104998
0.93
-0.095667
0.94
0.193162
1.14
0.198517
1.15
0.069533
1.05
0.000667
1.00
0.479603
1.39
-0.209236
0.86
0.177264
1.13
0.139607
1.10
0.1069
1.08
0.4502
1.37
0.174136
1.13
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Are these FoldChanges Significant?
 Very basic statistics: t-test between two groups
Gene
FSTL1
AACS
RPS11
ELMO2
PNMA1
C6orf162
RPS17
MMP2
EIF2B5
PAX7
VPS33A
TFIP11
PKNOX2
POLR2G
Arrest1
2.428
0.848685
2.994
1.491341
1.319924
-1.5009
5.465
4.202527
0.122947
-1.93779
-0.827961
-1.1654
-1.1039
2.911029
Arrest2
2.355
0.791826
3.251
1.403312
1.101906
-1.5168
5.437
4.201419
0.382848
-2.257243
-0.818072
-1.0467
-1.0255
2.894659
Arrest3
2.326
0.838685
2.868
1.334376
1.10503
-1.5063
5.532
4.230415
0.154379
-1.842891
-0.747925
-0.9819
-0.9391
2.873206
Control1
2.0609
0.98205
3.138
1.387896
0.934899
-1.5257
5.553
3.797249
0.540006
-2.095099
-0.910384
-1.1426
-1.2802
2.713224
Control2
1.8175
0.835956
3.028
1.126183
0.838787
-1.6609
5.317
3.563173
0.200409
-2.456796
-1.127961
-1.2839
-1.6969
2.636129
Control3
2.087
0.976186
3.234
1.135463
1.157624
-1.546
5.562
3.835128
0.547466
-2.017821
-0.774434
-1.0882
-1.442
2.807134
Log2Ratio FoldChange p-val
0.3812
1.30
0.013829
-0.104998
0.93
0.107932
-0.095667
0.94
0.494612
0.193162
1.14
0.11733
0.198517
1.15
0.170156
0.069533
1.05
0.175971
0.000667
1.00
0.994119
0.479603
1.39
0.004977
-0.209236
0.86
0.211312
0.177264
1.13
0.3909
0.139607
1.10
0.258185
0.1069
1.08
0.248911
0.4502
1.37
0.025928
0.174136
1.13
0.026329
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
How to calculate a log2Ratio in excel?
 Type in =AVERAGE(I2:K2)-AVERAGE(L2:N2) for FSTL1
 Drag the cell from the bottom right corner down to fill in for the other
rows.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
How to calculate FoldChange in excel?
 Raise the Log2Ratio Column to the power of 2 (2^O2 for FSTL1 gene)
 Drag the cell from the bottom right corner down to fill in for the other
rows.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
How to do a ttest in excel?
 Use function t-test from statistical function library:
 Type in =TTEST(I2:K2,L2:N2,2,2) for the following data:
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Metrics for Gene Expression
 Euclidian Distance
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Calculation of Euclidian
 Calculate the Euclidian distance between
FSTL1 and AACS
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Calculation of Euclidian
 Larger the Euclidian Distance between two expression
profiles more different they are from each other
FSTL1
AACS
EuclidianD
FSTL1
RPS11
EuclidianD
Arrest1
2.428
0.848685
2.494234
2.428
2.994
0.320356
Arrest2
2.355
0.791826
2.443512
2.355
3.251
0.802816
Arrest3
Control1 Control2
2.326
2.0609
1.8175
0.838685 0.98205 0.835956
2.212104 1.163917 0.963428
2.326
2.0609
1.8175
2.868
3.138
3.028
0.293764 1.160144 1.46531
Control3
2.087
0.976186
1.233907
2.087
3.234
1.315609
10.5111
5.358
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Correlation Coefficient
FSTL1
AACS
EuclidianD
FSTL1
RPS11
EuclidianD
Arrest1
2.428
0.848685
2.494234
2.428
2.994
0.320356
Arrest2
2.355
0.791826
2.443512
2.355
3.251
0.802816
Arrest3
Control1 Control2
2.326
2.0609
1.8175
0.838685 0.98205 0.835956
2.212104 1.163917 0.963428
2.326
2.0609
1.8175
2.868
3.138
3.028
0.293764 1.160144 1.46531
Control3
2.087
0.976186
1.233907
2.087
3.234
1.315609
Correlation
10.5111
-0.37
5.358
-0.14
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Plot of Genes Across Conditions
FSTL1
AACS
6
5
RPS11
ELMO2
4
3
PNMA1
C6orf162
2
RPS17
1
MMP2
EIF2B5
0
-1
Arrest1
Arrest2
Arrest3
Control1 Control2 Control3
PAX7
VPS33A
-2
TFIP11
PKNOX2
-3
POLR2G
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Plot of Highly Significant Genes
Across Conditions
Series1
14
Series2
Series3
12
Series4
10
Series5
Series6
8
Series7
Series8
6
Series9
4
Series10
Series11
2
Series12
Series13
0
Series14
Arrest1
Arrest2
Arrest3
Control1 Control2 Control3
Series15
Series16
Series17
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Plot of Highly Significant Gene
Clusters Across Conditions
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Metrics for gene expression
 Need a method to
measure how similar
genes are based on
expression
 Examples
 Euclidean distance
 Pearson correlation
coefficient
Euclidean
distance
Pearson
correlation
coefficient
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Unsupervised techniques
 Make no assumptions
about how the data
should behave
 Cluster genes based on
similar patterns of gene
expression
 Examples
 Hierarchical clustering
 Principal components
analysis (PCA)
Hierarchical
clustering
PCA
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Supervised techniques
 Divide groups of genes
based on sample
properties
 Can predict sample
condition based on gene
expression pattern
 Examples
 Support vector
machine
 Nearest neighbor
Support
vector
machine
Nearest
neighbor
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Summary
 Vast amounts of data require bioinformatics
 These are limited by the following:
 Algorithmic complexity of bioinformatics problems
 Computer hardware performance
 Heuristic methods used to get around these limitations
 Bioinformatics methods used in the following areas:






Sequence alignment
Phylogenetic-tree construction
Gene prediction
Secondary-structure determination
Analysis of microarray data
Simulation of biological systems
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458