Co-Modelling and Conditional Modelling A C

Download Report

Transcript Co-Modelling and Conditional Modelling A C

Co-Modelling and Conditional Modelling
Observable
Unobservable
Goldman, Thorne &
Jones, 96
C C
A
Knudsen.., 99
Eddy & co.
U
C
A
G
U
A
AGGTATATAATGCG..... Pcoding{ATG-->GTG} or
AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}
Meyer and Durbin 02
Pedersen …, 03 Siepel
& Haussler 03
Pedersen, Meyer,
Forsberg…, Simmonds
2004a,b
• Conditional Modelling
P ( Sequence Structure) P ( Structure) 
McCauley ….
Firth & Brown
P ( Structure Sequence ) P ( Sequence )
Footprinting -Signals (Blanchette)
Needs:
i. P(Sequence Structure)
Observable
Unobservable
ii. P(Structure)
Grammars: Finite Set of Rules for Generating Strings
&
Ordinary letters:
A starting symbol:
in the present string:
Context Sensitive
Context Free
Regular
ii. A set of substitution rules applied to variables
Variables:
finished – no variables
General (also erasing)
i.
Ab Initio Gene prediction
Ab initio gene prediction: prediction of the location of genes (and the
amino acid sequence it encodes) given a raw DNA sequence.
....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcg
gggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgc
aaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaaca
gctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgc
actgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggg
aagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaat
acactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctgg
ccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccatta
cactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacct
ccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgt
acctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg....
Input data
Output:
5'
3'
Exon
Intron
UTR and intergenic sequence
5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTC
CCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCG
CCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAG
CCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTC
GCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaa
gggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCT
GCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATC
TGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAA
CAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgtt
ccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG
CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg
ggaaatggccatacatggtgg.... 3'
Annotating genes
Despite all difficulties, protein-coding genes are among the easiest functional
elements to annotate. Several sources of information:
• Sequence features (ab-initio approaches)
– Coding exon contains no stop codons (open reading frame, ORF)
– Coding exons tend to reside in CG-rich regions
• Comparative information
– Similarity to known proteins in databases
– Similarity to other species; reduced mutation rates
• Experimental evidence for transcription
– cDNA sequences (complementary copy of spliced mRNA)
– ESTs (few 100s basepair copy of 5’ end of (spliced) mRNA transcript)
Annotating genes
What makes annotating protein-coding genes so difficult?
•
Gene density in human genome is low
– 1-2% are coding exons, some of which are small (50 nt)
– Introns may be very large (100 kb)
•
Alternative splicing
– Several promoters
– Several alternative transcripts
•
Pseudogenes
– Genes may lose functionality (e.g. after duplication)
Especially recent degenerated genes hard to spot
– Mature (spliced) transcript may be reverse transcribed
These are often easy to spot (no introns; poly-A tail)
HMM Examples
Gene Finding: Simple Prokaryotic
Burge and Karlin, 1996
Intron length > 50 bp required for splicing
•
Length distribution is not geometric
•
Simple Eukaryotic
Genscan
State with length
distribution
Initial exon
Exons of phase 0, 1 or 2
Introns of phase 0, 1 or 2
Terminal exon
Exon of single exon gene
5' UTR
Promoter
Omitted: reverse strand part of the HMM
3' UTR
Poly-A signal
Intergenic sequence
Gene Finding & Protein Homology
(Gelfand, Mironov & Pevzner, 1996)
Protein Database
Exon Ordering Graph
Spliced Alignment:
1. Define set of potential exons in new genome.
2. Make exon ordering graph - EOG.
3. Align EOG to protein database.
TYGHLP
T
TY--LPM
Y
W
TYGHLP
L P M
Q
Comparative Gene Annotation
AGGTATATAATGCG..... Pcoding{ATG-->GTG} or
AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}
Simultaneous Alignment & Gene Finding
Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002.
Align by minimizing Distance/
Maximizing Similarity:
Align genes with structure
Known/unknown:
~5% of the Human genome is under conservation
(Chiaromonte et al.)
Whole
Genome
(Whole Genome – ARs)
ARs
Due to this work, people often say
5% of the genome is constrained
From Caleb Webber & Gerton Lunter
Percentage of Genome under Purifying Selection
CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA
CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA
Weighted regression:
R2 > 0.9995
Log10 counts
Log10 counts
Consider lengths of inter-gap segments! Do they follow a geometric distribution?
Overrepresentation of long inter-gap distances:
Reduced indel rate due to
indel-purifying selection
At most, only 0.09% of all
ARs are under selection.
Inter-gap distance (nucleotides)
Inter-gap distance
(nucleotides)
From Caleb Webber & Gerton Lunter
Finding Regulatory Signals in Genomes
Searching for known signal in 1 sequence
Searching for unknown signal common to set of unrelated sequences
Searching for conserved segments in homologous
Challenges
Combining homologous and non-homologous analysis
Merging Annotations
mouse
pig
human
Predicting signal-regulatory protein relationships
Weight Matrices & Sequence Logos
Set of signal sequences:
f b,i b' s in position i, s(b) pseudo count.
corrected probability : p(b,i) 
f b,i  s(b)
N   s(b')
b' nucleo
Position Frequency Matrix - PFM
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 G A C C A A A T A A G G C A
2 G A C C A A A T A A G G C A
3 T G A C T A T A A A A G G A
4 T G A C T A T A A A A G G A
5 T G C C A A A A G T G G T C
6 C A A C T A T C T T G G G C
7 C A A C T A T C T T G G G C
8 C T C C T T A C A T G G G C
A
C
G
T
Consensus sequence:
A
C
G
T
p(b,i)
 log 2
p(b)
Score for New Sequence S  l1W b,i
w
Sequence Logo & Information
content

Di  2   pb,i log 2 pb,i
b
4
0
3
1
4
4
0
0
0
8
0
0
3
0
0
5
7
0
0
1
4
0
0
4
3
3
0
2
5
0
1
2
4
0
0
4
2
0
6
0
0
0
8
0
0
0
5
1
4
4
0
0
B R M C W A W H R W G G B M
Position Weight Matrix - PWM
PWM :W b,i
0
3
2
3
-1.93
.79
.79 -1.93
.45 1.50
.79
.45 1.07
.79
.0 -1.93 -1.93 .79
.45 -1.93
.79 1.68 -1.93 -1.93 -1.93
.45 -1.93 -1.93 -1.93 -1.93
.0
.79
.0
.45 -1.93 -1.93 -1.93 -1.93 -1.93 -1.93 .66 -1.93 1.3
1.68 1.07 -1.93
.15
.66 -1.93 -1.93 1.07
.66
.79
.0
.79 -1.93 -1.93 -1.93
.66 -1.93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
T T G C A T A A G T A G T C
.45 -.66 .79 1.66 .45 -.66 .79 .45 -.66 .79 .0 1.68 -.66 .79
Motifs in Biological Sequences
1990 Lawrence & Reilly “An Expectation Maximisation (EM) Algorithm for the identification and Characterization of Common Sites in Unaligned Biopolymer Sequences Proteins 7.41-51.
1992 Cardon and Stormo Expectation Maximisation Algorithm for Identifying Protein-binding sites with variable lengths from Unaligned DNA Fragments L.Mol.Biol. 223.159-170
1993 Lawrence… Liu “Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment” Science 262, 208-214.
1
(R,l)
K
=(1,A,…,w,T) probability of different bases in the window
A=(a1,..,aK) – positions of the windows
0=(A,..,T) – background frequencies of nucleotides.
p( R |  0 , , A)   0
Priors
h( R
{ A}c
)
w

j 1
h ( R A j 1 )
j
 0
h( R)
 j 
 

j 1   0 
w
h ( R A j 1 )
A has uniform prior
j has Dirichlet(N0a) prior – a base frequency in genome. N0 is pseudocounts
1.0
0.0
(,)
(,)
(,)
(,)
Natural Extensions to Basic Model I
Multiple Pattern Occurances in the same sequences:
Liu, J. `The collapsed Gibbs sampler with applications to a gene regulation problem," Journal of the American Statistical Association 89 958-966.
Prior: any position i has a small probability p to start a binding site:
A  (a1 ,, ak )
P( A)  p0k (1  p0 ) N  k (with nonoverlap ping constraint s)
width = w
ak
length nL
Composite Patterns:
BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics
Modified from Liu
Natural Extensions to Basic Model II
Correlated in Nucleotide Occurrence in Motif:
Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 6, 909-916.
Insertion-Deletion
BALSA: Bayesian algorithm for local sequence alignment Nucl. Acids Res., 30 1268-77.
1
w1
w2
w3
K
w4
M2
Start
p12
Regulatory Modules:
De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84
Gene A
Gene B
p21
M3
M1
Stop
Combining Signals and other Data
Motifs
Coding regions
Expresssion and Motif Regression:
Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci. 100.3339-44
1.Rank genes by E=log2(expression fold change)
2.Find “many” (hundreds) candidate motifs
3.For each motif pattern m, compute the vector Sm of matching scores
for genes with the pattern
4.Regress E on Sm
Yg  a   m Smg  g
ChIP-on-chip -
1-2 kb information on protein/DNA interaction:
An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, 835-39
Protein binding
in neighborhood
Coding regions
Modified from Liu
Phylogenetic Footprinting (homologous detection)
Term originated in 1988 in Tagle et al. Blanchette et al.: For unaligned sequences
related by phylogenetic tree, find all segments of length k with a history costing
less than d. Motif loss an option.
begin
Dibegin  min{ Di,
 d(i,)}
begin
Disignal,1  min{ Di,
 d(i,)}
signal, j
i
D
signal, j 1
i,
 min{ D
 d(i,)}
...
end
Diend  min{ Di,
 d(i,)}
begin
signal
end
The Basics of Footprinting
•Many aligned sequences related by a known phylogeny:
positions
HMM:
1
1
n
k
slow - rs
fast - rf
HMM:
•Two un-aligned sequences:
G
T
A
A
C
ATG
A-C
Statistical Alignment and Footprinting.
•Many un-aligned sequences related by a known phylogeny:
• Conceptually simple, computationally hard
• Dependent on a single alignment/no measure of uncertainty
1
acgtttgaaccgag----
Cartesian Product of HMMs
k
1
k
Solution:
1
acgtttgaaccgag----
acgtttgaaccgag----
k


SAPF - Statistical Alignment and Phylogenetic Footprinting
1
2
Target
Sum out

Annotate

BigFoot
• Dynamical programming is too slow for more
than 4-6 sequences
• MCMC integration is used instead – works
until 10-15 sequences
• For more sequences other methods are needed.
http://www.stats.ox.ac.uk/research/genome/software
Data – k genomes/sequences:
Pachter, Holmes & Co
Iterative addition of homology statements to shrinking alignment:
1
2
k
Spanning tree
Additional edges
1
Add most certain homology statement
from pairwise alignment compatible with
present multiple alignment
2
3
4
k
An edge – a pairwise alignment
1
2
1,3 2,3 3,4 3,k
12 2,k 1,4 4,k
i. Conflicting homology statements cannot be added
ii. Some scoring on multiple sequence homology
statements is used.
http://math.berkeley.edu/~rbradley/papers/manual.pdf
FSA - Fast Statistical Alignment
Rate of Molecular Evolution versus estimated Selective
Deceleration
Selected Process
Neutral Process
A
C
G
T
A
qC,A
qG,A
qT,A
C
qA,C
qG,C
qT,C
G
qA,G
qC, G
qT,G
Neutral Equilibrium
(pA,pC,pG,pT)
T
qA,T
qC,T
qG,T
-
How much selection?
Selection => deceleration
A
C
G
T
A
q’C,A
q’G,A
q’T,A
C
G
T
q’A,C q’A,G
q’A,T
q’C, G q’C,T
q’G,C q’G,T
q’T,C q’T,G
-
Observed Equilibrium
(pA,pC,pG,pT)’
Halpern and Bruno (1998) “Evolutionary Distances for Protein-Coding Sequences” MBE 15.7.910- & Moses et al.(2003) “Position specific variation in the rate fo evolution of transcription binding sites” BMC Evolutionary Biology 3.19-
Signal Factor Prediction
• Given set of homologous sequences and
set of transcription factors (TFs), find
signals and which TFs they bind to.
• Use PWM and Bruno-Halpern (BH) method to make TF specific evolutionary models
• Drawback BH only uses rates and equilibrium distribution
• Superior method: Infer TF Specific Position Specific evolutionary model
• Drawback: cannot be done without large scale data on TF-signal binding.
http://jaspar.cgb.ki.se/
http://www.gene-regulation.com/
Knowledge Transfer and Combining Annotations
Experimental observations
mouse
pig
• Annotation Transfer
• Observed Evolution
human
prior
Must be solvable by Bayesian Priors
Each position pi probability of being j’th position in k’th TFBS
If no experiment, low probability for being in TFBS
1 experimentally annotated genome (Mouse)
(Homologous + Non-homologous) detection
Unrelated genes - similar expression
promotor
Related genes - similar expression
gene
Combine above approaches
Combine “profiles”
Wang and Stormo (2003) “Combining phylogenetic data with co-regulated genes to identify regulatory motifs” Bioinformatics 19.18.2369-80
Zhou and Wong (2007) Coupling Hidden Markov Models for discovery of cis-regulatory signals in multiple species Annals Statistics 1.1.36-65