T_Coffee.Workshop.CRG - T
Download
Report
Transcript T_Coffee.Workshop.CRG - T
Using the T-Coffee Multiple Sequence
Alignment Package
I - Overview
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
What is T-Coffee ?
Tree Based Consistency based Objective
Function for Alignment Evaluation
–
–
Progressive Alignment
Consistency
Progressive Alignment
Feng and Dolittle, 1988; Taylor 1989
Clustering
Progressive Alignment
Dynamic Programming Using A Substitution Matrix
Progressive Alignment
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
•Tree making Algorithm.
Consistency?
Consistency is an attempt to use alignment
information at very early stages
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Prim. Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =77
SeqA GARFIELD THE LAST FAT CAT
SeqD -------- THE ---- FAT CAT
Prim. Weight =100
SeqB GARFIELD THE ---- FAST CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =100
SeqC GARFIELD THE VERY FAST CAT
SeqD -------- THE ---- FA-T CAT
Prim. Weight =100
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Prim. Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =77
SeqA GARFIELD THE LAST FAT CAT
SeqD -------- THE ---- FAT CAT
Prim. Weight =100
SeqB GARFIELD THE ---- FAST CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =100
SeqC GARFIELD THE VERY FAST CAT
SeqD -------- THE ---- FA-T CAT
Prim. Weight =100
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =77
SeqA GARFIELD THE LAST FA-T CAT
SeqD -------- THE ---- FA-T CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =100
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =77
SeqA GARFIELD THE LAST FA-T CAT
SeqD -------- THE ---- FA-T CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =100
T-Coffee and Concistency…
Where Do The Primary Alignments
Come From?
Primary Alignments
–
Primary Library
Source
–
Any valid Third Party Method
T-Coffee and Concistency…
T-Coffee and Concistency…
Using the T-Coffee Multiple Sequence
Alignment Package
II – M-Coffee
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
What is the Best MSA method ?
More than 50 MSA methods
Some methods are fast and inacurate
–
Some methods are slow and accurate
–
Mafft, muscle, kalign
T-Coffee, ProbCons
Some Methods are slow and inacurate…
–
ClustalW
Why Not Combining Them ?
All Methods give different alignments
Their Agreement is an indication of accuracy
t_coffee –method mafft_msa, muscle_msa
Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
What To Do Without Structures
Using the T-Coffee Multiple Sequence
Alignment Package
III – Template Based Alignments
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Sometimes Sequences are Not
Enough
Sequence based alignments are limited in
accuracy
–
–
30% for proteins
70% for DNA
It is hard to align correctly sequences whose
similarity is below these values
–
Twilight zone
One Solution: Template Based
Alignment
Replace the sequence with something more
informative
–
–
–
PDB Structure
Profile
RNA-Structure
Expresso
PSI-Coffee
R-Coffee
Template Based Multiple Sequence
Alignments
Sources
-Structure
Templates -Profile
-…
Template
Aligner
-Structure
-Profile
Templates
-…
Template Alignment
Source Template Alignment
Remove Templates
Library
Expresso: Finding the Right Structure
Sources
BLAST
BLAST
Templates
SAP
Templates
Template Alignment
Source Template Alignment
Remove Templates
Library
PSI-Coffee: Homology Extension
Sources
BLAST
BLAST
Templates
Profile Aligner
Templates
Template Alignment
Source Template Alignment
Remove Templates
Library
What is Homology Extension ?
-Simple scoring schemes result in alignment ambiguities
L
?
L
L
What is Homology Extension ?
L
L
L
L
L
L
Profile 1
L
L
L
L
L
I
V
I
L
L
L
L
L
L
L
Profile 2
What is Homology Extension ?
L
L
L
L
L
L
L
L
L
L
L
I
V
I
L
L
L
L
L
L
L
Profile 1
Profile 2
Method
Method
Template
Score
ClustalW-2
Progressive
NO
22.74
PRANK
Gap
NO
26.18
MAFFT
Iterative
NO
26.18
Muscle
Iterative
NO
31.37
ProbCons
Consistency
NO
40.80
ProbCons
MonoPhasic
NO
37.53
T-Coffee
Consistency
NO
42.30
M-Coffe4
Consistency
NO
43.60
PSI-Coffee
Consistency Profile
53.71
PROMAL
Consistency Profile
55.08
PROMAL-3D
Consistency PDB
57.60
3D-Coffee
Consistency PDB
61.00
Comment
Science2008
Expresso
Score: fraction of correct columns when compared with a structure based
reference (BB11 of BaliBase).
Templates
Templates
TARGET
Template
Aligner
TARGET
TARGET
Experimental
Data
…
Experimental
Data
…
Template Alignment
Template-Sequence Alignment
Template based Alignment
of the Sequences
Primary Library
Using the T-Coffee Multiple Sequence
Alignment Package
IV – RNA Alignments
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
ncRNAs Comparison
And ENCODE said…
“nearly the entire genome may be represented in primary transcripts
that extensively overlap and include many non-protein-coding regions”
Who Are They?
–
–
–
–
tRNA, rRNA, snoRNAs,
microRNAs, siRNAs
piRNAs
long ncRNAs (Xist, Evf, Air, CTN, PINK…)
How Many of them
–
–
–
.
Open question
30.000 is a common guess
Harder to detect than proteins
ncRNAs Can Evolve Rapidly
A
A C CA
C
G
G
G
G
A
A
CG
G
G C
A T
A T
C G
G C
G C
A T
C G
C G
A
A C CA
C
G
G
G
G
A
A
CG
G
C G
T A
CCAGGCAAGACGGGACGAGAGTTGCCTGG
T A
G C
CCTCCGTTCAGAGGTGCATAGAACGGAGG
C G
**-------*--**---*-**------**
C G
T A
C G
C G
The Holy Grail of RNA Comparison:
Sankoff’ Algorithm
The Holy Grail of RNA Comparison
Sankoff’ Algorithm
Simultaneous Folding and Alignment
–
–
In Practice, for Two Sequences:
–
–
–
–
Time Complexity: O(L2n)
Space Complexity: O(L3n)
50 nucleotides:
100 nucleotides
200 nucleotides
400 nucleotides
1 min.
16 min.
4 hours
3 days
Forget about
–
–
Multiple sequence alignments
Database searches
6 M.
256 M.
4 G.
3 T.
RNA Sequences
Consan
or
Mafft / Muscle / ProbCons
RNAplfold
Primary Library
Secondary
Structures
R-Coffee
Extension
R-Coffee Extended
Primary Library
R-Score
Progressive Alignment
Using The R-Score
R-Coffee Extension
TC Library
C
C
G
G
G G Score X
C C Score Y
C
C
G
G
Goal: Embedding RNA Structures Within The T-Coffee Libraries
The R-extension can be added on the top of any existing method.
R-Coffee + Regular Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Poa
0.62
0.65
0.70
48
154
Pcma
0.62
0.64
0.67
34
120
Prrn
0.64
0.61
0.66
-63
45
ClustalW
0.65
0.65
0.69
-7
83
Mafft_fftnts
0.68
0.68
0.72
17
68
ProbConsRNA
0.69
0.67
0.71
-49
39
Muscle
0.69
0.69
0.73
-17
42
Mafft_ginsi
0.70
0.68
0.72
-49
39
-----------------------------------------------------------
Improvement= # R-Coffee wins - # R-Coffee looses
RM-Coffee + Regular Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Poa
0.62
0.65
0.70
48
154
Pcma
0.62
0.64
0.67
34
120
Prrn
0.64
0.61
0.66
-63
45
ClustalW
0.65
0.65
0.69
-7
83
Mafft_fftnts
0.68
0.68
0.72
17
68
ProbConsRNA
0.69
0.67
0.71
-49
39
Muscle
0.69
0.69
0.73
-17
42
Mafft_ginsi
0.70
0.68
0.72
-49
39
----------------------------------------------------------RM-Coffee4
0.71
/
0.74
/
84
R-Coffee + Structural Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Stemloc
0.62
0.75
0.76
104
113
Mlocarna
0.66
0.69
0.71
101
133
Murlet
0.73
0.70
0.72
-132
-73
Pmcomp
0.73
0.73
0.73
142
145
T-Lara
0.74
0.74
0.69
-36
-8
Foldalign
0.75
0.77
0.77
72
73
----------------------------------------------------------Dyalign
--0.63
0.62
----Consan
--0.79
0.79
--------------------------------------------------------------RM-Coffee4
0.71
/
0.74
/
84
Using the T-Coffee Multiple Sequence
Alignment Package
V – DNA Alignments
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Aligning Genomic DNA
Main problem
–
Tell a good alignment from a bad one
Strategy:
–
–
Tuning on Orthologous Promoter Detection
Evaluation on ChIp-Seq Data
Aligning Genomic DNA
Main problem
–
Tell a good alignment from a bad one
Strategy:
–
–
Tuning on Orthologous Promoter Detection
Evaluation on ChIp-Seq Data
Aligning Genomic DNA
Tuning of Gap
Penalties
Design of a dinucleotide
substitution matrix
Aligning Genomic DNA
Aligning Genomic DNA
gDNA is very heterogenous
Each genomic feature requires its own
aligner
Aligning non-orthologous regions with a
global aligner is impossible
Pro-Coffee is designed to align orthologous
promoter regions
Using the T-Coffee Multiple Sequence
Alignment Package
VI – Wrap Up
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Which Flavor?
Fast Alignments
–
Difficult Protein Alignments
–
–
Expresso
PSI-Coffee
RNA Alignments
–
M-Coffee with Fast Aligners: mafft, muscle, kalign
R-Coffee
Promoter Alignments
–
Pro-Coffee
www.tcoffee.org