T_Coffee.Workshop.CRG - T

Download Report

Transcript T_Coffee.Workshop.CRG - T

Using the T-Coffee Multiple Sequence
Alignment Package
I - Overview
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
What is T-Coffee ?

Tree Based Consistency based Objective
Function for Alignment Evaluation
–
–
Progressive Alignment
Consistency
Progressive Alignment
Feng and Dolittle, 1988; Taylor 1989
Clustering
Progressive Alignment
Dynamic Programming Using A Substitution Matrix
Progressive Alignment
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
•Tree making Algorithm.
Consistency?

Consistency is an attempt to use alignment
information at very early stages
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Prim. Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =77
SeqA GARFIELD THE LAST FAT CAT
SeqD -------- THE ---- FAT CAT
Prim. Weight =100
SeqB GARFIELD THE ---- FAST CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =100
SeqC GARFIELD THE VERY FAST CAT
SeqD -------- THE ---- FA-T CAT
Prim. Weight =100
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Prim. Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =77
SeqA GARFIELD THE LAST FAT CAT
SeqD -------- THE ---- FAT CAT
Prim. Weight =100
SeqB GARFIELD THE ---- FAST CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =100
SeqC GARFIELD THE VERY FAST CAT
SeqD -------- THE ---- FA-T CAT
Prim. Weight =100
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =77
SeqA GARFIELD THE LAST FA-T CAT
SeqD -------- THE ---- FA-T CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =100
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =77
SeqA GARFIELD THE LAST FA-T CAT
SeqD -------- THE ---- FA-T CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =100
T-Coffee and Concistency…
Where Do The Primary Alignments
Come From?

Primary Alignments
–

Primary Library
Source
–
Any valid Third Party Method
T-Coffee and Concistency…
T-Coffee and Concistency…
Using the T-Coffee Multiple Sequence
Alignment Package
II – M-Coffee
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
What is the Best MSA method ?


More than 50 MSA methods
Some methods are fast and inacurate
–

Some methods are slow and accurate
–

Mafft, muscle, kalign
T-Coffee, ProbCons
Some Methods are slow and inacurate…
–
ClustalW
Why Not Combining Them ?

All Methods give different alignments
Their Agreement is an indication of accuracy

t_coffee –method mafft_msa, muscle_msa

Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
What To Do Without Structures
Using the T-Coffee Multiple Sequence
Alignment Package
III – Template Based Alignments
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Sometimes Sequences are Not
Enough

Sequence based alignments are limited in
accuracy
–
–

30% for proteins
70% for DNA
It is hard to align correctly sequences whose
similarity is below these values
–
Twilight zone
One Solution: Template Based
Alignment

Replace the sequence with something more
informative
–
–
–
PDB Structure
Profile
RNA-Structure
Expresso
PSI-Coffee
R-Coffee
Template Based Multiple Sequence
Alignments
Sources
-Structure
Templates -Profile
-…
Template
Aligner
-Structure
-Profile
Templates
-…
Template Alignment
Source Template Alignment
Remove Templates
Library
Expresso: Finding the Right Structure
Sources
BLAST
BLAST
Templates
SAP
Templates
Template Alignment
Source Template Alignment
Remove Templates
Library
PSI-Coffee: Homology Extension
Sources
BLAST
BLAST
Templates
Profile Aligner
Templates
Template Alignment
Source Template Alignment
Remove Templates
Library
What is Homology Extension ?
-Simple scoring schemes result in alignment ambiguities
L
?
L
L
What is Homology Extension ?
L
L
L
L
L
L
Profile 1
L
L
L
L
L
I
V
I
L
L
L
L
L
L
L
Profile 2
What is Homology Extension ?
L
L
L
L
L
L
L
L
L
L
L
I
V
I
L
L
L
L
L
L
L
Profile 1
Profile 2
Method
Method
Template
Score
ClustalW-2
Progressive
NO
22.74
PRANK
Gap
NO
26.18
MAFFT
Iterative
NO
26.18
Muscle
Iterative
NO
31.37
ProbCons
Consistency
NO
40.80
ProbCons
MonoPhasic
NO
37.53
T-Coffee
Consistency
NO
42.30
M-Coffe4
Consistency
NO
43.60
PSI-Coffee
Consistency Profile
53.71
PROMAL
Consistency Profile
55.08
PROMAL-3D
Consistency PDB
57.60
3D-Coffee
Consistency PDB
61.00
Comment
Science2008
Expresso
Score: fraction of correct columns when compared with a structure based
reference (BB11 of BaliBase).
Templates
Templates
TARGET
Template
Aligner
TARGET
TARGET
Experimental
Data
…
Experimental
Data
…
Template Alignment
Template-Sequence Alignment
Template based Alignment
of the Sequences
Primary Library
Using the T-Coffee Multiple Sequence
Alignment Package
IV – RNA Alignments
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
ncRNAs Comparison

And ENCODE said…
“nearly the entire genome may be represented in primary transcripts
that extensively overlap and include many non-protein-coding regions”

Who Are They?
–
–
–
–

tRNA, rRNA, snoRNAs,
microRNAs, siRNAs
piRNAs
long ncRNAs (Xist, Evf, Air, CTN, PINK…)
How Many of them
–
–
–
.
Open question
30.000 is a common guess
Harder to detect than proteins
ncRNAs Can Evolve Rapidly
A
A C CA
C
G
G
G
G
A
A
CG
G
G C
A T
A T
C G
G C
G C
A T
C G
C G
A
A C CA
C
G
G
G
G
A
A
CG
G
C G
T A
CCAGGCAAGACGGGACGAGAGTTGCCTGG
T A
G C
CCTCCGTTCAGAGGTGCATAGAACGGAGG
C G
**-------*--**---*-**------**
C G
T A
C G
C G
The Holy Grail of RNA Comparison:
Sankoff’ Algorithm
The Holy Grail of RNA Comparison
Sankoff’ Algorithm

Simultaneous Folding and Alignment
–
–

In Practice, for Two Sequences:
–
–
–
–

Time Complexity: O(L2n)
Space Complexity: O(L3n)
50 nucleotides:
100 nucleotides
200 nucleotides
400 nucleotides
1 min.
16 min.
4 hours
3 days
Forget about
–
–
Multiple sequence alignments
Database searches
6 M.
256 M.
4 G.
3 T.
RNA Sequences
Consan
or
Mafft / Muscle / ProbCons
RNAplfold
Primary Library
Secondary
Structures
R-Coffee
Extension
R-Coffee Extended
Primary Library
R-Score
Progressive Alignment
Using The R-Score
R-Coffee Extension
TC Library
C
C
G
G
G G Score X
C C Score Y
C
C


G
G
Goal: Embedding RNA Structures Within The T-Coffee Libraries
The R-extension can be added on the top of any existing method.
R-Coffee + Regular Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Poa
0.62
0.65
0.70
48
154
Pcma
0.62
0.64
0.67
34
120
Prrn
0.64
0.61
0.66
-63
45
ClustalW
0.65
0.65
0.69
-7
83
Mafft_fftnts
0.68
0.68
0.72
17
68
ProbConsRNA
0.69
0.67
0.71
-49
39
Muscle
0.69
0.69
0.73
-17
42
Mafft_ginsi
0.70
0.68
0.72
-49
39
-----------------------------------------------------------
Improvement= # R-Coffee wins - # R-Coffee looses
RM-Coffee + Regular Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Poa
0.62
0.65
0.70
48
154
Pcma
0.62
0.64
0.67
34
120
Prrn
0.64
0.61
0.66
-63
45
ClustalW
0.65
0.65
0.69
-7
83
Mafft_fftnts
0.68
0.68
0.72
17
68
ProbConsRNA
0.69
0.67
0.71
-49
39
Muscle
0.69
0.69
0.73
-17
42
Mafft_ginsi
0.70
0.68
0.72
-49
39
----------------------------------------------------------RM-Coffee4
0.71
/
0.74
/
84
R-Coffee + Structural Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Stemloc
0.62
0.75
0.76
104
113
Mlocarna
0.66
0.69
0.71
101
133
Murlet
0.73
0.70
0.72
-132
-73
Pmcomp
0.73
0.73
0.73
142
145
T-Lara
0.74
0.74
0.69
-36
-8
Foldalign
0.75
0.77
0.77
72
73
----------------------------------------------------------Dyalign
--0.63
0.62
----Consan
--0.79
0.79
--------------------------------------------------------------RM-Coffee4
0.71
/
0.74
/
84
Using the T-Coffee Multiple Sequence
Alignment Package
V – DNA Alignments
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Aligning Genomic DNA

Main problem
–

Tell a good alignment from a bad one
Strategy:
–
–
Tuning on Orthologous Promoter Detection
Evaluation on ChIp-Seq Data
Aligning Genomic DNA

Main problem
–

Tell a good alignment from a bad one
Strategy:
–
–
Tuning on Orthologous Promoter Detection
Evaluation on ChIp-Seq Data
Aligning Genomic DNA


Tuning of Gap
Penalties
Design of a dinucleotide
substitution matrix
Aligning Genomic DNA
Aligning Genomic DNA




gDNA is very heterogenous
Each genomic feature requires its own
aligner
Aligning non-orthologous regions with a
global aligner is impossible
Pro-Coffee is designed to align orthologous
promoter regions
Using the T-Coffee Multiple Sequence
Alignment Package
VI – Wrap Up
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Which Flavor?

Fast Alignments
–

Difficult Protein Alignments
–
–

Expresso
PSI-Coffee
RNA Alignments
–

M-Coffee with Fast Aligners: mafft, muscle, kalign
R-Coffee
Promoter Alignments
–
Pro-Coffee
www.tcoffee.org