Bioinfo primer

Download Report

Transcript Bioinfo primer

Christophe Roos - MediCel ltd
[email protected]
Evolution changes sequences
Good solutions are advantageous
Similarity is a tool in understanding
the information in a sequence
Proteins share similar domains
By comparing several related sequences
to each other, one can distiguish segments
with higher level of conservation. Usually
they have a key role in the function of a
protein.
Blast identifies related sequences fast but
only roughly.
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Refine the comparison
•Multiple sequence alignments of the best scoring sequences fround by
Blast (or some other way) is done with a more sensitive algorithm.
•Example: The eyeless gene in the fruit fly is also found in several species:
birds, mammals, reptiles, fish, invertebrates. There it is called PAX6.
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Visualise the relationship
•
Once a multiple sequence alignment
is done, it can also be used for
finding relationship (evolutionary
distance)
•
The distance is calculated as the
amount of mutations needed to
evolve from a putative ancestor to
all used ‘present-day’ sequences.
Then a path including all sequences
is computed. Different metrics can
be used (most parsimonious,
maximum likelihood, etc).
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Visualise the output of aligned domains
First all sequence pairs are aligned and
scored, then in a second round a
multiple sequence alignment is built up.
In this case (PAX6 proteins from
vertebrates and fruit fly), two domains
are more conserved than the rest of the
sequence. The most conserved areas
have been highlighted by the use of
black or gray background and white
text.
Only part of the alignment is shown.
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Profiles and motifs
•
•
•
•
A sequence motif is a locally conserved region of a sequence or a short sequence
pattern shared by a set of sequences.
The term motif refers to any sequence pattern that is predictive of a molecule’s
function, a structural feature, or a family membership.
Motifs can be detected in proteins, DNA and RNA sequences, but they most
commonly refer to protein motifs.
Motifs can be represented for computational purposes as
– Flexible patterns [K,R]-R-P-C-x(11)-C-V-S (qualitative, unweighted; see the Prosite
database at www.expasy.org)
– Position-specific scoring matrices (PSSM, see next page)
– Profile hidden Markov models (HMM). These are rigorous probabilistic formulation of a
sequence profile. They contain the same probability information as PSSMs but can also
account for gaps.
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Position specific scoring matrix
•
This corresponds to the flexible pattern of the paired box: [K,R]-R-P-C-x(11)-C-V-S
A
B
C
D
E
F
G
-22 -22 -35 -26 -15 -37 -30
H
I
K
-9 -38
L
M
N
P
35 -36 -23 -16 -34
Q
R
-5
53 -23 -24 -35 -40 -19 -31
0
99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79
0
0
47 -41 -18 -36 -27
0
0
10 -30
-9 -19 -24
0
0
5 -46 -20 -33 -30
0
0
3 -32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24
0
0
-3
31
39
-7
-7 -25 -13
32 -16 -13 -35 -44 -25 -15 -34 -22
-27 -36 -24 -38 -30 -12 -40 -30
-18
-8 -12 -27
6
-9
-3 -17
-8 -18 -24 -15 -11
0 -22 -19 -21 -12 -20
-3
-6 -16 -28 -10
33
-6 -20 -26
8
2 -17 -17 -13
5 -19 -12
52
-1 -20
19
35 -32 -17 -20
20
-2
23 -33
-7 -26 -18
0
0
-9 -16 -24 -19 -22
21
0
24 -29
-7 -25 -19
0
0
28 -10 -35
-8 -25 -14
0
0
27
-6 -21 -20
0
0
-8 -10
0
0
33 -26 -13 -16
2
2 -12 -29 -19 -21
20
2 -21 -17 -13
-26 -41 -12 -45 -40
10 -43 -10
30 -33
5
-27
-8 -28 -21 -10 -27
-5
30 -39 -21 -23 -20 -13
-8 -37 -47 -27 -17 -39 -28
-9 -19 -13 -25 -17
-4 -21 -21 -11 -19 -18
16 -17
-5 -35 -12 -19 -21 -16
20 -30 -10
-4 -17
45 -33
0
0
42 -15 -40 -20 -28 -28 -24 -14
73 -14
-5 -17
0
0
99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79
0
0
64 -57 -45 -56 -64
0
0
61 -24 -52 -66 -39 -57 -46
0
0
33 -13
-40 -73 -33 -75 -63 -45 -72 -68
45 -37 -44 -27 -31 -34 -21
-3 -30
7
12 -22
-42 -69
-
0
33
11
*
73 -54 -56 -50 -53 -59 -72 -51 -69 -54
50
21
Z
0
-31 -40 -21 -43 -34 -23 -48 -36
24 -20
Y
0
23
11
X
70 -51 -53 -63 -64 -46 -57 -40
26 -22
-5
W
0
-21 -38 -19 -41 -30 -29 -43 -36
6
V
0
-42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57 -62
-21
T
-9
-51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49 -44 -64 -34
-42 -69
S
-6 -66 -29 -28 -71 -71 -65 -67 -59 -40
-25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56 -36 -55 -47 -48
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Motif and databases – mode of use
•
•
•
Motifs can be used to search sequence databases
– take a family of related sequences
– align and define motifs
– use the motifs to search a database of sequences to find novel family members
– can also be generated from unaligned sequences (e.g. MEME, see next page)
Motif databases can be searched with sequences
– take one sequence and ask what known motifs it contains
– deduce its function using knowledge about those motifs in other sequences
DBs
– Blocks, Fred Hutchinson Cancer Research Center (ungapped alignments)
– COG, clusters of orthologous groups, NCBI (21 complete genomes)
– Pfam, Sanger Center (gapped profiles, curated)
– Prints, Univ. Manchester (fingerprints, i.e. more than one pattern)
– Prosite, Univ. Geneva (consensus patterns, expert-curated)
– SMART, EMBL-Heidelberg
– IntePro, EBI (multiple, curated), includes Pfam, SMART, etc. [2 pages forward]
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Motif discovery tools and PSSM creators
• The MEME tool takes as input
unaligned sequences and searches for
patterns according to several
parameters such as
– Min-max length
– Amount per sequence
– Amount per set
• MEME also generates PSSM for the
found domains.
• MAST is a tool for searching
databases with PSSMs
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
The InterPro database of motifs at EBI
•
(Nov 2001) was built from
Pfam 6.6, PRINTS 31.0,
PROSITE 16.37, ProDom
2001.2, SMART 3.1,
TIGRFAMs 1.2, and the current
SWISS-PROT + TrEMBL data.
This release of InterPro contains
4691 entries, representing 1068
domains, 3532 families, 74
repeats and 15 post-translational
modification sites.
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Scan the InterPro database - example
•
The InterPro database was scanned with the PAX6 sequence from the fruit fly.
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Protein 3D structure
• 3D is better than linear strings of
letters...
• Protein folding is critical for
function
• Protein folding is ordered
• Structures consist of folds
• 3D structure can be measured, but
computational ab initio structure
prediction is a tough task and
nearly impossible above a certain
protein size (cpu and rule limits)
Christophe Roos - 5/6 Profiles, motifs, structures
Spring 2002
Protein 3D structure building blocks
• Primary structure: the linear
array of aminoacids
• Secondary structures
DNA-binding protein
• Tertiary structures
Christophe Roos - 5/6 Profiles, motifs, structures
(DNA helix, white; helices, pink; sheets of beta-strands, ocra)
– Alpha helix
– Beta-strand
Spring 2002