Multiple Sequence Alignment

Download Report

Transcript Multiple Sequence Alignment

Marine Biological Laboratory
— Workshop on Molecular
Evolution
Woods Hole, Massachusetts
July 25, 2005, 7 to 10 PM
Multiple Sequence
Alignment & Analysis
thru GCG’s SeqLab
Steven M. Thompson
Florida State University School of
Computational Science (SCS)
More data yields stronger analyses — if done carefully!
Mosaic ideas and evolutionary ‘importance.’
But first a prelude: My definitions
Biocomputing and computational biology are synonymous and
describe the use of computers and computational techniques to
analyze any biological system, from molecules, through cells,
tissues, and organisms, all the way to populations.
Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any of the
available biological databases.
Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, mechanism, interactions,
evolution, and perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.
Proteomics is the subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of organisms,
both within and between different organisms.
And a ‘way’ to think about it:
The reverse biochemistry analogy
from a ‘virtual’ DNA sequence to actual molecular
physical characterization, not the other way ‘round.
Using bioinformatics tools, you can infer all sorts
of functional, evolutionary, and, structural
insights into a gene product, without the need
to isolate and purify massive amounts of
protein! Eventually you can go on to clone
and express the gene based on that analysis
using PCR techniques.
The computer and molecular databases are an
essential part of this process.
The exponential growth of molecular
sequence databases & cpu power
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
BasePairs
680338
2274029
3368765
5204420
9615371
15514776
23800000
34762585
49179285
71947426
101008486
157152442
217102462
384939485
651972984
1160300687
2008761784
3841163011
11101066288
15849921438
28507990166
36553368485
44575745176
Sequences
606
2427
4175
5700
9978
14584
20579
28791
39533
55627
78608
143492
215273
555694
1021211
1765847
2837897
4864570
10106023
14976310
22318883
30968418
40604319
Doubling time ~
1 year!
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Back to multiple sequence
alignment — Applicability?
So what; why even bother?
Applications:
Probe/primer, and motif/profile design;
Graphical illustrations;
Comparative ‘homology’ inference;
Molecular evolutionary analysis.
OK — well, how do you do it?
Dynamic programming’s complexity
increases exponentially with the number of
sequences being compared:
N-dimensional matrix . . . .
complexity=[sequence length]number of sequences
‘Global’ heuristic solutions
See —
MSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the multiple
alignment page at the
Baylor College of Medicine’s Search
Launcher —
http://searchlauncher.bcm.tmc.edu/ — but,
severely limiting restrictions!
Multiple Sequence Dynamic Programming
Therefore — pairwise,
progressive dynamic
programming restricts the
solution to the neighborhood of only two
sequences at a time.
All sequences are
compared, pairwise, and
then each is aligned to its
most similar partner or
group of partners. Each
group of partners is then
aligned to finish the
complete multiple
sequence alignment.
Reliability and the
Comparative Approach —
explicit homologous correspondence;
manual adjustments based on
knowledge,
especially structural, regulatory, and
functional sites.
Therefore, editors like SeqLab and
the Ribosomal Database Project:
http://rdp.cme.msu.edu/index.jsp
Structural & Functional correspondence in
the Wisconsin Package’s SeqLab —
Work with proteins!
If at all possible —
Twenty match symbols versus four, plus
similarity! Way better signal to noise.
Also guarantees no indels are placed
within codons. So translate, then align.
Nucleotide sequences will only reliably
align if they are very similar to each
other. And they will require extensive
hand editing and careful consideration.
Beware of aligning apples and
oranges [and grapefruit]!
Parologous
versus
orthologous;
genomic versus
cDNA;
mature versus
precursor.
Mask out uncertain areas —
Complications —
Order dependence.
Not that big of a deal.
Substitution matrices and gap penalties.
A very big deal!
Regional ‘realignment’ becomes incredibly
important, especially with sequences that
have areas of high and low similarity (GCG’
PileUp -InSitu option).
Complications cont. —
Format hassles!
Specialized format conversion
tools such as GCG’s From’
and To’ programs and
PAUPSearch.
Don Gilbert’s public domain
ReadSeq program.
Still more complications —
Indels and missing
data symbols (i.e.
gaps) designation
discrepancy
headaches —
., -, ~, ?, N, or X
. . . . . Help!
Web resources for pairwise,
progressive multiple alignment —
http://www.techfak.unibielefeld.de/bcd/Curric/MulAli/welcome.html.
http://pbil.univ-lyon1.fr/alignment.html
http://www.ebi.ac.uk/clustalw/
http://searchlauncher.bcm.tmc.edu/
However, problems with very large datasets and
huge multiple alignments make doing multiple
sequence alignment on the Web impractical
after your dataset has reached a certain size.
You’ll know it when you’re there!
If large datasets become intractable for
analysis on the Web, what other
resources are available?
Desktop software solutions — public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be
pretty computer savvy. So,
commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per
machine, and Internet and/or CD database
access all complicate matters!
Therefore, UNIX server-based solutions
Public domain solutions also exist, but now a very cooperative
systems manager needs to maintain everything for users, so,
commercial products, e.g. the Accelrys GCG Wisconsin Package [a
Pharmacopeia Co.]
and the SeqLab Graphical User Interface, simplify
matters for administrators and users.
One license fee for an entire institution and very fast, convenient
database access on local server disks. Connections from any
networked terminal or workstation anywhere!
Operating system: UNIX command line operation hassles;
communications software — telnet, ssh, and terminal emulation; X
graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs,
pico (or desktop word processing followed by file transfer [save as
"text only!"]). See my supplement pdf file.
The Genetics Computer Group —
The Accelrys Wisconsin Package for Sequence Analysis
Begun in 1982 in Oliver Smithies’ Genetics Dept. lab at the University
of Wisconsin, Madison, then a private company for over 10 years,
then acquired by the Oxford Molecular Group U.K., and now
owned by Pharmacopeia Inc. U.S.A., Accelrys Division, under the
brand new name, as of May 2005, Discovery Studio GCG.
The suite contains almost 150 programs designed to work in a
“toolbox” fashion. Several simple programs used in succession
can lead to sophisticated results.
Also ‘internal compatibility,’ i.e. once you learn to use one program,
all programs can be run similarly, and, the output from many
programs can be used as input for other programs.
Used all over the world by more than 30,000 scientists at over 950
institutions in more than 35 countries, so learning it here will likely
be useful at any other research institution that you may end up at.
To answer the always perplexing GCG question — “What
sequence(s)? . . . .”
Specifying sequences, GCG style;
in order of increasing power and complexity:
The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and all From & To programs)
The sequence is in a local GCG database in which case you ‘point’ to it by
using any of the GCG database logical names. A colon, “:,” always sets
the logical name apart from either an accession number or a proper
identifier name or a wildcard expression and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “{},” containing the sequence
specification, e.g. a wildcard — {*}.
Finally, the most powerful method of specifying sequences is in a GCG “list”
file. It is merely a list of other sequence specifications and can even
contain other list files within it. The convention to use a GCG list file in a
program is to precede it with an at sign, “@.” Furthermore, one can
supply attribute information within list files to specify something special
about the sequence.
‘Clean’ GCG format single sequence file after
‘reformat’ (or any of the From… programs)
This is a small example of GCG single sequence format.
Always put some documentation on top, so in the future
you can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.
example.seq
1
51
Length: 77
July 21, 1999 09:30
Type: N
Check: 4099
..
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA
GATTTAATAG CATGCGATCC CATGGGA
SeqLab’s Editor mode can also
“Import” native GenBank format and
ABI or LI-COR trace files!
Logical terms for the Wisconsin Package
Sequence databases, nucleic acids:
Sequence databases, amino acids:
GENBANKPLUS
all of GenBank plus EST and GSS subdivisions
GENPEPT
GenBank CDS translations
GBP
all of GenBank plus EST and GSS subdivisions
GP
GenBank CDS translations
GENBANK
all of GenBank except EST and GSS subdivisions
SWISSPROTPLUS
all of Swiss-Prot and all of SPTrEMBL
GB
all of GenBank except EST and GSS subdivisions
SWP
all of Swiss-Prot and all of SPTrEMBL
BA
GenBank bacterial subdivision
SWISSPROT
all of Swiss-Prot (fully annotated)
BACTERIAL
GenBank bacterial subdivision
SW
all of Swiss-Prot (fully annotated)
EST
GenBank EST (Expressed Sequence Tags) subdivision
SPTREMBL
Swiss-Prot preliminary EMBL translations
GSS
GenBank GSS (Genome Survey Sequences) subdivision
SPT
Swiss-Prot preliminary EMBL translations
HTC
GenBank High Throughput cDNA
P
all of PIR Protein
HTG
GenBank High Throughput Genomic
PIR
all of PIR Protein
IN
GenBank invertebrate subdivision
PROTEIN
PIR fully annotated subdivision
INVERTEBRATE
GenBank invertebrate subdivision
PIR1
PIR fully annotated subdivision
OM
GenBank other mammalian subdivision
PIR2
PIR preliminary subdivision
OTHERMAMM
GenBank other mammalian subdivision
PIR3
PIR unverified subdivision
OV
GenBank other vertebrate subdivision
PIR4
PIR unencoded subdivision
OTHERVERT
GenBank other vertebrate subdivision
NRL_3D
PDB 3D protein sequences
PAT
GenBank patent subdivision
NRL
PDB 3D protein sequences
PATENT
GenBank patent subdivision
PH
GenBank phage subdivision
PHAGE
GenBank phage subdivision
PL
GenBank plant subdivision
PLANT
GenBank plant subdivision
GENMOREDATA
path to GCG optional data files
PR
GenBank primate subdivision
GENRUNDATA
path to GCG default data files
PRIMATE
GenBank primate subdivision
RO
GenBank rodent subdivision
RODENT
GenBank rodent subdivision
STS
GenBank (sequence tagged sites) subdivision
SY
GenBank synthetic subdivision
SYNTHETIC
GenBank synthetic subdivision
TAGS
GenBank EST and GSS subdivisions
UN
GenBank unannotated subdivision
UNANNOTATED
GenBank unannotated subdivision
VI
GenBank viral subdivision
VIRAL
GenBank viral subdivision
General data files:
These are easy —
they make sense and
you’ll have a vested
interest.
GCG MSF & RSF format
!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
//
a49171
e70827
g83052
f70556
t17237
s65758
a46241
MSF: 735
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
425
577
718
534
229
735
274
July 20, 2001 14:53
Check:
Check:
Check:
Check:
Check:
Check:
Check:
537
21
9535
3494
9552
111
3514
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 6619 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
//////////////////////////////////////////////////
!!RICH_SEQUENCE 1.0
..
{
name ef1a_giala
descrip
PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list
type
PROTEIN
longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}
sequence-ID Q08046
checksum
7342
offset
23
This is SeqLab’s native format
creation-date 07/11/2001 16:51:19
strand 1
comments ////////////////////////////////////////////////////////////
The trick is to not forget the Braces and ‘wild card,’ e.g.
filename{*}, when specifying!
The List File Format
remember the @ sign!
An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG
data files, two periods separate
documentation from data.
..
my-special.pep
begin:24
end:134
SwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}
@another.list
The ‘way’ SeqLab works!
SeqLab — GCG’s X-based GUI!
SeqLab is the merger of Steve Smith’s Genetic
Data Environment and GCG’s Wisconsin
Package Interface:
GDE + WPI = SeqLab
Requires an X-Windowing environment —
either native on UNIX computers (including
LINUX, but not installed by default on Mac OS
X [v.10+] systems, however, see Apple’s free
X11 package or XDarwin), or emulated with XServer Software on personal computers.
Conclusions —
Gunnar von Heijne in his old but quite readable treatise, Sequence
Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit
(1987), provides a very appropriate conclusion:
“Think about what you’re doing; use your knowledge of the molecular
system involved to guide both your interpretation of results and your
direction of inquiry; use as much information as possible; and do not
blindly accept everything the computer offers you.”
He continues:
“. . . if any lesson is to be drawn . . . it surely is that to be able to make a
useful contribution one must first and foremost be a biologist, and only
second a theoretician . . . . We have to develop better algorithms, we
have to find ways to cope with the massive amounts of data, and above
all we have to become better biologists. But that’s all it takes.”
FOR MORE INFO...
Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.
Contact me ([email protected]) for specific long-distance
bioinformatics assistance and collaboration.
AND FOR EVEN MORE INFO...
Many texts are now available in
the field. To ‘honk-my-own-horn’ a bit,
check out:
Current Protocols in Bioinformatics
from John Wiley & Sons, Inc.
(http://www.does.org/cp/bioinfo.html);
and Horizon Scientific
Press’
Computational Genomics:
Theory and Application
(http://www.horizonpress.com/
hsp/books/com.html).
Humana Press’
Introduction to Bioinformatics:
A Theoretical And Practical Approach
(http://www.humanapress.com/Product.
pasp?txtCatalog=HumanaBooks&txtCat
egory=&txtProductID=1-58829-241X&isVariant=0);
They all asked me to
contribute chapters on
multiple sequence
alignment and analysis
using GCG software.
References —
Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in
biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular
Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural Biology 6, 361–365.
Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics 14, 755--763
Felsenstein, J. (1993–2005) PHYLIP (Phylogeny Inference Package) Distributed by the author. Dept. of Genetics,
University of Washington, Seattle, Washington, U.S.A.
Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic
trees. Journal of Molecular Evolution 25, 351–360 .
Genetics Computer Group (Copyright 1982–2005) Program Manual for the Wisconsin Package, Version 10.3,
Accelrys, subsidiary of Pharmocopeia Inc.
Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the
author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana
University, Bloomington, Indiana,U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl.
Acad. Sci. U.S.A. 84, 4355-4358.
Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the
shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology 2,
459–472.
Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing
secondary structure-dependent gap penalties for comparative protein modelling. Protein Engineering 5, 35–41.
Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994)
personal copyright, (1995–2000) Smithsonian Institution, Washington D.C., U.S.A., and (2001–2005) Florida
State University, School of Computational Science, Tallahassee, Florida, U.S.A.
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows
interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids
Research 24, 4876–4882.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix
choice. Nucleic Acids Research, 22, 4673-4680.
On to a demonstration of some
of SeqLab’s multiple sequence
dataset capabilities —
Glutathione Reductase, G-protein
coupled TM7 receptors, primate prions,
Human Papilloma Virus L1 major coat
protein, Major Histocompatibility Class
II, Vicilin seed storage proteins, and
Elongation Factor 1/Tu.