Multiple Sequence Alignment

Download Report

Transcript Multiple Sequence Alignment

Marine Biological Laboratory
— Workshop on Molecular
Evolution
Woods Hole, Massachusetts
July 25, 2006, 7 to 10 PM
Multiple Sequence
Alignment & Analysis
thru GCG’s SeqLab
Steven M. Thompson
Florida State University School of
Computational Science (SCS)
More data yields stronger analyses — if done carefully!
Mosaic ideas and evolutionary ‘importance.’
But first a prelude: My definitions
Biocomputing and computational biology are synonymous and
describe the use of computers and computational techniques to
analyze any biological system, from molecules, through cells,
tissues, organisms, and populations, to complete ecologies.
Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any of the
available online biological databases.
Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, mechanism, interactions,
evolution, and perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.
Proteomics is a subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of organisms,
both within and between different organisms.
And a ‘way’ to think about it:
The reverse biochemistry analogy
from a ‘virtual’ DNA sequence to actual molecular
physical characterization, not the other way ‘round.
Using bioinformatics tools, you can infer all sorts
of functional, evolutionary, and, structural
insights into a gene product, without the need
to isolate and purify massive amounts of
protein! Eventually you can go on to clone
and express the gene based on that analysis
using PCR techniques.
The computer and molecular databases are an
essential part of this process.
The exponential growth of molecular
sequence databases & cpu power
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
BasePairs
680338
2274029
3368765
5204420
9615371
15514776
23800000
34762585
49179285
71947426
101008486
157152442
217102462
384939485
651972984
1160300687
2008761784
3841163011
11101066288
15849921438
28507990166
36553368485
44575745176
56037734462
Sequences
606
2427
4175
5700
9978
14584
20579
28791
39533
55627
78608
143492
215273
555694
1021211
1765847
2837897
4864570
10106023
14976310
22318883
30968418
40604319
52016762
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Doubling time ~ 1 year!
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Back to multiple sequence
alignment — Applicability?
So what; why even bother?
Applications:
Probe/primer, and motif/profile design;
Graphical illustrations;
Comparative ‘homology’ inference;
Molecular evolutionary analysis.
OK — well, how do you do it?
Dynamic programming’s complexity
increases exponentially with the number of
sequences being compared:
N-dimensional matrix . . . .
complexity=[sequence length]number of sequences
‘Global’ heuristic solutions
See —
MSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the multiple
alignment page at the
Baylor College of Medicine’s Search
Launcher —
http://searchlauncher.bcm.tmc.edu/ — but,
severely limiting restrictions!
Multiple Sequence Dynamic Programming
Therefore — pairwise,
progressive dynamic
programming restricts the
solution to the neighborhood of only two
sequences at a time.
All sequences are
compared, pairwise, and
then each is aligned to its
most similar partner or
group of partners. Each
group of partners is then
aligned to finish the
complete multiple
sequence alignment.
Reliability and the
Comparative Approach —
explicit homologous correspondence;
manual adjustments should be
encouraged — based on knowledge,
especially structural, regulatory, and
functional sites.
Therefore, editors like SeqLab and
the Ribosomal Database Project:
http://rdp.cme.msu.edu/index.jsp
Structural & Functional correspondence in
the Wisconsin Package’s SeqLab —
Work with proteins!
If at all possible —
Twenty match symbols versus four, plus
similarity! Way better signal to noise.
Also guarantees no indels are placed
within codons. So translate, then align.
Nucleotide sequences will only reliably
align if they are very similar to each
other. And they will require extensive
hand editing and careful consideration.
Beware of aligning apples and
oranges [and grapefruit]!
Parologous
versus
orthologous;
genomic versus
cDNA;
mature versus
precursor.
Mask out uncertain areas —
Complications —
Order dependence.
Not that big of a deal.
Substitution matrices and gap penalties.
A very big deal!
Regional ‘realignment’ becomes incredibly
important, especially with sequences that
have areas of high and low similarity (GCG’
PileUp -InSitu option).
Complications cont. —
Format hassles!
Specialized format conversion
tools such as GCG’s
SeqConv+ program and
PAUPSearch, and
Don Gilbert’s public domain
ReadSeq program.
Still more complications —
Indels and missing
data symbols (i.e.
gaps) designation
discrepancy
headaches —
., -, ~, ?, N, or X
. . . . . Help!
Web resources for pairwise,
progressive multiple alignment —
http://www.techfak.unibielefeld.de/bcd/Curric/MulAli/welcome.html.
http://pbil.univ-lyon1.fr/alignment.html
http://www.ebi.ac.uk/clustalw/
http://searchlauncher.bcm.tmc.edu/
However, problems with very large datasets and
huge multiple alignments make doing multiple
sequence alignment on the Web impractical
after your dataset has reached a certain size.
You’ll know it when you’re there!
If large datasets become intractable for
analysis on the Web, what other
resources are available?
Desktop software solutions — public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be
pretty computer savvy. So,
commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per
machine, and Internet and/or CD database
access all complicate matters!
Therefore, UNIX server-based solutions
Public domain solutions also exist, but now a very cooperative
systems manager needs to maintain everything for users, so,
commercial products, e.g. the Accelrys GCG Wisconsin Package
and the SeqLab Graphical User Interface, simplify matters for
administrators and users. One format, one ‘look-and-feel.’
One license fee for an entire institution and very fast, convenient
database access on local server disks. Connections from any
networked terminal or workstation anywhere!
Operating system: UNIX command line operation hassles;
communications software — telnet, ssh, and terminal emulation; X
graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs,
pico (or desktop word processing followed by file transfer [save as
"text only!"]). See my supplement pdf file.
The Genetics Computer Group —
The Accelrys Wisconsin Package for Sequence Analysis
GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the
University of Wisconsin, Madison; and then starting in 1990 it
became a private company; which was acquired by the Oxford
Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc.,
U.S.A., in 2000; and then in 2004 Accelrys, San Diego,
California, left Pharmacopeia to become an independent entity.
The suite contains around 150 programs designed to work in a
“toolbox” fashion. Several simple programs used in succession
can lead to very sophisticated results.
Also ‘internal compatibility,’ i.e. once you learn to use one program,
all programs can be run similarly, and, the output from many
programs can be used as input for other programs.
Used all over the world at over 950 institutions, so learning it will
likely be useful at other research institutions as well.
To answer the always perplexing GCG question — “What
sequence(s)? . . . .”
Specifying sequences, GCG style;
in order of increasing power and complexity:
The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and SeqConv+ programs)
The sequence is in a local GCG database in which case you ‘point’ to it by
using any of the GCG database logical names. A colon, “:,” always sets
the logical name apart from either an accession number or a proper
identifier name or a wildcard expression, and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “{},” containing the sequence
specification, e.g. a wildcard — {*}.
Finally, the most powerful method of specifying sequences is in a GCG “list”
file. It is merely a list of other sequence specifications and can even
contain other list files within it. The convention to use a GCG list file in a
program is to precede it with an at sign, “@.” Furthermore, you can
supply attribute information within list files to specify something special
about the sequence such as begin and end constraints.
‘Clean’ GCG format single sequence file after
‘reformat’ (or the SeqConv+ program)
!!NA_SEQUENCE 1.0
This is a small example of GCG single sequence format.
Always put some documentation on top, so in the future
you can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.
example.seq
1
51
Length: 77
July 21, 1999 09:30
Type: N
Check: 4099
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA
GATTTAATAG CATGCGATCC CATGGGA
SeqLab’s Editor mode can also
“Import” native GenBank format and
ABI or LI-COR trace files!
..
Logical terms for the Wisconsin Package
Sequence databases, nucleic acids:
Sequence databases, amino acids:
GENBANKPLUS
all of GenBank plus EST, HTC & GSS subdivisions
GENPEPT
GenBank CDS translations
GBP
all of GenBank plus EST, HTC & GSS subdivisions
GP
GenBank CDS translations
GENBANK
all of GenBank except EST, HTC & GSS subdivisions
UNIPROT or UNI
all of Swiss-Prot and all of SPTrEMBL
GB
all of GenBank except EST, HTC & GSS subdivisions
SWISSPROTPLUS
all of Swiss-Prot and all of SPTrEMBL
BA
GenBank bacterial subdivision
SWP
all of Swiss-Prot and all of SPTrEMBL
BACTERIAL
GenBank bacterial subdivision
UNISPROT
all of Swiss-Prot (fully annotated)
EST
GenBank EST (Expressed Sequence Tags) subdivision
SWISSPROT
all of Swiss-Prot (fully annotated)
GSS
GenBank GSS (Genome Survey Sequences) subdivision
SWISS
all of Swiss-Prot (fully annotated)
HTC
GenBank High Throughput cDNA
SW
all of Swiss-Prot (fully annotated)
HTG
GenBank High Throughput Genomic
UNITREMBL
Swiss-Prot preliminary EMBL translations
IN
GenBank invertebrate subdivision
SPTREMBL
Swiss-Prot preliminary EMBL translations
INVERTEBRATE
GenBank invertebrate subdivision
SPT
Swiss-Prot preliminary EMBL translations
OM
GenBank other mammalian subdivision
P
all of PIR Protein
OTHERMAMM
GenBank other mammalian subdivision
PIR
all of PIR Protein
OV
GenBank other vertebrate subdivision
PIR1
PIR fully annotated subdivision
OTHERVERT
GenBank other vertebrate subdivision
PIR2
PIR preliminary subdivision
PAT
GenBank patent subdivision
PIR3
PIR unverified subdivision
PATENT
GenBank patent subdivision
PIR4
PIR unencoded subdivision
PH
GenBank phage subdivision
Note: not all GCG installations support the PIR database
PHAGE
GenBank phage subdivision
PL
GenBank plant subdivision
General data files:
PLANT
GenBank plant subdivision
GENMOREDATA
path to GCG optional data files
PR
GenBank primate subdivision
GENRUNDATA
path to GCG default data files
PRIMATE
GenBank primate subdivision
RO
GenBank rodent subdivision
RODENT
GenBank rodent subdivision
STS
GenBank (Sequence Tagged Sites) subdivision
SY
GenBank synthetic subdivision
SYNTHETIC
GenBank synthetic subdivision
TAGS
GenBank EST, HTC & GSS subdivisions
UN
GenBank unannotated subdivision
UNANNOTATED
GenBank unannotated subdivision
VI
GenBank viral subdivision
VIRAL
GenBank viral subdivision
These are easy —
they make sense and
you’ll have a vested
interest.
GCG MSF & RSF format
!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
//
a49171
e70827
g83052
f70556
t17237
s65758
a46241
MSF: 735
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
425
577
718
534
229
735
274
July 20, 2001 14:53
Check:
Check:
Check:
Check:
Check:
Check:
Check:
537
21
9535
3494
9552
111
3514
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 6619 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
//////////////////////////////////////////////////
!!RICH_SEQUENCE 1.0
..
{
name ef1a_giala
descrip
PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list
type
PROTEIN
longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}
sequence-ID Q08046
checksum
7342
offset
23
This is SeqLab’s native format
creation-date 07/11/2001 16:51:19
strand 1
comments ////////////////////////////////////////////////////////////
The trick is to not forget the Braces and ‘wild card,’ e.g.
filename{*}, when specifying!
The List File Format
remember the @ sign!
!!SEQUENCE_LIST 1.0
An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG
data files, two periods separate
documentation from data.
..
my-special.pep
begin:24
end:134
SwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}
@another.list
The ‘way’ SeqLab works!
SeqLab — GCG’s X-based GUI!
SeqLab is the merger of Steve Smith’s Genetic
Data Environment and GCG’s Wisconsin
Package Interface:
GDE + WPI = SeqLab
Requires an X-Windowing environment —
either native on UNIX computers (including
LINUX, but not installed by default on Mac OS
X [v.10+] systems, however, see Apple’s free
X11 package or XDarwin), or emulated with XServer Software on personal computers.
Conclusions —
Gunnar von Heijne in his old but quite readable treatise, Sequence
Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit
(1987), provides a very appropriate conclusion:
“Think about what you’re doing; use your knowledge of the molecular
system involved to guide both your interpretation of results and your
direction of inquiry; use as much information as possible; and do not
blindly accept everything the computer offers you.”
He continues:
“. . . if any lesson is to be drawn . . . it surely is that to be able to make a
useful contribution one must first and foremost be a biologist, and only
second a theoretician . . . . We have to develop better algorithms, we
have to find ways to cope with the massive amounts of data, and above
all we have to become better biologists. But that’s all it takes.”
FOR MORE INFO...
Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.
Contact me ([email protected]) for specific long-distance
bioinformatics assistance and collaboration.
AND FOR EVEN MORE INFO...
Many texts are now available in
the field. To ‘honk-my-own-horn’ a bit,
check out:
Current Protocols in Bioinformatics
from John Wiley & Sons, Inc.
(http://www.does.org/cp/bioinfo.html);
and Horizon Scientific
Press’
Computational Genomics:
Theory and Application
(http://www.horizonpress.com/
hsp/books/com.html).
Humana Press’
Introduction to Bioinformatics:
A Theoretical And Practical Approach
(http://www.humanapress.com/Product.
pasp?txtCatalog=HumanaBooks&txtCat
egory=&txtProductID=1-58829-241X&isVariant=0);
They all asked me to
contribute chapters on
multiple sequence
alignment and analysis
using GCG software.
On to a demonstration of some of
SeqLab’s multiple sequence
dataset capabilities —
some of my prebuilt alignments, and . . .
Elongation Factor 1/Tu, how to do it.