BioInfo Survey - University of Massachusetts Boston

Download Report

Transcript BioInfo Survey - University of Massachusetts Boston

The “Nuts and Bolts” of ‘doing’
bioinformatics with the
Wisconsin Package at FSU
Steve Thompson
Florida State University School of
Computational Science (SCS)
BCH 5425 Molecular Biology
Dr. Hong Li
February 16, 2005
Given nucleotide or amino acid sequence data,
what can we learn about biological molecules,
using the popular Accelrys Wisconsin Package?
But first some of my definitions, lots of overlap —
Biocomputing and computational biology are synonyms and
describe the use of computers and computational techniques to
analyze any type of a biological system, from individual molecules
to organisms to overall ecology.
Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any type of
biological database.
Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, interactions, evolution, and
perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within the same and/or across
different genomes.
Proteomics is the subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of organisms,
both within and between different organisms.
And one way to think about the field —
The reverse biochemistry analogy.
Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from
its native organism in order to characterize a particular
gene product. Rather, now scientists can amplify a
section of some genome based on its similarity to other
genomes, sequence that piece of DNA and, using
sequence analysis tools, infer all sorts of functional,
evolutionary, and, perhaps, structural insight into that
stretch of DNA! They can then clone and express it.
The computer and molecular databases are a
necessary, integral part of this entire process.
The exponential growth of molecular sequence
databases & cpu power —
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
BasePairs
Sequences
680338
606
2274029
2427
3368765
4175
5204420
5700
9615371
9978
15514776
14584
23800000
20579
34762585
28791
49179285
39533
71947426
55627
101008486 78608
157152442 143492
217102462 215273
384939485 555694
651972984 1021211
1160300687 1765847
2008761784 2837897
3841163011 4864570
11101066288 10106023
15849921438 14976310
28507990166 22318883
36553368485 30968418
doubling time ~
one year
Q uickTim e™ and a
TI FF ( Uncompr essed) decompr essor
ar e needed t o see t his pict ur e.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Another perspective on size and some organization stuff —
As of February 2005 the sequences in GenBank also include over
240 complete genomes, not including viruses! Nucleic acid
sequence databases (and TrEMBL) are split into subdivisions
based on taxonomy (historical rankings — the Fungi and Archaea
warning!). PIR is split into subdivisions based on level of
annotation. TrEMBL sequences are merged into SWISS-PROT
as they receive increased levels of annotation.
Nucleic Acid DB’s
GenBank/EMBL/DDBJ
all Taxonomic
categories + HTC’s,
HTG’s, & STS’s
“Tags”
EST’s
GSS’s
Amino Acid DB’s
SWISS-PROT
TrEMBL
PIR
PIR1
PIR2
PIR3
PIR4
NRL_3D
Genpept
So how do you access and manipulate all this data?
Often on the InterNet over the World Wide Web:
Site
URL (Uniform Resource Locator)
Content
Nat’l Center Biotech' Info'
http://www.ncbi.nlm.nih.gov/
databases/analysis/software
PIR/NBRF
http://www-nbrf.georgetown.edu/
protein sequence database
IUBIO Biology Archive
http://iubio.bio.indiana.edu/
database/software archive
Univ. of Montreal
http://megasun.bch.umontreal.ca/
database/software archive
Japan's GenomeNet
http://www.genome.ad.jp/
databases/analysis/software
European Mol' Bio' Lab'
http://www.embl-heidelberg.de/
databases/analysis/software
European Bioinformatics
http://www.ebi.ac.uk/
databases/analysis/software
The Sanger Institute
http://www.sanger.ac.uk/
databases/analysis/software
Univ. of Geneva BioWeb
http://www.expasy.ch/
databases/analysis/software
ProteinDataBank
http://www.rcsb.org/pdb/
3D mol' structure database
Molecules to Go
http://molbio.info.nih.gov/cgi-bin/pdb/
3D protein/nuc' visualization
The Genome DataBase
http://www.gdb.org/
The Human Genome Project
Stanford Genomics
http://genome-www.stanford.edu/
various genome projects
Inst. for Genomic Res’rch
http://www.tigr.org/
esp. microbial genome projects
HIV Sequence Database
http://hiv-web.lanl.gov/
HIV epidemeology seq' DB
The Tree of Life
http://tolweb.org/tree/phylogeny.html
overview of all phylogeny
Ribosomal Database Proj’
http://rdp.cme.msu.edu/index.jsp
databases/analysis/software
PUMA2 at Argonne
http://compbio.mcs.anl.gov/puma2/cgi-bin/
metabolic reconstruction
Harvard Bio' Laboratories
http://golgi.harvard.edu/
nice bioinformatics links list
With a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRS
But ‘doing’ bioinformatics on the Web has
both its pros and its cons —
Advantages: Accesses the very latest database
updates. It’s fun and very fast. It can be very
powerful and efficient, if you know what you’re doing.
In most cases relational links between different
databases ease navigation, and in some cases
neighboring concepts link similar entries.
Disadvantages: Can be very inefficient, if you don’t
know what you’re doing. Reformatting downloaded
sequence data is usually essential, if the sequence is
to be used in any other software. And, it’s very easy
to get lost and distracted in cyberspace!
Also, problems sometimes arise with the World Wide
Web itself, like dropped or slow connections . . . .
So what are the alternatives?
Personal computer software solutions — public domain
programs are available, but . . . a bit complicated to
install, configure, and maintain. User must be pretty
computer savvy. So,
good commercial software packages are also available,
e.g. Sequencher, MacVector, DNAStar, DNAsis, etc.,
but . . . license hassles, especially big expense per
machine, and Internet and/or CD database access all
complicate matters!
Therefore, UNIX server-based, non-Web
solutions are available as an alternative.
Public domain solutions also exist for UNIX servers, but
now a very cooperative systems manager needs to
maintain everything for users. So,
commercial products, e.g. the Accelrys GCG Wisconsin
Package [a Pharmacopeia Co.] and the SeqLab Graphical
User Interface, simplify matters for administrators and
users. One commercial license fee for an entire
institution and very fast, convenient database access
on local server disks. Connections from any
networked terminal or workstation anywhere, anytime!
Mendel (mendel.csit.fsu.edu) — FSU’s
UNIX (Linux) Biocomputing Server —
Operating system — UNIX command line;
communications software — telnet vs. ssh; X graphics;
ssh -X [email protected]
file transfer — ftp vs. scp/sftp;
and editors — vi, emacs, pico (or word processing
followed by file transfer [save as "text only!"]).
How do I get an account — just ask me! I am the
contact person for Mendel. It usually takes a couple of
days for the SCS system administrator to act on my
request. Anybody associated with FSU is entitled to an
account and there are NO fees associated with it.
The Genetics Computer Group —
the Wisconsin Package for Sequence Analysis.
Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the
University of Wisconsin, Madison, then a private company for
over 10 years, then acquired by the Oxford Molecular Group
U.K., and now owned by Pharmacopeia U.S.A. under the new
name Accelrys, Inc.
The suite contains almost 150 programs designed to work in a
"toolbox" fashion. Several simple programs used in succession
can lead to sophisticated results.
Also 'internal compatibility,' i.e. once you learn to use one program,
all programs can be run similarly, and, the output from many
programs can be used as input for other programs.
Used all over the world by more than 30,000 scientists at over 530
institutions in 35 countries, so learning it here will most likely be
useful anywhere else you may end up.
To answer the always perplexing GCG question — “What
sequence(s)? . . . .” Specifying sequences, GCG style;
in order of increasing power and complexity:
1) The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and all From- & To- programs)
2) The sequence is in a local GCG database in which case you ‘point’ to it
by using any of the GCG database logical names. A colon, “:,” always
sets the logical name apart from either an accession number or a proper
identifier name or a wildcard expression and they are case insensitive.
3) The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “{},” containing the sequence
specification, e.g. a wildcard — {*}.
4) Finally, the most powerful method of specifying sequences is in a GCG
“list” file. This is merely a list of other sequence specifications and can
even contain other list files within it. The convention to use a GCG list file
in a program is to precede it with an at sign, “@.” Furthermore, attribute
information within list files can specify particular sequence aspects.
The first way —
‘Clean’ GCG format single sequence file after
‘reformat’ (or any of the From… programs)
This is a small example of GCG single sequence format.
Always put some documentation on top, so in the future
you can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.
example.seq
1
51
Length: 77
July 21, 1999 09:30
Type: N
Check: 4099
..
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA
GATTTAATAG CATGCGATCC CATGGGA
SeqLab’s Editor mode can also
“Import” native GenBank format and
ABI or LI-COR trace files!
The logical terms for the second way of running the Wisconsin Package
Sequence databases, nucleic acids:
Sequence databases, amino acids:
GENBANKPLUS
all of GenBank plus EST and GSS subdivisions
GENPEPT
GenBank CDS translations
GBP
all of GenBank plus EST and GSS subdivisions
GP
GenBank CDS translations
GENBANK
all of GenBank except EST and GSS subdivisions
SWISSPROTPLUS
all of Swiss-Prot and all of SPTrEMBL
GB
all of GenBank except EST and GSS subdivisions
SWP
all of Swiss-Prot and all of SPTrEMBL
BA
GenBank bacterial subdivision
SWISSPROT
all of Swiss-Prot (fully annotated)
BACTERIAL
GenBank bacterial subdivision
SW
all of Swiss-Prot (fully annotated)
EST
GenBank EST (Expressed Sequence Tags) subdivision
SPTREMBL
Swiss-Prot preliminary EMBL translations
GSS
GenBank GSS (Genome Survey Sequences) subdivision
SPT
Swiss-Prot preliminary EMBL translations
HTC
GenBank High Throughput cDNA
P
all of PIR Protein
HTG
GenBank High Throughput Genomic
PIR
all of PIR Protein
IN
GenBank invertebrate subdivision
PROTEIN
PIR fully annotated subdivision
INVERTEBRATE
GenBank invertebrate subdivision
PIR1
PIR fully annotated subdivision
OM
GenBank other mammalian subdivision
PIR2
PIR preliminary subdivision
OTHERMAMM
GenBank other mammalian subdivision
PIR3
PIR unverified subdivision
OV
GenBank other vertebrate subdivision
PIR4
PIR unencoded subdivision
OTHERVERT
GenBank other vertebrate subdivision
NRL_3D
PDB 3D protein sequences
PAT
GenBank patent subdivision
NRL
PDB 3D protein sequences
PATENT
GenBank patent subdivision
PH
GenBank phage subdivision
Genome databases
PHAGE
GenBank phage subdivision
HOMO
NCBI human refseq
PL
GenBank plant subdivision
DANIO
Sanger Zebrafish build
PLANT
GenBank plant subdivision
PR
GenBank primate subdivision
General data files:
PRIMATE
GenBank primate subdivision
GENMOREDATA
path to GCG optional data files
RO
GenBank rodent subdivision
GENRUNDATA
path to GCG default data files
RODENT
GenBank rodent subdivision
GENTRAINDATA
path to GCG training datasets
STS
GenBank (sequence tagged sites) subdivision
SY
GenBank synthetic subdivision
SYNTHETIC
GenBank synthetic subdivision
TAGS
GenBank EST and GSS subdivisions
UN
GenBank unannotated subdivision
UNANNOTATED
GenBank unannotated subdivision
VI
GenBank viral subdivision
VIRAL
GenBank viral subdivision
These are easy —
they make sense and
you’ll have a vested
interest.
The third way — multiple sequence
formats — GCG MSF & RSF format
!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
//
a49171
e70827
g83052
f70556
t17237
s65758
a46241
MSF: 735
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
425
577
718
534
229
735
274
July 20, 2001 14:53
Check:
Check:
Check:
Check:
Check:
Check:
Check:
537
21
9535
3494
9552
111
3514
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 6619 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
//////////////////////////////////////////////////
!!RICH_SEQUENCE 1.0
..
{
name ef1a_giala
descrip
PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list
type
PROTEIN
longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}
sequence-ID Q08046
checksum
7342
offset
23
This is SeqLab’s native format
creation-date 07/11/2001 16:51:19
strand 1
comments ////////////////////////////////////////////////////////////
The trick is to not forget the Braces and ‘wild card,’ e.g. filename{*}!
And the forth way, the most powerful
way by far — the List File format
An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG
data files, two periods separate
documentation from data.
..
my-special.pep
begin:24
end:134
SwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}
@another.list
The ‘way’ SeqLab works!
Within the GCG suite —
LookUp, a Sequence Retrieval System (SRS)
derivative, is used to find sequences of
interest based on text words, and database
similarity searches find sequences from
local GCG server databases.
Advantages: Search output is a legitimate GCG list file,
appropriate input to other GCG programs; no need to
download and then reformat — it’s all GCG.
Disadvantage: DB’s only as new as GCG administrator
(me) maintains them. I update every two months to
coincide with NCBI’s full releases.
Let’s build two list files with LookUp —
One, elongation factor 1 alpha from humans, and
two, all proteins in the SwissProt database from the
so-called non-crown ‘primitive’ eukaryotes.
I’ll use the following search strings:
“elongation & factor & alpha” in the
“Definition” category and “Homo” in the
“Organism” field for the first search, and
“eukaryota ! ( fungi | metazoa |
viridiplantae )” in the “Organism”
category for the second search.
These two searches illustrate LookUp’s syntax
rules, in particular it’s Boolean qualifiers.
SeqLab — GCG’s X-based GUI!
The SeqLab graphical user interface is the
merger of Steve Smith’s Genetic Data
Environment and GCG’s Wisconsin Package
Interface:
GDE + WPI = SeqLab
Requires an X-Windowing environment —
either native on UNIX computers (including
LINUX, but not included by Apple in Mac OS
X [v.10+] see Apple’s free X11 package), or
emulated with X-Server Software on other
personal computers.
SeqLab — Editor mode, residue display —
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Structural & functional correspondence —
So let’s see what it looks
like, SeqLab in action —
From an X ‘aware’ terminal window I
launch the GUI with the command:
seqlab &
The ampersand is not required, but
it allows you to continue to use the
terminal window for system level
commands by running SeqLab as a
background process.
OK then, how can we see if two
sequences are similar enough to belong
in alignments? So first homology and
similarity —
Don’t confuse homology with similarity: there is
a huge difference! Similarity is a statistic that
describes how much two (sub)sequences are
alike according to some set scoring criteria. It
can be normalized to ascertain statistical
significance, but it’s still just a number.
Homology, in contrast and by definition —
implies an evolutionary relationship, more than just
everything evolving from the same primordial ‘slime.’ To
demonstrate homology reconstruct the phylogeny of the
organisms or genes of interest. Better yet, show some
experimental evidence — structural, morphological,
genetic, and/or fossil — that corroborates your assertion.
Percent homology is an invalid concept; something is
either homologous or it is not. Walter Fitch is credited
with the joke “homology is like pregnancy — you can’t be
45% pregnant, just like something can’t be 45%
homologous.” Highly significant similarity can argue for
homology; however, the inverse does not hold.
So, to introduce the concept of
sequence comparison, a graphical
method . . .
One way — Dot Matrices.
Provide a ‘Gestalt’ of all possible alignments
between two sequences.
To begin — very simple 0, 1 (match,
nomatch) identity scoring function.
Put a dot wherever symbols match.
Identities and insertion/deletion events (indels)
identified (zero:one match score matrix, no window).
Noise due to random composition contributes to confusion. To ‘clean up’ the
plot consider a filtered windowing approach. A dot is placed at the middle of a
window if some ‘stringency’ is met within that defined window size. Then the
window is shifted one position and the entire process is repeated (zero:one
match score, window of size three and a stringency level of two out of three).
Dot matrix analysis requires
two programs in the
Wisconsin Package —
Compare generates the data that
serves as input to DotPlot, which
actually draws the matrix.
Let’s see how a couple of the
elongation factors that we found
earlier look using this method.
SW:EF11_Human vs.
SW:EF1a_Schco
Exact alignment — but how can we ‘see’ the
correspondence of individual residues?
We can compare one molecule against another by
aligning them. However, a ‘brute force’ approach just
won’t work. Even without considering the introduction of
gaps, the computation required to compare all possible
alignments between two sequences requires time
proportional to the product of the lengths of the two
sequences. Therefore, if the two sequences are
approximately the same length (N), this is a N2 problem.
To include gaps, we would have to repeat the
calculation 2N times to examine the possibility of gaps
at each possible position within the sequences, now a
N4N problem. There’s no way! We need an algorithm.
But —
Just what the heck is an algorithm ! ?
Merriam-Webster’s says: “A rule
of procedure for solving a
problem [often mathematical] that
frequently involves repetition of
an operation.”
So, you could write an algorithm
for tying your shoe! It’s just a set
of explicit instructions for doing
some routine task.
Enter the Dynamic Programming Algorithm!
Computer scientists figured it out long ago; Needleman and Wunsch
applied it to the alignment of the full lengths of two sequences in
1970. An optimal alignment is defined as an arrangement of two
sequences, 1 of length i and 2 of length j, such that:
1)
you maximize the number of matching symbols between 1 and 2;
2)
you minimize the number of indels within 1 and 2; and
3)
you minimize the number of mismatched symbols between 1 and 2.
Therefore, the actual solution can be represented by:
Sij = sij + max
Si-1
max
2 <
max
2 <
j-1
Si-x j-1 + wx-1
x < i
Si-1 j-y + wy-1
y < I
or
or
Where Sij is the score for the alignment ending at i in sequence
1 and j in sequence 2,
sij is the score for aligning i with j,
wx is the score for making a x long gap in sequence 1,
wy is the score for making a y long gap in sequence 2,
allowing gaps to be any length in either sequence.
An oversimplified example —
total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])
Optimum Alignments —
There may be more than one best path through the matrix, and
optimum doesn’t guarantee biologically correct. Starting at the top
and working down, then tracing back, I found one best alignment:
cTATAtAagg
| |||||
cg.TAtAaT.
With our example’s scoring scheme this alignment’s final score is 5,
the highest bottom-right score in the trace-back path graph, and the
sum of six matches minus one interior gap. This is the number
optimized by the algorithm, not any type of a percentage! Only one
optimal solution will be reported. Do you have any ideas about how
others can be discovered, besides alternate trace back paths?
Answer — Often if you reverse the solution of the entire process,
other solutions will be found!
This was a global solution. Smith Waterman style local solutions
(1981) use negative numbers in the match matrix and pick the best
diagonal within overall graph gives local.
What about proteins — conservative replacements and similarity as
opposed to identity. The nitrogenous bases, A, C, T, G, are either the
same or they’re not, but amino acids can be similar, genetically,
evolutionarily, and structurally! Enter log-odds scoring matrices.
BLOSUM62 amino acid substitution matrix (the default in many sequence analysis programs).
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
X
Y
A
4
0
-2
-1
-2
0
-2
-1
-1
-1
-1
-2
-1
-1
-1
1
0
0
-3
-1
-2
C
0
9
-3
-4
-2
-3
-3
-1
-3
-1
-1
-3
-3
-3
-3
-1
-1
-1
-2
-1
-2
D
-2
-3
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
E
-1
-4
2
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-1
-2
F
-2
-2
-3
-3
6
-3
-1
0
-3
0
0
-3
-4
-3
-3
-2
-2
-1
1
-1
3
G
0
-3
-1
-2
-3
6
-2
-4
-2
-4
-3
0
-2
-2
-2
0
-2
-3
-2
-1
-3
H
-2
-3
-1
0
-1
-2
8
-3
-1
-3
-2
1
-2
0
0
-1
-2
-3
-2
-1
2
I
-1
-1
-3
-3
0
-4
-3
4
-3
2
1
-3
-3
-3
-3
-2
-1
3
-3
-1
-1
K
-1
-3
-1
1
-3
-2
-1
-3
5
-2
-1
0
-1
1
2
0
-1
-2
-3
-1
-2
L
-1
-1
-4
-3
0
-4
-3
2
-2
4
2
-3
-3
-2
-2
-2
-1
1
-2
-1
-1
M
-1
-1
-3
-2
0
-3
-2
1
-1
2
5
-2
-2
0
-1
-1
-1
1
-1
-1
-1
N
-2
-3
1
0
-3
0
1
-3
0
-3
-2
6
-2
0
0
1
0
-3
-4
-1
-2
P
-1
-3
-1
-1
-4
-2
-2
-3
-1
-3
-2
-2
7
-1
-2
-1
-1
-2
-4
-1
-3
Q
-1
-3
0
2
-3
-2
0
-3
1
-2
0
0
-1
5
1
0
-1
-2
-2
-1
-1
R
-1
-3
-2
0
-3
-2
0
-3
2
-2
-1
0
-2
1
5
-1
-1
-3
-3
-1
-2
S
1
-1
0
0
-2
0
-1
-2
0
-2
-1
1
-1
0
-1
4
1
-2
-3
-1
-2
T
0
-1
-1
-1
-2
-2
-2
-1
-1
-1
-1
0
-1
-1
-1
1
5
0
-2
-1
-2
V
0
-1
-3
-2
-1
-3
-3
3
-2
1
1
-3
-2
-2
-3
-2
0
4
-3
-1
-1
W
-3
-2
-4
-3
1
-2
-2
-3
-3
-2
-1
-4
-4
-2
-3
-3
-2
-3
11
-1
2
X
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Y
-2
-2
-3
-2
3
-3
2
-1
-2
-1
-1
-2
-3
-1
-2
-2
-2
-1
2
-1
7
Notice that positive values for identity range from 4 to 11 and negative values for those
substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a
score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.
We can imagine screening databases for sequences
similar to ours using the concepts of dynamic
programming and log-odds scoring matrices and yet to
be described algorithmic tricks.
But why even bother? Inference
through homology is a
fundamental principle of biology!
When a sequence is found to fall into a preexisting
family we may be able to infer function, mechanism,
evolution, perhaps even structure, based on homology
with its neighbors.
Independent of all that, what is a
‘good’ alignment?
So, first — Significance:
when is any alignment worth
anything biologically?
An old statistics trick — Monte Carlo simulations:
Z score = [ ( actual score ) - ( mean of randomized scores ) ]
( standard deviation of randomized score distribution )
The Wisconsin Package dynamic
programmings tools —
BestFit — Smith Waterman local
alignments,
Gap — Needleman Wunsch global
alignments,
FrameAlign — nucleotide to protein, either
local or global.
I’ll illustrate in SeqLab with same previous
example, but at the command line:
bestfit sw:ef11_human sw:ef1a_schco -shuffle=100
The Normal distribution —
Many Z scores measure the distance from the mean
using this simplistic Monte Carlo model assuming a
Gaussian distribution, a.k.a. the Normal distribution
(http://mathworld.wolfram.com/NormalDistribution.html),
in spite of the fact that ‘sequence-space’ actually
follows what is know as the ‘Extreme Value distribution.’
Regardless, Monte Carlo methods approximate
significance estimates pretty well.
0:==
< 20 650
0:
0
22
0:=
3
24
8:*
26 22
28 98 87:*
30 289 528:*
32 1714 2042:===*
34 5585 5539:=========*
36 12495 11375:==================*==
38 21957 18799:===============================*=====
40 28875 26223:===========================================*====
42 34153 32054:=====================================================*===
44 35427 35359:==========================================================*
46 36219 36014:===========================================================*
48 33699 34479:======================================================== *
50 30727 31462:=================================================== *
52 27288 27661:=============================================*
54 22538 23627:====================================== *
56 18055 19736:============================== *
58 14617 16203:========================= *
60 12595 13125:=====================*
62 10563 10522:=================*
64 8626 8368:=============*=
66 6426 6614:==========*
68 4770 5203:========*
70 4017 4077:======*
72 2920 3186:=====*
74 2448 2484:====*
76 1696 1933:===*
78 1178 1503:==*
80 935 1167:=*
82 722 893:=*
84 454 707:=*
86 438 547:*
88 322 423:*
90 257 328:*
92 175 253:*
94 210 196:*
96 102 152:*
98 63 117:*
100 58 91:*
102 40 70:*
104 30 54:*
106 17 42:*
108 14 33:*
110 14 25:*
112 12 20:*
9 15:*
114
6 12:*
116
9:*
8
118
7:*=
>120 1030
‘Sequence-space’ (Huh, what’s that?)
actually follows the ‘Extreme Value distribution’
(http://mathworld.wolfram.com/ExtremeValueDistribution.html).
Based on this known statistical
distribution, and robust
statistical methodology, a
realistic Expectation function,
the E Value, can be calculated
from database searches.
The ‘take-home’ message is . . .
The Expectation Value?
The higher the E value is, the more probable
that the observed match is due to chance in a
search of the same size database, and the
lower its Z score will be, i.e. is NOT significant.
Therefore, the smaller the E value, i.e. the
closer it is to zero, the more significant it is and
the higher its Z score will be! The E value is
the number that really matters.
Rules of thumb for a protein search —
The Z score represents the number of standard deviations some
particular alignment is from a distribution of random alignments
(often the Normal distribution).
They very roughly correspond to the listed E Values (based on the
Extreme Value distribution) for a typical protein sequence similarity
search. But remember probabilities are dependent on the size and
composition of the database and even on how often you search!
On to the searches —
How can you search the databases for
similar sequences, if pair-wise alignments
take N2 time?!
Database searching programs use the two
concepts of dynamic programming and log-odds
scoring matrices; however, dynamic programming
takes far too long when used against most
sequence databases with a ‘normal’ computer.
Remember how huge the databases are!
Therefore, the programs use tricks to make things
happen faster. These tricks fall into two main
categories, hashing and heuristics.
Corn beef hash? Huh . . .
Hashing is the process of breaking your sequence into
small ‘words’ or ‘k-tuples’ (think all chopped up, just like
corn beef hash) of a set size and creating a ‘look-up’
table with those words keyed to position numbers.
Computers can deal with numbers way faster than they
can deal with strings of letters, and this preprocessing
step happens very quickly.
Then when any of the word positions match part of an
entry in the database, that match, the ‘offset,’ is saved.
In general, hashing reduces the complexity of the search
problem from N2 for dynamic programming to N, the
length of all the sequences in the database.
OK. Heuristics . . . What’s that?
Approximation techniques are collectively known as ‘heuristics.’
Webster’s defines heuristic as “serving to guide, discover, or reveal;
. . . but unproved or incapable of proof.”
In database similarity searching techniques the heuristic usually
restricts the necessary search space by calculating some sort of a
statistic that allows the program to decide whether further scrutiny
of a particular match should be pursued. This statistic may miss
things depending on the parameters set — that’s what makes it
heuristic. ‘Worthwhile’ results at the end are compiled and the
longest alignment within the program’s restrictions is created.
The exact implementation varies between the different programs,
but the basic idea follows in most all of them.
Two predominant versions exist: BLAST and Fast
Both return local alignments, and are not a single program, but rather
a family of programs with implementations designed to compare a
sequence to a database in about every which way imaginable.
These include:
1) a DNA sequence against a DNA database (not recommended unless
forced to do so because you are dealing with a non-translated region of
the genome — DNA is just too darn noisy, only identity & four bases!),
2) a translated (where the translation is done ‘on-the-fly’ in all six frames)
version of a DNA sequence against a translated (‘on-the-fly’ six-frame)
version of the DNA database (not available in the Fast package),
3) a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a
protein database,
4) a protein sequence against a translated (‘on-the-fly’ six-frame) version of
a DNA database,
5) or a protein sequence against a protein database.
Many implementations allow for the possibility of frame shifts in
translated comparisons and don’t penalize the score for doing so.
The BLAST and Fast programs — some generalities
BLAST — Basic Local Alignment
Search Tool, developed at NCBI.
FastA — and its family of relatives,
developed by Bill Pearson at the
1) Normally NOT a good idea
to use for DNA against
DNA searches w/o
translation (not optimized);
2) Pre-filters repeat and “low
complexity” sequence
regions;
4) Can find more than one
region of gapped similarity;
5) Very fast heuristic and
parallel implementation;
6) Restricted to precompiled,
specially formatted
databases;
University of Virginia.
1) Works well for DNA against
DNA searches (within limits
of possible sensitivity);
2) Can find only one gapped
region of similarity;
3) Relatively slow, should often
be run in the background;
4) Does not require specially
prepared, preformatted
databases.
The algorithms, in brief —
BLAST:
Two word hits on the
same diagonal above
some similarity
threshold triggers
ungapped extension until
the score isn’t improved
enough above another
threshold:
the HSP.
Initiate gapped extensions
using dynamic programming for
those HSP’s above a third
threshold up to the point where
the score starts to drop below a
fourth threshold: yields
alignment.
Find all ungapped exact
word hits; maximize the
ten best continuous
regions’ scores: init1.
Fast:
Combine nonoverlapping init
regions on different
diagonals:
initn.
Use dynamic
programming ‘in a
band’ for all regions
with initn scores
better than some
threshold: opt score.
I’ll illustrate with FastA —
FastA of human elongation factor 1 alpha
searched against that list file of primitive
organism proteins from SwissProt.
I’ll show SeqLab’s implementation, but
at the command line it would be:
fasta sw:ef11_human @primitive.list
Multiple Sequence Analysis:
Multiple Sequence Alignment.
Dynamic programming’s complexity increases exponentially with the
number of sequences being compared. N-dimensional matrix . . . .
Therefore — pairwise,
progressive dynamic
programming restricts the
solution to the
neighborhood of only two
sequences at a time.
All sequences are
compared, pairwise, and
then each is aligned to its
most similar partner or
group of partners. Each
group of partners is then
aligned to finish the
complete multiple
sequence alignment.
PileUp is the Wisconsin
Package’s implementation of
pairwise progressive multiple
sequence alignment.
Let’s run PileUp on our ‘primitive’
dataset in SeqLab. At the
command line this would be:
pileup @primitive.list
The consensus and motifs —
P-Loop
QuickTime™ and a
Graphics decompressor
are needed to see this picture.
Conserved
regions can be
visualized with a
sliding window
approach and
appear as
peaks.
Let’s
concentrate on
the first peak
seen here to
simplify matters.
Motifs (a.k.a. signatures)
GHVDHGKS
A consensus isn’t
necessarily the
biologically “correct”
combination.
Therefore, build
one-dimensional
‘pattern descriptors.’
PROSITE Database
of protein families
and domains - over
1,000 motifs.
This motif, the Ploop, is defined:
(A,G)x4GK(S,T), i.e.
either an Alanine or
a Glycine, followed
by four of anything,
followed by an
invariant GlycineLysine pair, followed
by either a Serine or
a Threonine.
Discover motifs in ‘ungapped’
sequences with the program
Motifs in the Wisconsin
Package —
Again I’ll show you in SeqLab,
but at the command line:
motifs sw:ef11_human
Enter
the
Profile
But motifs can not convey any degree of the ‘importance’
of the residues. Use a position specific, two-dimensional
matrix where conserved areas of the alignment receive the
most importance and variable regions hardly matter!
Cons A B C D E F G H I K L M N P Q R S T V W Y Z Gap Len
E 11 20 -11 27 33 -21 16 10 -4 10 -9 -6 16 6 18 0 8 17 -3 -29 -15 26 12 12
K 0 27 -40 21 22 -47 -6 7 -13 100 -20 13 27 7 27 53 14 13 -13 5 -40 28 12 12
! 11
P 13 3 4 3 3 -13 9 2 3 3 -2 -1 1 28 4 3 11 20 9 -21 -16 4 12 12
H -7 26 -6 26 26 -6 -14 99 -18 6 -12 -19 33 13 46 33 -13 -6 -19 -7 20 33 12 12
I 3 -7 2 -7 -6 19 -6 -9 43 -7 29 22 -10 -4 -6 -10 -4 6 38 -17 1 -5 12 12
N 14 73 -19 47 33 -34 27 33 -20 27 -27 -20 100 0 26 7 22 14 -20 -20 -7 27 12 12
I 1 -10 -1 -10 -8 26 -9 -10 46 -8 34 27 -12 -6 -8 -12 -6 5 40 -12 4 -7 12 12
V 15 2 7 3 1 -1 20 -9 24 -6 14 11 -3 6 -3 -11 4 10 37 -30 -9 -1 12 12
V 9 -4 7 -5 -4 5 7 -8 29 -4 20 15 -6 4 -7 -9 0 19 36 -21 -2 -5 12 12
I 0 -16 16 -16 -16 55 -24 -24 118 -16 63 47 -24 -16 -24 -24 -8 16 87 -39 8 -16 12 12
G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12
H -6 27 -7 27 27 -8 -13 100 -20 7 -13 -20 34 14 48 34 -13 -7 -20 -7 19 34 12 12
! 21
V 11 -12 12 -12 -12 13 11 -18 67 -12 48 36 -18 5 -12 -18 -6 12 89 -47 -6 -12 12 12
D 24 87 -39 118 79 -79 55 31 -16 24 -39 -31 55 8 55 0 16 16 -16 -87 -39 71 12 12
S 9 12 11 11 11 -8 8 22 -7 5 -10 -10 14 11 11 9 23 4 -6 1 -2 9 12 12
G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12
K 0 27 -40 20 20 -47 -7 7 -14 100 -20 13 27 7 27 55 13 13 -14 8 -40 27 12 12
S 19 14 30 10 10 -14 27 -9 -2 10 -17 -12 14 19 -5 3 63 24 -2 7 -19 1 100 100
T 40 20 20 20 20 -30 40 -10 20 20 -10 0 20 30 -10 -10 30 150 20 -60 -30 10 100 100
T 8 -4 -9 -4 0 13 1 -6 18 0 23 22 -2 2 -4 -9 0 34 18 -6 -2 -1 100 100
T 19 8 10 8 8 -12 19 -6 16 8 1 4 7 14 -6 -6 13 69 18 -32 -14 3 100 100
G 40 24 10 28 21 -27 61 -8 -11 -4 -19 -11 16 16 9 -14 26 18 9 -44 -28 13 100 100
! 31
H 10 11 -1 11 11 -10 1 34 -8 7 -8 -5 13 11 19 18 0 1 -6 -1 0 14 100 100
L -4 -20 -27 -20 -13 50 -21 -10 43 -13 62 53 -17 -13 -7 -17 -15 -2 40 13 12 -9 100 100
* 20 0 0 27 12 3 73 70 65 46 38 0 24 11 5 6 33 85 65 0 0 0
The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22
substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and
aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the
high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine
assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other
residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get
similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.
Advanced methodologies — wondrous stuff based on
combinations of the previous techniques, e.g.
PSI-BLAST uses profile methods to iterate database searches.
Profiles can be optimized with hidden Markov models (HMMs) or even
discovered in unaligned sequences using expectation maximization (MEME).
Exon and intron structure can be predicted. See e.g. the genefinder at
http://genomic.sanger.ac.uk/gf/gf.html and GrailEXP at
http://grail.lsd.ornl.gov/grailexp/.
Secondary structure can often be predicted. See http://www.emblheidelberg.de/predictprotein/predictprotein.html, which uses multiple
sequence alignment profile techniques along with neural net technology.
Even three-dimensional “homology modeling” will often lead to remarkably
accurate results if the similarity is great enough between your protein and
one in which the structure has been solved through experimental means.
See SwissModel at http://www.expasy.ch/swissmod/SWISS-MODEL.html.
Evolutionary relationships can be ascertained using a multiple sequence
alignment and the methods of molecular phylogenetics. See the PAUP* and
PHYLIP software packages. And if you’re really interested in this topic check
out the Workshop on Molecular Evolution offered every August at the Woods
Hole Marine Biological Laboratory and/or similar courses worldwide.
Finally, what’s the deal with DNA versus
protein for searches and alignment?
All database similarity searching and sequence
alignment, regardless of the algorithm, is far more
sensitive at the amino acid level than with DNA. This is
because proteins have twenty match criteria versus
DNA’s four, and those four DNA bases can generally only
be identical, not similar, to each other; and many DNA
base changes (especially third position changes) do not
change the encoded protein.
All of these factors drastically increase the ‘noise’ level of
a DNA against DNA search, and give protein searches a
much greater ‘look-back’ time, at least doubling it.
Therefore, whenever dealing with coding sequence, it is
always prudent to work at the protein level!
Conclusions — A comprehensive sequence analysis software
suite, such as the Wisconsin Package, expedites
bioinformatics, putting a large assortment of tools all under
one organizational model with one user interface.
The better you understand the chemical, physical, and biological system
under study, the better your chance of success in their analysis. Certain
strategies are inherently more appropriate than others. Making these
types of subjective, discriminatory decisions is one of the most important
‘take-home’ messages I can offer!
Gunnar von Heijne in his old but quite readable treatise, Sequence
Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987),
provides a very appropriate conclusion:
“Think about what you’re doing; use your knowledge of the molecular
system involved to guide both your interpretation of results and your
direction of inquiry; use as much information as possible; and do not
blindly accept everything the computer offers you.”
FOR MORE INFO...
See http://bio.fsu.edu/~stevet/workshop.html and contact me
([email protected]) for further bioinformatics assistance.
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation
of Protein Database Search Programs. Nucleic Acids Research 25, 3389-3402.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington,
Seattle, Washington, U.S.A.
Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Program Manual for the Wisconsin Package, Version 10.1, Madison, Wisconsin, USA
53711.
Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89,
10915-10919.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.
Journal of Molecular Biology 48, 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio
1994. Nucleic Acids Research 22, 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85,
2444-2448.
Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, 584-599.
Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence
Analysis. CABIOS, 10, 671-675.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor)
5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.
Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega
Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, 2471-2484.
Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997)
Smithsonian Institution, Washington D.C., U.S.A.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680.
von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of
Sciences U.S.A. 80, 726-730.
Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52.