BioInformatics at FSU - whose job is it and why it needs

Download Report

Transcript BioInformatics at FSU - whose job is it and why it needs

A BioInformatics Survey
. . . just a taste, with an
emphasis on the GCG suite.
Steven M. Thompson
Florida State University School of
Computational Science and
Information Technology (CSIT)
¥
A GCG SeqLab Introduction for:
the
Woods Hole Oceanographic Institution
Woods Hole, MA 02543
http://www.whoi.edu/home/
Feb. 21 – 22, 2003
Introductory Overview:
What is bioinformatics , genomics, sequence
analysis, computational molecular biology . . .
The Reverse Biochemistry Analogy.
Using sequence analysis tools, one can infer
all sorts of functional, evolutionary, and,
perhaps, structural insight into a gene,
without the need to isolate and purify
massive amounts of protein!
The computer is an essential part of this
entire process.
Definitions:
Biocomputing and computational biology are fairly synonymous and
both describe the use of computers and computational techniques
to analyze biological systems.
Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any of the
available biological databases.
Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, interactions, evolution, and
perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.
Proteomics is the subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of organisms,
both within and between different organisms.
The exponential growth of molecular
sequence databases & cpu power.
Year
BasePairs
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
680338
2274029
3368765
5204420
9615371
15514776
23800000
34762585
49179285
71947426
101008486
157152442
217102462
384939485
651972984
1160300687
2008761784
3841163011
11101066288
14396883064
Sequences
606
2427
4175
5700
9978
14584
20579
28791
39533
55627
78608
143492
215273
555694
1021211
1765847
2837897
4864570
10106023
13602262
Doubling time
~ 1 year!
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Database Growth (cont.)
The Human Genome Project and numerous smaller
genome projects have kept the data coming at
alarming rates. As of February 2003, 115 complete,
finished genomes are publicly available for analysis,
not counting all the virus and viroid genomes available.
The International Human Genome Sequencing
Consortium announced the completion of a "Working
Draft" of the human genome in June 2000;
independently that same month, the private company
Celera Genomics announced that it had completed the
first assembly of the human genome. Both articles
were published mid-February 2001 in the journals
Science and Nature.
Some neat stuff from the papers:
We, Homo sapiens, aren’t nearly as special as
we had hoped we were. Of the 3.2 billion base
pairs in our DNA —
Traditional, text-book estimates of the number of
genes were often in the 100,000 range; turns out
we’ve only got about twice as many as a fruit fly,
between 25,000 and 35,000!
The protein coding region of the genome is only about
1% or so, much of the remainder “junk” is “jumping,”
“selfish DNA” of which much may be involved in
regulation and control.
100-200 genes were transferred from an ancestral
bacterial genome to an ancestral vertebrate
genome! (Later shown to be not true by more extensive
analyses, and to be due to gene loss rather than transfer.)
What are these databases like?
What are primary sequences?
(Central Dogma: DNA —> RNA —> protein)
Primary refers to one dimension — all of the “symbol”
information written in sequential order necessary to
specify a particular biological molecular entity, be it
polypeptide or nucleotide.
The symbols are the one letter alphabetic codes for all of
the biological nitrogenous bases and amino acid
residues and their ambiguity codes. Biological
carbohydrates, lipids, and structural information are not
included within this sequence, however, much of this
type of information is available in the reference
documentation sections associated with primary
sequences in the databases.
What are sequence databases?
These databases are an organized way to store the
tremendous amount of sequence information that
accumulates from laboratories worldwide. Each
database has its own specific format. Three major
database organizations around the world are
responsible for maintaining most of this data; they
largely ‘mirror’ one another.
North America: National Center for Biotechnology
Information (NCBI): GenBank & GenPept.
Also Georgetown University’s NBRF Protein
Identification Resource: PIR & NRL_3D.
Europe: European Molecular Biology Laboratory (also
EBI & ExPasy): EMBL & Swiss-Prot.
Asia: The DNA Data Bank of Japan (DDBJ).
Content & Organization:
Most sequence database installations are examples of complex
ASCII/Binary databases, but they usually are not Oracle or SQL or
Object Oriented (proprietary ones often are). They often contain
several very long text files containing different types of information all
related to particular sequences, such as all of the sequences
themselves, versus all of the title lines, or all of the reference
sections. Binary files often help ‘glue together’ all of these other files
by providing index functions.
Software is usually required to successfully interact with these
databases and access is most easily handled through various
software packages and interfaces, either on the World Wide Web or
otherwise, although systems level commands can be used if one
understands the data's structure. Nucleic acid databases are split
into subdivisions based on taxonomy (historical). Protein databases
are often organized into sections by level of annotation.
What are other biological databases?
Three dimensional structure databases:
the Protein Data Bank and Rutgers Nucleic Acid Database.
Still more; these can be considered ‘non-molecular’:
Reference Databases: e.g.
OMIM — Online Mendelian Inheritance in Man
PubMed/MedLine — over 11 million citations from more
than 4 thousand bio/medical scientific journals.
Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. WIT (What Is There) and
Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of
Genes and Genomes).
Population studies data — which strains, where, etc.
And then databases that most biocomputing people don’t even
usually consider:
e.g. GIS/GPS/remote sensing data, medical records, census
counts, mortality and birth rates . . . .
So how does one do Bioinformatics?
Often on the InterNet over the World Wide Web:
Site
URL (Uniform Resource Locator)
Content
Nat’l Center Biotech' Info'
http://www.ncbi.nlm.nih.gov/
databases/analysis/software
PIR/NBRF
http://www-nbrf.georgetown.edu/
protein sequence database
IUBIO Biology Archive
http://iubio.bio.indiana.edu/
database/software archive
Univ. of Montreal
http://megasun.bch.umontreal.ca/
database/software archive
Japan's GenomeNet
http://www.genome.ad.jp/
databases/analysis/software
European Mol' Bio' Lab'
http://www.embl-heidelberg.de/
databases/analysis/software
European Bioinformatics
http://www.ebi.ac.uk/
databases/analysis/software
The Sanger Institute
http://www.sanger.ac.uk/
databases/analysis/software
Univ. of Geneva BioWeb
http://www.expasy.ch/
databases/analysis/software
ProteinDataBank
http://www.rcsb.org/pdb/
3D mol' structure database
Molecules R Us
http://molbio.info.nih.gov/cgi-bin/pdb/
3D protein/nuc' visualization
The Genome DataBase
http://www.gdb.org/
The Human Genome Project
Stanford Genomics
http://genome-www.stanford.edu/
various genome projects
Inst. for Genomic Res’rch
http://www.tigr.org/
esp. microbial genome projects
HIV Sequence Database
http://hiv-web.lanl.gov/
HIV epidemeology seq' DB
The Tree of Life
http://tolweb.org/tree/phylogeny.html
overview of all phylogeny
Ribosomal Database Proj’
http://rdp.cme.msu.edu/html/
databases/analysis/software
WIT Metabolism
http://wit.mcs.anl.gov/WIT2/
metabolic reconstruction
Harvard Bio' Laboratories
http://golgi.harvard.edu/
nice bioinformatics links list
NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics . . .
So what are the alternatives . . . ?
Desktop software solutions — public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be pretty
computer savvy. So,
commercial software packages are available, e.g.
Omiga, MacVector, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per machine, and
Internet and/or CD database access all complicate
matters!
Therefore, UNIX server-based
solutions (e.g. the Accelrys GCG
Wisconsin Package [a Pharmacopeia Co.]):
One commercial license fee for an entire institution and
very fast, convenient database access on local
server disks. Connections from any networked
terminal or workstation anywhere!
Operating system: UNIX command line operation
hassles; communications software — telnet, ssh,
xdmcp, etc. and terminal emulation; X graphics; file
transfer — ftp, Mac Fetch, and scp/sftp; and editors
— vi, emacs, pico (or desktop word processing
followed by file transfer [save as "text only!"]).
Basic UNIX for Neophytes.
The Genetics Computer Group —
The Accelrys Wisconsin Package for Sequence
Analysis.
Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept.
at the University of Wisconsin, Madison, then a private
company for over 10 years, then acquired by the Oxford
Molecular Group U.K., and now owned by Pharmacopeia
U.S.A. under the new name Accelrys, Inc.
The suite contains almost 150 programs designed to work in
a "toolbox" fashion. Several simple programs used in
succession can lead to sophisticated results.
Also 'internal compatibility,' i.e. once you learn to use one
program, all programs can be run similarly, and, the output
from many programs can be used as input for other
programs.
Used all over the world by more than 30,000 scientists at
over 530 institutions in 35 countries, so learning it here will
most likely be useful anywhere else you may end up.
To answer the always perplexing GCG question — “What
sequence(s)? . . . .”
Specifying sequences, GCG style;
in order of increasing power and complexity:
The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and all From & To programs)
The sequence is in a local GCG database in which case you ‘point’ to it by
using any of the GCG database logical names. A colon, “:,” always sets
the logical name apart from either an accession number or a proper
identifier name or a wildcard expression and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “{},” containing the sequence
specification, e.g. a wildcard — {*}.
Finally, the most powerful method of specifying sequences is in a GCG “list”
file. It is merely a list of other sequence specifications and can even
contain other list files within it. The convention to use a GCG list file in a
program is to precede it with an at sign, “@.” Furthermore, one can
supply attribute information within list files to specify something special
about the sequence.
‘Clean’ GCG format single sequence file after
‘reformat’ (or any of the From… programs) —
This is a small example of GCG single sequence format.
Always put some documentation on top, so in the future
you can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.
example.seq
1
51
Length: 77
July 21, 1999 09:30
Type: N
Check: 4099
..
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA
GATTTAATAG CATGCGATCC CATGGGA
SeqLab’s Editor mode can also
“Import” native GenBank format and
ABI or LI-COR trace files!
Logical terms for the Wisconsin Package —
Sequence databases, nucleic acids:
GENMBLPLUS
all of GenEMBL plus EST and GSS subdivisions
GEP
all of GenEMBL plus EST and GSS subdivisions
GENEMBL
all of GenEMBL except EST and GSS subdivisions
GE
all of GenEMBL except EST and GSS subdivisions
BA
GenEMBL bacterial subdivision
BACTERIAL
GenEMBL bacterial subdivision
EST
GenEMBL EST (Expressed Sequence Tags) subdivision
GSS
GenEMBL GSS (Genome Survey Sequences) subdivision
HTC
GenEMBL High Throughput cDNA
HTG
GenEMBL High Throughput Genomic
IN
GenEMBL invertebrate subdivision
INVERTEBRATE GenEMBL invertebrate subdivision
OM
GenEMBL other mammalian subdivision
OTHERMAMM
GenEMBL other mammalian subdivision
OV
GenEMBL other vertebrate subdivision
OTHERVERT
GenEMBL other vertebrate subdivision
PAT
GenEMBL patent subdivision
PATENT
GenEMBL patent subdivision
PH
GenEMBL phage subdivision
PHAGE
GenEMBL phage subdivision
PL
GenEMBL plant subdivision
PLANT
GenEMBL plant subdivision
PR
GenEMBL primate subdivision
PRIMATE
GenEMBL primate subdivision
RO
GenEMBL rodent subdivision
RODENT
GenEMBL rodent subdivision
STS
GenEMBL (sequence tagged sites) subdivision
SY
GenEMBL synthetic subdivision
SYNTHETIC
GenEMBL synthetic subdivision
TAGS
GenEMBL EST and GSS subdivisions
UN
GenEMBL unannotated subdivision
UNANNOTATED GenEMBL unannotated subdivision
VI
GenEMBL viral subdivision
VIRAL
GenEMBL viral subdivision
Sequence databases, amino acids:
GENPEPT
GenBank CDS translations
GP
GenBank CDS translations
SWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBL
SWP
all of Swiss-Prot and all of SPTrEMBL
SWISSPROT
all of Swiss-Prot (fully annotated)
SW
all of Swiss-Prot (fully annotated)
SPTREMBL
Swiss-Prot preliminary EMBL translations
SPT
Swiss-Prot preliminary EMBL translations
P
all of PIR Protein
PIR
all of PIR Protein
PROTEIN
PIR fully annotated subdivision
PIR1
PIR fully annotated subdivision
PIR2
PIR preliminary subdivision
PIR3
PIR unverified subdivision
PIR4
PIR unencoded subdivision
NRL_3D
PDB 3D protein sequences
NRL
PDB 3D protein sequences
General data files:
GCGCORE
path to main GCG files
GENMOREDATA path to GCG optional data files
GENRUNDATA
path to GCG default data files
These are easy —
they make sense and
you’ll have a vested
interest.
GCG MSF & RSF format —
!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
//
a49171
e70827
g83052
f70556
t17237
s65758
a46241
MSF: 735
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
425
577
718
534
229
735
274
July 20, 2001 14:53
Check:
Check:
Check:
Check:
Check:
Check:
Check:
537
21
9535
3494
9552
111
3514
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 6619 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
//////////////////////////////////////////////////
!!RICH_SEQUENCE 1.0
..
{
name ef1a_giala
descrip
PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list
type
PROTEIN
longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}
sequence-ID Q08046
checksum
7342
offset
23
This is SeqLab’s native format
creation-date 07/11/2001 16:51:19
strand 1
comments ////////////////////////////////////////////////////////////
The trick is to not forget the Braces and ‘wild
card,’ e.g. filename{*}, when specifying!
The List File Format —
remember the @ sign!
An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG
data files, two periods separate
documentation from data.
..
my-special.pep
begin:24
end:134
SwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}
@another.list
The ‘way’ SeqLab works!
SeqLab — GCG’s X-based GUI!
Seqlab is the merger of Steve Smith’s
Genetic Data Environment and GCG’s
Wisconsin Package Interface:
GDE + WPI = SeqLab
Requires an X-Windowing environment
— either native on UNIX computers
(including LINUX, but not included by
Apple in Mac OS X [v.10+] but see
XDarwin), or emulated with X-Server
Software on personal computers.
FOR MORE INFO...
Participate in the tutorial following this
lecture and visit my Web page stuff:
http://bio.fsu.edu/~stevet/cv.html.
Contact me ([email protected]) for
specific bioinformatics assistance
and/or long distance collaboration.
And to ‘honk-my-own-horn’ a bit,
check out the new Current
Protocols in Bioinformatics from
John Wiley & Sons, Inc:
Humana Press, Inc. also
asked me to contribute.
I’ve got two chapters in
their Introduction to
Bioinformatics —
A Theoretical And Practical
Approach:
http://www.does.org/cp/bioinfo.html.
http://www.humanapress.com
/Product.pasp?txtCatalog=
HumanaBooks&txtCategor
y=&txtProductID=1-58829241-X&isVariant=0.
They asked me to contribute a
chapter on multiple sequence
analysis using GCG software.
Both publishers say the
volumes should be
available February 2003!
The SeqLab Tutorial —
Elongation Factor 1 from a
sorted collection of ‘lower’
Eukaryotes.
How to search databases,
analyze and interpret pair-wise
comparisons for significance,
and prepare and analyze
multiple sequence alignments.
Supplement —
How the algorithms work; accompanying
illustrations for the Tutorial Introduction:
Dot Matrix Techniques,
Dynamic Programming,
Score Matrices,
Significance,
Database Similarity Searching,
Multiple Sequence Analysis.
What about Homology?
Inference through homology is a
fundamental principle of
biology!
What is homology — in this context it is
similarity great enough such that common
ancestry is implied. Walter Fitch, a famous
molecular evolutionist, likes to relate the
analogy — homology is like pregnancy, you
either are or you’re not; there’s no such
thing as 65% pregnant!
A way to see similarities —
Pairwise Comparisons:
The Dot Matrix Method.
Provides a ‘Gestalt’ of all possible
alignments between two
sequences.
To begin — very simple 0, 1 (match,
nomatch) identity scoring function.
Put a dot wherever symbols match.
Identities and insertion/deletion events (indels) identified
(zero:one match score matrix, no window).
Noise due to random composition effects contributes to confusion. To ‘clean up’
the plot consider a filtered windowing approach. A dot is placed at the middle of
a window if some ‘stringency’ is met within that defined window size. Then the
window is shifted one position and the entire process is repeated (zero:one
match score, window of size three and a stringency level of two out of three).
The phenylalanine transfer RNA molecule from yeast plotted against itself
using a window size to 7 and the stringency value to 5. As a general guide
pick a window size about the same size as the feature that you are trying to
recognize and a stringency such that unwanted background noise is just
filtered away enough to enable you to see that desired feature.
RNA comparisons of the reverse, complement of a sequence to itself can often
be very informative. The yeast tRNA sequence is compared to its reverse,
complement using the same 5 out of 7 stringency setting as previously. The
stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become
obvious. They appear as clearly delineated diagonals running perpendicular to
an imaginary main diagonal running oppositely than before.
22 GAGCGCCAGACT
|| | ||||| |
48 CTGGAGGTCTAG
G
12, 22
A
A
3
Base position 22 through position 33 base pairs with (think — is quite similar to the reversecomplement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding
algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the
most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43.
However the region around position 38 is represented as a loop. The actual modeled structure as
seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.
Pairwise Comparisons: Dynamic Programming.
A ‘brute force’ approach just won’t work. The computation required to compare all possible
alignments between two sequences requires time proportional to the product of the lengths of
the two sequences, without considering gaps at all. If the two sequences are approximately the
same length (N), this is a N2 problem. To include gaps, the calculation needs to be repeated 2N
times to examine the possibility of gaps at each possible position within the sequences, now a
N4N problem.
Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 of length i
and 2 of length j, such that:
1) you maximize the number of matching symbols between 1 and 2;
2) you minimize the number of indels within 1 and 2; and
3) you minimize the number of mismatched symbols between 1 and 2.
Therefore, the actual solution can be represented by:
Sij = sij + max
Si-1
max
2 <
max
2 <
j-1
Si-x j-1 + wx-1
x < i
Si-1 j-y + wy-1
y < I
or
or
Where Sij is the score for the alignment ending at i in sequence 1 and j in sequence 2,
sij is the score for aligning i with j,
wx is the score for making a x long gap in sequence 1,
wy is the score for making a y long gap in sequence 2,
allowing gaps to be any length in either sequence.
An oversimplified example —
total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])
Optimum Alignments —
There will probably be more than one best path through the matrix and
none of them may be the biologically CORRECT alignment. Starting at
the top and working down as we did, then tracing back, I found two
optimum alignments:
cTATAtAagg
| |||||
cg.TAtAaT.
cTATAtAagg
| ||||
cgT.AtAaT.
Each of these solutions yields a traceback total score of 22. This is the
number optimized by the algorithm, not any type of a similarity or
identity score! Even though one of these alignments has 6 exact
matches and the other has 5, they are both optimal according to the
relatively strange criteria by which we solved the algorithm. Software
will report only one of these solutions. Do you have any ideas about
how others could be discovered? Answer — Often if you reverse the
solution of the entire dynamic programming process, other solutions
can be found!
Global versus local solution: negative numbers in match matrix and
pick best diagonal within overall graph.
What about proteins — conservative replacements and similarity as
opposed to identity, and similarity versus homology! Similarity is not
automatically homology. Homology always means related by descent from a
common ancestor.
BLOSUM62 amino acid substitution matrix Henikoff, S. and Henikoff, J. G. (1992).
A
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
X
Y
Z
A
4
-2
0
-2
-1
-2
0
-2
-1
-1
-1
-1
-2
-1
-1
-1
1
0
0
-3
-1
-2
-1
B
-2
6
-3
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
2
C
0
-3
9
-3
-4
-2
-3
-3
-1
-3
-1
-1
-3
-3
-3
-3
-1
-1
-1
-2
-1
-2
-4
D
-2
6
-3
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
2
E
-1
2
-4
2
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-1
-2
5
F
-2
-3
-2
-3
-3
6
-3
-1
0
-3
0
0
-3
-4
-3
-3
-2
-2
-1
1
-1
3
-3
G
0
-1
-3
-1
-2
-3
6
-2
-4
-2
-4
-3
0
-2
-2
-2
0
-2
-3
-2
-1
-3
-2
H
-2
-1
-3
-1
0
-1
-2
8
-3
-1
-3
-2
1
-2
0
0
-1
-2
-3
-2
-1
2
0
I
-1
-3
-1
-3
-3
0
-4
-3
4
-3
2
1
-3
-3
-3
-3
-2
-1
3
-3
-1
-1
-3
K
-1
-1
-3
-1
1
-3
-2
-1
-3
5
-2
-1
0
-1
1
2
0
-1
-2
-3
-1
-2
1
L
-1
-4
-1
-4
-3
0
-4
-3
2
-2
4
2
-3
-3
-2
-2
-2
-1
1
-2
-1
-1
-3
M
-1
-3
-1
-3
-2
0
-3
-2
1
-1
2
5
-2
-2
0
-1
-1
-1
1
-1
-1
-1
-2
N
-2
1
-3
1
0
-3
0
1
-3
0
-3
-2
6
-2
0
0
1
0
-3
-4
-1
-2
0
P
-1
-1
-3
-1
-1
-4
-2
-2
-3
-1
-3
-2
-2
7
-1
-2
-1
-1
-2
-4
-1
-3
-1
Q
-1
0
-3
0
2
-3
-2
0
-3
1
-2
0
0
-1
5
1
0
-1
-2
-2
-1
-1
2
R
-1
-2
-3
-2
0
-3
-2
0
-3
2
-2
-1
0
-2
1
5
-1
-1
-3
-3
-1
-2
0
S
1
0
-1
0
0
-2
0
-1
-2
0
-2
-1
1
-1
0
-1
4
1
-2
-3
-1
-2
0
T
0
-1
-1
-1
-1
-2
-2
-2
-1
-1
-1
-1
0
-1
-1
-1
1
5
0
-2
-1
-2
-1
V
0
-3
-1
-3
-2
-1
-3
-3
3
-2
1
1
-3
-2
-2
-3
-2
0
4
-3
-1
-1
-2
W
-3
-4
-2
-4
-3
1
-2
-2
-3
-3
-2
-1
-4
-4
-2
-3
-3
-2
-3
11
-1
2
-3
X
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Y
-2
-3
-2
-3
-2
3
-3
2
-1
-2
-1
-1
-2
-3
-1
-2
-2
-2
-1
2
-1
7
-2
Z
-1
2
-4
2
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-1
-2
x
Values whose magnitude is  4 are drawn in outline characters to make them easier to recognize. Notice that positive values
for identity range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved
residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.
Significance — When is an Alignment Worth
Anything Biologically?
Monte Carlo simulations:
Z score = [ ( actual score ) - ( mean of randomized scores ) ]
( standard deviation of randomized score distribution )
Many Z scores measure the distance from a mean using a simplistic Monte
Carlo model assuming a normal distribution, in spite of the fact that
‘sequence-space’ actually follows what is know as an ‘extreme value
distribution;’ however, the Monte Carlo method does approximate
significance estimates pretty well.
Histogram Key:
Each histogram symbol represents 604 search set sequences
Each inset symbol represents 21 search set sequences
z-scores computed from opt scores
z-score obs exp
(=) (*)
< 20 650
0:==
22
0
0:
24
3
0:=
26 22
8:*
28 98 87:*
30 289 528:*
32 1714 2042:===*
34 5585 5539:=========*
36 12495 11375:==================*==
38 21957 18799:===============================*=====
40 28875 26223:===========================================*====
42 34153 32054:=====================================================*===
44 35427 35359:==========================================================*
46 36219 36014:===========================================================*
48 33699 34479:======================================================== *
50 30727 31462:=================================================== *
52 27288 27661:=============================================*
54 22538 23627:====================================== *
56 18055 19736:============================== *
58 14617 16203:========================= *
60 12595 13125:=====================*
62 10563 10522:=================*
64 8626 8368:=============*=
66 6426 6614:==========*
68 4770 5203:========*
70 4017 4077:======*
72 2920 3186:=====*
74 2448 2484:====*
76 1696 1933:===*
78 1178 1503:==*
80 935 1167:=*
82 722 893:=*
84 454 707:=*
86 438 547:*
88 322 423:*
90 257 328:*
92 175 253:*
:========= *
94 210 196:*
:=========*
96 102 152:*
:===== *
98 63 117:*
:=== *
100 58 91:*
:=== *
102 40 70:*
:== *
104 30 54:*
:==*
106 17 42:*
:=*
108 14 33:*
:=*
110 14 25:*
:=*
112 12 20:*
:*
114
9 15:*
:*
116
6 12:*
:*
118
8
9:*
:*
>120 1030
7:*=
:*=======================================
‘Sequence-space’ actually
follows the ‘extreme value
distribution.’ Based on this
known statistical distribution, and
robust statistical methodology, a
realistic Expectation function, the
E value, can be calculated. The
particulars of how BLAST and
FastA do this differ, but the ‘takehome’ message is the same:
The higher the E value is, the
more probable that the observed
match is due to chance in a
search of the same size database
and the lower its Z score will be,
i.e. is NOT significant. Therefore,
the smaller the E value, i.e. the
closer it is to zero, the more
significant it is and the higher its Z
score will be! The E value is the
number that really matters.
These are the best hits, those most
similar sequences with a Pearson zscore greater than 120 in this search.
Pairwise Comparisons — Database Searching
Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing
breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed
to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at
the end are compiled and the longest alignment within the program’s restrictions is created. Hashing
reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the
sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database
searching the heuristic restricts search space by calculating a statistic that allows the program to decide
whether further scrutiny of a particular match should be pursued.
BLAST — Basic Local Alignment Search
Tool, developed at NCBI.
1) Normally NOT a good idea to use
for DNA against DNA searches (not
optimized);
2) Prefilters repeat and “low
complexity” sequence regions;
4) Can find more than one region of
gapped similarity;
5) Very fast heuristic and parallel
implementation;
6) Restricted to precompiled, specially
formatted databases;
FastA — and its family of relatives,
developed by Bill Pearson at the University
of Virginia.
1) Works well for DNA against DNA
searches (within limits of possible
sensitivity);
2) Can find only one gapped region
of similarity;
3) Relatively slow, should usually be
run in the background;
4) Does not require specially
prepared, preformatted
databases.
Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and
protein-protein searches. Translations done ‘on the fly’ for mixed searches.
The algorithms —
BLAST:
Two word hits on the
same diagonal above
some similarity threshold
triggers ungapped
extension until the score
isn’t improved enough
above another threshold:
the HSP.
Initiate gapped extensions
using dynamic programming for
those HSP’s above a third
threshold up to the point where
the score starts to drop below a
fourth threshold: yields
alignment.
Find all ungapped exact
word hits; maximize the
ten best continuous
regions’ scores: init1.
FastA:
Combine
nonoverlapping init
regions on different
diagonals:
initn.
Use dynamic
programming ‘in a
band’ for all regions
with initn scores
better than some
threshold: opt score.
What about more than just two sequences?
Dynamic programming’s complexity increases
exponentially with the number of sequences being
compared:
N-dimensional matrix . . . .
complexity=[sequence length]number of sequences
Multiple Sequence Alignment and
Analysis —
Therefore — pairwise,
progressive dynamic
programming restricts the
solution to the neighbor-hood
of only two sequences at a
time.
All sequences are compared,
pairwise, and then each is
aligned to its most similar
partner or group of partners.
Each group of partners is then
aligned to finish the complete
multiple sequence alignment.
Conserved regions can be
visualized with a sliding
window approach and appear
as peaks. Concentrate on the
first peak seen here.
Motifs —
GHVDHGKS
A consensus isn’t
necessarily the
biologically “correct”
combination.
Therefore, build onedimensional ‘pattern
descriptors.’
PROSITE Database of
protein families and
domains - over 1,000
motifs.
This motif, the P-loop, is
defined:
(A,G)x4GK(S,T), i.e.
either an Alanine or a
Glycine, followed by
four of anything,
followed by an
invariant GlycineLysine pair, followed
by either a Serine or a
Threonine.
But motifs can not
convey any degree of
the ‘importance’ of the
residues.
a multiple sequence alignment, how can we use all of the information
Enter Given
contained in it to find ever more remotely similar sequences, that is those
“Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those
BLAST/Fast E values above ~10 or so?
the
Use a position specific, two-dimensional matrix where conserved areas of the
Profile alignment receive the most importance and variable regions hardly matter!
-5
The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a
tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout
the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the
most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several
times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and
leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.
Some generalities: Work with
proteins! If at all possible —
Twenty match symbols versus four, plus
similarity! Way better signal to noise.
Also guarantees no indels are placed
within codons. So translate, then align.
Nucleotide sequences will only reliably
align if they are very similar to each
other. And they will require extensive
hand editing and careful consideration.
Beware of aligning apples and
oranges [and grapefruit]!
Parologous
versus
orthologous;
genomic versus
cDNA;
mature versus
precursor.
Advanced methodologies —
Many wondrous things can be accomplished based on combinations of all
the previous techniques.
PSI-BLAST uses profile methods to iterate database searches.
Motif profiles can be discovered in unaligned sequences using
Expectation Maximization or optimized in aligned sequences with
Hidden Markov Modeling statistics.
Secondary structure can be predicted in many cases. See
http://www.embl-heidelberg.de/predictprotein/predictprotein.html,
which uses multiple sequence alignment profile techniques along
with neural net technology. Even three-dimensional “homology
modeling” will often lead to remarkably accurate representations if
the similarity is great enough between your protein and one in which
the structure has been solved through experimental means. See
SwissModel at http://www.expasy.ch/swissmod/SWISSMODEL.html.
Evolutionary relationships can be ascertained using a multiple
sequence alignment and the methods of molecular phylogenetics.
See the PAUP* and PHYLIP software packages. And if you’re really
interested in this topic check out the Workshop on Molecular
Evolution offered every August at the Woods Hole Marine Biological
Laboratory and/or similar courses worldwide.
Conclusions —
Gunnar von Heijne in his old but quite readable treatise, Sequence
Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit
(1987), provides a very appropriate conclusion:
“Think about what you’re doing; use your knowledge of the molecular
system involved to guide both your interpretation of results and your
direction of inquiry; use as much information as possible; and do not
blindly accept everything the computer offers you.”
He continues:
“. . . if any lesson is to be drawn . . . it surely is that to be able to make a
useful contribution one must first and foremost be a biologist, and only
second a theoretician . . . . We have to develop better algorithms, we
have to find ways to cope with the massive amounts of data, and above
all we have to become better biologists. But that’s all it takes.”
To learn more See the listed references and WWW sites. Many fine
texts are also starting to become available in the field.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of
Protein Database Search Programs. Nucleic Acids Research 25, 3389-3402.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle,
Washington, U.S.A.
Genetics Computer Group (GCG), a division of Accelrys, a member of Pharmacopeia, Inc. (Copyright 1982-2003) Program Manual for the Wisconsin
Package, Version 10.3, San Diego, California, USA.
Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89,
10915-10919.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal
of Molecular Biology 48, 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio
1994. Nucleic Acids Research 22, 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85,
2444-2448.
Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, 584-599.
Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence
Analysis. CABIOS, 10, 671-675.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5,
Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.
Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega
Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, 2471-2484.
Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997)
Smithsonian Institution, Washington D.C., U.S.A.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence
weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680.
von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences
U.S.A. 80, 726-730.
Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52.