BioInformatics at FSU - whose job is it and why it needs

Download Report

Transcript BioInformatics at FSU - whose job is it and why it needs

Special Topics BSC4933/5936:
An Introduction to Bioinformatics.
Florida State University
The Department of Biological Science
www.bio.fsu.edu
Sept. 2, 2003
BioInformatics Databases,
the “T” in van Engelen’s talk . . .
just a glimpse.
Steven M. Thompson
Florida State University School of
Computational Science and
Information Technology (CSIT)
But first some comments —
If van Engelen’s talk scared you, don’t
worry. His lecture is designed to
provide an overview of the “way”
that computers “think.” Don’t get
hung up on the details.
If you were bored, just wait. Lots and
lots more ‘exciting’ algorithms are
coming up. Something is
guaranteed to challenge you!
To begin — some terminology
Just what the heck is an algorithm ! ?
Merriam-Webster’s says: “A rule of
procedure for solving a problem [often
mathematical] that frequently involves
repetition of an operation.”
So, you could write an algorithm for
tying your shoe! It’s just a set of
explicit instructions for doing some
routine task.
And what about bioinformatics, genomics,
proteomics, sequence analysis,
computational molecular biology . . . ?
Some Definitions, lots of overlap —
Biocomputing and computational biology are synonyms and
describe the use of computers and computational techniques
to analyze any type of a biological system, from individual
molecules to organisms to overall ecology.
Bioinformatics describes using computational techniques to
access, analyze, and interpret the biological information in
any type of biological database (much more later).
Sequence analysis is the study of molecular sequence data for
the purpose of inferring the function, interactions, evolution,
and perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes
(the total DNA content of an organism) within the same and/or
across different genomes.
Proteomics is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the proteome,
of organisms, both within and between different organisms.
One way to think about the field —
The Reverse Biochemistry Analogy.
Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from
its native organism in order to characterize a particular
gene product. Rather, now scientists can amplify a
section of some genome based on its similarity to other
genomes, sequence that piece of DNA and, using
sequence analysis tools, infer all sorts of functional,
evolutionary, and, perhaps, structural insight into that
stretch of DNA!
The computer and molecular databases are a
necessary, integral part of this entire process.
The exponential growth of molecular
sequence databases & cpu power —
Year
BasePairs
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
680338
2274029
3368765
5204420
9615371
15514776
23800000
34762585
49179285
71947426
101008486
157152442
217102462
384939485
651972984
1160300687
2008761784
3841163011
11101066288
14396883064
Sequences
606
2427
4175
5700
9978
14584
20579
28791
39533
55627
78608
143492
215273
555694
1021211
1765847
2837897
4864570
10106023
13602262
doubling time ~
one year
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Database Growth (cont.) —
The Human Genome Project and numerous smaller
genome projects have kept the data coming at alarming
rates. As of October 2002, 99 complete, finished
genomes are publicly available for analysis, not
counting all the virus and viroid genomes available.
The International Human Genome Sequencing
Consortium announced the completion of the "Working
Draft" of the human genome in June 2000;
Independently that same month, the private company
Celera Genomics announced that it had completed the
first “Assembly” of the human genome. Both articles
were published mid-February 2001 in the journals
Science and Nature.
Some neat stuff from the papers —
We, Homo sapiens, aren’t nearly as special as
we had hoped we were. Of the 3.2 billion
base pairs in our DNA:
Traditional, text-book estimates of the number of genes
were often in the 100,000 range; turns out we’ve only
got about twice as many as a fruit fly, between 25’ and
35,000!
The protein coding region of the genome is only about
1% or so, a bunch of the remainder is ‘jumping’
‘selfish DNA’ of which much may be involved in
regulation and control.
Over 100-200 genes were transferred from an ancestral
bacterial genome to an ancestral vertebrate genome!
(Later shown to be not true by more extensive analyses, and to
be due to gene loss rather than transfer.)
What are sequence databases?
These databases are an organized way to store the tremendous
amount of sequence information accumulating worldwide. Most have
their own specific format. An ‘alphabet soup’ of three major database
organizations around the world are responsible for maintaining most
of this data. They largely ‘mirror’ one another and share accession
codes, but NOT proper identifier names:
North America: the National Center for Biotechnology Information (NCBI),
a division of the National Library of Medicine (NLM), at the National
Institute of Health (NIH), has GenBank & GenPept. Also Georgetown
University’s National Biomedical Research Foundation (NBRF) Protein
Identification Resource (PIR) & NRL_3D (Naval Research Lab
sequences of known three-dimensional structure).
Europe: the European Molecular Biology Laboratory (EMBL), the
European Bioinformatics Institute (EBI), and the Swiss Institute of
Bioinformatics’ (SIB) Expert Protein Analysis System (ExPasy), all help
maintain the EMBL Nucleotide Sequence Database, and the SWISSPROT & TrEMBL amino acid sequence databases.
Asia: The National Institute of Genetics (NIG) supports the Center for
Information Biology’s (CIG) DNA Data Bank of Japan (DDBJ).
A little history —
Developments that affect software and the end user —
The first well recognized sequence database was Dr. Margaret Dayhoff’s
hardbound Atlas of Protein Sequence and Structure begun in the midsixties. DDBJ began in 1984, GenBank in 1982, and EMBL in 1980.
They are all attempts at establishing an organized, reliable,
comprehensive and openly available library of genetic sequences.
Databases have long-since outgrown a hardbound atlas. They have
become huge and have evolved through many changes with many more
yet to come.
Changes in format over the years are a major source of grief for software
designers and program users. Each program needs to be able to
recognize particular aspects of the sequence files; whenever they
change it throws a wrench in the works. NCBI’s ASN.1 format and its
Entrez interface attempt to circumvent some of these frustrations.
However, database format is much debated as many bioinformaticians
argue for relational or object-oriented standards. Unfortunately, until all
biologists and computer scientists worldwide agree on one standard and
all software is (re)written to that standard, neither of which is likely to
happen very quickly, format issues will remain probably the most
confusing and troubling aspect of working with primary sequence data.
So what are these databases like?
Just what are primary sequences?
(Central Dogma: DNA —> RNA —> protein)
Primary refers to one dimension — all of the ‘symbol’ information
written in sequential order necessary to specify a particular
biological molecular entity, be it polypeptide or nucleotide (van
Engelen’s “P”).
The symbols are the one letter codes for all of the biological
nitrogenous bases and amino acid residues and their ambiguity
codes (van Engelen’s “Alphabet”). Biological carbohydrates,
lipids, and structural and functional information are not
sequence data. Not even DNA translations in a DNA database!
However, much of this feature and bibliographic type information
is available in the reference documentation sections associated
with primary sequences in the databases.
Content & Organization —
Sequence database installations are commonly a complex
ASCII/Binary mix, usually not relational or Object Oriented (but
proprietary ones often are). They’ll contain several very long
text files each containing different types of information all
related to particular sequences, such as all of the sequences
themselves, versus all of the title lines, or all of the reference
sections. Binary files often help ‘glue together’ all of these
other files by providing indexing functions.
Software is usually required to successfully interact with these
databases and access is most easily handled through various
software packages and interfaces, either on the World Wide
Web or otherwise, although systems level commands can be
used if one understands the data's structure well enough.
More organization stuff —
Nucleic acid sequence databases (and TrEMBL) are split into
subdivisions based on taxonomy (historical rankings — the Fungi
warning!). PIR is split into subdivisions based on level of
annotation. TrEMBL sequences are merged into SWISS-PROT
as they receive increased levels of annotation.
Nucleic Acid DB’s
GenBank/EMBL/DDBJ
all Taxonomic
categories + HTC’s,
HTG’s, & STS’s
“Tags”
EST’s
GSS’s
Amino Acid DB’s
SWISS-PROT
TrEMBL
PIR
PIR1
PIR2
PIR3
PIR4
NRL_3D
Genpept
Parts and problems —
All sequence databases contain these elements:
Name: LOCUS, ENTRY, ID all are unique identifiers
Definition: A brief, one-line, textual sequence description.
Accession Number: A constant data identifier.
Source and taxonomy information.
Complete literature references.
Comments and keywords.
The all important FEATURE table!
A summary or checksum line.
The sequence itself.
But:
Each major database as well as each major suite of software tools
that you are likely to use has its own distinct format requirements.
This can be a huge problem and an enormous time sink, even with
helpful tools such as Don Gilbert’s ReadSeq. Therefore, becoming
familiar with some of the common formats is a big help. Look for key
features of each type of entry:
GenBank and GenPept format —
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
HSEF1AR
1506 bp
mRNA
linear
PRI 12-SEP-1993
Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).
X03558
X03558.1 GI:31097
elongation factor; elongation factor 1.
human.
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE
1 (bases 1 to 1506)
AUTHORS
Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W.
TITLE
The primary structure of the alpha subunit of human elongation……
JOURNAL
Eur. J. Biochem. 155 (1), 167-171 (1986)
MEDLINE
86136120
FEATURES
Location/Qualifiers
source
1..1506
/organism="Homo sapiens"
/db_xref="taxon:9606"
CDS
54..1442
/note="EF-1 alpha (aa 1-463)"
/codon_start=1
/protein_id="CAA27245.1"
/db_xref="GI:31098"
/db_xref="SWISS-PROT:P04720"
/translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK
EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM
……VTKSAQKAQKAK"
BASE COUNT
412 a
337 c
387 g
370 t
ORIGIN
1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa
61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca……….
1501 aactgt
//
EMBL and SWISS-PROT format —
ID
AC
DT
DE
DE
GN
OS
OS
OS
OC
OC
OX
RN
RP
RC
RX
RA
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
KW
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
//
EF11_HUMAN
STANDARD;
PRT;
462 AA.
P04720; P04719;
13-AUG-1987 (Rel. 05, Created)……
Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1)
(eEF1A-1) (Elongation factor Tu) (EF-Tu).
EEF1A1 OR EEF1A OR EF1A.
Homo sapiens (Human),
Bos taurus (Bovine), and
Oryctolagus cuniculus (Rabbit).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
NCBI_TaxID=9606, 9913, 9986;
[1]
SEQUENCE FROM N.A.
SPECIES=Human;
MEDLINE=86136120; PubMed=3512269;
Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.;
"The primary structure of the alpha subunit of human elongation …. -binding sites.";
Eur. J. Biochem. 155:167-171(1986).……
-!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF
AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN
BIOSYNTHESIS.
-!- SUBCELLULAR LOCATION: Cytoplasmic.
-!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY,
PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE.
-!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY.
EF-TU/EF-1A SUBFAMILY……
EMBL; X03558; CAA27245.1; -……
PIR; S18054; EFRB1……
HSSP; Q01698; 1TUI……
InterPro; IPR004160; GTP_EFTU_D3.
Pfam; PF00009; GTP_EFTU; 1……
PROSITE; PS00301; EFACTOR_GTP; 1.
Elongation factor; Protein biosynthesis; GTP-binding; Methylation;
Multigene family.
NP_BIND
14
21
GTP (BY SIMILARITY).
NP_BIND
91
95
GTP (BY SIMILARITY).
NP_BIND
153
156
GTP (BY SIMILARITY).
MOD_RES
36
36
METHYLATION (TRI-).
MOD_RES
55
55
METHYLATION (DI-).
MOD_RES
79
79
METHYLATION (TRI-).
MOD_RES
165
165
METHYLATION (DI-).
MOD_RES
318
318
METHYLATION (TRI-).
BINDING
301
301
ETHANOLAMINE-PHOSPHOGLYCEROL.
BINDING
374
374
ETHANOLAMINE-PHOSPHOGLYCEROL.
CONFLICT
83
83
S -> A (IN REF. 2).
CONFLICT
232
232
L -> V (IN REF. 3).
SEQUENCE
462 AA; 50141 MW; D465615545AF686A CRC64;
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL
DKLKAERERG …… VTKSAQKAQK AK
PIR/NBRF format —
ENTRY
EFHU1 #type complete
iProClass View of EFHU1
TITLE
translation elongation factor eEF-1 alpha-1 chain - human
ALTERNATE_NAMES translation elongation factor Tu
ORGANISM
#formal_name Homo sapiens #common_name man
#cross-references taxon:9606
DATE
30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change…..
ACCESSIONS
B24977; A25409; A29946; A32863; I37339
REFERENCE
A93610
#authors
Rao, T.R.; Slobin, L.I.
#journal
Nucleic Acids Res. (1986) 14:2409
#title
Structure of the amino-terminal end of mammalian elongation…
#accession
B24977
##molecule_type mRNA
##residues 1-82,'A',84-94 ##label RAO
##cross-references EMBL:X03689; NID:g31109; PIDN:CAA27325.1;
PID:g31110…….
GENETICS
#gene
GDB:EEF1A1; EEF1A; EF1A
##cross-references GDB:118791; OMIM:130590
#map_position 6q14-6q14
#introns
48/3; 108/3; 207/3; 258/1; 343/3; 422/1
CLASSIFICATION SF003007
#superfamily translation elongation factor Tu; translation elongation
factor Tu homology
KEYWORDS
GTP binding; methylated amino acid; nucleotide binding;
P-loop; phosphoprotein; protein biosynthesis; RNA binding
FEATURE
1-223
#domain eEF-1 alpha domain I, GTP-binding #status
predicted #label EF1\
8-156
#domain translation elongation factor Tu homology
#label ETU\
14-21
#region nucleotide-binding motif A (P-loop)\
153-156
#region GTP-binding NKXD motif\
245-330
#domain eEF-1 alpha domain II, tRNA-binding
#status predicted #label EF2\
332-462
#domain eEF-1 alpha domain III, tRNA-binding
#status predicted #label EF3\
36,55,79,165,318
#modified_site N6,N6,N6-trimethyllysine (Lys)
#status predicted\
301,374
#binding_site glycerylphosphorylethanolamine
(Glu) (covalent) #status predicted
SUMMARY
#length 462 #molecular_weight 50141
SEQUENCE
5
10
15
20
25
30
1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K
31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L
61 D K L K A E R E R …... Q K A Q K A K
Pearson FastA
format —
>EFHU1 PIR1 release 71.01
MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG
KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK
NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV
GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN
MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL
QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS
EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP
GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK
VTKSAQKAQKAK
Only one line of
documentation
allowed!
GCG single sequence
format —
!!AA_SEQUENCE 1.0
P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human
N;Alternate names: translation elongation factor Tu……
F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1>
F;8-156/Domain: translation elongation factor Tu homology <ETU>
F;14-21/Region: nucleotide-binding motif A (P-loop)
F;153-156/Region: GTP-binding NKXD motif
F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted <EF2>
F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted
<EF3>
F;36,55,79,165,318/Modified site: N6,N6,N6-trimethyllysine (Lys) #status
predicted
F;301,374/Binding site: glycerylphosphorylethanolamine (Glu) (covalent)
#status predicted
EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 ..
1
401
351
451
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE……
IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA
VTKSAQKAQK AK
!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
//
MSF: 735
a49171
e70827
g83052
f70556
t17237
s65758
a46241
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
425
577
718
534
229
735
274
July 20, 2001 14:53
Check:
Check:
Check:
Check:
Check:
Check:
Check:
537
21
9535
3494
9552
111
3514
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 6619 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
……………
!!RICH_SEQUENCE 1.0
..
{
name ef1a_giala
descrip
PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list
type
PROTEIN
longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}
sequence-ID Q08046
checksum
7342
offset
23
creation-date 07/11/2001 16:51:19
strand 1
comments …………….
This is SeqLab’s native format
Specialized ‘sequence’ -type DB’s —
Databases that contain special types of sequence
information, such as patterns, motifs, and profiles.
These include: REBASE, EPD, PROSITE, BLOCKS,
ProDom, Pfam . . . .
Databases that contain multiple sequence entries
aligned, e.g. RDP and ALN.
Databases that contain families of sequences ordered
functionally, structurally, or phylogenetically, e.g.
iProClass and HOVERGEN.
Databases of species specific sequences, e.g. the HIV
Database and the Giardia lamblia Genome Project.
And on and on . . . . See Amos Bairoch’s excellent links
page: http://us.expasy.org/alinks.html and the
wonderful Human Genome Ensemble Project at
http://www.ensembl.org/ that tries to tie it all together.
What about other types of biological databases?
Three dimensional structure databases:
the Protein Data Bank and Rutgers Nucleic Acid Database.
These databases contain all of the 3D atomic coordinate data
necessary to define the tertiary shape of a particular biological
molecule. The data is usually experimentally derived, either by
X-ray crystallography or with NMR, but sometimes it is a
hypothetical model. In all cases the source of the structure and
its resolution is clearly indicated.
Secondary structure boundaries, sequence data, and reference
information are often associated with the coordinate data, but it
is the 3D data that really matters, not the annotation.
Molecular visualization or modeling software is required to
interact with the data. It has little meaning on its own. See
Molecules R Us at http://molbio.info.nih.gov/cgi-bin/pdb/ .
Other types of Biological DB’s —
Still more; these can be considered ‘non-molecular’:
Genomic linkage mapping databases for most large genome projects
(w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C.
elegans, Saccharomyces, Arabidopsis, E. coli, . . . .
Reference Databases (also w/ pointers to sequences): e.g.
OMIM — Online Mendelian Inheritance in Man
PubMed/MedLine — over 11 million citations from more than 4
thousand bio/medical scientific journals.
Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s
GenomeNet KEGG (the Kyoto Encyclopedia of Genes and
Genomes).
Population studies data — which strains, where, etc.
And then databases that many biocomputing people don’t even usually
consider:
e.g. GIS/GPS/remote sensing data, medical records, census counts,
mortality and birth rates . . . .
So how do you access and manipulate all this data?
Often on the InterNet over the World Wide Web:
Site
URL (Uniform Resource Locator)
Content
Nat’l Center Biotech' Info'
http://www.ncbi.nlm.nih.gov/
databases/analysis/software
PIR/NBRF
http://www-nbrf.georgetown.edu/
protein sequence database
IUBIO Biology Archive
http://iubio.bio.indiana.edu/
database/software archive
Univ. of Montreal
http://megasun.bch.umontreal.ca/
database/software archive
Japan's GenomeNet
http://www.genome.ad.jp/
databases/analysis/software
European Mol' Bio' Lab'
http://www.embl-heidelberg.de/
databases/analysis/software
European Bioinformatics
http://www.ebi.ac.uk/
databases/analysis/software
The Sanger Institute
http://www.sanger.ac.uk/
databases/analysis/software
Univ. of Geneva BioWeb
http://www.expasy.ch/
databases/analysis/software
ProteinDataBank
http://www.rcsb.org/pdb/
3D mol' structure database
Molecules R Us
http://molbio.info.nih.gov/cgi-bin/pdb/
3D protein/nuc' visualization
The Genome DataBase
http://www.gdb.org/
The Human Genome Project
Stanford Genomics
http://genome-www.stanford.edu/
various genome projects
Inst. for Genomic Res’rch
http://www.tigr.org/
esp. microbial genome projects
HIV Sequence Database
http://hiv-web.lanl.gov/
HIV epidemeology seq' DB
The Tree of Life
http://tolweb.org/tree/phylogeny.html
overview of all phylogeny
Ribosomal Database Proj’
http://rdp.cme.msu.edu/html/
databases/analysis/software
WIT Metabolism
http://wit.mcs.anl.gov/WIT2/
metabolic reconstruction
Harvard Bio' Laboratories
http://golgi.harvard.edu/
nice bioinformatics links list
With tools like NCBI’s Entrez & EMBL’s SRS . . .
Net access software examples —
Internet surfing tools — a World Wide Web browser such as
NetScape or MS Explorer.
Advantage: Can access last night’s updates, fun, fast, efficient.
Disadvantage: Reformatting usually essential, and . . .
very easy to get lost and/or distracted in cyberspace!
E-Mail at NCBI’s Retrieve Server and other similar servers.
Advantage: Can access last night’s updates.
Disadvantage: slow, inefficient, and reformatting essential.
anonymous FTP through many servers worldwide.
Advantage: Can access last night’s updates, very fast.
Disadvantage: Difficult, inefficient, reformatting essential, and . . .
often no graphical user interface (GUI).
NCBI’s Network Entrez program — a client/server database access
solution installed on your machine with a very nice GUI.
Advantage: Very fast, powerful and efficient; relational links
between different databases; neighboring concept.
Disadvantage: Reformatting usually essential.
But problems sometimes arise with the Net, like bad
connections, and there’s always format hassles, etc.
So what are some of the alternatives . . . ?
Desktop software solutions — public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be
pretty computer savvy. So,
commercial software packages are available, e.g.
Sequencher, MacVector, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per machine,
and Internet and/or CD database access all
complicate matters!
Therefore, server-based solutions (e.g. the Wisconsin Package)
— we’re talking UNIX server computers here.
One commercial license fee for an entire institution and very fast,
convenient database access on local server disks. Connections from
any networked terminal or workstation anywhere!
Operating system: command line operation hassles, communications
software — telnet, ssh, xdmcp, etc. and terminal emulation, X
graphics, file transfer — ftp, Mac Fetch, and scp, and editors — vi,
emacs, pico (or desktop word processing followed by file transfer
[save as "text only!"]). Therefore, those in the optional lab section
went through a UNIX tutorial last week and everybody else is
encouraged to take a look:
Week One BioComputing Basics —
(http://bio.fsu.edu/~stevet/BSC5936/Lab1.pdf)
Within the GCG suite, LookUp is an SRS derivative used to find a
sequence of interest from local GCG server databases.
Advantage: Search output is a legitimate GCG list file, appropriate
input to other GCG programs; no need to reformat — all GCG.
Disadvantage: DB’s only as new as administrator maintains them.
The Genetics Computer Group —
the Wisconsin Package for Sequence Analysis.
Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept.
at the University of Wisconsin, Madison, then a private
company for over 10 years, then acquired by the Oxford
Molecular Group U.K., and now owned by Pharmacopeia
U.S.A. under the new name Accelrys, Inc.
The suite contains almost 150 programs designed to work in
a "toolbox" fashion. Several simple programs used in
succession can lead to sophisticated results.
Also 'internal compatibility,' i.e. once you learn to use one
program, all programs can be run similarly, and, the
output from many programs can be used as input for
other programs.
Used all over the world by more than 30,000 scientists at
over 530 institutions in 35 countries, so learning it here
will most likely be useful anywhere else you may end up.
To answer the always perplexing GCG question — “What
sequence(s)? . . . .”
Specifying sequences, GCG style;
in order of increasing power and complexity:
The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and all From & To programs)
The sequence is in a local GCG database in which case you ‘point’ to it by
using any of the GCG database logical names. A colon, “:,” always sets
the logical name apart from either an accession number or a proper
identifier name or a wildcard expression and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “{},” containing the sequence
specification, e.g. a wildcard — {*}.
Finally, the most powerful method of specifying sequences is in a GCG “list”
file. It is merely a list of other sequence specifications and can even
contain other list files within it. The convention to use a GCG list file in a
program is to precede it with an at sign, “@.” Furthermore, one can
supply attribute information within list files to specify something special
about the sequence.
Logical terms for the Wisconsin Package —
Sequence databases, nucleic acids:
Sequence databases, amino acids:
GENBANKPLUS
all of GenBank plus EST and GSS subdivisions
GENPEPT
GenBank CDS translations
GBP
all of GenBank plus EST and GSS subdivisions
GP
GenBank CDS translations
GENBANK
all of GenBank except EST and GSS subdivisions
SWISSPROTPLUS
all of Swiss-Prot and all of SPTrEMBL
GB
all of GenBank except EST and GSS subdivisions
SWP
all of Swiss-Prot and all of SPTrEMBL
BA
GenBank bacterial subdivision
SWISSPROT
all of Swiss-Prot (fully annotated)
BACTERIAL
GenBank bacterial subdivision
SW
all of Swiss-Prot (fully annotated)
EST
GenBank EST (Expressed Sequence Tags) subdivision
SPTREMBL
Swiss-Prot preliminary EMBL translations
GSS
GenBank GSS (Genome Survey Sequences) subdivision
SPT
Swiss-Prot preliminary EMBL translations
HTC
GenBank High Throughput cDNA
P
all of PIR Protein
HTG
GenBank High Throughput Genomic
PIR
all of PIR Protein
IN
GenBank invertebrate subdivision
PROTEIN
PIR fully annotated subdivision
INVERTEBRATE
GenBank invertebrate subdivision
PIR1
PIR fully annotated subdivision
OM
GenBank other mammalian subdivision
PIR2
PIR preliminary subdivision
OTHERMAMM
GenBank other mammalian subdivision
PIR3
PIR unverified subdivision
OV
GenBank other vertebrate subdivision
PIR4
PIR unencoded subdivision
OTHERVERT
GenBank other vertebrate subdivision
NRL_3D
PDB 3D protein sequences
PAT
GenBank patent subdivision
NRL
PDB 3D protein sequences
PATENT
GenBank patent subdivision
PH
GenBank phage subdivision
PHAGE
GenBank phage subdivision
PL
GenBank plant subdivision
PLANT
GenBank plant subdivision
GENMOREDATA
path to GCG optional data files
PR
GenBank primate subdivision
GENRUNDATA
path to GCG default data files
PRIMATE
GenBank primate subdivision
RO
GenBank rodent subdivision
RODENT
GenBank rodent subdivision
STS
GenBank (sequence tagged sites) subdivision
SY
GenBank synthetic subdivision
SYNTHETIC
GenBank synthetic subdivision
TAGS
GenBank EST and GSS subdivisions
UN
GenBank unannotated subdivision
UNANNOTATED
GenBank unannotated subdivision
VI
GenBank viral subdivision
VIRAL
GenBank viral subdivision
General data files:
These are easy —
they make sense and
you’ll have a vested
interest.
The List File Format —
An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG
data
files,
two
periods
separate
documentation from data.
..
my-special.pep
begin:24
end:134
SwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}
@another.list
The ‘way’ SeqLab works!
Conclusions —
There’s a bewildering assortment of different databases and ways to
access and manipulate the information within them. The key is to
learn how to use that information in the most efficient manner.
Next, a special treat, a colleague of mine, Misha
Taylor will further discuss the nature of databases,
with particular emphasis on relational and object
oriented data structures.
To learn more See the listed references and WWW sites.
Many fine texts are also starting to become available in the field.
FOR EVEN MORE INFO...
http://bio.fsu.edu/~stevet/workshop.html
Contact me ([email protected]) for specific bioinformatics
assistance and/or collaboration.
General Bioinformatics References —
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of
Protein Database Search Programs. Nucleic Acids Research 25, 3389-3402.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle,
Washington, U.S.A.
Genetics Computer Group ¥ (GCG), Inc. (Copyright 1982-2001) Program Manual for the Wisconsin Package, Version 10.2, Madison, Wisconsin, USA 53711.
Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89,
10915-10919.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal
of Molecular Biology 48, 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio
1994. Nucleic Acids Research 22, 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85,
2444-2448.
Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, 584-599.
Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence
Analysis. CABIOS, 10, 671-675.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5,
Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.
Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega
Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, 2471-2484.
Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997)
Smithsonian Institution, Washington D.C., U.S.A.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence
weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680.
von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences
U.S.A. 80, 726-730.
Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52.