Information organization

Transcript Information organization

Information organization
Oct 2, 2012
Learning objectives-Demonstrate Dotter Program.
Understand how information is stored in
GenBank. Learn how to read a GenBank flat file.
Learn how to search GenBank for information.
Understand difference between header, features
and sequence. Distinguish between a primary
database and secondary database.
Homework #2 due today.
Homework #3 due Tues. Oct. 9
What is GenBank?
Gene sequence database
Annotated records that represent single
contiguous stretches of DNA or RNA-may
have more than one coding region.
Generated from direct submissions to the
DNA sequence databases from the authors.
Part of the International Nucleotide
Sequence Database Collaboration.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
History of GenBank
Began with Atlas of Protein Sequences and
Structures (Dayhoff et al., 1965)
In 1986 it shared data with EMBL and in 1987 it
shared data with DDBJ.
Primary database
Examples of secondary databases derived from
GenBank: UniProt, EST database.
GenBank Flat File is a human readable form of a
GenBank record.
Downstream (relative to CDS)
Upstream (relative to CDS)
Start of gene
Transcription
Coding strand
Transcription
End of gene
initiation site
termination site
5’
Promoter
Protein Coding Sequence (CDS)
3’
5’ untranslated
region (5’UTR)
Template strand
3’
DNA
5’
3’ untranslated
region (3’UTR)
Transcription
3’
5’
RNA
Translation
Protein
Protein folding
Folded
protein
Transcript splicing
3
2
1
Intron 2
Intron 1
4
DNA
Intron 3
Transcription
1
2
3
4
Primary
transcript
Splicing
mRNA
Translation
protein
Alternative splicing
1
2
3
4
Primary transcript
General Comments on GBFF
Three sections:
1) Header-information about the whole record
 2) Features-description of annotations-each
represented by a key.
 3) Nucleotide sequence-each ends with // on
last line of record.

DNA-centered
Translated sequence is a feature
Feature Keys
Purpose:
 1)
Indicates biological nature of sequence
 2) Supplies information about changes to
sequences
Feature Key
conflict
rep_origin
protein_bind
CDS
Description
Separate determinations of the same
seq. differ
Origin of replication
Protein binding site on DNA
(Protein) coding sequence
Feature Keys-Terminology
Feature Key
CDS
Location/Qualifiers
23..400
/product=“alcohol dehydro.”
/gene=“adhI”
The feature CDS is a coding sequence beginning at base 23
and ending at base 400 that has a product called “alcohol
dehydrogenase” and corresponds to the gene called
“adhI”.
Feature Keys-Terminology
(Cont.)
Feat. Key
Location/Qualifiers
CDS
join (544..589,688..1032)
/product=“T-cell recep. B-ch.”
/partial
The feature CDS is a partial coding sequence formed by joining
the indicated elements to form one contiguous sequence
encoding a product called T-cell receptor beta-chain.
Record from GenBank
GenBank division (plant, fungal and algal)
Modification date
Locus name
LOCUS
SCU49845
5028 bp
DNA
PLN
21-JUN-1999
DEFINITION
Saccharomyces cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION
U49845 Accession number (never changes)
VERSION
U49845.1
KEYWORDS
.
Coding sequence
GI:1293613 GeneInfo identifier (changes whenever there is a change)
Nucleotide sequence identifier (changes when there is a change
in sequence (accession.version))
Word or phrase describing the sequence (not based on controlled vocabulary).
Not used in newer records.
SOURCE
ORGANISM
baker's yeast. Common name for organism
Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
Saccharomycetaceae; Saccharomyces.
Formal scientific name for the source organism and its lineage
based on NCBI Taxonomy Database
Record from GenBank (cont.1)
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
1 (bases 1 to 5028) Oldest reference first
Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
Cloning and sequence of REV7, a gene whose function is required
for DNA damage-induced mutagenesis in Saccharomyces cerevisiae
Yeast 10 (11), 1503-1509 (1994)
95176709 Medline UID
2 (bases 1 to 5028)
Roemer,T., Madden,K., Chang,J. and Snyder,M.
Selection of axial growth sites in yeast requires Axl2p, a
novel plasma membrane glycoprotein
Genes Dev. 10 (7), 777-793 (1996)
96194260
3
(bases 1 to 5028)
AUTHORS
Roemer,T.
TITLE
Direct Submission
JOURNAL
Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,
Submitter of sequence (always the last reference)
New Haven, CT, USA
Record from GenBank (cont.2)
There are three parts to the feature key: a keyword (indicates functional group), a location
(instruction for finding the feature), and a qualifier (auxiliary information about a feature)
FEATURES
source
Keys
CDS
Database cross-refs
Location/Qualifiers
1..5028 Location
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
Qualifiers
/chromosome="IX"
/map="9"
<1..206 Partial sequence on the 5’ end. The 3’ end is complete.
/codon_start=3 Start of open reading frame
/product="TCP1-beta" Descriptive free text must be in quotations
/protein_id="AAA98665.1" Protein sequence ID #
/db_xref="GI:1293614"
Values
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
Note: only a partial sequence
Record from GenBank (cont.3)
gene
687..3158 Another location
/gene="AXL2"
CDS
687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ Cutoff
gene
complement(3300..4037) Another location
/gene="REV7"
CDS
complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff
Coding strand is complementary
strand
Record from GenBank (cont.4)
BASE COUNT
1510 a
1074 c
835 g
1609 t
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac
ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca
gtagtcagct . . .//
DNA
RNA
protein
cDNA
DNA databases derived from GenBank
containing data for a single gene
•Non-redundant (nr)
RNA (cDNA) databases derived
•dbGSS
from GenBank
•dbSTS
containing data for a single gene
•dbEST
•UniGene
•RefSeq
Protein databases derived
from GenBank containing
data for a single gene
•Non-redundant (nr)
•UniProtKB
Types of primary databases
carrying biological infomation
GenBank/EMBL/DDBJ
dbEST-expressed sequence tags-single pass cDNA
sequences (high error freq.)
It is non-redundant
PDB-Three-dimensional structure coordinates of
biological molecules
PROSITE-database of protein domain/function
relationships.
Summary
GenBank-longest running molecular
biology database.
Three sections in every GenBank record
Primary databases and secondary databases.
RefSeq-contains unique record for each
RNA variant.
UniProtKB-protein centered
Workshop
Do problem 1 in Chapter 2.
Homework
Do problems 2 and 3 in Chapter
2.

Information organization

Transcript Information organization

Directory