Central Dogma of Molecular Biology

Download Report

Transcript Central Dogma of Molecular Biology

Fundamentals in Sequence
Analysis 1.(part 1)
Review of Basic biology + database searching in
Biology.
Hugues Sicotte
NCBI
The Flow of Biotechnology
Information
Gene
> DNA sequence
AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA
TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA
ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA
TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG
GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA
CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC
TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA
ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG
TAAGAAGATCGCGAACATCTAGTAGA
Function
> Protein sequence
MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI
DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE
PDEAEQDCIEFGKKIANI
Prequisites to Sequence Analysis
• Basic Biology so you can understand the language
of the databases: Central Dogma (transcription;
Translation, Prokaryotes, Eukaryotes,CDS,
3´UTR, 5´UTR, introns, exons, promoters,
operons, codons, start codons, stop
codons,snRNA,hnRNA,tRNA, secondary
structure, tertiary structure).
• Before you can analyze sequences.. You have to
understand their structure.. And know about Basic
Biological Database Searching
Central Dogmas of Molecular Biology
1) The concept of genes is historically defined on the basic of genetic
inheritance of a phenotype. (Mendellian Inheritance)
2) The DNA an organism encodes the genetic information. It is made up of
a double stranded helix composed of ribose sugars.
Adenine(A), Citosine (C), Guanine (G) and Thymine (T).
[note that only 4 values nees be encode ACGT.. Which can be done using 2
bits.. But to allow redundant letter combinations (like N means any 4
nucleotides), one usually resorts to a 4 bit alphabet.]
Central Dogmas of Molecular Biology
3) Each side of the double helix faces it´s complementary base.
A T, and G  C.
4) Biochemical process that read off the DNA always read it from the
5´´side towards the 3´ side. (replication and transcription).
5) A gene can be located on either the ´plus strand´ or the minus strand. But
rule 4) imposes the orientation of reading .. And rule 3 (complementarity)
tells us to complement each base E.g.
If the sequence on the + strand is ACGTGATCGATGCTA, the – strand
must be read off by reading the complement of this sequence going
´backwards´
e.g. TAGCATCGATCACGT
Central Dogmas of Molecular Biology
6) DNA information is copied over to mRNA that acts as a template to
produce proteins.
We often concentrate on protein coding genes, because proteins are
the building blocks of cells and the majority of bio-active molecules.
(but let´s not forget the various RNA genes)
Prokaryotic genes
Prokaryotes (intronless protein coding genes)
Upstream (5’)
promoter
TAC
Gene region
Downstream (3’)
DNA
Transcription (gene is encoded on minus strand ..
And the reverse complement is read into mRNA)
ATG
5´ UTR CoDing Sequence (CDS)
mRNA
3´ UTR
ATG
Translation: tRNA read off each codons, 3 bases
at a time, starting at start codon until it reaches a
STOP codon.
protein
Why does Nature bothers with the mRNA?
Why would the cell want to have an intermediate between DNA and
the proteins it encodes?
•Gene information can be amplified by having many copies of an
RNA made from one copy of DNA.
•Regulation of gene expression can be effected by having specific
controls at each element of the pathway between DNA and
proteins. The more elements there are in the pathway, the more
opportunities there are to control it in different circumstances.
•In Eukaryotes, the DNA can then stay pristine and protected,
away from the caustic chemistry of the cytoplasm.
Prokaryotic genes (operons)
Prokaryotes (operon structure)
upstream promoter
downstream
Gene 1
Gene 2
Gene 3
In prokaryotes, sometimes genes that are part of the same
operational pathway are grouped together under a single
promoter. They then produce a pre-mRNA which
eventually produces 3 separates mRNA´s.
Bacterial Gene Structure of signals
Bacterial genomes have simple gene structure.
- Transcription factor binding site.
- Promoters
-35 sequence (T82T84G78A65C54A45) 15-20 bases
-10 sequence (T80A95T45A60A50T96) 5-9 bases
- Start of transcription : initiation start: Purine90 (sometimes it’s the
“A” in CAT)
- translation binding site (shine-dalgarno) 10 bp upstream of AUG
(AGGAGG)
- One or more Open Reading Frame
•start-codon (unless sequence is partial)
•until next in-frame stop codon on that strand ..
Separated by intercistronic sequences.
- Termination
Genetic Code
How does an mRNA specify amino acid sequence? The answer lies in
the genetic code. It would be impossible for each amino acid to be
specified by one nucleotide, because there are only 4 nucleotides and 20
amino acids. Similarly, two nucleotide combinations could only specify
16 amino acids. The final conclusion is that each amino acid is specified
by a particular combination of three nucleotides, called a codon:
Each 3 nucleotide code for one amino acid.
•The first codon is the start codon, and usually coincides with the Amino
Acid Methionine. (M which has codon code ‘ATG’)
•The last codon is the stop codon and does NOT code for an amino acid.
It is sometimes represented by ‘*’ to indicate the ‘STOP’ codon.
•A coding region (abbreviation CDS) starts at the START codon and
ends at the STOP codon.
Codon table
Note the degeneracy of the genetic
code. Each amino acid might have
up to six codons that specify it.
• Different organisms have
different frequencies of codon
usage.
•A handful of species vary from
the codon association described
above, and use different codons fo
different amino acids.
How do tRNAs recognize to which
codon to bring an amino acid? The
tRNA has an anticodon on its
mRNA-binding end that is
complementary to the codon on the
mRNA. Each tRNA only binds the
appropriate amino acid for its
anticodon.
RNA
RNA has the same primary structure as DNA. It consists of a sugar-phosphate
backbone, with nucleotides attached to the 1' carbon of the sugar. The differences
between DNA and RNA are that:
1. RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference
between deoxyribonucleic acid and ribonucleic acid.
2. Instead of using the nucleotide thymine, RNA uses another nucleotide called
uracil:
3. Because of the extra hydroxyl group on the sugar, RNA is too bulky to form
a stable double helix. RNA exists as a single-stranded molecule. However,
regions of double helix can form where there is some base pair
complementation (U and A , G and C), resulting in hairpin loops. The RNA
molecule with its hairpin loops is said to have a secondary structure.
4. Because the RNA molecule is not restricted to a rigid double helix, it can
form many different stable three-dimensional tertiary structures.
tRNA ( transfer RNA)
is a small RNA that has a very specific secondary and tertiary structure such that it can
bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry
the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T
Secondary structure of tRNA
Threedimensional
Tertiary
structure
Bacterial Gene Prediction
Most of the consensus sequences are known from ecoli
studies. So for each bacteria the exact distribution of
consensus will change.
Most modern gene prediction programs need to be
“trained”. E.g. they find their own consensus and assembly
rules given a few examples genes.
A few programs find their own rules from a completely
unannotated bacterial genome by trying to find conserved
patterns. This is feasible because ORF’s restrict the
search space of possible gene candidates.
E.g. selfid program([email protected])
Open Reading Frames
The simplest bacterial gene prediction techniques
simply
1) identify all open reading frames(ORFs),
2) and blastx them against known proteins.
3) The ORFs with the best homology are retained
first.
4) This usually densely covers the bacterial
genomes with genes. rRNA and tRNA are
detected separately using tRNAScan or blastn.
Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible frames. The ORF can be
either on the + or minus strand and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,...
Frame 2: 1st base of start codon can either start at base 2,5,8,11,...
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame –1,-2,-3 are on minus strand)
Some programs have other conventions for naming frames.. (0..5, 1-6, etc)
Gene finding in
eukaryotic cDNA uses
ORF finding +blastx as
well.
http://www.ncbi.nlm.nih
.gov/gorf/gorf.html
try with gi=41 ( or your
own piece of DNA)
Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus)
The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. (
many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms)
mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a
messenger to carry the information stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
Eukaryotic Nuclear Gene Structure
Gene prediction for Pol II transcribed genes.
• Upstream Enhancer elements.
• Upstream Promoter elements.
• GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)
• TATA promoter (-30 nt) (70%, 15 nt
consensus (Bucher et al (1990))
• 14-20 nt spacer DNA
• CAP site (8 bp)
• Transcription Initiation.
• Transcript region, interrupted by introns.
Translation Initiation (Kozak signal 12 bp
consensus) 6 bp prior to initiation codon.
• polyA signal (AATAAA 99%,other)
introns
•Transcript region, interrupted by introns. Each
introns
•starts with a donor site consensus
(G100T100A62A68G84T63..)
•Has a branch site near 3’ end of intron
(one not very conserved consensus
UACUAAC)
•ends with an acceptor site consensus.
(12Py..NC65A100G100)
UACUAAC
AG
Exons
•The exons of the transcript region are
composed of:
•5’UTR (mean length of 769 bp) with a
specific base composition, that
depends on local G+C content of
genome)
•AUG (or other start codon)
•Remainder of coding region
•Stop Codon
•3’ UTR (mean length of 457, with a
specific base composition that
depends on local G+C content of
genome)
Structure of the Eukaryotic Genome
~6-12% of human DNA encodes
proteins(higher fraction in
nematode)
~10% of human DNA codes for
UTR
~90% of human DNA is noncoding.
Non-Coding Eukaryotic DNA
Untranslated regions (UTR’s)
•introns (can be genes within
introns of another gene!)
•intergenic regions.
- repetitive elements
- pseudogenes (dead
genes that may(or not)
have been retroposed back in the
genome as a single-exon “gene”
Pseudogenes
Pseudogenes:
Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack of
promoter, for example) or in translation or both.
Processed pseudogenes:
Gene retroposed back in the genome
after being processed by the splicing apperatus.
Thus it is fully spliced and has polyA tail.
Insertion process flanks mRNA sequence with
short direct repeats.
Thus no promoters.. Unless is accidentally
retroposed downstream of the promoter
sequence.
Do not confuse with single-exon genes.
Repeats
Each repeat family has many subfamilies.
- ALU: ~ 300nt long; 600,000 elements in human
genome. can cause false homology with mRNA.
Many have an Alu1 restriction site.
- Retroposons. ( can get copied back into
genome)
- Telltale sign: Direct or inverted repeat flank
the repeated element. That repeat was the
priming site for the RNA that was inserted.
LINEs (Long INtersped Elements)
L1 1-7kb long, 50000 copies
Have two ORFs!!!!! Will cause problems
for gene prediction programs.
SINEs (Short Intersped Elements)
Low-Complexity Elements
• When analyzing sequences, one often rely on the
fact that two stretches are similar to infer that they
are homologous (and therefore related).. But
sequences with repeated patterns will match
without there being any philogenetic relation!
• Sequences like ATATATACTTATATA which are
mostly two letters are called low-complexity.
• Triplet repeats (particularly CAG) have a tendency
to make the replication machinery stutter.. So they
are amplified.
• The low-complexity sequence can also be hidden
at the translated protein level.
Masking
•To avoid finding spurious matches in alignment programs, you
should always mask out the query sequence.
•Before predicting genes it is a good idea to mask out repeats (at
least those containing ORFs).
•Before running blastn against a genomic record, you must mask
out the repeats.
•Most used Programs:
CENSOR:
Repeat Masker:
http://ftp.genome.washington.edu/cgi-bin/RepeatMasker
More Non-Protein genes
rRNA - ribosomal RNA
is one of the structural components of the ribosome. It has sequence
complementarity to regions of the mRNA so that the ribosome knows where to
bind to an mRNA it needs to make protein from.
snRNA - small nuclear RNA
is involved in the machinery that processes RNA's as they travel between the
nucleus and the cytoplasm.
hnRNA – hetero-nuclear RNA.
small RNA involved in transcription.
Protein Processing & localization.
The protein as read off from the mRNA may not be in the final
form that will be used in the cell. Some proteins contains
• Signal Peptide (located at N-terminus (beginning)), this signal
peptide is used to guide the protein out of the nucleus towards it´s
final cellular localization. This signal peptide is cleaved-out at
the cleavage site once the protein has reach (or is near) it´s final
destination.
•Various Post-Translational modifications (phosphorylation)
The final protein is called the “mature peptide”
Convention for nucleotides in database
Because the mRNA is actually read off the minus strand
of the DNA, the nucleotide sequence are always quoted
on the minus strand.
In bioinformatics the sequence format does NOT make a
difference between Uracil and Thymine. There is no
symbol for Uracil.. It is always represented by a ´T´
Even genomic sequence follows that convention. A gene
on the ´plus´ strand is quoted so that it is in the same
strand as it´s product mRNA.
Biology Information on the
Internet
Biology Information on the Internet
• Introduction to Databases
• Searching the Internet for Biology
Information.
– General Search methods
– Biology Web sites
• Introduction to Genbank file format.
• Introduction to Entrez and Pubmed
• Ref: Chapters 1,2,5,6 of “Bioinformatics”
• Databases:
– A collection of Records.
– Each record has many fields.
Spread-sheet – Each field contain specific information.
– Each field has a data type.
Flat-file
» E.g. money, currency,Text Field, Integer,
version of a
date,address(text field) ,citation (text field)
database.
– Each record has a primary key. A UNIQUE
identifier that unambiguously defines this
record.
gi
Accession version date
Genbank Division taxid organims
6226959 NM_000014
3 06/01/00 PRI
9606 homo sapiens
6226762 NM_000014
2 10/12/99 PRI
9606 homo sapiens
4557224 NM_000014
1 02/04/99 PRI
9606 homo sapiens
41 X63129
1 06/06/96 MAM
9913 bos taurus
Number of Chromosomes
22 diploid + X+Y
22 diploid + X+Y
22 diploid + X+Y
29+X+Y
gi
Accession version date
Genbank Division taxid organims
6226959 NM_000014
3 01/06/2000 PRI
9606 homo sapiens
6226762 NM_000014
2 12/10/1999 PRI
9606 homo sapiens
4557224 NM_000014
1 04/02/1999 PRI
9606 homo sapiens
41 X63129
1 06/06/1996 MAM
9913 bos taurus
Number of Chromosomes
22 diploid + X+Y
22 diploid + X+Y
22 diploid + X+Y
29+X+Y
Gi = Genbank Identifier: Unique Key : Primary Key
GI Changes with each update of the sequence
record.
Accession Number: Secondary key: Points to same locus and sequence
despite sequence updates.
Accession + Version Number equivalent to Gi
gi
Accession version date
Genbank Division taxid organims
6226959 NM_000014
3 01/06/2000 PRI
9606 homo sapiens
6226762 NM_000014
2 12/10/1999 PRI
9606 homo sapiens
4557224 NM_000014
1 04/02/1999 PRI
9606 homo sapiens
41 X63129
1 06/06/1996 MAM
9913 bos taurus
Number of Chromosomes
22 diploid + X+Y
22 diploid + X+Y
22 diploid + X+Y
29+X+Y
Relational Database (Normalizing a database for repeated subelements of a database.. Splitting it into smaller databases, relating
the sub-databases to the first one using the primary key.)
gi
6226959
6226762
4557224
41
Accession
NM_000014
NM_000014
NM_000014
X63129
version
3
2
1
1
date
01/06/2000
12/10/1999
04/02/1999
06/06/1996
Genbank Division taxid
PRI
9606
PRI
9606
PRI
9606
MAM
9913
taxid
organims
Number of Chromosomes
9606 homo sapiens 22 diploid + X+Y
9913 bos taurus
29+X+Y
Types of Relational databases.
• The Internet can be though of as one
enormous relational database.
– The “links”/URL are the primary keys.
• SQL (Standard Query Language)
– Sybase; Oracle ; Access; (Databases systems)
• Sybase used at NCBI.
– SRS(One type of database querying system of
use in Biology)
Indexed searches.
• To allow easy searching of a database, make
an index.
• An index is a list of primary keys
corresponding to a key in a given field (or to
a collection of fields)
Genbank division
PRI
6226959;6226762;4557224;…
MAM
41;…
Accession
NM_000014
6226959;6226762;4557224;
X63129 41;
Indexed searches.
• Boolean Query: Merging and Intersecting lists:
– AND (in both lists) (e.g. human AND genome)
– +human +genome
– human && genome
– OR (in either lists) (e.g. human OR genome)
– human || genome
Search strategies
• Search engines use complex strategies that go
beyond Boolean queries.
– Phrases matching:
• human genome -> “human genome”
– togetherness: documents with human close to genome
are scored higher.
– Term expansion & synomyms:
• human -> homo sapiens
– neigbours:
– human genome-> genome projects, chromosomes,genetics
– Frequency of links (www.google.com)
• To avoid these term mapping, enclose your queries in quotes:
“human” AND “genome”
Search strategies
• Search engines use complex strategies that
go beyond Boolean queries.
• To avoid these term mapping, enclose your queries in
quotes: “human” AND “genome”
• To require that ALL the terms in your query be important,
precede them with a “+” . This also prevents term
mapping.
• To force the order of the words to be important, group
sentences within strings. “biology of mammals”.
Indexed searches.
Example
• find the advanced query page at
http://www.altavista.com
• type human (and hit the Search button)
• Type genome:
• type human AND genome
• type “human genome” (finds the least matches)
• type human OR genome (finds the most matches)
• Search Engines:
– Web Spiders: Collection of All web pages, but
since Web pages change all the time and new
ones appear, they must constantly roam the web
and re-index.. Or depend on people submitting
their own pages.
•
•
•
•
•
•
•
www.google.com (BEST!)
www.infoseek.com
www.lycos.com
www.exite.com
www.webcrawler.com
www.lycos.com
www.looksmart.com (country specific)
• Search Engines:
• www.google.com (BEST!)
• Google ranks pages according to how many pages with those
terms refer to the pages you are asking for. Not only must one
document contain ALL the search terms, but other documents
which refer to this one must also contain all the terms.
• Great when you know what you are looking for! You can also
use “” to require immediate proximity and order of terms.
• E.g. type
» Web server for the blast program.
But google only indexes about 40% of the web.. So you may
have to use other web spiders.
(disclaimer.. I don’t own stock in that company.. But I’d like to)
• Search Engines:
– Curated Collections: Not comprehensive:
Contains list of best sites for commonly
requested topics, but is missing important sites
for more specialized topics (like biology)
• www.yahoo.com (Has travel maps too!)
– Answer-based curated collections: Easy to
use english-like queries. First looks at list of
predefined answers, then refines answers based
on user interaction. Also answer new questions.
•
•
•
•
www.askjeeves.com
www.magellan.com
www.altavista.com(has translation TOOLS)
www.hotbot.com
• Search Engines:
– Meta-Search Engines: Polls several search
engines, and returns the consensus of all results.
Is likely to miss sites, but the sites it returns are
very relevant to the query.
– Other operating mode is to return the sum of all
the results.. Then becomes very sensitive to a
very detailled query.
•
•
•
•
•
www.metacrawler.com
www.savvysearch.com
www.1blink.com (fast)
www.metafind.com
www.dogpile.com
• Virtual Libraries: Curated collections of
links for Biologists.(by Biologists)
– Pedro’s BioMolecular Research Tools:(1996)
• http://www.public.iastate.edu/~pedro/
– Virtual Library: Bio Sciences
• http://vlib.org/Biosciences.html
– Publications and abstract search.
• http://www.ncbi.nlm.nih.gov/
– Expasy server
• http://www.expasy.ch
– EBI Biocatalog (software & databases list)
• http://www.ebi.ac.uk/biocat/
Biological Databases
• Nucleotide databases:
– Genbank: International Collaboration
• NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
• A “bank” No curation.. Submission to these database is
required for publication in a journal.
– Organism specific databases (Exercize: Find URLs
using search engines)
•
•
•
•
•
•
FlyBase
ChickGBASE
pigbase
wormpep
YPD (Yeast Protein Database)
SGD(Saccharomyces Genome Database)
• Protein Databases:
– NCBI:
– Swiss Prot:(Free for academic use, otherwise
commercial. Licensing restrictions on discoveries made
using the DB. 1998 version free of any licensing)
• http://www.expasy.ch(latest pay version)
• NCBI has the latest free version.
• Translated Proteins from Genbank Submissions
– EMBL
• TrEMBL is a computer-annotated supplement of SWISS-PROT
that contains all the translations of EMBL nucleotide sequence
entries not yet integrated in SWISS-PROT
– PIR
• Structure databases:
– PDB: Protein structure database.
• Http://www.rscb.org/pdb/
– MMDB: NCBI’s version of PDB with entrez
links.
• Http://www.ncbi.nlm.nih.gov
• Genome Mapping Information:
– http://www.il-st-acad-sci.org/health/genebase.html
– NCBI(Human)
– Genome Centers:
• Stanford, Washington University, Stanford
– Research Centers and Universities
• Litterature databases:
– NCBI: Pubmed: All biomedical litterature.
• Www.ncbi.nlm.nih.gov
• Abstracts and links to publisher sites for
– full text retrieval/ordering
– journal browsing.
– Publisher web sites.
– Biomednet: Commercial site for litterature
search.
• Pathways Database:
– KEGG: Kyoto Encyclopedia of Genes and
Genomes: www.genome.ad.jp/kegg/kegg/html
• Database Identifiers: Primary keys
– GI (changes with each sequence update for
NCBI only)
• Annotation may change without the gi changing!
–
–
–
–
Accession(stable)
version(changes with each sequence update)
“Version” also refers to Accession.version
Secondary accession: Records may have been
merged in the past.. So the records which were
not chosen as the primary were made
secondary.
Primary Databases
• A primary Database is a repository of data
derived from experiments or from research
knowledge.
–
–
–
–
–
–
Genbank (Nucleotide repository)
Protein DB, Swissprot
PDB (MMDB) are primary databases.
Pubmed (litterature)
Genome Mapping databases.
Kegg Database.(pathways)
Secondary Databases
• A secondary database contains information
derived from other sources.
– Refseq (Currated collection of Genbank at
NCBI)
– Unigene (Clustering of ESTs at NCBI)
• Organism-specific databases are often a mix
between primary and secondary.
Genbank Records
• A Bank: No attempt at reconciliation.
• Submit a sequence  Get an Accession Number!
– Cannot modify sequences without submitter’s consent.
– No attempt at reconciliation.(not a unique collection per
LOCUS/gene)
– Entries of various sequence quality and different
sources==> Separate in various divisions based on
• High Quality sequences in taxon specific divisions.
• Low Quality sequences in Usage specific databases.
• A Collaboration between NCBI, EMBL and
DDBJ. They contain (nearly) the same
information, only the data format differs.
EMBL does not differentiate between the different types of RNA
records, while NCBI (and DDBJ) do. In Entrez EMBL records are
patched up to add that information.
Refseq and LocusLink
• Attempt to produce 1 mRNA, 1 protein, and
1 genomic gene for each frequently
occuring allele of a protein expressing gene.
• www.ncbi.nlm.nih.gov/LocusLink
• Special non-genbank Accession numbers
–
–
–
–
–
NM_nnnnnn mRNA refseq
NP_nnnnnn protein refseq
NC_nnnnnn refseq genomic contig
NT_nnnnnn temporary genomic contig
NX_nnnnnn predicted gene
Genbank divisions
Sequences in genbank are split into various categories based
on
1) The quality and type of sequences
2) The high quality nucleotide sequences are divided into
organism-dependant divisions.
• Genbank Entry type: (and query to restrict to that
field)
– mRNA (1/10000 errors)
• biomol_mRNA [PROP]
– cDNA (EST, 95-99% accuracy, single pass )
• gbdiv_EST [PROP]
– genomic ( biomol_genomic [PROP])
• in HTGS division: >99% accuracy;
– gbdiv_HTG [PROP]
• GSS(low-quality genome survey sequences)
– gbdiv_GSS [PROP]
• rest of Genbank; 1/10000 accuracy.
– Human gbdiv_PRI [PROP]
– mouse gbdiv_ROD [PROP]
– bovine gbdiv_MAM [PROP]
– STS(EST or cDNA used in mapping)
• gbdiv_STS [PROP]
FASTA Format
MOST important
data format!!!
>identifier descriptive text
nucleotide of amino-acid
sequence on multiple lines if needed.
Example:
>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
Modified FASTA Format
1) A few tools follow the convention that
lower case sequences are masked. (repeat
masker, some versions of blast, megablast,
blastz)
2) A few analysis tools (like CLUSTAL)
want a simplified identifier on the defline..
So they can have a short string for the
alignment.
>X63129.1
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
• WIM now will talk about GCG …
Feature table
(NCBI;EMBL/DDBJ)
• http://www.ncbi.nlm.nih.gov/collab/FT/inde
x.html
Genbank Data format
41
•
•
•
•
•
•
•
•
•
•
LOCUS
BTA1AT
1380 bp mRNA
MAM
30-APR-1992
DEFINITION B.taurus mRNA for alpha-1-antitrypsin.
ACCESSION X63129
NID
g41
VERSION X63129.1 GI:41
KEYWORDS alpha-1 antitrypsin; serine protease inhibitor; serpin.
SOURCE
Bos taurus.
ORGANISM Bos taurus
Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
Artiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos.
Genbank References
•
•
•
•
•
•
•
•
•
•
•
•
•
LOCUS
BTA1AT
1380 bp mRNA
MAM
30-APR-1992
...
REFERENCE 1 (bases 1 to 1380)
AUTHORS Sinha,D.
TITLE Direct Submission
JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry,
Temple University, 3400 North Broad Street, Philadelphia, PA 19140, USA
REFERENCE 2 (bases 1 to 1380)
AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.
TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin
JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992)
MEDLINE 92223096
FEATURES
Location/Qualifiers
Genbank Source Qualifier
•
•
•
•
•
•
•
•
•
•
•
•
•
LOCUS
BTA1AT
1380 bp mRNA
...
FEATURES
Location/Qualifiers
source
1..1380
/organism="Bos taurus"
/db_xref="taxon:9913"
/tissue_type="liver"
/cell_type="hepatocyte"
/clone_lib="lambda gt11"
/clone="2f-Ic"
mRNA
<1..>1380
sig_peptide 33..104
...
MAM
30-APR-1992
Genbank mRNA+CDS features
•
•
•
•
•
•
•
•
•
•
•
•
•
•
mRNA
<1..>1380
sig_peptide 33..104
CDS
33..1283
/codon_start=1
/product="alpha-1-antitrypsin"
/protein_id="CAA44840.1"
/db_xref="PID:g42"
/db_xref="GI:42"
/db_xref="SWISS-PROT:P34955"
/translation="MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETDDTSHQEAAC
HKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAMLSLGAKGNTHTEILKG
LGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTTGNGLFINESAKLVDTFLEDV
KNLYHSEAFSINFRDAEEAKKKINDYVEKGSHGKIVELVKVLDPNTVFALVNYIS
FKGKWEKPFEMKHTTERDFHVDEQTTVKVPMMNRLGMFDLHYCDKLASWVL
LLDYVGNVTACFILPDLGKLQQLEDKLNNELLAKFLEKKYASSANLHLPKLSISE
TYDLKSVLGDVGITEVFSDRADLSGITKEQPLKVSKALHKAALTIDEKGTEAVG
STFLEAIPMSLPPDVEFNRPFLCILYDRNTKSPLFVGKVVNPTQA"
mat_peptide 105..1280
/product="alpha-1-antitrypsin"
polyA_signal 1343..1348
polyA_site
1368
•
•
•
•
•
•
•
•
•
•
•
•
•
Genbank Sequence format
...
BASE COUNT
357 a
413 c
322 g
288 t
ORIGIN
1 gaccagccct gacctaggac agtgaatcga taatggcact
61 tgctgctggc agccctgtgc tgcctggccc ccatctccct
121 acgctgtcca agagacagat gatacatccc accaggaagc
181 ccaacctggc caactttgcc ttcagcatat accaccattt
241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt
301 ccaagggcaa cactcacact gagatcctga agggcctggg
361 cagaggctga gatccacaaa ggctttcagc atcttctcca
...
1321 gtccccccac tccctccatg gcattaaagg atgactgacc
//
ctccatcacg
ggctggagtt
agcgtgccac
ggctcatcag
tgcgatgctc
tttcaacctc
caccctgaac
cggggccttc
ctccaaggac
aagattgccc
tccaacacca
tccctgggag
actgagctcg
cagccaaacc
tagccccgaa aaaaaaaaaa
EMBL DATA FORMAT
• Embl: http://www.ebi.ac.uk/Databases/
• http://www.ebi.ac.uk/cgi-bin/emblfetch
• Use Accession X63129
DDBJ DATA FORMAT
• DDBJ: http://www.ddbj.nig.ac.jp/
• http://ftp2.ddbj.nig.ac.jp:8000/getstarte.html
• Use Accession X63129
• Flat file format same as NCBI/Genbank
format.
Entrez
• Index Based search system. Each field in
the database is searchable individually or as
agregate.
– (e.g. CDS [FKEY])
– default is agregate [ALL FIELDS] *
• All primary databases are interlinked as one
big relational database.
– (e.g. Pubmed links in Genbank records)
• Phrase matching.
– Human genome -> “human genome”
Entrez
• Available neighbours (related documents or
related sequences)
• In Pubmed searches: Term mapping to
neighbouring documents and neighbouring terms.
• Term mapping to chemical names.
– In pubmed: term [All Fields] is term mapped to
chemical names + MeSH terms + Text Fields.
– .. Unless “term” is whithin double quotes.
Entrez
• http://www.ncbi.nlm.nih.gov/Entrez/
• Tutorials:
• http://www.ncbi.nlm.nih.gov/Class/MLACo
urse/Genetics/index.html
• http://www.ncbi.nlm.nih.gov/Literature/pubmed_s
earch.html
• http://www.ncbi.nlm.nih.gov/Database.tut1.html
SWISSPROT
http://www.expasy.ch/sprot/sprot_details.html
1. Core data: protein sequence data; the citation information and the
taxonomic data
2. Annotation
• Function(s) of the protein
• Domains and sites. For example calcium binding regions, ATPbinding sites, zinc fingers, homeobox, kringle, etc.
• Post-translational modification(s). For example carbohydrates,
phosphorylation, acetylation, GPI-anchor, etc.
• Secondary structure
• Quaternary structure. For example homodimer, heterotrimer, etc
• Similarities to other proteins
• Disease(s) associated with deficiencie(s) in the protein
• Sequence conflicts, variants, etc.
SWISSPROT
http://www.expasy.ch/cgi-bin/get-random-entry.pl?S
REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition sequence, and then
within or a few bases away from that pattern is the actual
cutting site
http://rebase.neb.com/rebase/rebase.html
I prefer the bairoch format (SWISSPROT format)
http://rebase.neb.com/rebase/rebase.f19.html
ID enzyme name
ET enzyme type
OS microorganism name
PT prototype
RS recognition sequence, cut site
MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase
RN [count]
RA authors
RL jour, vol, pages, year, etc.
Exercises
•You can work in teams for this.
•1a) Use the first 6000 bases of your genomic piece [ or find a
bacterial genomic or mRNA sequence in Entrez with length between
2000:10000 ]
•b) Use the ORF finder to find the gene(s). Compare the answer you
get to the annotation you can infer from using blastn against genbank
and to using blastx against a protein database.
•Do the Entrez exercizes. ( separate word document)