Transcript ppt

NGS Bioinformatics Workshop
1.2 Sequence Formats, Databases and
Visualization Tools
March 14th, 2012
IRMACS 10900, SFU
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Deeply Grateful Acknowledgment:
This week’s slides mainly courtesy of Professor Fiona Brinkman, MBB
Overview
 Understand the purpose of, and use of, bioinformatics
databases resources, such as GenBank,UniProt/Swiss-Prot,
Entrez and Ensembl.
 Be able to recognize common database data formats and
sequence identifiers and know what their primary use is.
 What kind of tools are available to visualize sequence data?
 Appreciate the issues surrounding bioinformatic database
updating.
2
Biological Databases and Data
Models
Great resource: The annual “January Database issue” of Nucleic Acids Research
and associated “database of databases”
4
http://www.oxfordjournals.org/nar/database/c/
Also check out the annual “web-software” issue of NAR every July
Databases
Organized array of information
Place where you put things in, and (if all goes
well!) you should be able to get them out
again.
Allows you to make discoveries.
5
Database Examples in Bioinformatics
• Primary (archival)
– GenBank/EMBL/DDBJ
(seqs)
– PDB
(protein structures)
– Medline
(literature)
– IMEx databases
(protein interactions)
6
• Secondary (curated)
– RefSeq (seqs)
– UniProt - SwissProt
(seqs)
– Taxon (taxonomy)
– PROSITE (binding sites)
– OMIM (genetics
literature/reviews)
– IMEx databases
(protein interactions)
UniProt: Swiss-Prot – An example of curated,
reviewed annotation
 Incorporates:
Function of the protein
Subcellular localization of protein
Post-translational modification
Domains and sites
Secondary structure
Quaternary structure
Similarities to other proteins
Diseases associated with deficiencies in the protein
Sequence conflicts, variants, etc.
7
INSDC - International Nucleotide Sequence Database Collaboration
Entrez
NIH
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
DDBJ
EBI
CIB
NIG
•Submissions
•Updates
getentry
8
SRS
EMBL
Sequence Databases
DNA
NCBI: GenBank -> RefSeq
EBI: EMBL
Protein
TrEMBL= “translated EMBL”
NCBI: GenPept
EBI: UniProt: TrEMBL -> UniProt: Swiss-Prot
National Center for Biotechnology Information www.ncbi.nlm.nih.gov
European Bioinformatics Institute www.ebi.ac.uk
9
LOCUS
DEFINITION
AF115338
591 bp
DNA
linear
BCT 19-AUG-1999
Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete
cds.
ACCESSION
AF115338
VERSION
AF115338.1 GI:4959391
KEYWORDS
.
SOURCE
Pseudomonas fluorescens.
ORGANISM Pseudomonas fluorescens
Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae;
Pseudomonas.
REFERENCE
1 (bases 1 to 591)
AUTHORS
Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R.
TITLE
Influence of a putative ECF sigma factor on expression of the major
outer membrane protein, OprF, in Pseudomonas aeruginosa and
Pseudomonas fluorescens
JOURNAL
J. Bacteriol. 181 (16), 4746-4754 (1999)
MEDLINE
99369842
PUBMED
10438740
REFERENCE
2 (bases 1 to 591)
AUTHORS
De Mot,R.
TITLE
Direct Submission
JOURNAL
Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,
Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium
FEATURES
Location/Qualifiers
source
1..591
/organism="Pseudomonas fluorescens"
/strain="M114"
/db_xref="taxon:294"
gene
1..591
/gene="sigX"
CDS
1..591
/gene="sigX"
/codon_start=1
/transl_table=11
/product="ECF sigma factor SigX"
/protein_id="AAD34329.1"
/db_xref="GI:4959392"
/translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ
RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR
KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE
FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET"
BASE COUNT
157 a
133 c
170 g
131 t
ORIGIN
1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag
61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg
121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac
181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
241
10gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg
GenBank Flat File
Header
•Title
•Taxonomy
•Citation
Features (AA seq)
DNA Sequence
ID
AC
SV
DT
DT
DE
KW
OS
OC
RN
RP
RX
RA
RT
RT
RT
RL
RN
RP
RA
RT
RL
RL
RL
DR
FH
FH
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
AF115338
standard; DNA; PRO; 591 BP.
AF115338;
AF115338.1
03-JUN-1999 (Rel. 59, Created)
23-AUG-1999 (Rel. 60, Last updated, Version 2)
Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds.
.
Pseudomonas fluorescens
Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas.
[1]
1-591
MEDLINE; 99369842.
Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.;
"Influence of a putative ECF sigma factor on expression of the major outer
membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas
fluorescens";
J. Bacteriol. 181(16):4746-4754(1999).
[2]
1-591
De Mot R.;
;
Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases.
F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K.
Mercierlaan 92, Heverlee B-3001, Belgium
SPTREMBL; Q9X4L7; Q9X4L7.
Key
Location/Qualifiers
EMBL Flat File
Header
•Title
•Taxonomy
•Citation
source 1..591
/db_xref="taxon:294"
/organism="Pseudomonas fluorescens"
/strain="M114"
CDS 1..591
/codon_start=1
/db_xref="SPTREMBL:Q9X4L7"
/transl_table=11
/gene="sigX"
/product="ECF sigma factor SigX"
/protein_id="AAD34329.1"
/translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQR
TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKE
RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQE
IADIMHMGLSATKMRYKRALDKLREKFAGETET"
Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other;
atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag
60
ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg
120
cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac
180
gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
240
11
gggaaatcga
agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
300
tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
360
Features (AA seq)
DNA Sequence
UniProt: Swiss-Prot
ID CYS3_YEAST STANDARD;
PRT; 393 AA.
AC P31373;
DT 01-JUL-1993 (REL. 26, CREATED)
DT 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE)
DT 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE)
DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONA
GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35.
OS SACCHAROMYCES CEREVISIAE (BAKER'S YEAST).
ID CYS3_YEAST STANDARD;
PRT; 393 AA.
OC EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAR
AC P31373;
OC SACCHAROMYCETACEAE; SACCHAROMYCES.
DT 01-JUL-1993 (REL. 26, CREATED)
RN [1]
RP SEQUENCE FROM N.A., AND PARTIAL SEQUENCE.
DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE).
RX MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan]
GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35.
RA ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S
OS TAXONOMY
RA OHMORI S., OSHIMA T., TOH-E A.;
OC SACCHAROMYCETACEAE; SACCHAROMYCES.
RT "Cloning and characterization of the CYS3 (CYI1) gene of
RT Saccharomyces cerevisiae.";
RX CITATION
RL J. BACTERIOL. 174:3339-3347(1992).
CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE +
CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE +
CC
NH(3) + 2-OXOBUTANOATE.
CC
NH(3) + 2-OXOBUTANOATE.
CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.
CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SY
CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.
CC
L-CYSTEINE FROM L-METHIONINE.
CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING
CC -!- SUBUNIT: HOMOTETRAMER.
CC
L-CYSTEINE FROM L-METHIONINE.
CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC.
CC -!- SUBUNIT: HOMOTETRAMER.
CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FA
CC -------------------------------------------------------------------------CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC.
CC This SWISS-PROT entry is copyright. It is produced through a collaboration
CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY.
CC between the Swiss Institute of Bioinformatics and the EMBL outstation CC ------------------------------------------------------------------------CC the European Bioinformatics Institute. There are no restrictions on its
CC Disclaimer
CC use by non-profit institutions as long as its content is in no way
CC modified and this statement is not removed. Usage by and for commercial
CC -------------------------------------------------------------------------CC entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC or send an email to [email protected]).
DR DATABASE cross-reference
CC -------------------------------------------------------------------------KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE.
DR EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
FT INIT_MET
0 0
DR EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
FT BINDING 203 203
PYRIDOXAL PHOSPHATE (BY SIMILARITY).
DR PIR; S31228; S31228.
SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32;
DR YEPD; 5280; -.
TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL
DR SGD; L0000470; CYS3. [SGD / YPD]
ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE DR PFAM; PF01053; Cys_Met_Meta_PP; 1.
DR PROSITE; PS00868; CYS_MET_METAB_PP; 1.
TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT
DR DOMO; P31373.
FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP
DR PRODOM [Domain structure / List of seq. sharing at least 1 domain]
FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHRDR PROTOMAP; P31373.
DR PRESAGE; P31373.
DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA
DR SWISS-2DPAGE; GET REGION ON 2D PAGE.
SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN
KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE.
//
FT INIT_MET
0 0
FT BINDING 203 203
PYRIDOXAL PHOSPHATE (BY SIMILARITY).
SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32;
TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RS
ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF
TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG Q
(a curated DB)
12
PDB
Protein Data Bank
Protein and Nucleic
acid 3D structures
Xray, NMR,
Computationally
predicted
Sequence
present
13
PDB









HEADER
COMPND
SOURCE
AUTHOR
DATE
JRNL
REMARK
SECRES
ATOM COORDINATES
HEADER
COMPND
COMPND
SOURCE
AUTHOR
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
1
1DGC 14
2
1DGC 15
2 RESOLUTION. 3.0 ANGSTROMS.
1DGC 16
3
1DGC 17
3 REFINEMENT.
1DGC 18
3 PROGRAM
X-PLOR
1DGC 19
3 AUTHORS
BRUNGER
1DGC 20
3 R VALUE
0.216
1DGC 21
3 RMSD BOND DISTANCES
0.020 ANGSTROMS
1DGC 22
3 RMSD BOND ANGLES
3.86 DEGREES
1DGC 23
3
1DGC 24
3 NUMBER OF REFLECTIONS
3296
1DGC 25
3 RESOLUTION RANGE
10.0 - 3.0 ANGSTROMS
1DGC 26
3 DATA CUTOFF
3.0 SIGMA(F)
1DGC 27
3 PERCENT COMPLETION
98.2
1DGC 28
3
1DGC 29
3 NUMBER OF PROTEIN ATOMS
456
1DGC 30
3 NUMBER OF NUCLEIC ACID ATOMS
386
1DGC 31
4
1DGC 32
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
1A
2A
3A
4A
5A
1B
2B
HELIX
14
LEUCINE ZIPPER
15-JUL-93 1DGC
1DGC 2
GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC
1DGC 3
2 ATF/CREB SITE DNA
1DGC 4
GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC
1DGC 5
T.J.RICHMOND
1DGC 6
1 22-JUN-94 1DGC 0
1DGC 7
AUTH P.KONIG,T.J.RICHMOND
1DGC 8
TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO
1DGC 9
TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10
TITL 3 FLEXIBILITY
1DGC 11
REF J.MOL.BIOL.
V. 233 139 1993
1DGC 12
REFN ASTM JMOBAK UK ISSN 0022-2836
0070 1DGC 13
62
62
62
62
62
19
19
ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60
ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61
LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62
GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63
ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG
1DGC 64
T G G A G A T G A C G T C 1DGC 65
A T C T C C
1DGC 66
1 A ALA A 228 LYS A 276 1
1DGC 67
CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2
ORIGX1
1.000000 0.000000 0.000000
0.00000
ORIGX2
0.000000 1.000000 0.000000
0.00000
ORIGX3
0.000000 0.000000 1.000000
0.00000
SCALE1
0.017047 0.000000 0.000000
0.00000
SCALE2
0.000000 0.017047 0.000000
0.00000
SCALE3
0.000000 0.000000 0.011539
0.00000
8 1DGC 68
1DGC 69
1DGC 70
1DGC 71
1DGC 72
1DGC 73
1DGC 74
Data Formats
Flat Files √
Many other formats for particular uses…
XML, Clustal (for multiple sequence alignments), GFF
(for sequence annotation), etc…
FASTA – simplest!
High throughput data file formats: BAM, etc.
15
FASTA
>
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4
MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
16
FASTA
> Your favourite gene 1 - yfg1
MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
> Your favourite gene 2 - yfg2
MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIV
DTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTI
TSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEW
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV
17
In GenBank, records are organized for various
reasons. Understanding the rationale behind
“groupings” and “numbering” systems for
such databases is the key to fully taking
advantage of database resources appropriately!
18
LOCUS vs Accession vs PID vs protein_id: What’s
the difference?
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases.
ACCESSION: A unique identifier to that record (particular
sequence) in GenBank/EMBL/DDBJ that does not change when
record is updated.
Nucleotide gi: Geninfo identifier (gi), a unique integer
specific for GenBank which will change every time the
sequence changes.
VERSION: System started in 1999 for GenBank/EMBL/DDBJ where
the accession and version play the same function as the
accession and gi number. Format: accession.version
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS (coding sequence).
Protein gi: Geninfo identifier (gi), a GenBank unique
integer which will change every time the sequence changes.
protein_id: Identifier which has the same structure and
function as the nucleotide Accession with version numbers.
19
LOCUS, Accession, NID, gi and PID
LOCUS
DEFINITION
ACCESSION
VERSION
HSU40282
1789 bp
mRNA
PRI
21-MAY-1998
Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
U40282
U40282.1 GI:3150001
LOCUS:
ACCESSION:
VERSION:
GI:
PID:
Protein gi:
protein_id:
CDS
20
HSU40282
U40282
U40282.1
3150001
g3150002
3150002
AAC16892.1
157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
Which of these would you use to cite
a sequence in a paper?
Can you think of situations where
you would use one over another?
21
Which of these would you use to cite a sequence?
When would you use one over another?
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases.
ACCESSION: A unique identifier to that record (particular
sequence) in GenBank/EMBL/DDBJ that does not change when
record is updated.
Nucleotide gi: Geninfo identifier (gi), a unique integer
specific for GenBank which will change every time the
sequence changes. (and can disappear!)
VERSION: System started in 1999 for GenBank/EMBL/DDBJ where
the accession and version play the same function as the
accession and gi number. Format: accession.version
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS (coding sequence).
Protein gi: Geninfo identifier (gi), a GenBank unique
integer which will change every time the sequence changes.
protein_id: Identifier which has the same structure and
function as the nucleotide Accession with version numbers.
Tomato graphics from www.rottentomatoes.com
22
Briefly…Examples of Functional Divisions
PAT
EST
STS
GSS
HTG
HTC
Patent
Expressed Sequence Tags
Sequence Tagged Site
Genome Survey Sequence
High Throughput Genome (unfinished)
High throughput cDNA (unfinished)
Genbank overview:
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1
23
Other Sequence (& related) File Formats
 Historically, a number of other sequence and
annotation file formats have been proposed, see:
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
 The demands of representing NGS data have
given rise to additional file formats and data
compression standards, some of which you will
encounter in this course. The next few slides will
present an overview of a few of these emergent
NGS formats and standards. See:
http://www.broadinstitute.org/software/igv/FileFormats
http://www.broadinstitute.org/software/igv/RecommendedFileFormats
http://genome.ucsc.edu/FAQ/FAQformat
Other Sequence (& Annotation) File Formats
FASTQ – FASTA with quality data
2bit – compressed DNA sequence format
SAM/BAM – Sequence Alignment Mapping
GFF/GTF – General Feature Format
BED/WIG – annotation track data formats
FASTQ
 FASTQ – FASTA “with an attitude” (embedded quality scores). Originally
developed at the Sanger to couple (Phred) quality data with sequence,
it is now common to specify raw read output data from NGS machines
in this format.
@EAS54_6_R1_2_1_443_348
 Various flavors:
 fastq-sanger
 fastq-illumina
 fastq-solexa
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
*-+*''))**55CCF>>>>>>CCCC
Differing in the format of the sequence identifier and in the valid range of
quality scores. See:
http://en.wikipedia.org/wiki/FASTQ_format
http://maq.sourceforge.net/fastq.shtml
http://nar.oxfordjournals.org/content/early
/2009/12/16/nar.gkp1137.full
“…the Sanger version of the FASTQ format has found the broadest
acceptance, supported by many assembly and read mapping tools
…Therefore, most users will do this conversion very early in their
workflows…”
http://hannonlab.cshl.edu/fastx_toolkit/
Linux, MacOSX or Unix only
2bit File Format
Highly compressed sequence file stores
multiple DNA sequences (up to 4 Gb total) in a
compact randomly-accessible format. The file
contains masking information as well as the
DNA itself.
http://genome.ucsc.edu/FAQ/FAQformat#format7
SAM/BAM
 SAM– a tab-delimited text file that contains a
compact and index-able representation of
nucleotide sequence alignments
http://samtools.sourceforge.net/SAM1.pdf
http://samtools.sourceforge.net/
 BAM – binary version of SAM (preferred by IGV)
 I/O format of several NGS tools, see:
http://samtools.sourceforge.net/swlist.shtml
 See also:
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth
G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing
Subgroup (2009) The Sequence alignment/map (SAM) format and
SAMtools. Bioinformatics, 25, 2078-9.
Gene/General/Generic Feature Formats (GFF)
A General Feature Format (GFF) file is a
relatively simple tab-delimited text file for
describing genomic features. Many genome
browsers – gbrowse, IGV, etc. - take GFF as
input for annotation data
There are several slightly but significantly
different GFF file formats (GFF,GFF2, GFF3,
GTF). The current primary standard is GFF3:
http://www.sequenceontology.org/gff3.shtml
Excerpt of a GFF File
##gff-version 3 1
##sequence-region ctg123 1 1497228
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
BED File Format
BED format provides a flexible way to define
the data lines that are displayed in an
annotation track in a genome browser.
http://genome.ucsc.edu/FAQ/FAQformat#format1
If your data set is BED-like, but it is very large
and you would like to keep it on your own
server, you should use the bigBed data
format.
WIGgle format
The Wiggle format is for display of dense,
continuous data such as GC percent, probability
scores, and transcriptome data.
http://genome.ucsc.edu/goldenPath/help/wiggle.html
If you need to display continuous data that is
sparse or contains elements of varying size, use
the BedGraph format instead.
If you have a very large data set and you would
like to keep it on your own server, you should
use the bigWig data format
EMBOSS Sequence Analysis Suite
emboss.sourceforge.net
Open Bioinformatics Foundation
bioperl / biojava / biopython / bioruby / biosql etc.
www.open-bio.org
Sequence Databases: “Roll your Own”?
GMOD BioSQL: a lightweight database
schema for storing and retrieving (annotated)
sequence records using OpenBio software
tools.
GMOD “Chado”: a more complex database
schema for storing sequence data, genome
feature annotation and a host of other related
biological data (initially inspired by Drosophila
genome annotation and genetics; supported
by many GMOD software tools)
Retrieving Sequence Information:
Using integrated database
resources such as Entrez
What you may be looking for:
 Heard on CBC about a disease gene that was recently
discovered, and you want to know more about it.
 Want to build a dataset of DNA sequences upstream of a
set of co-expressed genes, to identify common regulatory
element sequences
 Evolutionary, functional, structural analyses, etc…
38
Entrez: Initial version of this “Pathway to
Discovery”
Term frequency
statistics
Literature
citations in
sequence
databases
MEDLINE
abstracts
Nucleotide
sequences
Nucleotide
sequence
similarity
39
Literature
citations in
sequence
databases
Protein
sequences
Coding region
features
Amino acid
sequence similarity
PubMed Text Neighboring
Genetic Analysis of
Cancer in Families
The Genetic
Predisposition to
Cancer
• Common terms could
indicate similar subject
matter
• Statistical method
• Weights based on term
frequencies within
document and within the
database as a whole
• Some terms are better than
others
40
Entrez began to integrate more data…
MEDLINE
Expression
Data
PubMed
online Journals
Full text
Accession
Numbers
GenBank
SNP Data
ACGATGTGGTCGATG
TTCTCTATTATTATC
GGAAGCTAAGGATAT
CGCTGATGTGAGGTGA
TCGGTTCTATCTGCA
TAGCATGGATATTGA
TGGCTTATAGGCTAG
CGCTGATGTGAGGTG
Accession
Numbers - Map
Genomes
41
Links
MVILLVILAIVLISD
VTGREGSWQIPCMNV
KRKKGREGDHIVLIL
ILLNNAWASVLPESDS
SDSGPLIILHEREKR
LALAMAREENSPNCT
PLIKRESAEDSEDLR
KRKKTDEDDHIVLIL
Protein
Sequences
MMDB
structure:function
VAST
Structures
Entrez
Entrez Help
http://www.ncbi.nlm.nih.gov
/books/NBK3837/
Check out also What’s New
http://www.ncbi.nlm.nih.gov
/books/NBK1969/
Or @NCBI on Twitter
to keep up on new features
added (like the Database of
Genomic Structural Variation
recently released)
SFU’s Cenk Sahinalp international leader in
structural variation
bioinformatics research
42
BLink
43
Other Sequence Databases and
Sequence Data Visualization
The Ensembl Genomes Database:
Focuses on humans and select vertebrates
(but a plant version is also available…)
45
www.ensembl.org
What is Ensembl?
 Publicly available, automated annotation of selected
eukaryotic genomes (initially with mammalian focus)
Open source software (but slightly complicated to set up…)
Multiple different ways to access data, including
programmatic (Perl API)
Provides access to additional data from other groups
(distributed annotation system or DAS)
46
ENSEMBL – Region in
Detail
Check out the
“Printable mini-course” at
http://uswest.ensembl.org/inf
o/website/tutorials/index.htm
l
47
Generic Model Organism Database (GMOD)
Project
www.gmod.org
BioMart
(Ensmart)
A powerful
querying
system
(later: we’ll learn
about Ensembl’s
Perl API)
49
Distributed Annotation System (DAS)
 Allows Third-Party annotation
 Users choose the annotation they are interested in
 Good for specialized feature annotation or for comparison
of different methodologies
 Allows you to view different data in a consistent user
interface/display
Open source display focused on eukaryotes
 Ensembl
Open source display for any dataset
 Gbrowse
50
Gbrowse:
Another genome
data viewer with
DAS
Gene track 
Protein track 
Metabolic pathways track 
Regulons track 
3D structures track 
Intergenic sequences track 
Terminators track 
DNA sequence track 
51
Translation
track 
http://gmod.org/wiki/GBrowse
52
Gbrowse is used to display genomic
data for many projects






Mouse, Rat, Fly, C. elegans and other animals
Rice and a number of other plants
S. cerevisiae and other yeasts
A number of unicellular eukaryotes
Many many prokaryotes
Other types of data: HapMap, Segmental Duplications,
RNA-seq data-specific or other type-specific data
 ** Open source package ** (slightly simpler to set up
than Ensembl)
Entrez, Ensembl, Gbrowse:
What’s the difference?
•
Entrez
– Search and retrieval system for major databases, including PubMed,
Sequences (including genomes), Structures, Taxonomy, etc.
– NCBI (Maryland, USA) centrally hosts Entrez and they decide what to
host and maintain
– Not open source
•
Ensembl
– Automated annotation of selected eukaryotic genomes
– EMBL-EBI and the Sanger Institute (Cambridge/Hinxton, UK) centrally
hosts most resources and they decide what data to host and maintain.
– Open source and can obtain a local copy plus access other DAS data
•
Gbrowse
– Genome/genomic data viewer
– Very decentralized – anyone can set it up and publicly display any data
– Open source and can set up a local copy plus access other DAS data
Entrez, Ensembl, Gbrowse:
Benefits/Disadvantages of each?
•
•
•
Entrez
• Reputable institution – trust in the data
• Maintained by well established group with a lot of capital
• Perceived more consistency
• Limited to what they make available
• They make the call on how to display it, analyze it, and classify it
• Some of the analyses are definitely a black box
Ensembl
• Open source – can see how the data is analyzed/processed – NOT necessarily
an issue with lower quality data – a lot of eyes are watching you (wooahh haa
haa…)
• Reputable institution – trust in the data
Gbrowse
• Easy to use and set up
• Open source – can see how the data is analyzed/processed
• Anybody can release their data to the world
• Anybody can analyze the data in they want and release it to the world
Local Visualization of NGS Data
http://www.broadinstitute.org/igv/
How do I update or correct errors in
the Databases?
 Example: For Gene names, citations, new protein name,
sequencing errors in Genbank…
[email protected]
 But most people don’t bother
to correct things that they
notice are wrong…
 increased need for more
focused community-based
projects
57
Community Assisted Curation of Subsets
of Datasets
 Core curators continually update annotation of a
data subset (i.e. a genome)
Literature review
Input from the community
 Updates sent in batches to centralized databases - >
additional review -> becomes, for example, an NCBI
RefSeq
 Examples: WormBase.org, Pseudomonas.com
58
Ethical issues with bioinformatics databases
 How public and/or open source should biomolecular
data be?
 How much should researchers be forced to release
data as soon as possible?
 How much analysis of a genome can a researcher
publish before the genome sequence is published?
 How do we best organize the data?
 BIG issue! i.e. biomolecular pathway
classifications can bias analyses of pathways are
found to be upregulated or downregulated by gene
expression analysis
59
Resources
http://www.ncbi.nlm.nih.gov/
http://www.ebi.ac.uk/
http://www.expasy.ch/
http://www.ensembl.org/
http://www.rcsb.org/pdb/
http://www.pseudmonas.com/
http://www.wormbase.org/
http://biodas.org/
http://nar.oupjournals.org/
http://www.gmod.org/
http://www.broadinstitute.org/igv/
60