Transcript Slide 1

The Ensembl Browser
Dr. Giulietta M. Spudich
European Bioinformatics Institute
1 of 31
Today
Introduction to the Ensembl project and gene set
Walk-through of the browser
Hands-on Browser
BioMart
Lunch
BioMart Hands-on
Comparative Genomics + Hands-on
Variations &Functional Genomics + Hands-on
2 of 31
Course Objectives
 How to browse information about a gene
 How to choose a transcript
 Where to find sequence variations
 How to view multiple alignments
 How to use BioMart
 Where to go for help
3 of 31
Introduction to Ensembl
Why do we have genome browsers?
Why Ensembl?
Ensembl genes and genomes
Where to go for help?
4 of 31
Genome browsers provide a map
DNase I sensitive site
Histone modification
Gene
Conserved
sequence
Allele
Figure adapted from the ENCODE project
www.nature.com/nature/focus/encode/
5 of 31
Genome Browsers
• Ensembl Genome browser
http://www.ensembl.org
• NCBI Map Viewer
http://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browser
http://genome.ucsc.edu
6 of 31
Ensembl Features
• The gene set. Automatic annotation based on mRNA
and protein information plus manual annotation
(GENCODE set).
• BioMart (data export tool)
• Comparative analysis (gene trees)
• Variation and functional genomics
• Integration with other databases (DAS)
• Programmatic access via the Perl API (open source)
7 of 31
Subjects
Why do we have genome browsers?
Why Ensembl?
Ensembl genes and genomes
Where to go for help?
8 of 31
To meet a challenge…
Ensembl’s AIM: To provide annotation for the biological
community that is freely available and of high quality
• Started in 2000
• Joint project between EBI and Sanger
• Funded primarily by the Wellcome Trust,
additional funding by EMBL, NIH-NIAID, EU,
BBSRC and MRC
9 of 31
Genome annotation
Genome annotation is the process of
attaching biological information to
sequences. It consists of two main steps:
1. Identifying genes on the genome.
2. Attaching biological information to
genes and the genome. (For example,
effects of sequence variation).
10 of 31
Ensembl Annotates Vertebrate
Genomes
50 species including:
Non-chordates:
D. melanogaster
C. elegans
S. cerevisiae
11 of 31
: Extending Ensembl across the taxonomic space
48 Chordates including:
21Aspergillums
8
species
species
134
species
3
Plasmodia
Human
2
Arabidopsis
Drosophila
thaliana
-falciparum
6yeast
bacterial(12)
clades
Mouse
-knowlesi
Arabidopsis
Caenorhabditis
S.cerevisiae
lyrata
(5)
1 prokaryotic
clades
Zebrafish
-vivax
Oryza
Anopheles
S.pombe
sativa
gambiae
Chicken
Chimpanzee
Pig
Platypus
F. D. Ciccarelli, T. Doerks, C. von Mering, C. J. Creevey, B. Snel & P. Bork.
Towards automatic reconstruction of a highly resolved tree of life. Science, 3 March 2006.
Slide design by
12 of 35
Jeff Almeida-King
12 of 49
31
Exploring genomes
• Vertebrates focus: www.ensembl.org
• Other species: www.ensemblgenomes.org
13 of 31
Subjects
Why do we have genome browsers?
Why Ensembl?
Ensembl genes and genomes
Where to go for help?
14 of 31
What is known?
Genomic assemblies from sequencing
consortia
15 of 31
What is known?
Proteins and cDNA/mRNA sequences from
the research community found in:
• UniProtKB/Swiss-Prot (manually curated)
• UniProtKB/TrEMBL
www.uniprot.org
• NCBI RefSeq (manually curated)
www.ncbi.nlm.nih.gov/RefSeq
Note: See pages 55 and 56 of the course booklet
16 of 31
Combining genes and genomes
…tgcctgttag...
Exon
Untranslated+Coding
Exon
Coding
Exon
Untranslated
17 of 31
Too many pieces…
Genome
Aligned cDNA
and protein
Exon
Untranslated+Coding
Exon
Coding
Exon
Untranslated
18 of 31
Ensembl shows one transcript
with underlying evidence
19 of 31
Ensembl Compared with Swiss-Prot and
NCBI RefSeq sequences
20 of 31
Is there any consensus?
• NCBI RefSeq set ≠ UniProt set
• Ensembl combines these sets
• UCSC has it’s own gene set
How do we come up with a consensus
gene set between all these?
21 of 31
CCDS
• Reaching a consensus coding
sequence set for human and mouse.
• 19,851 (ENS human),
17,679 (ENS mouse) (*as of Sept 2009)
• If you see a “CCDS ID”, the coding
sequence is agreed upon.
Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4
22 of 31
VEGA/Havana
• Automatic annotation pipeline: Gene
building all at once (whole genome)
Ensembl
• Manual curation: case-by-case basis
VEGA: Vertebrate Genome Annotation
Havana
23 of 31
Genes and Transcripts in Ensembl
High Quality:
• CCDS transcripts
• Ensembl/Havana merged transcripts
24 of 31
Ensembl/Havana
• Transcripts are from:
Ensembl
Havana
Ensembl/Havana merge
25 of 31
Gene Names in Ensembl
•
•
•
•
ENSG###
ENST###
ENSP###
ENSE###
Ensembl Gene ID
Ensembl Transcript ID
Ensembl Peptide ID
Ensembl Exon ID
• For other species than human a suffix is
added:
MUS (Mus musculus) for mouse: ENSMUSG###
DAR (Danio rerio) for zebrafish: ENSDARG###, etc.
26 of 31
How is all this information
organised?
• Ensembl Views (Website)
• Ensembl Database (open source)
• BioMart ‘DataMining tool’
27 of 31
What other annotation?
•
•
•
•
Non-coding (nc)RNAs
IDs in other databases
microarray probes, clonesets, BAC maps
Other features of the genome:
repeats, CpG islands
• Homologs and whole genome alignments:
orthologues and paralogues, protein families, syntenic
regions
• Variation data:
Single Nucleotide Polymorphisms, InDels, CNVs
• Regulatory data (a first guess at promoter and
enhancer elements)
• Data from external sources (DAS)
28 of 31
Subjects
Why do we have genome browsers?
Why Ensembl?
Ensembl genes and genomes
Where to go for help?
29 of 31
Help and Information
• Comments and questions?
[email protected]
• Check out our tutorials page:
www.ensembl.org/info/website/tutorials/index.html
• Videos http://www.youtube.com/user/EnsemblHelpdesk
• Mailing list [email protected]
• Come visit our blog!
http://ensembl.blogspot.com/
• FTP site: ftp://ftp.ensembl.org
• Amazon Web Services: http://aws.amazon.com/publicdatasets
30 of 31
Ensembl Team
Ensembl
Paul Flicek (EBI), Steve Searle (Sanger Institute)
Software
Glenn Proctor, Andreas Kähäri, Stephen Keenan, Rhoda Kinsella, Eugene Kulesha, Ian Longden, Iliana Toneva, Jorge Zamora
Comparative Genomics
Functional Genomics
Variation
Analysis and Annotation
Web Team
Outreach
Systems & Support
Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon
Ian Dunham, Nathan Johnson, Daniel Sobral, Steven Wilder
Fiona Cunningham, Pontus Larsson, Will McLaren, Graham Ritchie
Jan-Hinnerck Vogel, Bronwen Aken, Susan Fairley, Thibaut Hourlier, Magali Ruffier, Simon White, Amy Tang, Amonida Zadissa
Anne Parker, Ridwan Amode, Simon Brent, Maurice Hendrix, Bethan Pritchard, Steve Trevanion (VEGA)
Xosé M Fernández, Jeff Almeida-King, Bert Overduin, Michael Schuster (QC), Giulietta Spudich, Jana Vandrovcova
Guy Coates, James Beal, Gen-Tao Chiang, Peter Clapham, Simon Kelley, Shelley Goddard, Tracy Mumford, Kerry Smith
Research
Benoît Ballester, Petra Catalina Schwalie, André Faure, Markus Fritz, Damian Keefe, Alison Meynert, Dace Ruklisa, Mikhail Spivakov,
David Thybert, Sander Timmer, Albert Vilella
Vertebrate Genomics
Chao-Kung Chen, Laura Clarke, Jonathan Hinton, Zam Iqbal, Vasudev Kumanduri, Ilkka Lappalainen, Edoardo Marcora, Pablo Marín,
Damian Smedley, Richard Smth, Phil Wilkinson, Holly Zheng-Bradley
Ensembl Genomes
VectorBase
Zebrafish
Ensembl Strategy
Paul Kersey, Paul Derwent, Matthias Haimel, Alan Horne, Arnaud Kerhornou, Uma Maheswari, Michael Nuhn, Dan Staines,
Andy Yates
Dan Lawson, Gautier Koscielny, Karyn Megy
Kerstin Howe, Kim Brugger, Will Chow, Britt Reimholz, James Torrance
Ewan Birney, Richard Durbin, Tim Hubbard
Ensembl’s 10th Year Nucleic Acids Res. 2010
http://www.ncbi.nlm.nih.gov/pubmed/19906699
31/40