Transcript Slide 1

Genomes with Ensembl
Dr. Giulietta M. Spudich
European Bioinformatics Institute
1 of 34
Hinxton, UK
Today
Introduction to the Ensembl project
Walk-through of the browser
BioMart
Variation
Comparative Genomics
2 of 34
Introduction to Ensembl
Why do we have genome browsers?
Why Ensembl?
Ensembl genes and genomes
Help and tutorials
3 of 34
Genome browsers provide a map
DNase I sensitive site
Histone modification
Gene
Conserved
sequence
Allele
Figure adapted from the EnCODE project
www.nature.com/nature/focus/encode/
4 of 34
Genome Browsers
• Ensembl Genome browser
http://www.ensembl.org
• NCBI Map Viewer
http://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browser
http://genome.ucsc.edu
5 of 34
What Distinguishes Ensembl from
the UCSC and NCBI Browsers?
• The gene set. Automatic annotation
based on mRNA and protein information.
• Programmatic access via the Perl API
(open source)
• BioMart
• Integration with other databases (DAS)
• Comparative analysis (gene trees)
6 of 34
Subjects
Why do we have genome browsers?
Why Ensembl?
How can we extract data from Ensembl?
Where can I find help?
7 of 34
To meet a challenge…
Ensembl’s AIM: To provide annotation for the biological
community that is freely available and of high quality
• Started in 2000
• Joint project between EBI and Sanger
• Funded primarily by the Wellcome Trust,
additional funding by EMBL, NIH-NIAID, EU,
BBSRC and MRC
8 of 34
Vertebrates are available
Extension to other genomes:
Plants, Microorganisms,…
www.ensemblgenomes.org
Non-chordates:
D. melanogaster
C. elegans
S. cerevisiae
9 of 34
: Extending Ensembl across the taxonomic space
Archaea
Eukaryota
48 Chordates including:
21Aspergillums
8
species
species
134
species
3
Plasmodia
Human
2
Arabidopsis
Drosophila
thaliana
-falciparum
6yeast
bacterial(12)
clades
Mouse
-knowlesi
Arabidopsis
Caenorhabditis
S.cerevisiae
lyrata
(5)
1 prokaryotic
clades
Zebrafish
-vivax
Oryza
Anopheles
S.pombe
sativa
gambiae
Chicken
Chimpanzee
Pig
Platypus
Bacteria
F. D. Ciccarelli, T. Doerks, C. von Mering, C. J. Creevey, B. Snel & P. Bork.
Towards automatic reconstruction of a highly resolved tree of life. Science, 3 March 2006.
Slide design by
10 of 34
Jeff Almeida-King
Exploring genomes
• Vertebrates focus: www.ensembl.org
• Other species: www.ensemblgenomes.org
11 of 34
Subjects
Why do we have genome browsers?
Why Ensembl?
Ensembl (vertebrate) genes & genomes
Help and tutorials
12 of 34
What is known?
Genomic assemblies from sequencing
consortia
13 of 34
What is known?
Proteins and cDNA/mRNA sequences from
the research community found in:
• UniProt/Swiss-Prot (manually curated)
• UniProt/TrEMBL
www.uniprot.org
• NCBI RefSeq (manually curated)
www.ncbi.nlm.nih.gov/RefSeq
14 of 34
Combining genes and genomes
…tgcctgttag...
Exon
Untranslated+Coding
Exon
Coding
Exon
Untranslated
15 of 34
Too many pieces…
Genome
Aligned cDNA
and protein
Exon
Untranslated+Coding
Exon
Coding
Exon
Untranslated
16 of 34
Ensembl shows one transcript
with underlying evidence
17 of 34
VEGA/Havana
• Automatic annotation pipeline: Gene
building all at once (whole genome)
Ensembl
• Manual curation: case-by-case basis
VEGA: Vertebrate Genome Annotation
Havana
18 of 34
HAVANA
http://www.sanger.ac.uk/HGP/havana/
19 of 34
Genes and Transcripts in Ensembl
• Ensembl known transcripts
• Ensembl novel transcripts
• Ensembl merged transcripts (Havana)
• EST clusters
• More manual curation (SGD,
WormBase, FlyBase)
20 of 34
Ensembl/Havana
• Transcripts are labelled:
Ensembl
Havana
Ensembl/Havana merge
21 of 34
Names in Ensembl
•
•
•
•
ENSG###
ENST###
ENSP###
ENSE###
Ensembl Gene ID
Ensembl Transcript ID
Ensembl Peptide ID
Ensembl Exon ID
• For other species than human a suffix is
added:
MUS (Mus musculus) for mouse: ENSMUSG###
DAR (Danio rerio) for zebrafish: ENSDARG###, etc.
22 of 34
Low-coverage genomes
• High-coverage sequencing is timeconsuming and expensive
– BAC clones (>10x): Human, Mouse, Zebrafish
– Whole Genome Shotgun (6x): Chimp, Rat,
Chicken,...
• Low (~2x) coverage genome sequencing
– Faster, cheaper, but only useful when annotated
• Assembled into lots of “scaffolds”
• “Classic” Ensembl gene-build would
result in many partial and fragmented
genes
23 of 34
Some 2X genomes
24 of 34
Low-Coverage Gene-Build
• Whole Genome Alignment to an
annotated high-quality reference
genome
• Guided re-ordering of scaffolds
• Annotation of longer, more complete
gene structures
25 of 34
2X Genebuild
Human gene
Human genome
Cat scaffold 1
NNNNNN
Cat scaffold 2
Human or dog gene (projected)
26 of 34
What other annotation?
•
•
•
•
Non-coding (nc)RNAs
IDs in other databases
microarray probes, clonesets, BAC maps
Other features of the genome:
repeats, CpG islands
• Comparative data:
orthologues and paralogues, protein families, whole
genome alignments, syntenic regions
• Variation data:
SNPs, InDels
• Regulatory data (a first guess at promoter and
enhancer elements)
• Data from external sources (DAS)
27 of 34
Sources of Variation
NCBI dbSNP
• Import: alleles, flanking sequence, frequencies,
Calculate: position, transcript effect
http://www.ncbi.nlm.nih.gov/SNP/
For human also:
HGVbase
Affy GeneChip 100K and 500K Mapping Array
Affy Genome-Wide SNP array 6.0
Ensembl-called SNPs (from Celera reads and Jim
Watson’s and Craig Venter’s genomes)
For mouse, rat, dog and chicken also:
Sanger- and Ensembl-called SNPs (other strains /
breeds)
STAR Project for rat, other projects
28 of 34
External Sources
Large-scale variations in…
DECIPHER
• Database of Chromosomal Imbalance and
Phenotype in Humans using Ensembl Resources
DGV loci
• Database of Genomic Variants
• CNVs, Inversions, InDels
29 of 34
Subjects
Why do we have genome browsers?
Why Ensembl?
Ensembl genes and genomes
Help and tutorials
30 of 34
How is this information organised?
• Ensembl Views (Website)
• Ensembl Database (open source)
• BioMart ‘DataMining tool’
31 of 34
Help and Information
• Comments and questions?
[email protected]
• Check out our tutorials page:
www.ensembl.org/info/website/tutorials/index.html
• Videos http://www.youtube.com/user/EnsemblHelpdesk
• Mailing list [email protected]
• Come visit our blog!
http://ensembl.blogspot.com/
• FTP site: ftp://ftp.ensembl.org
• Amazon Web Services: http://aws.amazon.com/publicdatasets
32 of 34
Ensembl Team
Ensembl
Paul Flicek (EBI), Steve Searle (Sanger Institute)
Software
Glenn Proctor, Andreas Kähäri, Stephen Keenan, Rhoda Kinsella, Eugene Kulesha, Ian Longden, Daniel Rios, Iliana
Toneva
Comparative Genomics
Functional Genomics
Variation
Analysis and
Annotation
Web Team
Outreach
Systems & Support
Research
Vertebrate Genomics
Ensembl Genomes
VectorBase
Zebrafish
Ensembl Strategy
Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Albert Vilella
Ian Dunham, Nathan Johnson, Steven Wilder
Fiona Cunningham, Yuan Chen, Pontus Larrson, Will McLaren
Bronwen Aken, Julio Banet, Susan Fairley, Jan-Hinnerck Vogel, Simon White, Amonida Zadissa
Anne Parker, Eugene Bragin, Bethan Pritchard, Steve Trevanion (VEGA)
Xosé M Fernández, Jeff Almeida-King, Bert Overduin, Michael Schuster (QC), Giulietta Spudich
Guy Coates, James Beal, Gen-Tao Chiang, Peter Clapham, Simon Kelley, Shelley Goddard, Tracy Mumford, Kerry Smith
Benoît Ballester, Damian Keefe, Dace Ruklisa, Petra Catalina Schwalie, Guy Slater
Illka Lappalainen, Chao-Kung Chen, Laura Clark, Jonathan Hinton, Vasudev Kumanduri, Edoardo Marcora, Damian
Smedley, Richard Smith, Phil Wilkinson, Holly Zheng-Bradley
Paul Kersey, Paul Derwent, Matthias Haimel, Arnaud Kerhornou, Uma Maheswari, Michael Nuhn, Dan Staines, Andy
Yates
Dan Lawson, Gautier Koscielny, Karyn Megy
Kerstin Howe, Kim Brugger (GRC), Will Chow, Britt Reimholz, James Torrance
Ewan Birney, Richard Durbin, Tim Hubbard
33 of 34
The Wellcome Trust Genome Campus
34 of 34