Transcript Slide 1
Genomes with Ensembl Dr. Giulietta M. Spudich European Bioinformatics Institute 1 of 34 Hinxton, UK Today Introduction to the Ensembl project Walk-through of the browser BioMart Variation Comparative Genomics 2 of 34 Introduction to Ensembl Why do we have genome browsers? Why Ensembl? Ensembl genes and genomes Help and tutorials 3 of 34 Genome browsers provide a map DNase I sensitive site Histone modification Gene Conserved sequence Allele Figure adapted from the EnCODE project www.nature.com/nature/focus/encode/ 4 of 34 Genome Browsers • Ensembl Genome browser http://www.ensembl.org • NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview/ • UCSC Genome Browser http://genome.ucsc.edu 5 of 34 What Distinguishes Ensembl from the UCSC and NCBI Browsers? • The gene set. Automatic annotation based on mRNA and protein information. • Programmatic access via the Perl API (open source) • BioMart • Integration with other databases (DAS) • Comparative analysis (gene trees) 6 of 34 Subjects Why do we have genome browsers? Why Ensembl? How can we extract data from Ensembl? Where can I find help? 7 of 34 To meet a challenge… Ensembl’s AIM: To provide annotation for the biological community that is freely available and of high quality • Started in 2000 • Joint project between EBI and Sanger • Funded primarily by the Wellcome Trust, additional funding by EMBL, NIH-NIAID, EU, BBSRC and MRC 8 of 34 Vertebrates are available Extension to other genomes: Plants, Microorganisms,… www.ensemblgenomes.org Non-chordates: D. melanogaster C. elegans S. cerevisiae 9 of 34 : Extending Ensembl across the taxonomic space Archaea Eukaryota 48 Chordates including: 21Aspergillums 8 species species 134 species 3 Plasmodia Human 2 Arabidopsis Drosophila thaliana -falciparum 6yeast bacterial(12) clades Mouse -knowlesi Arabidopsis Caenorhabditis S.cerevisiae lyrata (5) 1 prokaryotic clades Zebrafish -vivax Oryza Anopheles S.pombe sativa gambiae Chicken Chimpanzee Pig Platypus Bacteria F. D. Ciccarelli, T. Doerks, C. von Mering, C. J. Creevey, B. Snel & P. Bork. Towards automatic reconstruction of a highly resolved tree of life. Science, 3 March 2006. Slide design by 10 of 34 Jeff Almeida-King Exploring genomes • Vertebrates focus: www.ensembl.org • Other species: www.ensemblgenomes.org 11 of 34 Subjects Why do we have genome browsers? Why Ensembl? Ensembl (vertebrate) genes & genomes Help and tutorials 12 of 34 What is known? Genomic assemblies from sequencing consortia 13 of 34 What is known? Proteins and cDNA/mRNA sequences from the research community found in: • UniProt/Swiss-Prot (manually curated) • UniProt/TrEMBL www.uniprot.org • NCBI RefSeq (manually curated) www.ncbi.nlm.nih.gov/RefSeq 14 of 34 Combining genes and genomes …tgcctgttag... Exon Untranslated+Coding Exon Coding Exon Untranslated 15 of 34 Too many pieces… Genome Aligned cDNA and protein Exon Untranslated+Coding Exon Coding Exon Untranslated 16 of 34 Ensembl shows one transcript with underlying evidence 17 of 34 VEGA/Havana • Automatic annotation pipeline: Gene building all at once (whole genome) Ensembl • Manual curation: case-by-case basis VEGA: Vertebrate Genome Annotation Havana 18 of 34 HAVANA http://www.sanger.ac.uk/HGP/havana/ 19 of 34 Genes and Transcripts in Ensembl • Ensembl known transcripts • Ensembl novel transcripts • Ensembl merged transcripts (Havana) • EST clusters • More manual curation (SGD, WormBase, FlyBase) 20 of 34 Ensembl/Havana • Transcripts are labelled: Ensembl Havana Ensembl/Havana merge 21 of 34 Names in Ensembl • • • • ENSG### ENST### ENSP### ENSE### Ensembl Gene ID Ensembl Transcript ID Ensembl Peptide ID Ensembl Exon ID • For other species than human a suffix is added: MUS (Mus musculus) for mouse: ENSMUSG### DAR (Danio rerio) for zebrafish: ENSDARG###, etc. 22 of 34 Low-coverage genomes • High-coverage sequencing is timeconsuming and expensive – BAC clones (>10x): Human, Mouse, Zebrafish – Whole Genome Shotgun (6x): Chimp, Rat, Chicken,... • Low (~2x) coverage genome sequencing – Faster, cheaper, but only useful when annotated • Assembled into lots of “scaffolds” • “Classic” Ensembl gene-build would result in many partial and fragmented genes 23 of 34 Some 2X genomes 24 of 34 Low-Coverage Gene-Build • Whole Genome Alignment to an annotated high-quality reference genome • Guided re-ordering of scaffolds • Annotation of longer, more complete gene structures 25 of 34 2X Genebuild Human gene Human genome Cat scaffold 1 NNNNNN Cat scaffold 2 Human or dog gene (projected) 26 of 34 What other annotation? • • • • Non-coding (nc)RNAs IDs in other databases microarray probes, clonesets, BAC maps Other features of the genome: repeats, CpG islands • Comparative data: orthologues and paralogues, protein families, whole genome alignments, syntenic regions • Variation data: SNPs, InDels • Regulatory data (a first guess at promoter and enhancer elements) • Data from external sources (DAS) 27 of 34 Sources of Variation NCBI dbSNP • Import: alleles, flanking sequence, frequencies, Calculate: position, transcript effect http://www.ncbi.nlm.nih.gov/SNP/ For human also: HGVbase Affy GeneChip 100K and 500K Mapping Array Affy Genome-Wide SNP array 6.0 Ensembl-called SNPs (from Celera reads and Jim Watson’s and Craig Venter’s genomes) For mouse, rat, dog and chicken also: Sanger- and Ensembl-called SNPs (other strains / breeds) STAR Project for rat, other projects 28 of 34 External Sources Large-scale variations in… DECIPHER • Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources DGV loci • Database of Genomic Variants • CNVs, Inversions, InDels 29 of 34 Subjects Why do we have genome browsers? Why Ensembl? Ensembl genes and genomes Help and tutorials 30 of 34 How is this information organised? • Ensembl Views (Website) • Ensembl Database (open source) • BioMart ‘DataMining tool’ 31 of 34 Help and Information • Comments and questions? [email protected] • Check out our tutorials page: www.ensembl.org/info/website/tutorials/index.html • Videos http://www.youtube.com/user/EnsemblHelpdesk • Mailing list [email protected] • Come visit our blog! http://ensembl.blogspot.com/ • FTP site: ftp://ftp.ensembl.org • Amazon Web Services: http://aws.amazon.com/publicdatasets 32 of 34 Ensembl Team Ensembl Paul Flicek (EBI), Steve Searle (Sanger Institute) Software Glenn Proctor, Andreas Kähäri, Stephen Keenan, Rhoda Kinsella, Eugene Kulesha, Ian Longden, Daniel Rios, Iliana Toneva Comparative Genomics Functional Genomics Variation Analysis and Annotation Web Team Outreach Systems & Support Research Vertebrate Genomics Ensembl Genomes VectorBase Zebrafish Ensembl Strategy Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Albert Vilella Ian Dunham, Nathan Johnson, Steven Wilder Fiona Cunningham, Yuan Chen, Pontus Larrson, Will McLaren Bronwen Aken, Julio Banet, Susan Fairley, Jan-Hinnerck Vogel, Simon White, Amonida Zadissa Anne Parker, Eugene Bragin, Bethan Pritchard, Steve Trevanion (VEGA) Xosé M Fernández, Jeff Almeida-King, Bert Overduin, Michael Schuster (QC), Giulietta Spudich Guy Coates, James Beal, Gen-Tao Chiang, Peter Clapham, Simon Kelley, Shelley Goddard, Tracy Mumford, Kerry Smith Benoît Ballester, Damian Keefe, Dace Ruklisa, Petra Catalina Schwalie, Guy Slater Illka Lappalainen, Chao-Kung Chen, Laura Clark, Jonathan Hinton, Vasudev Kumanduri, Edoardo Marcora, Damian Smedley, Richard Smith, Phil Wilkinson, Holly Zheng-Bradley Paul Kersey, Paul Derwent, Matthias Haimel, Arnaud Kerhornou, Uma Maheswari, Michael Nuhn, Dan Staines, Andy Yates Dan Lawson, Gautier Koscielny, Karyn Megy Kerstin Howe, Kim Brugger (GRC), Will Chow, Britt Reimholz, James Torrance Ewan Birney, Richard Durbin, Tim Hubbard 33 of 34 The Wellcome Trust Genome Campus 34 of 34