From Genome Sequencing to Biology in the Lab of Milk and

Transcript From Genome Sequencing to Biology in the Lab of Milk and

BeeBase - The Honey Bee Model Organism Database

Chris Elsik [email protected]

Outline • BeeBase - what it is now • How it works • Future Plans

BeeBase

http://racerx00.tamu.edu/PHP/bee_search.php

• Predicted Gene and Homolog Search Page • Genome Browser • Comparative Map Viewer • Protein Families Database with Bee, Fly and Mosquito proteins • The newest assembly ( release 2.0) http://racerx00.tamu.edu/cgi-bin/gbrowse/bee_genome2

Gbrowse

• A module of the Generic Model Organism Database Project (GMOD), www.gmod.org

• A graphical viewer of features along a reference sequence • Based on MySQL and Perl • The configuration file allows us to – Change fonts, colors, text.

– Change overview – sequence scaffold, contig, genetic map, karyotype.

– Define tracks.

– Modify track appearance.

Gbrowse Internals

• BioPerl Library - allows browser to run on top of a variety of database management systems and schemata • Bio::Graphics module - used to graphically render any type of nucleotide or protein feature • Bio::DB::GFF Database - uses a flat coordinate system to represent genomic features. Optimized for queries that retrieve features by ID, type or region of genome

Our task is to generate GFF data

• GFF = generic feature format • A standard format that aids data exchange • Allows you to specify a substring of a biological sequence • The current version (2) uses terms from the Sequence Ontology project - A set of terms used to describe features on a nucleotide or protein sequence. It encompasses both "raw" features, such as nucleotide similarity hits, and interpretations, such as gene models. • For information on the specifications: http://www.sanger.ac.uk/Software/formats/GFF/

Computing Data for Tracks

• Markers – Compare marker sequences to genome scaffolds using BLASTN – Use ePCR (primersearch) for markers with primers, but no sequence • ESTs – Compare ESTs to genome scaffolds using fasta or BLAT – Use exonerate (http://www.ebi.ac.uk/~guy/exonerate/) to predict exon/intron boundaries for each match • Protein Homologs – Compare protein sequences to genome scaffolds using tfastx to identify matches – Use exonerate to predict exon/intron boundaries for each match

Annotating Tracks

• The most time consuming task in computing tracks is providing annotations for protein homologs.

• Annotations come from different sources and are in different formats depending on protein dataset.

• We use UniProt for all homolog tracks in assembly 1.1 and 1.2 browsers.

• Assembly 2 uses proteome sets for Drosophila (FlyBase), C. elegans (WormBase), Yeast (SGD), Mosquito (Ensembl) and Human (Ensembl) to avoid redundancy within proteomes.

– The fasta formatted sequences are not annotated (except yeast).

• The “other insect” track will come from UniProt.

– To identify which sequences are insect, we use taxon-id and a locally installed NCBI taxonomy database.

CMAP

• CMap is a web-based tool that allows users to view comparisons of genetic and physical maps. • The package also includes tools for curating map data. • MySQL and Perl • Consists of modules for data, logic (howmaps are layed out), and presentation.

• Our work is to modify the configuration file and format data.

Future BeeBase Plans

• Redo protein families analysis after final gene prediction set is released; add proteins from additional model organisms (worm, yeast, mouse, human) • Phylogenetic analysis to identify orthologs • Gene Ontology assignment • Create gene pages for each gene, similar to FlyBase, using the new “Turnkey gmod-web” module

More BeeBase Plans

• Curate literature for orthologs to provide an entry into the BeeSpace conceptual navigation system.

• Incorporate QTL viewer using Dave Adelson’s QTL viewer software, which was developed for cattle.

• Incorporate OpenGeneX gene expression database and expression data from the BeeSpace project.

Gene Ontology For Honey Bee

• • • • •

Gene Ontology Consortium http://www.geneontology.org/

“The goal of the Gene Ontology TM (GO) Consortium is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” GO provides three structured networks of defined terms to describe gene product attributes.

Molecular Function Ontology

carbohydrate binding

and the tasks performed by individual gene products; examples are

ATPase activity

Biological Process Ontology

broad biological goals, such as

mitosis

are accomplished by ordered assemblies of molecular functions or

purine metabolism,

that

Cellular Component Ontology

subcellular structures, locations, and macromolecular complexes; examples include

nucleus, telomere,

and

origin recognition complex

GO Evidence Codes

• • • • • • • •

IDA

inferred from direct assay - Enzyme assays, In vitro reconstitution (e.g. transcription), Immunofluorescence (for cellular component), Cell fractionation (for cellular component), Physical interaction/binding assay

IEP

inferred from expression pattern - useful for biological process ontology

IGI

inferred from genetic interaction - "Traditional" genetic interactions such as suppressors, synthetic lethals, etc., Functional complementation, Rescue experiments, Inference about one gene drawn from the phenotype of a mutation in a different gene

IMP

inferred from mutant phenotype - Any gene mutation/knockout, Overexpression/ectopic expression of wild-type or mutant genes, Anti-sense experiments, RNAi experiments, Specific protein inhibitors

IPI

inferred from physical interaction - 2-hybrid interactions, Co-purification, Co immunoprecipitation, Ion/protein binding experiments

IEA

inferred from electronic annotation

ISS

inferred from sequence or structural similarity

inferred by curator,

TAS

traceable author statement,

NAS

no biological data available,

not recorded non-traceable author statement ,

Applying GO to Honey Bee

• We must rely heavily on IEA (inferred from electronic annotation - no curator) or ISS (inferred from sequence similarity - inspected by curator) • We must make the most reliable inferences possible based on orthology instead of homology

Background: Evolution-based functional inference and orthology

Evolution Allows us to Infer Function

• The most powerful method for inferring function of a gene or protein is by similarity searching a sequence database.

• Our ability to characterize biological properties of a protein using sequence data alone stems from properties conserved through evolutionary time.

• Homologous (evolutionarily related) proteins always share a common 3-dimensional folding structure.

• They often contain common active sites or binding domains.

• They frequently share common functions.

• Predictions made using similar, but non-homologous proteins are much less reliable.

Orthologs

• Homologs = genes that are evolutionarily related • There are two kinds of homologs: • Orthologs = genes in different species that have diverged from a common gene in an ancestral species.

• Paralogs = genes that have diverged due to gene duplication.

• Orthologs are more likely than paralogs to have conserved function.

• Orthologs cannot be identified using BLAST or FASTA sequence comparison alone.

• Reliable ortholog identification requires phylogenetic methods.

Example Gene Tree (with plant genes) Barley-1 Wheat-1 Maize-1 Sorghum-1 Arabidopsis Rice-2b Rice-2a Maize-2 Wheat-2 Sorghum-2 orthologs paralogs paralogs The outgroup, Arabidopsis is a dicot. The cereals are monocots. Monocots and dicots diverged ~230 million years ago. Monocots diverged from each other ~60 mya.

Why shouldn’t we depend on inferences based on paralogs?

• Paralogs emerge after a gene duplication.

• Possible fates of duplicated genes: – Loss of function for one of the duplicates - lack of selective pressure allows gene to mutate beyond recognition – Emergence of new functional paralogs - one duplicate aquires a new function, so selection favors its maintenance in the genome – Sub-functionalization - both duplicates are required to maintain the function of the original

Back to Gene Ontology for Honey Bee: Proposed Evidence Codes within ISS

• • • • • ISS = inferred from sequence similarity (inspected by a curator) We can break this down into: Inferred from homology (lowest) Inferred from a ortholog in one species Inferred orthologs in more than one species, all of which have the same GO classification (highest).

– What if they don’t all have the same GO classification? Move up in the diacylic graph to a point where GO classifications converge.

– This can be tricky since the graph is diacyclic and each node can have more than one parant

Some Ongoing Gene Ontology Work in the Elsik Lab - Cattle

• Cattle EST Gene Family Database • Cattle gene families were created using assembled, translated ESTs grouped with homologous human protein families.

• Database is searchable using GO for the human proteins.

• The next step is phylogenetic analysis to identify human/cattle orthologs.

Searching by Gene Ontology

Borrowing More From Cattle • Bovine QTL Database - David Adelson, TAMU

The Bovine QTL viewer Interface

Image showing all chromosomes

Image showing one chromosome

QTL Details

OpenGeneX

• Web-based access to database • PostgreSQL • Includes as a curation tool a client side Java application that formats data in MAGE-ML • Includes several statistical routines and data analysis tools – Uses R statistical analysis package (open source)

Acknowledgements

• Elsik Lab – Justin Reese – Kyounghwa Bae – Anand Venkatraman – Shreyas Murthi – Michael Dickens – Juan Anzola • Collaborators – Bruce Schatz, Gene Robinson and the BeeSpace group, UIUC – William Gelbart - FlyBase (Harvard University) – Spencer Johnston (TAMU) – Danny Weaver, Bee Power LP