Biological Ontologies: - Georgetown University

Download Report

Transcript Biological Ontologies: - Georgetown University

Biomedical Ontologies
Bio-Trac 40 (Protein Bioinformatics)
October 9, 2008
Zhang-Zhi Hu, M.D.
Research Associate Professor
Protein Information Resource, Department of
Biochemistry and Molecular & Cellular Biology
Georgetown University Medical Center
1
Overview
• What is ontology?
– What is biomedical ontology?
• What is gene ontology?
– How is it generated?
– How is it used for annotation?
• What is protein ontology?
– Why is it necessary?
– How to use it?
• ……
2
Tree of Porphyry with Aristotle’s Categories
Aristotle, 384 BC – 322 BC
3
Ontology:
onto-, of being or existence; -logy, study.
Greek origin; Latin, ontologia,1606
• In philosophy, it seeks to describe basic categories
and relationships of being or existence to define
entities and types of entities within its framework:
– What do you know? How do you know it?
– What is existence? What is a physical object?
– What constitutes the identity of an object? ……
• Central goal is to have a definitive and exhaustive
classification of all entities.
“The science of what is, of the kinds and structures of objects,
properties, events, processes and relations in every area of reality”
– Barry Smith, U Buffalo
4
In computer and information science
• Ontology is a data model that represents a set of
concepts within a domain and the relationships
between those concepts. It is used to reason about
the objects within that domain.
Most ontologies describe individuals
(instances), classes (concepts),
attributes, and relations
is_a
Classes
Relations
Attributes
e.g. color,
engine, door…
Classes
(concepts)
Individuals (instances)
your Ford, my Ford, his Ford…
5
What are ontology useful for?
Ontology is a form of knowledge representation about
the world or some part of it.
•
•
•
Terminology management
Integration, interoperability, and sharing of data
– promote precise communication between scientists
– enable information retrieval across multiple resources
Knowledge reuse and decision support
– extend the power of computational approaches to perform
data exploration, inference, and mining
Biomedical Terminology vs. Biomedical Ontology
•
•
•
•
•
UMLS (unified medical language system)
MeSH (medical subject heading)
NCI Thesaurus
SNOMED / SNODENT
Medical WordNet
6
Ontology Enables
Large-Scale Biomedical Science
The center of two major activities currently in
biomedical research:
•
•
Structured representation of biomedicine:
– For different types of entities and relations to
describe biomedicine (ontology content
curation).
Annotation: using ontologies to summarize and
describe biomedical experimental results to enable:
– Integration of their data with other researchers’
results
– Cross-species analyses
7
Gene Ontology (GO)
what makes it
so wildly
successful ?
8
GO Consortium
http://www.geneontology.org/
• The Gene Ontology was originally constructed in 1998 by
a consortium of researchers studying the genome of three
model organisms:
– Drosophila melanogaster (fruit fly) (FlyBase)
– Mus musculus (mouse) (MGD)
– Saccharomyces cerevisiae (yeast) (SGD)
• Many other model organism databases have joined the
GO consortium, contributing:
– development of the ontologies
– annotations for the genes of one or more organisms
9
Need for annotation of genome sequences
•
What is Gene Ontology? GO provides controlled vocabulary to
describe gene and gene product attributes in any organism – how gene
products behave in a cellular context
Three key concepts: [Currently total 25804 GO terms] (Oct. 2008)
• Biological process: series of events accomplished by one or more
ordered assemblies of molecular functions, e.g. signal transduction, or
pyrimidine metabolism, and alpha-glucoside transport. [total: 15161]
• Molecular function: describes activities, such as catalytic or binding
activities, that occur at the molecular level. Activities that can be
performed by individual gene products, or by assembled complexes of
gene products; e.g. catalytic activity, transporter activity. [total: 8425]
• Cellular component: a component of a cell that it is part of some larger
object, maybe an anatomical structure (e.g. ER or nucleus) or a gene
product group (e.g. ribosome, or a protein dimer). [total: 2218]
•
GO annotation
- Characterization of gene products using GO terms
- Members submit their data which are available at GO
website.
10
GO Representation:
Tree or Network?
GO is a network structure
root
Node, a
concept
or a term
A
C
Relations:
is_a, or
part_of
C has two
parents, A
and B
B
C
Leaf
node
11
http://www.geneontology.org/
12
GO search and display tool
GO term (GO:0006366):
mRNA transcription from RNA polymerase II promoter
Leaf
node
13
Human p53 – GO annotation
(UniProtKB:P04637)
GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]
14
GO annotation of gene products
•
•
Science basis of the GO: trained experts use the experimental
observations from literature to associate GO terms with gene products
(to annotate the entities represented in the gene/protein databases)
Enabling data integration across databases and making them available
to semantic search
http://www.geneontology.org/GO.current.annotations.shtml
~46
Human, mouse, plant, worm, yeast …
15
What GO is NOT……
• Ontology of gene products: e.g. cytochrome c is not in
GO, but attributes of cytochrome c are, e.g.
oxidoreductase activity.
• Processes, functions and component unique to mutants
or diseases: e.g. oncogenesis is not a valid GO.
• Protein domains or structural features.
• Protein-protein interactions.
• Environment, evolution and expression.
• Anatomical or histological features above the level of
cellular components, including cell types.
Neither GO is Ontology of Genes!! – a misnomer
16
Missing GO
nodes…
not deep enough…
not broad enough…
17
Lack of connections among GOs
Estrogen
receptor
18
GO: A Common Standard for Omics Data Analysis
what molecular function?
what biological process?
what cellular component?
19
need more…
• need to improve the quality of GO to support more
rigorous logic-based reasoning across the data
annotated in its terms
• need to extend the GO by engaging ever broader
community support for addition of new terms and
for correction of errors
• need to extend the methodology to other domains,
including clinical domains, such as:
–
–
–
–
–
disease ontology
immunology ontology
symptom (phenotype) ontology
clinical trial ontology
...
20
http://www.obofoundry.org/
• Establish common
rules governing best
practices for creating
ontologies and for
using these in
annotations
• Apply these rules to
create a complete
suite of orthogonal
interoperable
biomedical reference
ontologies
National Center
for Biomedical
Ontology (NCBO)
http://bioontology.org/
21
http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies
22
The OBO Foundry
• A family of interoperable gold standard biomedical
reference ontologies to serve annotation of:
– scientific literature
– model organism databases
– clinical trial data …
OBO Foundry = a subset of OBO ontologies, whose developers
have agreed in advance to accept a common set of principles
reflecting best practice in ontology development designed to ensure:
• tight connection to the biomedical basic sciences
• compatibility, interoperability, common relations
• support for logic-based reasoning
OBO Foundry Principles: http://www.obofoundry.org/crit.shtml
23
Rationale of OBO Foundry coverage
RELATION TO
TIME
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
GRANULARITY
ORGAN AND
ORGANISM
Organism
(NCBI
Taxonomy)
CELL AND
CELLULAR
COMPONENT
Cell
(CL)
MOLECULE
Anatomical
Organ
Entity
Function
(FMA,
(FMP, CPRO) Phenotypic
CARO)
Quality
(PaTO)
Cellular
Cellular
Component Function
(FMA, GO)
(GO)
Molecule
(ChEBI, SO,
RNAO, PRO)
Molecular Function
(GO)
Biological
Process
(GO)
Molecular
Process
(GO)
24
OBO Relation Ontology
Foundational
is_a
part_of
Spatial
located_in
contained_in
adjacent_to
Temporal
transformation_of
derives_from
preceded_by
Participation
has_participant
has_agent
e.g.: A is_a B =def. every instance of A is an instance of B
“rose is_a plant  all instances of rose is_a plant”
25
What is Protein Ontology? Why?
PRO
http://pir.georgetown.edu/pro/
26
The Need for Representation of Various Proteins Forms
Glucocorticoid receptor (GR)
Human PRLR
and PTMs…
27
Sphingomyelin phosphodiesterase (SMPD1)
(ASM_HUMAN)
• Cleavage sites:
– lysosomal: the enzyme is transported from the Golgi apparatus
to the lysosome after additions of mannose-6-phosphate
moieties (M6P) and binding to M6P receptor.
– secreted: the shorter cleaved form is not modified with M6P
and is targeted for secretion to the extracellular space, with
different functions such as LDL binding and oxidized LDL
catabolism.
lysosome
M6P
Extracelluar, e.g.
LDL binding
28
Alternative splicing
a single new contact between Phe32
(F32) of FGF8b and a hydrophobic
groove within Ig domain 3 of FGFR2c
Olsen et al., Genes Dev. 2006
FGF8a, 8b – differ in their ability
to pattern embryonic brain
FGF8a
FGF8b
FGF8_HUMAN alternative splicing
• Only FGF8b can
transform midbrain to
cerebellum whereas
FGF8a causes an
overgrowth of midbrain.
29
GOA for Transcription factor Ovo-like 2
Form 1 - long: GO:0045892 IDA - negative regulation of transcription, DNA-dependent
Form 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent
- Gene. 2004 336:47-58. PMID:15225875
274 aa
OVOL2_MOUSE
(Q8CIV7)
30
The Need for Protein Classes
Representing Protein Evolutionary Relationships
• Genes/proteins identified in model organisms, such as mouse,
yeast, fly, may have important functional implications in human.
– Gene function in model organism may not applied to human
• Animal models for human diseases: such as mouse models for
diabetes, arthritis, and tumor.
– Essential genes may be redundant and nonessential in another
species due to functional compensation, e.g.:
• mutation of Rb1 causes retinoblastoma in early childhood
• Rb1 knock-out mouse did not develop retinoblastoma because
of compensation from a functional homolog p107.
• Close examination of proteins in phylogenetic classes and their
functional convergence and divergence in a ontological
structure is important for application of disease models.
31
Implications of Protein Evolution
•
Conclusions from experiments performed on proteins from one organism
are often applicable to the homologous protein from another organism.
•
Information learned about existing proteins allows us to infer the
properties of ancestral proteins.
Common
ancestor
32
Protein Evolution
Sequence changes
With enough similarity, one can
trace back to a common origin
Domain shuffling
What about
these?
33
Functional convergence
• Protein classes of the same function derived from different
evolutionary origins, e.g. carbonate dehydratase (or
carbonic anhydrase EC 4.2.1.1), which has three
independent gene families with functional convergence.
Animal and
prokaryotic type
Plant and
prokaryotic type
Archaea type
34
Functional divergence
Gene Duplication (TGM3/EPB42 split)
TGM3 branch
Speciation (Human/mouse split)
Human
TGM3 (Human)
Mouse
TGM3 (Mouse)
Human
EPB42 (Human)
EPB42 branch
Mouse
EPB42 (Mouse)
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
TGM3 = Protein-glutamine gamma-glutamyltransferase
(Transglutaminase; involved in protein modification)
EBP42 = Erythrocyte membrane protein band 4.2
(Constituent of cytoskeleton; involved in cell shape)
35
The Need for Protein Ontology
• Data integration and knowledge management for -omics work.
• A gap exists in OBO for gene products.
• Protein Ontology (PRO) will contain two connected components
(or subontologies):
– ProEvo captures the protein classes represented by protein
families at fold, domain and full length levels that reflect
evolutional relationship
– ProForm captures the specific protein objects of a specific
gene resulting from alternative splicing, posttranslational
modification, genetic variations.
– ProEvo and ProMod is connected through the “reference”
(canonical) protein sequence currently annotated in
UniProtKB.
• PRO formalization of these detailed protein objects and classes
will allow accurate and consistent proteomics experimental design
and data analysis/integration.
36
PRO Framework
• PRO is designed to be a formal and well-principled OBO
Foundry ontology for protein entities.
• Attributes of objects will take the form of links to other
ontologies, such as gene (GO), sequence (SO), modification
(PSI-MOD) and disease (DO) ontologies.
• A PRO prototype for TGF-beta signaling proteins was built
based on this framework.
• In this way, PRO aims at providing an ontological framework
to define protein entities and evolutionary-related classes
that community can adopt for different purposes, e.g.
– annotation of entities attributes,
– mapping of objects in pathways, and
– modeling of biological system dynamics and disease.
37
Protein Ontology (PRO)
http://pir.georgetown.edu/pro/
Pfam
PRO
protein
Root Level
Domain
protein domain
is_a
has_part
GO
Family-Level Distinction
• Derivation: common ancestor
• Source: PIRSF family
translation product of an evolutionarily-related gene
Gene Ontology
molecular function
is_a
has_function
biological process
Gene-Level Distinction
• Derivation: specific gene
• Sources: PIRSF subfamily, Panther subfamily
translation product of a specific gene
participates_in
is_a
cellular component
Sequence-Level Distinction
• Derivation: specific allele or splice variant
• Source: UniProtKB
translation product of a specific mRNA
part_of (for complexes)
(for compartments)
located_in
OMIM
Disease
derives_from
Modification-Level Distinction
• Derived from post-translational modification
• Source: UniProtKB
Modification Level
ProForm
Sequence Level
Gene Level
ProEvo
Family Level
Root Level
cleaved/modified translation product
Example:
TGF-beta receptor phosphorylated smad2 isoform1
is a phosphorylated smad2 isoform1
derives_from smad2 isoform 1
is a smad2
is a TGF-b receptor-regulated smad
is a smad
is a protein
disease
agent_in
SO
Sequence Ontology
sequence change
has_agent (sequence change)
agent_of (effect on function)
PSI-MOD
Modification
protein modification
has_modification
38
Mothers against decapentaplegic homolog 2
Smad 2
GO annotation of SMAD2_HUMAN:
Cellular Component:
- nucleus
Molecular Function:
- protein binding
Biological Process:
- signal transduction
- regulation of transcription, DNA-dependent
39
TGF-b
TGF-beta receptor
II
I
Smad 2
1 phosphorylation
CAMK2
ERK1
P
Smad 2
Smad 2
P
Smad 4
P
P
2 complex formation
P
P
P
P
Smad 2
P
Smad 4
P
P
P
3 nuclear translocation Cytoplasm
P
Nucleus
Smad 2
P
Smad 4
P
P
4 DNA binding
Transcription Regulation
++
40
Smad2 gene products
Forms
Location
“normal”
•Cytoplasmic
PRO:00000011
SMAD2_HUMAN
Smad 2
Smad 2
P
Smad 2
P
P
Smad 2
P
Smad 2
Smad 2 P
Smad 2
ID
x
P
P
P
P
TGF-b receptor
phosphorylated
•Forms complex
•Nuclear
•Txn upregulation
PRO:00000013
SMAD2_HUMAN
ERK1 phosphorylated
•Forms complex
•Nuclear
•Txn upregulation++
PRO:00000014
SMAD2_HUMAN
CAMK2
phosphorylated
•Forms complex
•Cytoplasmic
•No Txn upregulation
PRO:00000015
SMAD2_HUMAN
alternatively spliced
short form
•Cytoplasmic
phosphorylated short
form
•Nuclear
•Txn upregulation
point mutation
(causative agent:
large intestine
carcinoma)
•Doesn’t form complex
•Cytoplasmic
•No Txn upregulation
P
PRO:00000016
SMAD2_HUMAN
PRO:00000018
SMAD2_HUMAN
PRO:00000019
SMAD2_HUMAN
41
PRO hierarchy in Obo Edit
ProEvo:
Representing evolutionary-related protein classes.
In this example, children of TGF-beta-like cysteineknot cytokine have a common architecture
consisting of a signal peptide, a variable propeptide
region and a transforming growth factor beta-like
domain that is a cysteine-knot domain.
Pfam:PF00019 "has_part
factor beta like domain".
ProForm:
OBO relations: is_a, derives_from
Transforming
growth
Representing multiple protein
products of a gene. Only
forms with experimental data
are included. When common
protein forms exist in human
and mouse, a single node is
created (See details below).
43
Summary
•
•
•
•
The vision of the biomedical ontology community is that all
biomedical knowledge and data are disseminated on the
Internet using principled ontologies, such that they are
semantically interoperable and useful for improving biomedical
science and clinical care.
The scope extends to all knowledge and data that is relevant to
the understanding or improvement of human biology and health.
Knowledge and data are semantically interoperable when they
enable predictable, meaningful, computation across knowledge
sources developed independently to meet diverse needs.
Principled ontologies are ones that follow NCBO-recommended
formats and methodologies for ontology development,
maintenance, and use.
44