Diapositiva 1

Download Report

Transcript Diapositiva 1

Introduction to
Functional Analysis
J.L. Mosquera and Alex Sanchez
Motivation
• The rise of the genomic era and especially the
deciphering of the whole genome sequences of several
organism has represented huge quantities of
information.
• New technologies such as DNA microarrays (but not
only these!) allow the simultaneous study of hundreds,
even thousands of genes, in a single experiment.
2
Motivation
• This represents different challenges:
1) The experiment in itself
2)Statistical analysis of results
3)Biological interpretation
• Very often the results are large-lists of genes which
have been selected according to some specific
criteria.
PROBLEM: How could a researcher give these sets a
biological interpretation?
3
Rationale
• A reasonable thing to do is to rely on existing
annotations which help to relate the selected
sequences with biological knowledge.
• Bioinformatics resources hold data, often in the form
of sequences which are annotated in scientific natural
language.
• The annotation in this form is human readable and
understandable, but difficult to interpret
computationally.
4
What’s in a name?
QUESTION: What’s a cell?
• The same name can be
used to describe
different concepts
• A concept can be
described using different
names
• Comparison is difficult,
especially across species
5
or across databases
Image from http://microscopy.fsu.edu
Functional annotation
•
•
•
•
Probably, the most important thing you want to know
is what the genes or their products are concerned
with, i.e. their function.
Function annotation is difficult:
1) Different people use different words for the same function,
2) may mean different things by the same word.
3) The context in which a gene was found (e.g. “TGF-induced gene”)
may not be particularly associated with its function.
Inference of function from sequence alone is errorprone and sometimes unreliable.
The best function annotation systems use human
beings who read the literature before assigning a
function to a gene
6
What can we do?
To overcome some of the problems, an
annotation system has been created: The
Gene Ontology (GO).
7
What is an ontology?
• An ontology is an entity which provides a set of
vocabulary terms covering a conceptual domain.
• These terms must
1) have an exhaustive and rigorous definition,
2) be placed within a structure of relationships. It usually is a
hierarchical data structure.
• The terms may be linked by two kind of
relationships:
1) “is-a” between parent and child.
2) “part-of” between part and whole.
• They may have one or more parents.
8
What’s the GO?
• The GO is a cooperative project, developed and maintained
by the Gene Ontology Consortium.
• It is an annotation database created to provide a controlled
vocabulary to describe gene and gene product attributes in
any organism.
• It is organized around three basic ontologies:
Ontology
Molecular Function
7220
Biological Process
9529
Cellular Component
1536
Total Terms
1May,
2005
Number of terms1
18285
9
The GO ontologies and the GO graph
Biological Processes (BP)
GO
Cellular Components (CC)
Molecular Functions (MF)
10
Genes and GO terms
A given gene product may
• represent one or more
molecular functions,
• be used in one or more
biological processes and
• appear in one or more
cellular components.
11
GO database
• Consist of two essential parts:
1) The current ontologies:
o Vocabulary
o Structure
2) The current annotations:
o Create a link between the known genes and the associated
GOs that define their function.
• The GO database exists independently from other
annotation databases
1) It does not depend on the organism
2) It does not depend on other databases, but
• Most important databases have cross-references
with the GO databases
o
It is possible to link and relate other annotations with those contained 12
in GO
Two types of GO Annotations
 Electronic Annotation
 Manual Annotation
•
All annotations must
1) be attributed to a source,
2) indicate what evidence was found to support the GO
term-gene/protein association
13
Evidence Codes
IEA
ISS
IEP
IMP
IGI
IPI
IDA
RCA
TAS
NAS
IC
ND
Inferred from Electronic Annotation
Inferred from Sequence Similarity
Inferred from Expression Pattern
Inferred from Mutant Phenotype
Inferred from Genetic Interaction
Inferred from Physical Interaction
Inferred from Direct Assay
Inferred from Reviewed Computational Analysis
Traceable Author Statement
Non-traceable Author Statement
Inferred by Curator
No biological Data available
14
Enrichment Analysis
•
•
Unbiased method to ask question, “What’s so special
about my set of genes?”
Many tools follow similar steps
1) Obtain GO annotation (most specific term(s)) for
genes in your set
2) Climb an ontology to get all “parents” (more
general terms)
3) Look at occurrence of each term in your set
compared to terms in population (all genes or all
genes on your chip)
4) Are some terms over-represented?
15
Statistical Methods for enrichment analysis
• Let us consider:
o N genes on a microarray:
M belong to a given GO term category (A)
M-N do not belong it (category Ac)
o K of the N genes are selected and assigned to a given class (e.g.
regulated genes)
o x genes of these K will be in A
(EXAMPLE)
STATISTICAL HYPOTHESIS:
H0:
H1:
genes
GO category A is equally represented on the microarray
than in the class of differentially regulated genes
GO category A is higher (or lower) represented on the 16
microarray than in the class differentially regulated
Hypergeometric Distribution (1/2)
We ask: Assuming sampling without replacement, what is the
category A?
probability of having exactly x genes of
• The probability that certain category occurs x times just by
chance in the list of differentially regulated genes is
modelled by a hypergeometric distribution with parameters
(N, M, K).
P X= x =
M
x
N− M
K− x
N
K
17
Hypergeometric Distribution (2/2)
• So, under the null hypothesis p_value of having x genes or
larger in A will be:
K
pvalue = P X ≥ x∣ H 0 =
∑
k= x
M
k
N− M
K− k
N
K
• This corresponds to a one-side test in which small p_values
relate to over-represented GO terms.
• For under-represented categories can be calculated as
1 - p_value
18
Disadvantages
• The hypergeometric distribution is rather difficult
and time consuming to calculate when N is high.
• We can proof,
Hip M , N , K
N ∞ Bin
K,
M
N
• Using this approximation the p_value for overrepresented GO terms can be calculated as
x− 1
pvalue= ∑
i= 0
K
i
M
N
i
M
1−
N
K− i
19
Alternative approaches
•
Let us assume
Differentially
regulated genes (D)
Dc
Genes on
Microarray
Category A
n11
n12
N1.
Ac
n21
n22
N2.
N.1
N.2
N..
where N=N.., M=N1., K=N.1 and x=n11
•
Using this notation, alternative include:
2
χo
test for equality of two proportions
o Fisher’s Exact Test
20
Fisher’s Exact Test
• This test consider fixed the marginal totals and
uses the hypergeometric distribution to calculate
the probability of observing each individual table.
• One can calculate a table containing all possible
combinations of n11n12n21n22.
• The p_value for a particular occurrence is the sum
of all probabilities lower than or equal to the
probability corresponding to the observed
combination.
21
Correction for Multiple Tests
• As the number of GO terms for which test
significance is large, p_values have to take the
correction for multiple tests in account. For instance:
o Methods controlling False Discovery Rate (FDR):
 Benjamin and Hochberg (assuming independence)
 Benjamin and Yekutieli (dropping independence)
o Methods controlling Family Wyse Error Rate (FWER):
 Holm correction
 Westfall and Young
22
Example
N= 9177 genes on microarray
A
Ac
M= 467 in GO
category A
N-M= 8710 in Ac
x= 51
genes of
category A
23
K= 173 genes picked randomly