Transcript Slide 1

Statistics and bioinformatics
applied to –omics technologies
Part II: Integrating biological knowledge
Frédéric Schütz
[email protected]
Bioinformatics Core Facility
Swiss Institute of Bioinformatics
Center for Integrative Genomics
University of Lausanne, Switzerland
Contents
Slides
• Class prediction
1-19
• Gene Ontology analysis
20-29
• Geneset analysis (GSEA, etc)
30-39
Class discovery and class prediction
• Example: patients from which we obtained measurements (e.g.
gene expression)
Class prediction
Gene 2
Gene 2
Class discovery
Gene 1
Find natural groups in the data (e.g. sets
of patients with similar gene expression)
?
Gene 1
Given previous measurements for which
the grouping is known (red and blue),
can we predict the group to which a new
observation belongs ?
Why do we want to do class prediction ?
• Many questions in biology and medicine are “class
prediction” questions:
–
–
–
–
–
–
Does a patient have a predisposition for a given disease ?
What is the prognosis for this patient ?
What will be the response of this patient to a given drug ?
Is this tumour benign or malign ?
What type is this tumour ?
Which treatment should be used ?
Gene 2
Class prediction: easy case
Gene 1
Classify everything
on this side as “blue”
Threshold
Classify everything
on this side as “red”
Example
Blue points represent “oestrogen receptor (ER) status positive” determined
by immunohistochemistry.
Pierre Farmer et al. Identification of molecular apocrine breast
tumours by microarray analysis. Oncogene (2005) 24, 4660–4671
Gene 2
Class prediction: in practice
Gene 1
• The two groups are not perfectly separated (and this is still a pretty good
case…)
• One variable (gene) is not sufficient to assign patients to groups
• Remember that with microarrays, we are not talking about just 2 measurements,
but several 10,000s.
Discrimination in general
• Goal: assign objects (e.g. patients) to classes based on some
measurements (e.g. gene expression)
• Typically, in a microarray setting:
– 10s or (at best) 100s of patients
– 10,000s genes
• Unsupervised learning: nothing is known about the grouping of
the data, and we try to find natural groups in the data
• Supervised learning: the classes are predefined; we use
previously labelled objects to create a procedure for classification
of future observations.
Some supervised analysis methods
•
•
•
•
•
K-nearest neighbours
Linear Discriminant Analysis
Classification trees
Support Vector Machines (SVM)
etc.
Gene 2
Example: 3-nearest neighbours
Gene 1
Red or blue ?
Gene 2
Example: 3-nearest neighbours
Gene 1
2 red vs 1 blue: the point is assigned to “red”
K-nearest neighbours
• Choose a value for k (typical values: 3 or 5); in
practice it can be chosen using the learning data
(value that produces the best result)
• Find the k observations in the learning set that are
closest to the new, unknown, observation
• Predict the class by a majority vote, that is, choose
the class that is most common among the
neighbours.
• Very simple method, with surprisingly good
performance
Linear Discriminant Analysis
• Suggested by R.A. Fisher in 1935
• Procedure to find a linear combination of the observed
variables that best separates (discriminates) two
classes of objects.
• Using the “new variable”, objects from the same class
are close together, and objects from different class are
further away.
• Straightforward to calculate
• Can easily be extended to more than two classes
• Similar idea to Principal Component Analysis (PCA)
• Often forgotten in favour of PCA
Gene 2
Back to the easy case
Gene 1
Classify everything
on this side as “blue”
Low value of
the discriminant
Threshold
Discriminant = Gene 1
Classify everything
on this side as “red”
High value of
the discriminant
Gene 2
Linear Discriminant Analysis: Example
Gene 1
• The two groups are well separated
• Neither Gene1 nor Gene2 is able to discriminate between
the two categories
Gene 2
Linear Discriminant Analysis: Example
Gene 1
• However, the linear combination
L = Gene1 + Gene2
discriminates well between the two groups
• Blue points tend to have smaller L values
• Red points tend to have bigger L values
Gene 2
Linear Discriminant Analysis: Example
Gene 1
Threshold
• A threshold is set in between the mean of the two groups
• Points with a value L above the threshold are classified as red
• Points with a value L below the threshold are classified as blue
Caveats: Overfitting
• It is easy to create classifiers which fit the training data perfectly
• It is harder to find classifiers which still work as well when
validated on new data
• A classifier must ALWAYS be tested on data independent from the
one used to actually train the classifier.
• This is particularly important in microarray analysis:
– Few samples
– Many different measurements
• If not careful, it is always possible to find a classifier that works
well for your training data !
Gene 2
Caveats: Overfitting
Classify everything in this
region as red
Gene 1
• Perfect classifier for this data
• Probably not so good with any new data
Gene Ontology analysis
• Many microarray experiments produce lists of genes that are
significantly differently expressed between two conditions (gene
comparison).
• In some (rare) cases, only a few genes are of interest, and they can
easily be examined and validated.
• In most cases, however, a long list of differentially expressed genes
is returned, and these genes can not be considered individually.
• It is harder to obtain biological understanding from this data.
• One strategy: consider the functional annotation of the differentially
expressed genes.
• Question: what do these genes have in common that could be of
interest ?
Reminder: Gene Ontology (GO) project
• Collaborative effort to address the need for consistent
descriptions of gene products in different databases.
• Three structured, controlled vocabularies (ontologies) that
describe gene products in terms of their associated
– biological processes
– cellular components
– molecular functions
in a species-independent manner.
(From http://www.geneontology.org/)
Example
PPARA, NR1C1, PPAR: Peroxisome proliferator-activated receptor alpha
(TAS: Traceable Author Statement, IPI: Inferred from Physical Interaction)
(From http://www.geneontology.org/)
Example of GO analysis
• Simple microarray experience: WT vs KO
• The microarray has 10,000 genes, 100 of which have GO annotation “fatty acid
transport”
• I obtain 1000 differentially expressed genes (10% of all genes)
1000 genes
differentially
expressed
10%
10,000
genes
in total
90%
• If my experiment has nothing to do with “fatty
acid transport”, I expect in average about 10%
of genes (or 10) to be differentially expressed.
• If this proportion is higher, it means the list of
differentially-expressed genes is enriched in
“fatty acid transport” genes
• If the difference is significant, it suggests a link
between differential expression and this GO
annotation: genes with this annotation are more
likely to be differentially expressed than others
• This indicates that this biological process may
be related to my KO experiment.
Number of genes “fatty acid transport”
1000 genes
differentially
expressed
10%
10 (10%)
10,000
genes
in total
90%
90 (90%)
...
0 (0%)
Looks like a random
distribution
No apparent association
?
Strong association
100 (100%)
Statistical analysis
•
•
•
•
•
Assume that I found 20 differential expression with the GO annotation of interest.
Count the numbers of genes with the GO annotation or not, and compare with
differential expression:
Differentially
expressed
Not D.E.
Total
“Fatty acid transport”
20
80
100
Others
980
8980
9900
Total
1000
9000
10000
A statistical test such as Fisher’s exact test can tell us what is the probability of
observing this result (or more extreme) if there is no association between the rows
and columns
In this case, this probability (p-value) is 0.002
This indicates that this biological process may be important in the difference
between WT and KO.
In practice
• One can either suggest a GO annotation and see if it
is enriched in the list of differentially expressed genes
• Or we may want to go “fishing” and try all potentially
interesting GO annotations to see if any of them is
enriched.
• Easy to do
• Multiple services available on the web
– User indicates the list of genes differentially expressed
– Returns the most significant GO annotations
Gene Ontology analysis: example. I
• Microarray with about 22,000 genes
• We look at the 1% of the genes that are most different between different subtypes
of cancer.
• Which processes are likely to be different between these subtypes ?
– Those for which more than 1% of the genes are differentially expressed are good
candidates
Prop.
5%
19%
10%
3%
5%
5%
4%
Pierre Farmer et al. Identification of molecular apocrine breast
tumours by microarray analysis. Oncogene (2005) 24, 4660–4671
Gene Ontology analysis: example. II
• To apply this GO analysis, we need first to define a list
of differentially expressed genes.
• This usually means calculating a “score” (e.g. pvalue), and selecting a cut-off point.
• While there are some traditional cut-off points (0.001,
0.01 or the “magical” 0.05), they remain fairly arbitrary
– Is there really a difference between a gene associated with
a p-value of 0.049 and another one with a p-value of 0.051 ?
Gene Ontology analysis: example. III
• Some genes may be differentially expressed, but the change
may be so small (lost in the noise) that it will not appear in the
list.
• However, the difference in expression may appear at the level
of a set of genes rather than individual genes
• Set of genes may correspond e.g. to co-regulated genes, or
genes belonging to the same pathway
• If the change of expression is consistent across genes in the
set, it may indicate that the set is of interest, even if no
individual gene shows a significant difference.
Gene set enrichment analysis (GSEA)
Gene set enrichment analysis (GSEA)
• Series of papers describing a method for analyzing the
expression of sets of genes
• Software available, along with a database of biologically
relevant gene sets
• Relatively hot topic in bioinformatics/statistics: many
differerent papers and methods published on the topic,
with small or large differences
• GSEA usually refers to this particular program, but
sometimes indicates any such method which examines
sets of genes.
Principle of GSEA
• We have a list of genes sorted according to a given
measure (score for differential expression, correlation
to a phenotype, etc)
• Among this list, we have a smaller set of genes of
interest (e.g. all belonging to a given pathway)
• Is the smaller set distributed randomly in the sorted list
of genes ?
– If yes, the set is less likely to be of interest
– If no, it may indicate that the function represented by the set
is linked with the measure.
Principle of GSEA (most methods)
All genes, sorted
High values
(e.g. upregulated)
Low values
(e.g. down-regulated)
Position in the list
of genes of our set
of interest
The location of the genes of our set of interest within
the list seem random (uniform); the set does not appear
to be linked with differential expression.
Principle of GSEA (most methods)
All genes, sorted
High values
(e.g. upregulated)
Low values
(e.g. down-regulated)
Position in the list
of genes of our set
of interest
Link with
up-regulation
Position in the list
of genes of our set
of interest
Link with
down-regulation
Statistical analysis
• “Random walk”:
– The list of genes is walked down from left to right
– Everytime a gene belong to our list S, the score goes up
– Everytime a gene does not belong to the list, it goes down
• If the genes of the set are
uniformly distributed, the score
will never go very high (“up”
soon followed by a “down”)
• If the genes are distributed
together, the score will go
higher before getting back to 0.
• Using a permutation test, a pvalue can be associated to the
geneset.
From fig. 1 of Subramanian et al. PNAS 2005; 102; 15545-15550
Statistical analysis
• How can we summarise and assess an apparent link
between a set and differential expression ?
• Each method uses different statistics
• Original GSEA method based on the KolmogorovSmirnov test (compare the distribution of genes with a
uniform distribution)
• Later replaced by an “Enrichment Score” (similar but
weighted)
Example
• mRNA expression profiles from lymphoblastoid cell
lines derived from 15 males and 17 females
• Identify gene sets correlated with the difference
between males and females
(False Discovery Rate)
From table 2 of Subramanian et al. PNAS 2005; 102; 15545-15550
Example
• Gene expression patterns from a collection of 50 cancer cell lines
• p53 regulates gene expression in response to various signals of cellular stress
• 33 cell lines carry a mutation on the p53 gene, and 17 are normal.
From table 2 of Subramanian et al. PNAS 2005; 102; 15545-15550
Conclusions
• GeneSet Enrichment Analysis methods have quickly
become widespread in the microarray community.
• Intuitive method
• Can be used to confirm an association known or
suspected… (use a given geneset)
• … or to go “fishing” for unknown association (use a
database of genesets)
• More generally, microarray analysis uses more and
more this external biological knowledge.