Transcript ppt

The Statistical Significance
of Max-gap Clusters
Rose Hoberman
David Sankoff
Dannie Durand
Glycolysis Pathway
Glycolysis Clusters
Clostridium acetobutylicum
Gene Clustering for Functional
Inference in Bacterial Genomes
The Use of Gene Clusters to Infer Functional Coupling,
Overbeek et al., PNAS 96: 2896-2901, 1999.
original genome
large scale duplication
or speciation event
rearrangement,
mutation
Gene content and order
are preserved
Similarity in gene content
Neither content nor order is strictly preserved
“Evolution of gene order conservation
in prokaryotes”
Tamames, Genome Biology 2, 2001
“Evolution of gene order conservation
in prokaryotes”
Tamames, Genome Biology 2, 2001
Gene insertion/loss
“Evolution of gene order conservation
in prokaryotes”
Tamames, Genome Biology 2, 2001
Gene insertion/loss
Local rearrangement
Two Possible Questions
1. Given a set of genes that we
believe are functionally related,
determine if they cluster together
spatially more than we would
expect by chance
2. Identify all significantly conserved
gene clusters as a starting point
for making functional inferences
Two Possible Questions
1. Given a set of genes that we
believe are functionally related,
determine if they cluster together
spatially more than we would
expect by chance
Reference set scenario
2. Identify all significantly conserved
gene clusters as a starting point
for making functional inferences
Whole genome comparison
Reference Set Scenario
Reference Set Scenario
• Model of a genome
– G = 1, …, n; an ordered set of n unique genes
– assume genes do not overlap
– chromosome breaks ignored
Reference Set Scenario
• Model of a genome
– G = 1, …, n; an ordered set of n unique genes
– assume genes do not overlap
– chromosome breaks ignored
• Reference gene scenario:
– m genes of interest (in red) are pre-specified
– want to find clusters of (a subset of) these genes
Whole Genome Scenario
Given: two genomes: G = 1, …, n and H = 1, …, n
G
H
Find all significant clusters of at least
k homologs in close proximity in both genomes?
Outline
• What formalisms do we need to address
these questions?
– Definitions: formulate a cluster definition
– Algorithms: identifying clusters in real data
Statistics: assess the significance of one or more
clusters
• Reference set scenario
• Whole genome comparison
• Conclusion
Why develop a formal statistical model?
• Understand trends and verify that they match
our expectations
• Choose parameters effectively
• Statistical tests for data analysis
Typically researchers use randomization
tests to estimate statistical significance
Cluster Definitions
• An intuitive notion of a cluster is a group of genes
– occurring in close proximity
– neither gene content nor order is strictly conserved
• Algorithms and statistics require a formal definition.
– What properties are desirable?
– Do existing definitions have these properties?
size = 3 genes
Possible Cluster Parameters
– size: number of red genes in the cluster
• Example: cluster size ≥ 3
length = 6
Possible Cluster Parameters
– size: number of red genes in the cluster
• Example: cluster size ≥ 3
– length: number of genes between first and last red
genes
• Example: cluster length ≤ 6
length = 6
Possible Cluster Parameters
– size: number of red genes in the cluster
• Example: cluster size ≥ 3
– length: number of genes between first and last red
genes
• Example: cluster length ≤ 6
density = 6/11
Possible Cluster Parameters
– size: number of red genes in the cluster
• Example: cluster size ≥ 3
– length: number of genes between first and last red
genes
• Example: cluster length ≤ 6
– density: proportion of red genes (size/length)
• Example: density ≥ 0.5
density = 6/11
Possible Cluster Parameters
– size: number of red genes in the cluster
• Example: cluster size ≥ 3
– length: number of genes between first and last red
genes
• Example: cluster length ≤ 6
– density: proportion of red genes (size/length)
• Example: density ≥ 0.5
gap ≤ 4 genes
Possible Cluster Parameters
– size: number of red genes in the cluster
• Example: cluster size ≥ 3
– length: number of genes between first and last red
genes
• Example: cluster length ≤ 6
– density: proportion of red genes (size/length)
– compactness: maximum gap between adjacent red
genes
Max-Gap Cluster
gap g
•
•
Commonly used in analysis of genomic data
Desirable properties
– Ensures minimum local density
– Extensible: doesn’t artificially limit cluster length
– Disjoint: clusters will not overlap
Outline
• Formalisms
• Reference set scenario
• Whole genome comparison
• Conclusion
Formalisms
• Definitions: formulate a cluster definition
• Algorithms: identify clusters in real data
• Statistics: assess the significance of a cluster
A Statistical Model
• Given
– a genome: G = 1, …, n unique genes
– a set of m reference genes
– a maximum-gap size g
• Null hypothesis:
– Random gene order
• Alternate hypotheses:
– Evolutionary history
– Functional selection
Statistics of Max-Gap Gene Clusters
• We provide
– analytical and dynamic programming solutions
– to determine cluster significance exactly
– for the reference set scenario
Hoberman, Sankoff and Durand. In ``Proceedings of the RECOMB
Satellite Workshop on Comparative Genomics'', J. Lagergren, ed.,
Lecture Notes in Bioinformatics, Springer Verlag, in press.
Hoberman, Sankoff, Durand. Submitted to RECOMB 2005.
Test Statistic: Complete Clusters
The probability of observing all m reference genes in
a max-gap cluster in G
Test Statistic: Incomplete Clusters
The probability of observing at least h of the m
reference genes in a max-gap cluster in G
Cluster significance
n = 1000, m=50
•
•
•
•
n = 500, h = m/2
n = number genes in each genome
m = number of genes shared between the two genomes
g = maximum allowed gap size
h = size of cluster (e.g. number of red genes)
Significant Parameter Values (α = 0.0001)
n = 500
Significant Parameter Values (α = 0.0001)
n = 500
Outline
• Formalisms
• Reference set scenario
• Whole genome comparison
• Conclusion
Formalisms
• Definitions: formulate a cluster definition
• Algorithms: identify clusters in real data
• Statistics: assess the significance of one or
more clusters
Whole genome comparison
g 10
g 10
Find all sets of genes that form max-gap
clusters in both genomes.
Properties of Max-Gap Clusters for
Whole Genome Comparison
• Clusters are locally dense in both genomes
• Clusters are still guaranteed to be disjoint.
• The definition is symmetric with respect to
genome
Most existing cluster algorithms are not symmetric!
Algorithms: Finding Max-Gap Clusters
If g = 2
• There is no valid max-gap cluster of
size two or three
• There is a valid max-gap cluster of
size four
Algorithms: Finding Max-Gap Clusters
• A consequence of this is that a greedy iterative
approach will not find all max-gap clusters
– Specifically, larger clusters that don’t contain smaller
ones will not be found
Algorithms: Finding Max-Gap Clusters
There is an efficient divide-and-conquer algorithm to
find all max-gap clusters (Bergeron et al, 2002)
Since algorithms are generally not stated formally in
application papers, we don’t know whether people
are actually getting what they think they’re getting
Formalisms
• Definitions: formulate a cluster definition
• Algorithms: identify clusters in real data
• Statistics: assess the significance of one or
more clusters
Work in Progress…
Statistics: Whole genome comparison
g 10
g 10
What is the probability that at least k genes form a
max-gap cluster in both genomes?
Statistics: Whole genome comparison
g 10
g 10
What is the probability that at least k genes form a
max-gap cluster in both genomes?
Assuming identical gene content, the probability
of finding a max-gap cluster of size at least k is
always one!
An Example
Example: g =1
An Example
Example: g =1
An Example
Example: g =1
A cluster of size k does not necessarily
contain a cluster of size k-1
An Example
Example: g =1
An Example
Example: g =1
• When gene content is identical, there will always
be a cluster of size n
An Example
Example: g =1
• When gene content is identical, there will always
be a cluster of size n
• Therefore, for all k, there will always be a cluster
of size at least k
An Example
Example: g =1
• When gene content is identical, there will always be a
cluster of size n
• Therefore, for all k, there will always be a cluster of size at
least k
• Therefore, the probability of finding a cluster of size at
least k is always one!
Relaxing the Assumption of Identical
Gene Content
• Assume only m of the n genes in each genome
are shared
• If the longest run of “non-shared” genes is less
than g then we are still guaranteed to find a
complete cluster
More generally…
Simulations of randomly ordered genomes
show that large clusters may be very likely
to occur merely by chance
Unexpected Statistical Trends
• There can be a significant
probability of finding a cluster
that includes all homologous
gene pairs
n = 1000, m = 250, g=20
• The significance of a cluster of
size k can be less than that of a
cluster of size k-1
• Probabilities are not monotonic
• Large clusters may not be
significant
Probability of a cluster
of size 250 ~ 50%
Outline
• Formalisms
• Reference set scenario
• Whole genome comparison
• Conclusion
Clusters Are Used in Many Other Applications
Inferring functional coupling of genes in bacteria (Overbeek et al 1999)
Recent polyploidy in Arabidopsis (Blanc et al 2003)
Sequence of the human genome (Venter et al 2001)
Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002)
Duplications in Eukaryotes (Vision et al 2000)
Identification of horizontal transfers (Lawrence and Roth 1996)
Evolution of gene order conservation in prokaryotes (Tamames 2001)
Ancient yeast duplication (Wolfe and Shields 1997)
Genomic duplication during early chordate evolution (McLysaght et al 2002)
Comparing rates of rearrangements (Coghlan and Wolfe 2002)
Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998)
Operon prediction in newly sequenced bacteria (Chen et al 2004)
Breakpoints as phylogenetic features (Blanchette et al 1999)
...
Max-Gap Clusters are Especially Common
Inferring functional coupling of genes in bacteria (Overbeek et al 1999)
Recent polyploidy in Arabidopsis (Blanc et al 2003)
Sequence of the human genome (Venter et al 2001)
Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002)
Duplications in Eukaryotes (Vision et al 2000)
Identification of horizontal transfers (Lawrence and Roth 1996)
Evolution of gene order conservation in prokaryotes (Tamames 2001)
Ancient yeast duplication (Wolfe and Shields 1997)
Genomic duplication during early chordate evolution (McLysaght et al 2002)
Comparing rates of rearrangements (Coghlan and Wolfe 2002)
Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998)
Operon prediction in newly sequenced bacteria (Chen et al 2004)
Breakpoints as phylogenetic features (Blanchette et al 1999)
...
Formal statistical models allow us to
– understand trends and verify that they match
our expectations,
– choose parameters effectively
– conduct statistical tests for data analysis
Formal statistical models require
– a formal cluster definition
– a search procedure to find clusters
These issues are more complicated than
they might seem!
Summary
Results: statistical tests of significance for max-gap
clusters
•
•
Reference set scenario
Genome comparison (work in progress)
We need to
•
•
•
explicitly consider the cluster properties we would like
our definitions to satisfy
rigorously evaluate whether our definition meets these
requirements
carefully prove that our search procedures match our
stated definitions
Thank You