Annotatie van het menselijk genoom

Download Report

Transcript Annotatie van het menselijk genoom

Systems biology:
identification of regulatory regions and
disease causing genes and mechanisms
PhD defense
Peter Van Loo
Promotor: P. Marynen
Co-promotors: B. De Moor
and C. De Wolf-Peeters
Human Genome Laboratory
Departement of Human Genetics
May 23th, 2008
Introduction
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Gene
expression
…
Genes
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
Regulatory
regions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Gene
expression
…
Genes
1
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
Regulatory
regions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Gene
expression
…
Genes
2
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
Regulatory
regions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Gene
expression
…
Genes
Regulatory
regions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
3
PhD defense
Peter Van Loo
23/05/2008
Introduction
Systems biology:
identification of regulatory
regions and disease causing
genes and mechanisms
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Systems biology:
identification of regulatory
regions and disease causing
genes and mechanisms
TSS
5’ UTR
Exon
Intron
3’ UTR
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
regulatory region
Conclusions
transcription factor binding site
gene
TATA-box
regulatory region
PhD defense
Peter Van Loo
23/05/2008
Introduction
Systems biology:
identification of regulatory
regions and disease causing
genes and mechanisms
Disease gene identification:
x

x
x
x
x
x


Linkage analysis
Cytogenetics
Molecular cytogenetics
e.g. array-CGH
Candidate genes…
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
Validation: labour intensive
 Computational gene prioritization by data integration
PhD defense
Peter Van Loo
23/05/2008
Introduction
Systems biology:
identification of regulatory
regions and disease causing
genes and mechanisms
Introduction
The genome
Systems biology
Regulatory
regions
Disease genes
Mechanisms
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Gene
expression
…
Genes
1
Regulatory
regions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Computational detection of regulatory regions: cis-regulatory modules
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
Wasserman and Sandelin, Nat Rev Genetics 2004
PhD defense
Peter Van Loo
23/05/2008
A classification of existing CRM detection methods
Introduction
Type I
Type II
Type III
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
+
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Computational detection of cis-regulatory modules: method overview
TSS
CDS
CNS 1
Introduction
CDS
Human gene
5’ UTR
Exon 1
LAGAN & VISTA
CRM detection
3’ UTR
Exon 2
Mouse gene
Set of coregulated
genes
Transcriptional Regulatory Model
CNS
database
p,v
211 bp
ModuleMiner
TFBS
database
Gene
prioritization
Expression
profiling of
THRLBCL
MotifScanner
Transfac
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Polysomy 17 in
breast cancer
ModuleScanner
Conclusions
Jaspar
New
target
genes
PhD defense
Peter Van Loo
23/05/2008
The transcriptional regulatory model (TRM) - ModuleScanner
p,v
Introduction
CRM detection
211 bp


Collection of Position Weight Matrices +
Parameters
For a transcriptional regulatory model Θ, we
can assign a score to a sequence s, as:
SΘ( s)  max f (p, t )
 log S (t)
TFBS t





TFBS: transcription factor binding site
log S(t): logarithmic score of TFBS t
f(p,t): limitations, depending on parameters p (0 if invalid, 1 if all criteria are satisfied)
maximalisation over all binding sites of all transcription factors in the model
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
ModuleScanner ranks all genes in the genome
 Highest
ranking genes are putative target genes
PhD defense
Peter Van Loo
23/05/2008
ModuleMiner: parameterless CRM detection in sets of co-regulated genes

Introduction
Search space: all possible TRMs
CRM detection
v
97 bp
167 bp
p
p,v
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
211 bp
Gene
prioritization
191 bp
p,v
465 bp
43 bp
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
ModuleMiner: parameterless CRM detection in sets of co-regulated genes

For each TRM: use ModuleScanner to score
all genes
v
97 bp
167 bp
p
p,v
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
211 bp
Gene
prioritization
191 bp
p,v
43 bp
465 bp
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Gen
Conclusions
1
2
3
…
I-KAPPA-B-RELATED PROTEIN
WD-REPEAT CONTAINING PROT…
PITUITARY HOMEOBOX 2 …
PhD defense
Peter Van Loo
23/05/2008
ModuleMiner: parameterless CRM detection in sets of co-regulated genes

Use order statistics to assign a score
to the ranks of the co-regulated genes
Set of coregulated genes
Ranks
Rank
ratios ri
6
11
18
29
45
6.8E-4
1.2E-3
2.0E-3
3.3E-3
5.1E-3
Assign p-value:
…
All genes in genome,
ordered by
transcriptional
regulatory model
score
Q( r1, r2 ,..., rn ) 
r1 r2
rn
  ...  ds ds
n
0 s1
sn 1
n 1
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
...ds1
PhD defense
Peter Van Loo
23/05/2008
ModuleMiner: parameterless CRM detection in sets of co-regulated genes

Use order statistics to assign a score to each
TRM
v
6E-5
97 bp
1E-3
167 bp
5E-1
p
8E-9
p,v
211 bp
191 bp
2E-7
465 bp
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
3E-4
p,v
Introduction
43 bp
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
ModuleMiner: parameterless CRM detection in sets of co-regulated genes

Use order statistics to assign a score to each
TRM
v
6E-5
97 bp
1E-3
167 bp
p
8E-9
p,v
5E-1
211 bp
191 bp
2E-7
p,v
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
43 bp
Expression
profiling of
THRLBCL
Select the best performing TRM
8E-9
CRM detection
Gene
prioritization
3E-4
465 bp

Introduction
Polysomy 17 in
breast cancer
p
Conclusions
211 bp
PhD defense
Peter Van Loo
23/05/2008
ModuleMiner: parameterless CRM detection in sets of co-regulated genes

Use order statistics to assign a score to each
TRM
v
6E-5
97 bp
1E-3
167 bp
p
8E-9
p,v
5E-1
211 bp
191 bp
2E-7
p,v
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
43 bp
Expression
profiling of
THRLBCL
Select the best performing TRM
8E-9
CRM detection
Gene
prioritization
3E-4
465 bp

Introduction
Polysomy 17 in
breast cancer
p
Conclusions
211 bp

Genetic algorithm based optimization
PhD defense
Peter Van Loo
23/05/2008
In silico validation – leave-one-out cross-validation
High quality set of
12 smooth muscle
specific genes*
Leave one gene out
+
+
…
ModuleMiner:
train a TRM on
the other 11 genes
p,v
p
191 bp
211 bp
ModuleScanner:
rank the full genome
Look at the position of
the left-out gene
* Nelander
et al. (2003), Genome Res 13:1838-1854
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
In silico validation – leave-one-out cross-validation
Do for all possible thresholds:
Introduction
…
Sensitivity = % of left out genes above threshold
Specificity = % of genome below threshold
Plot on ROC curve
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Area Under
the Curve is
a measure of
performance
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
In silico validation – leave-one-out cross-validation
Do for all possible thresholds:
Introduction
…
Sensitivity = % of left out genes above threshold
Specificity = % of genome below threshold
Plot on ROC curve
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
50 %
Area Under
the Curve is
a measure of
performance
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
In silico validation – leave-one-out cross-validation
Do for all possible thresholds:
Introduction
…
Sensitivity = % of left out genes above threshold
Specificity = % of genome below threshold
Plot on ROC curve
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
100 %
Area Under
the Curve is
a measure of
performance
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
In silico validation – leave-one-out cross-validation
Do for all possible thresholds:
Introduction
…
Sensitivity = % of left out genes above threshold
Specificity = % of genome below threshold
Plot on ROC curve
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
93 %
Area Under
the Curve is
a measure of
performance
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Sensitivity to noise
Introduction
CRM detection
1.0
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
0.9
0.8
AUC
Gene
prioritization
0.7
Expression
profiling of
THRLBCL
0.6
Polysomy 17 in
breast cancer
10 smooth muscle genes + random genes
smooth muscle genes + random genes = 10
Conclusions
0.5
0
1
2
3
# of random genes / # of smooth muscle genes
4
PhD defense
Peter Van Loo
23/05/2008
Comparison to other CRM detection algorithms
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
ModuleMiner
ModuleSearcher
CisModule
EMCMODULE
Random TRMs
PhD defense
Peter Van Loo
23/05/2008
Application of ModuleMiner to adult tissues and embryonic development

10 microarray clusters
Genes
expressed in different adult tissues
9/10: successful CRM detection

5 custom-build sets
Embryonic
development processed
5/5: successful CRM detection
All conserved regions
Adult tissue CRM predictions
Embryonic development CRM predictions
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
distance from TSS
PhD defense
Peter Van Loo
23/05/2008
Conclusions
ModuleMiner
 detects similar cis-regulatory modules in
co-regulated genes
 outperforms existing CRM detection
algorithm on benchmark data
 detects CRMs in microarray clusters of
different adult tissues
Mostly

close to TSS
detects CRMs in custom-build
embryonic development sets
Mostly
further from TSS
Introduction
CRM detection
CRMs
Three types
Principles
ModuleMiner
Validation
Noise sensitivity
Comparison
CRM predictions
Conclusions
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Introduction
CRM detection
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Gene
expression
…
Genes
2
Regulatory
regions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Gene prioritization: genomic data fusion
Introduction
CRM detection
Gene
expression
Literature
Anatomical
expression
Protein
Domains
Process/
pathway
Gene
Prot-Prot
regulation Interactions
Gene
prioritization
BLAST
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
P(gene)
PhD defense
Peter Van Loo
23/05/2008
ENDEAVOUR: The approach
Introduction
CRM detection
Gene
prioritization
candidate
genes
n data sources
overall
prioritization
n prioritizations
data source
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
known
(training)
genes
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Prioritization based on one data source





Introduction
Vector-based
CRM detection
Literature (text-mining)
Microarray data
Cis-regulatory motifs
Gene
prioritization
Attribute-based



Gene ontology
Protein domains
Pathways
Anatomical expression
training
genes
GO IDs
GO ID
expected
frequencies
GO IDs
GO IDs
observed
frequencies
P-value

full
genome
 2 log p
i

Other



BLAST
Cis-regulatory modules
Protein-protein interactions
Training genes
Protein sequences
Candidate
genes
BLASTP
Local
BLAST
database
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Genomic data fusion: order statistics
Introduction
CRM detection
Gene
prioritization
candidate
genes
n data sources
overall
prioritization
n prioritizations
data source
Q( r1, r2 ,..., rn ) 
r1 r2
known
(training)
genes
n
0 s1
n 1
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
rn
  ...  ds ds
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
...ds1
Conclusions
sn 1
PhD defense
Peter Van Loo
23/05/2008
Cross-validation
Introduction
n-th
position
99 random test genes
one left-out gene
CRM detection
…
data source
prioritization
left-out gene
Plot on
ROC curve
AUC
0
1 - specificity
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
sensitivity
sensitivity
training genes in which
one gene is left out
Sensitivity = % of left out
genes above threshold
Specificity = % of random
genes below threshold
Gene
prioritization
Polysomy 17 in
breast cancer
Performance
measure:
Area Under
the Curve
Conclusions
0
1 - specificity
PhD defense
Peter Van Loo
23/05/2008
Cross-validation
Introduction
n-th
position
99 random test genes
one left-out gene
CRM detection
…
all data sources
prioritization
left-out gene
Plot on
ROC curve
AUC
0
1 - specificity
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
sensitivity
sensitivity
training genes in which
one gene is left out
Sensitivity = % of left out
genes above threshold
Specificity = % of random
genes below threshold
Gene
prioritization
Polysomy 17 in
breast cancer
Performance
measure:
Area Under
the Curve
Conclusions
0
1 - specificity
PhD defense
Peter Van Loo
23/05/2008
Cross-validation on disease and pathway genes
100
Diseases (OMIM, 627 genes, 29 diseases)
Pathways (GO, 76 genes, 3 pathways)
80
AUC
60
40
Random
Introduction
CRM detection
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
20
Polysomy 17 in
breast cancer
0
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Integrated case study: predicting expression from sequence
AGCTTCTCCTCTGTAGACACCGAGACTCATAACTCTGATGAGATCCACAGTTCTATTGGAGTTGTGCAATGAAATAGCAGACACTCTTGGAATCTCTTGGGGCTCCCCCAACTTCATGAATGAAT CTCTAAGTTCTGCATG
CCCCATATAAACTGATGACAAGATCTTTGAGAGCACTGTTTCCTTAGTGGGTTTCCACAGAGAAATTTTGAATATGGGGGTCCACGAAGTGGCTTGAGCCATCTACCCCAACAACAACATTTGGC CTTTGGTGCCTCTCTA
GTATTCTCCTGATGGTTATGCAGATGGTGGCATACAGAAATGGAGTAAATTAGTAAACTAAAAGAATAAATGAGGTGCCCCATTTCTCTGACTCTATTCTAGGAAAATGAGTGAGAAGCAGGATC TCCCAGATTTCAGGAG
AGATCTGGGTCACTTTTTGGAGGTTTCTGGTATTGAAAATTATATATATATATCCTCCAGCTGTATATATATATATATATATATATATATATATATATATATATATAACATCTCTATATGATATA CGTATCTATCTATACC
TCTATAGATATCTATAGATATCTATCTATATCTCTATATGATATATAGAGATATAGATATCTCCTCCAGGTAATAGACTTAATTTTTAAGAACATGTTTCAATTCACAGAAAAATTGAGCAGATG GTACAGAGAATAACCC
TGTGCCCAGTTTCCCCTATGATTAACATAATACATTATATGGTACACGTGTAACAGTGAAACAATATCGGTACATTATTATTCACTAAAGTTCATCATTGATTCAGATTTGTCTAGGTTGATCTT ATGTCTTTTTGTGGCC
AATTATTCCATCTAAGATTCGACATTATATTAAGTTGTCATGTCTCCTTAGGCTAATCCTTGCCTGTGACAATTTCTCAGACTTTCCTTGTTTCTGATGACCTTGATGGGCTTGAGGATTACTGG TTTTTTGTAGGACGCC
CCTCTACTAGAATTTGTCTGATGTTTTTCTTATGATTAGACTAGTATTATGAGAGCAGGACCACAGAGAGAAAGAACAATTTTCACCACATCCTATCAAGAGTATATACTATCAAGATGATTTAT CATTGTTGATGTTGGT
CTTAATCCCCTGGCTAAGAGAGTGTTTGTCAGGCTTCTCCTAAGCTATTTTCCCCTGACTACCTTTCCATACGGAATATACTCCCCGGGAAGAAGTTACTATCTATAGCCCACAATTAAAGAGTG TGGGTTTCTGTTTCTC
CTCCTTAAGGCCGGCACATGTCTATAAATTATTTGGAATCCCTGTGCACATCTATATAAATAAATTTGGAATTACGGGATGTTTGTCTTTTCTCTCTGGTTTATTAATTTACTTAATAATTTATT TATAATAGTATGGACT
CATTACTTTTTTTTTTTTTTTTTTTTTGAGAAGGAGCCTCACTCTGTTGCCCAGGCTGGAGTGCAGTGGCACAATCTTGGCTCACTGAACCTCCGCCTCCCGGGTTCAATGGATTCTCCTGTCTC AGCCTCCCGAGTAGCT
GGGATTACAGGCATACGACACCATGCCCAGCTAATTTCTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTGGCCAGGCTGGTCTTGAACTCCTGACCTCAAGTGATCTGCCCACCTCGGCCTC CCAAAGTGCTGGGATT
ACAGGTGTAAGCCACTGCACCCAGCCCTGGACTCATTAATATTTATTTTATACTTTGGGTTATAATGTAAAACACTATTCTATTTTGTTGCTCAAATTGTTGCAGCTTTGGCCACTGGGAGCTCT TCAAGTGGCTCCTGTG
TCTTTTTGAAATATCCCTCACCAATGTAGTTTTGTTTTTGAATAATTCCTTACTTTAAGGTGCTACAAGATCTTTCATTCTCATTTGTGTATTTCCTGCCCCAGTTTTAGAACTCAACAATTTCT CCAAGAAGCCTTGGTT
CCAGCTGCTGAGAAATGGCATTAAAACTGAGACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTGAAAATACAAAAACTAGCCGGGCGTGGTGGTGCAGGGCTGTAATCCCAGCTACTCGG GAGGCTGAGGCAAGAG
AATCGCTTGAACCCGGGAGGCGGAGGTTACAGTCGGCTGAGATCGCGCCACTGCACTCCAGCCTGGGCAACAGAGTGAGACTCTGTGTCAAAAACAAAAACAAACACAAAAACAAAACAAAACTG AGCTCTGAGCACCAGG
TGTGCTTGTTGCTACGACAAATATATTTCAAACCTTATATTTTTAACACCAGCACCCACACAACTACAATACAATTGCACTATTCATAAAACAATTATAGATTATTAACAACATTCAATCATGGT GTCATAGGAGCCTGTG
GTCCTACACTGGATCCCACACACAAAACTTGCATATGATGGTCATCTTCTTTCAGTCCTGTTAGGATTGAAAGAGAGATGTATAGCCTCAGTGGAGATAATATCAAAAGTCTAATTTTATTTATT TATTTCTTTCTTTATT
TTGAGACAGGGTCTTACTCTGTGGCCCAGGCTGGAGTGGTGCCATCATAGCTCACTGCAGCCTCAAATTCCTGGGTTCAAGAAATCCTCCTGCCTCAGCCTCCCAAGTGGCTAGCACTACAAGTA TGTGCCATCATGCCTG
GCTATTTTTTTTTTCTTCCGTTTTTTTAGAGACAGGGTCTACGTTGCCCAAGCTGGTCTTGAATCCCTGGTCTCAAGTGATCTTCCCACCTCAGCCTCACAAAGTATTGGGACTACAGTTGTGAG TCATTGTGTCTGGCCC
AAAAGTCCAAATTTGAGGCCTTCTTTGGATGTGTGGCCACAATAAATGGCTCTTGCAAGGCTGCCAACCCCTTACACTCTTTCCATAATATGCCATAAGAAAAGCATACTGGATTTAGAAATAGG GCATGAAAGTTCTGAA
TCCAGCTGTGTTAGTGTTATAGCATATATAGCAAGTGGATTGTGTCTGGGCCTCAATTTCCAATGATACAAAATCAGGAACATCAGATTGGATAATGGCTAAAGGCCCTCCCAGTTCTAGCACAC TATAATTTTCAACAGA
CTTACACTGGGGGAATACAATTGGCTCCACTAGTCTTTGTATACAGGCCTAATATTCCAGAAAGTCTAAACCAGTGGAGGCATGGGGGTGCGGAGGTCGCGGCTAATAAATCAGAGTCATTTTAT TATTTTTTGGGAATGC
CAAGACCTGTTAAAGGCTTTAGATAGTCTAGACAATCGGGCCTGAGAAACTTTAGACCTTTCTTTTTAAAGAATGAAGATCAAAAAAGTATAAAAAATATTGATGGAAAGTATCTCTTTCATTGG TTTCATGTTCTGATAG
ATCAAGACTTCTTCCTCTTTTTTTTTTTTTTCCTTTAGTAAGGGAAAACTCCTCATCTGCTTTTTCCTCTGACTTCAAATAATTACCTTTAATGCAGTGATGGCTGAGCCACCTCTAAGTTTCTT ACCAGGAATCTCTCTC
TAGGTTTTTATTTTTTTCTTTTTCTCTTCCTTCCTTCTCTCCTCTCTGCCTCCCTCTCTTATGCTCTCCCTCCCTCCCTTCCTTCCCTTGAATGTTATGATGTGTTTTTTACATCCATATACTAC CCAGTGTACAAATGTC
ATCTCCTTCCTATCATTGATGGGGACAATTTGCAAAAACAAATAGAAGGAAAATAAAAAAGGAAATATAGGGCAGAAAAGACACTTGGGAACTGTCACATTTGATTATGAATGCTGGAGATCAAA GGTGCAAGGTCTTAGA
ACCTACTTCCTCCACCTCTTAACGTTTAAAATCTTCAATTGGCTTTTGAACCCACTCAGCAAAATCCCAGACTTTGGTCTACAATTGGTTAAAAATTGATAGAGTGAGGATTCTGGGACTGCCTT CTTTACTTAGAAGTTT
ACATTTTAACTCCTTCCCTAGCCCCAGGTACACATATACACACAGCTCCTTTCCACTCCTCTCGCACAGTTCTGTAAATATGTTTTGAAATGTAAAGGTACAGAACTAAGCGCAGACCGGCATCC CTCAAATCATCGGGGC
TATTCCTTCACACAGCTGAGGAAACGGAGTCCTCACAAGTGGCTTTGCTCAATGTCCCATAAAGAGTTTCAGGCACAGCTGTAATTAGAAACCAAGGGTTTGTGTGTGTGTGCGCGCGTGTGTGT GTGTGTGTGTGTGTGT
GTGTTTGCAACAAATACAGTGTTTTTTTTTTTCTCCCTACACTGTGCCCCCGTGGAGTCACATTTGTGTGTCTGTGTCTCTGTACGTACATAAGTTACACAGACACTGACATGTAGGAAACGTGC ACCAAAGTGTCTGTCT
TCTGACCTCAGGTAACAGTATAATGACTTGAATTTCAGGCAGCTGAAAGGTTTCTGCCGGTGGAGGTTGAAATAAACAAGAAAAGCCACTGTGGAGATGTGAATGGAAAAGTACCGAGCCCTCCC TCCCTCCGCACATTCT
TCCGCTCTCCAGCTCTCCCTGCCATCGAGCTGGCTTCAGATAGGCTTCTGCATGGTCAGGTGTACAAGAGGGCGGTGGGGAGAAGAAAAAAAAAATGCAGGCACAACACGCAAATCAAGTTTTTC CACTTCTAGCCTTAGG
TAGTAGAGACAGCTAAGTACAGCAGCCAGCAGCCCGGCAACGGCAGCGGGTGGACCAGCCACCCTGAGTTTACAAACACTCAAGTGCTTTCCTTCCCTCATCCCTCTCAGAGTCCAGCTGCTGCT TTCCTTCATGCTAAGG
TTTCATAGGAAGTGAAAACTCTGCTATTCAAAACAGCGATCGAACGCAATAAACAAATCATTACACACCCCCTAACCCCCATCACTTCTCTATTTTAAGCTTCTGATATTTATTCCCATTTTAAA TAAGTGAGAAAAGTGT
GGAAAATTAGTGTTTGGGGGTAAACTCTGAGCCAGGCTGAAAAGGTTTCTAAAGGAAAAAAAAATCTCAGAACAATAAAGGCTAAAAGCAGGCAGCATATGGATGAAAATTAAACACTGATACTT CCTTTTCAGAAGGCAG
TAGCTGGAAATTATACACTTTTTTAATGTCTCAAAACTTTCTGCTCATCTTGCTATGTTAAAAACGCCTTTCTTTCTCCAAGGATACTACAAAAAGCTTGTTTACAACAGTTCTAAATGAAGGAT TTGAAATAAAACGAAC
AGGTAAAATTTAACAAGTCTGATAGATAGTGTCTCCCAAATCTATCAAAAGCAGTGCCAAGTACTTCAATGTAGCTGAGAGGCATAAATAAACCCAAATGACCATCAAAACTCATCATGACTTGG AGTTCGCTCTGAGTTT
TGCAGTTTACAAAGAGACCATGGCAGCCTTGCTTCCCTCAGTTCTACAAGGACACAAGATATACACCTACAGACTCAAGTTGTCAGATTACACTGATCCTCTAAAATGACAGAGGGCCAGCAAAT CATGCAGACCCATTTT
CAGTTGTGTTCCTGGGGTCACACATGCTCCTAGTGAAGACCCAGCCTATAATCCTGAAAGAAGAAAGCCTAGAGAAGGTGATTGATTTGAAAAAGTCTTCCCAGTTTTAAAATCTTTAGTCCTAT GATGTGGTATCTTAAA
GACCTACCAAGGTGCCAGAGGTTCCTGACAGGTGAAACCAACTTCCTCTTGTGAGCCCCCTTAGAAAGAGGACAGAACCGTGTTTATTCCAAGGATAGGTTCTTTTTCAGCTATGACTGAATTGT GGGGAAGGTTTTGCAA
GGGGGAATTGGATGTGAAGTCTGTTCTTTTCCTCAAATAGATGTAATATTAGGACCAGGCTATTTTATTTTGTAATAAAGCTTATATTTACCCAGCAGCAATGATCAGGGACCTATTCTTATGCC CAGTCCATGAGGCAAA
GAGGTTGGCCTGGTCCTCTACTGAGTATTAGCAGCCAGTAACCATTAAATCATGGGACTAGTTGAAATGTAGTGCCCCGAAGTCTGCAGGAATTATTCATACGACCCCAGACATGGAATCACTCT TTAGAGCTTCTTAAGG
ATGATTTAAAAGAATCAGAATACGTTCAAGTCAGCCCTTTCTTTAATCCTGTAACACGGCACTGCGGGAGTGAGGGAGGCCCACATAGTGATGCCAACTGGATACTGAGGAGAGGTCAAGAATGA AAGAAGAAATGACATT
CTGGAAGAAATTCAACTGGTATAATATTTGACAAAGTTACTTTCCTAGGAATTGAAAAGAGATTGAGAGGCGGGTGCACAATTTTCCTCACCATTCATTCAGTTCAAAGTAAAAGAGACTCACCG AAAAGTAAGTGCCTAT
CTTTAGAAAATTTTCAATAATGATTTTCTCTTTCTTTCTAACTGGTCTTCTGTTCTGGTCAATTTCTTTCAGTGTAAACACATTGATTTGGCAAAAAGCAGTAGGAAAATGTGGCACTCTGGCAC TTGGTCCCAGAAATAA
TATGCTGGGAAGATTTGAGGTCCCTGGTGATGAGGTTAATTATATATGAACCAGCCCTGGTGGGTTCTCCCTCTAGGGGCTCACTGCAGAGAATTAAAGAGGGCTGAGGTATGAGAAGGGTGAAT TCCTTCCCAGCCCCCA
CTCTGGCTGGTTCTACTACTGCCTTAGAGAGCAGATTTCCCTTTGCTCTGCAGCGCCCCCATGGGGCTAAGAGTGGAGTGGCAAAGGGAACACAGGAGGGACAAGCTGTGTTTCAGGTTGAGGGG GGCGGTGGATGAGGCT
GAATGGCAGTTTTGACAAAGAAAAAAGTGACCAAAAATCATAAAAATAATCTTTTGAGGGCCCAATAGTAAGGCAGAGCCATACAATTCACATTCCAAACCATATAGATACGTCTGAGAAATCCT AAAGTGCTAATTGCTC
ATAAAAGAAAAAATTACACATATAAACACACAAAGAAAATCCCTTCCACAAAATCGGGGTGTCATTTTGCATCCAGCGGGATTCATTTTAATTTCTTTGAAAATGAGAAGGAAGGGGACTCAAAT GAAAAAGCAGATAGTC
TGCCTTCTGGCAGAATAAATCTGAACTTGACAATATCATGTGTCTTTGGGGGTAAAACGTACATTTCAACAACAGTGACAGGATTAGGCCTATGTATATTTTTCAAAAACCGTTCACAAGACAGG CTTTTCTGCAGAGGCT
GCAGTAATCCATCTGTCAATAAGTATTAAAATATTCAGATTTCACAGGGACAGACACTTTAACGCATATTTCCTAAGCTCCAGCCCTTGTGGAAAATAATCAACCTCTTTGCACCTTTCTGGGTT TTAAAACCTAAAATAC
AGCCTTTAAAAATGTGTGTGTGTTGTGGGGTAGGGGGGTGCATTGCCAACAACATTTTCGGTGATAGATGGAACTTCTTACGGGACTGTCAATGAAAGAGATTTTCCAAATATCCCAGCAAACAG CAATCTTTCACAGCTC
TGATCACTCCTCCATTATAAACCCAAATTTTGGGTTGAGATAGGTAGATTATTTTAGACATATCTTTATTAGAAATTAACAAGTGACGAGATTTTGTGGAAGCTTTAAGAATTCATCTGTAATTT AATAAGTCGCTTGAAG
GACTCTCATAGCCAAGGCTCAGAACAGCCTGACCTTTGAAAGCTGCTTCTGGTCCAAACATTTTGGGCTAATTCTTGAGGAATCTGAAATATTATTTTCCCCTCACACCCTTCTTTTAAGAGAGA GACATAAAAGAAACAA
GAGTCTCCCTTATTCAGGGATGAGTAGGAGGGGAAAAAACCCGAACCAACATTTAAATAAGGAAACTAGCAGCTCTGAACAAACAAACTAGGACCCACAATGAAATGATTCTGCACTGCAATTGC CTTTAAAAAGAAAGTA
ATAGAGAAAAAGAGAAGGAAAGAATTTCTCCTTCTTCTCTACCCCCCCCCCACCCCACCCCCCAACTCAGCTTCAAAGCTAAGAAGACTGTGCTGCGTGTAGTGCATTGTAGTTGTGGCAGTCTG TTCTAAATACAGGCAG
TATCTGTGATACTGGCACGGCAGGCCTTTAGAATTCCCTCCGGCTGATCTCTTAAACACAGACTGAAGAGATTTTTTTACAACGACCTTGAAACGAGCCTCGAAAACAAAAATCTCAAGACCTTA AGAGAAAACAAAACAC
AAACAGGTATTTGGCTCACAGAATTTTGTAGAAAACACACACATACCACCCCGCCACCCCCACCCTCCCCCCCACACACACGTTTCTTGCAACAAGAAATTTCCCAAGAGTCAACAATAACAGAT TAAACCCACCACTTGC
TGTCCTGGAAAGAAACAAACCAAACCAAAACAAATCCTTTGAACATTTCTCTGAAGTGCAGGAGAGACACACTTCAGCAAAAGTCCAAGGGGGAAAAAGAAAATTGCACCAAAGGAAAAAAAAAA AAAAAAAGTGGGGGCT
GGGATTGTTACATATGGCCAAAAATTTAAGCTTCTTTCAATAGTATTAGTATTGAAATAATACATCTTTAAAACGCTTGAGGGATTAGATAGGGAAAGAAAAGGCACGTACAAAAAAATCCAACC GATGCCGATCCTGTGA
TTTACGTAACACCACAAACTTGCAAAAGGCAAAAAATCAGAAGCAAAAATCCATAAACCATCAAAATACAGAAACCAAAAATCCCAAGCCACCACACCAGAAAGAAAAAAACCCAGAACAACAGC AAAAACCCCTGTCCTA
AATAAAAATAAAGCAAATGAACCCACCGAAAACTGCTTGGCAAATATTTTTCTCGTGGTGCCTAATATTCTAGTTGGAAAGAGCTGTGATGTTTATTTTATTTTATTTTTCTCTTACTCGCCTCT CTAACCCTACTATATA
TATAACATACTTTTCCCAGTGGTTCAAACCTCTCGCTCCCTTTTGTGCATTTAGCTCGATCTGCTGAGTTTATGGGTAAGAAAGAAGGAATTAGCCCCAGACCCCGGGAAAGCAAAGCGCACTCC CCCTCTTATGTCACCG
AATAGCAAATTAGTTCTCAGAATTCCAGAGGCCGAGCTTTGCTACAGCGAAGGCGCCGACGTCACAGAGGAGGAGCCCACGTGATGGTGGCGGAGCAGGCCATACCATCGTCTTGGGCCCGGGGAGGGAGAGCCACCTTCA
How is this gene expressed?
Introduction
CRM detection
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Predicting expression from sequence – macrophage differentiation
Introduction
Macrophage
Granulocytic
monocytic
progenitor cell
CRM detection
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
TPA
Expression
profiling of
THRLBCL
HL-60
cell line
Differentiated HL-60 cells
Using only sequence information, can we predict genes
up- or down-regulated during macrophage differentiation?
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
CRM detection: prediction of new target genes
Introduction
CBF
NRF2
LBP1
AP1
HNF3
CRM detection
200 bp
18 upregulated genes
transcriptional regulatory model
100 new target genes
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
CRM detection: prediction of new target genes
Introduction
CBF
NRF2
LBP1
AP1
HNF3
CRM detection
200 bp
18 upregulated genes
100
Fold upregulation
10
1
transcriptional regulatory model
100 new target genes
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
0.1
0.01
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Prioritization of new target genes
training genes
Introduction
CRM detection
18
upregulated
genes
transcriptional
regulatory
model
Gene
prioritization
100 new
target genes
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Prioritization of new target genes - conclusions
training genes
Introduction
CRM detection
18
upregulated
genes
10000
Fold upregulation
1000
100
transcriptional
regulatory
model
Gene
prioritization
100 new
target genes
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
10
1
Polysomy 17 in
breast cancer
Conclusions
0.1
0.01
PhD defense
Peter Van Loo
23/05/2008
Conclusions

Endeavour prioritizes candidate genes
Looks
for similarities with known
disease/pathway genes
Integrates information from many
heterogeneous data sources

Computational validation
Disease/pathway
genes ranked on average at
the 10th position of 100 candidate genes

In vitro validation
Predicting
gene expression from sequence
Introduction
CRM detection
Gene
prioritization
Genomic data
fusion
Endeavour
Data sources
Order statistics
Validation
Results
Predicting gene
expression
Conclusions
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Introduction
Introduction
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Gene
expression
Polysomy 17 in
breast cancer
…
Genes
Regulatory
regions
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
Conclusions
3
PhD defense
Peter Van Loo
23/05/2008
Identification of disease causing mechanisms: THRLBCL

T cell/histiocyte rich large B cell lymphoma
Introduction
CRM detection
Gene
prioritization
THRLBCL
NLPHL
0


5
15
10
Survival (years)
Similarities with nodular lymphocyte
predominant Hodgkin’s lymphoma (NLPHL)
Functional meaning of the THRLBCL
microenvironment?
Expression
profiling of
THRLBCL
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
Polysomy 17 in
breast cancer
Conclusions

Microarray expression profiling of THRLBCL,
in comparison with NLPHL
PhD defense
Peter Van Loo
23/05/2008
The microarray experiment - PCA plot
Introduction
THRLBCL
NLPHL
reactive lymph node
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
A three-gene quantitative RT-PCR classifier of THRLBCL vs NLPHL



S
3 most significant genes
One calibrator of each lymphoma type
Each converted to give 6 percentage
scores:
THRLBCL
i

log( xi )  log( ri NLPHL )
 100
log( riTHRLBCL )  log( ri NLPHL )
S
NLPHL
i
log( xi )  log( riTHRLBCL )
 100
log( ri NLPHL )  log( riTHRLBCL )
Averaged to give one NLPHL and one
THRLBCL similarity score
diagnosis by
morphology
NLPHL
THRLBCL
classification by the threegene classifier
NLPHL
THRLBCL
46
0
0
23
Introduction
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Differential expression
Introduction
CRM detection
NLPHL signature
THRLBCL signature
Gene
prioritization
FCER1G
VSIG4
IDO
CCL8
TLR1
TLR2
TLR4
TLR8
CD14
STAT1
CCR1
CXCL10
CXCL16
CCRL2
CD80
CD86
CD274
CSF1R
CSF3R
PDCD1LG2
FCGR3B
FCGR1A
ICAM1
IL1RN
IL18BP
IRAK3
CD74
S100A9
CASP5
MSR1
CD163
SOD2
IFNAR1
IFNGR2
IFIT3
IFI6
C1QA
C1QC
C2
C3AR1
FCRL1
CD79A
CD79B
CD19
CD22
MS4A1
PAX5
BCL11A
FGFR1OP
FCER2
BANK1
Expression
profiling of
THRLBCL
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
The model
Introduction
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
CCL8
recruitment
innate immunity
scavenger receptors
Toll-like receptors
VSIG4
macrophages and
dendritic cells
Polysomy 17 in
breast cancer
IDO
tumor tolerance
Conclusions
VSIG4
activation
IFN-
PhD defense
Peter Van Loo
23/05/2008
Conclusions
Introduction


Expression profiles are in line with
differences in microenvironment between
THRLBCL and NLPHL
Insight into the functional significance of
the microenvironment
Tolerogenic

immune response
New targets for therapy
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
THRLBCL
Experiment
Classifier
Genes
Tolerogenic
immune
response
Conclusions
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Breast cancer: clinicopathological significance of polysomy 17
Introduction
Tumour grade
CRM detection
HER2 amplification
Trastuzumab
I
II
III
Nottingham Prognostic Index
Normal
I
II
III
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
ER status
Polysomy 17
Gene
prioritization
HER2 expression
Negative
Positive
PR status
<1
1-3
3-5
5 - 10
> 10
Negative
Positive
PhD defense
Peter Van Loo
23/05/2008
Final conclusions

Development of novel systems biology
methods
ModuleMiner:
CRM detection
Endeavour: gene prioritization

Systems biology to gain more insight into
diseases and processes
Predicting
expression from sequence:
integrated case study
A tolerogenic immune response in
THRLBCL

Clinicopathological significance of
polysomy 17 in breast cancer
Introduction
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
PhD defense
Peter Van Loo
23/05/2008
Perspectives
Systems biology methods for the identification of:
 Regulatory regions
 Protein-binding microarrays:

more and better PWMs
Disease genes
 Three
systems biology methods
 Combination with array-CGH

Disease mechanisms
 Microarrays: focused
 Data
Introduction
CRM detection
Gene
prioritization
Expression
profiling of
THRLBCL
Polysomy 17 in
breast cancer
Conclusions
experiments
integration
Disease treatment:
 Insight  directed treatment
PhD defense
Peter Van Loo
23/05/2008
Systems biology:
identification of regulatory regions and
disease causing genes and mechanisms
PhD defense
Peter Van Loo
Promotor: P. Marynen
Co-promotors: B. De Moor
and C. De Wolf-Peeters
Human Genome Laboratory
Departement of Human Genetics
May 23th, 2008
3 Conservation options:

1. All predicted binding sites in all human- mouse conserved non-coding sequences
(CNSs), 10 kb 5’ of transcription start
10 kb
CDS
CNS 1
Human gene
5’ UTR
LAGAN & VISTA
Mouse gene

Exon 1
10 kb
2. Same as (1), but limit ot binding sites that occur in both the human and mouse CNS
Human CNS
Mouse CNS

3. Same as (2), but add 100 kb of mouse sequence both 5’ and 3’ (to correct for
transcription start annotation errors)
10 kb
CDS
CNS 1
Human gene
5’ UTR
LAGAN & VISTA
Mouse gene
10 kb + 100 kb
Exon 1
100 kb
ModuleMiner performance: TRMs and TRGMs
Comparison to other CRM detection algorithms - results
Comparison to other CRM detection algorithms - results

Using TFBS set 2 in ranking step improves
performance of other methods
Comparison to other CRM detection algorithms - results

TFBS set 2 does not always do best
Application of ModuleMiner to microarray clusters
Application of ModuleMiner to embryonic development sets
TFBS
set
Nr target genes after
leave-one-out
cross-validation (p-val)
AUC
Primary heart field [44]
1
6 / 7 (p = 6.4  10-6)
0.92
Secondary heart field [44]
1
6 / 9 (p = 6.4  10-5)
0.79
Neural crest cells [45]
2
6 / 10 (p = 1.5  10-4)
0.86
Eye development [46]
1
10 / 15 (p = 1.9  10-7)
0.79
Limb development [47]
1
10 / 24 (p = 5.2  10-5)
0.77
Embryonic development process
Where are the CRM predictions located?
TFBS set 1 and 2
TFBS set 3
Microarray clusters
Development sets
How are genes ranked?

Vector-based data source
expression in
brain, liver, kidney,...
Microarray data
• Candidate gene with expression
similar to that of genes known
for the disease gets a high score
Literature
Known disease genes
Low score candidates
High score candidates
Motifs

Attribute-based data source
Gene
Ontology
Interpro protein domains
KEGG pathways
EST anatomical expression







cytoskeleton
GO:0005856
Order statistics




Given a set of n ordered rank ratios for gene i
(9/100; 4/120; 30/150; 30/50; 2/10; 80/80)
 (0.09; 0.03; 0.2; 0.6; 0.2; 1)
 (0.03; 0.09; 0.2; 0.2; 0.3; 0.5; 0.6; 0.8)
What is the probability of getting these rank ratios or better by
chance alone?
“How many rank vectors does my vector strictly dominate?”
Joint probability density function of all n order statistics
r1 r2
rn
0 s1
sn 1
Q ( r1 , r2 ,..., rn )  n !   ...  dsn dsn 1...ds1

Recursive formula of complexity O(n2)
k 1
Vk   ...   ( 1)i 1
i 1
Vk i i
rn k 1 , V0  1
i!
Validation of the literature data source
Prioritizations of 199 random gene + the indicated disease gene
Disease
Gene
Publication date
Arrhythmia
Congenital Heart disease
Cardiomyopathy 1
Parkinson’s Disease
Charcot-Marie-Tooth
Amyotrophic Lateral sclerosis
Klippel-Trenaunay Disease
Cardiomyopathy 2
CACNA1C
CRELD1
CAV3
LRRK2
DNM2
DCTN1
VG5Q
ABCC9
October 2004
April 2003
January 2004
November 2004
March 2005
August 2004
February 2004
April 2004
Rolledback
Text
3
1
1
*
100
97
39
51
Distal hereditary motor
neuropathy
BSCL2
March 2004
Cornelia de Lange
NIPBL
All
All, no
Text
4
3
2
50
14
27
3
1
4
6
8
42
12
23
3
1
62
15
6
June 2004
75
9
3
Average Rank
48±13
11±4 13±5
Validation of genes related to complex diseases
Prioritizations of 199 random gene + the indicated disease gene
Disease
Atheroscleosis 1
Crohn's disease
Parkinson's disease
Rheumatoid arthritis
Artherosclerosis 2
Alzheimer's disease
Gene
TNFSF4
OCTN
GBA
PTPN22
ALOX5AP
UBQNL1
Publication date
April 2005
May 2004
November 2004
August 2004
February 2004
March 2005
Rank
54
71
23
11
29
54
Average Rank
40±10
Disease case study: DiGeorge syndrome
Atypical 22q11 deletion
58 candidate genes
Training sets used to prioritize TBX1 or YPEL1
DGS (14)
Cardiovascular birth defects (14)
Cleft palate birth defects (9)
Neural crest genes (14)
Average rank
Rank assigned to
TBX1
Rank assigned to
YPEL1
1
1
1
2
1
3
2
1
1.25 ± 0.25
1.75 ± 0.48
Ypel1 as a novel DGS gene: validation in zebrafish
A screen for genes involved in congenital heart defects (CHD) – work in progress
Array-CGH of CHD patients
with a ‘chromosomal’ phenotype
Map (micro)deletions and
(micro)duplications
1.0
0
-1.0
Chr 14
No known CHD gene in
deleted/duplicated region(s)
Endeavour prioritization
Known CHD gene
explains phenotype
Validation in zebrafish
in situ hybridisation
Morpholino knockdown
A screen for genes involved in congenital heart defects (CHD) – work in progress
Array-CGH of CHD patients
with a ‘chromosomal’ phenotype
100
Map (micro)deletions and
(micro)duplications
16
1.0
0
-1.0
Chr 14
No known CHD gene in
deleted/duplicated region(s) 9
Endeavour prioritization
Known CHD gene
explains phenotype 7
Validation in zebrafish
in situ hybridisation
Morpholino knockdown
CHD gene prioritization – optimizing the performance
Extra data source:
Microarray data embryonic
heart development (mouse)
Multiple training sets
Validation/optimization of each
training set by leave-one-out
cross-validation:
Primary heart field
Secondary heart field
Performance gain:
100
All data sources
except microarrays
heart development
All data sources
Selected data sources
AUC
95
90
85
Neural crest cells
Vascularization
CHD genes
primary secondary neural Vascu- congenital
heart
heart
crest larization heart
field
field
cells
disease
Combine
prioritizations
using different
training sets
into one
prioritization
CHD gene prioritization – preliminary results (in situ hybridisation)
Chr 4
Chr 14
1.0
1.0
0
0
-1.0
-1.0
Bias to well characterized genes
Numbers of data sources with information
Endeavour
http://www.esat.kuleuven.be/endeavour
Differential expression – histogram of p-values
Is the spleen sample abberrant?
Quantitative RT-PCR validation


Fold change THRLBCL vs NLPHL
Genes selected for involvement in
 Interferon pathways
 Macrophage activation
 Innate
immune responses
Fold difference
quantitative RT-PCR
(p-value1)
Gene symbol
Description
Fold difference
microarray
IFN-
Interferon gamma
4.72
4.4 (p = 1.0 x 10-5)
STAT-1
Signal transducer and activator of transcription 1
1.6
2.9 (p = 4.4 x 10-9)
CD74
HLA class II histocompatibility antigen gamma chain
2.8
1.2 (p = 0.21)
CCL8 (MCP-2)
Monocyte chemotactic protein 2
143.5
84.8 (p = 3.9 x 10-9)
IDO
Indoleamine 2,3-dioxygenase
9.0
12.3 (p = 1.6 x 10-8)
IFN-1
Interferon alpha 1
1.02
0.92 (p = 0.81)
IFN-R2
Interferon alpha receptor 2
0.92
1.3 (p = 5.3 x 10-3)
STAT-2
Signal transducer and activator of transcription 2
1.82
1.3 (p = 0.11)
TLR8
Toll-like receptor 8
11.5
11.5 (p = 6.4 x 10-11)
MyD88
Myeloid differentiation primary response gene (88)
1.82
2.2 (p = 6.7 x 10-7)
1
2
T-test, not corrected for multiple testing.
Difference was not significant at p < 0.001 (after correction for multiple testing).
Sensitivity of the classifier to the choice of reference samples
60
50
40
30
20
10
0
51
52
53
54
55