Transcript Slide 1

Inference of cis and trans regulatory
variation in the human genome
Manolis Dermitzakis
The Wellcome Trust Sanger Institute
Wellcome Trust Genome Campus
Cambridge, UK
[email protected]
Gene expression
• Altered patterns of gene expression  disease.
– e.g., Type 1 diabetes, Burkitt’s lymphomas.
• Widespread intraspecific variation.
• Heritable genetic variation for transcript levels.
– Familial aggregation of expression profiles (Cheung et al. 2003).
– In humans, ~30% of surveyed loci exhibited a genetic component
for expression differences (Monks et al. 2004; Schadt et al. 2003).
• Much of the influential variation is located cis- to the coding
locus.
– In humans, mouse, and maize, 35%-50% of the genetic basis for
intraspecific differences in transcription level are cis- to the coding
locus (e.g. Morley et al. 2004; Schadt et al. 2003; Stranger et al. 2005;
Cheung et al. 2005, etc.).
Nature of regulatory variation
REG
GENE
DNA
i) Pre-mRNA
ii) mRNA
iii) Protein
REG
GENE
Expression
iv) DNA
Stranger and Dermitzakis, Human Genomics 2005
Effects of Copy Number Variation on
gene expression
REG
GENE
GENE
REG
Additional gene copy
REG
GENE
REG
REG
Increase of distance from
regulatory element
GENE
GENE
REG
GENE
GENE
REG
New regulatory element
Gene interruption
REG
REG
GENE
GENE
Gene expression association mapping
100
AA
Frequency
80
AG
60
40
GG
20
0
-1.5
0.0
1.5
3.0
Expression Levels
4.5
6.0
7.5
Stranger et al. PLoS Genet 2005
Phenotypic variation space
illumina Human 6 x 2 gene GEX arrays
Beads in Wells
Whole-genome gene expression
~48,000 transcripts
24,000 RefSeq
24,000 other transcripts
Cell line
270 HapMap individuals:
CEU: 30 trios, 90 total
CHB: 45 unrelated
JPT: 45 unrelated
YRI: 30 trios, 90 total
RNA
IVT1
rep1
IVT2
rep2 rep3
rep4
2 IVTs each person
2 replicate hybridizations each IVT
Quantile normalization of all replicates of
each individual.
Median normalization across all
individuals of a population.
Within- and between- individual variation
test : 1400410090_A vs. 1400410090_C
10 5
10 5
10 4
10 4
10
Signal 1400410090_C
Signal 1400410096_A
trial : 1400410090_A vs. 1400410096_A
3
10 2
10
2
10
3
10
4
10
5
10
6
Signal 1400410090_A
10
3
10
2
10
7
10
2
10
3
10
4
10
5
10
6
Signal 1400410090_A
2 replicates; single YRI individual
2 YRI individuals
r2 (all genes) = 0.990
r2 (all genes) = 0.964
Detected genes (0.98 in both samples: 12,076)
r2 (detected) = 0.994
Detected genes (0.98 in both samples: 11,529)
r2 (detected) = 0.964
HapMap SNPs
60 CEU
45 CHB
44 JPT
60 YRI
14,925 genes
Phase I HapMap; MAF > 0.05
CEU:
CHB:
JPT:
YRI:
762,447 SNPs
695,601
689,295
799,242
~1/5kb
Copy Number Variation dataset
• Genome Structural Variation Consortium
– Redon et al. Nature in press
•
Array-CGH using a whole genome
tile path array
– Median clone size ~170 kb
– All 270 HapMap individuals
•
•
Quantitative values (log2 ratios) representing
diploid genome copy number, not genotypes.
1117 CNVs called from log2 ratios
– Calls based on standard deviation of log2 ratios
– Many CNVs experimentally verified
26,563 clones
93.7% euchromatic genome
SNP cis-analysis:
SNPs within 1Mb of probe midpoint
1Mb
1Mb window
probe
gene
SNPs
1Mb
Association analysis
9.0
8.5
8.0
Expression level
9.5
Additive association model:
Linear regression e.g. CC = 0, CT = 1, TT = 2.
CC
CT
TT
Genotype
0
1
2
- slope of line
- p-value
- r2
CNV cis-analysis:
clone midpoint within 2Mb of probe midpoint
2Mb
1Mb window
probe
gene
clones
2Mb
Linear regression for CNV and expression
Clone signal (log2 ratio)
Multiple-test correction
whole-genome
1.
Bonferroni
ciswhole-genome
2. False Discovery Rate
FDR
ciswhole-genome
3.
permutations
cis-
Permutation design
GENOTYPES
g11
g21
g31
…
…
…
gi1
g12
g22
g32
g13
g23
g33
g14
g24
g34
GENE EXPRESSION
…
…
…
g1n
g2n
g3n
permute
gi2
gi3
gi4
…
gin
- 10,000 permutations – each time keep lowest p-value
- Null distribution of 10,000 extreme p-values
- Compare observed p-values to the tails of the null
Exp1
Exp2
Exp3
…
…
…
Expi
Significant expression – cis-SNP
associations
• CEU genes 323
• 888 non-redundant
genes
• CHB genes 348
• JPT genes 370
• YRI genes 411
• 67 genes in all 4
populations (8%)
• 333 genes in at least 2
populations (37%)
~ 6% genes exhibit significant cis- association
permutation threshold 0.001; SNP-probe distance < 1Mb
Significant expression – cis-CNV clones
associations
• CEU genes
40
• CHB genes
32
• JPT genes
40
• YRI genes
42
• 99 non-redundant
genes
• 7 genes associated in
all 4 populations (7%)
• 34 genes in at least 2
populations (34%)
permutation threshold 0.001; clone-probe distance < 2Mb
Table 2: Population overlap of SNP-associated genes, clone-associated genes,
and genes with both SNP and clone associations.
CEU-CHB-JPT-YRI
CEU-CHB-JPT
CEU-CHB-YRI
CEU-JPT-YRI
CHB-JPT-YRI
CEU-CHB
CEU-JPT
CEU-YRI
CHB-JPT
CHB-YRI
JPT-YRI
CEU only
CHB only
JPT only
YRI only
CNV
7
4
0
0
3
3
0
6
5
3
3
20
7
18
20
SNP
67
48
11
12
28
18
15
36
51
18
27
116
107
122
212
SUM
99
888
gene associations in at least 2 populations
percentage of total
34
0.34
331
0.37
gene associations in single populations
percentage of total
65
0.66
557
0.63
Note: clones in CNVs with freq > 1
permutation threshold 10-3
Some genes
ABC1, ABHD6, ACY1L2, ADAT1, ARNT, ARSA, ASAHL, ATP13A, B7, BBS2, BLK, C14orf130, C14orf4, C14orf52, C1orf16, C20orf22,
C21orf107, C7orf13, C7orf29, C7orf31, C8orf13, C9orf95, CARD8, CAT, CD151, CD79B, CDKN1A, CDKN2B, CGI-111, CGI-62, CGI-96,
CHCHD2, CHI3L2, CHRNE, CNN2, CP110, CPEB4, CPNE1, CRIPT, CSTB, CTNS, CTSH, CTSK, DCLRE1B, DCTD, DERP6, dJ383J4.3,
DKFZp434N035, DKFZP566H073, DKFZP566J2046, DKFZP586D0919, DKFZp761A132, DNAJD1, DOM3Z, DPYSL4, DSCR5, DTNB,
ECHDC3, EGFL5, EIF2B2, ENTPD1, ERMAP, FCGR2A, FDX1, FKBP1A, FLJ10252, FLJ10904, FLJ12994, FLJ12998, FLJ13576, FLJ14009,
FLJ14753, FLJ20444, FLJ20635, FLJ21347, FLJ21616, FLJ22374, FLJ22573, FLJ22635, FLJ23235, FLJ34443, FLJ35827, FLJ36888,
FLJ37970, FLJ40432, FLJ46603, FLJ90036, FUT10, GAA, GSTM1, GSTM2, GSTT1, H17, HABP4, HIBCH, HLA-C, HLA-DQA1, HLA-DQA2,
hmm1412, hmm23621, hmm26268, hmm31752, hmm31999, hmm3577, hmm3587, hmm5445, hmm665, hmm8232, HNLF, Hs.119946,
Hs.124623, Hs.135624, Hs.153573, Hs.158943, Hs.164463, Hs.169006, Hs.171169, Hs.212658, Hs.245997, Hs.26039, Hs.264076, Hs.311977,
Hs.333841, Hs.379903, Hs.396207, Hs.400876, Hs.40696, Hs.431200, Hs.43687, Hs.453941, Hs.460359, Hs.465789, Hs.466924, Hs.467281,
Hs.482037, Hs.485895, Hs.490095, Hs.495422, Hs.506072, Hs.517172, Hs.519979, Hs.5855, Hs.6637, HSRTSBETA, IFIT5, IL16, IL21R,
IMAGE3451454, IMMT, IPP, IREB2, IRF5, KIAA0265, KIAA0483, KIAA0643, KIAA0748, KIAA1463, KIAA1627, LCMT1, LOC113386,
LOC132001, LOC132321, LOC135043, LOC151963, LOC282956, LOC283710, LOC283970, LOC284184, LOC284293, LOC285407,
LOC286353, LOC339231, LOC339803, LOC339804, LOC340435, LOC347981, LOC348094, LOC348180, LOC374758, LOC375097,
LOC375399, LOC378075, LOC388918, LOC389362, LOC389763, LOC399987, LOC400410, LOC400566, LOC400642, LOC400684,
LOC400933, LOC401075, LOC401135, LOC401284, LOC51240, LOC90637, LOC90693, MAN1A2, MCMDC1, MGC10120, MGC12458,
MGC13186, MGC19764, MGC20235, MGC20481, MGC20781, MGC22773, MGC24665, MGC2752, MGC3794, MGC9084, MMRP19, MRPL21,
MRPL43, MTERF, MYOM2, NDUFA10, NDUFS5, NMNAT3, NUDT2, OAS1, PACSIN2, PASK, PBX4, PCTAIRE2BP, PEX5, PEX6, PGS1,
PHACS, PHC2, PHEMX, PIP5K1C, PIP5K2A, PKHD1L1, POLR2J, PP3856, PP784, PPA2, PPFIA1, PPIL3, PTER, QRSL1, R29124_1,
RABEP1, RAPGEFL1, RDH5, RPAP1, RPL13, RPL36AL, RPL8, RPLP2, RPS16, RPS6KB2, SARS2, SERPINB10, SF1, SH3GLB2, SHMT1,
SIAT4C, SIVA, SKIV2L, SNAP29, SNX11, SOD2, SPG7, SQSTM1, ST7L, STAT6, STK25, SYNGR1, SYNGR3, TAP2, TAPBP-R, TBC1D4,
TCL6, TEF, TGM5, THAP5, THAP6, THOC3, TIMM10, TINP1, TMEM8, TMPIT, TRAPPC4, TRIM4, TSGA10, TSGA2, TUBB, UBE2G1,
UGT2B11, UGT2B17, UGT2B7, UROS, USMG5, VPS28, WARS2, WBSCR27, WWOX, XRRA1, ZNF266, ZNF384, ZNF493, ZNF587, ZNF79,
ZNF85, ZRANB1,
• UGT2B7, 11, 17
• GSTM1
Genomic location of associations
SNP
CNV
SNPs
CNVs
D.
C.
0
250000 500000 750000 1000000
CEU
CHB
0
500000
CEU
40
1000000 1500000
2000000
CHB
20
30
pos/neg
+
15
-log10(pvalue)
-log10(pvalue)
20
10
JPT
40
YRI
0
30
10
5
JPT
YRI
20
15
20
10
10
5
0
0
0
250000 500000 750000 1000000
500000
1000000 1500000 2000000
distance
Distance
F.
E.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
CEU
CEU
CHB
CHB
12
40
9
JPT
YRI
40
20
6
10
3
0
Frequency
Frequency
30
JPT
YRI
12
9
30
20
6
10
3
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Adjusted_R^2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Adjusted_R^2
0
Effects of Copy Number Variation on
gene expression
REG
GENE
GENE
REG
Additional gene copy
REG
GENE
REG
Increase of distance from
regulatory element
GENE
POSITIVE
REG
GENE
REG
POSITIVE OR NEGATIVE
GENE
GENE
REG
New regulatory element
Gene interruption
REG
REG
GENE
GENE
POSITIVE OR NEGATIVE
NEGATIVE
Negative or positive slope in CNV associations
80% positive
20% negative
What is the overlap between
SNP and CNV effects?
Do SNPs capture the CNV effects
through Linkage Disequilibrium?
LD between CNV and SNP
Gene X
Gene X
A
2x expression
A
A
Gene X
G
1x expression
G
G
CNVs and SNPs mostly capture
different effects
• Relative impact on gene expression: 82% SNPs
18% CNVs
• Only 13% of genes with CNV association also had a SNP
association in the same population
– biased toward large effect size.
– CNV and SNP variation are highly correlated (p-value 0.001).
• Lack of overlapping effects is not due to CNVs in regions
of segmental duplications (few HapMap SNPs).
– Percentage of associated clones overlapping SDs does not differ
from all clones overlapping SDs (p-value: 0.016).
– Also, the probability that a CNV signal is captured by SNPs does
not depend on whether the CNV is in a SD (17.3%) or outside of
SDs (15.9%).
Phase II HapMap (2.2m SNPs)
Overlap of genes across populations as detected using phaseI and phaseII HapMap (10-3 threshold)
CEU-CHB-JPT-YRI
CEU-CHB-JPT
CEU-CHB-YRI
CEU-JPT-YRI
CHB-JPT-YRI
CEU-CHB
CEU-JPT
CEU-YRI
CHB-JPT
CHB-YRI
JPT-YRI
CEU only
CHB only
JPT only
YRI only
SNP phaseII
Number of genes
percent total
71
0.077
43
0.047
15
0.016
10
0.011
33
0.036
23
0.025
14
0.015
43
0.047
47
0.051
24
0.026
30
0.033
116
0.126
97
0.106
123
0.134
228
0.249
SNP phaseI
Number of genes
percent total
67
0.075
48
0.054
11
0.012
12
0.014
28
0.032
18
0.020
15
0.017
36
0.041
51
0.057
18
0.020
27
0.030
116
0.131
107
0.120
122
0.137
212
0.239
SUM (Non-redundant genes)
917
888
gene associations in at least 2 populations
percentage of total
353
0.38
331
0.37
gene associations in single populations
percentage of total
564
0.62
557
0.63
Note: 770 genes overlap between Non-redundant associated genes using phaseI and Non-redundant associated genes using phaseII.
Direction of allelic effect
POP2
POP1
100
100
AA
AA
AG
AGREEMENT
40
GG
20
Frequency
80
60
AG
60
40
GG
20
0
-1.5
100
0.0
1.5
3.0
Expression Levels
4.5
6.0
0
7.5
0.0
1.5
3.0
Expression Levels
4.5
6.0
7.5
GG
80
OPPOSITE
40
GG
Frequency
AG
60
AG
60
40
AA
20
0
-1.5
100
AA
80
Frequency
Frequency
80
20
-1.5
0.0
1.5
3.0
Expression Levels
4.5
6.0
7.5
0
-1.5
0.0
1.5
3.0
Expression Levels
4.5
6.0
7.5
Direction of allelic effects
1.0
CEU-slope
0.5
0.0
-0.5
-1.0
-1.5
-2
-1
95% have the same direction
0
YRI-slope
1
2
Trans effects
spliceSNPs
rSNPs nsSNPs
REG
mirnaSNPs
GENE
DNA
Genome-wide associations
Dissect regulatory networks
Trans analysis Nb of genes with signif associations
pv 10-4
CEU
YRI
CHB
JPT
pv 10-3
16
9
17
16
pv 10-2
45
23
38
40
pv 0.05
251
164
216
200
1107
743
900
876
Trans analysis Nb of significant SNP-gene associations
pv 10-4
CEU
YRI
CHB
JPT
93
16
76
43
pv 10-3
193
41
118
104
pv 10-2
660
253
400
461
pv 0.05
2726
1130
1777
1913
Regulatory variants have the highest impact on regulatory
networks
Statistics of RS categ used as input for Trans analysis
Percentages
CEU
YRI
CHB
JPT
cis 10-3
53.07
51.27
54.42
54.53
ns
39.76
41.12
38.88
38.76
splice
7.05
7.47
6.57
6.59
miRNA
0.12
0.13
0.13
0.12
Statistics of sgnif RS categ for thresh 10-2
Percentages
CEU
YRI
CHB
JPT
cis 10-3
80.84
54.69
72.39
73
ns
16.02
39.84
22.64
22.89
splice
3.14
5.47
4.98
3.89
miRNA
0
0
0
0.22
Statistics of sgnif RS categ for thresh 10-3
Percentages
CEU
YRI
CHB
JPT
cis 10-3
87.24
58.54
81.51
79.05
ns
9.18
39.02
13.45
16.19
splice
3.57
2.44
5.04
4.76
miRNA
0
0
0
0
Conclusions
- Large number of genes with significant expression variation within
and between human population samples and strong association
between individual genes and specific SNPs and CNVs.
-Little overlap between SNP and CNV signals
- Replication of significant signals across populations.
- Promising approach for identification of functionally variable
regulatory regions.
- Cis regulatory variation mostly responsible for genome-wide
regulatory variation
Pre-publication data release
www.sanger.ac.uk/genevar/
Acknowledgements
Cambridge University
Barbara Stranger
Matthew Forrest
Catherine Ingle
Antigone Dimas
Christine Bird
Alexandra Nica
Claude Beazley
Panos Deloukas
Mark Dunning
Simon Tavaré
Cornell University
Genome Structural Variation Consortium
Matt Hurles, Richard Redon, Nigel Carter, Charles
Lee, Chris Tyler-Smith, Stephen Scherer,
Andy Clark
illumina
Jill Orwick
Mark Gibbs
The HapMap
Consortium
Wellcome Trust for funding
Wellcome Trust Advanced Courses
Working with the HapMap
2-5 April 2007
Closing date for applications: 10 January 2007
Wellcome Trust Genome Campus, Hinxton, Cambridge
This 4-day residential workshop will provide a comprehensive overview of the
International HapMap Project, including practical experience of working with the
HapMap data to map phenotypic traits to locations in the human genome.
Theoretical lectures will be combined with hands-on practical sessions and
introduction to relevant databases and tools.
Course instructors: Paul de Bakker (MIT), Manolis Dermitzakis (Sanger Institute),
Mike Feolo (NIH/NCBI), Jonathan Marchini (Oxford University), Gil McVean (Oxford
University), Steve Sherry (NIH/NCBI), Albert Vernon Smith (CSHL), Barbara Stranger
(Sanger Institute), Eleftheria Zeggini (Wellcome Trust Center for Human Genetics)
Speakers: Lon Cardon (Wellcome Trust Center for Human Genetics), Panos Deloukas
(Sanger Institute), John Todd (Cambridge University)
Full information and application details at:
www.wellcome.ac.uk/advancedcourses