Transcript Slide 1
Inference of cis and trans regulatory variation in the human genome Manolis Dermitzakis The Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge, UK [email protected] Gene expression • Altered patterns of gene expression disease. – e.g., Type 1 diabetes, Burkitt’s lymphomas. • Widespread intraspecific variation. • Heritable genetic variation for transcript levels. – Familial aggregation of expression profiles (Cheung et al. 2003). – In humans, ~30% of surveyed loci exhibited a genetic component for expression differences (Monks et al. 2004; Schadt et al. 2003). • Much of the influential variation is located cis- to the coding locus. – In humans, mouse, and maize, 35%-50% of the genetic basis for intraspecific differences in transcription level are cis- to the coding locus (e.g. Morley et al. 2004; Schadt et al. 2003; Stranger et al. 2005; Cheung et al. 2005, etc.). Nature of regulatory variation REG GENE DNA i) Pre-mRNA ii) mRNA iii) Protein REG GENE Expression iv) DNA Stranger and Dermitzakis, Human Genomics 2005 Effects of Copy Number Variation on gene expression REG GENE GENE REG Additional gene copy REG GENE REG REG Increase of distance from regulatory element GENE GENE REG GENE GENE REG New regulatory element Gene interruption REG REG GENE GENE Gene expression association mapping 100 AA Frequency 80 AG 60 40 GG 20 0 -1.5 0.0 1.5 3.0 Expression Levels 4.5 6.0 7.5 Stranger et al. PLoS Genet 2005 Phenotypic variation space illumina Human 6 x 2 gene GEX arrays Beads in Wells Whole-genome gene expression ~48,000 transcripts 24,000 RefSeq 24,000 other transcripts Cell line 270 HapMap individuals: CEU: 30 trios, 90 total CHB: 45 unrelated JPT: 45 unrelated YRI: 30 trios, 90 total RNA IVT1 rep1 IVT2 rep2 rep3 rep4 2 IVTs each person 2 replicate hybridizations each IVT Quantile normalization of all replicates of each individual. Median normalization across all individuals of a population. Within- and between- individual variation test : 1400410090_A vs. 1400410090_C 10 5 10 5 10 4 10 4 10 Signal 1400410090_C Signal 1400410096_A trial : 1400410090_A vs. 1400410096_A 3 10 2 10 2 10 3 10 4 10 5 10 6 Signal 1400410090_A 10 3 10 2 10 7 10 2 10 3 10 4 10 5 10 6 Signal 1400410090_A 2 replicates; single YRI individual 2 YRI individuals r2 (all genes) = 0.990 r2 (all genes) = 0.964 Detected genes (0.98 in both samples: 12,076) r2 (detected) = 0.994 Detected genes (0.98 in both samples: 11,529) r2 (detected) = 0.964 HapMap SNPs 60 CEU 45 CHB 44 JPT 60 YRI 14,925 genes Phase I HapMap; MAF > 0.05 CEU: CHB: JPT: YRI: 762,447 SNPs 695,601 689,295 799,242 ~1/5kb Copy Number Variation dataset • Genome Structural Variation Consortium – Redon et al. Nature in press • Array-CGH using a whole genome tile path array – Median clone size ~170 kb – All 270 HapMap individuals • • Quantitative values (log2 ratios) representing diploid genome copy number, not genotypes. 1117 CNVs called from log2 ratios – Calls based on standard deviation of log2 ratios – Many CNVs experimentally verified 26,563 clones 93.7% euchromatic genome SNP cis-analysis: SNPs within 1Mb of probe midpoint 1Mb 1Mb window probe gene SNPs 1Mb Association analysis 9.0 8.5 8.0 Expression level 9.5 Additive association model: Linear regression e.g. CC = 0, CT = 1, TT = 2. CC CT TT Genotype 0 1 2 - slope of line - p-value - r2 CNV cis-analysis: clone midpoint within 2Mb of probe midpoint 2Mb 1Mb window probe gene clones 2Mb Linear regression for CNV and expression Clone signal (log2 ratio) Multiple-test correction whole-genome 1. Bonferroni ciswhole-genome 2. False Discovery Rate FDR ciswhole-genome 3. permutations cis- Permutation design GENOTYPES g11 g21 g31 … … … gi1 g12 g22 g32 g13 g23 g33 g14 g24 g34 GENE EXPRESSION … … … g1n g2n g3n permute gi2 gi3 gi4 … gin - 10,000 permutations – each time keep lowest p-value - Null distribution of 10,000 extreme p-values - Compare observed p-values to the tails of the null Exp1 Exp2 Exp3 … … … Expi Significant expression – cis-SNP associations • CEU genes 323 • 888 non-redundant genes • CHB genes 348 • JPT genes 370 • YRI genes 411 • 67 genes in all 4 populations (8%) • 333 genes in at least 2 populations (37%) ~ 6% genes exhibit significant cis- association permutation threshold 0.001; SNP-probe distance < 1Mb Significant expression – cis-CNV clones associations • CEU genes 40 • CHB genes 32 • JPT genes 40 • YRI genes 42 • 99 non-redundant genes • 7 genes associated in all 4 populations (7%) • 34 genes in at least 2 populations (34%) permutation threshold 0.001; clone-probe distance < 2Mb Table 2: Population overlap of SNP-associated genes, clone-associated genes, and genes with both SNP and clone associations. CEU-CHB-JPT-YRI CEU-CHB-JPT CEU-CHB-YRI CEU-JPT-YRI CHB-JPT-YRI CEU-CHB CEU-JPT CEU-YRI CHB-JPT CHB-YRI JPT-YRI CEU only CHB only JPT only YRI only CNV 7 4 0 0 3 3 0 6 5 3 3 20 7 18 20 SNP 67 48 11 12 28 18 15 36 51 18 27 116 107 122 212 SUM 99 888 gene associations in at least 2 populations percentage of total 34 0.34 331 0.37 gene associations in single populations percentage of total 65 0.66 557 0.63 Note: clones in CNVs with freq > 1 permutation threshold 10-3 Some genes ABC1, ABHD6, ACY1L2, ADAT1, ARNT, ARSA, ASAHL, ATP13A, B7, BBS2, BLK, C14orf130, C14orf4, C14orf52, C1orf16, C20orf22, C21orf107, C7orf13, C7orf29, C7orf31, C8orf13, C9orf95, CARD8, CAT, CD151, CD79B, CDKN1A, CDKN2B, CGI-111, CGI-62, CGI-96, CHCHD2, CHI3L2, CHRNE, CNN2, CP110, CPEB4, CPNE1, CRIPT, CSTB, CTNS, CTSH, CTSK, DCLRE1B, DCTD, DERP6, dJ383J4.3, DKFZp434N035, DKFZP566H073, DKFZP566J2046, DKFZP586D0919, DKFZp761A132, DNAJD1, DOM3Z, DPYSL4, DSCR5, DTNB, ECHDC3, EGFL5, EIF2B2, ENTPD1, ERMAP, FCGR2A, FDX1, FKBP1A, FLJ10252, FLJ10904, FLJ12994, FLJ12998, FLJ13576, FLJ14009, FLJ14753, FLJ20444, FLJ20635, FLJ21347, FLJ21616, FLJ22374, FLJ22573, FLJ22635, FLJ23235, FLJ34443, FLJ35827, FLJ36888, FLJ37970, FLJ40432, FLJ46603, FLJ90036, FUT10, GAA, GSTM1, GSTM2, GSTT1, H17, HABP4, HIBCH, HLA-C, HLA-DQA1, HLA-DQA2, hmm1412, hmm23621, hmm26268, hmm31752, hmm31999, hmm3577, hmm3587, hmm5445, hmm665, hmm8232, HNLF, Hs.119946, Hs.124623, Hs.135624, Hs.153573, Hs.158943, Hs.164463, Hs.169006, Hs.171169, Hs.212658, Hs.245997, Hs.26039, Hs.264076, Hs.311977, Hs.333841, Hs.379903, Hs.396207, Hs.400876, Hs.40696, Hs.431200, Hs.43687, Hs.453941, Hs.460359, Hs.465789, Hs.466924, Hs.467281, Hs.482037, Hs.485895, Hs.490095, Hs.495422, Hs.506072, Hs.517172, Hs.519979, Hs.5855, Hs.6637, HSRTSBETA, IFIT5, IL16, IL21R, IMAGE3451454, IMMT, IPP, IREB2, IRF5, KIAA0265, KIAA0483, KIAA0643, KIAA0748, KIAA1463, KIAA1627, LCMT1, LOC113386, LOC132001, LOC132321, LOC135043, LOC151963, LOC282956, LOC283710, LOC283970, LOC284184, LOC284293, LOC285407, LOC286353, LOC339231, LOC339803, LOC339804, LOC340435, LOC347981, LOC348094, LOC348180, LOC374758, LOC375097, LOC375399, LOC378075, LOC388918, LOC389362, LOC389763, LOC399987, LOC400410, LOC400566, LOC400642, LOC400684, LOC400933, LOC401075, LOC401135, LOC401284, LOC51240, LOC90637, LOC90693, MAN1A2, MCMDC1, MGC10120, MGC12458, MGC13186, MGC19764, MGC20235, MGC20481, MGC20781, MGC22773, MGC24665, MGC2752, MGC3794, MGC9084, MMRP19, MRPL21, MRPL43, MTERF, MYOM2, NDUFA10, NDUFS5, NMNAT3, NUDT2, OAS1, PACSIN2, PASK, PBX4, PCTAIRE2BP, PEX5, PEX6, PGS1, PHACS, PHC2, PHEMX, PIP5K1C, PIP5K2A, PKHD1L1, POLR2J, PP3856, PP784, PPA2, PPFIA1, PPIL3, PTER, QRSL1, R29124_1, RABEP1, RAPGEFL1, RDH5, RPAP1, RPL13, RPL36AL, RPL8, RPLP2, RPS16, RPS6KB2, SARS2, SERPINB10, SF1, SH3GLB2, SHMT1, SIAT4C, SIVA, SKIV2L, SNAP29, SNX11, SOD2, SPG7, SQSTM1, ST7L, STAT6, STK25, SYNGR1, SYNGR3, TAP2, TAPBP-R, TBC1D4, TCL6, TEF, TGM5, THAP5, THAP6, THOC3, TIMM10, TINP1, TMEM8, TMPIT, TRAPPC4, TRIM4, TSGA10, TSGA2, TUBB, UBE2G1, UGT2B11, UGT2B17, UGT2B7, UROS, USMG5, VPS28, WARS2, WBSCR27, WWOX, XRRA1, ZNF266, ZNF384, ZNF493, ZNF587, ZNF79, ZNF85, ZRANB1, • UGT2B7, 11, 17 • GSTM1 Genomic location of associations SNP CNV SNPs CNVs D. C. 0 250000 500000 750000 1000000 CEU CHB 0 500000 CEU 40 1000000 1500000 2000000 CHB 20 30 pos/neg + 15 -log10(pvalue) -log10(pvalue) 20 10 JPT 40 YRI 0 30 10 5 JPT YRI 20 15 20 10 10 5 0 0 0 250000 500000 750000 1000000 500000 1000000 1500000 2000000 distance Distance F. E. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 CEU CEU CHB CHB 12 40 9 JPT YRI 40 20 6 10 3 0 Frequency Frequency 30 JPT YRI 12 9 30 20 6 10 3 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adjusted_R^2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adjusted_R^2 0 Effects of Copy Number Variation on gene expression REG GENE GENE REG Additional gene copy REG GENE REG Increase of distance from regulatory element GENE POSITIVE REG GENE REG POSITIVE OR NEGATIVE GENE GENE REG New regulatory element Gene interruption REG REG GENE GENE POSITIVE OR NEGATIVE NEGATIVE Negative or positive slope in CNV associations 80% positive 20% negative What is the overlap between SNP and CNV effects? Do SNPs capture the CNV effects through Linkage Disequilibrium? LD between CNV and SNP Gene X Gene X A 2x expression A A Gene X G 1x expression G G CNVs and SNPs mostly capture different effects • Relative impact on gene expression: 82% SNPs 18% CNVs • Only 13% of genes with CNV association also had a SNP association in the same population – biased toward large effect size. – CNV and SNP variation are highly correlated (p-value 0.001). • Lack of overlapping effects is not due to CNVs in regions of segmental duplications (few HapMap SNPs). – Percentage of associated clones overlapping SDs does not differ from all clones overlapping SDs (p-value: 0.016). – Also, the probability that a CNV signal is captured by SNPs does not depend on whether the CNV is in a SD (17.3%) or outside of SDs (15.9%). Phase II HapMap (2.2m SNPs) Overlap of genes across populations as detected using phaseI and phaseII HapMap (10-3 threshold) CEU-CHB-JPT-YRI CEU-CHB-JPT CEU-CHB-YRI CEU-JPT-YRI CHB-JPT-YRI CEU-CHB CEU-JPT CEU-YRI CHB-JPT CHB-YRI JPT-YRI CEU only CHB only JPT only YRI only SNP phaseII Number of genes percent total 71 0.077 43 0.047 15 0.016 10 0.011 33 0.036 23 0.025 14 0.015 43 0.047 47 0.051 24 0.026 30 0.033 116 0.126 97 0.106 123 0.134 228 0.249 SNP phaseI Number of genes percent total 67 0.075 48 0.054 11 0.012 12 0.014 28 0.032 18 0.020 15 0.017 36 0.041 51 0.057 18 0.020 27 0.030 116 0.131 107 0.120 122 0.137 212 0.239 SUM (Non-redundant genes) 917 888 gene associations in at least 2 populations percentage of total 353 0.38 331 0.37 gene associations in single populations percentage of total 564 0.62 557 0.63 Note: 770 genes overlap between Non-redundant associated genes using phaseI and Non-redundant associated genes using phaseII. Direction of allelic effect POP2 POP1 100 100 AA AA AG AGREEMENT 40 GG 20 Frequency 80 60 AG 60 40 GG 20 0 -1.5 100 0.0 1.5 3.0 Expression Levels 4.5 6.0 0 7.5 0.0 1.5 3.0 Expression Levels 4.5 6.0 7.5 GG 80 OPPOSITE 40 GG Frequency AG 60 AG 60 40 AA 20 0 -1.5 100 AA 80 Frequency Frequency 80 20 -1.5 0.0 1.5 3.0 Expression Levels 4.5 6.0 7.5 0 -1.5 0.0 1.5 3.0 Expression Levels 4.5 6.0 7.5 Direction of allelic effects 1.0 CEU-slope 0.5 0.0 -0.5 -1.0 -1.5 -2 -1 95% have the same direction 0 YRI-slope 1 2 Trans effects spliceSNPs rSNPs nsSNPs REG mirnaSNPs GENE DNA Genome-wide associations Dissect regulatory networks Trans analysis Nb of genes with signif associations pv 10-4 CEU YRI CHB JPT pv 10-3 16 9 17 16 pv 10-2 45 23 38 40 pv 0.05 251 164 216 200 1107 743 900 876 Trans analysis Nb of significant SNP-gene associations pv 10-4 CEU YRI CHB JPT 93 16 76 43 pv 10-3 193 41 118 104 pv 10-2 660 253 400 461 pv 0.05 2726 1130 1777 1913 Regulatory variants have the highest impact on regulatory networks Statistics of RS categ used as input for Trans analysis Percentages CEU YRI CHB JPT cis 10-3 53.07 51.27 54.42 54.53 ns 39.76 41.12 38.88 38.76 splice 7.05 7.47 6.57 6.59 miRNA 0.12 0.13 0.13 0.12 Statistics of sgnif RS categ for thresh 10-2 Percentages CEU YRI CHB JPT cis 10-3 80.84 54.69 72.39 73 ns 16.02 39.84 22.64 22.89 splice 3.14 5.47 4.98 3.89 miRNA 0 0 0 0.22 Statistics of sgnif RS categ for thresh 10-3 Percentages CEU YRI CHB JPT cis 10-3 87.24 58.54 81.51 79.05 ns 9.18 39.02 13.45 16.19 splice 3.57 2.44 5.04 4.76 miRNA 0 0 0 0 Conclusions - Large number of genes with significant expression variation within and between human population samples and strong association between individual genes and specific SNPs and CNVs. -Little overlap between SNP and CNV signals - Replication of significant signals across populations. - Promising approach for identification of functionally variable regulatory regions. - Cis regulatory variation mostly responsible for genome-wide regulatory variation Pre-publication data release www.sanger.ac.uk/genevar/ Acknowledgements Cambridge University Barbara Stranger Matthew Forrest Catherine Ingle Antigone Dimas Christine Bird Alexandra Nica Claude Beazley Panos Deloukas Mark Dunning Simon Tavaré Cornell University Genome Structural Variation Consortium Matt Hurles, Richard Redon, Nigel Carter, Charles Lee, Chris Tyler-Smith, Stephen Scherer, Andy Clark illumina Jill Orwick Mark Gibbs The HapMap Consortium Wellcome Trust for funding Wellcome Trust Advanced Courses Working with the HapMap 2-5 April 2007 Closing date for applications: 10 January 2007 Wellcome Trust Genome Campus, Hinxton, Cambridge This 4-day residential workshop will provide a comprehensive overview of the International HapMap Project, including practical experience of working with the HapMap data to map phenotypic traits to locations in the human genome. Theoretical lectures will be combined with hands-on practical sessions and introduction to relevant databases and tools. Course instructors: Paul de Bakker (MIT), Manolis Dermitzakis (Sanger Institute), Mike Feolo (NIH/NCBI), Jonathan Marchini (Oxford University), Gil McVean (Oxford University), Steve Sherry (NIH/NCBI), Albert Vernon Smith (CSHL), Barbara Stranger (Sanger Institute), Eleftheria Zeggini (Wellcome Trust Center for Human Genetics) Speakers: Lon Cardon (Wellcome Trust Center for Human Genetics), Panos Deloukas (Sanger Institute), John Todd (Cambridge University) Full information and application details at: www.wellcome.ac.uk/advancedcourses