1. Interpreting rich epigenomic datasets

Download Report

Transcript 1. Interpreting rich epigenomic datasets

1. Interpreting rich epigenomic datasets
%Genome
TSS
CpG
hiCpG-TSS
loCpG-TSS
TSS
Transcribed
TES
Dnase
Conservation
ZNF
Lamina
Repeats
Expression
L1 repeat
Alu repeat
Interpreting chromatin states
How many states are meaningful:
agreement between cell types
Ratio vs. background
H1-H9
H9-H1
H1/9-IMR90
IMR90-H1
IMR90-H9
Background
• Distinctions remain recoverable between cell types,
even after 40-50 chromatin states (IMR90-H1-H9)
Preferential enhancer-promoter interactions
IMR90 – Same chromosome interactions
Transcribed 3’
Transcribed 5’
Transcribed strong
Transcribed weak
Transcribed enhancer
Enhancer poised
Enhancer Active
Strongest Enhancer
Strong Enhancer
Weak Enhancer
Low signal
Heterochromatin
Repressed
Bivalent promoter
Active Promoter
Transcribed Enhancer
IMR90 – diff chrom
Off
Prom
H1 – same chrom
H1 – diff chrom
• Different enhancer states show different interactions
• Enhancers/transcribed/promoters interact
• Inactive regions show fewer interactions overall
(both to active states, and to each other)
• H3K9me3 states interact between chromosomes in ES cells
4
2. Prioritizing experiments
Ever-expanding dimensions of epigenomics
Additional dimensions:
Environment
Thousands of whole-genome
Genotype
datasets
Disease
Gender
Chromatin marks
Stage
Age
Cell types
• Today: Cell-type and chromatin-mark dimensions
• Next: Personal epigenomes: genotype/phenotype
• Complete matrix of conditions, individuals, alleles
Prioritize experiments for additional cell types
2 methods
Method 1
• Based on unique information
Method 2
• Based on chromatin state recovery
(1) Quantify state recovery using subsets of marks
(2) Capture additional information from mark intensity
7
 Beyond marks: Trade-offs of >cell types vs. >depth
Mark Prediction Error2
Method 1 example: Rank chromatin marks for a new cell type
Hardest to predict
 Prioritize these marks?
IMR90
Using all marks
Easiest to predict (redundant)
• Hardest marks to predict using all other IMR90 marks: H3K3me3, etc
• Match the marks usually identified as the most useful: a good metric?
Method 2 example: Rank additional marks for existing cell type
Extend IMR90 set beyond initial 22 marks
22 Marks common with CD4T data
H2AK5ac
H3K27ac
H3K27me3
H3K9me3
H2BK120ac
H3K4ac
H3K36me3
H4K20me1
H2BK12ac
H3K9ac
H3K4me1
H2BK20ac
H4K5ac
H3K4me2
H3K14ac
H4K8ac
H3K4me3
H3K18ac
H4K91ac
H3K79me1
H3K23ac
H3K79me2
19 Marks only in CD4T data
H2AK9ac
H2BK5me1
H3K9me2
CTCF
H2BK5ac
H3K27me1
H3R2me1
H2AZ
H3K36ac
H3K27me2
H3R2me2
PolII
H4K12ac
H3K36me1
H4K20me3
H4K16ac
H3K79me3
H4R3me2
9
3. Completing epigenomes computationally
Chromatin mark imputation
10
Predicting signal for missing marks
• Question: Can we predict signal intensity of one
mark given other sets of marks
• Datasets used:
– H1, IMR90 (+H9, K562, GM12878, HSMM)
• Methodological decisions:
–
–
–
–
Focus on common set of marks
Downsample one replicate to 10 million reads
Split reads equally between training and test data
Bin genome into 2kb bins
• Model/metrics:
– Use a linear regression model for predictions
– Used square error loss on mark signal as objective
Eg: Predicting H3K9ac signal
H3K9ac
Predicted
H3K9ac
True
• How good is the prediction?
• How similar to other marks?
• How does it compare to biological replicate?
Mark
Coeff
H3K56ac
0.32
H3K4me3
0.29
H3K4ac
0.22
H3K4me2
0.15
H3K27ac
0.14
H2AK5ac
0.14
H4K8ac
0.14
H3K23ac
0.13
H3K14ac
0.13
H3K79me2
0.12
H4K5ac
0.06
H3K36me3
0.04
H4K91ac
0.01
H3K4me1
-0.01
H3K18ac
-0.01
H3K27me3
-0.02
H4K20me1
-0.04
H2BK120ac -0.05
H3K9me3
-0.05
Input
-0.07
H2BK15ac
-0.1
H3K79me1
-0.15
H2BK12ac
-0.15
H2BK20ac
-0.22
Intercept
-0.16
Impute missing datasets / predict new cell types
Predict missing mark from many others
Predict many marks in new cell type
Prediction of K27ac,K9ac,K4me1… in GM from DNase
Prediction of H3K4me1 from DNase across cell types
• Use mark correlations to predict missing datasets as matrices become denser
• Applications: (1) Prediction in difficult to access conditions. (2) Detecting failed
experiments/replicates. (3) Finding unexpected prediction/raw differences
13
4. Allele-specific chromatin marks
Known imprinted genes confirm
allele specific methodology
Method
• Map to phased GM12878 haplotypes
• Count maternal vs. paternal reads,
Validation
• Known imprinted genes are allelic
• X-inactivation only one chromosome
• Requires sufficient SNPs and
sufficient reads for significance
Discover allelic genes genome-wide
 Aggregate by gene / chromatin state
Allelic activity supported by many marks, Pol2, TFs
• Includes X-inactivated paternal chromosome genes
Genome-wide
correlations for
pairs of marks
• Aggregate
signal across
chromatin
states
• Active marks
positively
correlated
• H3K27me3
negatively
correlated
Zoom in on
indiv. examples
Active/repressive marks on paternal/maternal alleles
Pol2 reads on paternal chromosome
Active transcription of paternal chromosome
Repressive marks on maternal chromosome
• Strong repressive signal (K27me3): reads mostly maternal
• Strong active signal (K79me2 tx): reads mostly paternal
Allele-specific chromatin marks: cis-vs-trans effects
• Maternal and paternal GM12878 genomes sequenced
• Map reads to phased genome, handle SNPs indels
• Correlate activity changes with sequence differences
5. Linking enhancers to promoters
using many cell types
Power should increase with additional cell types
Chromatin State
Gene expression
Chance of spurious correlation decreases
Power to predict links increases with more cell types
• True enhancers show excess • Number of non-random
of high correlation
links increases linearly
with number of cell types
• Can estimate number of non22
random links at any FDR
• 30 cell types: 15,000 links
Visualizing 10,000s predicted enhancer-gene links
• Overlapping regulatory units, both few and many
• Both upstream and downstream elements linked
• Enhancers correlate with sequence constraint
23
6. Disease enrichments across 1000s of enhancers
Full T1D association spectrum  1000s of causal SNPs
• Rank all SNPs by P-value
• Find chromatin states with
enrichment in high ranks
• Signal spans 1000s of SNPs
GM12878
Lymphoblastoid
K562
Myelogenous leukemia
GM12878 enhancer enrichment now seen
Could bias in array design contribute to these enrichments?
 Evaluate all 1000 genomes SNPs by imputing those in LD
Cell type specific: GM and K562 enhancers
Chromatin state specific: Enhancers/promoters
Imputing SNPs in LDstronger cell/state separation
Enhancers across cell types
Chromatin states in GM12878
Promoters: 462 (excess 81)
Enhancers: 2049 (excess 392)
1940 distinct loci (R^2<.8)
Transcribed: 4740 (excess 522)
Insulator: 240 (excess 23)
Repressed: 1351
(excess 76)
Other: 21k (deplete 1093)
• Excess of 30,000 SNPs2049 enhancers (excess 392)
• Mostly found in independent loci (1730 with R2<0.2)
 Systematically measure their regulatory contributions