RNA-Seq Analysis Practicals

Download Report

Transcript RNA-Seq Analysis Practicals

Differential Methylation
Analysis
Simon Andrews
[email protected]
@simon_andrews
1
A basic question…
2
Factors to consider
•
•
•
•
•
Number of observations
Magnitude of effect
Technical considerations
Biological variability
Biological common sense
3
The problem of power…
• Ideally want to cover every Cytosine (CpG)
• Have to correct for the number of tests
• There’s no way you’ll collect enough data to
analyse each C and have p-values which
survive multiple testing correction
• Stats have to find a way to work round this.
4
Maximising power
• Options
– Analyse in windows
– Pre-filter
– Hierarchical or Adaptive filtering
5
Window sizes
Small windows
• Good resolution
• Specific biological effects
• High MTC burden
• Small observations
• High p-values
Large windows
• Lots of data
• High statistical power
• Low MTC burden
• Low p-values
• Effect averaging
6
Hierarchical testing
• Test larger regions
– Windows / Features etc.
• Take significant hits and subdivide
– Smaller windows
– Individual CpGs
– Correct only for these tests
• Assemble hits together to make up DMRs
7
Hierarchical testing
CGI
CGI
CGI
Genome
CGI
X
CGI
CGI
X
CGI
Genome
CGI
X X
X
CGI
X
Genome
CGI
X X
CGI
CGI
CGI
CGI
CGI
CGI
CGI
CGI
Statistically ‘creative’ solution to not having enough data
8
Un-replicated Analysis
9
Contingency tests
• Chi-square / G-test / Fisher’s exact test
– Differ only at low observations
– Significant changes require enough observations
that any of these should give the same answer
• Operates on single replicates
• Technical measure of difference
Meth A
Unmeth A
Meth B
Unmeth B
10
Chi-Square results
11
Biological considerations
• Minimum relevant effect size?
– Balance power vs change
– What makes biological sense
– (what would you follow up?)
• Minimum coverage worth testing
– No point testing poorly covered regions
12
Effect of pre-filtering
13
Replicated Analysis
14
Dealing with replicates
• Simple approach
– Merge data from replicates together
– Single test, High power
– Post-hoc test for consistency
• Explicitly account for batch effects
– Logistic regression
– Measures batch effects and excludes them from final significance calculation
– Beta binomial tests – account for both epigenetic and observation noise.
• Work with methylation values
– Normalise percentage methylation values
– Use conventional statistics (t-tests etc) for comparing groups
15
Simple Approach
WT1
WT2
Merged
WT
Contingency
test
WT3
Hits
Clustering
Correlation
T-test
Etc.
Consistent
Hits
KO1
KO2
Merged
KO
KO3
16
17
Replicated Count Based Analyses
• Logistic Regression
– Effectively a replicated contingency test
– Accounts for sampling bias
– Compares samples in multiple conditions for
difference
• Beta binomial modeling
– Tries to account for epigenetic variation
– Different noise levels for different methylation levels
– Can do dispersion shrinkage based on similar points
18
Replicated methylation level analyses
•
•
•
•
Works from methylation percentages
Noise is very high – individual values unreliable
Expect no sudden changes
Observation level should predict noise
• Can generate “smoothed” data
• Use normal continuous statistics on smoothed
data
19
BSmooth algorithm
black: 25x (Lister)
pink: 4x (Lister)
20
Bsmooth t-values
21
Methylation statistics packages
•
swDMR (Perl/R-package)
n > 3)
•
Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for
methylKit* (R-package by A. Akalin et al.)
Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method.
•
bsseq* (R/Bioconductor by K.D. Hansen)
Fisher’s
•
Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms
exact test. Requires biological replicates for DMR detection
BiSeq* (R/Bioconductor by K. Hebestreit et al.)
Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq
•
RnBeads* (R package by F. Mueller et al.)
works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data
•
DMAP* (C command line tool by P. Stockwell et al.)
RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA
•
RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith)
Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs
•
MOABS* (C++ command line tool by D. Sun et al.)
metric that
•
ComMet (Y. Saito et al., 2014)
Does not
•
Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single
combines biological and statistical significance
Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria.
require biological replicates
DSS (R/Bioconductor by Feng et al., 2014)
methylated
Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially
loci
* interface well with
22