ChIP-seq introduction Powerpoint slides

Download Report

Transcript ChIP-seq introduction Powerpoint slides

ChIP-seq Data Analysis
ChIP-seq overview
DNA + bound protein
Fragment DNA
Sequence
Map sequence
tags to genome
& identify
peaks
Adapted from slide set by: Stuart M. Brown, Ph.D.,
Center for Health Informatics & Bioinformatics, NYU School of Medicine
Prepare
sequencing
library
Immunoprecipitate
Release DNA
ChIP-seq big picture
• Combine high-throughput sequencing with Chromatin
Immuno-precipitation to identify specific protein-DNA
interactions genome-wide, including those of:
–
–
–
–
–
Transcription factors
Histones (various types and modifications)
RNA Polymerase (survey of transcription)
DNA polymerase (investigate DNA replication)
DNA repair enzymes
• … or fragments of DNA that are modified (e.g.
methylated)
ChIP-seq Workflow
Confirm ChIP
Prepare library
Submit for Sequencing
Get Raw sequence data & do QC
Map sequence reads to genome
Identify ChIP peaks over input background
Downstream bioinformatic analyses
Initial quality control measures
FastQC as for the RNASeq data
A note on using duplication levels to
estimate your library size (complexity)
Assuming you have 100 initial fragments in your library (before amplification) & which
fragment gets read is random:
# reads :
25
50
75
100
150
200
# unique reads:
23
37
52
63
78
87
% duplicated:
8%
26%
31%
43%
55%
69%
x-more left in lib:
4.3
2.7
1.9
1.6
1.3
1.15
x-more than prev: 1.6
1.4
1.2
1.24
1.11
Given 9% duplicates, an additional sequencing run of the same size (from the same
library) will give you 1.6x more unique reads. Two additional runs will give you 2.2x
more (1.6*1.4).
…but if you have a high % duplicates (e.g. 43%) adding one more lane will only give you
1.37x more unique reads than you had initially. Depending on sequencing depth, this
could indicate that your library has low complexity – either because too few fragments
from your ChIP survived to the library amplification step, or because the protein binds
few sites.
Alternatively: sub-sample your data and check saturation of peak calling
Read mapping - Bowtie Algorithm
(Burrows-Wheeler Transformation)
Provides an identifier to any sequence, allowing fast lookup of all its genomic
positions in an indexed genome file (ebw file). Avoid having to search genome
for matches each time (like blast would do).
How do peak-finders map binding sites?
•Fragments contain the TF binding site
at a (mostly) random position within
them.
•Reads are (randomly) from left or right
edges (sense or antisense) of
fragments.
•Thus peak for sense tags will be 1/2
the fragment length upstream…
•Binding site position = mid-way
between sense tag peak & antisense
tag peak.
•To get binding site peak, shift sense
downstream by ½ fragsize & antisense
upstream by ½ fragsize.
•
Adapted from slide set by: Stuart M. Brown, Ph.D., Center for Health Informatics
& Bioinformatics, NYU School of Medicine & from Jothi, et al. Genome-wide
identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR
(2008), 36: 5221-31
MACS procedure
IGV visualization of MACS results using the FoxA1 data set
IGV visualization of MACS results using the University of Washington H3K4me3 data set.
IGV visualization of MACS results using the Broad Institute H3K36me3 data set.
MACS’ "shiftsize” model
You can find info on the estimated parameters in your .macsinfo file…
INFO @ Sun, 10 Feb 2013 21:27:51: #2 Build Peak Model...
INFO @ Sun, 10 Feb 2013 21:27:51: #2 number of paired peaks: 0
WARNING @ Sun, 10 Feb 2013 21:27:51: Too few paired peaks (0) so I can not build the model!
Broader your MFOLD range parameter may erase this error. If it still can't build the model, please
use --nomodel and --shiftsize 100 instead.
WARNING @ Sun, 10 Feb 2013 21:27:51: Process for pairing-model is terminated!
WARNING @ Sun, 10 Feb 2013 21:27:51: #2 Skipped...
WARNING @ Sun, 10 Feb 2013 21:27:51: #2 Use 100 as shiftsize, 200 as fragment length
Here MACs tried to estimate the “shift size” for moving sense & antisense
reads to get a final peak position, by identifying sets of strong + & - strand
peaks at a certain distance from each other.
There was not enough info on chromosome 19 to do this, so it made a guess
that the fragment size was 200 & shiftsize was 100. 200 is close enough to the
actual fragment size of ~150 bp that we can go with this.
MACS model file
This a representative result from running MACS
#2 Build Peak Model...
#2 number of paired peaks: 683
Fewer paired peaks (683) than 1000!
Model may not be build well! Lower
your MFOLD parameter may erase this
warning. Now I will use 683 pairs to
build model!
finished!
predicted fragment length is 125 bps
Generate R script for model :
LiE_IP_v_INPUT_11_2012_dup1_mode
l.r
Call peaks...
use control data to filter peak
candidates...
Finally, 9504 peaks are called!
find negative peaks by swapping treat
and control
Finally, 337 peaks are called!
To generate this file you will
need to go into R, and enter:
Source(“MACS_output_file.r”),
which will generate a .pdf
Peaks & negative peaks
Keep scrolling down your .macsinfo file until you find…
…
INFO @ Sun, 10 Feb 2013 21:36:47: #3 Finally, 364 peaks are called!
INFO @ Sun, 10 Feb 2013 21:36:47: #3 find negative peaks by swapping treat
and control
INFO @ Sun, 10 Feb 2013 21:36:52: #3 Finally, 364 peaks are called!
INFO @ Sun, 10 Feb 2013 21:36:52: #4 Write output…
This is the pay-off, where MACS identifies your ER alpha peak locations!
364 peaks on chromosome 19 (which is ~1/50th of the genome) suggests
~20,000 peaks for the whole genome, which is not bad!
Equally critical, MACS now swaps treat & control (pretending your INPUT data
is your IP & your ChIP data is your input) and looks again for peaks.
The number of “negative” peaks found in this way should be far less than the
positive peaks, and the 10:1 ratio here is fine.
Troubleshooting MACs
Briefly…
MACS can’t build a model:
- Adjust the mfold values (the fold over background ranges MACs considers
for paired peaks)
- Tell MACs to not build a model, but instead use the shiftsize you specify.
Peaks/Negative Peaks ratio is poor or too few peaks are detected:
- Adjust model settings to see if you can improve both. Otherwise, you may
have to conclude that 1) your library was no good or 2) the factor just
doesn’t bind to many places in the genome.
Toubleshooting MACs…
Be on the lookout for
MACS building a model
from short-separation
noise peaks (that may
arise from sonication
sensitive breakpoints or
other things unrelated to
your protein binding).
To avoid this, you can
decrease the maximum
“mfold” so that these
strong irrelevant peaks are
ignored when the model
is built.
Downstream analysis
Downstream analysis
• Find the nearest gene to each peak
• Check distribution relative to gene features (start site, exon,
intron, upstream/downstream)
• Find overrepresented motifs in peak region (TFBS binding
sites of our factor + possible co-binders) – kmers/logos
• Check if peaks are clustered or co-occur with other binding
events
• Sequence conservation (or conservation of binding event, if
data is available)
• Gene set functional analysis
• …
Overlaps in Galaxy
Operate On Genomic Intervals-> Intersect
This lets you create a new .bed file which has only the regions that intersect
between two datasets.
Overlapping Intervals:
(saves complete intervals from file 1 that
overlap anything in file 2)
Overlapping Pieces of Intervals:
(saves only the regions shared between 1
& 2)
Additional notes on overlaps
Comparing peaks from multiple
samples (Bardet et al. 2011)
During genome-wide peak calling, only the best peaks pass the stringent thresholds
required for low false discovery rates (FDRs) due to the correction for multiple testing
(orange lines).
Regions may show substantial tag enrichment, yet are not called as peaks (green line).
When comparing peaks across conditions, we advocate using 'significant enrichment'
(not multiple testing–corrected) as the measure to assess whether a peak is shared
across conditions or is truly condition-specific. Merely intersecting peaks called at
each condition would miss conserved peaks (e.g., middle examples).
How can we tell whether overlaps are significantly greater
than chance? Assigning a p-value?
We have 3 pieces of count frequency information: The
number of overlaps, the number of regions compared,
and we can generate the expected background
frequency through shuffling.
This type of data is like coin tosses & is ideally suited for
a binomial test, which uses “number of matches”,
“number of tests” and “expected background
frequency” to calculate p. values.
If you flip a coin, say 10 times and it comes up heads 6 out of 10
(frequency 0.6 vs. expected 0.5), that would not seem unlikely – and
a binomial test would tell you this.
p.=.74
1
5 6
10
p.=2e-10
However if you flip a coin 1000 times & get heads 600 out of 1000,
that would seem a bit odd, and the binomial test would indicate this
by saying that the probability of the null hypothesis (that the
frequency of heads is 0.5) is low.
100
500 600 1000
Resampling statistic or binomial tests for overlaps
Let’s say we have peak regions from two samples (7000 and 8260), and the number of
overlaps is 1653.
We can estimate the background chance by randomly placing the regions in the first dataset
in new locations, and then count the number of overlaps. We would repeat this procedure
100-1000 times.
Then, we can either ask: how many times, out of the 1000 repeated procedures, was the
number of overlaps greater than that observed for the real dataset? This is our p-value
(3/1000-> p=0.003. 0 -> p<0.001)
Alternatively, we can take the mean of the number of overlaps, and use this in a binomial
test. For example, let’s say the mean number of overlaps in the shuffled sets was 95.11
Run a binomial test in R by typing:
binom.test(1653, 8260, 95.11/8260)
The p.value is <2e-16.
Very low. So, yes, the binding sites in the two samples overlapped more than expected by
chance… but binding events are still to ~80% different places between these two samples..
Data in the UCSC browser
ChIP-seq for histone modifications
Method:
•Mouse ES cells vs ES-derived primary neural progenitor cells (NPCs).
•Prepare chromatin & ChIP with antibodies to specific histone modifications.
•H3K4 methylation marks active genes, H3K27 marks repressed genes, both marks together in ES
cells mark “poised” genes that will become activated in certain developmental lineages.
Meissner et al. 2008, Nature 454:766.
The ENCODE Project
Dozens of labs did ChIP-seq, under rigorous quality
guidelines, for over 100 transcription factors and histone
modifications, plus related assays for DNA methylation,
chromatin accessibility etc.
Major paper (many others provide additional details):
Encode Project Consortium (over 100 authors) An integrated encyclopedia of DNA
elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. doi:
10.1038/nature11247.
Some ways to access this data:
Nature.com/encode (Nature’s summary & links to all related papers)
factorbook.org (a way to explore the data in a wiki format)
UCSC genome browser (hands on examples next)
sra (short read archive, repository for raw data, more on this later!)
Sample of Encode Data
Example of ChIP-seq data tracks on UCSC browser
Peak calls: regions
of significant
enrichment over
background
Processed read
density, read as # of
reads overlapping a
given BP position,
data (used to make
peak calls)
H3K4me3 & H3K27acet are marks of active promoters (e.g. MTRF1L)
H3K27me2 is a mark of repressed promoters
Genes in ESCs that are required for differentiation are often “poised” and bear K4me3 & K27me3
“bivalent mark” (e.g. SYNE1)
Differentiation resolves bivalent mark to all activating marks (Osteoblasts) or all repressive marks (HepG2)
Downloading data from UCSC browser
Try zooming in (you can go all the way to base pair resolution)
Want to learn more about a gene? Control click on it’s ideogram & select
“open details page in new window”
What if you want to use this data somewhere else?
Select Tools->Table Browser
Select Group: Regulation, Track: Broad Histone
Table: H1-hESC H3K4me3 … Pk (for the peaks data, the signal file will be huge)
In output format, select “all fields from selected table”
--Note that you could have selected “sequence”… if you had you’d get the actual DNA
sequence for each one of these peaks. We’ll use this later.
Check “Galaxy” next to send output to:
--Note that you could have selected send to file, we’ll use this later as well.
Click “Get output” & then click “send query to galaxy”
Introduction to Galaxy Tools
Galaxy is a web platform providing a lot of basic tools for manipulating
genomics data.
On the right are input & analysis options.
On the left is your history of uploaded files & analyses.
You’ll have one item in process, which will finish soon & turn green.
Click on the title to get a sample of what the data looks like.
Click on the eye to see the data in the central panel.
Each entry has a chromosome#, BP for start & BP for end & some other
values (signal value=enrichment over background, p.value=-log(base10)of p.
value, so, for p=.0001 this would be 4)
Click on the pencil to look at and edit the name & other attributes of any
item.
We’ll look more at Galaxy tools later…
What about data that’s not on the UCSC Browser?
The ENCODE project was UNUSUALLY considerate when compared to most other researchers who
generate genomics data.
Even though ENCODE is huge, it’s probably <10% of published NGS data.
To publish, researchers must make their data accessible, but they will very rarely provide a link to a
UCSC browser track. If you’re lucky, they will have put processed data up somewhere: generally on
GEO…
The GENE EXPRESSION OMNIBUS (GEO) Key repository for microarray and genomics data.
Open a new browser tab (ctrl-T) & go to:
http://www.ncbi.nlm.nih.gov/geo/
Search for “encode h3k4me3 h1-hesc”. You’ll see several entries. The first few are larger datasets
that include this specific data. The one at the bottom is just the data for this track. First note the
“accession number GSM733657” - often publications will give this accession number, providing an
easy way to get directly to the right place.
---> Now, click on the title for this entry “Bernstein…”
Scroll down the next page: There’s lots of info about this experiment with links for more
information.
At the bottom are the processed data files:
They’ve been nice & offer us a “BROADPEAK” file (the same as what we just uploaded to Galaxy), a
“BAM” file for each experimental replicate (which has the genomic coordinates for each read), and a
“BIGWIG” file (the filetype for the “signal” track on the UCSC browser)
What if the data I’m interested in isn’t in GEO?
Authors are almost always required to make their NGS data accessible in order to publish….
…but they’re often not required to make it easy!
Many times the only thing that’s available is the raw data, stored in…
The Sequence Read (SRA) - Repository for
Open a new browser tab (ctrl-T) & go to:
raw NGS data.
http://www.ncbi.nlm.nih.gov/sra/
Search again for “encode h3k4me3 h1-hesc”.
A single record will be called up…
Clicking on link 1 or 2 under “run” will give you information about that particular biological replicate
sample.
Clicking on the link 1 or 2 under ‘size’, will take you to a page with a single linked file with a “.sra”
extension.
Note how big this file is… just a few of these would rapidly fill up a PC hard drive!
So what is a .sra file & what can you do with it? Don’t even try to download & open it… besides being
huge, it’s not even normal text.
In the next few lectures we’ll find out how to handle this sort of raw data.
How many reads do I need?
Minimum for ChIP-seq of a transcription factor with < ~30,000
binding sites in a mammalian genome:
• 2 replicates per condition
• 20+ million reads per sample (>40M per condition, proportionately less for
smaller genomes & fewer binding peaks)
• One HiSeq lane gives ~150 million reads
…& can multiplex ~4 samples (2 exp + 2 input / lane)
• Single end 50 bp reads almost always good
(unlike RNA-seq where longer and/or paired end reads are required for many downstream
questions).
For some applications need many more reads (e.g. mapping nucleosome positions need
>400 M). Make your best estimate. If you have too few you can re-sequence the same
samples or add additional samples. Reads from all runs can be pooled in the end.