New Approaches for ChIP

Download Report

Transcript New Approaches for ChIP

Transcriptional and post-transcriptional
regulation of gene expression
protein
Translation
Localization
Stability
mRNA
3’UTR
Pol II
DNA
Activation
Repression
• Where does each transcription factor
bind in the genome, in each cell type, at
a given time ? Near which genes ?
• What is the cis-regulatory code of each
factor ? Does they require any cofactors ?
DNA
Activation
Repression
ChIP-seq
Transcription
factor of
interest
Antibody
Genome Analyzer II (Solexa)
Control: input DNA
Genome Analyzer II (Solexa)
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTC
TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG
Average length ~ 250bp
25-40bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTC
TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG
Average length ~ 250bp
25-40bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTC
TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG
Average length ~ 250bp
BCL6 ChIP-seq
•
•
•
•
•
•
Lymphoma cell line (OCI-Ly1)
Solexa/Illumina
6 lanes for ChIP, 1 for input DNA, 1 for QC
36nt long sequences
32 Million reads
Aligned/mapped to hg18 with Eland
Melnick lab at WCMC
Read mapping with Eland
Solexa Read
AAAATACGCGTATTCTCCCAAAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Read mapping with Eland
Solexa Read
AAAATACGCCTATTCTCCCAAAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Read mapping with Eland
Solexa Read
AAAATACGCCTATTCTCCCATAACAATATC
TCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT
Reference Human Genome (hg18)
Reads can map to multiple
locations/chromosomes
Solexa Read 1
Reference Human Genome (hg18)
Solexa Read 2
Reads map to one strand or
the other
Solexa Read 1
hg18
Solexa Read 2
1011 AGGTCACAAAACAAGTCCTAACAAATTTAAGAGTAT
U0
1245 GTCAGAAAAATCCTTTTTATTATATAAACAATACAT
U2
945
GTCATCAAACTCCAAGGATTCTGTTTTCAACATACT
:1118
GAAAGTGATTAGCAGATTGTCATTTAATAATTGTCT
874
GATAAATTTTTTCCTACAATCTTAAATTATTACACA
928
AAAAATTAAACAATTCTAAAAATATTTTTATCTTAA
:4
GCACATGTCATACTCTTTCTAGCTCTCTTATTTTTC
1015 AAATTAATGTAAAAAATAGGATACTGAATTGTGATA
U1
926
GTAGTTAACAATAATTTATTTTATACTTCAAAATTC
1702 GTCAGAATTAATTAATCAAAACACCAAATGTACTTC
U0
1003 ATTTTGACTTTATTATTTTTTCTTCAATGTTTTTAA
NM
1090 GAAAGTACATCAAATACATATTATATACTTTACATA
R2
937
AATCCATATACATTTCTTTTTAATCATTTCCTCTTT
:330 GTGAGTTTCTTAATCCTGAGTTCTAATTTTATTTCA
R0
1031 ACATTTTATAAATTTTTAATTTCATTTTAATTTATA
NM
:1469
GTTTTTAAAATCAACACTTTTATTATAGAAGTAGCA
:828 GTACTGATGTAAACTTGGTAAAAACATTGACATAAA
U0
:583 GAAGAAAATGACTATGTCAAAATATTATCTCTCAAT
U0
:1653
GTTTTACTGATTTTCTTACTTACTAAACTACCTGTT
:943 AATGATACGGCGACCACCGACAGGTTCAGAGTTCTA
NM
:268 GAGAATTATTCAGAAGTCAAATCTGTGCTTAGTTTA
U2
:318 GTATGTATCATATATATTTATGTATCATATATATTT
R1
:1113
GATTGCTCCATTATTTGTTAAAAACATAGTAAAATA
1072 ATGAGATCAGTACTTCAAAGAGATATCTGCACTCCC
U0
1178 GTTAGTCCCAATATTCCATTAATCCCAATAAATATA
U2
:972 GAGATAATAATAGCAGTTATGGCATCGAGATAATTT
U0
:341 GTAGAGGGCACACATCACAAACAAGTTTCTGAGAAT
R2
:302 GAATATCCACTTGCAGACTTTACAAACAAATTTTTT
R2
:1126
GGCAGATGAAACTTCTATACACTATATTTTAGCCAG
1371 GAAAGAAAAACTATTGAAAAAATAGTTACTTTCCAA
U0
:614 GTGTAGATGATATCGAGGGCATTAGAAGTAAATAGC
U0
:808 GAGAGGAAATAATAAAGATAAAAGTAGAAAAAGTGA
U0
:815 GATAATTATGTTGTTGTAATTATTGTTTGTTTTTTT
U0
:260 GTTGACAATCCAGCTGTCATAGAAACTGACTATTTT
U0
:916 AAAAATTCTCCCAAAACAACAAGATGTAAATATACC
U0
:308 GTTCTTACACTGATATGAAGAAATACCTGAGACTGG
U0
917
GAGAAACACACATATTTTTGTAAGTGCCATCACATC
:348 GTATTATCTAACACACAAGATGATGTTTGTTTTTAT
NM
:857 GAGTGTAGAAAATTTTCTGCCCTAAAATATTTGTTA
U1
1729 GTATCCTAAAGTGTATCTTATGTTTTTTCATCTTCT
U1
:64
AATAAAACAAATTCCAATGGCTTAGATTCTACTTAA
1059 AAATGGTCATACTTCCCAAAGCGATCTACAGATTCA
U1
1061 ACATTTCCACATTTCTGTGGAAGCCTCACAATCATT
R2
932
ATTAATCAACAGCAACATTAATCAACTGAATCAACA
647
GAATAAATAATCAAAACATATAATACATTTTTTTAT
:731 ATATACACATATATATACATATATATATACACATAT
R0
:1196
GAGAAGGAAATGTGTTTTCTAAGTTTCTTTATCTTC
1
0
U0
U2
U1
U2
U0
0
U1
1
0
0
U1
29
0
U0
1
1
U0
0
0
0
NM
1
0
1
0
0
U0
1
1
1
1
1
1
1
U1
0
0
0
U2
0
0
U0
U1
47
U1
13
0
1
0
0
0
1
1
0
0
0
0
0
255
0
1
0
0
1
0
0
3
0
1
0
0
0
0
1
0
0
0
0
0
0
2
0
0
1
1
0
1
0
1
0
255
0
62
1
1
0
1
0
0
0
1
0
0
2
1
255
0
0
0
0
0
0
1
2
0
9
1
0
3
4
0
0
0
0
0
0
0
67
1
0
0
0
0
29
2
0
1
255
1
chr8.fa
59699745
R
chr5.fa
121195098
F
0
chr18.fa
8914049
1
chr1.fa
97496963
0
chr3.fa
95643444
1
chr2.fa
177727639
0
chr8.fa
79132719
chr10.fa
69774166
F
17
chrX.fa
26496842
chr12.fa
72700465
F
DD
DD
R
F
R
R
R
DD
R
DD
15G
DD
DD
DD
DD
DD
30G
DD
20G
0
F
DD
20G
1
chr12.fa
62166701
chr14.fa
65160857
F
chr5.fa
97782464
F
0
chr7.fa
133200265
R
DD
DD
F
DD
chr5.fa
162472124
R
DD
3G
7C
0
chr12.fa
chr6.fa
chr2.fa
33830898
110722427
47305609
R
F
R
DD
DD
DD
15G
19G
0
chr13.fa
90021137
chr1.fa
74303257
R
chr5.fa
16031200
F
chr1.fa
187326417
F
chr15.fa
46739015
R
chr12.fa
38910133
R
chr3.fa
101625712
R
chr2.fa
214128537
R
0
chr7.fa
13668652
F
DD
DD
DD
DD
DD
DD
DD
R
DD
chr6.fa
74625385
F
chr12.fa
7400023
R
1
chr10.fa
98020799
chr3.fa
50834510
R
DD
DD
R
DD
13G
9C
DD
19C
0
0
chr2.fa
chr5.fa
46078825
41496935
R
F
DD
DD
32G
0
chr4.fa
188020201
F
DD
32G
chr11.fa
94204222
18G
10C
18C
28G
31G
7A
DD
DD
18C
15C
20C
Number of reads per Eland type
U0
U1
U2
R0
R1
R2
NM
QC
21019702
3280059
1007173
3661054
815275
406002
2050499
306352
65%
10%
3%
11%
2%
1%
6%
1%
Peak detection
• Calculate read count at each position (bp) in
genome
• Determine if read count is greater than expected
Peak detection
• We need to correct for input DNA reads (control)
•
- non-uniformaly distributed (form peaks too)
- vastly different numbers of reads
between ChIP and input
Peak detection using ChIPseeqer
genome
Read
count
Expected read count
T A T T A A T T A T C C C C A T A T A TG A T A T
genome
Expected read count = total number of reads * extended fragment length / chr length
Is the observed read count at a
given genomic position greater than
expected ?
x1
Frequency
P(X  x)  1 
0
xe 
x!
x = observed read count
λ = expected read count
Read count

The Poisson
distribution
Is the observed read count at a
given genomic position greater than
expected ?
x1
P(X  x)  1 
0
xe 
x!
x = 10 reads (observed)
λ = 0.5 reads (expected)
genome

P(X>=10) = 1.7 x 10-10
log10 P(X>=10) = -9.77
The Poisson
-log10 P(X>=10) = 9.77
Read
count
Expected read count
x1
Pc (X  x)  1 
0
-Log(p)
xce 
c
x!

Expected read count = total number of reads * extended frag len / chr len
Read
count
Expected read count
-Log(p)
Expected read count = total number of reads * extended frag len / chr len
Input reads
ChIP
INPUT
Read
count
Expected read
count
Read
count
Genome positions (bp)
Genome positions (bp)
x1
Pc (X  x)  1 
0
Expected read
count
xce 
x1
Pi (X  x)  1 
c
x!
0
-Log(Pc)
-Log(Pi)


Log(Pc) - Log(Pi)
Threshold
xi e 
i
x!
Normalized Peak score (at each bp)
P(XChIP)
R = -log10
P(Xinput)
Will detect peaks with high read
counts in ChIP, low in Input
Works when no input DNA !
x1
Pi (X  x)  1 
0
xi e 
i
x!
Non-mappable fraction of the
genome
We enumerated all 30-mers, counted # occurrences,
calculated non-unique fraction of genome
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
chr18
chr2
chr3
chr4
chr6
chr8
chr5
chr12
chr11
chr20
chr7
chr10
chr17
chrX
chr1
chr13
chr16
chr14
chrM
chr9
chr19
chr15
chr21
chr22
chrY
9369067/76117153
33849240/242951149
27854877/199501827
27090014/191273063
24330283/170899992
20932821/146274826
26029902/180857866
19382853/132349534
20039443/134452384
10017788/62435964
26182588/158821424
22968951/135374737
14496284/78774742
31269270/154913754
55186693/247249719
28668063/114142980
23552340/88827254
29689825/106368585
4628/16571
43125838/140273252
20251255/63811651
31877970/100338915
16867677/46944323
21176578/49691432
43209644/57772954
0.123087459668913 (=12%)
0.139325292921335
0.139622164963933
0.141630052737745
0.142365618132972
0.143106107677065
0.143924633059643
0.14645199279659
0.149044906485258
0.160449000194824
0.164855517225434
0.169669404417753
0.184021980040252
0.201849540099583
0.223202247602959
0.251159230291692
0.265147676410215
0.279122120502026
0.279283084907368
0.307441635415995
0.317359834491667
0.317702957023205
0.359312392256674
0.426161556382597
0.747921665906161 (=74%)
Peak detection
• Determine all genomic regions with R>=15
• Merge peaks separated by less than 100bp
• Output all peaks with length >= 100bp
• Process 23M reads in <7mins
BCL6: 18,814 peaks
ChIP reads
Input reads
Detected Peaks
80% are within <20kb of a known gene
• Where does each transcription factor
bind in the genome, in each cell type, at
a given time ? Near which genes ?
• What is the cis-regulatory code of each
factor ? Does they require any cofactors ?
DNA
Activation
Repression
Regulatory Sequence
Discovery using FIRE
Discovering regulatory sequences
associated with peak regions
True TF binding
peak?
correlation is quantified using
the mutual information
Yes
Yes
Target
regions
Yes
True TF peak
Yes
Yes
No
Yes
Absent
0.40
0.10
0.33
Present
0.10
0.40
0.00
Yes
Motif
…
No
No
Random
regions
No
No
2
2
I(motif ;groups)   P(i, j)log
No
i1 j1
No
…

P(i, j)
P(i)P( j)
Motif Search Algorithm
Highly
informative
k-mer
CTCATCG
TCATCGC
AAAATTT
GATGAGC
AAAAATT
ATGAGCT
TTGCCAC
TGCCACC
ATCTCAT
...
...
ACGCGCG
CGACGCG
TACGCTA
ACCCCCT
CCACGGC
TTCAAAA
AGACGCG
CGAGAGC
Not
informative CTTATTA
MI
0.0618
0.0485
0.0438
0.0434
0.0383
0.0334
0.0322
0.0298
0.0265
0.0018
0.0012
0.0011
0.0010
0.0009
0.0005
0.0004
0.0003
0.0002
MI=0.081
MI=0.045
MI=0.040
...
Optimizing k-mers into more informative
degenerate motifs
True TF binding
peak?
Yes
ATCCGTACA
Yes
Target
regions
Yes
Yes
Yes
Yes
…
A/G
C/G/T
T/G
A/T/G
C/G
A/C/G
No
No
Random
regions
No
ATCC[C/G]TACA
No
No
No
…
which character increases
the mutual information by
the largest amount ?
Optimizing k-mers into more informative
degenerate motifs
True TF binding
peak?
Yes
Yes
Target
regions
Yes
Yes
Yes
Yes
…
A/C
C/G/T
T/C
A/T/C
C/G
A/C/G
No
No
Random
regions
No
No
No
No
…
ATCC[C/G]TACA
.
.
.
Mutual
information
change
Similarity to
ChIP-chip
RAP1 motif
Motif
Motifs optimized so far
k-mer
CTCATCG
TCATCGC
AAAATTT
Highly
GCTCATC
informative k- AAAAATT
mers
ATGAGCT
TTGCCAC
TGCCACC
ATCTCAT
...
MI
0.0618
0.0485
0.0438
0.0434
0.0383
0.0334
0.0322
0.0298
0.0265
MI=0.081
MI=0.045
optimize ?
Only optimize k-mer if
I(k-mer;expression | motif)
is large enough
(for all motifs optimized so far)
Conditional mutual information I(X;Y|Z)
Depletion
Enrichment
Discovered
Motifs
FIRE automatically compares
discovered motifs to known
motifs in TRANSFAC and
JASPAR
Motif co-occurrence
anallysis
ChIPseeqer: an integrated
framework for ChIP-seq data
analysis
•
•
•
•
•
ChIPseeqer (peak detection)
ChIPseeqer2Track (for Genome Browser)
ChIPseeqer2FIRE (+ motif analysis)
ChIPseeqer2iPAGE (+ pathway analysis)
ChIPseeqer2cons (conservation analysis)
Installing and setting up
programs
Install ChIPseeqer and FIRE:
http://physiology.med.cornell.edu/faculty/elemento/lab/chipseq.shtml
http://tavazoielab.princeton.edu/FIRE/
Execute following commands:
export FIREDIR=/Applications/FIRE-1.1
export PATH=$PATH:$FIREDIR
export CHIPSEEQERDIR=/Applications/ChIPseeqer-1.0
export PATH=$PATH:$CHIPSEEQERDIR:$CHIPSEEQERDIR/SCRIPTS
chmod +x $CHIPSEEQERDIR/ChIP*
chmod +x $CHIPSEEQERDIR/SCRIPTS/*.pl
Peak Detection
- Input file: CTCF.bed
cd ~/Desktop/elemento
Or download from:
http://physiology.med.cornell.edu/faculty/elemento/lab/files/chi
seq/
- 2947043 U0 reads in BED format
(check by typing wc –l CTCF.bed)
(view by typing more CTCF.bed and q to exit)
- No input DNA for this experiment
Peak Detection
Step 1: Split big read file into one file per chromosome
split_bed_or_mit_files.pl CTCF.bed
Expected output:
Opening CTCF.bed
Current directory = .
Creating ./reads.chr1
…
Peak Detection
Step 2. Detect peaks
ChIPseeqer --chipdir=. --t=15 --fraglen=250 --format=bed -outfile=CTCF_peaks_t15.txt
Expected output:
Processing reads in chrY ... done.
Processing reads in chrX ... done.
Processing reads in chr9 ... done.
Processing reads in chr8 ... done.
Step 3. Count how many peaks were found
wc -l CTCF_peaks_t15.txt
Making a Genome Browser track
Command lines:
cd JuliaChild
wc –l CTCF_peaks_t15.txt
ChIPseeqer2track --targets=CTCF_peaks_t15.txt --trackname=“CTCF peaks”
Expected output:
CTCF_peaks_t15.txt.wgl.gz created.
To check that the file was created:
ls
Making a Genome Browser track
http://genome.ucsc.edu/cgi-bin/hgGateway
Making FIRE input files
Command line (type instructions below as one single line):
ChIPseeqer2FIRE --targets=CTCF_peaks_t15.txt –genome=wg.fa
--suffix=CTCF_peaks_t15_FIRE
wg.fa is also available from:
http://physiology.med.cornell.edu/faculty/elemento/lab/files/chipseq/
(decompress with gunzip wg.fa.gz)
Expected output:
Extracting sequences ... Done.
Extracting randomly selected sequences ... Done.
CTCF_peaks_t15_FIRE.txt and CTCF_peaks_t15_FIRE.seq have been generated.
…
FIRE analysis
Command line (type instructions below as one single line):
fire.pl --expfile=CTCF_peaks_t15_FIRE.txt
--fastafile_dna=CTCF_peaks_t15_FIRE.seq --nodups=1 --minr=2
--species=human --dorna=0 --dodnarna=0
Expected output:
Extracting sequences ... Done.
Extracting randomly selected sequences ... Done.
CTCF_peaks_t15_FIRE.txt and CTCF_peaks_t15_FIRE.seq have been generated.
…
FIRE main output file
open CTCF_peaks_t15_FIRE.txt_FIRE/DNA/CTCF_peaks_t15_FIRE.txt.summary.pdf
Randomly
selected
sequences
Peak
sequences