Transcript Document

Transcription factor binding sites and
gene regulatory network
Victor Jin
Department of Biomedical Informatics
The Ohio State University
Transcription in higher eukaryotes
Gene Expression
1.
Chromatin structure
2.
Initiation of
transcription
3.
Processing of the
transcript
4.
Transport to the
cytoplasm
5.
mRNA translation
6.
mRNA stability
7.
Protein activity stability
Transcriptional Regulation
Nuclear membrane
Transcriptional Regulation
Nuclear membrane
Binding site/motif
CCG__CCG
Genome-wide mRNA
transcript data (e.g.
microarrays)
Transcriptional Regulation
Learning problems:
• Understand which
regulators control
which target genes
Binding site/motif
CCG__CCG
Nuclear membrane
• Discover motifs
representing
regulatory elements
Some common approaches
• Cluster-first motif discovery
– Cluster genes by expression profile, annotation, …
to find potentially coregulated genes
– Find overrepresented motifs in promoter
sequences of similar genes (algorithms: MEME,
Consensus, Gibbs sampler, AlignACE, …)
(Spellman et al. 1998)
Training data – Features
regulator expression
promoter sequence
label
feature vector
What is PWM?
 Transcription factor binding sites (TFBSs) are
usually slightly variable in their sequences.
 A positional weight matrix (PWM) specifies the
probability that you will see a given base at each
index position of the motif.
Pos 1
A 18
C
8
G 13
T
7
Con N
2
8
3
31
4
G
3 4
5 4
3 9
34 9
4 24
G T
5
1
33
8
4
C
6
29
4
10
3
A
7
7
21
11
7
N
8
7
15
15
9
N
9
7
14
19
6
N
10
0
0
4
42
T
11
1
0
44
1
G
12
39
1
3
3
A
13
1
43
0
2
C
14
1
39
1
5
C
15
6
18
6
16
N
PWM for ERE
1.
2.
3.
4.
5.
6.
7.
acggcagggTGACCc
aGGGCAtcgTGACCc
cGGTCGccaGGACCt
tGGTCAggcTGGTCt
aGGTGGcccTGACCc
cTGTCCctcTGACCc
aGGCTAcgaTGACGt
41.
42.
43.
44.
45.
46.
cagggagtgTGACCc
gagcatgggTGACCa
aGGTCAtaacgattt
gGAACAgttTGACCc
cGGTGAcctTGACCc
gGGGCAaagTGACTg
...
Position frequency matrix (PFM)
(also known as raw count matrix)
Given N sequence fragments of fixed length, one
can assemble a position frequency matrix
(number of times a particular nucleotide appears
at a given position). A normalized PFM, in which
each column adds up to a total of one, is a matrix
of probabilities for observing each nucleotide at
each position.
Position weight matrix (PWM)
(also known as position-specific scoring matrix)
PFM should be converted to log-scale for efficient
computational analysis. To eliminate null values before
log-conversion, and to correct for small samples of
binding sites, a sampling correction, known as
pseudocounts, is added to each cell of the PFM.
Position Weight Matrix for ERE
Converting a PFM into a PWM
A
C
G
T
18
8
13
7
8
3
31
4
5
3
34
4
For each matrix
element do:
4
9
9
24
1
33
8
4
29
4
10
3
7
21
11
7
7
15
15
9
7
14
19
6
0
0
4
42
1
0
44
1
39
1
3
3
1
43
0
2
1
39
1
5
N
4
N N
p(b)
f b ,i 
w(b, i )  log2
pb, i 
 log2
pb 
A
0.58
-0.44
-0.98
-1.21
-2.29
1.22
-0.60
-0.60
-0.60
-2.96
-2.29
1.62
-2.29
-2.29
-0.72
C
-0.44
-1.49
-1.49
-0.30
1.39
-1.21
0.78
0.34
0.25
-2.96
-2.96
-2.29
1.76
1.62
0.46
G
0.16
1.31
1.44
-0.30
-0.44
-0.17
-0.06
0.34
0.65
-1.21
1.79
-1.49
-2.96
-2.29
-0.64
T
-0.60
-1.21
-1.21
0.96
-1.21
-1.49
-0.60
-0.30
-0.78
1.73
-2.29
-1.49
-1.84
-0.98
0.23
f b ,i
– raw count (PFM matrix element) of nucleotide b in column i
N
– number of sequences used to create PFM (= column sum)
N
and N
4
- pseudocounts (correction for small sample size)
p(b) - background frequency of nucleotide b
6
18
6
16
Scoring putative EREs by scanning the promoter with PWM
GGGTCAGCATGGCCA
A
0.58
-0.44
-0.98
-1.21
-2.29
1.22
-0.60
-0.60
-0.60
-2.96
-2.29
1.62
-2.29
-2.29
-0.72
C
-0.44
-1.49
-1.49
-0.30
1.39
-1.21
0.78
0.34
0.25
-2.96
-2.96
-2.29
1.76
1.62
0.46
G
0.16
1.31
1.44
-0.30
-0.44
-0.17
-0.06
0.34
0.65
-1.21
1.79
-1.49
-2.96
-2.29
-0.64
T
-0.60
-1.21
-1.21
0.96
-1.21
-1.49
-0.60
-0.30
-0.78
1.73
-2.29
-1.49
-1.84
-0.98
0.23
m
Absolute score of the site
S   w(b, i) =11.57
Row
Sum
Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20
Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02
i 1
relative_ score 
Absolute_ score  Minim um_ score
Maxim um_ score  Minim um_ score

11.57   24.02
 0.86
17.20   24.02
Yeast ESR: Biological Validation
Universal stress
repressor motif
STRE element
Previous work: “Structure learning”
• Graphical models (and other methods)
– Learn structure of “regulatory network”, “regulatory
modules”, etc.
– Fit interpretable model to training data
– Model small number of genes or clusters of genes
– Many computational and statistical challenges; often
used for qualitative hypotheses rather than prediction
(Pe’er et al. 2001)
(Segal et al, 2003, 2004)
Signaling networks in a cell
Network inference
• Regulator-motif associations in nodes can have
different meanings:
P
Mp
Direct binding
P
TF
MTF
Indirect effect
P
Mp
M
Co-occurrence
• Need other data to confirm binding relationship
between regulator and target (e.g. ChIP-chip)
• Still, can determine statistically significant regulatortarget relationships from regulation program
Example: oxygen sensing and regulatory network
Binding data for regulatory networks
• ChIP-chip: genome-wide proteinDNA binding data, i.e. what
promoters are bound by TF?
• Investigate regulatory network
model: use ChIP-chip data in place of
motifs (no motif discovery)
– Features: (regulator, TFoccupancy) pairs
P1
P2
TF
Inferring regulatory networks from the combination
of expression data and binding data
An extended ER regulatory network in MCF7 cells
RUVBL1
GTF2I
CCNL1
RFC1
RXRA
MKL2
BHLHB2
BAZ1B
HEY2
ZNF394
TTF2
RAB18
ASCC3
STRAP
ZNF50
0
XBP1
ER
TLE3
FOXP4
BRIP1
PAWR
FOS
HDAC1
TBX2
MYC
ELF3
DDX20
NRIP1
LASS2
THRAP1
ZBTB41
TXNIP
PNN
CEBP
VPS72
ZNF23
BRF1
9
HSF2
MSX2
HIF1A ZNF38
DNMT1
ZKSCAN1
C140RF43
PURB
ADAR
CUTL1
IVNS1ABP
CHAF1B
BATF
CSDE1
SP3
TXNDC
HES1
Signaling molecules -- Networks
• Find all SMs that associate as regulators with a
particular TF’s ChIP occupancy in ADT features
• e.g.
TF
Glc7
phosphatase
complex
SM mRNA
Gac1Sds22
Gip1
Hsf1
• Hypothesis: Glc7 phosphatase complex interacts with
Hsf1 in regulation of Hsf1 targets
(Interaction supported in literature)
http://motif.bmi.ohio-state.edu/ChIPMotifs/
•FASTA file
Input Data
•Contact Info
•Control data (optional)
Ab initio Motif Discovery
Programs
Statistical Methods
•Weeder
•MaMf
•MEME
•Bootstrap re-sampling
•Fisher test
STAMP Matching
•SeqLog
Results
•PWM
•P-value
•Known or novel motifs
http://motif.bmi-ohio-state.edu/HRTBLDb
Software Demo
• W-ChIPMotifs
• HRTargetDB