Transcript Document

Regulatory sequence analysis
tools and approaches
Alexander Kel
BIOBASE GmbH
Halchtersche Strasse 33
D-38304 Wolfenbuettel
Germany
[email protected]
www.biobase.de
F(g)=E(g)A(g
p)
Gene functional
role
Gene expression
profile
organ,
tissue,
cell
stage of
development
cell cycle
phase
extracellular
signals
Protein specific activity
(as enzyme or structural or
regulatory protein)
?
?
gherllojunomd-bype Genny fasltow
Where ?
organ,
tissue,
cell
When ?
stage of
development
How ?
cell cycle
phase
extracellular
signals
With whom?
organ,
tissue,
cell
stage of
development
cell cycle
phase
External signals,
conditions
Collecting bits of information about regulation of gene expression
through transcription factor binding sites
Mouse p53 tumor suppressor gene
Expression
level
+1 to +216: enhancer
•+3 to +19: NF-1
•+35 to +51: p53
•+53 to +69: NF-kappaB
•+57 to +72: ETF
•+65 to +79: E-box
-225 to +1: promoter 1
•-195 to -170: p53
•-68 to -53: AP-1
-320 to -225: negative
regulatory element
low
maximal level
low
none
high
medium
very low
high
induction
induction
induction
induction
Organ, tissue, cell Stage of
type
development
Cell cycle
phase
Extracellular
signal
G1
G1/S, S
G2
G0
heart, liver
heart
heart;
terminally diff.
cardiomyocytes
spleen, thymus;
proliferating
fibroblasts
lymphocytes
embrio
at birth
adult
TPA, serum
mitogenic induct.
TNF-
UV radiation
?
A
C
G
T
9
8
4
8
N
2
3
2
22
T
1
1
2
25
T
0
1
2
26
T
1
13
15
0
S
0
3
26
0
G
…
0
29
0
0
C
0
0
29
0
G
l
q
0
22
7
0
C
1
8
17
3
S
l
 I (i) f (b , i)   I (i) f
i
i 1
l
 I (i) f
15
9
3
2
M
min
13
4
9
3
R
7
8
8
6
N
(i)
i 1
max
13
1
7
8
D
(1)
(i)
i 1
I (i ) 
 f (b, i) ln(4 f (b, i))
b{ A ,T ,G ,C }
(2)
TFBS identification via pattern search
Phylogenetic footprint of promoter regions of nucleolin genes
1
<===========V$CREB_02(0.85)
=============================================================================
2
<=======V$CREB_01(0.82)
MMNUCLEO
GGCCCGCTCATCAGCCCGAGGGAACCCTAGG--CC------TTCCGGCGTTCT------423
MMNUCLEO
TCTCCCCAC-CACACCAGGAAGTCACCTCTCTCA----------ACCTG---GAGTTATA
225
RNNUCIA1
GGCCCACTAAACGGCCCGAATGAACTCTAGG--CC------TTCCGGCGCTCT------435
1
<===========V$CREB_02(0.85)
CSNUCLEO
GGCC-GCGAGCTGGCCCCAGTGG-CTCTAGG--CCCTCAACTTCCGGCGCTCTCCGGCTC
450
2
<=======V$CREB_01(0.82)
HSNUCLEO
TGCCTCCAAAAGGGCCAACGGGAACTCCGCGGTCCCTGAACTTCCGGTGCTGGAGG---A
448
RNNUCIA1
TCTCCCACCACACACCAGGAAGTCACCTCTCTGA----------ACCTG---GAGTTATA
221
*** *
***
* * *
* **
****** * *
1
<===========V$CREB_02(0.85)
=============================================================================
2
<=======V$CREB_01(0.82)
MMNUCLEO
-TCAGCAGGACCACGCGGCG---------------------------------------442
CSNUCLEO
CCTCC-AGCACACACCAGGAAGTCACCTCTCCGAGACCGTCCCCATCAG---GAGTTAAA
229
RNNUCIA1
-CCAGCTCTTCAGCGCGGCGAACGTTCTAGGCCCCTGAGAAGTCCACCGGGAGGCGCAGG
494
1
<===============V$TH1E47_01(0.85)
CSNUCLEO
CTCAGCGGGAACGCGCGGCGAGCAGTTGAGGCCGCCGCGGATTCCAACGGGTTGGGGACG
510
HSNUCLEO
TGGCCCTGT-GAGGCCAGAAAGTTACTTCTCCGAGGCCAGTTCCCCATGTCTGAGAAATA
229
HSNUCLEO
CTCCTCGCTCCAGGGCCACCAGGAGCCGCGGC---------------------GTGAGTG
487
**
* **** **** ** **** *
*
*** * *
* *
** *
=============================================================================
=============================================================================
MMNUCLEO
--------------GGGGGAAA-----GCACCGAGAAACGCCCAGACCACCTGAGCATCG
483
1
<==========V$DELTAEF1_01(0.82)
RNNUCIA1
TTTCCGCTACGCGAGGGGGAAA-----TCCCCGAGAAATGCCCAGACCACCTAAGCACAG
549
MMNUCLEO
CCTACCG-CGAGAGGTCACCGACATTACATGGATCGCTTGTGCACTGCTCGTA--CACAC
282
CSNUCLEO
TTCGC----AGCGCGGGGGATGCTCGGGCCACCCACCACCCCCCCACCCCCCCGGCCACG
566
1
<======== ==V$DELTAEF1_01(0.87)
HSNUCLEO
CGTGCCGGAACCGAGGGCGGGG-----TCTCTGAGGAACTCCAAGGCTGCCCAAGCCTAC
542
RNNUCIA1
CCTACCG-CGTGAGGTCA--GAGATTAAATGGACTGTTTGTGCACTGCTCACA--CACAC
276
*** *
*
* **
* **
**
1
<======== ==V$DELTAEF1_01(0.84)
=============================================================================
CSNUCLEO
TCTACCG-CGCGAGGTTG--GACATTAAGCGAGCTGTTTGAGCACTGCACACAGGCGCGC
286
MMNUCLEO
CCGCCC--------ATGCTGCCTCGGAACACCTGAGGGAATCCGGGCCACGCCGCCACCT
535
1
<========= =V$DELTAEF1_01(0.84)
RNNUCIA1
ACGTCC--------ATGCGGCGTACGGATACCTGAGGGAATCCGGGCCATACCGCCACCT
601
HSNUCLEO
TCTCCCAACTTGAGGTTCT-GTGGGGTAGGGGAGGGTTCGTGACTTTCTCACAGAAAACC
288
CSNUCLEO
AGGCCCGGAGCTCCAGGTAGCAGTGCAGCACTAGGCGGCGTCCGGGCCACGCCGCCCAAT
626
** ** * *****
*
*
* * * *
* * * *
*
HSNUCLEO
GGACCC---------AGCCACATTGGCGAACC----GGAGACCGCCCGATTCCACCACC588
=============================================================================
**
*
*
**
**
*** * * ** **
1
<=======V$NKX25_02(0.84)
2
=========>V$CETS1P54_01(0.87)=============================================================================
1
<=======V$E2F_02(1.00)
MMNUCLEO
ACACACGCAC------------AACTGCTTTTATTAGGAGCT----CTCAGGAAAGCGGG
326
MMNUCLEO
ACCCGCG--CCTCACACACAAGCCGCGCCAAACTCGCCCGTCCCACTGCGCAGGCGTGGG
593
1
<=======V$NKX25_02(0.84)
1
<=======V$E2F_02(1.00)
2
=========>V$CETS1P54_01(0.87)
RNNUCIA1
ACTCGCG--CCTCACTC--AAGCCGCGCCAAACTCGCGCGTTTCACTGCGCAGGCGTGTA
657
RNNUCIA1
ACACACGCGCGCGCGCGCGCGAAATTGCTTTTATTAGGAGCT----CTCAGGAAAGTGGT
332
1
<=======V$E2F_02(1.00)
1
=======>V$NKX25_02(0.82)
TCCCCCGAGCCCCTTCCACAAGCCGCGCCAAACGGGTCTG---CACCGCGCAGGCG--GC
681
2
<==========V$DELTAEF1_01(0.81)CSNUCLEO
1
<=======V$E2F_02(1.00)
3
=========>V$CETS1P54_01(0.84)
HSNUCLEO
-CCCGCGCTCCCCTCAC--AGCCGGCGCCAAAAACGCCAGTCCCACGACGCAGGC----640
CSNUCLEO
ACACACGCACGC----------AACTGCCTTTATTGGGAGCTGTCTCTCAGGAGAACAGC
336
* * ** ** *
* * * ********
*
*
*** *******
1
<=======V$NKX25_02(0.83)
2
<==========V$DELTAEF1_01(0.81)
3
=========>V$CETS1P54_01(0.86)
HSNUCLEO
TCGTACAGACCC-------CGCCACTGCCTTTATTAACAGCT----CTCAGGAGACTGCC
337
* **
*
* *** ******
****
******* *
HSNUCLEO - Homo sapiens;
=============================================================================
CSNUCLEO - Cricetulus griseus;
MMNUCLEO
GACTCGCATCA---TAGCCAAG----AAGCCGTTCGCGAC-TCCGCGGAGAACAGGCCGA
378
RNNUCIA1
GGCTCGCATCAGGCTACCACAGCC--AAGAGGACCGCCACCTCTACCGAGGGCAGGCCAA
390
MMNUCLEO - Mus musculus;
CSNUCLEO
GGCCCGCGGCGCAACACTAGAGCCCCGGGATGTTCTCGGC-TCTGCCGAGGGCAG-CCGA
394
RNNUCIA1 – Rattus norvegicus
HSNUCLEO
TGCAGGAGGGGGGTCGCTCCGGCC---CCATGCTCGCGGG-CAAGCAGGGATAAG--CTG
391
* *
*
* * *
* * *
** *
Gibbs sampling
Algorithm
A
T
G
C
1)
A
T
2)
A
T
3)
A
T
G
G
G
C
C
C
...
Jun
Fos
TGASTCA
AP-1
NFAT
human TNF promoter
-107
AP-1
mast cells
-74
NFAT
T-cells
NF-kB
dendritic cells
VDR
AP-1
C/EBP
T-cells + ?
Functional of Averaged Density
 k ( sequence )
S
Weight Sum
sequence sample
1/(1 h )


1 h

V
k (sequence) 
 sequence
space



S
V
k h (sequence)  p(sequence)
Kernel Volume
Averaged Density
Condition for Maximum
of Averaged Density
Main properties of the functional of averaged density  0,h(k)
represented as 3 theorems.
Theorem 1: Functional of averaged density  0,h (k) reaches maximum with respect to kernel
k ( )  c  p 1 / h ( )
k():    which satisfies the following equation
where c is an arbitrary normalization factor.
This theorem tells that we can get an accurate estimate of probability function p() by means
of maximization  n,h (k) under n. In this case pn () = const{k n ()}h .
Theorem 2: Let the averaged density functional  n,h (k) reaches the maximum with respect to
k h  c h  p1 /hh ( ) : k h   k , p h   p
kernel
. Then there is a limit of log-likelihood function
L n ( p )
under h   and it equals
lim L n ( p h )  sup L n ( p )
h
h
p  p
This theorem tells that under h   the method of averaged density functional maximization
is similar to method of likelihood maximization.
Theorem 3: Let p () be the probabilities of sequences . Let pn () be some estimates of
probabilities satisfying equation
p ( )  k ( )   p ( )  k ( )



i 
n
i
n
i
i 
i
n
i
where k() = cp1/h () and k n () = cpn 1/h (). The following relation is true
1 h
 p(i )  pn (i )  kn (i )




8  h    0 h (k )  h
 i 
 

1


h  1    0 h (kn ) 

 max

p
(

)

k
(

)
,
p
(

)

k
(

)



i
i
n
i
n
i 
 
 i 
 i 

2
This theorem tells that we can expect more accurate estimates of probabilities for sequences with
higher kernel weights. At least the theorem establishes the upper boundary for accuracy of
estimation. The problem is that we maximize the empirical functional  n,h (k) not  0,h (k). If the
family of probabilities (respectively the family of kernels) is too manifold the value  n,h (k) may
differ significantly from  0,h (k). But this is the problem of maximal likelihood method as well.
Model for Independent Distribution of Symbols
RL is the distance from the
sequence L to the given local
consensus;
R L  
(1)
jl
j
 ( R L  jl )
jl is distance coefficient for l
letter in j-th position; #
s jl 
sjl is weighted sum for l letter in jth position;
s jl0  max j (s jl )
L*jl is all sequences from
selection where l letter is situated
in j-th position.
e
h
(2)
LL*jl

jl
 ln(
s jl0
s jl
)
(3)
(4)
1. Initialisation of the algorithm by setting the initial values  jl. For that we select a sequence
L and set  jl = 0, where l is a letter in j-th position of sequence L. All other values of  jl set
to 1.
2. Calculation of distance RL (1).
3. Calculation of partial sums sjl (2).
4. Determination of maximal values of sjl0 for every position (3).
5. Calculation of new values  ’jl (4).
Testing of the kernel method of motif finding.
A mixture of CREB and AP-1 sites was analyzed. Kernel method has revealed two original
motifs. Whereas, CONSENSUS-V6C.1 and Gibbs sampling were not able to reveal two different patterns.
Only one pattern was revealed that presents a mixture of the original two.
Table 1. Weight matrices revealed with kernel method (smoothing parameter h = 1.2).
Weight matrix 1 (113 sequences contain this motif)
A
G
C
T
Consensus
15
14
6
65
T
18
55
5
22
G
51
4
34
11
A
0
84
1
15
G
0
0
5
95
T
4
2
94
C
100
0
0
0
A
6
36
27
31
2
0
98
0
C
100
0
0
0
A
Weight matrix 2 (73 sequences contain this motif)
A
G
C
T
Consensus
12
8
5
75
T
1
77
3
19
G
75
10
5
10
A
7
5
88
0
C
23
62
15
0
G
0
0
3
97
T
Table 2. The most optimal weight matrix (153 sampled words) resulted from run of program
CONSENSUS-V6C.1 (Hertz and Stormo, 1999).
A
G
C
T
Consensus
9
29
8
54
T
39
40
4
17
G/A
37
11
40
12
C/A
0
87
8
5
G
0
0
0
100
T
5
0
95
0
C
100
0
0
0
A
5
26
27
42
T
Table 3. Weight matrix obtained with Gibbs sampling (Lawrence et. al., 1993).
A
G
C
T
Consensus
36
42
43
93
82
36
52
T
G/A
C/A
85
G
92
T
C
A
39
T
A

T
1
3
G
C
1 1 1
3
3
3

1 1
3
3
1 1 
1
3
3
3
1 1
1 
3
3
3
10bp
100
calculated 2
D   ( pimplanted

p
)
jl
jl
jl
100
100bp
Result of comparison of four different pattern discovery programs on the sets of simulated
sequences with implanted TF binding sites for one matrix; y-axis: the averaged sum of
squared differences between reveled matrix and the original one; x-axis:  values, that are
the probabilities of “consensus nucleotide” in each position of the matrix.
1,000
Kernel
MEME
CONSENSUS
GIBBS
0,800
0,600
0,400
0,200
GIBBS
CONSENSUS
M EM E
0,000
Kernel
0,65
0,7
0,75
0,8
0,85
0,9
0,95
Table 1. Comparison of 3 programs performing the best for the low levels of  value.

0,65
0,7
Kernel
0,205
0,165
MULTIPROFILER
0,208
0,255
PROJECTION
0,260
0,304
=0.7
X1
A

T
1
3
G
C
1 1 1
3
3
3

1 1
3
3
1 1 
1
3
3
3
1 1
1 
3
3
3
10bp
100
X2
100
A

T
1
3
G
C
1 1 1
3
3
3

1 1
3
3
1 1 
1
3
3
3
1 1
1 
3
3
3
10bp
100bp
Result of comparison of four different pattern discovery programs on the sets of
simulated sequences with implanted TF binding sites for two matrices; y-axis:
the averaged sum of squared differences between two reveled matrices and two
original ones; x-axis: 4 different variants of matrices. First is the most different
matrices, last – the most similar matrices
2,5
Kernel
2
MULTIPROF
CONS t=10
1,5
CONS t=20
1
CONS t=50
GIBBS
0,5
ANN-SPEC
0
1
2
3
4
Hierarchical order of the anatomical structures
Bronchial tree and Intrapulmonary Airways
Human body
Lung
Bronchial tree
Main bronchus
Lobar bronchus
Segmental bronchus
Bronchus
Bronchiolus
Terminal
bronchiolus
Alveolar sac
Pulmonary
alveolus
Alveolar
Alveolar
pore
epithelium
Pneumocytes
Cytomer/Content
Respiratory
bronchiolus
Alveolar
duct
Alveolar
septa
Link from CYTOMER to TRANSFAC
Link from CYTOMER to TRANSFAC: T00167
Gene expression
UniGene
EST
Cytomer
TRANSGENOME
Gene expression
group 1
Gene expression
group 2
Gene expression
group 3
cell cycle
T-cell
uterus
testis
stomach
prostate
gland
placenta
peripheral
lymphoid
pancreas
muscles
lymph
lung
liver
large
intestine
kidney
heart
eye
ear
brain
breast
Number of promoters of the specific genes
500
450
400
350
300
250
200
150
100
50
0
Cell-cycle
TTTCGCGCCA
ATTTGGCGCG
1)
AggGCCGgGC
AAAGGAtTTG
GGGGCGGGGC
GGGGGCGGGG
CCAAAGCCCG
cGCAGCCAAT
T-cell
CaTTTCCTCT
TATAAAGgga
cCCCCGCCCc
AtAgAGGAAg
TGAGGAAATG
CCCCGCCCcc
TtCCTtTATA
Muscle
GaCTATATAA
GCCcCCtCCT
GGGGcAGgGg
GAGGtGGCTG
GCAGGGGtGG
CCCCCGGCTC
GGGGAGGggg
gGGGGCAGGG
V$E2F_03
V$E2F_03
V$SP1_Q6; V$MYCMAX_B
?
V$SP1_Q6; V$MAZ_Q6
V$SP1_Q6
?
V$SREBP_Q3; V$NFY_01
V$HOX13_01; V$ISRE_01
V$TATA_C; V$LEF1_Q6; V$SRF_Q6
V$MTF1_Q4
V$NFAT_Q6; V$NFKAPPAB65_01
V$PTF1BETA_Q6; V$MAF_Q6
V$MTF1_Q4
V$TATA_C
V$TATA_C; V$SRF_Q6; V$AMEF2_Q6
V$HOX13_01
V$SP1_Q6; V$MAZ_Q6; V$MAF_Q6
V$MYOD_01
?
V$AP2_Q6; V$MTF1_Q4
V$SP1_Q6; V$MAZ_Q6; V$ETF_Q6
?
Motifs found by the Kernel method in three
different sets of promoters. TRANSFAC
matrices that are most similar to the motifs
are shown. Matrices that are very similar to
the motif are shown in bold. Matrices for the
factors that are known as being involved in
the regulation of the corresponding specific
function are underlined
TRANSPLORER (TRANScription exPLORER) is a software package for the analysis of transcription regulatory
sequences. Currently, TRANSPLORER site prediction tool uses position weight matrices (PWM) collections. It is able to
use several matrix sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional
database, other matrix libraries as well as any user-developed matrix libraries. This means that it provides an
opportunity to search for a great variety of different transcription factor binding sites. A search can be made using all or
subsets of matrices from the libraries.
Search for most probable binding sites regulating gene expression
Search for binding sites coinciding with SNPs
Mouse c-fos promoter
(Matrix search for TF binding sites)
1
<------------V$IK1_01(0.86)
-----...V$CREBP1CJUN_01(0.85)
2
<-----------V$IK2_01(0.90)
-----...V$CREB_01(0.96)
3
----------->V$AP2_Q6(0.87)
<-------------V$GKLF_01(0.87)
4-->V$ATF_01(0.89)
<-------V$MZF1_01(0.99)
----...V$ELK1_01(0.87)
5
<-----------V$AP2_Q6(0.92)
<------------V$SP1_Q6(0.88)
6>V$AP1FJ_Q2(0.89)
<-------------V$GKLF_01(0.85)
7>V$AP1_Q2(0.87)
<-------------V$GKLF_01(0.86)
8->V$CREB_Q2(0.86)
<---------V$CETS1P54_01(0.90)
9->V$CREB_Q4(0.90)
<---------V$NRF2_01(0.90)
10
<-------------V$GC_01(0.88)
11
----------->V$CAAT_01(0.87)
12
<------------V$TCF11_01(0.87)
13
----------->V$AP2_Q6(0.87)
14
<---------V$USF_Q6(0.93)
16
--------...V$ATF_01(0.94)
17
-------...V$AP1FJ_Q2(0.95)
20
-------...V$CREBP1_Q2(0.93)
21
-------...V$CREB_Q2(0.95)
23
---...V$IK2_01(0.85)
MMCFOS_1
GAGCGCCCGCAGAGGGCCTTGGGGCGCGCTTCCCCCCCCTTCCAGTTCCGCCCAGTGACG
420
1-->V$CREBP1CJUN_01(0.85)
-------------->V$BARBIE_01(0.86)
2-->V$CREB_01(0.96)
-------------->V$TATA_01(0.95)
3
----------->V$CAAT_01(0.91)
--------->V$AP4_Q5(0.95)
4----------->V$ELK1_01(0.87)
--------------------->V$HEN1_01(0.87)
5
--------->V$AP4_Q5(0.88)
<---...V$CMYB_01(0.93)
6
<---------V$CDPCR3HD_01(0.93)
--...V$VMYB_02(0.89)
7
<--------------V$TATA_01(0.88)
8
--------------------->V$HEN1_02(0.87)
9
<---------------------V$HEN1_02(0.86)
10
<-----------------V$AP4_01(0.88)
11
----------->V$LMO2COM_01(0.93)
12
<-----------V$LMO2COM_01(0.93)
13
<-----------V$MYOD_01(0.88)
17--->V$AP1FJ_Q2(0.95)
<---------V$AP4_Q6(0.99)
20---->V$CREBP1_Q2(0.93)
<---------V$MYOD_Q6(0.96)
21---->V$CREB_Q2(0.95)
Transcription start
23-------->V$IK2_01(0.85)
24
<=========== E2F (0.80)
MMCFOS_1
TAGGAAGTCCATCCATTCACAGCGCTTCTATAAAGGCGCCAGCTGAGGCGCCTACTACTC
480
1
<-----------------V$CMYB_01(0.91)
-------...V$ER_Q6(0.86)
2
<-----------V$LMO2COM_01(0.90)
<----...V$TCF11_01(0.87)
3
--------->V$MYOD_Q6(0.90)
-------->V$STAT_01(0.93)
4
--------->V$VMYB_01(0.89)
<--------V$STAT_01(0.89)
5--------------V$CMYB_01(0.93)
-------->V$LMO2COM_02(0.93)
6------>V$VMYB_02(0.89)
<-----------V$CAAT_01(0.85)
7
-------->V$VMYB_02(0.88)
8
-------------->V$EVI1_04(0.86)
9
------------->V$GATA1_02(0.93)
12
<------------V$ZID_01(0.85)
13
<----------V$CP2_01(0.97)
14
---------->V$GATA_C(0.92)
15
----------------->V$CMYB_01(0.86)
16
--------->V$CREL_01(0.91)
24
<=========== E2F (0.82)
MMCFOS_1
CAACCGCGACTGCAGCGAGCAACTGAGAAGACTGGATAGAGCCGGCGGTTCCGCGAACGA
540
Exon 2 sequence of human thyroid transcription factor-1
(TTF-1) gene (HS198161)
(Matrix search for TF binding sites)
1------------V$AHRARNT_01(0.90)
<-----------------V$NF1_Q6(0.85)
2--------V$NMYC_01(0.89)
--------->V$AP4_Q5(0.91)
3------>V$USF_Q6(0.89)
--------->V$AP4_Q6(0.85)
4------V$USF_C(0.86)
------------...V$YY1_02(0.86)
5 --------->V$AP4_Q5(0.91)
6 --------->V$AP4_Q6(0.86)
7
--------->V$AP4_Q5(0.92)
8
--------->V$AP4_Q6(0.86)
9
--------->V$AP4_Q5(0.86)
HS198161_1 ACGCGCAGCAGCAGGCGCAGCACCAGGCGCAGGCCGCGCAGGCGGCGGCAGCGGCCATCT
540
1
----------------->V$NF1_Q6(0.96)
2
<-----------------V$NF1_Q6(0.90)
3
--------->V$USF_Q6(0.87)
4------->V$YY1_02(0.86)
---------->V$CP2_01(0.88)
5
--------->V$AP4_Q5(0.92)
----------->V$CAAT_01(0.85)
6
--------->V$AP4_Q6(0.85)
--------->V$AP4_Q5(0.86)
7
------...V$CP2_01(0.86)
8
===========> E2F (0.81)
9
===========> E2F (0.90)
HS198161_1 CCGTGGGCAGCGGTGGCGCCGGCCTTGGCGCACACCCGGGCCACCAGCCAGGCAGCGCAG
600
1 <---------V$CETS1P54_01(0.89)
<--------...V$GATA_C(0.86)
2
----------------->V$NF1_Q6(0.85)
<-------...V$GATA1_02(0.90)
3
--------->V$CETS1P54_01(0.90)
<-------...V$GATA1_03(0.92)
4
<--------------------V$R_01(0.88) <-----...V$LMO2COM_02(0.90)
5
<---------------V$AHRARNT_01(0.86)
6
----------->V$AP2_Q6(0.95)
7---->V$CP2_01(0.86)
<-------...V$GATA1_04(0.87)
8
<----...V$CETS1P54_01(0.87)
9
===========> E2F (0.80)
HS198161_1 GCCAGTCTCCGGACCTGGCGCACCACGCCGCCAGCCCCGCGGCGCTGCAGGGCCAGGTAT 660
1--V$GATA_C(0.86)
<---------V$CETS1P54_01(0.89)
2------V$GATA1_02(0.90)
--------...V$DELTAEF1_01(0.96)
3------V$GATA1_03(0.92)
<---...V$CEBPB_01(0.88)
4---V$LMO2COM_02(0.90)
5
<-----------V$IK2_01(0.92)
6
<---------------V$E47_02(0.87)
7-----V$GATA1_04(0.87)
8-----V$CETS1P54_01(0.87)
9
<--------------V$E47_01(0.86)
10
---------->V$DELTAEF1_01(0.99)
11
<-----------V$LMO2COM_01(0.94)
12
<-----------V$MYOD_01(0.87)
13
--------->V$MYOD_Q6(0.91)
14
------->V$USF_C(0.93)
HS198161_1 CCAGCCTGTCCCACCTGAACTCCTCGGGCTCGGACTACGGCACCATGTCCTGCTCCACCT
720
Enhanceosome
Recruitment of CIITA to MHC-II promoters. A prototypical MHC-II promoter (HLA-DRA) is represented schematically with the
W, X, X2, and Y sequences conserved in all MHC-II, Ii, and HLA-DM promoters. RFX, X2BP, NF-Y, and an as yet undefined Wbinding protein bind cooperatively to these sequences and assemble into a stable higher order nucleoprotein complex referred to
here as the MHC-II enhanceosome. CIITA is tethered to the enhanceosome via multiple weak protein-protein interactions with the
W, X, X2, and Y-binding factors. The octamer site found in the HLA-DRA promoter (O), and its cognate activators (Oct and OBF1) are not required for recruitment of CIITA. CIITA is proposed to activate transcription (arrow) via its amino-terminal activation
domains (AD), which contact the RNA polymerase II basal transcription machinery.
Masternak K et al., Genes Dev 2000 May 1;14(9):1156-66
One of the TF binding sites in a composite elements can be rather weak.
Weak DNA-protein interactions are stabilized by protein-protein interactions.
Mouse Interleukin-2
gene promoter
AP-1
COMPEL:C00050
NF-ATp
.......
tgccacacaggtagactcttTTGAAAATAtgTGTAATAtgtaaaa catcgtgaca cccccatatt… …
-96
-79
TGAGTCA
AP-1 consensus
ST
Antagonistic composite elements
COMPEL: C00006
Chicken embryonic -globin gene
Sp1
NF-Y
GGTGGGcctccggagtgaccaatgagtgTGGACAGATGCCA
NF-1
Sp1 cooperatively with NF-Y activates
transcription
in primitive erythroid cells
NF-1 represses transcription
in adult cells
COMPEL: C00009
Human c-fos protooncogene
SRF mediates the rapid, transient induction of the
c-fos protooncogen by serum growth factors.
SRF
acaggaTGTCCATATTAGGacatctgcg
YY1 diminishes both basal and
serum-induced expression
YY-1
of the c-fos.
COMPEL: C00054
Rat serum amyloid A1 gene
C/EBP
NF-B
C/EBP and NF-B synergistically
activate transcription in liver cells
during acute phase response
TGGTAGTCTTGCACAGGAAATGACATggtGGGACTTTCCCcaggg
YY-1
YY1 represses inducible transcription of this
gene.
NFAT
human TNF promoter
-107
AP-1
mast cells
-74
NFAT
T-cells
NF-kB
dendritic cells
VDR
AP-1
C/EBP
T-cells + ?
E2F site context
Local context
TTTGGCGCGAAA
Global context
Revealing of local oligonucleotide context
of TF binding sites
motif: WSG
TTTGGCGCGAAA
window: [
]
Promoters of cell-cycle genes:
.............
Exon 2 sequences:
.............
}
}
Frequency
of the motifs
in the window
Search for a maximal clique in a graph of non-correlated
characteristics
0.91
VWS [7,65]
0.74
TTT [39,41]
0.78
BAY [7,65]
0.84
MGSG [25,27]
0.73
WWTT [11,65]
0.88
WS [15,65]
0.77
YKMG [13,15]
0.76
MGCG [19,21]
0.83
VTS [33,35]
0.89
CGSK [17,37]
Found motifs in the flanking regions of E2F sites
in cell-cycle promoters
12 bp
30 bp
30 bp
TTTGGCGCGAAA
MGCG:
TTT:
High
frequency
CGSK:
HKCG:
[
]
]
[
DWTT:
[
Low
frequency
]
[
]
VTV:
BAY:
]
[
]
[
VDWW:
VWS:
]
[
[
[
]
]
Motifs found in the local context of E2F sites in
promoters of cell cycle-related genes
Negative
characteristics
Positive
characteristics
N
Motif ()
fˆ Y
fˆ N
0.0048 / 0.0041 = 1.179
0.0112 / 0.0032 = 3.536
0.0851 / 0.0341 = 2.499
0.0675 / 0.0095 = 7.071
0.1233 / 0.0536 = 2.299
0.0337 / 0.0000
0.0980 / 0.0559 = 1.754
0.80
0.75
0.90
0.79
0.72
0.80
0.82
-0.394
0.9618
0.5353
0.5904
0.223
0.5036
0.595
-0.095
-0.2297
-0.261
-0.566
=-5.6767
2)
Utility
i
Window
(w)1)
[27,34]
[39,41]
[17,38]
[13,16]
[17,46]
[21,26]
[3,69]
1
2
3
4
5
6
7
MGCG
TTT
CGSK
HKCG
VDWW
DWTT
GSDM
8
VWS
[7,66]
0.1258 / 0.1932 = 0.651
0.91
9
10
11
HSWY
VTV
BAY
[26,65]
[19,34]
[7,65]
0.0413 / 0.0813 = 0.508
0.0427 / 0.1354 = 0.315
0.0274 / 0.0614 = 0.447
0.79
0.71
0.78
Score of context:
k
d ( X )     i  f (i , wi , X )
i 0
Human uracil DNA-glycosylase (E2F sites)
-1000
+1
1000
3000
5000
7000
9000
+ score of context
-1000
+1
1000
3000
5000
7000
ttTTTGCCGCGAAAag q=0.92 d=2.8 (known site)
9000
False negative (FN in percents) and false positive (FP sites per 1000bp) rates
for recognition of E2F sites.
1,4
PWM
1,2
PWM+score of
context
1
FP
0,8
0,80
0,6
0,4
0,2
0,79
0
10
20
30
40
FN
50
60
70
Analysis of promoters of cell cycle-related genes
by E2F weight matrix
Comparison of frequencies of potential
E2F sites in different
promoter sets
0,008
High frequency of potential E2F sites near
transcription start site in promoters of cell
cycle related genes.
0,025
Cell cycle-related genes
0,006
Other genes (EPD)
Random sequences
Exons 2
0,004
Cell cycle-related genes
клеточного цикла
Other genes (EPD)
гены
0,02
0,015
0,01
0,002
0,005
0
Identification of new E2F target genes
350
300
250
200
150
100
50
-50
-100
-150
-200
-250
-300
-350
-400
-450
-500
-550
-600
0,8
-650
0
SITEVIDEO system
Building of E2F site recognition program (step 1)
SITEVIDEO system
Building of E2F site recognition program (step 2)
SITEVIDEO system
Building of E2F site recognition program (step 3)
Composite elements
ternary complex formation and stabilization of DNA-protein complexes
COMPEL:C00149
NF-ATp
.........
Mouse Interleukin-2
gene promoter
AP-1
tcagtgtatgggggtttaaAGAAATTCCagAGAGTCAtcagaagaggaaaaacaaa… …
-147
-164
Human Interleukin-2
gene promoter
AP-1
COMPEL:C00109
NF-ATp
ST
.......
ccacccccttaaagaaaggAGGAAAAAcTGTTTCAtacagaaggcgttaattgcatg… …
-283
-268
ST
Recognition method for
T-cell specific Composite Elements NFAT/AP-1
AP-1
NFATp
5’
..WRGAAAA.. ..TGASTCA..3’
8-12 bp
A
C
G
T
1
2
3
4
5
6
7
8
5
5
8
8
12
1
2
11
2 0 26
0 0 0
23 26 0
1 0 0
25
0
1
0
25
1
0
0
15
5
2
4
A
C
G
T
NFAT = -log(1-scoreNFAT)
1
2
3
4
5
6
7
8
9
19
3
16
9
4
2
5
36
4 36 3
2 4 13
33 2 29
8 5 2
0
0
0
47
2
44
0
1
47
0
0
0
2
8
24
13
AP-1 = -log(1-scoreAP-1)
6,7
5,7
4,7
3,7
NFAT/AP-1 (training)
Random
2,7
Composite score
 1.47 AP1  4.7

wCE  17,0   NFAT
 NFAT  0.88 AP1  3.5
1,7
0,7
0,7
1,2
1,7
2,2
2,7
3,2
3,7
4,2
4,7
Frequency of NFAT/AP-1 in genomic sequences
1
0,9
Freq. per 1000bp
0,8
0,7
T-cell
0,6
Muscle
0,5
dbEST
0,4
Random
0,3
0,2
0,1
0
Promoters
Intrones
CDS
Frequency of NFAT/AP-1 in promoters
0,007
0,006
0,005
0,004
Musc. promoters
0,003
T-cell promoters
0,002
0,001
0
> -900
[-900:-750]
[-750:-600]
[-600:-450]
[-450:-300]
[-300:-150]
[-150:+1]
Composite modules encode gene expression pattern
organ,
tissue,
cell
stage of
development
cell cycle
phase
extracellular
signals
Composite modules
w
(1)
1
s
( 2)
1
s
(1)
cut off
s
( 2)
2
( 2)
cut off
(k )
(k )
1 ... nk
q
q
 (1)
 ( 2)
...
C  max 
w
(k )
 q (w)
k 1, K
K - number of TF matrixes
(k )
avr
s
... s
...
Start of
transcription
(k )
cut off
q
 (k )
...
Parameters of
the model to be
estimated
(k )
q
(
s
q (w)   i )
(k )
avr
i 1, nk
(k )
q ( si( k ) )  qcut
off
(k )
si w
Mutation, recombination and selection of the best genomes
G
g1
g2
g3
SELECTION
F
…….
41
27
3
MUTATION
0.9
0.9
0.5
0.9
0.8
RECOMBINATION
0.7
0.7
0.9
0.7
0.6
1
14
5
6
0.5
0.5
0.9
0.6
0.7
O.5
MULTIPLICATION
0.9
0.9
0.7
0.9
0.7
gn
4
Genetic Algorithm (GA)
Fitness function of the GA
F    FN    FP    T    N    AC
# promoters
FN – false negatives
T-test
FP – false positives
N
FN FP
T – T-test (difference
between mean values)
cms
N – normal likeness
AC – Akaike Information
Criteria
Composite module in promoters of
T-cell specific genes

Weight:
qcutoff
TF matrix
0.618300
0.923077
V$NFKB_Q6
0.162534
0.895279
V$OCT1_02
0.743705
0.965039
V$NFKAPPAB65_01
0.002359
0.788579
V$HOX13_01
0.928935
0.928569
V$NFAT_DWM_1
100
90
t-cell
T-cell specific
promoters
other promoters
80
70
Other promoters
No of obs
60
50
40
C
30
(k )
(k )


q

cut off
k 1,5
20
10
0
<= -,2
(0;,2]
(-,2;0]
(,4;,6]
(,2;,4]
(,6;,8]
(,8;1,]
(1,2;1,4]
(1,6;1,8]
(2,;2,2]
(1,;1,2]
(1,4;1,6]
(1,8;2,]
> 2,2
Composite module in promoters of
cell cycle-related genes
Weight:

qcutoff
TF matrix
1.000000
0.840072
V$E2F_19
0.954483
0.737637
V$TATA_01
0.888064
0.939687
V$CREB_01
0.816179
0.941583
V$SP1_Q6
0.039746
0.839702
V$TAL1BETAE47_01
4
0
Exon-2 sequences
Cell cycle-related
promoters
Noofsequences
3
0
2
0
C
1
0
(k )
(k )


q

cut off
k 1,5
0
-0
,5
0
,0
0
,5
1
,0
1
,5
2
,0
2
,5
3
,0
3
,5
4
,0
1
<------------V$IK1_01(0.86)
-----...V$CREBP1CJUN_01(0.85)
2
<-----------V$IK2_01(0.90)
-----...V$CREB_01(0.96)
3
----------->V$AP2_Q6(0.87)
<-------------V$GKLF_01(0.87)
4-->V$ATF_01(0.89)
<-------V$MZF1_01(0.99)
----...V$ELK1_01(0.87)
5
<-----------V$AP2_Q6(0.92)
<------------V$SP1_Q6(0.88)
6>V$AP1FJ_Q2(0.89)
<-------------V$GKLF_01(0.85)
7>V$AP1_Q2(0.87)
<-------------V$GKLF_01(0.86)
8->V$CREB_Q2(0.86)
<---------V$CETS1P54_01(0.90)
9->V$CREB_Q4(0.90)
<---------V$NRF2_01(0.90)
10
<-------------V$GC_01(0.88)
11
----------->V$CAAT_01(0.87)
12
<------------V$TCF11_01(0.87)
13
----------->V$AP2_Q6(0.87)
14
<---------V$USF_Q6(0.93)
16
--------...V$ATF_01(0.94)
17
-------...V$AP1FJ_Q2(0.95)
20
-------...V$CREBP1_Q2(0.93)
21
-------...V$CREB_Q2(0.95)
23
---...V$IK2_01(0.85)
MMCFOS_1
GAGCGCCCGCAGAGGGCCTTGGGGCGCGCTTCCCCCCCCTTCCAGTTCCGCCCAGTGACG
420
Mouse c-fos promoter
E2F composite module
(global context)
E2F flanking motifs
(local context)
1-->V$CREBP1CJUN_01(0.85)
-------------->V$BARBIE_01(0.86)
2-->V$CREB_01(0.96)
-------------->V$TATA_01(0.95)
3
----------->V$CAAT_01(0.91)
--------->V$AP4_Q5(0.95)
4----------->V$ELK1_01(0.87)
--------------------->V$HEN1_01(0.87)
5
--------->V$AP4_Q5(0.88)
<---...V$CMYB_01(0.93)
6
<---------V$CDPCR3HD_01(0.93)
--...V$VMYB_02(0.89)
7
<--------------V$TATA_01(0.88)
8
--------------------->V$HEN1_02(0.87)
9
<---------------------V$HEN1_02(0.86)
10
<-----------------V$AP4_01(0.88)
11
----------->V$LMO2COM_01(0.93)
12
<-----------V$LMO2COM_01(0.93)
13
<-----------V$MYOD_01(0.88)
17--->V$AP1FJ_Q2(0.95)
<---------V$AP4_Q6(0.99)
20---->V$CREBP1_Q2(0.93)
<---------V$MYOD_Q6(0.96)
21---->V$CREB_Q2(0.95)
Transcription start
23-------->V$IK2_01(0.85)
24
<----------- E2F (0.80)
MMCFOS_1
TAGGAAGTCCATCCATTCACAGCGCTTCTATAAAGGCGCCAGCTGAGGCGCCTACTACTC
480
1
<-----------------V$CMYB_01(0.91)
-------...V$ER_Q6(0.86)
2
<-----------V$LMO2COM_01(0.90)
<----...V$TCF11_01(0.87)
3
--------->V$MYOD_Q6(0.90)
-------->V$STAT_01(0.93)
4
--------->V$VMYB_01(0.89)
<--------V$STAT_01(0.89)
5--------------V$CMYB_01(0.93)
-------->V$LMO2COM_02(0.93)
6------>V$VMYB_02(0.89)
<-----------V$CAAT_01(0.85)
7
-------->V$VMYB_02(0.88)
8
-------------->V$EVI1_04(0.86)
9
------------->V$GATA1_02(0.93)
12
<------------V$ZID_01(0.85)
13
<----------V$CP2_01(0.97)
14
---------->V$GATA_C(0.92)
15
----------------->V$CMYB_01(0.86)
16
--------->V$CREL_01(0.91)
24
<----------- E2F (0.82)
MMCFOS_1
CAACCGCGACTGCAGCGAGCAACTGAGAAGACTGGATAGAGCCGGCGGTTCCGCGAACGA
540
MMCFOS_1
1----------->V$ER_Q6(0.86)
2--------V$TCF11_01(0.87)
3
--------->V$AP4_Q5(0.91)
4
--------->V$AP4_Q6(0.87)
5 ---------->V$AP1FJ_Q2(0.93)
6 ---------->V$AP1_Q2(0.90)
7 ---------->V$AP1_Q4(0.87)
8
<-----------V$IK2_01(0.94)
GCAGTGACCGCGCTCCCACCCAGCTCTGCTCTGCAGCTCC
580
Computationally predicted E2F target genes
confirmed by in vivo footprint
EMBL
Gene
Chromatin crosslinking
c-fos, Hs
HSFOS
JunB, Hs
HS207341
tgf-1, Hs
HSTGFB1P
R
p14ARF, Hs
AF082338
Immunoprecipitation
Mcm4
(Cdc21), Hs
mcm5 (P1cdc46), Hs
PCR
Von HippelLindau
(VHL), Hs
B-myb, Hs
HSU63630
HS286B10
AF010238
HSBMYBD
NA
nucleolin,
Hs
nucleolin,
Cg
nucleolin,
Ms
HSNUCLEO
CSNUCLEO
MMNUCLE
O
Score
,q
(+) aaGCTCGCGCCACTgc
(-) gcAGTGGCGCGAGCtt
(-) gtCTTCGCGCGCGCtc
Position rel.
start of
transcription
-165 .. -176
-92 .. –103
-90 .. –79
-78 .. –89
79 .. 90
91 .. 80
169 .. 158
-513 .. -502
-298 .. -287
28 .. 39
40 .. 29
85 .. 96
-1384 .. -1395
-1009 .. -1020
-739 .. -750
-589 .. -578
-265 .. -276
-491 .. -502
-409 .. -420
-377 .. -366
-175 .. -164
-93 .. -82
-187 .. -176
-175 .. -186
8 .. 19
20 .. 9
-270 .. -259
-258 .. -269
-28 .. 39
(-) gtCCTGGCGCGCGGgc
(+) cgCTTGGCGGGAGAta
-72 .. –83
-53 .. -42
0.83
0.87
1.18
-296 ->
+14 <-
(-) ttTTTGGCGCCGGCtg
(-) ccGTGGGCGCGCGGgt
-297 .. -308
-256 .. -267
0.97
0.81
2.91
-407 ->
-41 <-
(-) cgTTTGGCGCGGCTtg
-296 .. -307
0.97
6.67
-538 ->
-198 <-
(-) agTTTGGCGCGGCTtg
-306 .. -317
0.97
1.76
-531 ->
-232 <-
Sequence of the potential
sites
(-)
(-)
(+)
(-)
gcCTTGGCGCGTGTcc
ggGGTGGCGCGCGGgc
ccTCTGGCGCCACCgt
acGGTGGCGCCAGAgg
(+) gcTATCGCGCCAGAga
(-) tcTCTGGCGCGATAgc
(-) ggGCTGGCGCGGGCgg
(+)
(+)
(+)
(-)
(+)
ctGTTTGCGGGGCGga
ccCTTCGCGCCCTGgg
ctCTTGGCGCGACGct
agCGTCGCGCCAAGag
ccTTTGCCGCCGGGga
(-)
(-)
(-)
(+)
(-)
ctCTCCGCGCGCGGga
gtCTTGGCGACCGTtg
ggCCTGGCGCCGGAct
tgATTGGCGGATAGag
acTTTCCCGCCCTGtg
(-)
(-)
(+)
(+)
(+)
gtTTTCGCGGGAAAac
ctTTCAGCGCCCGTgc
gcAGTGGCGCCTCCcg
ggCGTGGCGCGGAGcc
ctTGTCGCGCAGGTac
(+)
(-)
(+)
(-)
agTTTCGCGCCAAAtt
aaTTTGGCGCGAAAct
ttTTTCCCGCGAAAct
agTTTCGCGGGAAAaa
0.92
0.84
0.88
0.83
0.89
0.91
0.82
0.80
0.91
0.93
0.83
0.85
0.81
0.81
0.81
0.83
0.86
0.93
0.82
0.80
0.83
0.86
0.99
1.00
0.89
0.93
0.81
0.84
0.92
Score of
context,
d
2.92
Positions
of PCR
primers
-201 ->
+96 <-
-27 ->
+313 <3.17
2.03
-122 ->
+210 <-
4.11
-404 ->
-143 <-
3.53
-667 ->
-330 <-
4.39
4.91
-211 ->
+88 <-
3.01
4.21
-137 ->
+123 <2.22
•Phylogenetic footprinting
Alignment of c-fos promoters
E2F
mouse
rat
hamster
man
ATGTTCGCTCGCCTTCTCTGCCTTTCCCGCCTCCCCTCCCCCGGCCGCGGCCCCGGTTCC
ATGTTCGCTCGCCTTCTCTGCCTTTCCCGCCTCCCCTCCCCCGGCCGCGGCCCCCGCTCC
ATGTTCGCTCGCCTTCTCTACCTTTCCCGCCTCCCCTCCCCCGGCCGCGGCCCCAGCTCC
ATGTTCTCTCTCATTCTGCGCCGTTCCCGCCTCCCCTCCCCCAGCCGCGGCCCCCGCCTC
****** *** * ****
** ******************* *********** *
*
mouse
rat
hamster
man
CCCCCT----GCGCTGCACCCTCAGAGTTGGCTGCAGCCGGCGAGCTGTTCCCGTCAATC
CCCCTT----GCGCTGCACCCTCAGAGTTGGCTGCAGCCGGCGAGCTGTTCCCGTCAATC
CCCCTCCCCCGCGCTGCACCCTCAGAGTTGGCTGCAGCCGGCAAGCAGTTCCCGTCAATC
CCCCC-----GCACTGCACCCTCGGTGTTGGCTGCAGCCCGCGAGCAGTTCCCGTCAATC
****
** ********** * ************* ** *** *************
mouse
rat
hamster
man
CCTCCCTCCTTTACACAGGATGTCCATATTAGGACATCTGCGTCAGCAGGTTTCCACGGC
CCTCCCTCCTTTACACAGGATGTCCATATTAGGACATCTGCGTCA---GGTTTCCACGGC
CCT---TTCC---CACAGGATGTCCATATTAGGACATCTGCGTCAGCAGGTTTCCACGGC
CCTCCCCCCTT-ACACAGGATGTCCATATTAGGACATCTGCGTCAGCAGGTTTCCACGGC
***
*
********************************
************
mouse
rat
hamster
man
CGGTCCCTGTTGTTCTGGGGGGGGGACCATCTCCGAAATCCTACACGC-GGAAGGTCTAG
CGGTCCCTGTTGTCCTGGGGGGA--ACCATCCCCGAAATCCTACATGC-GGAGGGTCCAG
CGGTCCTTGTAGACCTGGGGGTG--ACGATCCCCAAAATCCTACATGC-GGAGAGTCCAG
CTTTCCCTGTAGCCCTGGGGGGA--GCCATCCCCGAAACCCCTCATCTTGGGGGGCCCAC
* *** *** * *******
* *** ** *** ** **
**
* * *
mouse
rat
hamster
man
GAGACCCCCTAAGATCCCAAATGTGAACA-CTCATAGGTGAAAGATGTATGCCAAGACGG
GAGACCTTCTAAGATCCCAATTGTGAACA-CTCATAGGTGAAAGTTACAGACTGAGACGG
GAGACCCCCTAAGACCCCTATTGTGAACA-CAAATGGGTGAAAATTACATGTCAAGACGG
GAGACCT-CTGAGACAGGAACTGCGAAATGCTCACGAGATTAGGACACGCGCCAAGGCGG
****** ** ***
* ** ***
* *
*
*
** ***
mouse
rat
hamster
man
GGGTTGAAAGCCTGGGGCGTAGAGTTGACGACAGAGCGCCCGCAGAGG-GCCTTGGGGCG
GGGTTGAGAGCCTGGGGGCTAGAGTTGATGACAGGGAGCCCGCAGAGG-GCATTCGGGAG
AGGCGGGGGACCCGGGGCGCGGAGTTGACGCCAGGGCGGCCGCAGAAG-GCCTGGGGGCG
GGGCAGGGAGCTGCGAGCGCTGGGGACGCAGCCGGGCGGCCGCAGAAGCGCCCAGGCCCG
** *
*
* *
* *
* * * * ******* * **
*
*
mouse
rat
hamster
man
CGCTTCCCCCCCC-------TTC-CAGTTCCGCCCAGTGACGTAGGAAGTCCATCCATTC
CGCTTTCCCCCCTCCAGT--TTCTCTGTTCCGCTCA-TGACGTAGTAAG-----CCATTC
CGCGGCTCCCCTCCGTC---GCCACAGTTCCGCCCAGTGACGTGTAATGT----TCATTC
CGCGCCACCCCTCTGGCGCCACCGTGGTTGAGCCCG-TGACGTTTACAC-----TCATTC
***
****
*
*** ** * ******
*****
mouse
rat
hamster
man
AC—-AGCGCTTC-TATAAAGGCGCCAGCTGAGGCGCCTACTACTCCAACCGCGACTGCAG
A---AGCGCTTC-TATAAAGCGGCCAGCTGAGGCGCCTACTACTCCAACCGCGATTGCAG
ACA-AGCGCTTC-TATAAAGGCACCGGCTGAGGCGCCTACTACTCCAACCGCGACTGCAG
ATAAAACGCTTGTTATAAAAGCAGTGGCTGCGGCGCCTCGTACTCCAACCGCATCTGCAG
*
* ***** ******
**** ******* ************
*****
CRE
Ets YY1 SRE
CRE/AP-1
E2F
SP-1 E2F
CRE
TATA E2F
CRE
E2F
Phylogenetic footprint
(human/mouse)
Spec1
Spec2
Phylogenetic footprint of the promoter of p53 gene
p53_human
ClustalW
alignment
p53_mouse
p53_human
p53_mouse
p53_human
Motif-based
re-alignment
p53_mouse
p53_human
p53_mouse
1
==========>V$AP1_Q4(0.91)
TTAGTATCTACGGCACCAGGTCGGCGAGAATCCTGACTCTGCACCCTCCTCCCCAACTCC
1
==========>V$AP1_Q4(0.91)
TTCCTGCTGAGGGCAACATCTCAGGGAGAATCCTGACTCTGCAAG----TCCCCGCCTCC
** *
* **** ** ** * ******************
***** ****
ATTTCCTTTGCTTCCTCCGGCAGGCGGATTACTTGCCCT
ATTTC--TTGC--CCTCAACCCACGGAAGGACTTGCCCT
***** **** ****
*
* * *********
60
56
99
91
1
==========>V$AP1_Q4(0.91)
2
< ============V$SP1_Q6(0.88)
TTAGTATCTACGGCACCAGGTCGGCGAGAATCCTGACTCTGCAC-CCTCCTCCCCAACTC
59
1
==========>V$AP1_Q4(0.91)
2
<============V$SP1_Q6(0.90)
TTCCTGCTGAGGGCAACATCTCAGGGAGAATCCTGACTCTGCAAGTCCCCGCCTCCATTT
60
** *
* **** ** ** * ******************
* ** ** * * *
CATTTCCTTTGCTTCCTCCGGCAGGCGGATTACTTGCCCT
C-TT--------GCCCTCAACCCACGGAAGGACTTGCCCT
* **
****
*
* * *********
99
91
New human/mouse conserved SP-1 sites were found
Phylogenetic footprint of 5’ regulatory
region of Xist gene
human
horse
mouse
M.subarv
IV
III
II
*
**
* **** ***** *
** **** ** ****
*** * ***
*
CATAGTTAAAAAATTACAAACAGGTCACAAACCAGTACTCTTTCTTGATTATTTAGGAACCAAATAGCCATTCTATGAAATGTCTTCCTTTCC
CGCAGTTTAAAACTTACAAACAGGTCAAAAACAG-------TACTCGATTATTTCGGGGCCAAATTGGCATTCTGTGAAATGCCTTCCTTTCC
ATGAGCGTAAGCCCTCCAAATCGGTCACAAC------TAATACTCTGATAATTTAGGAACCAAGGAGCCATTTTGTGAGGCATTTCTACCCTT
CTGTGCGCAATCAGTACAAATAGGTCACAGCCAA---TAATACCCTAATAATTTAGGAACCAAGGAACGATTTTGTGAAGCACCTCTTCTTTT
|||||
||| |.||
|||||| ||
RGGTCAnnnTgacy ER
rTtnnGmAAt C/EBP
wwTTGTTww SRY
| ||||
|||||| |
TgaGTCA AP-1
rrCCAATs CCAAT box
|| .|||||
WAWnnAGGTCA RAR
TF binding sites in the distal conservative region of XIST 5’ sequence:
overlapped binding sites for ER (estrogen receptor), AP-1 (c-fos/c-jun) binding
sites and sites for RAR (retinoic acid receptor); sites for C/EBP factors and
potential CAAT box; sites for SRY transcription factor (sex-determining region Y
gene product).
Methods to detect protein-DNA interactions
ChIP-chip approach
(chromatin immunoprecipitation – chip analysis)
Robine et al., pbil.univ-lyon1.fr/events/jobim2005/proceedings/P126Robine.pdf
Composite module on flanks of HNF-4
functional binding sites
500bp
HNF-4
Matrix_ID(1)
cut-off(1) Matrix_ID(2)
cut-off(2) dmin
V$MAZ_Q6
0.89
V$ER_Q6
0.913
V$HEB_Q6
0.969
V$HNF4_Q6_01
0.976
V$HEN1_02
0.854
V$CREB_Q2
0.888
V$HNF4_Q6_01
0.8325 V$EFC_Q6
0.6825
V$COUP_01
0.8005 V$KROX_Q6
0.8315
V$PEBP_Q6
0.84 V$TEL2_Q6
0.878
V$ELK1_01
0.785 V$WHN_B
0.948
V$CMYB_01
0.86 V$KROX_Q6
0.841
V$FOXO1_02
0.8715 V$FXR_Q3
0.8135
V$HNF4_Q6_01
0.8065 V$HNF4_01
0.8705
V$XBP1_01
0.8845 V$FOXO1_02 0.8715
Intercept
500bp

dmax
8
8
8
8
8
8
8
8

100
100
100
100
100
500
200
200
4
4
4
4
4
4
2
2
2
2
2
2
2
2
0.020763
0.047177
0.078905
0.210340
0.099368
0.086618
0.043344
0.053285
0.214469
0.111909
0.100922
0.100184
0.080381
0.112402
-0.098626
1.0312990000
18
16
14
12
10
8
6
4
2
0
0.9371385833
0.8429781667
0.7488177500
0.6546573333
0.5604969167
0.4663365000
0.3721760833
0.2780156667
0.1838552500
0.0896948333
-0.0044655833
Var2 = 643*0.0942*normal(x, 0.05, 0.1342)
Var1 = 70*0.0942*normal(x, 0.4991, 0.2237)
300
280
260
240
220
200
180
160
140
120
100
80
60
40
20
0
-0.0986260000
No of sites
Composite module on flanks of HNF-4
functional binding sites
HNF4 sites (+/-500bp)
Genome PWM matches (+/-500bp)
Analysis of ChIP-chip data on HNF-4
from Odom et al. (2004)
H13K_noHNF4
Selected
1.8
1.6
1.4
Local context
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
0
0.2
0.4
Global context
0.6
0.8
1
Composite module in different
promoter functional classes
Promoter
class
TF factors selected
Score
Cell-cycle related
E2F (1.00), TATA (0.95), CREB (0.88), Sp-1 (0.81)
7.2
Brain enriched
BRLF1 (0.192), ATF (0.038), CREB (0.450), Sp-1
(0.592), HFH2 (1.00)
3.8
Muscle-specific
Tal-1 (0.50), YY-1 (1.0), Oct-1 (0.40), MyoD
(0.80), SRF (1.0), PAX5 (0.80)
5.2
Immune cell
specific
COMP1 (0.024), STAF (0.017), NF-kB (1.30), NFAT (0.957), Brn-2 (0.059)
6.6
Erythroid
specific
n-myc (0.31) , GR (0.08), AP-4 (1.00), RREB-1
(0.08), v-Maf (.08)
2.0
Liver enriched
RORalpha1 (1.00), Sp-1 (0.03), SREBP-1 (1.00),
HNF-1 (0.54), ER (0.07), GATA-1 (0.03)
2.6
Housekeeping
Egr-2 (0.15), AhR/Arnt (0.72), ZID (0.94), Elk-1
(0.79), NRF-2 (0.54), CREB (.62)
7.2
A decision tree method for classification of
promoters based on combinations of TF binding sites
ER (F>0.26)
no
yes
MyoD (F>0.2)
NF-AT (q>0.8)
NF-AT
(F>0.8)
yes
yes
Nkx-2.5 (F>0.6)
yes
Musclespecific
44%
no
no
no
Liverenriched
51%
Immune
cell
specific
54%
Housekeeping
20%
E2F + SRF (F>0.8)
yes
no
Cell cycle
related
65%
Oct-1 (F>0.3)
yes
Brainenriched
34%
no
Erythroidspecific
70%
CYTOMER®
Hierarchical
representation of
anatomical
(sub)structures in the
Organ table of
CYTOMER
Human DNA sequence from clone RP1-102D24 on chromosome 22
Cell cycle regulatory potential
Promoter potential
•1,2
•1
•0,8
Novel Mitosis-specific Chromosome
Segregation protein SMC1 LIKE protein
•0,6
•0,4
•0,2
•0
•0
•10000
•20000
•30000
•40000
Composite•70000
modules •80000
•60000
•50000
w
Cell cycle regulatory potential:
CP(i) 
s1(1)
( 2)
1
s
s
( 2)
2
Start of
transcription
(k )
(k )
... s1 ...snk
...
i  LS
Wk
k i  LS
C (k ), if C (k )  Ccut off
Wk  
otherwise
 0,
q
(1)
cut  off
q
 (1)
C  max
w

k 1, K
( 2)
cut  off
 ( 2)
(k )
(k )
 qavr
( w)
K - number of TF matrixes
•90000
...
...
q
(k )
cut  off
Parameters of
the model to be
estimated
 (k )
(k )
qavr
( w) 
 q(s
i 1, nk
(k )
q ( si( k ) )  qcut
off
si( k ) w
(k )
i
)
•100000
•110000 •120000 •130000
Weight:
TF matrix
1.000000
0.840072
V$E2F_19
0.954483
0.737637
V$TATA_01
0.888064
0.939687
V$CREB_01
0.816179
0.941583
V$SP1_Q6
0.039746
0.839702
V$TAL1BETAE47_01
LS = 5000
Ccut off = 0.9
Promoter recognition matrix

5
9
-
2
6
3
1
0,625
0,237
-
M
Promoter potential =
13
3
3
-
0,250
0,158
0,375
0,333
Km
 ( 
m 1 k 1
m,k
)
0,342
0,375
0,273
-
1
5
2
6
1
0,125
0,132
0,250
0,545
0,333
5
2
1
0,132
0,181
0,333
M – number of promoter regions
Km – number of found sites in the region m
 m,k– weight of the site k in the region m
8
38
8
11
3
Regulatory potential for
mouse Xist gene
3
2,5
2
1,5
1
0,5
0
0
10000
20000
P0
P1P P2
Ex-1
BC
D
A2
R
pS12 pS19X
NLAR CpG
0
5000
10000
CpG
15000
30000
S/MAR
40000
Ex-4
Ex-6
Ex-2
MIR
Ex-5 E
Ex-3
NLAR
S/MAR CpG
20000
NLAR
25000
50000
60000
pMKK2
30000
S/MAR
35000
Ex
-7
CH
40000
Ex-8
17-mer
CpG
45000
34-mer
CpG
CpG
50000 TSIX
55000
CpG
60000
65000
Clusters of immune-cell specific
NF-AT/AP-1 composite elements
a) Human IL-4 (HSIL4A)
Cluster (5: 399bp)
ex1
1
1000
ex2
ex3
2000
3000
4000
5000
6000
ex4
7000
8000
9000
b) Human prointerleukin 1 (HSIL1B)
Cluster (4: 228bp)
ex1
1
1000
ex2
2000
ex3
ex4
4000
3000
5000
ex5
6000
ex6
7000
c) Human DNA sequence from PAC 272J12 on chromosome 22q12-qter (HS272J12)
Cluster (3: 76bp)
81000
82000
83000
84000
85000
86000
87000
ex7
8000
9000
The task is to reveal statistically significant composite clusters
of TF binding sites
Andreas Wagner: Genes regulated cooperatively by one or
more transcription factors and their identification in whole
eukaryotic scale.
=++
Revealing of statistically signifficant composite
clusters
  
window
P(1,1,1)=0.1
P(2,2,1)=0.0001
P(1,1,1)+
P =
P(2,1,1)+P(1,2,1)+P(1,1,2)+
P(3,1,1)+P(2,2,1)+P(2,1,2)+P(1,1,3)+P(1,2,2)+P(1,3,1)+
…………
The probability to find a cluster: n sites (m types) or more
within a window of the length w.
P(n) 
 PP((kk ))PP((kk ))...... P(k
{ki }
11
22
m
m
)
  {k1 ,...,km | k1  K1 ,...,k m Km ; k1  ...  km  n}
Ki – constraints on existence of a sites of type i.
P(ki )  e  wi
( wi ) ki
ki !
The easier form for calculation is:
P( N ) 
 P(k )   P(k ) 
k1  K1
1
k2  K2
2
 P(k )  P(k )  ... P(k
1
ki  Ki ,k1 km  N
2
m
)
Some sites tend to be together due to similarity of their binding
patterns. This decline distribution from Poison law.
CAB  N ( A  B)
N ( A)
 P( B | A)
a
b
V$AP1_C - V$AP1_C = 0.56
V$SRF_C- V$YY1_01= 0.37
a
b
V$USF_Q6 - V$USF_Q6 = 0.34
V$HNF1_01 - V$AP1_C = 0.16
V$HNF4_01 - V$GR_Q6 = 0.15
V$NFY_Q6 - V$CEBPA_01 = 0.12
V$OCT_C - V$HNF3B_01 = 0.11
V$CEBPA_01 -V$CEBPA_01 = 0.10
If Cab > Cba
If Cab < Cba
21 chromosome. Length = 33*106 bp, window – 300 bp.
Some examples
1.
Homo-type - P=2.2e-16 (5288100,5288400) Number of sites:28 Classes - V$HNF3B_01-28
2.
Hetero-type - P=5.0e-11 (4575600,4575900) Number of sites:16 Classes - V$MEF2_02-1,
V$HNF4_01-1, V$MYB_Q6-1 V$AP1_C-1, V$USF_Q6-2, V$YY1_01-1, V$GATA1_042, V$CEBPA_01-1, V$NFY_Q6-1, V$CREBP1_Q2-1, V$GR_Q6-3, V$NF1_Q6-1.
3.
Few types only - P=1.2e-17 (28848900,28849200) Number of sites:21 Classes V$EGR1_01-2, V$GC_01-9, V$GR_Q6-10
V$HNF3B_01
V$CEBPA_01
V$MEF2_02
V$GATA1_04
V$AP1_C
V$HNF4_01
V$OCT_C
V$MYB_Q6
V$YY1_01
0,63
0,62
0,61
GC-content
LocusLink_Ge
Cluster300
Cluster500
0,6
0,59
0,58
0,57
0,56
0,55
0,54
0,53
0,52
0,51
0,5
0,49
0,48
0,47
0,46
0,45
0,44
0,43
0,42
STCH
0,41
Cluster300Cluster300
Cluster500
Cluster300
0,4
0,39
0,38
0,37
0,36
0,35
0,34
0,33
0,32
0,31
0,3
0,29
0,28
0,27
0,26
0,25
0,24
0,23
13 890 000
13 895 000
13 900 000
13 905 000
13 910 000
13 915 000
13 920 000
13 925 000
13 930 000
13 935 000
13 940 000
13 945 000
? Potentially a new gene
Normalized frequencies of clusters distribution within promoters,
exons, entire genes.
7
6
Promoters10000
Promoters2000
5
Promoters1000
Promoters300
4
Genes2000
3
2
1
er
50
0
Cl
us
t
er
30
0
0
Cl
us
t
Exons
URLs for main resources mentioned:
http://www.gene-regulation.de
http://www.biobase.de
http://www.hnbioinfo.de
http://compel.bionet.nsc.ru