Transcript Document
Regulatory sequence analysis tools and approaches Alexander Kel BIOBASE GmbH Halchtersche Strasse 33 D-38304 Wolfenbuettel Germany [email protected] www.biobase.de F(g)=E(g)A(g p) Gene functional role Gene expression profile organ, tissue, cell stage of development cell cycle phase extracellular signals Protein specific activity (as enzyme or structural or regulatory protein) ? ? gherllojunomd-bype Genny fasltow Where ? organ, tissue, cell When ? stage of development How ? cell cycle phase extracellular signals With whom? organ, tissue, cell stage of development cell cycle phase External signals, conditions Collecting bits of information about regulation of gene expression through transcription factor binding sites Mouse p53 tumor suppressor gene Expression level +1 to +216: enhancer •+3 to +19: NF-1 •+35 to +51: p53 •+53 to +69: NF-kappaB •+57 to +72: ETF •+65 to +79: E-box -225 to +1: promoter 1 •-195 to -170: p53 •-68 to -53: AP-1 -320 to -225: negative regulatory element low maximal level low none high medium very low high induction induction induction induction Organ, tissue, cell Stage of type development Cell cycle phase Extracellular signal G1 G1/S, S G2 G0 heart, liver heart heart; terminally diff. cardiomyocytes spleen, thymus; proliferating fibroblasts lymphocytes embrio at birth adult TPA, serum mitogenic induct. TNF- UV radiation ? A C G T 9 8 4 8 N 2 3 2 22 T 1 1 2 25 T 0 1 2 26 T 1 13 15 0 S 0 3 26 0 G … 0 29 0 0 C 0 0 29 0 G l q 0 22 7 0 C 1 8 17 3 S l I (i) f (b , i) I (i) f i i 1 l I (i) f 15 9 3 2 M min 13 4 9 3 R 7 8 8 6 N (i) i 1 max 13 1 7 8 D (1) (i) i 1 I (i ) f (b, i) ln(4 f (b, i)) b{ A ,T ,G ,C } (2) TFBS identification via pattern search Phylogenetic footprint of promoter regions of nucleolin genes 1 <===========V$CREB_02(0.85) ============================================================================= 2 <=======V$CREB_01(0.82) MMNUCLEO GGCCCGCTCATCAGCCCGAGGGAACCCTAGG--CC------TTCCGGCGTTCT------423 MMNUCLEO TCTCCCCAC-CACACCAGGAAGTCACCTCTCTCA----------ACCTG---GAGTTATA 225 RNNUCIA1 GGCCCACTAAACGGCCCGAATGAACTCTAGG--CC------TTCCGGCGCTCT------435 1 <===========V$CREB_02(0.85) CSNUCLEO GGCC-GCGAGCTGGCCCCAGTGG-CTCTAGG--CCCTCAACTTCCGGCGCTCTCCGGCTC 450 2 <=======V$CREB_01(0.82) HSNUCLEO TGCCTCCAAAAGGGCCAACGGGAACTCCGCGGTCCCTGAACTTCCGGTGCTGGAGG---A 448 RNNUCIA1 TCTCCCACCACACACCAGGAAGTCACCTCTCTGA----------ACCTG---GAGTTATA 221 *** * *** * * * * ** ****** * * 1 <===========V$CREB_02(0.85) ============================================================================= 2 <=======V$CREB_01(0.82) MMNUCLEO -TCAGCAGGACCACGCGGCG---------------------------------------442 CSNUCLEO CCTCC-AGCACACACCAGGAAGTCACCTCTCCGAGACCGTCCCCATCAG---GAGTTAAA 229 RNNUCIA1 -CCAGCTCTTCAGCGCGGCGAACGTTCTAGGCCCCTGAGAAGTCCACCGGGAGGCGCAGG 494 1 <===============V$TH1E47_01(0.85) CSNUCLEO CTCAGCGGGAACGCGCGGCGAGCAGTTGAGGCCGCCGCGGATTCCAACGGGTTGGGGACG 510 HSNUCLEO TGGCCCTGT-GAGGCCAGAAAGTTACTTCTCCGAGGCCAGTTCCCCATGTCTGAGAAATA 229 HSNUCLEO CTCCTCGCTCCAGGGCCACCAGGAGCCGCGGC---------------------GTGAGTG 487 ** * **** **** ** **** * * *** * * * * ** * ============================================================================= ============================================================================= MMNUCLEO --------------GGGGGAAA-----GCACCGAGAAACGCCCAGACCACCTGAGCATCG 483 1 <==========V$DELTAEF1_01(0.82) RNNUCIA1 TTTCCGCTACGCGAGGGGGAAA-----TCCCCGAGAAATGCCCAGACCACCTAAGCACAG 549 MMNUCLEO CCTACCG-CGAGAGGTCACCGACATTACATGGATCGCTTGTGCACTGCTCGTA--CACAC 282 CSNUCLEO TTCGC----AGCGCGGGGGATGCTCGGGCCACCCACCACCCCCCCACCCCCCCGGCCACG 566 1 <======== ==V$DELTAEF1_01(0.87) HSNUCLEO CGTGCCGGAACCGAGGGCGGGG-----TCTCTGAGGAACTCCAAGGCTGCCCAAGCCTAC 542 RNNUCIA1 CCTACCG-CGTGAGGTCA--GAGATTAAATGGACTGTTTGTGCACTGCTCACA--CACAC 276 *** * * * ** * ** ** 1 <======== ==V$DELTAEF1_01(0.84) ============================================================================= CSNUCLEO TCTACCG-CGCGAGGTTG--GACATTAAGCGAGCTGTTTGAGCACTGCACACAGGCGCGC 286 MMNUCLEO CCGCCC--------ATGCTGCCTCGGAACACCTGAGGGAATCCGGGCCACGCCGCCACCT 535 1 <========= =V$DELTAEF1_01(0.84) RNNUCIA1 ACGTCC--------ATGCGGCGTACGGATACCTGAGGGAATCCGGGCCATACCGCCACCT 601 HSNUCLEO TCTCCCAACTTGAGGTTCT-GTGGGGTAGGGGAGGGTTCGTGACTTTCTCACAGAAAACC 288 CSNUCLEO AGGCCCGGAGCTCCAGGTAGCAGTGCAGCACTAGGCGGCGTCCGGGCCACGCCGCCCAAT 626 ** ** * ***** * * * * * * * * * * * HSNUCLEO GGACCC---------AGCCACATTGGCGAACC----GGAGACCGCCCGATTCCACCACC588 ============================================================================= ** * * ** ** *** * * ** ** 1 <=======V$NKX25_02(0.84) 2 =========>V$CETS1P54_01(0.87)============================================================================= 1 <=======V$E2F_02(1.00) MMNUCLEO ACACACGCAC------------AACTGCTTTTATTAGGAGCT----CTCAGGAAAGCGGG 326 MMNUCLEO ACCCGCG--CCTCACACACAAGCCGCGCCAAACTCGCCCGTCCCACTGCGCAGGCGTGGG 593 1 <=======V$NKX25_02(0.84) 1 <=======V$E2F_02(1.00) 2 =========>V$CETS1P54_01(0.87) RNNUCIA1 ACTCGCG--CCTCACTC--AAGCCGCGCCAAACTCGCGCGTTTCACTGCGCAGGCGTGTA 657 RNNUCIA1 ACACACGCGCGCGCGCGCGCGAAATTGCTTTTATTAGGAGCT----CTCAGGAAAGTGGT 332 1 <=======V$E2F_02(1.00) 1 =======>V$NKX25_02(0.82) TCCCCCGAGCCCCTTCCACAAGCCGCGCCAAACGGGTCTG---CACCGCGCAGGCG--GC 681 2 <==========V$DELTAEF1_01(0.81)CSNUCLEO 1 <=======V$E2F_02(1.00) 3 =========>V$CETS1P54_01(0.84) HSNUCLEO -CCCGCGCTCCCCTCAC--AGCCGGCGCCAAAAACGCCAGTCCCACGACGCAGGC----640 CSNUCLEO ACACACGCACGC----------AACTGCCTTTATTGGGAGCTGTCTCTCAGGAGAACAGC 336 * * ** ** * * * * ******** * * *** ******* 1 <=======V$NKX25_02(0.83) 2 <==========V$DELTAEF1_01(0.81) 3 =========>V$CETS1P54_01(0.86) HSNUCLEO TCGTACAGACCC-------CGCCACTGCCTTTATTAACAGCT----CTCAGGAGACTGCC 337 * ** * * *** ****** **** ******* * HSNUCLEO - Homo sapiens; ============================================================================= CSNUCLEO - Cricetulus griseus; MMNUCLEO GACTCGCATCA---TAGCCAAG----AAGCCGTTCGCGAC-TCCGCGGAGAACAGGCCGA 378 RNNUCIA1 GGCTCGCATCAGGCTACCACAGCC--AAGAGGACCGCCACCTCTACCGAGGGCAGGCCAA 390 MMNUCLEO - Mus musculus; CSNUCLEO GGCCCGCGGCGCAACACTAGAGCCCCGGGATGTTCTCGGC-TCTGCCGAGGGCAG-CCGA 394 RNNUCIA1 – Rattus norvegicus HSNUCLEO TGCAGGAGGGGGGTCGCTCCGGCC---CCATGCTCGCGGG-CAAGCAGGGATAAG--CTG 391 * * * * * * * * * ** * Gibbs sampling Algorithm A T G C 1) A T 2) A T 3) A T G G G C C C ... Jun Fos TGASTCA AP-1 NFAT human TNF promoter -107 AP-1 mast cells -74 NFAT T-cells NF-kB dendritic cells VDR AP-1 C/EBP T-cells + ? Functional of Averaged Density k ( sequence ) S Weight Sum sequence sample 1/(1 h ) 1 h V k (sequence) sequence space S V k h (sequence) p(sequence) Kernel Volume Averaged Density Condition for Maximum of Averaged Density Main properties of the functional of averaged density 0,h(k) represented as 3 theorems. Theorem 1: Functional of averaged density 0,h (k) reaches maximum with respect to kernel k ( ) c p 1 / h ( ) k(): which satisfies the following equation where c is an arbitrary normalization factor. This theorem tells that we can get an accurate estimate of probability function p() by means of maximization n,h (k) under n. In this case pn () = const{k n ()}h . Theorem 2: Let the averaged density functional n,h (k) reaches the maximum with respect to k h c h p1 /hh ( ) : k h k , p h p kernel . Then there is a limit of log-likelihood function L n ( p ) under h and it equals lim L n ( p h ) sup L n ( p ) h h p p This theorem tells that under h the method of averaged density functional maximization is similar to method of likelihood maximization. Theorem 3: Let p () be the probabilities of sequences . Let pn () be some estimates of probabilities satisfying equation p ( ) k ( ) p ( ) k ( ) i n i n i i i n i where k() = cp1/h () and k n () = cpn 1/h (). The following relation is true 1 h p(i ) pn (i ) kn (i ) 8 h 0 h (k ) h i 1 h 1 0 h (kn ) max p ( ) k ( ) , p ( ) k ( ) i i n i n i i i 2 This theorem tells that we can expect more accurate estimates of probabilities for sequences with higher kernel weights. At least the theorem establishes the upper boundary for accuracy of estimation. The problem is that we maximize the empirical functional n,h (k) not 0,h (k). If the family of probabilities (respectively the family of kernels) is too manifold the value n,h (k) may differ significantly from 0,h (k). But this is the problem of maximal likelihood method as well. Model for Independent Distribution of Symbols RL is the distance from the sequence L to the given local consensus; R L (1) jl j ( R L jl ) jl is distance coefficient for l letter in j-th position; # s jl sjl is weighted sum for l letter in jth position; s jl0 max j (s jl ) L*jl is all sequences from selection where l letter is situated in j-th position. e h (2) LL*jl jl ln( s jl0 s jl ) (3) (4) 1. Initialisation of the algorithm by setting the initial values jl. For that we select a sequence L and set jl = 0, where l is a letter in j-th position of sequence L. All other values of jl set to 1. 2. Calculation of distance RL (1). 3. Calculation of partial sums sjl (2). 4. Determination of maximal values of sjl0 for every position (3). 5. Calculation of new values ’jl (4). Testing of the kernel method of motif finding. A mixture of CREB and AP-1 sites was analyzed. Kernel method has revealed two original motifs. Whereas, CONSENSUS-V6C.1 and Gibbs sampling were not able to reveal two different patterns. Only one pattern was revealed that presents a mixture of the original two. Table 1. Weight matrices revealed with kernel method (smoothing parameter h = 1.2). Weight matrix 1 (113 sequences contain this motif) A G C T Consensus 15 14 6 65 T 18 55 5 22 G 51 4 34 11 A 0 84 1 15 G 0 0 5 95 T 4 2 94 C 100 0 0 0 A 6 36 27 31 2 0 98 0 C 100 0 0 0 A Weight matrix 2 (73 sequences contain this motif) A G C T Consensus 12 8 5 75 T 1 77 3 19 G 75 10 5 10 A 7 5 88 0 C 23 62 15 0 G 0 0 3 97 T Table 2. The most optimal weight matrix (153 sampled words) resulted from run of program CONSENSUS-V6C.1 (Hertz and Stormo, 1999). A G C T Consensus 9 29 8 54 T 39 40 4 17 G/A 37 11 40 12 C/A 0 87 8 5 G 0 0 0 100 T 5 0 95 0 C 100 0 0 0 A 5 26 27 42 T Table 3. Weight matrix obtained with Gibbs sampling (Lawrence et. al., 1993). A G C T Consensus 36 42 43 93 82 36 52 T G/A C/A 85 G 92 T C A 39 T A T 1 3 G C 1 1 1 3 3 3 1 1 3 3 1 1 1 3 3 3 1 1 1 3 3 3 10bp 100 calculated 2 D ( pimplanted p ) jl jl jl 100 100bp Result of comparison of four different pattern discovery programs on the sets of simulated sequences with implanted TF binding sites for one matrix; y-axis: the averaged sum of squared differences between reveled matrix and the original one; x-axis: values, that are the probabilities of “consensus nucleotide” in each position of the matrix. 1,000 Kernel MEME CONSENSUS GIBBS 0,800 0,600 0,400 0,200 GIBBS CONSENSUS M EM E 0,000 Kernel 0,65 0,7 0,75 0,8 0,85 0,9 0,95 Table 1. Comparison of 3 programs performing the best for the low levels of value. 0,65 0,7 Kernel 0,205 0,165 MULTIPROFILER 0,208 0,255 PROJECTION 0,260 0,304 =0.7 X1 A T 1 3 G C 1 1 1 3 3 3 1 1 3 3 1 1 1 3 3 3 1 1 1 3 3 3 10bp 100 X2 100 A T 1 3 G C 1 1 1 3 3 3 1 1 3 3 1 1 1 3 3 3 1 1 1 3 3 3 10bp 100bp Result of comparison of four different pattern discovery programs on the sets of simulated sequences with implanted TF binding sites for two matrices; y-axis: the averaged sum of squared differences between two reveled matrices and two original ones; x-axis: 4 different variants of matrices. First is the most different matrices, last – the most similar matrices 2,5 Kernel 2 MULTIPROF CONS t=10 1,5 CONS t=20 1 CONS t=50 GIBBS 0,5 ANN-SPEC 0 1 2 3 4 Hierarchical order of the anatomical structures Bronchial tree and Intrapulmonary Airways Human body Lung Bronchial tree Main bronchus Lobar bronchus Segmental bronchus Bronchus Bronchiolus Terminal bronchiolus Alveolar sac Pulmonary alveolus Alveolar Alveolar pore epithelium Pneumocytes Cytomer/Content Respiratory bronchiolus Alveolar duct Alveolar septa Link from CYTOMER to TRANSFAC Link from CYTOMER to TRANSFAC: T00167 Gene expression UniGene EST Cytomer TRANSGENOME Gene expression group 1 Gene expression group 2 Gene expression group 3 cell cycle T-cell uterus testis stomach prostate gland placenta peripheral lymphoid pancreas muscles lymph lung liver large intestine kidney heart eye ear brain breast Number of promoters of the specific genes 500 450 400 350 300 250 200 150 100 50 0 Cell-cycle TTTCGCGCCA ATTTGGCGCG 1) AggGCCGgGC AAAGGAtTTG GGGGCGGGGC GGGGGCGGGG CCAAAGCCCG cGCAGCCAAT T-cell CaTTTCCTCT TATAAAGgga cCCCCGCCCc AtAgAGGAAg TGAGGAAATG CCCCGCCCcc TtCCTtTATA Muscle GaCTATATAA GCCcCCtCCT GGGGcAGgGg GAGGtGGCTG GCAGGGGtGG CCCCCGGCTC GGGGAGGggg gGGGGCAGGG V$E2F_03 V$E2F_03 V$SP1_Q6; V$MYCMAX_B ? V$SP1_Q6; V$MAZ_Q6 V$SP1_Q6 ? V$SREBP_Q3; V$NFY_01 V$HOX13_01; V$ISRE_01 V$TATA_C; V$LEF1_Q6; V$SRF_Q6 V$MTF1_Q4 V$NFAT_Q6; V$NFKAPPAB65_01 V$PTF1BETA_Q6; V$MAF_Q6 V$MTF1_Q4 V$TATA_C V$TATA_C; V$SRF_Q6; V$AMEF2_Q6 V$HOX13_01 V$SP1_Q6; V$MAZ_Q6; V$MAF_Q6 V$MYOD_01 ? V$AP2_Q6; V$MTF1_Q4 V$SP1_Q6; V$MAZ_Q6; V$ETF_Q6 ? Motifs found by the Kernel method in three different sets of promoters. TRANSFAC matrices that are most similar to the motifs are shown. Matrices that are very similar to the motif are shown in bold. Matrices for the factors that are known as being involved in the regulation of the corresponding specific function are underlined TRANSPLORER (TRANScription exPLORER) is a software package for the analysis of transcription regulatory sequences. Currently, TRANSPLORER site prediction tool uses position weight matrices (PWM) collections. It is able to use several matrix sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional database, other matrix libraries as well as any user-developed matrix libraries. This means that it provides an opportunity to search for a great variety of different transcription factor binding sites. A search can be made using all or subsets of matrices from the libraries. Search for most probable binding sites regulating gene expression Search for binding sites coinciding with SNPs Mouse c-fos promoter (Matrix search for TF binding sites) 1 <------------V$IK1_01(0.86) -----...V$CREBP1CJUN_01(0.85) 2 <-----------V$IK2_01(0.90) -----...V$CREB_01(0.96) 3 ----------->V$AP2_Q6(0.87) <-------------V$GKLF_01(0.87) 4-->V$ATF_01(0.89) <-------V$MZF1_01(0.99) ----...V$ELK1_01(0.87) 5 <-----------V$AP2_Q6(0.92) <------------V$SP1_Q6(0.88) 6>V$AP1FJ_Q2(0.89) <-------------V$GKLF_01(0.85) 7>V$AP1_Q2(0.87) <-------------V$GKLF_01(0.86) 8->V$CREB_Q2(0.86) <---------V$CETS1P54_01(0.90) 9->V$CREB_Q4(0.90) <---------V$NRF2_01(0.90) 10 <-------------V$GC_01(0.88) 11 ----------->V$CAAT_01(0.87) 12 <------------V$TCF11_01(0.87) 13 ----------->V$AP2_Q6(0.87) 14 <---------V$USF_Q6(0.93) 16 --------...V$ATF_01(0.94) 17 -------...V$AP1FJ_Q2(0.95) 20 -------...V$CREBP1_Q2(0.93) 21 -------...V$CREB_Q2(0.95) 23 ---...V$IK2_01(0.85) MMCFOS_1 GAGCGCCCGCAGAGGGCCTTGGGGCGCGCTTCCCCCCCCTTCCAGTTCCGCCCAGTGACG 420 1-->V$CREBP1CJUN_01(0.85) -------------->V$BARBIE_01(0.86) 2-->V$CREB_01(0.96) -------------->V$TATA_01(0.95) 3 ----------->V$CAAT_01(0.91) --------->V$AP4_Q5(0.95) 4----------->V$ELK1_01(0.87) --------------------->V$HEN1_01(0.87) 5 --------->V$AP4_Q5(0.88) <---...V$CMYB_01(0.93) 6 <---------V$CDPCR3HD_01(0.93) --...V$VMYB_02(0.89) 7 <--------------V$TATA_01(0.88) 8 --------------------->V$HEN1_02(0.87) 9 <---------------------V$HEN1_02(0.86) 10 <-----------------V$AP4_01(0.88) 11 ----------->V$LMO2COM_01(0.93) 12 <-----------V$LMO2COM_01(0.93) 13 <-----------V$MYOD_01(0.88) 17--->V$AP1FJ_Q2(0.95) <---------V$AP4_Q6(0.99) 20---->V$CREBP1_Q2(0.93) <---------V$MYOD_Q6(0.96) 21---->V$CREB_Q2(0.95) Transcription start 23-------->V$IK2_01(0.85) 24 <=========== E2F (0.80) MMCFOS_1 TAGGAAGTCCATCCATTCACAGCGCTTCTATAAAGGCGCCAGCTGAGGCGCCTACTACTC 480 1 <-----------------V$CMYB_01(0.91) -------...V$ER_Q6(0.86) 2 <-----------V$LMO2COM_01(0.90) <----...V$TCF11_01(0.87) 3 --------->V$MYOD_Q6(0.90) -------->V$STAT_01(0.93) 4 --------->V$VMYB_01(0.89) <--------V$STAT_01(0.89) 5--------------V$CMYB_01(0.93) -------->V$LMO2COM_02(0.93) 6------>V$VMYB_02(0.89) <-----------V$CAAT_01(0.85) 7 -------->V$VMYB_02(0.88) 8 -------------->V$EVI1_04(0.86) 9 ------------->V$GATA1_02(0.93) 12 <------------V$ZID_01(0.85) 13 <----------V$CP2_01(0.97) 14 ---------->V$GATA_C(0.92) 15 ----------------->V$CMYB_01(0.86) 16 --------->V$CREL_01(0.91) 24 <=========== E2F (0.82) MMCFOS_1 CAACCGCGACTGCAGCGAGCAACTGAGAAGACTGGATAGAGCCGGCGGTTCCGCGAACGA 540 Exon 2 sequence of human thyroid transcription factor-1 (TTF-1) gene (HS198161) (Matrix search for TF binding sites) 1------------V$AHRARNT_01(0.90) <-----------------V$NF1_Q6(0.85) 2--------V$NMYC_01(0.89) --------->V$AP4_Q5(0.91) 3------>V$USF_Q6(0.89) --------->V$AP4_Q6(0.85) 4------V$USF_C(0.86) ------------...V$YY1_02(0.86) 5 --------->V$AP4_Q5(0.91) 6 --------->V$AP4_Q6(0.86) 7 --------->V$AP4_Q5(0.92) 8 --------->V$AP4_Q6(0.86) 9 --------->V$AP4_Q5(0.86) HS198161_1 ACGCGCAGCAGCAGGCGCAGCACCAGGCGCAGGCCGCGCAGGCGGCGGCAGCGGCCATCT 540 1 ----------------->V$NF1_Q6(0.96) 2 <-----------------V$NF1_Q6(0.90) 3 --------->V$USF_Q6(0.87) 4------->V$YY1_02(0.86) ---------->V$CP2_01(0.88) 5 --------->V$AP4_Q5(0.92) ----------->V$CAAT_01(0.85) 6 --------->V$AP4_Q6(0.85) --------->V$AP4_Q5(0.86) 7 ------...V$CP2_01(0.86) 8 ===========> E2F (0.81) 9 ===========> E2F (0.90) HS198161_1 CCGTGGGCAGCGGTGGCGCCGGCCTTGGCGCACACCCGGGCCACCAGCCAGGCAGCGCAG 600 1 <---------V$CETS1P54_01(0.89) <--------...V$GATA_C(0.86) 2 ----------------->V$NF1_Q6(0.85) <-------...V$GATA1_02(0.90) 3 --------->V$CETS1P54_01(0.90) <-------...V$GATA1_03(0.92) 4 <--------------------V$R_01(0.88) <-----...V$LMO2COM_02(0.90) 5 <---------------V$AHRARNT_01(0.86) 6 ----------->V$AP2_Q6(0.95) 7---->V$CP2_01(0.86) <-------...V$GATA1_04(0.87) 8 <----...V$CETS1P54_01(0.87) 9 ===========> E2F (0.80) HS198161_1 GCCAGTCTCCGGACCTGGCGCACCACGCCGCCAGCCCCGCGGCGCTGCAGGGCCAGGTAT 660 1--V$GATA_C(0.86) <---------V$CETS1P54_01(0.89) 2------V$GATA1_02(0.90) --------...V$DELTAEF1_01(0.96) 3------V$GATA1_03(0.92) <---...V$CEBPB_01(0.88) 4---V$LMO2COM_02(0.90) 5 <-----------V$IK2_01(0.92) 6 <---------------V$E47_02(0.87) 7-----V$GATA1_04(0.87) 8-----V$CETS1P54_01(0.87) 9 <--------------V$E47_01(0.86) 10 ---------->V$DELTAEF1_01(0.99) 11 <-----------V$LMO2COM_01(0.94) 12 <-----------V$MYOD_01(0.87) 13 --------->V$MYOD_Q6(0.91) 14 ------->V$USF_C(0.93) HS198161_1 CCAGCCTGTCCCACCTGAACTCCTCGGGCTCGGACTACGGCACCATGTCCTGCTCCACCT 720 Enhanceosome Recruitment of CIITA to MHC-II promoters. A prototypical MHC-II promoter (HLA-DRA) is represented schematically with the W, X, X2, and Y sequences conserved in all MHC-II, Ii, and HLA-DM promoters. RFX, X2BP, NF-Y, and an as yet undefined Wbinding protein bind cooperatively to these sequences and assemble into a stable higher order nucleoprotein complex referred to here as the MHC-II enhanceosome. CIITA is tethered to the enhanceosome via multiple weak protein-protein interactions with the W, X, X2, and Y-binding factors. The octamer site found in the HLA-DRA promoter (O), and its cognate activators (Oct and OBF1) are not required for recruitment of CIITA. CIITA is proposed to activate transcription (arrow) via its amino-terminal activation domains (AD), which contact the RNA polymerase II basal transcription machinery. Masternak K et al., Genes Dev 2000 May 1;14(9):1156-66 One of the TF binding sites in a composite elements can be rather weak. Weak DNA-protein interactions are stabilized by protein-protein interactions. Mouse Interleukin-2 gene promoter AP-1 COMPEL:C00050 NF-ATp ....... tgccacacaggtagactcttTTGAAAATAtgTGTAATAtgtaaaa catcgtgaca cccccatatt… … -96 -79 TGAGTCA AP-1 consensus ST Antagonistic composite elements COMPEL: C00006 Chicken embryonic -globin gene Sp1 NF-Y GGTGGGcctccggagtgaccaatgagtgTGGACAGATGCCA NF-1 Sp1 cooperatively with NF-Y activates transcription in primitive erythroid cells NF-1 represses transcription in adult cells COMPEL: C00009 Human c-fos protooncogene SRF mediates the rapid, transient induction of the c-fos protooncogen by serum growth factors. SRF acaggaTGTCCATATTAGGacatctgcg YY1 diminishes both basal and serum-induced expression YY-1 of the c-fos. COMPEL: C00054 Rat serum amyloid A1 gene C/EBP NF-B C/EBP and NF-B synergistically activate transcription in liver cells during acute phase response TGGTAGTCTTGCACAGGAAATGACATggtGGGACTTTCCCcaggg YY-1 YY1 represses inducible transcription of this gene. NFAT human TNF promoter -107 AP-1 mast cells -74 NFAT T-cells NF-kB dendritic cells VDR AP-1 C/EBP T-cells + ? E2F site context Local context TTTGGCGCGAAA Global context Revealing of local oligonucleotide context of TF binding sites motif: WSG TTTGGCGCGAAA window: [ ] Promoters of cell-cycle genes: ............. Exon 2 sequences: ............. } } Frequency of the motifs in the window Search for a maximal clique in a graph of non-correlated characteristics 0.91 VWS [7,65] 0.74 TTT [39,41] 0.78 BAY [7,65] 0.84 MGSG [25,27] 0.73 WWTT [11,65] 0.88 WS [15,65] 0.77 YKMG [13,15] 0.76 MGCG [19,21] 0.83 VTS [33,35] 0.89 CGSK [17,37] Found motifs in the flanking regions of E2F sites in cell-cycle promoters 12 bp 30 bp 30 bp TTTGGCGCGAAA MGCG: TTT: High frequency CGSK: HKCG: [ ] ] [ DWTT: [ Low frequency ] [ ] VTV: BAY: ] [ ] [ VDWW: VWS: ] [ [ [ ] ] Motifs found in the local context of E2F sites in promoters of cell cycle-related genes Negative characteristics Positive characteristics N Motif () fˆ Y fˆ N 0.0048 / 0.0041 = 1.179 0.0112 / 0.0032 = 3.536 0.0851 / 0.0341 = 2.499 0.0675 / 0.0095 = 7.071 0.1233 / 0.0536 = 2.299 0.0337 / 0.0000 0.0980 / 0.0559 = 1.754 0.80 0.75 0.90 0.79 0.72 0.80 0.82 -0.394 0.9618 0.5353 0.5904 0.223 0.5036 0.595 -0.095 -0.2297 -0.261 -0.566 =-5.6767 2) Utility i Window (w)1) [27,34] [39,41] [17,38] [13,16] [17,46] [21,26] [3,69] 1 2 3 4 5 6 7 MGCG TTT CGSK HKCG VDWW DWTT GSDM 8 VWS [7,66] 0.1258 / 0.1932 = 0.651 0.91 9 10 11 HSWY VTV BAY [26,65] [19,34] [7,65] 0.0413 / 0.0813 = 0.508 0.0427 / 0.1354 = 0.315 0.0274 / 0.0614 = 0.447 0.79 0.71 0.78 Score of context: k d ( X ) i f (i , wi , X ) i 0 Human uracil DNA-glycosylase (E2F sites) -1000 +1 1000 3000 5000 7000 9000 + score of context -1000 +1 1000 3000 5000 7000 ttTTTGCCGCGAAAag q=0.92 d=2.8 (known site) 9000 False negative (FN in percents) and false positive (FP sites per 1000bp) rates for recognition of E2F sites. 1,4 PWM 1,2 PWM+score of context 1 FP 0,8 0,80 0,6 0,4 0,2 0,79 0 10 20 30 40 FN 50 60 70 Analysis of promoters of cell cycle-related genes by E2F weight matrix Comparison of frequencies of potential E2F sites in different promoter sets 0,008 High frequency of potential E2F sites near transcription start site in promoters of cell cycle related genes. 0,025 Cell cycle-related genes 0,006 Other genes (EPD) Random sequences Exons 2 0,004 Cell cycle-related genes клеточного цикла Other genes (EPD) гены 0,02 0,015 0,01 0,002 0,005 0 Identification of new E2F target genes 350 300 250 200 150 100 50 -50 -100 -150 -200 -250 -300 -350 -400 -450 -500 -550 -600 0,8 -650 0 SITEVIDEO system Building of E2F site recognition program (step 1) SITEVIDEO system Building of E2F site recognition program (step 2) SITEVIDEO system Building of E2F site recognition program (step 3) Composite elements ternary complex formation and stabilization of DNA-protein complexes COMPEL:C00149 NF-ATp ......... Mouse Interleukin-2 gene promoter AP-1 tcagtgtatgggggtttaaAGAAATTCCagAGAGTCAtcagaagaggaaaaacaaa… … -147 -164 Human Interleukin-2 gene promoter AP-1 COMPEL:C00109 NF-ATp ST ....... ccacccccttaaagaaaggAGGAAAAAcTGTTTCAtacagaaggcgttaattgcatg… … -283 -268 ST Recognition method for T-cell specific Composite Elements NFAT/AP-1 AP-1 NFATp 5’ ..WRGAAAA.. ..TGASTCA..3’ 8-12 bp A C G T 1 2 3 4 5 6 7 8 5 5 8 8 12 1 2 11 2 0 26 0 0 0 23 26 0 1 0 0 25 0 1 0 25 1 0 0 15 5 2 4 A C G T NFAT = -log(1-scoreNFAT) 1 2 3 4 5 6 7 8 9 19 3 16 9 4 2 5 36 4 36 3 2 4 13 33 2 29 8 5 2 0 0 0 47 2 44 0 1 47 0 0 0 2 8 24 13 AP-1 = -log(1-scoreAP-1) 6,7 5,7 4,7 3,7 NFAT/AP-1 (training) Random 2,7 Composite score 1.47 AP1 4.7 wCE 17,0 NFAT NFAT 0.88 AP1 3.5 1,7 0,7 0,7 1,2 1,7 2,2 2,7 3,2 3,7 4,2 4,7 Frequency of NFAT/AP-1 in genomic sequences 1 0,9 Freq. per 1000bp 0,8 0,7 T-cell 0,6 Muscle 0,5 dbEST 0,4 Random 0,3 0,2 0,1 0 Promoters Intrones CDS Frequency of NFAT/AP-1 in promoters 0,007 0,006 0,005 0,004 Musc. promoters 0,003 T-cell promoters 0,002 0,001 0 > -900 [-900:-750] [-750:-600] [-600:-450] [-450:-300] [-300:-150] [-150:+1] Composite modules encode gene expression pattern organ, tissue, cell stage of development cell cycle phase extracellular signals Composite modules w (1) 1 s ( 2) 1 s (1) cut off s ( 2) 2 ( 2) cut off (k ) (k ) 1 ... nk q q (1) ( 2) ... C max w (k ) q (w) k 1, K K - number of TF matrixes (k ) avr s ... s ... Start of transcription (k ) cut off q (k ) ... Parameters of the model to be estimated (k ) q ( s q (w) i ) (k ) avr i 1, nk (k ) q ( si( k ) ) qcut off (k ) si w Mutation, recombination and selection of the best genomes G g1 g2 g3 SELECTION F ……. 41 27 3 MUTATION 0.9 0.9 0.5 0.9 0.8 RECOMBINATION 0.7 0.7 0.9 0.7 0.6 1 14 5 6 0.5 0.5 0.9 0.6 0.7 O.5 MULTIPLICATION 0.9 0.9 0.7 0.9 0.7 gn 4 Genetic Algorithm (GA) Fitness function of the GA F FN FP T N AC # promoters FN – false negatives T-test FP – false positives N FN FP T – T-test (difference between mean values) cms N – normal likeness AC – Akaike Information Criteria Composite module in promoters of T-cell specific genes Weight: qcutoff TF matrix 0.618300 0.923077 V$NFKB_Q6 0.162534 0.895279 V$OCT1_02 0.743705 0.965039 V$NFKAPPAB65_01 0.002359 0.788579 V$HOX13_01 0.928935 0.928569 V$NFAT_DWM_1 100 90 t-cell T-cell specific promoters other promoters 80 70 Other promoters No of obs 60 50 40 C 30 (k ) (k ) q cut off k 1,5 20 10 0 <= -,2 (0;,2] (-,2;0] (,4;,6] (,2;,4] (,6;,8] (,8;1,] (1,2;1,4] (1,6;1,8] (2,;2,2] (1,;1,2] (1,4;1,6] (1,8;2,] > 2,2 Composite module in promoters of cell cycle-related genes Weight: qcutoff TF matrix 1.000000 0.840072 V$E2F_19 0.954483 0.737637 V$TATA_01 0.888064 0.939687 V$CREB_01 0.816179 0.941583 V$SP1_Q6 0.039746 0.839702 V$TAL1BETAE47_01 4 0 Exon-2 sequences Cell cycle-related promoters Noofsequences 3 0 2 0 C 1 0 (k ) (k ) q cut off k 1,5 0 -0 ,5 0 ,0 0 ,5 1 ,0 1 ,5 2 ,0 2 ,5 3 ,0 3 ,5 4 ,0 1 <------------V$IK1_01(0.86) -----...V$CREBP1CJUN_01(0.85) 2 <-----------V$IK2_01(0.90) -----...V$CREB_01(0.96) 3 ----------->V$AP2_Q6(0.87) <-------------V$GKLF_01(0.87) 4-->V$ATF_01(0.89) <-------V$MZF1_01(0.99) ----...V$ELK1_01(0.87) 5 <-----------V$AP2_Q6(0.92) <------------V$SP1_Q6(0.88) 6>V$AP1FJ_Q2(0.89) <-------------V$GKLF_01(0.85) 7>V$AP1_Q2(0.87) <-------------V$GKLF_01(0.86) 8->V$CREB_Q2(0.86) <---------V$CETS1P54_01(0.90) 9->V$CREB_Q4(0.90) <---------V$NRF2_01(0.90) 10 <-------------V$GC_01(0.88) 11 ----------->V$CAAT_01(0.87) 12 <------------V$TCF11_01(0.87) 13 ----------->V$AP2_Q6(0.87) 14 <---------V$USF_Q6(0.93) 16 --------...V$ATF_01(0.94) 17 -------...V$AP1FJ_Q2(0.95) 20 -------...V$CREBP1_Q2(0.93) 21 -------...V$CREB_Q2(0.95) 23 ---...V$IK2_01(0.85) MMCFOS_1 GAGCGCCCGCAGAGGGCCTTGGGGCGCGCTTCCCCCCCCTTCCAGTTCCGCCCAGTGACG 420 Mouse c-fos promoter E2F composite module (global context) E2F flanking motifs (local context) 1-->V$CREBP1CJUN_01(0.85) -------------->V$BARBIE_01(0.86) 2-->V$CREB_01(0.96) -------------->V$TATA_01(0.95) 3 ----------->V$CAAT_01(0.91) --------->V$AP4_Q5(0.95) 4----------->V$ELK1_01(0.87) --------------------->V$HEN1_01(0.87) 5 --------->V$AP4_Q5(0.88) <---...V$CMYB_01(0.93) 6 <---------V$CDPCR3HD_01(0.93) --...V$VMYB_02(0.89) 7 <--------------V$TATA_01(0.88) 8 --------------------->V$HEN1_02(0.87) 9 <---------------------V$HEN1_02(0.86) 10 <-----------------V$AP4_01(0.88) 11 ----------->V$LMO2COM_01(0.93) 12 <-----------V$LMO2COM_01(0.93) 13 <-----------V$MYOD_01(0.88) 17--->V$AP1FJ_Q2(0.95) <---------V$AP4_Q6(0.99) 20---->V$CREBP1_Q2(0.93) <---------V$MYOD_Q6(0.96) 21---->V$CREB_Q2(0.95) Transcription start 23-------->V$IK2_01(0.85) 24 <----------- E2F (0.80) MMCFOS_1 TAGGAAGTCCATCCATTCACAGCGCTTCTATAAAGGCGCCAGCTGAGGCGCCTACTACTC 480 1 <-----------------V$CMYB_01(0.91) -------...V$ER_Q6(0.86) 2 <-----------V$LMO2COM_01(0.90) <----...V$TCF11_01(0.87) 3 --------->V$MYOD_Q6(0.90) -------->V$STAT_01(0.93) 4 --------->V$VMYB_01(0.89) <--------V$STAT_01(0.89) 5--------------V$CMYB_01(0.93) -------->V$LMO2COM_02(0.93) 6------>V$VMYB_02(0.89) <-----------V$CAAT_01(0.85) 7 -------->V$VMYB_02(0.88) 8 -------------->V$EVI1_04(0.86) 9 ------------->V$GATA1_02(0.93) 12 <------------V$ZID_01(0.85) 13 <----------V$CP2_01(0.97) 14 ---------->V$GATA_C(0.92) 15 ----------------->V$CMYB_01(0.86) 16 --------->V$CREL_01(0.91) 24 <----------- E2F (0.82) MMCFOS_1 CAACCGCGACTGCAGCGAGCAACTGAGAAGACTGGATAGAGCCGGCGGTTCCGCGAACGA 540 MMCFOS_1 1----------->V$ER_Q6(0.86) 2--------V$TCF11_01(0.87) 3 --------->V$AP4_Q5(0.91) 4 --------->V$AP4_Q6(0.87) 5 ---------->V$AP1FJ_Q2(0.93) 6 ---------->V$AP1_Q2(0.90) 7 ---------->V$AP1_Q4(0.87) 8 <-----------V$IK2_01(0.94) GCAGTGACCGCGCTCCCACCCAGCTCTGCTCTGCAGCTCC 580 Computationally predicted E2F target genes confirmed by in vivo footprint EMBL Gene Chromatin crosslinking c-fos, Hs HSFOS JunB, Hs HS207341 tgf-1, Hs HSTGFB1P R p14ARF, Hs AF082338 Immunoprecipitation Mcm4 (Cdc21), Hs mcm5 (P1cdc46), Hs PCR Von HippelLindau (VHL), Hs B-myb, Hs HSU63630 HS286B10 AF010238 HSBMYBD NA nucleolin, Hs nucleolin, Cg nucleolin, Ms HSNUCLEO CSNUCLEO MMNUCLE O Score ,q (+) aaGCTCGCGCCACTgc (-) gcAGTGGCGCGAGCtt (-) gtCTTCGCGCGCGCtc Position rel. start of transcription -165 .. -176 -92 .. –103 -90 .. –79 -78 .. –89 79 .. 90 91 .. 80 169 .. 158 -513 .. -502 -298 .. -287 28 .. 39 40 .. 29 85 .. 96 -1384 .. -1395 -1009 .. -1020 -739 .. -750 -589 .. -578 -265 .. -276 -491 .. -502 -409 .. -420 -377 .. -366 -175 .. -164 -93 .. -82 -187 .. -176 -175 .. -186 8 .. 19 20 .. 9 -270 .. -259 -258 .. -269 -28 .. 39 (-) gtCCTGGCGCGCGGgc (+) cgCTTGGCGGGAGAta -72 .. –83 -53 .. -42 0.83 0.87 1.18 -296 -> +14 <- (-) ttTTTGGCGCCGGCtg (-) ccGTGGGCGCGCGGgt -297 .. -308 -256 .. -267 0.97 0.81 2.91 -407 -> -41 <- (-) cgTTTGGCGCGGCTtg -296 .. -307 0.97 6.67 -538 -> -198 <- (-) agTTTGGCGCGGCTtg -306 .. -317 0.97 1.76 -531 -> -232 <- Sequence of the potential sites (-) (-) (+) (-) gcCTTGGCGCGTGTcc ggGGTGGCGCGCGGgc ccTCTGGCGCCACCgt acGGTGGCGCCAGAgg (+) gcTATCGCGCCAGAga (-) tcTCTGGCGCGATAgc (-) ggGCTGGCGCGGGCgg (+) (+) (+) (-) (+) ctGTTTGCGGGGCGga ccCTTCGCGCCCTGgg ctCTTGGCGCGACGct agCGTCGCGCCAAGag ccTTTGCCGCCGGGga (-) (-) (-) (+) (-) ctCTCCGCGCGCGGga gtCTTGGCGACCGTtg ggCCTGGCGCCGGAct tgATTGGCGGATAGag acTTTCCCGCCCTGtg (-) (-) (+) (+) (+) gtTTTCGCGGGAAAac ctTTCAGCGCCCGTgc gcAGTGGCGCCTCCcg ggCGTGGCGCGGAGcc ctTGTCGCGCAGGTac (+) (-) (+) (-) agTTTCGCGCCAAAtt aaTTTGGCGCGAAAct ttTTTCCCGCGAAAct agTTTCGCGGGAAAaa 0.92 0.84 0.88 0.83 0.89 0.91 0.82 0.80 0.91 0.93 0.83 0.85 0.81 0.81 0.81 0.83 0.86 0.93 0.82 0.80 0.83 0.86 0.99 1.00 0.89 0.93 0.81 0.84 0.92 Score of context, d 2.92 Positions of PCR primers -201 -> +96 <- -27 -> +313 <3.17 2.03 -122 -> +210 <- 4.11 -404 -> -143 <- 3.53 -667 -> -330 <- 4.39 4.91 -211 -> +88 <- 3.01 4.21 -137 -> +123 <2.22 •Phylogenetic footprinting Alignment of c-fos promoters E2F mouse rat hamster man ATGTTCGCTCGCCTTCTCTGCCTTTCCCGCCTCCCCTCCCCCGGCCGCGGCCCCGGTTCC ATGTTCGCTCGCCTTCTCTGCCTTTCCCGCCTCCCCTCCCCCGGCCGCGGCCCCCGCTCC ATGTTCGCTCGCCTTCTCTACCTTTCCCGCCTCCCCTCCCCCGGCCGCGGCCCCAGCTCC ATGTTCTCTCTCATTCTGCGCCGTTCCCGCCTCCCCTCCCCCAGCCGCGGCCCCCGCCTC ****** *** * **** ** ******************* *********** * * mouse rat hamster man CCCCCT----GCGCTGCACCCTCAGAGTTGGCTGCAGCCGGCGAGCTGTTCCCGTCAATC CCCCTT----GCGCTGCACCCTCAGAGTTGGCTGCAGCCGGCGAGCTGTTCCCGTCAATC CCCCTCCCCCGCGCTGCACCCTCAGAGTTGGCTGCAGCCGGCAAGCAGTTCCCGTCAATC CCCCC-----GCACTGCACCCTCGGTGTTGGCTGCAGCCCGCGAGCAGTTCCCGTCAATC **** ** ********** * ************* ** *** ************* mouse rat hamster man CCTCCCTCCTTTACACAGGATGTCCATATTAGGACATCTGCGTCAGCAGGTTTCCACGGC CCTCCCTCCTTTACACAGGATGTCCATATTAGGACATCTGCGTCA---GGTTTCCACGGC CCT---TTCC---CACAGGATGTCCATATTAGGACATCTGCGTCAGCAGGTTTCCACGGC CCTCCCCCCTT-ACACAGGATGTCCATATTAGGACATCTGCGTCAGCAGGTTTCCACGGC *** * ******************************** ************ mouse rat hamster man CGGTCCCTGTTGTTCTGGGGGGGGGACCATCTCCGAAATCCTACACGC-GGAAGGTCTAG CGGTCCCTGTTGTCCTGGGGGGA--ACCATCCCCGAAATCCTACATGC-GGAGGGTCCAG CGGTCCTTGTAGACCTGGGGGTG--ACGATCCCCAAAATCCTACATGC-GGAGAGTCCAG CTTTCCCTGTAGCCCTGGGGGGA--GCCATCCCCGAAACCCCTCATCTTGGGGGGCCCAC * *** *** * ******* * *** ** *** ** ** ** * * * mouse rat hamster man GAGACCCCCTAAGATCCCAAATGTGAACA-CTCATAGGTGAAAGATGTATGCCAAGACGG GAGACCTTCTAAGATCCCAATTGTGAACA-CTCATAGGTGAAAGTTACAGACTGAGACGG GAGACCCCCTAAGACCCCTATTGTGAACA-CAAATGGGTGAAAATTACATGTCAAGACGG GAGACCT-CTGAGACAGGAACTGCGAAATGCTCACGAGATTAGGACACGCGCCAAGGCGG ****** ** *** * ** *** * * * * ** *** mouse rat hamster man GGGTTGAAAGCCTGGGGCGTAGAGTTGACGACAGAGCGCCCGCAGAGG-GCCTTGGGGCG GGGTTGAGAGCCTGGGGGCTAGAGTTGATGACAGGGAGCCCGCAGAGG-GCATTCGGGAG AGGCGGGGGACCCGGGGCGCGGAGTTGACGCCAGGGCGGCCGCAGAAG-GCCTGGGGGCG GGGCAGGGAGCTGCGAGCGCTGGGGACGCAGCCGGGCGGCCGCAGAAGCGCCCAGGCCCG ** * * * * * * * * * * ******* * ** * * mouse rat hamster man CGCTTCCCCCCCC-------TTC-CAGTTCCGCCCAGTGACGTAGGAAGTCCATCCATTC CGCTTTCCCCCCTCCAGT--TTCTCTGTTCCGCTCA-TGACGTAGTAAG-----CCATTC CGCGGCTCCCCTCCGTC---GCCACAGTTCCGCCCAGTGACGTGTAATGT----TCATTC CGCGCCACCCCTCTGGCGCCACCGTGGTTGAGCCCG-TGACGTTTACAC-----TCATTC *** **** * *** ** * ****** ***** mouse rat hamster man AC—-AGCGCTTC-TATAAAGGCGCCAGCTGAGGCGCCTACTACTCCAACCGCGACTGCAG A---AGCGCTTC-TATAAAGCGGCCAGCTGAGGCGCCTACTACTCCAACCGCGATTGCAG ACA-AGCGCTTC-TATAAAGGCACCGGCTGAGGCGCCTACTACTCCAACCGCGACTGCAG ATAAAACGCTTGTTATAAAAGCAGTGGCTGCGGCGCCTCGTACTCCAACCGCATCTGCAG * * ***** ****** **** ******* ************ ***** CRE Ets YY1 SRE CRE/AP-1 E2F SP-1 E2F CRE TATA E2F CRE E2F Phylogenetic footprint (human/mouse) Spec1 Spec2 Phylogenetic footprint of the promoter of p53 gene p53_human ClustalW alignment p53_mouse p53_human p53_mouse p53_human Motif-based re-alignment p53_mouse p53_human p53_mouse 1 ==========>V$AP1_Q4(0.91) TTAGTATCTACGGCACCAGGTCGGCGAGAATCCTGACTCTGCACCCTCCTCCCCAACTCC 1 ==========>V$AP1_Q4(0.91) TTCCTGCTGAGGGCAACATCTCAGGGAGAATCCTGACTCTGCAAG----TCCCCGCCTCC ** * * **** ** ** * ****************** ***** **** ATTTCCTTTGCTTCCTCCGGCAGGCGGATTACTTGCCCT ATTTC--TTGC--CCTCAACCCACGGAAGGACTTGCCCT ***** **** **** * * * ********* 60 56 99 91 1 ==========>V$AP1_Q4(0.91) 2 < ============V$SP1_Q6(0.88) TTAGTATCTACGGCACCAGGTCGGCGAGAATCCTGACTCTGCAC-CCTCCTCCCCAACTC 59 1 ==========>V$AP1_Q4(0.91) 2 <============V$SP1_Q6(0.90) TTCCTGCTGAGGGCAACATCTCAGGGAGAATCCTGACTCTGCAAGTCCCCGCCTCCATTT 60 ** * * **** ** ** * ****************** * ** ** * * * CATTTCCTTTGCTTCCTCCGGCAGGCGGATTACTTGCCCT C-TT--------GCCCTCAACCCACGGAAGGACTTGCCCT * ** **** * * * ********* 99 91 New human/mouse conserved SP-1 sites were found Phylogenetic footprint of 5’ regulatory region of Xist gene human horse mouse M.subarv IV III II * ** * **** ***** * ** **** ** **** *** * *** * CATAGTTAAAAAATTACAAACAGGTCACAAACCAGTACTCTTTCTTGATTATTTAGGAACCAAATAGCCATTCTATGAAATGTCTTCCTTTCC CGCAGTTTAAAACTTACAAACAGGTCAAAAACAG-------TACTCGATTATTTCGGGGCCAAATTGGCATTCTGTGAAATGCCTTCCTTTCC ATGAGCGTAAGCCCTCCAAATCGGTCACAAC------TAATACTCTGATAATTTAGGAACCAAGGAGCCATTTTGTGAGGCATTTCTACCCTT CTGTGCGCAATCAGTACAAATAGGTCACAGCCAA---TAATACCCTAATAATTTAGGAACCAAGGAACGATTTTGTGAAGCACCTCTTCTTTT ||||| ||| |.|| |||||| || RGGTCAnnnTgacy ER rTtnnGmAAt C/EBP wwTTGTTww SRY | |||| |||||| | TgaGTCA AP-1 rrCCAATs CCAAT box || .||||| WAWnnAGGTCA RAR TF binding sites in the distal conservative region of XIST 5’ sequence: overlapped binding sites for ER (estrogen receptor), AP-1 (c-fos/c-jun) binding sites and sites for RAR (retinoic acid receptor); sites for C/EBP factors and potential CAAT box; sites for SRY transcription factor (sex-determining region Y gene product). Methods to detect protein-DNA interactions ChIP-chip approach (chromatin immunoprecipitation – chip analysis) Robine et al., pbil.univ-lyon1.fr/events/jobim2005/proceedings/P126Robine.pdf Composite module on flanks of HNF-4 functional binding sites 500bp HNF-4 Matrix_ID(1) cut-off(1) Matrix_ID(2) cut-off(2) dmin V$MAZ_Q6 0.89 V$ER_Q6 0.913 V$HEB_Q6 0.969 V$HNF4_Q6_01 0.976 V$HEN1_02 0.854 V$CREB_Q2 0.888 V$HNF4_Q6_01 0.8325 V$EFC_Q6 0.6825 V$COUP_01 0.8005 V$KROX_Q6 0.8315 V$PEBP_Q6 0.84 V$TEL2_Q6 0.878 V$ELK1_01 0.785 V$WHN_B 0.948 V$CMYB_01 0.86 V$KROX_Q6 0.841 V$FOXO1_02 0.8715 V$FXR_Q3 0.8135 V$HNF4_Q6_01 0.8065 V$HNF4_01 0.8705 V$XBP1_01 0.8845 V$FOXO1_02 0.8715 Intercept 500bp dmax 8 8 8 8 8 8 8 8 100 100 100 100 100 500 200 200 4 4 4 4 4 4 2 2 2 2 2 2 2 2 0.020763 0.047177 0.078905 0.210340 0.099368 0.086618 0.043344 0.053285 0.214469 0.111909 0.100922 0.100184 0.080381 0.112402 -0.098626 1.0312990000 18 16 14 12 10 8 6 4 2 0 0.9371385833 0.8429781667 0.7488177500 0.6546573333 0.5604969167 0.4663365000 0.3721760833 0.2780156667 0.1838552500 0.0896948333 -0.0044655833 Var2 = 643*0.0942*normal(x, 0.05, 0.1342) Var1 = 70*0.0942*normal(x, 0.4991, 0.2237) 300 280 260 240 220 200 180 160 140 120 100 80 60 40 20 0 -0.0986260000 No of sites Composite module on flanks of HNF-4 functional binding sites HNF4 sites (+/-500bp) Genome PWM matches (+/-500bp) Analysis of ChIP-chip data on HNF-4 from Odom et al. (2004) H13K_noHNF4 Selected 1.8 1.6 1.4 Local context 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.2 0.4 Global context 0.6 0.8 1 Composite module in different promoter functional classes Promoter class TF factors selected Score Cell-cycle related E2F (1.00), TATA (0.95), CREB (0.88), Sp-1 (0.81) 7.2 Brain enriched BRLF1 (0.192), ATF (0.038), CREB (0.450), Sp-1 (0.592), HFH2 (1.00) 3.8 Muscle-specific Tal-1 (0.50), YY-1 (1.0), Oct-1 (0.40), MyoD (0.80), SRF (1.0), PAX5 (0.80) 5.2 Immune cell specific COMP1 (0.024), STAF (0.017), NF-kB (1.30), NFAT (0.957), Brn-2 (0.059) 6.6 Erythroid specific n-myc (0.31) , GR (0.08), AP-4 (1.00), RREB-1 (0.08), v-Maf (.08) 2.0 Liver enriched RORalpha1 (1.00), Sp-1 (0.03), SREBP-1 (1.00), HNF-1 (0.54), ER (0.07), GATA-1 (0.03) 2.6 Housekeeping Egr-2 (0.15), AhR/Arnt (0.72), ZID (0.94), Elk-1 (0.79), NRF-2 (0.54), CREB (.62) 7.2 A decision tree method for classification of promoters based on combinations of TF binding sites ER (F>0.26) no yes MyoD (F>0.2) NF-AT (q>0.8) NF-AT (F>0.8) yes yes Nkx-2.5 (F>0.6) yes Musclespecific 44% no no no Liverenriched 51% Immune cell specific 54% Housekeeping 20% E2F + SRF (F>0.8) yes no Cell cycle related 65% Oct-1 (F>0.3) yes Brainenriched 34% no Erythroidspecific 70% CYTOMER® Hierarchical representation of anatomical (sub)structures in the Organ table of CYTOMER Human DNA sequence from clone RP1-102D24 on chromosome 22 Cell cycle regulatory potential Promoter potential •1,2 •1 •0,8 Novel Mitosis-specific Chromosome Segregation protein SMC1 LIKE protein •0,6 •0,4 •0,2 •0 •0 •10000 •20000 •30000 •40000 Composite•70000 modules •80000 •60000 •50000 w Cell cycle regulatory potential: CP(i) s1(1) ( 2) 1 s s ( 2) 2 Start of transcription (k ) (k ) ... s1 ...snk ... i LS Wk k i LS C (k ), if C (k ) Ccut off Wk otherwise 0, q (1) cut off q (1) C max w k 1, K ( 2) cut off ( 2) (k ) (k ) qavr ( w) K - number of TF matrixes •90000 ... ... q (k ) cut off Parameters of the model to be estimated (k ) (k ) qavr ( w) q(s i 1, nk (k ) q ( si( k ) ) qcut off si( k ) w (k ) i ) •100000 •110000 •120000 •130000 Weight: TF matrix 1.000000 0.840072 V$E2F_19 0.954483 0.737637 V$TATA_01 0.888064 0.939687 V$CREB_01 0.816179 0.941583 V$SP1_Q6 0.039746 0.839702 V$TAL1BETAE47_01 LS = 5000 Ccut off = 0.9 Promoter recognition matrix 5 9 - 2 6 3 1 0,625 0,237 - M Promoter potential = 13 3 3 - 0,250 0,158 0,375 0,333 Km ( m 1 k 1 m,k ) 0,342 0,375 0,273 - 1 5 2 6 1 0,125 0,132 0,250 0,545 0,333 5 2 1 0,132 0,181 0,333 M – number of promoter regions Km – number of found sites in the region m m,k– weight of the site k in the region m 8 38 8 11 3 Regulatory potential for mouse Xist gene 3 2,5 2 1,5 1 0,5 0 0 10000 20000 P0 P1P P2 Ex-1 BC D A2 R pS12 pS19X NLAR CpG 0 5000 10000 CpG 15000 30000 S/MAR 40000 Ex-4 Ex-6 Ex-2 MIR Ex-5 E Ex-3 NLAR S/MAR CpG 20000 NLAR 25000 50000 60000 pMKK2 30000 S/MAR 35000 Ex -7 CH 40000 Ex-8 17-mer CpG 45000 34-mer CpG CpG 50000 TSIX 55000 CpG 60000 65000 Clusters of immune-cell specific NF-AT/AP-1 composite elements a) Human IL-4 (HSIL4A) Cluster (5: 399bp) ex1 1 1000 ex2 ex3 2000 3000 4000 5000 6000 ex4 7000 8000 9000 b) Human prointerleukin 1 (HSIL1B) Cluster (4: 228bp) ex1 1 1000 ex2 2000 ex3 ex4 4000 3000 5000 ex5 6000 ex6 7000 c) Human DNA sequence from PAC 272J12 on chromosome 22q12-qter (HS272J12) Cluster (3: 76bp) 81000 82000 83000 84000 85000 86000 87000 ex7 8000 9000 The task is to reveal statistically significant composite clusters of TF binding sites Andreas Wagner: Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic scale. =++ Revealing of statistically signifficant composite clusters window P(1,1,1)=0.1 P(2,2,1)=0.0001 P(1,1,1)+ P = P(2,1,1)+P(1,2,1)+P(1,1,2)+ P(3,1,1)+P(2,2,1)+P(2,1,2)+P(1,1,3)+P(1,2,2)+P(1,3,1)+ ………… The probability to find a cluster: n sites (m types) or more within a window of the length w. P(n) PP((kk ))PP((kk ))...... P(k {ki } 11 22 m m ) {k1 ,...,km | k1 K1 ,...,k m Km ; k1 ... km n} Ki – constraints on existence of a sites of type i. P(ki ) e wi ( wi ) ki ki ! The easier form for calculation is: P( N ) P(k ) P(k ) k1 K1 1 k2 K2 2 P(k ) P(k ) ... P(k 1 ki Ki ,k1 km N 2 m ) Some sites tend to be together due to similarity of their binding patterns. This decline distribution from Poison law. CAB N ( A B) N ( A) P( B | A) a b V$AP1_C - V$AP1_C = 0.56 V$SRF_C- V$YY1_01= 0.37 a b V$USF_Q6 - V$USF_Q6 = 0.34 V$HNF1_01 - V$AP1_C = 0.16 V$HNF4_01 - V$GR_Q6 = 0.15 V$NFY_Q6 - V$CEBPA_01 = 0.12 V$OCT_C - V$HNF3B_01 = 0.11 V$CEBPA_01 -V$CEBPA_01 = 0.10 If Cab > Cba If Cab < Cba 21 chromosome. Length = 33*106 bp, window – 300 bp. Some examples 1. Homo-type - P=2.2e-16 (5288100,5288400) Number of sites:28 Classes - V$HNF3B_01-28 2. Hetero-type - P=5.0e-11 (4575600,4575900) Number of sites:16 Classes - V$MEF2_02-1, V$HNF4_01-1, V$MYB_Q6-1 V$AP1_C-1, V$USF_Q6-2, V$YY1_01-1, V$GATA1_042, V$CEBPA_01-1, V$NFY_Q6-1, V$CREBP1_Q2-1, V$GR_Q6-3, V$NF1_Q6-1. 3. Few types only - P=1.2e-17 (28848900,28849200) Number of sites:21 Classes V$EGR1_01-2, V$GC_01-9, V$GR_Q6-10 V$HNF3B_01 V$CEBPA_01 V$MEF2_02 V$GATA1_04 V$AP1_C V$HNF4_01 V$OCT_C V$MYB_Q6 V$YY1_01 0,63 0,62 0,61 GC-content LocusLink_Ge Cluster300 Cluster500 0,6 0,59 0,58 0,57 0,56 0,55 0,54 0,53 0,52 0,51 0,5 0,49 0,48 0,47 0,46 0,45 0,44 0,43 0,42 STCH 0,41 Cluster300Cluster300 Cluster500 Cluster300 0,4 0,39 0,38 0,37 0,36 0,35 0,34 0,33 0,32 0,31 0,3 0,29 0,28 0,27 0,26 0,25 0,24 0,23 13 890 000 13 895 000 13 900 000 13 905 000 13 910 000 13 915 000 13 920 000 13 925 000 13 930 000 13 935 000 13 940 000 13 945 000 ? Potentially a new gene Normalized frequencies of clusters distribution within promoters, exons, entire genes. 7 6 Promoters10000 Promoters2000 5 Promoters1000 Promoters300 4 Genes2000 3 2 1 er 50 0 Cl us t er 30 0 0 Cl us t Exons URLs for main resources mentioned: http://www.gene-regulation.de http://www.biobase.de http://www.hnbioinfo.de http://compel.bionet.nsc.ru