Transcript Document
Canadian Bioinformatics Workshops www.bioinformatics.ca Module 3 1 Module 3 2 Canadian Bioinformatics Workshops 2008 Inferring Regulatory Mechanisms Governing Sets of Genes Wyeth W. Wasserman University of British Columbia www.cisreg.ca Module 3 3 Deciphering Regulation of CoExpressed Genes Co-Expressed Negative Controls Module 3 4 Module 3: Overview Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/STAMP) Module 3 5 Restrictions in Coverage • Focus on Eukaryotic cells and PolII Promoters • Most principles apply to prokaryotes • Pol-II ~ protein coding genes • All references are made to activating sequences • Information about repression is sparse Module 3 6 Part 1 Introduction to transcription in eukaryotic cells Module 3 7 Transcription Over-Simplified Three-step Process: 1. TF binds to TFBS (DNA) 2. TF catalyzes recruitment of polymerase II complex 3. Production of RNA from transcription start site (TSS) TF Pol-II TFBS TATA TSS Module 3 8 Anatomy of Transcriptional Regulation WARNING: Terms vary widely in meaning between scientists Core Promoter/Initiation Region (Inr) Distal Regulatory Region TFBS TFBS TFBS Proximal Regulatory Region TFBS TFBS TATA TSR EXON Distal R.R. TFBS TFBS EXON • Core Promoter – Sufficient for initiation of transcription; orientation dependent • TSR – transcription start region – Refers to a region rather than specific start site (TSS) • TFBS – single transcription factor binding site • Regulatory Regions • • • • Proximal/Distal – vague reference to distance from TSR May be positive (enhancing) or negative (repressing) Orientation independent (generally) Modules – Sets of TFBS within a region that function together • Transcriptional Unit Module 3 • DNA sequence transcribed as a single polycistronic mRNA 9 Complexity in Transcription Chromatin Distal enhancer Proximal enhancer Module 3 Core Promoter Distal enhancer 10 Lab Discovery of TF Binding Sites 0% Reporter Gene Activity 100% LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE mutation Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies) Module 3 11 EMSA/Gel Shift Assays to Identify Binding Proteins TF + DNA DNA http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg Module 3 12 High-throughput Methods • SELEX – mix random ds DNA oligonucleotides with TF protein, recover TF-DNA complexes and sequence DNA • Protein Binding Arrays – prepare arrays with ds DNA attached, label protein with a fluorescent mark and observe DNA bound by protein • ChIP – covalently link proteins to DNA in cell, shear DNA, recover protein-DNA complexes and identify DNA (PCR, array or sequencing) Module 3 13 Promoters • In most vertebrates the delineation of the transcription start position is not easy • cDNA often incomplete at 5’ end • Multiple promoters for many genes • Referencing position relative to the initiation “site” is therefore not a good idea – But done almost uniformly in biological papers • (Translation start equally problematic) – Can be in internal exon – Multiple start positions common Module 3 14 mRNA Caps for Mapping Initiation Sites • 5’ end of mRNA have a “cap” structure that can be precipitated with an antibody – Allows for large-scale sequencing of “full-length” cDNAs and “tags” derived from the 5’ end of mRNAs – RIKEN the leading generators of such sequences http://departments.oxy.edu/biology/Stillman/bi221/111300/26_18a.GIF Module 3 15 Classes of Initiation Regions Bias: TATA Box (“Selective”) Bias: CpG Island (“Broad”) Position This is over-simplified - see paper for greater detail. Take home message is that promoters are not drawn from a single continuous distribution of properties, rather drawn from at least two classes. Image from Carninci P, et al (2006). Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. Apr 28 PMID: 16645617 Module 3 16 CpG Islands • DNA methylation occurs in competition with histone acetylation • Acetylation promotes open chromatin structure that is permissive for TF binding to DNA • Methylation of DNA inhibits histone acetylation • Certain TFs promote histone acetylation by recruiting acetylases • Methylation occurs on cytosines • Preferentially on cytosine adjacent to guanines (CG dinucleotides, generally referred to as CpG) • Methylated cytosines frequently undergo deamination to form thymidine (CpG -> TpG) • CpG Islands are regions of DNA where CG dinucleotides occur at a frequency consistent with C and G mononucleotide frequencies • Highlight regions in which histones are acetylated – regions of active transcription Module 3 17 CpG Islands (2) • Important to recognize, that promoters selectively active after early development will not be acetylated (and hence will be methylated) in the cell divisions preceding the establishment of germ cells and therefore will not have CpG islands. • Lists of genes that have higher or lower CpG frequencies than average can misleadingly appear to have TF binding motifs based on this compositional characteristic. Module 3 18 Section 3.1 What have we learned? • Transcription controlled by regulatory regions • Regulatory regions can be distant from initiation regions • Laboratory methods can identify regulatory regions and TF binding sites • Concept of single initiation site is flawed • Promoters fall into subclasses • CpG vs TATA • Can impact assessment of motifs in sets of genes Module 3 19 Module 3 Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/STAMP) Module 3 20 Part 2 Prediction of TF Binding Sites Teaching a computer to find TFBS… Module 3 21 Representing Binding Sites for a TF • A single site • AAGTTAATGA • A set of sites represented as a consensus • VDRTWRWWSHD (IUPAC degenerate DNA) • A matrix describing a set of sites: A C G T 14 16 4 0 1 19 20 1 3 0 0 0 0 0 0 0 4 3 17 0 0 2 0 0 0 2 0 21 20 0 1 20 4 13 4 4 13 12 3 7 3 1 0 3 1 12 9 1 3 0 5 2 2 1 4 13 17 0 6 4 Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix Module 3 Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA22 Conversion of PFMs to Position Specific Scoring Matrices (PSSM) Add the following features to the matrix profile: 1. Correct for nucleotide frequencies in genome 2. Weight for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic pssm pfm A C G T 5 0 0 0 0 2 3 0 1 2 1 1 0 4 0 1 f(b,i) + s(n) 0 Log p(b) 0 4 1 ( ) A C G T 1.6 -1.7 -1.7 -1.7 -1.7 0.5 1.0 -1.7 -0.2 0.5 -0.2 -0.2 -1.7 1.3 -1.7 -0.2 -1.7 -1.7 1.3 -0.2 TGCTG = 0.9 Module 3 23 PSSM Scoring Scales • Raw scores • Sum of values from indicated cells of the matrix • Relative Scores (most common) • Normalize the scores to range of 0-1 or 0%-100% • Empirical p-values • Based on distribution of scores for some DNA sequence, determine a p-value (see next slide) Module 3 24 Detecting binding sites in a single sequence Raw Scores Sp1 ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC A C G T [-0.2284 0.4368 [-0.2284 -0.2284 [ 1.2348 1.2348 [ 0.4368 -0.2284 -1.5 -1.5 2.1222 -1.5 -1.5 -1.5 -1.5 1.5128 2.1222 0.4368 -1.5 -0.2284 0.4368 -1.5 -1.5 -0.2284 1.2348 1.5128 0.4368 0.4368 -1.5 -0.2284 -1.5 -0.2284 1.7457 1.7457 0.4368 -1.5 0.4368 -1.5 -1.5 1.7457 Abs_score = 13.4 (sum of column scores) Relative Scores [-0.2284 0.4368 [-0.2284 -0.2284 [ 1.2348 1.2348 [ 0.4368 -0.2284 -1.5 -1.5 2.1222 -1.5 -1.5 -1.5 -1.5 1.5128 2.1222 0.4368 -1.5 -0.2284 0.4368 -1.5 -1.5 -0.2284 1.2348 1.5128 0.4368 0.4368 -1.5 -0.2284 -1.5 -0.2284 1.7457 1.7457 0.4368 -1.5 0.4368 -1.5 -1.5 1.7457 ] ] ] ] Max_score = 15.2 (sum of highest column scores) A C G T [-0.2284 0.4368 [-0.2284 -0.2284 [ 1.2348 1.2348 [ 0.4368 -0.2284 -1.5 -1.5 2.1222 -1.5 -1.5 -1.5 -1.5 1.5128 2.1222 0.4368 -1.5 -0.2284 0.4368 -1.5 -1.5 -0.2284 1.2348 1.5128 0.4368 0.4368 -1.5 -0.2284 -1.5 -0.2284 1.7457 1.7457 0.4368 -1.5 0.4368 -1.5 -1.5 1.7457 Min_score = -10.3 (sum of lowest column scores) Abs_score - Min_score 100 % Max_score - Min_score 13.4- (-10.3) 100% 93% 15.2 (10.3) Rel_score Module 3 0.3 Area to right of value ] ] ] ] Frequency A C G T Empirical p-value Scores Area under entire curve 0.2 0.1 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Relative Score 25 ] ] ] ] JASPAR: AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES ( jaspar.genereg.net ) Module 3 26 The Good… • Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound! BINDING ENERGY • Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy PSSM SCORE Module 3 27 …the Bad… • Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence – This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average gene size) Module 3 28 …and the Ugly! Human Cardiac a-Actin gene analyzed with a set of profiles (each line represents a TFBS prediction) Futility Conjuncture: TFBS predictions are almost always wrong Red boxes are protein coding exons TFBS predictions excluded in this analysis Module 3 29 ADVANCED TOPIC Issues of Column Independence • PSSM model assumes independence between positions • For example, if you observe a G at position 2, the model assumes there is no influence on the likelihood of a T at position 3 - this is known to be an incorrect assumption • Other models can represent dependence • Hidden Markov models of Nth order where Nth refers to the number of influencing positions • For the very few cases where there are hundreds of TFBS known for a TF, there has been only modest improvement in the specificity of TFBS predictions using advanced column inter-dependent models Module 3 30 A Conundrum… P P V THRESHOLD • Counter to intuition, the ratio of true positives to predictions fails to improve for “stringent” thresholds • For most predictive models this ratio would increase • Why? • True binding sites are defined by properties not incorporated into the profile scores - above some threshold all sites could be bound if in the right setting Module 3 31 Section 3.1A What have we learned? • PSSMs accurately reflect in vitro binding properties of DNA binding proteins • Suitable binding sites occur at a rate far too frequent to reflect in vivo function • Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity • Unfiltered predictions are too noisy for most applications • Organisms with short regulatory sequences are less problematic (e.g. yeast and E.coli) Module 3 32 Using Phylogenetic Footprinting to Improve TFBS Discrimination 70,000,000 years of evolution can reveal regulatory regions Module 3 33 Phylogenetic Footprinting FoxC2 – a single exon gene 1001 0.8% 80% 0.6 60% 0.4 40% 0.2 20% 0 0% -0.2 0 • • 1000 2000 3000 4000 5000 6000 7000 Align orthologous gene sequences (e.g. LAGAN) For first window of 100 bp, of sequence#1, determine the % with identical match in sequence#2 • Step across the first sequence, recording rhe percentage of identical nucleotides in each window • • Observe that single exon contains a region of high identity that corresponds to the ORF, with lower identity in the 5’ and 3’ UTRs Additional conserved region could be regulatory regions Module 3 34 Phylogenetic Footprinting (cont) % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse Module 3 35 Phylogenetic Footprinting Dramatically Reduces Spurious Hits Human Mouse Module 3 Actin, alpha cardiac 36 TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 75-80% of defined sites detected with conservation filter, while only 11-16% of total predictions retained Module 3 37 1kbp insulin receptor promoter screened with footprinting Module 3 38 Choosing the ”right” species for pairwise comparison... CHICKEN HUMAN MOUSE COW HUMAN HUMAN Module 3 39 Multi-species Phylogenetic Footprinting • PhastCons scores indicate the regions of DNA which are unusual in their sequence composition in some subset of organisms Module 3 40 ConSite Module 3 41 TFBS Discrimination Tools • Phylogenetic Footprinting Servers • FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php • CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ • rVISTA http://rvista.dcode.org/ • SNPs in TFBS Analysis • RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home • Prokaryotes • PRODORIC http://prodoric.tu-bs.de/ • Software Packages • TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php • Programming Tools • TFBS http://tfbs.genereg.net/ • ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk Module 3 42 Analysis of TFBS with Phylogenetic Footprinting Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions A dramatic improvement in the percentage of biologically significant detections Low specificity of profiles: •too many hits •great majority not biologically significant Module 3 43 Section 3.2B What have we learned? • TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity • As with any purification process, some true binding sites will be lost • Available online resources support phylogenetic footprinting Module 3 44 Laboratory Exercise 3.2 TF Binding Site Prediction Module 3 45 20 minute break Until 10:50am Next: Sections 3.3 and 3.4 Module 3 46 Module 3 Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/STAMP) Module 3 47 Part 3: Inferring Regulating TFs for Sets of Co-Expressed Genes Module 3 48 Deciphering Regulation of CoExpressed Genes Co-Expressed Negative Controls Module 3 49 TFBS Over-representation • Akin to the GO studies yesterday, we seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF • Phylogenetic footprinting to reduce false prediction rate Module 3 50 Two Examples of TFBS Over-Representation Foreground Foreground More Total TFBS More Genes with TFBS Background Module 3 Background 51 Statistical Methods for Identifying Over-represented TFBS • Binomial test (Z scores) – Based on the number of occurrences of the TFBS relative to background – Normalized for sequence length – Simple binomial distribution model • Fisher exact probability scores – Based on the number of genes containing the TFBS relative to background – Hypergeometric probability distribution Module 3 52 oPOSSUM Procedure Set of coexpressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting ORCA Putative mediating transcription factors Module 3 Statistical significance of binding sites Detection of transcription factor binding sites 53 Validation using Reference Gene Sets A. Muscle-specific (23 input; 16 analyzed) Rank Z-score B. Liver-specific (20 input; 12 analyzed) Fisher Rank Z-score Fisher SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08 MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03 c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01 Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01 TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02 deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01 S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01 Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02 Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01 HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01 TFs with experimentally-verified sites in the reference sets. Module 3 54 Empirical Selection of Parameters based on Reference Studies 40 p65 SRF c-Rel HNF-1 30 NF-κB p50 20 Z-score TEF-1 Muscle MEF2 Liver FREAC-2 Myf 10 cEBP SP1 HNF-3β 0 NF-κB Z-score cutoff Fisher cutoff -10 -20 1.0E-09 1.0E-07 1.0E-05 1.0E-03 1.0E-01 Fisher p-value Module 3 55 C-Myc SAGE Data • c-Myc transcription factor dimerizes with the Max protein • Key regulator of cell proliferation, differentiation and apoptosis • Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells • They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR Module 3 56 Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed) TF Class Rank Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7 Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2 Max bHLH-ZIP 3 18.32 2.16e-02 12 SAP-1 ETS 4 13.23 1.61e-04 13 USF bHLH-ZIP 5 11.90 1.84e-01 16 SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12 n-MYC bHLH-ZIP 7 11.11 1.55e-01 20 ARNT bHLH 8 11.11 1.55e-01 20 Elk-1 ETS 9 10.92 3.88e-03 19 Ahr-ARNT bHLH 10 10.17 1.11e-01 25 Module 3 Z-score Fisher No. Genes 57 C-Fos Microarray Experiment • In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line • We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs Module 3 58 Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed) TF Class Rank Z-score Fisher No. Genes c-FOS bZIP 1 17.53 2.60e-05 45 RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1 PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1 CREB bZIP 4 3.626 1.25e-01 10 E2F Unknown 5 2.965 7.67e-02 15 Module 3 59 Structurally-related TFs with Indistinguishable TFBS • Ets example Module 3 60 oPOSSUM Server Module 3 61 Section 3.3 What have we learned? • New generation of tools to help interrogate the meaning of observed clusters of co-expressed genes • Generally best performance has been with data directly linked to a transcription factor • Highly dependent on the experimental design – cannot overcome noisy data from poor design (Recall Day 1) • The identity of a mediating TF may not be apparent when many proteins can bind to the same motif Module 3 62 Laboratory Exercise 3.3 TFBS Over-Representation Analysis Module 3 63 Module 3: Overview Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/STAMP) Module 3 64 Part 4: de novo Discovery of TF Binding Sites Module 3 65 de novo Pattern Discovery Module 3 66 de novo Pattern Discovery • String-based – e.g. YMF (Sinha & Tompa) – Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections – Used often for yeast promoter analysis • Profile-based – e.g. AnnSpec (Workman & Stormo) or MEME (Bailey & Elkin) – Generalization: Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics Module 3 67 Assessing Discovered Patterns • Strength • Similarity search Module 3 68 String-based methods(1) How likely are X words in a set of sequences, given background sequence characteristics? CCCGCCGGAATGAAATCTGATTGACATTTTCC TTCAAATTTTAACGCCGGAATAATCTCCTATT TCGCTGTAACCGGAATATTTAGTCAGTTTTTG TATCGTCATTCTCCGCCTCTTTTCTT GCTTATCAATGCGCCCGGAATAAAACGCTATA CATTGACTTTATCGAATAAATCTGTT ATCTATTTACAATGATAAAACTTCAA ATGGTCTCTACCGGAAAGCTACTTTCAGAATT TTTCAAATCCGGAATTTCCACCCGGAATTACT TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC ACTGAACTTGTCTTCAAATTTCAACACCGGAA TCAATGCCGGAATTCTGAATGTGAGTCGCCCT Module 3 >EP71002 >EP63009 >EP63010 >EP11013 >EP11014 >EP11015 >EP11016 >EP11017 >EP63007 >EP63008 >EP17012 >EP55011 (+) (+) (+) (+) (+) (-) (+) (+) (-) (+) (+) (-) Ce[IV] msp-56 B; range -100 to -75 Ce Cuticle Col-12; range -100 to -75 Ce Cuticle Col-13; range -100 to -75 Ce vitellogenin 2; range -100 to -75 Ce vitellogenin 5; range -100 to -75 Ce vitellogenin 4; range -100 to -75 Ce vitellogenin 6; range -100 to -75 Ce calmodulin cal-2; range -100 to -75 Ce cAMP-dep. PKR P1+; range -100 to -75 Ce cAMP-dep. PKR P2; range -100 to -75 Ce hsp 16K-1 A; range -100 to -75 Ce hsp 16K-1 B; range 69 String-based methods(2) Find all words of length n in the yeast promoters (e.g. n=7) GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA Module 3 Make a lookup table: AAACCTTT TTTTTTTT GATAGGCA 456 57788 589 Etc... 70 String-based methods(3) Xw: Instances of a word w within our set of X genes X w EX w Zw VarX w Module 3 E[Xw]: Average number of instances of w based on number of genes in our set Var[Xw]: Variance – how much deviation from the average is expected for w 71 Limitations of String-based Methods • Longer word lengths not possible • While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences • Imagine column in PFM with 7 A’s and 1 T --- in a consensus sequence we would represent as W or throw out the instance with T • Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences... Module 3 72 microRNA Target Sequences • Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA • The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts • Nice website tutorial: • http://www.ambion.com/main/explorations/mirna.html Module 3 73 Probabilistic Methods for Pattern Discovery •What is a probabilistic method? •The Gibbs sampler algorithm Module 3 74 Probabilistic Methods Overview: Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time Usually by Gibbs sampling or EM methods Motivation: TFBS are not words Efficiency – can handle longer patterns than string-based methods Can be intentionally influenced to reflect prior knowledge Module 3 75 What does probabilistic mean? • Based on probability • Functionally, it means we’re going to guess our way to a good pattern (TFBS) • We’re going to try to make a good guess • Two different flavours of the approach – Expectation Maximization in which we try to make the best guess each time – Gibbs Sampling in which we make our guesses based on the strength of our conviction Module 3 76 Gibbs Sampling Two data structures used: 1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4 tgacttcc tgatctct agacctca tgacctct 2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j. One starting point in each sequence is chosen randomly initially. Module 3 77 Iterations in Gibbs Sampling Remove one sequence z from the set. Update the current pattern according to A qi , j ci , j b j N 1 B Pseudocount for symbol j Sum of all pseudocounts in column z tgacttcc tgatctct agacctca tgacctct ’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model B Module 3 78 Gibbs Sampling (grossly over-simplified) ttcgctcc cgatacgc tgctacct tgacttcc agacctca ctgtagtg acgcatct Module 3 A C G T 1 2 3 4 5 6 7 8 2 0 2 2 2 1 0 1 0 2 3 3 2 1 6 2 0 4 1 0 1 0 1 1 4 1 1 2 2 5 0 2 79 Pattern Discovery • Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often • Procedure is fast, so running many 1000s of times is feasible • Unfortunately, we have a problem…what if the mediating TFBS are not strongly overrepresented relative to other patterns… Module 3 80 Applied Pattern Discovery is Acutely Sensitive to Noise PATTERN SIMILARITY vs. TRUE MEF2 PROFILE 18 Pink line is negative control with no Mef2 sites included 16 14 12 10 0 100 200 300 400 500 600 SEQUENCE LENGTH True Mef2 Binding Sites Module 3 81 Four Approaches to Improve Sensitivity • Better background models -Higher-order properties of DNA • Phylogenetic Footprinting – Human:Mouse comparison eliminates ~75% of sequence • Regulatory Modules – Architectural rules • Limit the types of binding profiles allowed – TFBS patterns are NOT random Module 3 82 Pattern Discovery Summary • Pattern discovery methods can recover overrepresented patterns in the promoters of coexpressed genes • Methods are acutely sensitive to noise, indicating that the signal we seek is weak • TFs tolerate great variability between binding sites • As for pattern discrimination, supplementary information/approaches are required to overcome the noise Module 3 83 Laboratory Exercise 3.4 Motif Discovery Module 3 84 REFLECTIONS • Part 2 – Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function – Successful bioinformatics methods for site discrimination incorporate additional information (clusters, conservation) • Part 3 – TFBS over-representation is a powerful new means to identify TFs likely to contribute to observed patterns of coexpression • Part 4 – Pattern discovery methods are severely restricted by the Signal-to-Noise problem • Observed patterns must be carefully considered – Successful methods for pattern discovery will have to incorporate additional information (conservation, structural constraints on TFs) Module 3 85 Module 3: Overview Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/STAMP) Module 3 86 THE END • Questions before the break? • Lab exercises address Sections 2 and 3 Module 3 87