Transcript Document
Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University Transcription in higher eukaryotes Gene Expression 1. Chromatin structure 2. Initiation of transcription 3. Processing of the transcript 4. Transport to the cytoplasm 5. mRNA translation 6. mRNA stability 7. Protein activity stability Transcriptional Regulation Nuclear membrane Transcriptional Regulation Nuclear membrane Binding site/motif CCG__CCG Genome-wide mRNA transcript data (e.g. microarrays) Transcriptional Regulation Learning problems: • Understand which regulators control which target genes Binding site/motif CCG__CCG Nuclear membrane • Discover motifs representing regulatory elements Some common approaches • Cluster-first motif discovery – Cluster genes by expression profile, annotation, … to find potentially coregulated genes – Find overrepresented motifs in promoter sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …) (Spellman et al. 1998) Training data – Features regulator expression promoter sequence label feature vector What is PWM? Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. A positional weight matrix (PWM) specifies the probability that you will see a given base at each index position of the motif. Pos 1 A 18 C 8 G 13 T 7 Con N 2 8 3 31 4 G 3 4 5 4 3 9 34 9 4 24 G T 5 1 33 8 4 C 6 29 4 10 3 A 7 7 21 11 7 N 8 7 15 15 9 N 9 7 14 19 6 N 10 0 0 4 42 T 11 1 0 44 1 G 12 39 1 3 3 A 13 1 43 0 2 C 14 1 39 1 5 C 15 6 18 6 16 N PWM for ERE 1. 2. 3. 4. 5. 6. 7. acggcagggTGACCc aGGGCAtcgTGACCc cGGTCGccaGGACCt tGGTCAggcTGGTCt aGGTGGcccTGACCc cTGTCCctcTGACCc aGGCTAcgaTGACGt 41. 42. 43. 44. 45. 46. cagggagtgTGACCc gagcatgggTGACCa aGGTCAtaacgattt gGAACAgttTGACCc cGGTGAcctTGACCc gGGGCAaagTGACTg ... Position frequency matrix (PFM) (also known as raw count matrix) Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position. Position weight matrix (PWM) (also known as position-specific scoring matrix) PFM should be converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM. Position Weight Matrix for ERE Converting a PFM into a PWM A C G T 18 8 13 7 8 3 31 4 5 3 34 4 For each matrix element do: 4 9 9 24 1 33 8 4 29 4 10 3 7 21 11 7 7 15 15 9 7 14 19 6 0 0 4 42 1 0 44 1 39 1 3 3 1 43 0 2 1 39 1 5 N 4 N N p(b) f b ,i w(b, i ) log2 pb, i log2 pb A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72 C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46 G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64 T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23 f b ,i – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) N and N 4 - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b 6 18 6 16 Scoring putative EREs by scanning the promoter with PWM GGGTCAGCATGGCCA A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72 C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46 G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64 T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23 m Absolute score of the site S w(b, i) =11.57 Row Sum Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20 Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02 i 1 relative_ score Absolute_ score Minim um_ score Maxim um_ score Minim um_ score 11.57 24.02 0.86 17.20 24.02 Yeast ESR: Biological Validation Universal stress repressor motif STRE element Previous work: “Structure learning” • Graphical models (and other methods) – Learn structure of “regulatory network”, “regulatory modules”, etc. – Fit interpretable model to training data – Model small number of genes or clusters of genes – Many computational and statistical challenges; often used for qualitative hypotheses rather than prediction (Pe’er et al. 2001) (Segal et al, 2003, 2004) Signaling networks in a cell Network inference • Regulator-motif associations in nodes can have different meanings: P Mp Direct binding P TF MTF Indirect effect P Mp M Co-occurrence • Need other data to confirm binding relationship between regulator and target (e.g. ChIP-chip) • Still, can determine statistically significant regulatortarget relationships from regulation program Example: oxygen sensing and regulatory network Binding data for regulatory networks • ChIP-chip: genome-wide proteinDNA binding data, i.e. what promoters are bound by TF? • Investigate regulatory network model: use ChIP-chip data in place of motifs (no motif discovery) – Features: (regulator, TFoccupancy) pairs P1 P2 TF Inferring regulatory networks from the combination of expression data and binding data An extended ER regulatory network in MCF7 cells RUVBL1 GTF2I CCNL1 RFC1 RXRA MKL2 BHLHB2 BAZ1B HEY2 ZNF394 TTF2 RAB18 ASCC3 STRAP ZNF50 0 XBP1 ER TLE3 FOXP4 BRIP1 PAWR FOS HDAC1 TBX2 MYC ELF3 DDX20 NRIP1 LASS2 THRAP1 ZBTB41 TXNIP PNN CEBP VPS72 ZNF23 BRF1 9 HSF2 MSX2 HIF1A ZNF38 DNMT1 ZKSCAN1 C140RF43 PURB ADAR CUTL1 IVNS1ABP CHAF1B BATF CSDE1 SP3 TXNDC HES1 Signaling molecules -- Networks • Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features • e.g. TF Glc7 phosphatase complex SM mRNA Gac1Sds22 Gip1 Hsf1 • Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature) http://motif.bmi.ohio-state.edu/ChIPMotifs/ •FASTA file Input Data •Contact Info •Control data (optional) Ab initio Motif Discovery Programs Statistical Methods •Weeder •MaMf •MEME •Bootstrap re-sampling •Fisher test STAMP Matching •SeqLog Results •PWM •P-value •Known or novel motifs http://motif.bmi-ohio-state.edu/HRTBLDb Software Demo • W-ChIPMotifs • HRTargetDB