Promoter identification

Transcript Promoter identification

COMPUTATIONAL ANALYSIS OF PROMOTERS

Gene regulation

• Genomes usually contain several thousands of different genes.

• Some of the gene products are required by the cell under all growth conditions and are called

housekeeping genes

. • genes for DNA polymerase, RNA polymerase, rRNA, tRNA, … • Many other gene products are required under specific growth conditions. • e.g. enzymes responding to a specific environmental condition such as DNA damage

Gene regulation

• Housekeeping genes must be expressed at some level all of the time. • Frequently, as the cell grows faster, more of the housekeeping gene products are needed.

• The gene products required for specific growth conditions are not needed all of the time. • These genes are frequently expressed at extremely low levels, or not expressed at all when they are not needed and yet made when they are needed.

• Apparently, the gene expression must be

regulated

so that the genes that are being expressed meet the needs of different cell types, developmental stages, or different external conditions.

Gene regulation

Gene regulation basically occurs at three different places:

• •

transcriptional regulation

transcription of the gene is regulated control of transcription initiation – most important control mechanism

• •

translational regulation

translation of the gene is regulated How often the mRNA is translated influences the amount of gene product that is made.

•

post-transcriptional/post-translational regulation

regulation of gene products after they are completely synthesized, e.g. degradation, chemical modifications (methylation, phosphorylation)

Transcriptional regulation

• Transcription control has two key features: 1.

protein-binding regulatory DNA sequences (

control elements

) are associated with genes specific proteins that bind to regulatory sequences determine where transcription will start, and either activate or repress its transcription • DNA sequence specifying where RNA polymerase binds and initiates transcription of a gene is called a

promoter.

• Transcription from a particular promoter is controlled by DNA binding proteins, termed

transcription factors

• DNA control elements in binding transcription factors may be located very far from the promoter they regulate.

Three different polymerases

• As a result of this arrangement, transcription from a single promoter may be regulated by binding of multiple transcription factors to alternative control elements, permitting complex control of gene expression.

• • • RNA polymerase I synthesizes rRNA.

RNA polymerase II

synthesizes mRNA.

RNA polymerase III synthesizes small RNAs and tRNA.

source: Molecular Biology of the Cell. 4th edition. Alberts B

Three parts of promoter

• core promoter • • responsible for actual binding of transcription apparatus very close upstream (~35 bp), may also be downstream, see later • proximal promoter • • contains several regulatory elements few hundreds bases upstream of transcriptional start site (TSS) • distal promoter • • contains enhancers (upstream/downstream), silencers They are

cis-acting

…

cis-element

regulates gene on the same DNA molecule.

cis-acting

sequences are bound by

trans-acting

(i.e. acting from a different molecule) regulatory proteins.

• However, the distinctions between proximal elements and enhancers/silencers is not very clear.

Core promoter

• Eukaryotic RNAPII is not itself capable of transcriptional initiation in vitro.

• It needs to be supplemented by

general (basal) transcription factors

(GTFs).

• Factors are identified as TFIIX, where X is a letter. e.g. TFIIA, TFIIB, … • RNAPII + TFs form

pre-initiation complex

then transcription can commence.

(PIC). Only •

minimal (core) promoter

– DNA sequence sufficient for assembly of pre-initiation complex.

• Transcription initiated by the core promoter is called

basal transcription

Core promoter elements

• Core promoter is usually located proximal to or overlapping TSS.

• Contains several sequence motifs. TFs interact with them in sequence-specific manner.

• Combination of TF-binding motifs vary depending on the gene.

Core promoter elements

•

TATA box

… ~ 30 bp upstream, consensus TATA(A/T)A(A/T) • Instead of a TATA box, some eukaryotic (TATA-less) genes contain initiator (

Inr

) … surrounds TSS, extremely degenerate consensus sequence YYAN(T/A)YYY (A – TSS, N – any nucleotide) • Promoters with both TATA and Inr also exist.

•

DPE

• (downstream promoter element) in TATA-less Present in some TATA , Inr + promoters, 30 bp downstream.

consensus: RGWCGTG (W = A or T) Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002 ;

(20):2583-92 .

Promoter proximal elements

• Found within 100 to 200 bp of the TSS.

•

CAAT

(CCAAT, CAT)

box

… consensus GGCCAATCT •

GC box

• … consensus G/T G/A GGCG G/T G/A G/A C/T.

It’s GC rich segment.

• Promoter may contain multiple GC boxes, such promoter usually lack TATA box.

A hypothetic mammalian promoter region

Promoter Proximal Element +1 Enhancer Enhancer -10~-50 Kb -200 TATA -30 Exon Intron Enhancer +10~50 Kb

CpG island

• Transcription of genes with TATA/Inr promoters begins at a well-defined sites.

• However, transcription of many protein-coding genes has been shown to begin at any one of multiple possible sites over an extended region 20 –200 bp long. • As a result, such genes give rise to mRNAs with multiple alternative 5’ ends.

• These are housekeeping genes, they do not contain TATA, Inr.

• Most genes of this type contain a CG-rich stretch of several hundreds nucleotides –

CpG island

– within ≈100 base pairs upstream of TSS.

• CpG islands are typical for vertebrates (including human). They are not common in lower eukaryotes.

CpG island

mRNA ~100 bp CpG island Multiple 5’-start sites • Computational analysis is based on CG dinucleotide imbalance.

• length = 200 bp, C+G content min 50%, CpG 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 CpG 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 0.60

𝑝(CG) = 𝑝 C 𝑝(𝐺) > M. Gardiner-Garden, M. Frommer, CpG islands in vertebrate genomes, J. Mol. Biol. 1987 ,

196

, 261-282.

• length = 500 bp, C+G content min 55%, CpG 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 CpG 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 > 0.6

5 D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22, PNAS 2002 ,

, 3740-45.

CpG island

len=51, #C=76, #g=101, #CG=30, 𝑝C = 76 251 , 𝑝G = 101 251 , 𝑝CG = 30 251 , CG content = 𝑝C + 𝑝G = 0.71

, CpGo/e=0.98

• simple methods based on the frequency of CG perform remarkably well at correctly predicting regions containing TSSs • • EMBOSS CpGPlot/CpGReport http://www.ebi.ac.uk/Tools/emboss/cpgplot/ CpG Island Searcher http://cpgislands.usc.edu/ (IE only)

Promoter regions in human genes

Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 2001 ,

(5):677-84.

TATA Inr GC box CAAT box located in CpG 32% 85% 97% 64% 48% TATA + Inr + TATA + Inr TATA Inr + TATA Inr 28% 4% 56% 12%

Computational analysis of promoters

Introduction

• Regulatory regions typically contain several transcription factor binding sites strung out over a large region.

• Which particular factor is used not only relies on the binding site, but also on what factors are available for binding in a given cell type at a given time.

• Any given gene will typically have its very own pattern of binding sites for transcriptional activators and repressors ensuring that the gene is only transcribed in the proper cell type(s) and at the proper time during the development.

Introduction

• Transcription factors themselves are also subject to similar transcriptional regulation, thereby forming transcriptional cascades and feed-back control loops.

• While this all is very nice and interesting from a biologist’s point of view, it spells

big trouble for promoter prediction

Computational difficulties

• There thousands of transcriptional regulators, many of which have recognition sequences that are not yet characterized.

• Any given sequence element might be recognized by different factors in different cell types.

• Core promoter regulatory elements are short and not completely conserved ⟹ similar elements will be found purely by chance all over the genome.

What promoter prediction methods actually predict?

• 1 st nucleotide copied at the 5’ end of the corresponding mRNA – transcription start site

TSS

• region around TSS is often referred as the

core promoter

• Owing to the strong link between TSS and core promoter, these terms are often used interchangeably.

• Three distinct types of promoter prediction 1.

signal features context features 3.

structure features

Evaluating predictions

• sensitivity (Se), recall, TPR • proportion of correct predictions of TSSs relative to all experimental TSSs Se = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 • positive predictive value (PPV), precision • proportion of correct predictions of TSSs out of all counted positive predictions 𝑇𝑃 PPV = 𝑇𝑃 + 𝐹𝑃

Evaluating predictions

• And how to obtain FP, FN, TP?

• You have a gene sequence for which you know TSS location. And you make your prediction. • If it falls within the region [-2000, +2000] relative to annotated TSS, you have TP.

• Prediction falling into the annotated part of gene within [+2001, EndOfGene] are FPs.

• If you predict no promoter for this gene sequence, you have FN.

Signal features

• Recognize “conserved” signals such as TATA box, Inr, DPE, BRE etc.

• Such motifs are highly variable and degenerate. This leads to high false positive rate.

• Methods based on core promoter elements and other specific TFBs (e.g. CAAT box) are far from being accurate.

• Much more reliable signal is CpG-island feature. However, only ≈50% of human genes contain CpG islands.

⇓ CpG and non-CpG promoters are predicted with different success, prediction of non-CpG is less accurate

Context features

• Extracted from genomic context of promoters • Represented by a set of

n-mers

(DNA sequence long

bases). Their statistics are estimated from training samples.

•

-mers can cover most biological signals (TFBS: TATAAA, CCAAT; CpG: GC rich

-mers like CGGCG) •

-mer representation encodes contextual information of promoters and has following advantages • • • contextual information is independent of any biological signals distribution of

-mers may have biological significance (TFBS, CpG)

-mers may reveal details of yet unknown promoter regions •

-mers reduce FPR while maintaining relatively high TPR (i.e. Se)

Structure features

• They originate from DNA 3D structures that characterize proximal promoters.

• DNA actually encodes in its sequence at least two independent levels of functional information • DNA sequence – encodes proteins and their regulatory elements.

• Physical and structural properties of DNA itself.

• Example: • • dinucleotide properties – stacking energy, propeller twist trinucleotide – bendability, nucleocome position preference • They have long-range interactions (up to 10 kbp), so they can exhibit properties not visible in the sequence.

Model for cooperative assembly of an activated transcription-initiation complex.

This figure clearly shows, why are structural features such as flexibility important.

Molecular Cell Biology. 4 th edition. Lodish H, Berk A, Zipursky SL, et al. New York: W. H. Freeman; 2000.

Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation. FASEB J. 2003 ;

(10):1228-37.

Software

Signal features

(two leading CpG predictors) • FirstEF – different quadratic discriminant functions for CpG and non-CpG, slightly improves performance by concentrating to regions around first exon • Eponine – TATA and G+C rich domain, Relevance Vector Machine

Context features

• PromoterInspector – IUPAC word groups with wildcards

Structure features

• McPromoter – DNA sequence, bending, DNA twist, ANN • EP3 – features from 1 , prediction based just on the threshold imposed on the structural profile.

1 Florquin K et al., Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005 ;

(13):4255

Integrated approaches

• combine sequence, context and structural features •

ARTS

– SVM, sophisticated kernels, combines n-mers to structure features (e.g. twist angle, stacking energies) • does not distinguish CpG related promoter from unrelated, not clear how it performs on non-CpG •

SCS

– sequence (TATA, Inr, DPE, CpG), structure (flexibility), and context (6-mers) features are used in different prediction models, their outcomes are combined by Decission Tree •

CoreBoost

– boosting technique with stumps, integrates core promoter signals, DNA flexibility,

mer frequency, … •

CoreBoost_HM

… adds experimental histone modification data

Boosting, stumps

• Boosting • Belongs between learners, WL).

ensemble

methods that produce a very accurate prediction rule (strong learner) by combining rough and moderately inaccurate (i.e. just a bit better than random guessing) rules (weak • • Iteratively learn weak classifiers and add them to a final strong classifier When WL is added, it’s weighted based on their accuracy. • After a WL is added, the data is reweighted: misclassified examples gain weight and correctly classified examples lose weight. • Thus, future WLs focus more on the examples that previous WLs misclassified.

• Stump • One-level decision tree (i.e. it has one root and two terminal nodes) source:

wikipedia

Databases

• EPD – Eukaryotic Promoter Database • http://epd.vital-it.ch

• manually annotated non-redundant collection of eukaryotic POL II promoters • DBTSS • • http://dbtss.hgc.jp/ putative core promoter: e.g. 100 bp … +50 bp, -250 bp … +50 bp, 200 … +200 bp

Actual state of the promoter prediction

• CpG island promoters are better to predict than non-CpG.

• CpG islands usually correspond to housekeeping genes. Promoters of housekeeping genes are easier to predict, but housekeeping genes are not regulated that strongly. So if biologist wants to up- or down-regulate the expression and you tell him he has CpG island promoter, he is usually not happy.

• non-CpG islands correspond to

tissue-specific

expression. And are the bottleneck in accurate promoter prediction.

• Best way how to do it: use transcription data. Alignment of the 5’ of ESTs or full cDNAs can be indicative of promoter sequence. However, cDNA does not contain 5’ UTR. This is overcome by new mRNA cap cloning techniques – DBTSS.

Future directions

• False positives are still the main problem.

• This is because the information about

chromatine structure

is missing in prediction models.

• Without knowing which regions of chromatin are opened or closed (and to what degree), researchers have to assume the whole genome is accessi ble for binding, which is obvi ously wrong and will lead to more FP (and FN because of the extra noise).

•

Chromatin remodelling

: enzyme-assisted movement of nucleosomes on DNA. source:

http://www.nida.nih.gov/NIDA_notes/NNvol21N4/gene.html

Promoter identification

Transcript Promoter identification

COMPUTATIONAL ANALYSIS OF PROMOTERS

Gene regulation

Gene regulation

Gene regulation

Transcriptional regulation

Three different polymerases

Three parts of promoter

Core promoter

Core promoter elements

Core promoter elements

Promoter proximal elements

A hypothetic mammalian promoter region

CpG island

CpG island

CpG island

Promoter regions in human genes

Computational analysis of promoters

Introduction

Introduction

Computational difficulties

What promoter prediction methods actually predict?

Evaluating predictions

Evaluating predictions

Signal features

Context features

Structure features

Software

Integrated approaches

Boosting, stumps

Databases

Actual state of the promoter prediction

Future directions

Directory