Transcript Document

Biological Motif Discovery

Concepts

Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction

Applications

Transcription Factor Binding Site Prediction Epitope Prediction

Lab Practical

DNA Motif Discovery with MEME and AlignAce Co-regulated genes from TB Boshoff data set

Regulatory Motifs

Find promoter motifs associated with co-regulated or functionally related genes

Transcription Factor Binding Sites

• Very Small • Highly Variable • ~Constant Size • Often repeated • Low-complexity-ish

Slide Credit: S. Batzoglou

Other “Motifs”

• Splicing Signals – Splice junctions – Exonic Splicing Enhancers (ESE) – Exonic Splicing Surpressors (ESS) • Protein Domains – Glycosylation sites – Kinase targets – Targetting signals • Protein Epitopes – MHC binding specificities

Essential Tasks

• Modeling Motifs – How to computationally represent motifs • Visualizing Motifs – Motif “Information” • Predicting Motif Instances – Using the model to classify new sequences • Learning Motif Structure – Finding new motifs, assessing their quality

Modeling Motifs

Consensus Sequences

Useful for publication IUPAC symbols for degenerate sites Not very amenable to computation

Nature Biotechnology

24, 423 - 425 (2006)

Probabilistic Model

1 K

Count frequencies Add pseudocounts

A C G T M 1 .1

.2

.4

.3

.2

.2

.5

.1

.1

.2

.4

.2

.4

.2

.2

.2

.1

.5

.2

.2

M K .1

.1

.1

.7

P k (S|M) Position Frequency Matrix (PFM)

Scoring A Sequence

To score a sequence, we compare to a null model Log likelihood ratio

Score

 log )  log

i N

  1

i i

|

i

| ) ) 

i N

  1 log

i i

|

i

| ) ) A C G T

PFM .1

.2

.4

.3

.2

.2

.5

.1

.1

.2

.4

.2

.4

.2

.2

.2

.1

.5

.2

.2

.1

.1

.1

.7

Background DNA (B) Position Weight Matrix (PWM)

A C G T

-1.3

-0.3

0.6

0.3

-0.3

-0.3

1 -1.3

-1.3

0.3

0.6

-0.3

0.6

-0.3

-1.3

1 -0.3

-0.3

-0.3

-0.3

-1.3

-1.3

-1.3

1.4

Scoring a Sequence

Common threshold = 60% of maximum score MacIsaac & Fraenkel (2006) PLoS Comp Bio

Visualizing Motifs – Motif Logos

Represent both base frequency and conservation position at each Height of letter proportional to frequency of base at that position Height of stack to conservation proportional at that position

Motif Information

The height of a stack is often called the motif information that position measured in bits at Motif Position Information

b

  { , , , } 

p b

log

p b

Why is this a measure of information?

Uncertainty and probability

Uncertainty is related to our surprise at an event “The sun will rise tomorrow” “The sun will

not

rise tomorrow” Not surprising (p~1)

Very

surprising (p<<1) Uncertainty is inversely related to probability of event

Uncertainty

p

1

p

1

p event

Average Uncertainty

Two possible outcomes for sun rising A “The sun will rise tomorrow” B “The sun will

not

rise tomorrow” P(A)=p 1 P(B)=p 2 What is our

average uncertainty

about the sun rising

     

p

1 

log

p i p

1

log

p i p

2

log

p

2

= Entropy

Entropy

Entropy measures average uncertainty Entropy measures randomness   

i p i

log 2

p i

If log is base 2 , then the units are called bits

Entropy versus randomness

Entropy is maximum at maximum randomness 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 0.1

0.2

0.3

0.4

0.5

0.6

P(heads) 0.7

0.8

0.9

1 Example: Coin Toss P(heads)=0.1 Not very random H(X)=0.47 bits P(heads)=0.5 Completely random H(X)=1 bits

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Entropy Examples

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

      2 bits       0.63 bits

Information Content Information is a

decrease in uncertainty

Once I tell you the sun will rise, your uncertainty about the event decreases Information = H before (X) H after (X)

Information is difference in entropy after receiving information

Motif Information

Motif Position Information = 2 -

b

  { , , , } 

p b

log

p b

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

H background (X)

Prior uncertainty about nucleotide A T G H(X)=2 bits C

H motif_i (X)

Uncertainty after learning it is position i in a motif 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 A T G H(X)=0.63 bits C Uncertainty at this position has been reduced by 0.37 bits

Motif Logo

Conserved Residue Reduction of uncertainty of 2 bits Little Conservation Minimal reduction of uncertainty

Background DNA Frequency

The definition of information assumes a uniform background DNA nucleotide frequency What if the background frequency is not uniform?

(e.g. Plasmodium ) H background (X) 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 A T G H(X)=1.7 bits C H motif_i (X) 1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 A T G H(X)=1.9 bits C Motif Position Information = -

b

  { , , , } 

p b

log

p b

= -0.2 bits

Some motifs could have negative information!

A Different Measure

Relative entropy or Kullback-Leibler (KL) divergence Divergence between a “true” distribution and another

D KL

(

P motif

||

P background

) 

i

  { , , , }

P motif

“True” Distribution Other Distribution D KL is larger the more different P motif is from P background Same as Information if P background is uniform

P motif P background

Properties

D KL D KL D KL

 0  0 if and only if P motif =P background ) 

D KL

(

Comparing Both Methods

Information assuming uniform background DNA KL Distance assuming 20% GC content (e.g. Plasmodium)

Online Logo Generation

http://weblogo.berkeley.edu/ http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi

Finding New Motifs Learning Motif Models

A Promoter Model

Length K A C G T

Motif

M 1 .1

.2

.4

.3

.2

.2

.5

.1

.1

.2

.4

.2

.4

.2

.2

.2

.1

.5

.2

.2

M K .3

.4

.2

.1

P k (S|M) Background DNA P(S|B) The same motif model in all promoters

Probability of a Sequence

Given a sequence(s), motif

model

and motif

location

1 60 65 A T A T G C 100

|  10,

Model

) 

i

59   1

S i = nucleotide at position i in the sequence i

| ) 6   1

k P k

S k

 63 |

M i

 100   66

P

(

S i

A C G T M 1 .1

.2

.2

.4

.3

.2

.5

.1

.1

.2

.4

.2

.4

.2

.2

.2

.5

.2

.2

.1

M K .3

.4

.2

.1

Parameterizing the Motif Model

Given multiple sequences and motif locations but no motif

model

AATGCG ATATGG ATATCG GATGCA Count Frequencies Add pseudocounts

A C G T M 1

3/4

M 6 ETC…

3/4

Finding Known Motifs

Given multiple sequences and motif model but no motif

locations

x

P(Seq window |Motif) window

x x x Calculate P(Seq window |Motif) for every starting location Choose best starting location in each sequence

Discovering Motifs

Given a set of co-regulated genes, we need to discover with only sequences

We have neither a motif model nor motif locations Need to discover both

How can we approach this problem?

(Hint: start with a random motif model)

Expectation Maximization (EM)

Remember the basic idea!

1.Use model to estimate (distribution of) missing data 2.Use estimate to update model 3.Repeat

until convergence

Model is the motif model Missing data are the motif locations

EM for Motif Discovery

.1

.2

.1

.4

.3

1. Start with random motif model 2.

E Step: estimate probability of motif positions for each sequence

A C G T

.2

.4

.3

.2

.5

.1

.2

.4

.2

.2

.2

.2

.1

.5

.2

.2

.4

.2

.1

.1

.1

.1

.1

.1

.3

3.

M Step: use estimate to update

A C

.2

.3

.2

.2

.5

.1

G

.4

.5

.4

.5

ETC…

.2

.1

motif model

T

.3

.1

.2

.2

.2

.1

At each iteration, P(Sequences|Model) guaranteed to increase 4. Iterate (to convergence)

MEME

• MEME - implements EM for motif discovery in DNA and proteins • MAST – search sequences for motifs given a model

http://meme.sdsc.edu/meme/

P(Seq|Model) Landscape

EM searches for parameters to increase P(seqs|parameters)

Useful to think of P(seqs|parameters) as a function of parameters EM starts at an initial set of parameters And then “climbs uphill” until it reaches a local maximum

Where EM starts can make a big difference

Search from Many Different Starts

To minimize the effects of local maxima, you should search multiple times from different starting points

MEME uses this idea Start at many points Run for one iteration Choose starting point that got the “highest” and continue

Gibbs Sampling

A stochastic version of EM that differs from deterministic EM in two key ways 1. At each iteration, we only update the motif position of a single sequence 2. We may update a motif position to a “suboptimal” new position

Gibbs Sampling

“Best” Location New Location

1. Start with random motif locations calculate a motif model and 2. Randomly select a sequence, remove its motif and recalculate tempory model 3. With temporary model, calculate probability of motif at each position on sequence 4. Select new position based on this distribution 5. Update model and Iterate

A C G T

.1

.2

.4

.3

.2

.2

.5

.1

.1

.2

.4

.2

.4

.2

.2

.2

.1

.5

.2

.2

.3

.4

.2

.1

A C G T

.1

.2

.4

.3

.1

.3

.5

.1

.1

.2

.4

.2

.1

.2

.5

.2

.1

.5

.2

.2

.3

.1

.1

.1

ETC…

Gibbs Sampling and Climbing

Because gibbs sampling does always choose the best new location it can move to another place not directly uphill

In theory,

Gibbs Sampling less likely to get stuck a local maxima

AlignACE

Implements Gibbs sampling for motif discovery

Several enhancements

ScanAce – look for motifs in a sequence given a model

CompareAce motifs) – calculate “similarity” between two motifs (i.e. for clustering http://atlas.med.harvard.edu/cgi-bin/alignace.pl

Assessing Motif Quality

Scan the genome for all instances and associate with nearest genes

• • • •

Category Enrichment

– look for association between motif and sets of genes. Score using Hypergeometric distribution – Functional Category – Gene Families – Protein Complexes

Group Specificity

– how restricted are motif instances to the promoter sequences used to find the motif?

Positional Bias

– do motif instances cluster at a certain distance from ATG?

Orientation Bias

– do motifs appear preferentially upstream or downstream of genes?

Comparative Motif Prediction

Kellis et al. (2003) Nature

Conservation of Motifs

GAL10

Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * TBP Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * GAL4 GAL4 GAL4 Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** GAL4 Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** MIG1 Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * ** MIG1 TBP Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TTAA-CGTCAAGGA---GAAAAAACTATA Spar TTAT-CGTCAAGGAAA-GAACAAACTATA Smik TCGTTCATCAAGAA----AAAAAACTA..

Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** **

GAL1 Factor footprint slide credits: M. Kellis Conservation island

Genome-wide conservation

Scer Spar Smik Sbay Evaluate conservation within: (1) All intergenic regions (2) Intergenic : coding (3) Upstream : downstream Gal4 22% 4:1 12:0 Controls 5% 1:3 1:1 A signature for regulatory motifs

Finding Motifs in Yeast Genomes

M. Kellis PhD Thesis 1.

Enumerate all “mini-motifs”

T C A

2. Apply three tests

N A C G

1.

2.

3.

Look for motifs conserved in intergenic regions Look for motifs more conserved intergenically than in genes Look for motifs preferentially conserved upstream or downstream of genes slide credits: M. Kellis

Test 1: Intergenic conservation

CGG-11-CCG Total count slide credits: M. Kellis

Test 2: Intergenic vs. Coding

CGG-11-CCG Higher Conservation in Genes Coding Conservation slide credits: M. Kellis

Test 3: Upstream vs. Downstream

CGG-11-CCG Most Patterns Downstream Conservation slide credits: M. Kellis

Constructing full motifs

Test 1 Test 2 Test 3 2,000 Mini-motifs Extend R

T C A

Y R

G A

R

T T T

T

C C C G

C A C R

G

Y

C 6 5

Collapse Collapse Merge

A

Extend

C G

R A

A

A A

C G C G C G A

A

A A

Full Motifs 72 Full motifs slide credits: M. Kellis

Results

Rank Discovered Motif

12 13 14 15 16 17 18 10 11 6 7 8 9 1 2 3 4 5 RCGCAnGCGY CACGTG SCGGAAGY ACTAYRnnnCCCR GATTGGY GGGCGGR TGAnTCA TMTCGCGAnR TGAYRTCA GCCATnTTG MGGAAGTG CAGGTG CTTTGT TGACGTCA CAGCTG RYTTCCTG AACTTT TCAnnTGAY 19 GKCGCn(7)TGAY G

Known TF motif

NRF-1 MYC ELK-1 NF-Y SP1 AP-1 ATF3 YY1 GABP E12 LEF1 ATF3 AP-4 C-ETS-2 IRF1(*) SREBP 1

Tissue Enrichment

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Distance bias

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes • 174 promoter motifs  70 match known TF motifs  60 show positional bias  75% have evidence • Control sequences < 2% match known TF motifs < 3% show positional bias  < 7% false positives slide credits: M. Kellis

Antigen Epitope Prediction

Genome to “Immunome”

Pathogen genome sequences provide define all proteins that could illicit an immune response

• Looking for a needle… – Only a small number of epitopes are typically antigenic • …in a very big haystack –

Vaccinia

virus (258 ORFs): and 10-mers) 175,716 potential epitopes (8-, 9-, – –

M. tuberculosis

(~4K genes): 433,206 potential epitopes

A. nidulans

(~9K genes): 1,579,000 potential epitopes

Can computational approaches predict all antigenic epitopes from a genome?

Antigen Processing & Presentation

Modeling MHC Epitopes • Have a set of peptides that have been associate with a particular MHC allele • Want to discover motif within the peptide bound by MHC allele • Use motif to predict epitopes other potential

Motifs Bound by MHCs

• MHC 1 – Closed ends of grove – Peptides 8-10 AAs in length –

Motif is the peptide

• MHC 2 – Grove has open ends – Peptides have broad length distribution: 10-30 AAs –

Need to find binding motif within peptides

MHC 2 Motif Discovery

Use Gibbs Sampling!

462 peptides known to bind to MHC II HLA-DR4(B1*0401) 9-30 residues in length Goal: identify a common length 9 binding motif Nielsen et al (2004) Bioinf

Vaccinia Epitope Prediction

Mutaftsi et al (2006) Nat. Biotech.

• Predict MHC1 peptides binding • Using 4 matrices for H-2 Kb and Db • Top 1% predictions experimentally validated

49 validated epitopes accounting for 95% of immune response