CSCE590/822 Data Mining Principles and Applications

Download Report

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics

Lecture 10 Motif Discovery

HAPPY CHINESE NEW YEAR

Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu

.

Outline

    Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 4/25/2020 2

What is a DNA motif?

 Motif  A Recurring pattern   A short conserved sequence pattern associated with distinct functions of a protein or DNA DNA motifs asTranscription Factor biding sites

Transcription: binding sites (DNA) and factors (proteins)

Colored lines are binding sites: DNA sequence patterns.

Blobs are factors (proteins) that recognize binding sites.

Example: Transcription Factor Binding Sites

Gene

Efp TERT Oxytocin Lactoferrin Angiotensin VEGF Estrogen Receptor ERE (estrogen response element)

… a g … t t g g t c a g g t c a … g c g g t g a … c a g g t c a … t a g g g c a … a t a a t c a

Transcription start DNA

ERE Sequence t g g g g c c c t a g g t g a c c t g a t c t g a c c c g a t c t c g t g a c c g a c t g a c t c t … t c … c c … t t … c g … g g …

Why are sequence patterns useful- revisited

In the context of transcriptional regulation, sequence patterns can be used to help answer several questions .

 What transcription factors regulating my gene?

are involved in   Does my gene contain a DNA binding domain ?

What novel transcription factor binding sites does my set of co-regulated genes contain?

How do we represent sequence patterns?

The three most common pattern representation languages :    regular expressions (e.g.,leucine zipper) profiles (PWMs, PSSMs etc.) hidden Markov models (HMMs)

1) Regular expressions define sets of sequences that they match

Sp1 binds to DNA via 3 zinc-finger binding domains :

C -X(2,4) C -X(3) [LIVMFYWC] -X(8) H X(3,5) H

These particular domains recognize Sp1 binding sites : G R GG C R GG W

Transcription factor Sp1 binding to DNA

2) Profiles are built from multiple alignments of instances of a pattern

Example : nuclear hormone receptor transcription factor binding site profile derived from experimentally determined sites.

Observed counts converted to can be frequencies dividing by the number of observed instances.

by

So profiles are probabilistic models of sequence patterns.

Counts of number of times each letter is observed at each position in pattern.

3) Making a Markov Model

A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C [AT][CG][AC][ACGT-](3)A[TG][GC] ~3600 possible valid sequences

Lecture 3.2

10

Making a Markov Model of Motif

0.4

C:0.4

G:0.2

T:0.2

A:0.2

0.6

0.6

A:0.8

T:0.2

1.0

C:0.8

G:0.2

1.0

A:0.8

C:0.2

0.4

A:1.0

1.0

T:0.8

G:0.2

1.0

C:0.8

G:0.2

P(ACAC--ATC)=0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4

x0.6x1.0x1.0x0.8x1.0x0.8 = 0.0047

Lecture 3.2

11

How to score the match of a sequence against three motif models?

   Regular express: exact match or fuzzy match Profile: sum of log-odds HMM: probability score P(s|H)

Outline

    Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 4/25/2020 13

How do we search for occurrences of known patterns?

Tools exist that allow us to search for one or more known sequence patterns in one or more sequences in different ways.

 The patterns can come from a patterns or be novel patterns we have discovered using pattern discovery software database of known or other means.

 Some tools treat each pattern independently ; others look for groups of matches to patterns.

 All tools compare each pattern to each position and compute a score which can be the number of mutations (regular expression patterns) or a probability or log odds (profiles and HMMs).

Many useful databases of patterns have been compiled

 TRANSFAC – transcription factor binding sites (profiles)   PROSITE – protein sites and domains (regular expressions and profiles) EPD – eukaryotic promoters (profiles)   PFAM – protein families and domains (HMMs) BLOCKS – protein families (profiles)

Searching for known patterns in a given sequence

   MOTIF – search protein sequence against Prosite, PFAM etc.; search DNA sequence against TRANSFAC PROFILESCAN – search protein sequence against Prosite database of profiles or regular expressions MAST – search for occurrences of one or more patterns in a DNA sequence (or database of sequences)

Outline

    Introduction to DNA Motif Motif Representations (Recap) Motif database search Algorithms for motif discovery 4/25/2020 17

The Motif Discovery Problem

 We are given a set of sequences, each containing an instance of an unknown motif. Find the motif.

  Multiple, local sequence alignment.

A clean, computer-sciencey problem. A bit too clean, we should be suspicious…

In Real Life

   A microarray experiment indicates that 50 genes share similar expression patterns.

Do they share a common type of transcription factor binding site?

◦ Almost certainly some of the genes were included erroneously: experimental noise.

◦ Perhaps they share a common mRNA degradation signal.

Is the TFBS near the transcription start site?

◦ Yeast: probably. Human: who knows?

Approaches to Motif Discovery

 ◦ ◦ Matrix-based: ◦ Gibbs Sampling - most popular.

Expectation maximization.

Stormo’s greedy algorithm.

 Consensus sequence-based: ◦ ◦ Several algorithms by Pevzner.

Box-finder of Kielbasa et al.

Three Ingredients of Almost any Bioinformatics Method 1.

2.

3.

Search space (haystack) Scoring scheme Mathematically precise formulation of the problem Search algorithm (= optimization technique) Strictly speaking, Gibbs sampling and expectation maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.

Gibbs Sampling: Simplifying Assumptions

 The width of the motif is known in advance.

 No indels (gaps).

 Each sequence contains precisely one instance of the motif.

 The sequences are single-stranded (e.g. mRNA).

Search Space

Motif width = W Length = L Size of search space = (L – W + 1) N L=100, W=15, N=10  size  10 19 N

Scoring Scheme

 Assign a numeric score to any proposed answer.

caga ctga cacc cgca  What score should this get?

Some Definitions

ctga caga cacc cgca k  count matrix: c ki =

a c t g 1

0 4 0 0

2

2 0 1 1

3

0 2 2 0 • p ki = c ki / N • p i = background abundance of i th residue type

4

3 1 0 0 i

Two Scoring Schemes

1. Based on frequentist statistics / information theory: score 

k W

  1 ln   

i

 

a

,

c

,

g

,

t c ki i

a

g

,

c

, ,

t

 

i

   2. Based on Bayesian statistics: score 

k W

  1 ln    

N

6  3  !

i

 

a

,

c

,

g c ki

,

t

!

i

 

a

,

c

,

g

,

t

 

i

  

Worked Example

score 

k W

  1 ln    

N

6  3  !

i

 

a

,

c

,

g c

,

t ki

!

i

 

a

,

c

,

g

,

t

 

i

   c ki =

a c g t 1

0 4 0 0

2

2 0 1 1

3

0 2 2 0

4

3 1 0 0 Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29

N = 4 p i = ¼  

i

  

N

6  3  !

i N

 1 256  32 105

Search Algorithm

 We want the global maximum score!

(Or as close as we can get.)  Exact algorithms (e.g. dynamic programming) would be too slow (e.g. lifetime of universe).

 Therefore we resort to a heuristic algorithm: Gibbs sampling , which is a type of Monte Carlo Markov chain method.

Gibbs Sampling Search

1 X 2 Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!) Start at a random point X.

Randomly pick a dimension.

Look at all points along this dimension.

Move to one of them randomly, proportional to its score π.

Repeat.

Gibbs Sampling for Motif Search

Choose a random starting state.

Randomly pick a sequence.

Look at all motif positions in this sequence.

Pick one randomly proportional to exp(score).

Repeat.

Does it Work in Practice?

      Only successful cases get published!

Seems more successful in microbes than in animals.

(bacteria & yeast) The search algorithm seems to work quite well, the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay.

Attempts to deal with this: ◦ Assume the motif is an inverted palindrome (they often are).

◦ Only analyze sequence regions that are conserved in another species (e.g. human vs. mouse).

As usual, repetitive sequences cause problems.

More powerful algorithm: MEME

1.

Go to our MEME server: 1.

2.

3.

4.

5.

http://molgen.biol.rug.nl/meme/website/meme.ht

ml Fill in your emailadres, description of the sequences Open the fasta formatted file you just saved with Genome2d (click “Browse”) Select the number of motifs, number of sites and the optimum width of the motif Click “Search given strand only” Click “Start search”

Something like this will appear in your email. The results are quite self explanatory.

Summary

    Motif discovery and Motif search problem Motif representation Gibbs sampling algorithm for motif discovery Using MEME (Expectation Maximization algorithm) for motif discovery

Acknowledgement

  Zhiping Weng (Boston Uni.) Timothy L. Bailey