Protein synthesis - Swedish Institute of Computer Science

Download Report

Transcript Protein synthesis - Swedish Institute of Computer Science

Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration with AstraZeneca

Outline of the talk

Introduction • Data description • The REDUCE method • Results • Applications and Conclusions

Ameur, Orzechowski 11/3 2003

Introduction - the REDUCE method

• The aim is to find binding sites for transcription factors, motifs , in the human genome by using a method developed at Rockefeller University (Bussemaker, Li & Siggia 2001).

• This method is called REDUCE and has previously only been applied to yeast data. We applied it to human data.

• The idea is to find motifs by correlating sequence and expression data. • Input consists of: Expression data , sequence data and a set of putative motifs .

• Output is a list of significant motifs:

consensus id description

Dc 2

F probes hits

NNNRRCCAATSRGNNN M00287 NF-Y 0.0044 0.0661 1041 1279 NNNCGGCCATCTTGNCTSNW M00069 YY1 0.0014 0.0363 300 314 NNRACAGGTGYAN M00060 Sn 0.0013 -0.0345 368 374 NNNRGGNCAAAGKTCANNN M00134 HNF-4 0.0008 0.0290 263 272 TWTTTAATTGGTT M00424 NKX6-1 0.0007 -0.0234 428 457 KNNKNNTYGCGTGCMS M00235 AhR/Arnt 0.0006 -0.0254 155 161 NANCACGTGNNW M00123 c-Myc/Max 0.0006 -0.0243 50 50 NNBTNTNCTATTTNTT M00092 BR-CZ2 0.0005 0.0233 92 94 NNGAATATKCANNNN M00136 Oct-1 0.0005 -0.0230 213 244

Ameur, Orzechowski 11/3 2003

Expression data

• Expression data is provided by AstraZeneca. It consists of 81 samples of human cerebral cortex stem cells undergoing various treatments. Expressions are measured on an Affymetrix U133 chip.

• We visualize expression data in a heatmap.

gene 1 gene 2 .

.

.

gene n sample 1

e(1,1) e(2,1) .

.

.

e(n,1)

. . .

. . . . . .

. . .

sample 81

e(1,81) e(2,81) .

.

.

e(n,81)

• It is possible to identify regions of correlated genes in the heatmap.

Ameur, Orzechowski 11/3 2003

Sequence data

• In the REDUCE model, expression levels are explained by the number of times the motifs occur in the upstream sequences of human genes.

• For this, sequences around the transcription starts are extracted. We take sequences in the range [1000 bp upstream, 100 bp downstream]. • Transcription starts and genome data are provided by AstraZeneca.

• The upstream sequences are masked for repeats (with RepeatMasker ).

• Putative motifs are matched to the resulting sequences. transcription start -1000 bp +100 bp GGAGTTCAAGACCAACCTAAGCAACAAAG TGAAA CCACATCACTATAAATATATTC TTAAA CG TGAAA TGTTCACTCAGGCTT TTTAA TATTTTA TTTCA TTT • The motif TKAAA and its reverse complement TTTMA are matched in the example.

Ameur, Orzechowski 11/3 2003

Motifs

• Motifs are represented as weight matrices : W(

M

) =

pos 1 pos 2 .

.

.

pos n A

w(1,A) w(2,A) .

.

.

w(n,A)

C

w(1,C) w(2,C) .

.

.

w(n,C)

G

w(1,G) w(2,G) .

.

.

w(n,G)

T

w(1,T) w(2,T) .

.

.

w(n,T) w

(

i,B

) is the probability that base

i

is the nucleotide

B

in the motif

M.

• We generate the set of putative motifs as weight matrices. This can be done in several ways: • One possibility is to use the matrices (about 300) in the TransFac data base.

• Another possibility is to generate matrices of our own, for example for all sequences of a certain length . Since the number of possible sequences grows exponentially with the length, this is only possible for sequneces up to length 7 or 8.

• We have implemented a method based on Gibbs sampling to match weight matrices to upstream regions.

Ameur, Orzechowski 11/3 2003

Matching motifs to the upstream sequences

• A weight matrix

W

is matched to a sequence

s 1 s 2 … s n

the following way: • The entropy of position

i

in a weight matrix

W

is defined as:

E i

 1  1  log  

A

 ,

C

,

G

 ,

T i

  • If the sequence S is added to the the weight matrix W a new weight matrix W s is obtained.

• We then define a score based on the changes in the entropies when a sequence

S

is added to a weight matrix

W

:

score

W

,

S

 

i l

  1

W

i

E

i

E

i

 

s

 • If the score is non-negative, that is if the entropy decreases, a match is reported.

Ameur, Orzechowski 11/3 2003

Pre-processing and REDUCE

Mapping from probes to transcription starts Human genome

Upstream sequences Masked upstream sequences Motif occurences in upstream regions REDUCE indata

A g

C

   

M F

n g

Putative motifs Expression data for cortex stem cells Ameur, Orzechowski 11/3 2003

REDUCE output

consensus id description

Dc 2

F probes hits

NNNRRCCAATSRGNNN M00287 NF-Y 0.0044 0.0661 1041 1279 NNNCGGCCATCTTGNCTSNW M00069 YY1 0.0014 0.0363 300 314 NNRACAGGTGYAN M00060 Sn 0.0013 -0.0345 368 374 NNNRGGNCAAAGKTCANNN M00134 HNF-4 0.0008 0.0290 263 272 TWTTTAATTGGTT M00424 NKX6-1 0.0007 -0.0234 428 457 KNNKNNTYGCGTGCMS M00235 AhR/Arnt 0.0006 -0.0254 155 161 NANCACGTGNNW M00123 c-Myc/Max 0.0006 -0.0243 50 50 NNBTNTNCTATTTNTT M00092 BR-CZ2 0.0005 0.0233 92 94 NNGAATATKCANNNN M00136 Oct-1 0.0005 -0.0230 213 244

consensus

- A consensus sequence for the motif.

id

- A unique id for each motif.

description

- The transcription factor name.

Dc 2 - The significance of the motif.

F

- The effect. A positive value indicates activation and negative repression .

probes

- Number of probes with occurences of the motif in their upstream regions.

hits

- Total number of motif occurences.

Ameur, Orzechowski 11/3 2003

Visualizing REDUCE outdata

• REDUCE outadata can be visualized in a heatmap.

sample 1 sample 2 .

.

.

sample 81

m 1 F 1 m 1 F 2 m 1

.

.

.

F 81 m 1

. . . . . .

. . .

mn F 1 m n F 2 m n

.

.

.

F 81 m n • The motifs in this heatmap are taken from TransFac.

• Green dots indicate repressing and red dots indicate activating motifs.

• The heatmap gives a clustering of samples on motifs.

Ameur, Orzechowski 11/3 2003

Validation of results

• A bootstrap test was carried out to validate the results of REDUCE.

• 10 sets of randomized data. Each set consists of the same upstream sequences and expression levels, but combined randomly.

35 30 25 20 15 10 5 0 actual data randomized data, worst ranomized data, best randomized data, mean • For most samples the results for the actual data is significantly better than for the randomized data.

Analyzing REDUCE outdata

• More validation: The pictures below show the samples clustered on expression and on motifs. • Analysis of significant motifs: By analyzing the motifs found by REDUCE we hope to find motifs that explain clusters of correlated genes.

For example, REDUCE found a TransFac motif in the samples associated with the red area in the picture. It matches 109 genes in the picture, and

4%

of the other genes.

18%

of the • Finding new motifs: C CG GA GCG GA TC GCG GCG AC GCG CG CC GCG GCG GC CG GCG C CG CC AG GCG GCG CC GCG GG GG GCG One iteration of REDUCE was run on all sequences of length 5.

A

0.17

0 0 0 0 0.14

0.29

C

0.33

0.5

0.15

1 0 0.29

0.57

G

0.33

0.5

0.85

0 1 0.57

0.14

T

0.17

0 0 0 0 0 0

N S G C G S M Ameur, Orzechowski 11/3 2003

Applications

• Identify coregulated genes with potentially different expression profiles, using the motifs found by REDUCE. • Predict previously unknown motifs, or new properties of known ones.

Conclusions

As the bootstrap test shows our results are significant in most cases. This suggests that the motifs we find are biologically meaningful, and that the method is applicable on human data.

To determine the role of each transcription factor, requires more in-depth examination of subgroups of genes. This is very intereseting, but beyond the scope of this project. Our results on human data had somewhat lower significance than previuos results on yeast presented in (Bussemaker, Li & Siggia, 2001). There are several possible causes for this: • Data quality: Expression data, upstream regions.

• Gene regulation probably more complicated in human.

Even so, our results suggest that the REDUCE method might give useful information about transcription factor binding sites in humans. Probably, this requires prior knowledge about motifs and other methods such as clustering.

Ameur, Orzechowski 11/3 2003