Pattern Overrepresentation using Z

Download Report

Transcript Pattern Overrepresentation using Z

Statistical Analysis for Word
counting
in Drosophila Core Promoters
Yogita Mantri
April 27 2005
Bioinformatics Capstone presentation
 Introduction & Motivation
 Dataset used
 Part I – Unbiased word counting
 Part II – TCAGT-centric word counting
 Conclusions and Future work
Introduction



Regulatory elements are short DNA sequences that
control gene expression.
They are often found around the Transcription
Start Site (TSS), sometimes further upstream.
Identification of promoters and regulatory
elements is a major challenge in bioinformatics:




Regulatory elements are not well-conserved
Computational discovery of TSS in not straightforward
Promoter sequences do not have distinguishable statistical
properties
Transcription is a highly cooperative process including
competitive or cooperative binding which is not completely
determined from the rest of the genome’s DNA sequence
Drosophila Core Promoters
“Computational analysis of core promoters in the Drosophila
Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–0087.12
Above image edited from:
http://163.238.8.180/~davis/Bio_327/lectures/Transcription/TranscriptionOver.html
Motivation for project

Database of Core Promoters with TSS
experimentally determined is a huge advantage
over other approaches using only gene upstream
regions.

Word Counting method to determine significant
patterns, inspired by Dr. Peter Cherbas’ earlier
work.

“The arthropod initiator: the capsite consensus plays an
important role in transcription”, Cherbas L, Cherbas P.,
Insect Biochem Mol Biol. 1993 Jan;23(1):81-90
 Introduction & Motivation
 Dataset used
 Part I – Unbiased word counting
 Part II – TCAGT-centric word counting
 Conclusions and Future work
The Database of Drosophila Core
Promoters


Compiled by Sumit Middha. It consists of Drosophila core
promoters from three experimental sources.
Ohler, Rubin et al:



Kadonaga et al:




205 promoters
Changed TSS to coincide with A of Inr consensus TCAGT even if
experimental results reported TSS in the vicinity.
The discrepancy was fixed by taking the experimentally
reported TSS.
Eukaryotic Promoter Database:



1941 promoters
Stringent criteria for identifying TSSs, requiring 5’ ends of
multiple cDNAs to lie in close proximity.
1926 promoters
Assigned TSS based on experimental data with a precision of
+/- 5bp or better.
3458 sequences after removing redundant entries in the
dataset.
 Introduction & Motivation
 Dataset used
 Part I – Unbiased word counting
 Part II – TCAGT-centric word counting
 Conclusions and Future work
Word Analysis – Part I
Unbiased search
Used various statistical measures like Zscore on all possible n-mers in the entire
dataset and in specific windows.
 The goal was to see whether known
patterns of interest were significantly
enriched in promoter sequences than
other patterns.

Basic Statistics of the dataset




3458 promoter sequences in the database.
First step was a word-frequency analysis
(pentamers used for initial analysis)
Performed analysis on the following sets:
 Entire dataset (DS-1)
 Subset of above dataset, with only -20 to +20
region (DS-2)
2 types of analyses, differing in “Random”
sequences used:
 1st Order Markov Chains based on base and
transition probabilities of respective dataset
 “non-coding” regions
Random set
Generated 100 sets of 1st order Markov
chains
 Each set contained same number of
sequences as original datase (3458), and
having same length (350)
 Computed occurrence of each pentamer in
actual and random sequences
 For random sequences, calculated average
and S.D over all sets

Z-score
A test of significance
 Mean and S.D
calculated over 100
sets
 Calculated Z-scores
for all pentamers
 Looking for
pentamers with very
high or very low Zscores

Rank of TCAGT and variants in entire dataset
Rank
Pattern
Z-Score
1
aaaaa
113.037
2
ttttt
111.647
3
ttttg
88.1
4
gaaaa
83.156
5
aaaac
82.69
6
atttt
82.152
7
gtttt
82.067
8
ttttc
79.485
9
aaaat
78.348
10
gcagc
77.091
101
gcagt
29.269
115
tcagt
27.156
307
acagt
10.286
485
tcatt
1.375
965
tataa
-25.213
Summary of known pentamers in different windows
-20+20
PATTERN
tcagt
Non-overlapping windows
Z-Score
Rank
58.929
2
3.6
418
gcagt
25.545
34
acagt
12.923
179
tataa
-25
1022
tcatt
Pattern
tcagt
Sliding Windows
Pattern
Z-score
Rank
tcagt
7.559871
254
tcatt
-1.402484
576
gcagt
9.0644839
200
acagt
2.7177419
409
tataa
-8.962065
880
Z-score
Rank
4.277429
356
tcatt
-2.00671
590
gcagt
7.714143
246
acagt
2.080429
435
tataa
-9.064
898
Z-score Plots of tcagt and variants using sliding
windows of 10 bp
Sliding Window
100
80
40
20
0
150
20
-7
0
40
-9
60 0
-1
1
80 0
-1
10 30
01
12 50
01
14 70
01
16 90
02
18 10
02
20 30
02
22 50
02
24 70
02
26 90
03
28 10
03
30 30
035
0
Z-score
60
-20
-40
Window
tcagt
acagt
gcagt
tcatt
tataa
cgtcg
aaaaa
atttt
cagcg
atatc
tagta
Lesson

Cannot ignore position preference of
regulatory motifs!
 Introduction & Motivation
 Dataset used
 Part I – Unbiased word counting
 Part II – TCAGT-centric word counting
 Conclusions and Future work
Word Analysis – Part II
Guided search, starting with known INR element TCAGT
Identification of INR enriched regions
 Identification of synonyms
 Correlation analysis of INR synonyms
 Guided search

TCAGT-centric word analysis
Zscore
(-3,3)
(-4,2)
(-2,4)
(-5,1)
(-6,-1)
(-7,-2)
(-1,5)
(1,6)
(2,7)
(3,8)
130.58
116.27
105.67
98.96
95.71
85.83
59.23
47.68
43.30
28.79
Zscore vs Position wrt TSS
160
140
120
100
80
Z-score
Window
TCAGT
60
40
20
-15
-10
-5
0
-20 0
Position
5
10
15
INR Synonyms
Group1
CTCAG--ATCAG--TTCAG--GTCAG---TCAGT----AGTTG
---AGTCG
--CAGTT--CAGTC-
Group 2
TTAGT
Group 3
ACACT---CACTCTG
Group 4
-TCACAGTCAC---CACAC
Group 5
TCACTCT
Group 6
-CATTC
TCATT“Computational analysis of core promoters in the Drosophila
Genome”, Ohler, Rubin et. al, Genome Biology 2002,
3(12):research0087.1–0087.12
Binary Tree Representation of Dataset
TOTAL: 3412
INR+
INR1801
1611
DPE+
79
TATA+
TATA-
410
1201
TATA+
397
DPE- DPE+
DPE- DPE+
331
832
369
TATA-
76
1404
DPE- DPE+
321
232
DPE-
1172
3 Clusters in INR-positive set
Zscore
250.0
INR (-10, +2)
200.0
ggtcacact
ggtcacac
cggtcacac
Zscores
150.0
ttcagtcg
tcagt
DPE (+20, -30)
TATA (-40, -35)
100.0
cggacgtg
tataaaag
50.0
0.0
-50
-40
-30
-20
-10
0
Postition (-40 to +40)
10
20
30
40
Contingency Matrices for INR, TATA, DPE
TATA+
DPE+
TATA-
DPE-
INR+
410
1201
1611
INR+
448
1163
1611
INR-
397
1404
1801
INR-
308
1493
1801
807
2605
756
2656
INR+, TATA+ Log Likelihood: 0.073
INR+, DPE+ Log Likelihood: 0.227
DPE+
DPE-
TATA+
155
652
807
TATA-
601
2004
2605
756
2656
INR+, DPE+ Log Likelihood: -0.143
Possible Alternative TATA and INR Synonyms ??
Zscore vs Postition in INR-neg set
90.0
80.0
TATA – 2 ?
Z-score
50.0
tctttcttt
ggtcacac
70.0
60.0
INR – 2 ?
ctatcgat
gtcacact
ctcgaggg
ttctttccg
40.0
cggtcacac
30.0
20.0
10.0
0.0
-40
-30
-20
-10
0
10
Position (-40, +40)
20
30
40
Zscore
Enrichment further upstream
– New Binding Sites?
-100, -40 region
80
actatcgat
70
ctatcgat
tatcgata
Z-score
aactatcgat
60
50
40
30
20
10
0
-100
-90
-80
-70
-100, -40, position
-60
-50
-40
Next Level of Binary Tree analysis
TOTAL: 3412
INR+
INR-
1611
1801
TATA+
TATA-
410
DPE+
INR_2+
1201
DPE-
TATA_2+
INR_2-
397
TATA_2-
1404
DPE-
?
DPE+
?
DPEDPE+
DPE+
DPE-
DPE+
DPE-
Conclusions & Future steps





The main goal of this project was to try to identify
significant words based on only statistical overrepresentation.
The first part of the analysis using an unbiased searching
method was successful only in a very narrow range of
positions around the TSS.
However, the biased search starting with the Inr consensus
revealed the 3 known regulatory elements in that region.
An analysis of the Inr-negative set showed over-expression
of patterns in the same positions as the Inr, TATA and DPE
should be, and could be possible synonyms.
Thus the word-counting strategy has the potential to
reveal:



Regulatory motifs and interrelationships that other motif
discovery programs cannot
Synonyms for regulatory motifs
Dependencies among regulatory motifs
Acknowledgements
Dr. Haixu Tang
 Dr. Sun Kim
 Dr. Peter Cherbas
 Sumit Middha
 Bioinformatics Research Group
