dambe.bio.uottawa.ca

Download Report

Transcript dambe.bio.uottawa.ca

Position weight matrix (PWM),
Perceptron and their applications
Xuhua Xia
[email protected]
http://dambe.bio.uottawa.ca
Sequence Annotation
• Objective of sequence annotation: Given a nucleotide
or amino acid sequence, find its biological function
by bioinformatics tools.
• Approaches:
– Homology search: Annotation based on known, well
annotated genes in databases, using BLAST and FASTA
– Gene prediction: annotation based on known gene
structure using position weight matrix, perceptron, HMM.
Position-weight matrix (PWM)
• Also called position-specific scoring matrix (PSSM)
• Used in
– Characterizing sequence motifs
•
•
•
•
Eukaryotic translation initiation consensus
Splicing sites
Branchpoint sites
Shine-Dalgarno sequences
– Database searches (PHI-BLAST, PSI-BLAST and RPSBLAST)
Slide 3
Position weight matrix
Sequences flanking the initiation
codon of 508 CDSs:
1234567890123
ATACCATGTCCAA
ACAAAATGGCGCC
GGAGTATGGTTGA
CCGCCATGGCCCG
.....
A4GALT
ACO2
ACR
ADM2
....
N: Number of sequences, i.e. 508
L: Sequence length, i.e., 13
i: A, C, G or T
j: Site index, i.e., 1, 2, ..., 13
Site-specific frequencies: Pij
Non-site-specific (global) frequencies: Pi
L
pij 
f ij
, e.g.,
N
75
p A1 
 0.1476
508
pi 

j
f ij
, e.g .,
NL
1563
pA 
 0.2367
508  13
Site
A
C
G
U
1
75
173
171
89
2
105
216
144
43
3
199
70
212
27
4
124
236
83
65
5
60
276
143
29
6
502
6
0
0
7
0
0
0
508
8
0
0
508
0
9
98
71
274
65
10
141
227
89
51
11
49
154
221
84
12
89
144
196
79
13
121
151
134
102
1563
1724
2175
1142
Sum
Slide 4
Two hypotheses
• No: All sites have the same nuc/aa distributions
• Yes: Different sites have different nuc/aa distributions
• Related terms:
– Observation: S
– Likelihood: probability of having S given a model (a
hypothesis)
– Odds ratio: LYes/LNo
– Log-odds: log(LYes/LNo)
1234567890123
A4GALT ATACCATGTCCAA
ACO2
ACAAAATGGCGCC
ACR
GGAGTATGGTTGA
ADM2
CCGCCATGGCCCG
....
.....
S = ACGGTACCACGTT
LYes  p(S | Yes )  p A1 pC 2 pG 3 pG 4 pT 5 p A6 pC 7 pC 8 p A9 pC10 pG11 pT 12 pT 13
LNo  p( S |  No )  pA pC pG pG pT pA pC pC pA pC pG pT pT  p A3 pC4 pG3 pT3
 LYes 
 pC 2 
 p A1 
 pT 13 
log 
  log 
  ......  log 
  log 

L
p
p
 A
 pT 
 No 
 C 
Slide 5
Position weight matrix (PWM)
• Two major purposes of PWM
– To characterize the sequence
pattern (the motif)
– to facilitate the computation
of log-odds (or PWM score),
e.g., computing the PWMS
for ATACCATGTCCAA
p 

PWM ij  log 2  i , j   log 2 
 pi 

f
 log 2  L   log 2  i , j
 fi
fi , j / N 

fi /( NL) 



Site
A
C
G
U
Std
1
0.0726
0.7140
0.5377
0.3675
0.2731
2
0.3307
0.9354
0.3913
-0.1780
0.4553
3
0.9284
-0.0167
0.7350
-0.4293
0.6367
4
0.4731
1.0279
-0.0072
0.1086
0.4656
5
-0.0761
1.1967
0.3856
-0.3954
0.6915
6
1.9941
-0.7772
-0.8254
-0.9879
1.4316
7
-0.8980
-0.8743
-0.8254
2.3190
1.5927
8
-0.8980
-0.8743
1.6783
-0.9879
1.3001
9
0.2745
-0.0075
0.9900
0.1086
0.4476
10
0.5896
0.9870
0.0373
-0.0671
0.4931
11
-0.1958
0.6042
0.7750
0.3173
0.4249
12
0.1988
0.5428
0.6612
0.2652
0.2207
0.4515
0.5860
0.3331
0.4905
0.1047
 f   fi 
13
PWM ij  log 2  L   log 2  i , j
 f    f 
i 
 i
 75  0.05 1563i 
PWM A1  log 2 13  log 2 
  0.07263
 1563  0.05  508 13 
 
PWMS   PWM S j , j   log 2  Yes 
Xuhua
Xia
j 1
j 1
  No 
L
RCCAUGG
L
Slide 6
PWMS over sites
12345678901234567890123456789012345678901234567890123456789012345678901234567890
GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG
------------------------------------15
10
PWMS
5
0
-5
-10
-15
0
10
20
30
40
50
Site
Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG
and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5
bases flanking the ATG. PWMS computed with  = 0.01.
Slide 7
BLAST Programs
Program
Database
Query
Typical Uses
BLASTN/ME
GABLAST
Nucleotide
Nucleotide
MEGABLAST has longer word size than
BLASTN
BLASTP
Protein
Protein
Query a protein/peptide against a protein
database.
BLASTX
Protein
Nucleotide
Translate a nuc sequence into a “protein”
in six frames and search against a protein
database
TBLASTN
Nucleotide
Protein
Unannotated nuc sequences (e.g., ESTs)
are translated in six frames against which
the query protein is searched
TBLASTX
Nucleotide
Nucleotide
6-frame translation of both query and
database
PHI-BLAST
Protein
Protein
Pattern-hit iterated BLAST
PSI-BLAST
Protein
Protein
Position-specific iterated BLAST
RPS-BLAST
Protein
Protein
Reverse PSI-BLAST
Slide 8
Yeast 5’ ss PWM
Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence
(UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important
sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915,
G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12
intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded.
Site
−5
−4
−3
−2
−1
1
2
3
4
5
6
7
8
9
10
11
12
A
94
119
139
138
91
0
0
268
17
2
10
97
95
123
118
105
90
C
32
47
38
40
45
1
9
1
29
0
8
18
54
45
41
33
44
G
57
48
43
36
88
274
0
2
1
272
2
39
35
34
38
43
42
U
92
61
55
61
51
0
266
4
228
1
255
121
91
73
78
94
99
χ2
11.798
14.117
39.672
38.899
27.270
1060.426
658.096
522.754
428.607
1041.047
583.545
55.570
11.363
22.172
17.334
17.367
12.109
p
0.0081088
0.0027505
0.0000001
0.0000001
0.0000052
0.0000004
0.0000003
0.0000003
0.0000002
0.0000004
0.0000003
0.0000001
0.0099180
0.0000601
0.0006034
0.0005940
0.0070180
A
0.0641
0.4032
0.6268
0.6164
0.0174
−8.1042
−8.1042
1.5723
−2.3805
−5.2765
−3.1271
0.1092
0.0793
0.4508
0.3911
0.2232
0.0015
C
−0.7117
−0.1599
−0.4651
−0.3915
−0.2223
−5.4675
−2.5200
−5.4675
−0.8528
−8.1049
−2.6862
−1.5351
0.0397
−0.2223
−0.3560
−0.6676
−0.2546
G
0.0245
−0.2225
−0.3805
−0.6355
0.6492
2.2855
−8.1048
−4.6732
−5.5454
2.2750
−4.6732
−0.5206
−0.6759
−0.7175
−0.5579
−0.3805
−0.4142
U
0.2792
−0.3115
−0.4601
−0.3115
−0.5685
−8.1044
1.8081
−4.1523
1.5859
−5.8967
1.7472
0.6734
0.2635
−0.0534
0.0418
0.3101
0.3847
Ma and Xia 2011
Slide 9
Yeast 3’ss PWM
Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence
(UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important
sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are
labeled with first exon site as 1.
Site
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
1
2
3
4
5
A
61
70
79
38
51
91
95
93
136
12
277
0
93
72
90
83
90
C
53
47
42
30
42
33
42
33
25
121
1
0
37
64
54
43
65
G
34
20
12
23
27
28
35
23
38
0
0
278
73
54
48
54
37
U
130
141
145
187
158
126
106
129
79
145
0
0
75
88
86
98
86
2
56.0
83.2
99.9
219.4
121.6
53.8
22.0
63.3
43.3
272.3
563.7
1082.7
9.7
8.0
2.5
8.7
10.6
p
0
0
0
0
0
0
0.0001
0
0
0
0
0
0.0217
0.0466
0.4771
0.0337
0.0140
A
-0.5608
-0.3649
-0.1926
-1.2308
-0.8149
0.0093
0.0707
0.0403
0.5842
-2.8223
1.6056
-6.6464
0.0403
-0.3248
-0.0065
-0.1221
-0.0065
C
0.0033
-0.1682
-0.3285
-0.8067
-0.3285
-0.6715
-0.3285
-0.6715
-1.0647
1.1862
-5.1232
-6.6483
-0.5089
0.2729
0.0300
-0.2950
0.2951
G
-0.7205
-1.4696
-2.1802
-1.2731
-1.0470
-0.9956
-0.6794
-1.2731
-0.5626
-6.6480
-6.6480
2.2900
0.3691
-0.0619
-0.2299
-0.0619
-0.6005
U
0.7648
0.8813
0.9214
1.2867
1.0447
0.7200
0.4722
0.7537
0.0517
0.9214
-6.6469
-6.6469
-0.0226
0.2059
0.1730
0.3599
0.1730
Ma and Xia 2011
Slide 10
(a)
(b)
Slide 11
PWMS as a proxy of splicing strength
Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller
for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs
(NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs
(RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests
assuming equal variances. Mann-Whitney tests yield the same conclusion.
5' ss
PWMS Mean
PWMS Var.
N
3' ss
NRG
RG
NRG
RG
8.8138
11.1978
5.3129
7.1762
31.5069
4.8646
13.3017
8.2077
44
202
49
229
t
-4.6346
-3.9257
p
0.0000
0.0001
Slide 12
Highly expressed genes should have high
splicing efficiency.
1.12
1.1
1.08
Splicing efficiency
1.06
1.04
1.02
1
Predictions:
(1) Highly transcribed genes
should, on average, have introns
with greater splicing efficiency
(2) Lowly transcribed genes should
have greater variance in splicing
efficiency than highly transcribed
genes.
0.98
0.96
0.94
0.92
0.9
0
5
10
15
20
Gene expression
25
30
35
PWMS and Gene Expression
Slide 14
PWMS and Splicing Mechanisms
• Expected PWMS is 0 when there is no site-specific
difference in nucleotide frequency distribution
• What does a strongly negative PWMS mean?
• 5’ ss:
– HAC1: -8.8291
– HFM1: -7.3825
– HOP2: -7.8898
• 3’ ss:
– HAC1: -4.4039
– REC102: -3.4464
Slide 15
Perceptron
• The perceptron is one of the simplest artificial neural
networks invented in 1957 at the Cornell
Aeronautical Laboratory by Frank Rosenblatt
(Rosenblatt, 1958).
• Perceptron has been used in bioinformatics research
since 1980s:
– The identification of translational initiation sites in E. coli
(Stormo et al., 1982a).
– Characterizing the ATP/GTP-binding motif (Hirst and
Sternberg, 1991).
– More recent publications use multi-layer perceptrons
which is more complicated than what we cover here.
Slide 16
What perceptron does
• Positive sequences
POS1 ACGT
POS2 GCGC
• Negative sequences
NEG1 AGCT
NEG2 GGCC
• Objective: Find a scoring matrix that can distinguish
between the two groups (positive and negative) of
sequences
Slide 17
Definitions
POS1 ACGT
POS2 GCGC
L
PS  WS j , j
j 1
PSPOS1  WA,1  WC ,2  WG,3  WT ,4  4
NEG1 AGCT
NEG2 GGCC
Table 5-3. The weighting matrix (W) for the
WS j , j  WS j , j  1, if S is from POS group and PS < 0
fictitious example with two sequences of length 4 in
each group, initialized with values of 1. The first row WS j , j  WS j , j  1, if S is from NEG group and PS  0
designates sites 1-4.
No change in WS j , j otherwise.
Base
1
2
3
4
A
1
1
1
1
C
1
1
1
1
G
1
1
1
1
T
1
1
1
1
One may initialize all values to 0 instead of 1.
For amino acid sequences, the
matrix would be 20 by 4.
Slide 18
Iterations and convergence
First round of the training process in the perceptron algorithm. Updated values are highlighted
in bold.
NEG1: AGCT, PS = 4, update
Base
1
2
3
4
A
0
1
1
1
(a)
C
1
1
0
1
G
1
0
1
1
T
1
1
1
0
NEG2: GGCC, PS = 2, update
(b)
POS1: ACGT, PS = 2, no update
(c)
POS2: GCGC, PS = 2, no update
(d)
A
C
G
T
0
1
0
1
1
1
-1
1
1
-1
1
1
1
0
1
0
A
C
G
T
0
1
0
1
1
1
-1
1
1
-1
1
1
1
0
1
0
A
C
G
T
0
1
0
1
1
1
-1
1
1
-1
1
1
1
0
1
0
Slide 19
Post-processing
What is the score
for:
TAAA?
POS1 ACGT
POS2 GCGC
NEG1 AGCT
NEG2 GGCC
Base
1
2
3
4
A
0
1
1
1
C
1
1
-1
0
G
0
-1
1
1
T
1
1
1
0
Base
1
2
3
4
A
0
0
0
0
C
0
1
-1
0
G
0
-1
1
0
T
0
0
0
0
A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power
Slide 20
Doublet perceptron
P1
P2
P3
P4
P5
P6
1234567890
ACGUAUACGU
ACGUCUACGU
ACGUGUACGU
ACGUUAACGU
ACGUUCACGU
ACGUUGACGU
N1
N1
N1
N1
N1
N1
N1
N1
N1
N1
ACGUAAACGU
ACGUACACGU
ACGUAGACGU
ACGUCAACGU
ACGUCCACGU
ACGUCGACGU
ACGUGAACGU
ACGUGCACGU
ACGUGGACGU
ACGUUUACGU
1
AC
AC
AC
AC
AC
AC
2
CG
CG
CG
CG
CG
CG
3
GU
GU
GU
GU
GU
GU
4
UA
UC
UG
UU
UU
UU
5
AU
CU
GU
UA
UC
UG
6
UA
UA
UA
AA
CA
GA
7
AC
AC
AC
AC
AC
AC
8
CG
CG
CG
CG
CG
CG
9
GU
GU
GU
GU
GU
GU
AC
AC
AC
AC
AC
AC
AC
AC
AC
AC
CG
CG
CG
CG
CG
CG
CG
CG
CG
CG
GU
GU
GU
GU
GU
GU
GU
GU
GU
GU
UA
UA
UA
UC
UC
UC
UG
UG
UG
UU
AA
AC
AG
CA
CC
CG
GA
GC
GG
UU
AA
CA
GA
AA
CA
GA
AA
CA
GA
UA
AC
AC
AC
AC
AC
AC
AC
AC
AC
AC
CG
CG
CG
CG
CG
CG
CG
CG
CG
CG
GU
GU
GU
GU
GU
GU
GU
GU
GU
GU
Slide 21
Doublet Perceptron
AA
AC
AG
AU
CA
CC
CG
CU
GA
GC
GG
GU
UA
UC
UG
UU
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Doublet
4
5
6
0
-6 -4.3
0
-4
0
0
-2
0
0 8.33
0
0
-4
-1
0
-1
0
0
-1
0
0
5
0
0
-1 -0.7
0
-1
0
0
-1
0
0 3.33
0
-3.7 6.67 5.67
-1
5
0
0.33 3.33
0
4
-11
0
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Large amount of data are needed to avoid the problem of overfitting
Slide 22
Gene/Motif Prediction
• Objective: given molecular sequence, find its biological
function (preferably in terms of gene ontology).
– Cellular localization
– Biological processes the gene (its product) participates in
– The biological reaction
• Related terms:
– Motif: e.g., RccAUGG
– Fingerprint: a set of aligned sequences from which a position weight
matrix or the like can be constructed to predict the motif effectively
• Gene/Motif prediction methods
–
–
–
–
–
Position weight matrix
Perceptrons
Supervised learning
Hidden Markov Models (HMMs)
Neural networks (e.g., self-organizing map or SOM)
Bayesian inference on breast cancer
Population: women aged 40+
A woman has a chance of 0.01 of
getting breast cancer.
80% of those with breast cancer
will get positive
mammographies.
priors
0.8
0.01
P(i | E ) 
P( E | 1 ) P(1 )
P( E | 1 ) P(1 )  P( E |  2 ) P( 2 )
0.008
0.008
0.075=0.008/
(0.008+0.099)
0.2
0.002
0.099
10% of those without breast
cancer will also get a positive
mammography.
What is the probability that a
woman with a positive
mammography actually has
breast cancer?
posterior
0.099
0.1
0.99
0.9
0.891
0.925
Application of Bayes Theorem
P(i | E ) 
P( H Cancer
P( E | i ) P(i )
P( E | 1 ) P(1 )  P( E |  2 ) P( 2 )
P( H Cancer ) P( E | H Cancer )
| E ) 
P( H Cancer ) P( E | H Cancer )  P( H NoCancer ) P( E | H NoCancer )

0.01 0.8
0.008

 0.075
0.01 0.8  0.99  0.1 0.008  0.099
Many more diagnostic tools are
needed and their predictions are
combined to reach a better joint
prediction.
Bayesian prediction of genes
Population: All 300mers
Probability of the 300mer is a
gene: 0.02
95% of those 300mers from a
gene will get positive scores.
posterior
priors
0.95
0.02
0.019
0.019
0.114=0.019/
(0.019+0.147)
0.05
0.001
15% of those 300mers from nongenes or pseudogenes will also
get a positive score.
0.147
0.15
What is the probability that a
300mer with a positive score is
from a real gene?
0.98
0.85
0.833
P(i | E ) 
P( E | 1 ) P(1 )
P( E | 1 ) P(1 )  P( E |  2 ) P( 2 )
0.147
0.925