Transcript Document
Ocena przydatności modeli
Markowa do różnych zastosowań
w bioinformatyce
Jacek Leluk
Interdyscyplinarne Centrum Modelowania
Matematycznego i Komputerowego
Uniwersytet Warszawski
Jacek Leluk
Jacek
Leluk, Interdisciplinary
Centre for Mathematical
and Computational
Modelling,
Warsaw
University
Interdyscyplinarne
Centrum Modelowania
Matematycznego
i Komputerowego,
Uniwersytet
Warszawski
Modele Markowa
w identyfikacji i lokalizacji
sekwencji kodujących w genomie
Jacek Leluk
Jacek
Leluk, Interdisciplinary
Centre for Mathematical
and Computational
Modelling,
Warsaw
University
Interdyscyplinarne
Centrum Modelowania
Matematycznego
i Komputerowego,
Uniwersytet
Warszawski
Identyfikacja regionów kodujących
w genomie
Metody oparte na wzorcowym
DNA kodującym
Metody niezależne od
wzorcowego DNA kodującego
wykorzystujące:
występowanie
oligonukleotydów
Używanie
kodonu
Używanie
aminokwasu
Preferencje
kodonów
Używanie
heksamerów
tendencje w
obsadzeniu
pozycji
kodonu
wykorzystujące:
zależności
w obsadzeniu
sąsiadujących
pozycji
okresową
korelację
między
pozycjami
tendencje w
obsadzeniu
pozycji
kodonu
nukleotydów
Prototyp
kodonu
Modele
Markowa
Asymetria
pozycji
Indeks
okresowej
asymetrii
Średnia
informacja
względna
Widma
Fouriera
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Metody wymagające wzorcowego DNA kodującego
Tendencje w obsadzeniu kolejnych sąsiadujących pozycji
Modele Markowa
(Markov Models)
W modelach Markowa prawdopodobieństwo wystąpienia danego
nukleotydu w określonej pozycji kodonu zależy od rodzaju
nukletydu(-ów) bezpośrednio poprzedzającego (-ych) w sekwencji.
Najprostszym przykładem jest model Markowa I rzędu.
Model Markowa I rzędu oparty jest na prawdopodobieństwie
napotkania każdego z 4 nukletydów w każdej z trzech pozycji
kodonu, uwzględniającym zależność od rodzaju nukleotydu, który tę
pozycje poprzedza. W metodzie tej wykorzystuje się trzy 4x4
macierze tranzycji (F1, F2 i F3), z których każda odnosi się do każdej
z trzech pozycji kodonu.
Stosowane są modele Markowa rzędu od 1 do 5.
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Genetic conditioning of the amino acid
replacement probabilities and spectrum in
molecular evolution
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Do the amino acids possess their pedigree ?
or...
Do they contain the information about their history
(genealogy)?
or...
Can the amino acid mutational replacements described as
Markovian processes ?
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
The Markov model assumes that the substitution probability of
amino acid AA1 by AA2 is the same, regardless of what the initial
residue AA1 was transformed from (AAx, AAy)
AAx
AAy
AA1
AA1
Pa
Pb
AA2
AA2
Pa = Pb
The currently used statistical algorithms are based on Markovian
model of the amino acid replacement (they directly use stochastic
matrices of replacement frequency indices)
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
PAM250 matrix of amino acid replacements
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
C
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
S
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
T
Why tryptophane is here
the most conservative residue?
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-5
-3
-6
A
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
N
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
6
2
0
-2
-2
-2
-2
-2
0
-3
H
6
3
0
-2
-3
-2
-4
-4
2
R
5
0 6
-2 2 5
-3 4 2 6
-2 2 4 2 4
-5 0 1 2 -1
-4 -2 -1 -1 -2
-3 -4 -5 -2 -6
K M I L V
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
9
7
0
F
10
0 17
Y W
BLOSUM62 matrix of amino acid replacements
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
A
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
R
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
N
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
D
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
C
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
Q
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
E
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
G
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
H
4
2
-3
1
0
-3
-2
-1
-3
-1
3
I
4
-2
2
0
-3
-2
-1
-2
-1
1
L
5
-1
-3
-1
0
-1
-3
-2
-2
K
5
0
-2
-1
-1
-1
-1
1
M
6
-4
-2
-2
1
3
-1
F
7
-1
-1
-4
-3
-2
P
4
1 5
-3 -2 11
-2 -2 2 7
-2 0 -3 -1 4
S T W Y V
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Replacemant Arg Lys according to the statistical
interpretation using stochastical matrix indices
Arg
PAM250
3
BLOSUM62
2
BLOSUM35
2
BLOSUM45
3
BLOSUM100
3
Lys
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Diagram of genetic relationships between amino acids
K
E
K
E
N
D
R
H
D
G
Y
H
Y
–
R
1
R
G
S
R
G
S
3
2
–
Q
N
AGCU
–
Q
T
R
G
A
T
P
T
P
T
A
S
P
L
V
S
L
L
V
I
S
P
V
I
C
S
A
M
C
R
A
I
W
L
L
V
F
L
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
F
Diagram
Diagram
of of
amino
codon
acid
genetic
genetic
relationships
relationships
K
AAA
E
GAA
K
AAG
E
GAG
N
AAC
R
AGA
1
D
GAC
D
GAU
T
ACA
T
ACG
I
AUA
M
AUG
S
UCU
L
UUA
L
CUG
V
GUC
I
AUU
S
UCC
P
CCU
L
CUA
V
GUG
I
AUC
S
UCG
P
CCC
A
GCU
V
GUA
C
UGU
S
UCA
P
CCG
A
GCC
T
ACU
C
UGC
R
CGU
P
CCA
A
GCG
T
ACC
W
UGG
R
CGC
G
GGU
A
GCA
Y
UAU
–
UGA
R
CGG
G
GGC
S
AGU
3
Y
UAC
H
CAU
R
CGA
G
GGG
S
AGC
2
H
CAC
G
GGA
R
AGG
–
UAG
Q
CAG
N
AAU
AGCU
–
UAA
Q
CAA
L
UUG
L
CUC
V
GUU
F
UUC
L
CUU
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
F
UUU
Arginine-to-lysine mutational conversion pathways
for arginines of different origin
Met
Arg
Lys
AUG
AGG
AAG
His
Asn
CAC
AAC
?
Arg
Lys
AGC
AGG
AAG
Arg
Gln
CGG
CAG
Pro
Arg
Ser
CCC
CGC
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Possible single-point-mutational processing of serine with
respect to its origin
Trp
Asn
UGG
AAU
Ser
Ser
UCG
AGU
Thr
Ala
Pro
Thr
Ile
Asn
Ser
Trp
Leu
Ser
Arg
Cys
(UAG)
Gly
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Amino acid mutational substitution based on the
single transition/transversion is NOT the Markovian
process
Theoretical proof
The conversion pathway of arginine into lysine, glutamine
and serine for arginine resulting from the processing of the
codons encoding different amino acids
Possible codons for arginine: AGA AGG CGA CGG CGC CGT
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Conversion of arginine into lysine
Met
Arg
Lys
ATG
AGG
AAG
Gln
Leu
Arg
CTR
CGR
CAR
Lys
Arg
AAR
AGR
Ser
His
Arg
CAY
CGY
AGY
Arg
Arg
Lys
AGR
AAR
CGR
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Conversion of arginine into serine
Met
Arg
Ser
ATG
AGG
AGY
Arg
Leu
Arg
CTR
CGR
AGR
Ser
Arg
AGY
CGY
His
Arg
Ser
CAY
CGY
AGY
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Conversion of arginine into glutamine
Lys
Met
Arg
ATG
AGG
AAG
Gln
Arg
CAG
CGG
Leu
Arg
Gln
CTR
CGR
CAR
His
His
Arg
CAY
CGY
CAY
Gln
Arg
CAR
CGR
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
then...
Probability of the replacement of one amino acid into another
depends significantly on what amino acids occupied that
position in the past
There is a high risk, that commonly used algorithms applying
the stochastic data matrices (MDM, PAM, BLOSUM) lead to
the wrong interpretation of mutational processes occurring in
proteins
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Genetic relationhips between Arg and Met/Gln
K
Q
E
K
Q
E
N
D
N
AGCU
1
R
D
R
H
–
G
S
R
A
P
T
P
T
A
S
P
L
V
S
L
L
V
I
S
P
V
I
C
S
A
M
C
R
A
I
W
G
T
Y
R
G
T
Y
R
S
2
–
H
G
3
–
L
L
V
F
L
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
F
Arg-Met and Arg-Gln substitutions.
„Two kinds” of arginine
Inhibitory z roślin dyniowatych
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
RVMIG
RVMIGS
C
P
RKL
I
[LW][Y]
MNK
REKQP
C
KSQT
[KSRHQTY][V]
DN
RSDA
D
C
LFMP
ALTPGR
DEGQK
C
VITKR
C
LKGQMV
PKEQRSA
NHEQDS–
[I][D]–
GE
YFIH
C
G
*
*
#
#
#
Inhibitory typu Bowmana-Birk
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
C
C
DRBSN
QHELZRSIFTK
#
C
ASTKEMILRDVPF *
C
T
[KR][A]
S
NMIEKRDQ
*#
P
P
QKZETI
C
[RHQS][V]
#
C
STNVAEHR
DZBN
MILVTR
*
R
L
NDE
SKTR
C
H
S
A
C
KSDEN
SLGRTFH
C
79. IAVLM
80. C
81. ATNR
82. LYFRK
83. S
84. YIEFMQDN
85. P
86. AGP
87. QKZM
88. C
89. FVRIHSQ
90. C
91. VTBGLAYF
92. DB
93. [IMTV][Q]
94. TNBKAHD
95. DBNKT
96. FSY
97. C
98. [YH][T]
99. EAKPD
100. PSAK
101. C
Domeny owomukoidu (typ Kazala )
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
# 11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
VILE
NDH
C
[STR][D]
LPKQE
YF
ALPKQ
SQTK
GTRS–
IVKNT
GVSTL
KRTQ–
DGN–
G–
TNRKE–
STLAQP
WMLIV–
VTI
[A][R]–
C
PT
[RM][F]
[NI][E]
[L][Y]
KSLQDV
[P][E]
[V][H]
C
GA
TS
DN
GS
33. SFV
34. T
35. Y
36. SDA
37. NS
38. [ED][R]
39. C
40. GSTF
41. ILF
42. C
43. [L][A][N]
44. [YH][A]
# 45. NY
46. RAILV
47. EQ
48. HQLS
49. GHRN
50. ATR
51. [NHST][E]
52. VIL
53. ESKAGN
54. [K][L]
* 55. ELKSRV
56. [YHS][K]
57. [DN][M]
58. GA
59. EKRA
60. C
61. RKE
62. PLQE
63. KERD
64. [ISV][H]
65. [VG][PT]
66. [MEK][PS]
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
PAM250 matrix of amino acid replacements
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
C
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
S
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
T
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-5
-3
-6
A
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
N
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
6
2
0
-2
-2
-2
-2
-2
0
-3
H
6
3
0
-2
-3
-2
-4
-4
2
R
5
0 6
-2 2 5
-3 4 2 6
-2 2 4 2 4
-5 0 1 2 -1
-4 -2 -1 -1 -2
-3 -4 -5 -2 -6
K M I L V
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
9
7
0
F
10
0 17
Y W
PAM250 and BLOSUM62 scores for the replacements:
Arg-Lys Lys-Gln Lys-Glu Arg-Gln and Arg-Glu
Replacement
PAM250
BLOSUM62
Arg/Lys
3
2
Lys/Gln
1
1
Arg/Gln
1
1
Lys/Glu
0
1
Arg/Glu
-1
0
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Genetic relationships among Arg, Lys, Glu and Gln
K
E
K
Q
E
N
Q
D
N
R
AGCU
1
D
R
H
–
G
S
R
A
P
T
P
T
A
S
P
L
V
S
L
L
V
I
S
P
V
I
C
S
A
M
C
R
A
I
W
G
T
Y
R
G
T
Y
R
S
2
–
H
G
3
–
L
L
V
F
L
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
F
Arg-Glu and Lys-Glu substitutions (Arg/Lys/Gln/Glu replacements)
Inhibitory z roślin dyniowatych
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
RVMIG
RVMIGS
C
P
RKL
I
[LW][Y]
MNK
REKQP
C
KSQT
[KSRHQTY][V]
DN
RSDA
D
C
LFMP
ALTPGR
DEGQK
C
VITKR
C
LKGQMV
PKEQRSA
NHEQDS–
[I][D]–
GE
YFIH
C
G
Inhibitory typu Bowmana-Birk
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
C
C
DRBSN
QHELZRSIFTK
C
ASTKEMILRDVPF
C
T
[KR][A]
S
NMIEKRDQ
P
P
QKZETI
C
[RHQS][V]
C
STNVAEHR
!
DZBN
MILVTR
R
L
NDE
SKTR
C
H
S
A
C
KSDEN
SLGRTFH
C
79. IAVLM
80. C
81. ATNR
82. LYFRK
83. S
84. YIEFMQDN
85. P
86. AGP
87. QKZM
88. C
89. FVRIHSQ
90. C
91. VTBGLAYF
92. DB
93. [IMTV][Q]
94. TNBKAHD
95. DBNKT
96. FSY
97. C
98. [YH][T]
99. EAKPD
100. PSAK
101. C
Domeny owomukoidu (typ Kazala)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
VILE
NDH
C
[STR][D]
LPKQE
YF
ALPKQ
SQTK
GTRS–
IVKNT
GVSTL
KRTQ–
DGN–
G–
TNRKE–
STLAQP
WMLIV–
VTI
[A][R]–
C
PT
[RM][F]
[NI][E]
[L][Y]
KSLQDV
[P][E]
[V][H]
C
GA
TS
DN
GS
33. SFV
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
T
Y
SDA
NS
[ED][R]
C
GSTF
ILF
C
[L][A][N]
[YH][A]
NY
RAILV
EQ
HQLS
GHRN
ATR
[NHST][E]
VIL
ESKAGN
[K][L]
ELKSRV
[YHS][K]
[DN][M]
GA
EKRA
C
RKE
PLQE
KERD
[ISV][H]
[VG][PT]
66. [MEK][PS]
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
!
Multiple alignment
of seven chicken
ovoinhibitor domains
obtained with
Markovian and nonMarkovian methods
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
What part of the codon contains the information about the
previous amino acid that occurred at certain position of the
protein sequence?
At most 2/3 of the entire codon.
Ala
Val
GCG
GUG
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
How long is the information about codons of preceeding
amino acids stored?
The shortest storage period is 3 transitions/transversions
Ala
Val
Met
Ile
GCG
GUG
AUG
AUA
Ser
Ser
Thr
Ser
UCC
UCU
ACU
AGU
Theoreticaly the longest period is infinite
Lys
Asn
Asp
His
Gln
Glu
Asp
AAA
AAC
GAC
CAC
CAG
GAG
GAU
Tyr
His
Asn
Lys
Gln
His
UAU
CAU
AAU
AAG
CAG
CAC
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
...
CONCLUSIONS
The analysis of genetic semihomology excludes
applicability of Markov model for the studies on protein
variability at the amino acid level.
The amino acid codons do contain the information about
the „ancestral” amino acids, whose codons were the
starting point to the codon of current residue.
It refers mainly to the positions undergoing single-point
mutations as the most basic mechanism of evolutionary
variability.
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Thank you for your attention !
Thank you for your
attention!
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University