Математическое и программное обеспечен

Download Report

Transcript Математическое и программное обеспечен

Two cases of chemometrics
application in protein
crystallography
Andrey Bogomolov
European Molecular Biology Laboratory (EMBL), Hamburg, Germany
Outline
• Protein crystallography: a brief introduction
• Case I: determination of protein secondary structure from
the raw diffraction data using PLS-R
• Case II: modeling of crystal radiation damage
• Potential applications of chemometric techniques to
crystallography (of biological macromolecules)
Protein crystallography: introduction
• Protein (macromolecular) crystallography is a scientific
discipline that studies…
• biological objects: proteins, DNA, RNA etc. …
• by physical means: X-ray diffraction, synchrotron radiation …
• on the chemical level: 3D-structure, complexes, interactions …
• with the extensive use of mathematics: data analysis, modeling
• The main objectives:
• solve 3D-structure of a molecule
• explain its biological function at the atomic level
• Today’s hot topic:
• drug design
• part of the global “-omics” project (genomics/proteomics)
Protein crystallography workflow
protein (DNA, RNA) solution
expression&
purification
crystallization
data collection
phasing
structure
solution
Protein crystallography workflow
protein crystal
expression&
purification
crystallization
data collection
phasing
structure
solution
Protein crystallography workflow
diffraction pattern
expression&
purification
crystallization
data collection
phasing
structure
solution
Protein crystallography workflow
electron density map
expression&
purification
crystallization
data collection
phasing
structure
solution
Protein crystallography workflow
3D structure
expression&
purification
crystallization
data collection
phasing
structure
solution
Protein Data Bank (PDB)
40 000
Global data collection (>30000 records)
•
•
•
•
35 000
30 000
PDB enrties
25 000
www.pdb.org
3D structures
experimental data
biological and chemical information
total
per year
Molecule Type
20 000
15 000
10 000
Method
Proteins
NA
Complexes
Other
Total
X-ray
27335
807
1270
85
29497
NMR
4421
674
118
17
5230
El. Microsc.
77
9
27
0
113
Other
70
4
3
0
77
Total
31903
1494
1418
102
34917
5 000
0
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
Crystallographic data collection:
Wilson plot
control
optimization
log intensity
X-ray beam
Wilson plot
theoretical
experimental
reciprocal resolution
Case I: Determination of
protein secondary structure
α-helix
β-sheet
Problem:
• determine the contents (fractions of the polypeptide chain) of secondary
structure elements in a protein molecule from the raw diffraction data (Wilson
plot)
• well established method for CD and IR spectra of protein solutions
• PLS regression – one of the best methods
• Wilson plot: only qualitative data on existing correlation for “theoretical” data
Secondary structure determination:
data
2
Data Preprocessing:
• averaging with an optimal bin
• special scaling (correction for
1hq3 ( =0.63, =0.06)
1at0 ( =0.00, =0.60)
1.5
log(<I>)
size*
theoretical
1d5t ( =0.27, =0.23)
1
0.5
anisotropic B-factor)*
0
• taking the natural logarithm
• conversion into the matrix
-0.5
(Wilson plots in rows)*
0.4
0.6
0.8
1
0.8
1
experimental
12
log(<I>)
• auto-scaling
• outliers detection and removal*
0.2
13
11
10
*) experimental data only
9
0.2
0.4
0.6
1/d, A
-1
Secondary structure determination:
data (2)
2
1hq3 (α)
theoretical
1hq3 ( =0.63, =0.06)
1at0 ( =0.00, =0.60)
1at0 (β)
log(<I>)
1.5
1d5t ( =0.27, =0.23)
1
0.5
0
-0.5
0.2
0.4
0.6
0.8
1
0.8
1
13
experimental
log(<I>)
12
11
10
1d5t (α+β)
9
0.2
0.4
0.6
1/d, A
-1
Secondary structure determination:
calibration results
1
RMSEP & correlation coefficients for different methods
-helix
-sheet
α-helix (theoretical)
Theoretical
0.062 (0.96)
0.060 (0.92)
Experimental*
0.112 (0.84)
0.081 (0.84)
IR/PLS [1]
0.078 (0.93)
0.075 (0.93)
CD/PLS [2]
0.077 (0.94)
0.092 (0.89)
0.21 (0.00)
0.22 (0.00)
μ: α=0.31, β=0.24
0.8
predicted
Element
 -helix (theoretical)
0.6
0.4
0.2
0
0
0.2
*) Resolution (1/d) = 0.52 Å-1 (~1.9 Å)
0.8
0.4
0.6
measured
0.8
1
-sheet (experimental)
1.
2.
S. Navea, R. Tauler, A. de Juan, Elucidation of protein
secondary structure, Anal. Biochem. 336 (2005) 231–242
K.A. Oberg, J.-M. Ruysschaert, and E. Goormaghtigh, The
optimization of protein secondary structure determination
with infrared and circular dichroism spectra, Eur. J.
Biochem. 271 (2004) 2937-2948
predicted
0.6
0.4
0.2
0
0
0.2
0.4
measured
0.6
0.8
Case II: Modeling radiation damage
• Biological crystal exposed to X-rays undergoes radiation
damage:
• Modeling of radiation damage is important
• understanding of the effect on the protein
• optimization of data collection
• Problem present state
• no comprehensive theory of RD
• specific effects are well-known, but it the main changes are nonspecific
• Suggestion by Gleb Bourenkov:
• radiation dose has linear effect on atom’s B-factors
• Task
• check for linearity, find reason(s) of deviation
Radiation damage modeling:
data (trypsin)
0.8
dose
0.6
0.4
0.2
0
0
10
20
measurement
30
40
0.15
90
80
0.1
60
p
B-factor
70
50
40
0.05
30
20
0
0.1
0.2
0.3
0.4
dose
0.5
0.6
0.7
0
0
20
40
60
b1 in B=b0+b1*dose
80
100
Radiation damage modeling:
results
0.3
1421
0.2
30 32
26229
24
8 31
23
2527
33
222
1
20
34
19
1
8
17
16
35
15
36 37
14
13
12
11
38
10
89
7
6
345
0
-10
-20
-0.05
0
40
-50
-200
-100
0
100
200
300
t1 (X:92%; Y:99%)
400
500
-0.1
0.024
0.6
predicted
RMSEP
0.02
0.018
0.016
0.014
0.02
0.04
0.06
p1
439
0.1
0.08
0.1
r=0.999
RMSEP=9.4×10-3
0.7
0.022
0
0.5
0.4
0.3
0.2
0.012
0.1
0.01
1
2
3
4
5
Number of PCs
6
7
0
0
0.1
0.2
810
183
1264
1520
1420
1507 1506
46
1263
1006
743
804
407
1443852
371
1326
296
3521519
1083
645
943
295390
1227
1092
1156
601444
920
459975
790
110
221
1188
1082
919
1189
809
1526
1361
689
803932
1442
1241
926
1433
934
6
851
370
141910651630
1606
1265
1625
797
1539
591
680
63
52
1621
126
218
324
901
956
904
843
5
1042
1504
205
662
1291
811
11
600
1454
607
164
263
51
808
974
925
812
122
1546
942
111
671
1242
4
1
243
1126
1081
858
288
744
1462
818
828
884
1145
178
1332
878
827
1590
533
900
1199
1537
391
102
646
1334
694
54
1262
867
1633
48
1629
246
1202
1027
1187
1300
182
1592
789
1078
148
1216
1591
917
125
1277
334
848
292
820
983
741
885
1183
545
1142
78
1299
1077
82
1243
1198
387
13
33
821
909
910
31637
796
1473
556
1138
258
829
1364
1388
376
760
770
655
819
1085
1200
163
1407
264
551
1195
504
1257
1005
1423
1089
294
1014
1252
392
1185
1181
1386
1048
1068
1567
144
528
613
1150
165
1094
1211
814
897
912
112
1232
154
11260
335
258
1022
1634
1628
656
373
658
1032
1363
1562
1136
1149
1176
1177
1573
1239
1400
350
596
131
482
902
1050
232
1458
896
661
849
1268
1640
710
460
1305
28
647
1432
807
1422
384
1417
1619
1251
667
1553
195
1034
668
71
1604
1451
147
1140
1449
1254
500
577
351
16
1527
377
1144
432
413
842
1328
211
76
1624
1636
1071
400
686
291
1270
1090
1224
1063
374
471
375
495
1122
1571
1579
279
967
1162
1049
309
103
624
857
1602
293
290
333
914
719
49
1231
422
343
330
866
813
388
939
584
795
798
1478
1131
1479
1460
634
123
231
1197
1516
1496
469
853
1029
631
822
456
337
10
328
705
1100
174
695
42
261
186
648
1109
461
1175
622
1040
872
1128
365
360
660
1008
541
480
527
903
1003
1076
1203
632
669
249
1052
1548
1112
1007
217
281
1525
101
425
486
899
455
633
342
1372
704
1253
488
623
1489
369
15
735
1234
874
938
997
1281
139
725
815
606
56
1335
511
44
976
672
438
452
1044
1540
1477
1485
1385
1392
1124
251
1406
361
1474
839
278
1558
1376
1084
750
763
1004
1213
346
517
992
1151
557
951
864
889
474
637
1079
462
1031
310
1293
1471
615
284
1205
196
1225
286
543
1172
379
685
1522
742
457
915
1215
1505
508
1448
638
1238
1580
1342
1338
1222
289
850
933
599
687
494
802
805
55
458
987
665
1147
1184
1295
576
823
1402
207
1509
252
565
1125
216
445
1593
124
649
1398
138
1529
1566
1178
1220
1492
1301
181
1186
765
20
1026
1312
223
755
619
1340
1408
1066
503
639
435
14
1
1000
1282
1528
1431
308
630
916
873
629
77
434
389
961
1362
53
437
1418
1086
237
238
1393
184
863
1404
555
1009
664
484
1577
875
1561
869
378
1311
1464
1316
179
366
542
1323
571
907
1196
1607
168
1033
1324
788
1582
764
1095
1127
1349
1468
393
583
406
501
865
166
854
304
412
958
9
1230
982
962
300
751
697
816
72
197
37
1413
50
61
70
1285
2
806
1635
733
1250
529
247
574
1569
201
523
191
276
1028
1532
194
1298
202
262
354
659
1164
1209
1557
877
1106
1179
1307
116
560
547
1166
1555
761
1221
143
322
332
344
477
861
1303
1019
1559
995
1035
1240
313
152
1093
18
898
1288
1488
1521
118
502
1104
507
711
41
410
414
894
959
22
1467
75
1414
8
836
1284
1623
58
1283
1378
470
1545
1395
1370
1375
1459
372
128
1047
573
266
644
544
1493
856
1165
98
673
1438
1533
663
1214
1596
208
1595
479
74
136
817
1113
222
1075
1297
747
1118
1137
840
1057
1498
746
67
1497
824
319
841
1319
1387
302
175
311
585
767
1465
1313
1446
908
1374
1182
396
1337
721
1325
306
859
381
38
756
941
440
40
1556
1480
1391
454
269
1168
1618
272
271
483
1399
554
609
887
1501
146
1059
1463
225
1167
1382
1321
185
682
1322
158
701
1292
395
625
161
881
1318
159
367
758
989
7
1247
120
674
1194
1486
505
605
1170
303
399
690
593
1001
1201
696
416
1259
1424
1275
698
1088
1
1409
104
354
1622
336
1271
1411
1415
1578
1495
109
424
1174
1117
259
364
677
1482
1289
1379
189
188
255
553
256
575
1130
1541
229
666
203
491
549
759
1045
918
535
1554
1304
614
886
1341
283
791
160
703
879
1080
257
587
1353
134
1237
220
17
1314
1280
713
1469
691
325
442
1626
1256
1261
1617
331
39
234
1235
47
594
307
59
1062
1023
1091
635
1512
193
1046
348
650
206
534
242
570
603
1190
408
1405
357
1581
1055
1437
532
1294
1586
145
1627
85
702
1058
1191
417
485
119
339
1315
433
707
73
1346
83
1067
209
427
215
826
1347
141
36
89
1053
162
1290
1472
1069
79
277
499
1
1255
568
652
521
32
1246
402
056
998
1278
1620
94
1233
415
727
1266
441
1155
922
45
1632
228
1570
524
1481
640
473
198
1514
219
653
1564
1309
1576
265
564
986
113
466
546
620
35
530
1508
628
1348
1551
1036
214
522
1153
157
871
883
935
177
862
1016
1054
1219
979
1171
250
772
777
171
270
595
966
991
1148
888
1355
180
312
210
937
1248
1358
1152
734
786
1320
443
794
513
654
1360
105
1098
860
1329
846
838
1483
1119
363
463
467
115
641
516
657
338
1397
497
617
1017
224
749
1544
610
1105
1158
267
347
1547
1550
490
870
1589
84
1552
1103
1530
151
419
1210
1563
1503
566
936
1207
1279
19
1452
670
506
1208
1427
298
316
329
1223
285
326
597
793
1013
1440
287
1343
868
1345
801
1336
235
1330
465
472
358
121
957
1575
578
1476
199
106
1357
1534
475
200
233
536
1531
1403
345
137
349
1038
114
1302
274
581
754
855
618
708
830
832
993
80
92
282
785
1115
1572
1470
1487
947
1110
362
978
1025
487
558
604
512
981
692
323
944
23
1143
844
931
643
253
1523
548
1475
1366
1549
492
1317
1339
1598
108
1461
81
1401
1517
602
994
170
226
493
562
880
1236
787
341
1041
1192
204
1160
411
736
135
913
133
453
1010
1510
1274
1588
167
1491
398
706
27
380
1513
588
779
1373
783
921
31
1377
1484
356
627
1543
254
230
1356
1367
1535
169
1389
273
988
1350
539
601
1096
890
964
825
418
476
968
308
598
1139
1466
1111
977
1114
1612
1020
731
906
782
845
1518
1129
1436
1218
297
30
688
905
955
1445
940
68
911
776
1272
930
525
748
1494
1365
150
538
1511
626
1015
1002
616
447
1597
1639
699
892
1037
24
107
1600
1390
1456
753
190
700
88
1073
176
468
714
280
321
444
996
1286
301
327
142
509
1011
1245
1267
792
320
775
1134
1410
718
960
693
784
227
1416
1500
1039
149
950
496
1568
882
1383
1574
97
611
1603
90
1352
681
766
569
768
386
1587
1429
394
678
1030
612
745
397
1439
972
1616
1154
65
57
1333
651
642
1502
187
244
1384
537
1538
1542
1565
87
156
876
340
1455
130
192
515
1631
86
489
428
240
984
91
586
315
531
579
990
173
715
970
1141
683
835
774
893
353
589
1099
1276
426
1102
429
709
1228
140
21
383
62
69
405
1380
213
1123
1310
1161
260
268
636
129
478
1097
675
1060
1368
248
1287
155
831
1133
498
1087
1351
582
752
1163
1536
679
1043
1608
1061
1613
771
953
1146
1614
1453
999
1430
717
847
1359
954
1269
1327
924
1583
446
1173
464
1121
1560
621
540
608
1296
1169
550
563
834
1605
117
580
833
723
127
409
66
1193
93
1249
1012
12
1369
559
572
1120
724
99
567
1018
971
212
1206
95
773
34
1381
1447
96
684
1024
1428
385
712
720
153
737
973
43
552
1072
1159
1610
1594
245
1515
1132
778
1074
1108
526
757
1244
401
450
732
1021
948
1306
923
1371
561
980
305
275
676
25
969
1599
1490
780
1601
314
1217
1441
769
318
520
949929
239
241
355
1524
716
985
965
891
423
730
729
359
421
368
1396
1051
740
26
518
1116
1070
781
1412
1212
1615
1107
1457
1204
952
1394
448
519
963
382
1425
64
430
1499
1180
431
1101
762
728
1229
895
799
1585
29
837
172
100
722
1611
1638
800
1135
420
739
481
299
510
945
1434
738
726
946
1426
236
514
1609
1344
1331
132
1157
1435
0.02
0.04
0.06
0.08
928
403
317
1584
1450
404
451
592 927436
590
1226p1
1273
0.1
0
-40
1064
449
0.05
39
2
1
-30
0.2
0.1
p2
t2 (X:2%; Y:1%)
10
atom
CYS
GLU
0.15
p2
20
0.3
0.4
0.5
measured
0.6
0.7
0.12
Conclusions
• Multivariate data analysis has a great potential for protein
crystallography
• currently it is application is episodic
• rarely goes beyond PCA
• Method-centric approach would be beneficial:
• “I have a method, I am looking for problems”
X-files
PCA, Factor Analysis
crystallization, HTPC
SIMCA, PLSD
crystal screening
Multivariate Regression
crystal auto-mounting
MSPC, Design Of Experiment
data collection
Curve Resolution
data reduction
Multivariate Image Analysis
radiation damage
Target Factor Analysis
phasing
PARAFAC, 3(multi)-way
structure solution
Wavelet Transform
structure refinement
Challenge
Critical re-assessment of the entire protein
crystallographic workflow with multivariate
approach in mind –
an ambitious project for chemometricians?
Acknowledgements
• Alexander Popov
• Gleb Bourenkov
• Victor Lamzin