qa.iis.sinica.edu.tw

Transcript qa.iis.sinica.edu.tw

An Iterative Relaxation
Technique for the NMR
Backbone Assignment Problem
Wen-Lian Hsu
Institute of Information Science
Academia Sinica
1/62
Characteristics of Our Method

Model this as a constraint satisfaction
problem
 Solve it using natural language parsing
techniques


Both top-down and bottom-up
An iterative approach


Create spin systems based on noisy data.
Link spin systems by using maximum
independent set finding techniques.
2/62
Outline
 Introduction
 Method
 Experiment Results
 Conclusion
3/62
Blind Man’s Elephant
 We
cannot directly “see” the positions of
these atoms (the structure)
 But we can measure a set of parameters
(with constraints) on these atoms

Which can help us infer their coordinates
Each experiment can only determine
a subset of parameters (with noises)
To combine the parameters of different
experiments we need to stitch them together
4/62
The Flow of NMR Experiments
Get protein
Samples
Collect NMR spectra
Resonance assignment
Calculation and
simulation
- Energy minimization
- Fitness of structure
constraints
Structure Constraints
5/62
Chemical Shift Assignment
Find out Chemical Shift for Each Atom
• Backbone atoms: Ca, Cb, C’, N, NH
•Various experiments: HSQC, CBCANH, CBCACONH, HN(CA)CO,
HNCO, HN(CO)CA, HNCA
• Side chain: all others (especially CHs)
TOCSY-HSQC, HCCCONH, CCCONH, HCCH-TOCSY
Cd H3
Cg H2
One amino acid
Cb H2
N
Ca
H
H
CO
6/62
Some Relevant Parameters
18-23
ppm
55-60
CH3
17-23
O
H
H
H
CH3
O H-C-H
-N-C-C-N-C-C-N-C-C-N-C-CH-C-H H
O H-C-H H
H
30-35
Backbone
O
O
16-20
19-24
H
31-34
7/62
Three important experiments
• Backbone: Ca, Cb, C’, N, NH
HSQC, CBCANH, CBCA(CO)NH, HN(CA)CO, HNCO, HN(CO)CA, HNCA
HSQC
sequential assignment
chemical shifts of Ca, Cb, NH
8/62
H
O
Our NMR spectra
a
H
a
C
b
N
C
C
b
 HSQC
C
H
C
 CBCA(CO)NH
(2 peaks)
 HNCACB (4 peaks)
H
H
O
O
a
C
C
C
C
b
C
H
CBCANH
a
b
N
a
H
a
H
C
C
C
b
N
C
b
C
H
CBCA(CO)NH
HSQC Spectra

HSQC peaks (1 chemical shifts for an amino acid)
H
N
Intensity
8.109
118.60
65920032
HSQC
10/62
CBCA(CO)NH Spectra

CBCA(CO)NH peaks (2 chemical shifts for one amino acid)
H
N
C
Intensity
8.116
118.25
16.37
79238811
8.109
118.60
36.52
65920032
11/62
CBCANH Spectra

CBCANH peaks (4 chemical shifts for one amino acid)

Ca (+), Cb (-)
H
N
C
Intensity
8.116
118.25
16.37
79238811
8.109
118.60
36.52
─65920032
8.117
118.90
61.58
─51223894
8.119
117.25
57.42
109928374
-
-
+
+
12/62
8
*9
# 364,# 365,# 366,# 367
27,185
# 372,# 377,# 378,# 381,# 383 *1
*12
# 400
99,266,269
3
21,197
94
*49
A Dataset Example
*2
# 314,# 322,# 323
70,159
# 305 *54
# 298
*38
147
# 343,# 356
# 342
160
*30
# 227,# 234,# 238,# 240
117,155
*4
# 347,# 350,# 359
69,161
*41
# 249
# 242,# 243
5
17
*34
# 222
406
H ## 389,#
401
63,212
*16
# 374,# 375,# 376
48,229
*5
# 271,# 278,# 284
89,188
*20
# 336,# 348,# 355
# 346,# 352
25,36
100
HSQC
HNCACB 4
CBCA(CO)NH 2
# 270,# 272,# 281
*10
# 301,# 302,# 310
47,230
*19
# 262,# 267
102,263
13/62
N
Backbone Assignment
 Goal

Assign chemical shifts to N, NH, Ca (and Cb)
along the protein backbone.
 General

approaches
Generate spin systems
• A spin system: an amino acid with known chemical
shifts on its N, NH, Ca (and Cb).

Link spin systems
14/62
Ambiguities
 All
4 point experiments are mixed together
 All 2 point experiments are mixed together
 Each spin system can be mapped to
several amino acids in the protein
sequence
 False positives, false negatives
15/62
Previous Approaches
 Constrained
Legal matching


bipartite matching problem
Illegal matching under constraints
The spin system might be ambiguous
Can’t deal with ambiguous link
16/62
Natural Language Processing
─ Signal or Noise?

Speech recognition：Homophone selection
台北市一位小孩走失了
台北市
小孩
台北
適宜
走失
事宜
一位
一味
移位
17/62
An Error-Tolerant Algorithm
18/62
Phrase, Sentence Combination
19/62
Hierarchical Analysis
句意模版
句型模版
片語模版
字詞模版
20/62
Perfect Group
 Each
spin group contains 6 points, in
which


4 points are from the first experiments
2 points are from the second experiment
H
O
a
H
a
C
C
C
b
N
C
b
C
H
Perfect Group
 Each
spin group contains 6 points, in
which


4 points are from the first experiments
2 points are from the second experiment
H
O
a
H
a
C
C
b
N
C
C
b
H
C
H
O
a
H
a
C
C
C
b
N
C
b
C
H
A Perfect Spin System Group
CBCA(CO)NH
N
H
C
Intensity
113.293
7.897
56.294
1.64325e+008
i -1
113.293
7.897
27.853
1.08099e+008
i -1
CBCANH
N
H
C
Intensity
113.293
7.92
62.544
8.52851e+007
Ca
113.293
7.92
56.294
4.71331e+007
Ca
113.293
7.92
68.483
-8.54121e+007
Cb
113.293
7.92
28.165
-3.49346e+007
Cb
Cai-1
Cbi-1
Cai
Cbi
56.294
28.165
62.544
68.483
23/62
False Positives and False Negatives
 False


Noise with high intensity
Produce fake spin systems
 False


positives
negatives
Peaks with low intensity
Missing peaks
 In
real wet-lab data, nearly 50% are noises
(false positive).
24/62
8
*9
# 364,# 365,# 366,# 367
27,185
# 372,# 377,# 378,# 381,# 383 *1
*12
# 400
99,266,269
3
21,197
94
*49
Spin System Group
*2
# 314,# 322,# 323
70,159
# 305 *54
# 298
*38
147
# 343,# 356
# 342
160
*30
117,155
*4
# 347,# 350,# 359
69,161
*41
# 249
# 242,# 243
5
17
*34
406
H ## 389,#
401
# 227,# 234,# 238,# 240
Perfect
# 222
False Negative
63,212
*16
# 374,# 375,# 376
48,229
*5
# 271,# 278,# 284
89,188
False Positive
*20
# 336,# 348,# 355
# 346,# 352
25,36
100
# 270,# 272,# 281
*10
# 301,# 302,# 310
47,230
*19
# 262,# 267
102,263
25/62
N
Outline

Introduction
 Method
 Experiment Results
 Conclusion
26/62
Main Idea
 Deal
with false negative in spin system
generation procedures.
 Eliminate false positive in spin system
linking procedures.
 Perform spin system generation and
linking procedures in an iterative fashion.
27/62
Spin System Group Generation
 Three
types of spin system group are
generated based on the quality of
CBCANH data:



Perfect
Weak false negative
Severe false negative
28/62
Perfect Spin Systems

A spin system is determined without any added
pseudo peak.
CBCA(CO)NH
N
H
C
Intensity
113.293
7.897
56.294
1.64325e+008
i -1
113.293
7.897
27.853
1.08099e+008
i -1
CBCANH
N
H
C
Intensity
113.293
7.92
62.544
8.52851e+007
Ca
113.293
7.92
56.294
4.71331e+007
Ca
113.293
7.92
68.483
-8.54121e+007
Cb
113.293
7.92
28.165
-3.49346e+007
Cb
Cai-1
Cbi-1
Cai
Cbi
56.294
28.165
62.544
68.483
29/62
Weak False Negative Spin System Group

A spin system is determined with an added
pseudo peak.
CBCA(CO)NH
N
H
C
Intensity
115.481
9.604
60.044
1.30407e+008
i -1
115.481
9.604
30.66
6.93923e+007
i -1
CBCANH
N
H
C
Intensity
115.481
9.616
59.419
2.25295e+008
Ca
115.481
9.616
31.291
-4.82097e+007
Cb
115.481
9.616
27.853
-1.33326e+008
Cb
115.481
9.604
60.044
1.30407e+008
Ca
Cai-1
Cbi-1
Cai
Cbi
60.044
31.291
59.419
27.583
30/62
Severe false Negative Spin System Group

A spin system is determined with two added
pseudo peaks.
CBCA(CO)NH
N
H
C
Intensity
119.857
8.435
28.166
3.36293e+007
i -1
119.857
8.435
59.419
1.56434e+008
i -1
CBCANH
N
H
C
119.856
8.477
58.481
3.7353e+008
Ca
119.856
8.477
28.79
-2.55735e+008
Cb
119.857
119.857
8.435
8.435
28.166
59.419
3.36293e+007
1.56434e+008
Cb
Cai-1
Cbi-1
Cai
Cbi
59.419
28.166
58.481
28.79
Intensity
Note: it is also possible that
Cai-1 = 28.166 and Cbi-1 = 59.419
Ca
31/62
A note on spin system generation

To generate *ALL* possible spin systems, a
peak can be included in more than one spin
system.



False positives are eliminated in spin system linking
procedure.
False negative are treated by adding pseudo peaks.
A rule-based mechanism is used to filter out
incompatible spin systems (false positives).

Adopt maximum weight independent set algorithm
32/62
Spin System Linking
 Goal

Link spin system as long as possible.
 Constraints


Each spin system is uniquely assigned to a
position of the target protein sequence.
Two spin systems are linked only if the
chemical shift differences of their intra- and
inter- residues are less than the predefined
thresholds.
33/62
A Peculiar Parking Lot (valet parking)
Information you have: The make of your car, the car
parked in front of you (approximately). Together with
others, try to identify as many cars in the right order as
possible (maximizing the overall satisfaction).
# 314,# 322,# 323
70,159
# 305 *54
# 298
*38
147
# 343,# 356
# 342
160
*30
# 227,# 234,# 238,#
Backbone Assignment
# 372,# 377,# 378,# 381,# 383 *1
*12
# 400
99,266,269
# 413
21,197
108,194
117,155
DGRIGEIKGRKTLATPAVRRLAMENNIKLS
*4
# 347,# 350,# 359
69,161
*41
# 249
# 242,# 243
5
17
*34
# 389,# 406
# 401
63,212
*16
# 374,# 375,# 376
48,229
*5
# 271,# 278,# 284
89,188
*20
# 336,# 348,# 355
# 346,# 352
25,36
100
# 270,# 272,# 281
*10
# 301,# 302,# 310
47,230
*22
*8
*19
# 262,# 267
102,263
# 222
Spin System Positioning
 We
assign spin system groups to a protein
sequence according to their codes.
D 50
G 10
R 40
I 50|51
55.266 38.675 44.555 0
Spin System
55.266 38.675 44.555 0 => 50 10
44.417 0
55.043 30.04 =>10 40
44.417 0
30.665 28.72 =>10 40
55356 29.782 60.044 37.541 => 40 50
44.417 0
55.043 30.04
44.417 0
30.665 28.72
55356 29.782 60.044 37.541
36/62
Link Spin System groups
D
G
44.417 0
R
I
30.665 28.72
Segment 1
55.266 38.675 44.555 0
Segment 2
44.417 0
55.043 30.04
Segment 3
55356 29.782 60.044 37.541
37/62
Iterative Concatenation
DGRI….FKJJREKL
1
Step1
1
1
…
2
2
….
Spin Systems
56
56
47
Step2
…
Segment 1
Segment 31
Segment 2
….
Step n-1
Step n
Segment 78
…
Segment 79
Segment 99
38/62
Conflict Segments
DGRIGEIKGRKTLATPAVRRLAMENNIKLS
Segment 78
Segment 79
Segment 97
Segment 71
Segment 99
Segment 98
Two kinds of conflict segments
Overlap (e.g. segment 71, segment 99)
Use the same spin system (e.g. both segment 78 and
segment 79 contain spin system 1 )
39/62
A Graph Model for Spin System Linking

G(V,E)



V: a set of nodes (segments).
E: (u, v), u, v  V, u and v are conflict.
Goal

Assign as many non-conflict segments
as possible => find the maximum
independent set of G.
40/62
An Example of G
Seg1
SP13
Seg2
Segment1: SP12->SP13->SP14
Overlap
Segment3: SP8->SP15->SP21
Overlap
Segment2: SP9->SP13->SP20->SP4
Segment4: SP7->SP1->SP15->SP3
Seg4
 Seq.
SP15
Seg3
: GEIKGRKTLATPAVRRLAMENNIKLSE
Seg1
Seg3
Seg4
Seg2
41/62
Segment weight
 The
larger length of segment is, the higher
weight of segment is.
 The less frequency of segment is, the
higher of segment is.
42/62
Find Maximum Weight Independent Set
of G

Boppana, R. and M.M. Halldόrsson, Approximatin Maximum Independent
Sets bt Excluding Subgraphs. BIR, 1992. 32(2).
43/62
An Iterative Approach


We perform spin system generation and
linking iteratively.
Three stages.
44/62
First Stage

Generate perfect spin systems;




Perform spin system concatenation on spin systems
(newly generated perfect) to generate segments;
Retain segments that contain at least 3 spin systems;
Perform MaxIndSet on the segments;
Drop spin systems (and related peaks) that are used in
the resulting segments.
45/62
Second Stage

Generate weak false negative spin systems.





Perform segment extension on the resulting segments
of the first iteration (using unused perfect and newly
generated weak false negative);
Perform spin system concatenation on the unused spin
systems (perfect + weak false negative) to generate
longer segments;
Retain segments that contain at least 3 spin systems;
Perform MaxIndSet on the segments;
Drop spin systems (and related peaks) that are used in
the resulting segments.
46/62
Third Stage

Generate severe false negative spin systems.




Perform segment extension on the resulting segments
of the second iteration (using unused perfect and weak
false negative, as well as newly generated severe false
negative);
Perform spin system concatenation on the unused spin
systems (perfect + weak false negative + severe false
negative) to generate longer segments;
Retain segments that contain at least 3 spin systems;
Perform MaxIndSet on the segments.
47/62
Segment Extension
….FKJJREKL….
109
1
12
2
….
New spin systems
29
109
45
29
New 109
48/62
Segment Extension
DGRGEKGRKTLATPAVRRLAMENNIKLS
23
99
26
97
24
45
28
27
31
29
28
32
33
77
78
99
97
97‘
71
99‘
MaxIndSet
77
99‘
97‘
49/62
Outline

Introduction
 Method
 Experimental
 Conclusion
Results
50/62
Experimental Results

Two datasets obtained from our collaborator Dr.
Tai-Huang, Huang in IBMS, Academia Sinica:



Average precision: 87.5%
Average recall: 73.1%
Perfect data from BMRB: 99.1%
51/62
Real Wet-Lab Datasets

The two datasets are
obtained from our
collaborator Dr. TaiHuang, Huang in
IBMS at Academia
Sinica, Taiwan.
Datasets
sbd
lbd
# of amino acids
53
85
# of amino acids that are assigned
manually by biologists
42
80
# of HSQC peaks
58
78
# of CBCA(CO)NH peaks
258
271
# of HNCACB peaks
224
620
84
160
168
320
# of expected CBCA(CO)NH
# of expected HNCACB
false positive of CBCA(CO)NH
67.4% 41.0%
false positive of HNCACB
25.0% 48.4%
52/62
Experimental Results on Real Data
datasets
sbd
lbd
# of amino acid
53
85
# of assigned amino acid
42
81
# of HSQC
58
78
# of CBCANH peaks
224
620
# of CBCA(CO)NH peaks
258
271
# of correctly assigned
# of assigned
accuracy
recall
Method on sbd
32
35
91.4%
76.2%
Method on lbd
56
67
83.6%
70.0%
53/62
Outline

Introduction
 Method
 Experiment Results
 Conclusion
54/62
Conclusion
 We
model the backbone assignment
problem as a constraint satisfaction
problem
 This problem is solved using a natural
language parsing technique (both bottomup and top-down approach)
 The same approach seem to work for a
large class of noise reduction problems
that are discrete in nature
55/62
A genetic algorithm for NMR
backbone resonance assignment (I)
 Randomly
generate a population of
chromosomes

Each chromosome represents a possible
backbone resonance assignment
 Fitness

function
Evaluate the fitness of each chromosome
according to the connectivity between
adjacent amino acids
56/62
A genetic algorithm for NMR backbone
resonance assignment (II)
 Crossover

An offspring inherits different connected
blocks from parents
 Mutation

operation
operation
Make a new connected block from any
position to increase the popular diversity
57/62
Generation of a random
chromosome
2
7
11
6
22 18
32 17 23 5
9
 Step1.
Randomly select a position x
 Step2. Randomly select a SSGroup i from
CL(x)
 Step3. Extend connected fragments from i
to both sides by using adjacency lists until
no more extension can be found.
 Step4. Repeat Step1~Step3 until all
positions are assigned.
58/62
Fitness Evaluation
2
7
11
6
22 18
32 17 23 5
9
Building Blocks: connected fragments
Fitness(ch) = The number of connected pairs associate with
their chemical shift differences.
Two principles:
1. The more connected pairs it has, the higher score it gets.
2. The less chemical shift differences it has, the higher score it gets.
59/62
Crossover Operation
cutting site
2
7
11
6
22 18
32 17 23 5
9
parents
offspring
60/62
Mutation operation
 Once
a position is going to mutate, the
following positions will also mutate to
produce a connected fragments.
Mutation point
61/62
Experiment Results
 The


accuracy on two real dataset
SBD:95.1% (FP: 67%)
LBD:100% (FP: 48%)
 The
average accuracy on perfect BMRB
datasets (902 proteins)
62/62

qa.iis.sinica.edu.tw

Transcript qa.iis.sinica.edu.tw

Directory