Transcript A Geometric Framework for Robust Nearest Neighbor Analysis Deepak Bandyopadhyay
A Geometric Framework for Robust Nearest Neighbor Analysis of Protein Structure and Function
Deepak Bandyopadhyay
Department of Computer Science, University of North Carolina at Chapel Hill
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
Motivation Methods Applications
2 5/1/2020
Use geometric proximity (Voronoi / Delaunay) to analyze protein structures have problems into their function Let’s modify existing neighbor analyses of protein structure to make them robust, and design new ones!
SNAPP; packing differences; secondary structure; hinges Detail: structural fingerprints for function inference
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Geometric structures on point sets
● Voronoi Diagram ● ●
Input: Output:
Points Neighbors
4 5/1/2020
Delaunay triangulation / tessellation (DT)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Delaunay tessellation of proteins
A L M D ● Delaunay Tessellation quadruplets ● ● Represent each amino acid by a point ● C a , side-chain centroid, C b ,...
Delaunay tetrahedra nearest neighbor quadruplets E
5 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Delaunay Tessellation Applications
SNAPP, four-body statistical potential for hydrophobic core stability [Carter et al, 2001] Decoy discrimination [ Krishnamoorthy and Tropsha, 2003] Scoring Ligand-receptor binding affinity [Zhang et al, 2004] 6 5/1/2020 Mining frequent substructures in protein families [Huan et al., 2004, 2005] Structure-Based Function Inference [Bandyopadhyay et al, 2005] The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
Geometric proximity structures have problems with imprecise points. But we can fix this!
Motivation Methods Applications
7 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Effect of Imprecision on Delaunay
● ● If point coordinates are imprecise...
What happens to the Delaunay neighbors?
● Think of 4 nearly co-circular points in 2D. Delaunay edges may flip … neighbors change.
8 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Which applications are affected by instability of Delaunay ?
Voronoi volumes Frequent Subgraphs Quantitative, Continuous Less affected Qualitative, Discretized Worse affected
●
When people use Delaunay in analysis of protein structure, they assume it is robust to perturbations!
9 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Method 1 : Almost-Delaunay (AD) tetrahedra
● ● A 4-tuple of points is in AD
(
at most e, e
)
, if, by perturbing all points in the set by its circumscribing sphere can become empty.
The minimum perturbation required, e, is the
AD threshold.
Vertex can move within sphere of radius e
10 5/1/2020 Green
Delaunay, in AD(0)
Red
is in AD( e )
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
AD tetrahedra for protein 2ACY, 98 residues, C
a
s (colored by threshold ; DT not shown, for clarity) AD tetrahedra my overlap; they do not tile space 11 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
12
Computing AD thresholds
[Bandyopadhyay and Snoeyink, 2004] ● Find the spherical shell of minimum width, using a result from computational metrology [Garcia-Lopez et al, 1998] Given a set of points P, a simplex t AD ( e) , is iff its points are contained within 2 concentric spheres s.t.: • difference in radii is 2 e, minimum over all such concentric spheres • inner sphere contains no points of P 2D Example
Code to compute AD edges, triangles, tetrahedra for 3D points, in C++/CGAL (with MATLAB interface and utilities) is available from: http://www.cs.unc.edu/~debug/software
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Method 2: Delaunay Probability
● AD( e ) captures worst-case deviation in coordinates ● Uncertainty in actual coordinates probabilistic model ● ● ● Assume each point has Gaussian p.d.f
Prob(sphere empty of
p i
) = 1 (p.d.f of
p i
inside sphere) Probability that tetrahedron
abcd
is Delaunay ● integrate over
all possible
prob(sphere)
spheres defined by
a,b,c,d:
* p {a,b,c,d}
prob(sphere empty of p)
● AD algorithm makes Delaunay Probability computation feasible ● Delaunay Probability significant only for tetrahedra with low e !
p 1 p 2
13 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Summary of contributions
14
● Algorithmic: ● ● Theory and Algorithm for the general framework Fast and robust implementation for 3D points ● Application domain: ● Nearest neighbor analysis with imprecision ● Applications explored: ● ● ● ● ● Scoring protein packing with a statistical 4-body potential (SNAPP) Quantifying packing differences between proteins and other structures Assigning secondary structure from C a s Analyzing conformational changes and finding hinge residues Finding local packing motifs specific to protein families, applied to structure classification, and functional inference for structural genomics
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
Motivation Methods Applications
15 5/1/2020
Let’s modify existing neighbor analyses of protein structure to make them robust, and design differences; secondary structure; hinges Detail: structural fingerprints for function inference
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Application 1: SNAPP
● Simplicial Neighborhood Analysis of Protein Packing ● [Carter et al, JMB’99] ● ● Residues represented by side-chain centroids Protein structure represented as an aggregate of space filling, irregular tetrahedra ● Unique and objective recognition of nearest neighbor residues in sets of four (Quadruplets) A M L D E
16 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Likelihood Scores for 8724 Compositions
17 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Likelihood Mapped to hydrophobic core
18 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Applications
● Applications ● Decoy Discrimination [Krishnamoorthy and Tropsha, 2003] • Weighting scheme based on tetrahedron sequence topology ● Conformation change on ligand binding ● ● [Sherman et al, 2003] Study of folding simulations [Krishnamoorthy and Tropsha, 2003] Ligand-receptor binding affinity [Zhang, Golbraikh and Tropsha, 2004]
19
●
Contribution of almost-Delaunay:
● ● How stable is the SNAPP score computed using Delaunay?
Compute variants of it using AD and Delaunay Probability
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results : scoring decoys
1 2
3
● ●
1.
2.
3.
4state_reduced lattice_ssfit semfold
SNAPP with Delaunay probabilities distinguishes decoys from native state as well as (even better than?) Delaunay-based SNAPP.
Hence, the original Delaunay-based score is stable
20 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results : scoring CASP5 predictions
● ● SNAPP with Delaunay probabilities discriminates native structures from predictions as well as Delaunay based SNAPP ( usually even better) .
Hence, the original Delaunay-based score is stable
21 5/1/2020 Z-score (Rank) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Application 2: Packing Differences
● How does DT change as points are perturbed, for different point sets? random points random walks random folded chains decoys protein C a sidechain centroids
22 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stability of the DT in Proteins
Left: Average # of AD tetrahedra at low
e
(< 0.5 Å ) grows faster for random points than proteins, as seen in this cumulative histogram.
This suggests that the DT is stable for small perturbations in proteins Right: Number of Delaunay and AD(0.3) tetrahedra for a sample of predictions to CASP5. Notice that the native structures, colored green, have fewer AD tetrahedra for the same number of Delaunay tetrahedra. Delaunay 23 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
●
Application 3: Secondary structure from C
a AD threshold histogram of a -helix has unique signature that enables helix assignment from C a s… a -helix
24 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
25
AD secondary structure assignment
● strong a -helix signal, weaker b -sheet and b -turn signals ● Better accuracy than previous work [Wako and Yamato, 1998] ● More tolerant to structural and H-bond imperfections than DSSP ● 1bg5, irregular helix on right
1bg5 AD 1bg5
● Applications: ● ● consensus assignment structure prediction
DSSP
Above :
of Visual comparison
a
-helix
,
b
-sheet
and
b
-turn
assignments in 1BG5 showing an irregular
a
-helix detected by AD and not DSSP.
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
●
Application 4 : Conformational Change and Hinges
Analysis of conformational change and detection of hinges from a few unaligned conformations using AD tetrahedra
26 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Neighbor Changes on Motion
● ● ● ● Motion: major rearrangements at a few key residues, the
hinges
Model as neighbor changes, rather than large dihedral angle changes DT contains no conformational change signal; AD tetrahedra do In neighborhood of hinge region, neighbor relationships change drastically (quantify by changes in AD tetrahedra thresholds) ●
Ovotransferrin, threshold color:
0 , 0.01-0.1
, 0.1-0.5
, 0.5-1 , 1-2
● Hinge residues from hinge tetrahedra
27 1TFA SC 5/1/2020 apo (open) form 1IEJ SC holo (closed) form The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Application 5 : Family-Specific Fingerprints
● Find residue packing patterns specific to protein families, using
graph representations
with DT/AD edges.
● Use for family classification and functional annotation
29 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graph Representation
Proteins Small Molecules
Peptide edge Proximity edge
Node label:
Amino acid type, chemical properties, …
Edge label:
Sequence adjacency or
structure proximity
, determined by distance
30 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graph Database Mining
●
Input:
database of labeled undirected graphs; threshold 0 <
1
= 2/3
●
Output:
All (connected) frequent subgraphs from
the graph database
.
● Performance is critical ● Number of patterns can grow exponentially for large and dense graphs ● Subgraph isomorphism (NP-complete)
31 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
● ● ● ●
Subgraph mining algorithms developed in our group
Frequent Subgraph Mining [ICDM’03] ● Canonical Adjacency Matrix (CAM) tree Induced Subgraph Mining [RECOMB’04] ● Induced subgraphs: geometrically more rigid, superimposable ● Miss many useful motifs “embedded” in a dense graph.
Maximal frequent subgraph mining [SIGKDD’04] ● Mines only
maximal frequent
subgraph (no supergraph freqnt) ● Uses a spanning tree comparison algorithm CliqueHashing and CliqueHashing+ [ISMB’05 demo] ● Finding frequent cliques in linear time
32 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Three Graph Representations
33 5/1/2020 CD AD(0.5) DT E(DT)
E(AD)
E(CD) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
34
Family Specific Fingerprints
●
Frequent:
occur in >80% of family proteins ●
Family-specific:
occur in <5% of background proteins TRP141 GLY196 HIS57 CYS42
G1 G2
CYS42 GLY197 ALA55 CYS58 Subgraph G1:
Not sequence conserved.
Useful for the annotation of structural orphans.
Human Kallikrein 6 (1LO6) Serine Protease family
Subgraph G2:
Sequence conserved
motif C-x(12)-A-x-H-C Useful for the annotation of both structural orphans and sequences.
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Largest Serine Protease Fingerprint
ASP102
ALA55
HIS57 SER 195
GLY 43 GLY140 GLY 142 ASP194
35 5/1/2020
LEU 16 Blue = His57-Asp102-Ser195 catalytic triad Grey = others
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Cyclin Dependent Protein Kinases (structure of PDB:1B6C)
SCOP classification of 1B6C Superfamily: Protein Kinase like (PK like).
Family: Protein Kinase catalytic subunit.
Protein: Type I TGF-beta receptor R4.
Species: Human.
36 5/1/2020
• • 6 residue motif is highlighted in Red
ASP(333) is part of the active site
• Conserved in 18 out of 29 PK proteins .
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Applications of Family-Specific Fingerprints
● ● ● ● Functional family inference for Structural Genomics Functional family inference for predicted structures Functional neighbors and remote structural similarity Deriving sequence patterns from fingerprints
37 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation
● Hypothetical proteins from Structural Genomics: ● structure known, function unknown ● Function has to be inferred from structure ● ●
Overall fold
similarity
Local structure
similarity to structure with known function ● Overall fold similarity not necessary, sometimes misleading ● Existing local structure methods ● ● Search for known functional sites Derive templates by clique detection
38 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Related work : function inference from local structure
● ● Detecting similarity to known functional sites ● ● ● ● ● SiteEngine SURFACE eF-site [Kinoshita and Nakamura, 2004] ● PINTS-weekly [Stark, Shkumatov and Russell 2004] Detecting functional sites derived from protein families FoldMiner [Shapiro and Brutlag, 2003] Phunctioner [Shulman-Peleg et al, 2003] [Ferre et al, 2004] [Pazos and Sternberg, 2004]
geom.
hashing surface patches Super position
● ● DRESPAT [Wangikar et al, 2003] Common structural cliques [Milik et al, 2003]
graph search / clique detection 39 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Method for functional inference
● ● ● Pick families from SCOP, EC or other classifications Model protein structures by labeled graphs, with almost Delaunay edges defining proximity ● Enumerate all frequent subgraphs within the family using a subgraph mining algorithm Pick frequent subgraphs infrequent in background as family-specific fingerprints ●
Search for fingerprints in structure to be annotated
●
use an index of graph similarity to speed up Ullman’s alg.
●
Assign significance of family membership based on the fingerprints found.
40 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
2
Fast Graph Search Using Local Neighborhood Index
1
ASP 102
4
SER 214
5
ALA 196
Hard Case Search for 11 subgraphs in the 6500-protein background dataset (hydrophobic, average 60 occurrences per protein)
HIS 57
3
ALA 55 Search w/o index
41
1 ASP 2 3 4 5 HIS ALA SER ALA
1
: A1H1S1
2
: A2H1S1
1
: A1D1S1
2
: A2D1S1
1
: A1D1H1S1
1
: A2D1H1
1
: A1S1
2
: A1D1H1S1
5/1/2020 Intractable w/o index The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Function Inference Using Fingerprints
● ● ● ● Given query structure q Given fingerprints {X 1 Say {X q1 …X m } for prospective family F i …X qn } q, is q in F i ?
Simple approximation: based on # fingerprints ● P-value based on number of BG proteins with more fingerprints ● Accurate: Bayesian formula applied to family and background probabilities of X q1 …X qn
42 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Advantages of our method
● ● ● Sequence similarity : not sensitive enough Global fold similarity : misleading Functional site similarity ● ● ● Different functional families sometimes share functional sites Exact matching may not be robust (distortion/mutation) Clique methods sacrifice generality of patterns ●
Subgraph fingerprints :
● ● ● Family-specific, few false positives by definition
Multiple fingerprints = consensus
Confidence of family membership
43 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Discriminating the TIM barrels
● Validation of method: all TIM barrel families are structurally very similar. FP family Family TIM ( 1920 FP ) Amylase ( 11 FP ) 1656 Amylase Alcohol Dehydro genase Xylose IM 82 87 89 1 9 1 …..
2 …..
Alcohol De hydrogenase ( 127 FP ) Xylose IM ( 671 FP ) 22 19 …..
13 78 …..
105 29 …..
1 …..
615 …..
…..
45 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Annotations missed by SCOP 1.65
New serine protease annotations, based on the number of fingerprints found out of
79
Serine Protease fingerprints:
1op0A
(73/79)*;
1os8A
(73/79);
1p57B
(73/79);
1s83
(73/79);
1ssx
(46/79);
1md8
(45/79).
New Trioseposphate Isomerase (TIM), 1r2r, 1885/1920 fingerprints.
Verified in PDB file headers, literature * All the above except 1op0 have been classified in SCOP 1.67, Feb 2005
46 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
1nfg
Structural Genomics Function inference I
1m65 SCOP 51556
Metallo-dependent hydrolase 8-stranded
ba
(TIM) barrel fold 17 members, 49 FP 47 5/1/2020
CASP5 T0147 Ycdx
unknown function 7-stranded barrel fold 30 FP found The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
1nfg
Residues hit by fingerprints
Figures made in VMD
1m65 SCOP 51556
Metallo-dependent hydrolase 8-stranded
ba
(TIM) barrel fold 17 members, 49 FP
CASP5 T0147
unknown function 7-stranded barrel fold 30 FP found
Ycdx
Acidic Basic Polar H’phobic 48 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Function inference for predictions
● ● ● ● Check predicted structures against family of template SNAPP [Fischer et al 2004] , SPREK [Taylor, Jonassen 2004] not family-specific ● well-packed predictions with wrong fold may score high Fingerprints infer the correct functional family, even if the template chosen is incorrect.
E.g. CASP5 target T0147, PDB 1m65 ● rare ( ba) 8 fold, putative metallo-dependent hydrolase (MDH) ● 107 predictions ranked 1 ● 50 predictions had 50% or more of 49 MDH FP ● 51 other families had ≤4 preds with ≥50% FP
49 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Functional Neighbors
● Finding families that share some fingerprints ● ● ● Search for family fingerprints in the background Cluster hits for significant enrichment in SCOP, GO hierarchy Eg. Find local similarity between remotely related SCOP families 1kew
SCOP: NAD(P) binding Rossman fold SCOP: FAD/NAD linked reductase 50 The DALI Z-score of the two structures is 4.5, which suggests that they are dissimilar at the fold level. The pair-wise sequence identity is 16% and there is no local sequence similarity at the region of the motif. 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sequence Patterns from Tertiary Packing
● Frequent DT quadruplets or subgraph motifs that are conserved in sequence order, mapped back to sequence
Sparse Sequence Signatures
● ● Evaluate precision/recall by querying SwissProt ● Overlap with / comparable to PROSITE patterns Joint work with Ruchir Shah
51
Sequence Motif = {
aa 1 , aa 2 , aa 3 , aa 4 , d 12 , d 23 , d 34
} {
D
,
S
,
G
,
P
,
2
,
3
,
7
}
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Future Work
● ● Biological validation of function inference Future applications in bioinformatics ● ● ● ● ● ● ● ● Hierarchical family fingerprints to infer function for novel folds, with no putative family information Tool for template verification in homology modeling/fold recognition Augment domain classifications (SCOP) with motif-based functions Augment structure neighbor searches (VAST) with functional neighbors Robust neighbor relation to accelerate MD, QM simulations Improve docking (graph matching, MD) Local similarity search Other geometric computations (Voronoi volumes/domains, alpha shapes,…)
52 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thanks to…
● ● ● Thesis advisor: Dr. Jack
Snoeyink
(UNC CS) Collaborators in this work: ● ● ● ● ● Dr. Alexander Tropsha (UNC Pharmacy) Jun (Luke) Huan, Dr. Wei Wang, Dr. Jan Prins (UNC CS) Ruchir Shah (UNC Biomolecular Informatics) Dr. Bala Krishnamoorthy (Washington State U. Pullman, Math) Dr. Charlie Carter (UNC Biochemistry) Mother nature, for her wonderful imprecision and complexity, that is an endless source of problems…
53 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References
● References to my publications: ●
Bandyopadhyay, D.
points.
and J. Snoeyink (2004). Almost-Delaunay simplices : Nearest neighbor relations for imprecise
ACM-SIAM Symposium On Discrete Algorithms (SODA’04)
.
http://www.cs.unc.edu/~debug/papers/AlmDel ●
Bandyopadhyay, D.
and J. Snoeyink (2004). Almost-Delaunay simplices : Robust nearest neighbor relations for imprecise points in CGAL.
Second CGAL User Workshop, 2004
. Software: http://www.cs.unc.edu/~debug/software ● Jun Huan, Wei Wang,
Deepak Bandyopadhyay,
Jack Snoeyink, Jan Prins, Alexander Tropsha (2004).
Protein Family-specific residue packing patterns in Protein Structure Graphs. RECOMB 2004.
Finding
Invited to Journal of Computational Biology, 2005, in press.
●
Bandyopadhyay, Deepak
, Alexander Tropsha and Jack Snoeyink.
using Almost-Delaunay Tetrahedra.
2005, in submission.
A Robust Score for Protein Packing
●
Bandyopadhyay, Deepak
, Jun Huan, Jinze Liu, Jan Prins, Jack Snoeyink, Wei Wang, and Alexander Tropsha.
Protein Functional Family Identification by Fast Subgraph Isomorphism Using Structure-Based Fingerprints Mined from SCOP and EC families.
Biophysics Symposium, 2004.
2005, in submission. Poster presented at Triangle ●
Bandyopadhyay, Deepak,
Jack Snoeyink, Alexander Tropsha and Charlie Carter.
Conformational Change Using Almost-Delaunay Tetrahedra.
Analysis of Protein
Manuscript in preparation. Poster presented at Pacific Symposium on Biocomputing (PSB), Jan. 2005, Big Island of Hawaii.
●
Bandyopadhyay, Deepak,
Alexander Tropsha and Jack Snoeyink.
Analyzing Protein Structure using Almost-Delaunay Tetrahedra.
UNC-CS Technical Report TR03-043, 2003. Poster presented at RECOMB 2004, March 2004, San Diego, CA.
54 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References
● Computational geometry methods applied to protein structure analysis: ● Gerstein, M., J. Tsai, and M. Levitt (1995).
The volume of atoms on the protein surface: Calculated from simulation, using Voronoi polyhedra.
Journal of Molecular Biology 249(5), 955 –966.
● Tsai, J., R. Taylor, C. Chothia, and M. Gerstein (1999).
volumes.
Journal of Molecular Biology 290(1), 253 –266.
The packing density in proteins: Standard radii and
● Angelov, B., J. Sadoc, R. Jullien, A. Soyer, J. Mornon, and J. Chomilier (2002).
Voronoi tessellation of proteins: an open tool to analyze protein folds.
Nonatomic solvent-driven
Proteins 49(4), 446 –456.
● J. Pontius, J. Richelle and S.J. Wodak (1996).
Deviations from Standard Atomic Volumes as a Quality Measure for Protein Crystal Structures.
Journal of Molecular Biology 264(1), 121-136.
● ● ● H. Edelsbrunner and P. Koehl.
2003; 100: 2203 - 2208.
The weighted-volume derivative of a space-filling diagram
Liang, J. and K. A. Dill (2001).
Are proteins well-packed?
Biophys. J.
81
(2), 751 –766.
. PNAS, Mar J. Liang, H. Edelsbrunner, P. Fu, P. Sudhakar, and S. Subramaniam.
Analytical shape computing of macromolecules II: identification and computation of inaccessible cavities inside proteins
. Proteins, 33:18 – 29, 1998.
● H.L. Cheng.
Algorithms for Smooth and Deformable Surfaces in 3D
. Ph.D. Dissertation, University of Illinois at Urbana-Champaign, 2002.
● Y.-E. Ban, H. Edelsbrunner and J. Rudolph.
RECOMB 2004.
Interface surfaces for protein-protein complexes
.
Proc.
● Wernisch, L., M. Hunting, and S. Wodak (1999).
heuristic
. Proteins
35
(3), 338 –352.
Identification of structural domains in proteins by a graph
● Wako, H. and T. Yamato (1998).
Novel method to detect a motif of local structures in different protein conformations
. Protein Engineering 11, 981 –990.
55 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References
56
● ● ● ● SNAPP: ● C. W. Carter, B. C. LeFebvre, S. Cammer, A. Tropsha, and M. H. Edgell (2001).
Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations
. Journal of Molecular Biology, 311(4):625 –638.
● B. Krishnamoorthy and A. Tropsha (2003).
Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations
. Bioinformatics, 19(12).
● Tropsha, A., Carter, C., Cammer, S.& Vaisman, I. (2003). Simplicial neighborhood analysis of protein packing (SNAPP) : a computational geometry approach to studying proteins. Meth. Enzymol.,374, 509 –544 Hinges: ● Krebs WG, Alexandrov V, Wilson CA, Echols N, Yu H, Gerstein M. (2002).
macromolecular motions in a database framework: developing mode concentration as a useful classifying statistic.
Proteins. 2002 Sep 1;48(4):682-95.
Normal mode analysis of
● Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001).
Proteins 44, 150 - 165.
Protein Flexibilty Predictions using Graph Theory
● M.F. Thorpe, Ming Lei, A.J. Rader, Donald J. Jacobs, and Leslie A. Kuhn (2001).
Dynamics using Constraint Theory.
J. Molecular Graphics and Modelling 19, 60-69.
Protein Flexibility and
Secondary structure: ● Kabsch, W. and C. Sander (1983).
Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers 22(12), 2577 –2637.
Family-specific motifs: ● Cammer, S. A. and A. Tropsha (2000).
structures using Delaunay tessellation.
Verlag, New York.
● ●
Identification of sequence specific tertiary packing motifs in protein
Lecture Notes in Computational Science and Engineering. Springer J. Huan, W. Wang, and J. Prins (2003).
Isomorphism Efficient Mining of Frequent Subgraphs in the Presence of
. International Conference on Data Mining ’03.
Jun (Luke) Huan, Wei Wang, Anglinia Washington, Jan Prins, Ruchir Shah, Alexander Tropsha (2004).
Accurate Classification of Protein Structural Families Based on Coherent Subgraph Mining
. PSB 2004.
● Huan, J., Wang, W., Prins, J. & Yang, J. (2004b).
databases
. SIGKDD 2004
SPIN: Mining maximal frequent subgraphs from graph
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Canonical Adjacency Matrix
●
The
Canonical Adjacency Matrix
(CAM) of a graph G is the maximal adjacency matrix for G under a total ordering defined on adjacency matrices.
p 1 1 y y y y p 2 2 y y p 5 5 p 3 p 4 4 a y y 0 0 b x y 0 b 0 y M 1 c 0 d
>
a y y 0 0 b x 0 y b y 0 M 2 d 0 c
>
b x y 0 y b 0 y y d 0 0 M 3 c 0 a
57 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CAM Tree
58
y 0 a y b x y b 0 c a y y 0 0 b x y 0 b 0 y c 0 d
5/1/2020
a a y b a y y b x b a y 0 b y c a y 0 b y d y 0 a y b x y b 0 d b x y b 0 c b x b b b y c b x y b 0 d c b y d d y 0 b x b 0 y c 0 d p 1 a y y p 2 b x b p 3 y y (P) p 5 c d p 4
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Chemical Datasets
● ● Predictive Toxicology Evaluation Competition ● ● ● Dataset: 337 compounds Two class labels:
positive
(180) and
negative
(157) Each chemical graph contains 27 nodes and 27 edges on average NIH DTP Anti-Viral Screen Test ● Chemicals are classified to be Confirmed Active (
CA
), Confirmed Moderate Active (
CM
) and Confirmed Inactive (
CI
) in NIH DTP Anti-Viral Screen Test .
● ● Dataset contains 423 CA and 1083 CM compounds Each chemical graph contains 25 nodes and 27 edges on average
59 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Performance (Chemical Datasets)
PTE DTP CA/CM 60
FFSM and gSpan are the current available most efficient frequent subgraph mining algorithms
5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL