A Geometric Framework for Robust Nearest Neighbor Analysis Deepak Bandyopadhyay

Download Report

Transcript A Geometric Framework for Robust Nearest Neighbor Analysis Deepak Bandyopadhyay

A Geometric Framework for Robust Nearest Neighbor Analysis of Protein Structure and Function

Deepak Bandyopadhyay

Department of Computer Science, University of North Carolina at Chapel Hill

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

Motivation Methods Applications

2 5/1/2020

Use geometric proximity (Voronoi / Delaunay) to analyze protein structures have problems into their function Let’s modify existing neighbor analyses of protein structure to make them robust, and design new ones!

SNAPP; packing differences; secondary structure; hinges Detail: structural fingerprints for function inference

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Geometric structures on point sets

● Voronoi Diagram ● ●

Input: Output:

Points Neighbors

4 5/1/2020

Delaunay triangulation / tessellation (DT)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Delaunay tessellation of proteins

A L M D ● Delaunay Tessellation quadruplets ● ● Represent each amino acid by a point ● C a , side-chain centroid, C b ,...

Delaunay tetrahedra  nearest neighbor quadruplets E

5 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Delaunay Tessellation Applications

SNAPP, four-body statistical potential for hydrophobic core stability [Carter et al, 2001] Decoy discrimination [ Krishnamoorthy and Tropsha, 2003] Scoring Ligand-receptor binding affinity [Zhang et al, 2004] 6 5/1/2020 Mining frequent substructures in protein families [Huan et al., 2004, 2005] Structure-Based Function Inference [Bandyopadhyay et al, 2005] The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

Geometric proximity structures have problems with imprecise points. But we can fix this!

Motivation Methods Applications

7 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Effect of Imprecision on Delaunay

● ● If point coordinates are imprecise...

What happens to the Delaunay neighbors?

● Think of 4 nearly co-circular points in 2D. Delaunay edges may flip … neighbors change.

8 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Which applications are affected by instability of Delaunay ?

Voronoi volumes Frequent Subgraphs Quantitative, Continuous Less affected Qualitative, Discretized Worse affected

When people use Delaunay in analysis of protein structure, they assume it is robust to perturbations!

9 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Method 1 : Almost-Delaunay (AD) tetrahedra

● ● A 4-tuple of points is in AD

(

at most e, e

)

, if, by perturbing all points in the set by its circumscribing sphere can become empty.

The minimum perturbation required, e, is the

AD threshold.

Vertex can move within sphere of radius e

10 5/1/2020 Green

Delaunay, in AD(0)

Red

is in AD( e )

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

AD tetrahedra for protein 2ACY, 98 residues, C

a

s (colored by threshold ; DT not shown, for clarity) AD tetrahedra my overlap; they do not tile space 11 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

12

Computing AD thresholds

[Bandyopadhyay and Snoeyink, 2004] ● Find the spherical shell of minimum width, using a result from computational metrology [Garcia-Lopez et al, 1998] Given a set of points P, a simplex t AD ( e) , is iff its points are contained within 2 concentric spheres s.t.: • difference in radii is 2 e, minimum over all such concentric spheres • inner sphere contains no points of P 2D Example

Code to compute AD edges, triangles, tetrahedra for 3D points, in C++/CGAL (with MATLAB interface and utilities) is available from: http://www.cs.unc.edu/~debug/software

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Method 2: Delaunay Probability

● AD( e ) captures worst-case deviation in coordinates ● Uncertainty in actual coordinates  probabilistic model ● ● ● Assume each point has Gaussian p.d.f

Prob(sphere empty of

p i

) = 1  (p.d.f of

p i

inside sphere) Probability that tetrahedron

abcd

is Delaunay ● integrate over

all possible

prob(sphere)

spheres defined by

a,b,c,d:

*  p  {a,b,c,d}

prob(sphere empty of p)

● AD algorithm makes Delaunay Probability computation feasible ● Delaunay Probability significant only for tetrahedra with low e !

p 1 p 2

13 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Summary of contributions

14

● Algorithmic: ● ● Theory and Algorithm for the general framework Fast and robust implementation for 3D points ● Application domain: ● Nearest neighbor analysis with imprecision ● Applications explored: ● ● ● ● ● Scoring protein packing with a statistical 4-body potential (SNAPP) Quantifying packing differences between proteins and other structures Assigning secondary structure from C a s Analyzing conformational changes and finding hinge residues Finding local packing motifs specific to protein families, applied to structure classification, and functional inference for structural genomics

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

Motivation Methods Applications

15 5/1/2020

Let’s modify existing neighbor analyses of protein structure to make them robust, and design differences; secondary structure; hinges Detail: structural fingerprints for function inference

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application 1: SNAPP

● Simplicial Neighborhood Analysis of Protein Packing ● [Carter et al, JMB’99] ● ● Residues represented by side-chain centroids Protein structure represented as an aggregate of space filling, irregular tetrahedra ● Unique and objective recognition of nearest neighbor residues in sets of four (Quadruplets) A M L D E

16 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Likelihood Scores for 8724 Compositions

17 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Likelihood Mapped to hydrophobic core

18 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Applications

● Applications ● Decoy Discrimination [Krishnamoorthy and Tropsha, 2003] • Weighting scheme based on tetrahedron sequence topology ● Conformation change on ligand binding ● ● [Sherman et al, 2003] Study of folding simulations [Krishnamoorthy and Tropsha, 2003] Ligand-receptor binding affinity [Zhang, Golbraikh and Tropsha, 2004]

19

Contribution of almost-Delaunay:

● ● How stable is the SNAPP score computed using Delaunay?

Compute variants of it using AD and Delaunay Probability

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results : scoring decoys

1 2

3

● ●

1.

2.

3.

4state_reduced lattice_ssfit semfold

SNAPP with Delaunay probabilities distinguishes decoys from native state as well as (even better than?) Delaunay-based SNAPP.

Hence, the original Delaunay-based score is stable

20 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results : scoring CASP5 predictions

● ● SNAPP with Delaunay probabilities discriminates native structures from predictions as well as Delaunay based SNAPP ( usually even better) .

Hence, the original Delaunay-based score is stable

21 5/1/2020 Z-score (Rank) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application 2: Packing Differences

● How does DT change as points are perturbed, for different point sets? random points random walks random folded chains decoys protein C a sidechain centroids

22 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Stability of the DT in Proteins

Left: Average # of AD tetrahedra at low

e

(< 0.5 Å ) grows faster for random points than proteins, as seen in this cumulative histogram.

This suggests that the DT is stable for small perturbations in proteins Right: Number of Delaunay and AD(0.3) tetrahedra for a sample of predictions to CASP5. Notice that the native structures, colored green, have fewer AD tetrahedra for the same number of Delaunay tetrahedra. Delaunay 23 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application 3: Secondary structure from C

a AD threshold histogram of a -helix has unique signature that enables helix assignment from C a s… a -helix

24 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

25

AD secondary structure assignment

● strong a -helix signal, weaker b -sheet and b -turn signals ● Better accuracy than previous work [Wako and Yamato, 1998] ● More tolerant to structural and H-bond imperfections than DSSP ● 1bg5, irregular helix on right

1bg5 AD 1bg5

● Applications: ● ● consensus assignment structure prediction

DSSP

Above :

of Visual comparison

a

-helix

,

b

-sheet

and

b

-turn

assignments in 1BG5 showing an irregular

a

-helix detected by AD and not DSSP.

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application 4 : Conformational Change and Hinges

Analysis of conformational change and detection of hinges from a few unaligned conformations using AD tetrahedra

26 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Neighbor Changes on Motion

● ● ● ● Motion: major rearrangements at a few key residues, the

hinges

Model as neighbor changes, rather than large dihedral angle changes DT contains no conformational change signal; AD tetrahedra do In neighborhood of hinge region, neighbor relationships change drastically (quantify by changes in AD tetrahedra thresholds) ●

Ovotransferrin, threshold color:

0 , 0.01-0.1

, 0.1-0.5

, 0.5-1 , 1-2

● Hinge residues from hinge tetrahedra

27 1TFA SC 5/1/2020 apo (open) form 1IEJ SC holo (closed) form The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application 5 : Family-Specific Fingerprints

● Find residue packing patterns specific to protein families, using

graph representations

with DT/AD edges.

● Use for family classification and functional annotation

29 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Graph Representation

Proteins Small Molecules

Peptide edge Proximity edge

Node label:

Amino acid type, chemical properties, …

Edge label:

Sequence adjacency or

structure proximity

, determined by distance

30 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Graph Database Mining

Input:

database of labeled undirected graphs; threshold 0 <  

1

= 2/3

Output:

All (connected) frequent subgraphs from

the graph database

.

● Performance is critical ● Number of patterns can grow exponentially for large and dense graphs ● Subgraph isomorphism (NP-complete)

31 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

● ● ● ●

Subgraph mining algorithms developed in our group

Frequent Subgraph Mining [ICDM’03] ● Canonical Adjacency Matrix (CAM) tree Induced Subgraph Mining [RECOMB’04] ● Induced subgraphs: geometrically more rigid, superimposable ● Miss many useful motifs “embedded” in a dense graph.

Maximal frequent subgraph mining [SIGKDD’04] ● Mines only

maximal frequent

subgraph (no supergraph freqnt) ● Uses a spanning tree comparison algorithm CliqueHashing and CliqueHashing+ [ISMB’05 demo] ● Finding frequent cliques in linear time

32 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Three Graph Representations

33 5/1/2020 CD AD(0.5) DT E(DT)

E(AD)

E(CD) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

34

Family Specific Fingerprints

Frequent:

occur in >80% of family proteins ●

Family-specific:

occur in <5% of background proteins TRP141 GLY196 HIS57 CYS42

G1 G2

CYS42 GLY197 ALA55 CYS58 Subgraph G1:

Not sequence conserved.

Useful for the annotation of structural orphans.

Human Kallikrein 6 (1LO6) Serine Protease family

Subgraph G2:

Sequence conserved

motif C-x(12)-A-x-H-C Useful for the annotation of both structural orphans and sequences.

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Largest Serine Protease Fingerprint

ASP102

ALA55

HIS57 SER 195

GLY 43 GLY140 GLY 142 ASP194

35 5/1/2020

LEU 16 Blue = His57-Asp102-Ser195 catalytic triad Grey = others

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Cyclin Dependent Protein Kinases (structure of PDB:1B6C)

SCOP classification of 1B6C Superfamily: Protein Kinase like (PK like).

Family: Protein Kinase catalytic subunit.

Protein: Type I TGF-beta receptor R4.

Species: Human.

36 5/1/2020

• • 6 residue motif is highlighted in Red

ASP(333) is part of the active site

• Conserved in 18 out of 29 PK proteins .

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Applications of Family-Specific Fingerprints

● ● ● ● Functional family inference for Structural Genomics Functional family inference for predicted structures Functional neighbors and remote structural similarity Deriving sequence patterns from fingerprints

37 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Motivation

● Hypothetical proteins from Structural Genomics: ● structure known, function unknown ● Function has to be inferred from structure ● ●

Overall fold

similarity

Local structure

similarity to structure with known function ● Overall fold similarity not necessary, sometimes misleading ● Existing local structure methods ● ● Search for known functional sites Derive templates by clique detection

38 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Related work : function inference from local structure

● ● Detecting similarity to known functional sites ● ● ● ● ● SiteEngine SURFACE eF-site [Kinoshita and Nakamura, 2004] ● PINTS-weekly [Stark, Shkumatov and Russell 2004] Detecting functional sites derived from protein families FoldMiner [Shapiro and Brutlag, 2003] Phunctioner [Shulman-Peleg et al, 2003] [Ferre et al, 2004] [Pazos and Sternberg, 2004]

geom.

hashing surface patches Super position

● ● DRESPAT [Wangikar et al, 2003] Common structural cliques [Milik et al, 2003]

graph search / clique detection 39 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Method for functional inference

● ● ● Pick families from SCOP, EC or other classifications Model protein structures by labeled graphs, with almost Delaunay edges defining proximity ● Enumerate all frequent subgraphs within the family using a subgraph mining algorithm Pick frequent subgraphs infrequent in background as family-specific fingerprints ●

Search for fingerprints in structure to be annotated

use an index of graph similarity to speed up Ullman’s alg.

Assign significance of family membership based on the fingerprints found.

40 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

2

Fast Graph Search Using Local Neighborhood Index

1

ASP 102

4

SER 214

5

ALA 196

Hard Case Search for 11 subgraphs in the 6500-protein background dataset (hydrophobic, average 60 occurrences per protein)

HIS 57

3

ALA 55 Search w/o index

41

1 ASP 2 3 4 5 HIS ALA SER ALA

1

: A1H1S1

2

: A2H1S1

1

: A1D1S1

2

: A2D1S1

1

: A1D1H1S1

1

: A2D1H1

1

: A1S1

2

: A1D1H1S1

5/1/2020 Intractable w/o index The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Function Inference Using Fingerprints

● ● ● ● Given query structure q Given fingerprints {X 1 Say {X q1 …X m } for prospective family F i …X qn } q, is q in F i ?

Simple approximation: based on # fingerprints ● P-value based on number of BG proteins with more fingerprints ● Accurate: Bayesian formula applied to family and background probabilities of X q1 …X qn

42 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Advantages of our method

● ● ● Sequence similarity : not sensitive enough Global fold similarity : misleading Functional site similarity ● ● ● Different functional families sometimes share functional sites Exact matching may not be robust (distortion/mutation) Clique methods sacrifice generality of patterns ●

Subgraph fingerprints :

● ● ● Family-specific, few false positives by definition

Multiple fingerprints = consensus

Confidence of family membership

43 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Discriminating the TIM barrels

● Validation of method: all TIM barrel families are structurally very similar. FP family Family TIM ( 1920 FP ) Amylase ( 11 FP ) 1656 Amylase Alcohol Dehydro genase Xylose IM 82 87 89 1 9 1 …..

2 …..

Alcohol De hydrogenase ( 127 FP ) Xylose IM ( 671 FP ) 22 19 …..

13 78 …..

105 29 …..

1 …..

615 …..

…..

45 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Annotations missed by SCOP 1.65

New serine protease annotations, based on the number of fingerprints found out of

79

Serine Protease fingerprints:

1op0A

(73/79)*;

1os8A

(73/79);

1p57B

(73/79);

1s83

(73/79);

1ssx

(46/79);

1md8

(45/79).

New Trioseposphate Isomerase (TIM), 1r2r, 1885/1920 fingerprints.

Verified in PDB file headers, literature * All the above except 1op0 have been classified in SCOP 1.67, Feb 2005

46 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

1nfg

Structural Genomics Function inference I

1m65 SCOP 51556

Metallo-dependent hydrolase 8-stranded

ba

(TIM) barrel fold 17 members, 49 FP 47 5/1/2020

CASP5 T0147 Ycdx

unknown function 7-stranded barrel fold 30 FP found The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

1nfg

Residues hit by fingerprints

Figures made in VMD

1m65 SCOP 51556

Metallo-dependent hydrolase 8-stranded

ba

(TIM) barrel fold 17 members, 49 FP

CASP5 T0147

unknown function 7-stranded barrel fold 30 FP found

Ycdx

Acidic Basic Polar H’phobic 48 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Function inference for predictions

● ● ● ● Check predicted structures against family of template SNAPP [Fischer et al 2004] , SPREK [Taylor, Jonassen 2004] not family-specific ● well-packed predictions with wrong fold may score high Fingerprints infer the correct functional family, even if the template chosen is incorrect.

E.g. CASP5 target T0147, PDB 1m65 ● rare ( ba) 8 fold, putative metallo-dependent hydrolase (MDH) ● 107 predictions ranked 1 ● 50 predictions had 50% or more of 49 MDH FP ● 51 other families had ≤4 preds with ≥50% FP

49 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Functional Neighbors

● Finding families that share some fingerprints ● ● ● Search for family fingerprints in the background Cluster hits for significant enrichment in SCOP, GO hierarchy Eg. Find local similarity between remotely related SCOP families 1kew

SCOP: NAD(P) binding Rossman fold SCOP: FAD/NAD linked reductase 50 The DALI Z-score of the two structures is 4.5, which suggests that they are dissimilar at the fold level. The pair-wise sequence identity is 16% and there is no local sequence similarity at the region of the motif. 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Sequence Patterns from Tertiary Packing

● Frequent DT quadruplets or subgraph motifs that are conserved in sequence order, mapped back to sequence 

Sparse Sequence Signatures

● ● Evaluate precision/recall by querying SwissProt ● Overlap with / comparable to PROSITE patterns Joint work with Ruchir Shah

51

Sequence Motif = {

aa 1 , aa 2 , aa 3 , aa 4 , d 12 , d 23 , d 34

} {

D

,

S

,

G

,

P

,

2

,

3

,

7

}

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Future Work

● ● Biological validation of function inference Future applications in bioinformatics ● ● ● ● ● ● ● ● Hierarchical family fingerprints to infer function for novel folds, with no putative family information Tool for template verification in homology modeling/fold recognition Augment domain classifications (SCOP) with motif-based functions Augment structure neighbor searches (VAST) with functional neighbors Robust neighbor relation to accelerate MD, QM simulations Improve docking (graph matching, MD) Local similarity search Other geometric computations (Voronoi volumes/domains, alpha shapes,…)

52 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Thanks to…

● ● ● Thesis advisor: Dr. Jack

Snoeyink

(UNC CS) Collaborators in this work: ● ● ● ● ● Dr. Alexander Tropsha (UNC Pharmacy) Jun (Luke) Huan, Dr. Wei Wang, Dr. Jan Prins (UNC CS) Ruchir Shah (UNC Biomolecular Informatics) Dr. Bala Krishnamoorthy (Washington State U. Pullman, Math) Dr. Charlie Carter (UNC Biochemistry) Mother nature, for her wonderful imprecision and complexity, that is an endless source of problems…

53 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

References

● References to my publications: ●

Bandyopadhyay, D.

points.

and J. Snoeyink (2004). Almost-Delaunay simplices : Nearest neighbor relations for imprecise

ACM-SIAM Symposium On Discrete Algorithms (SODA’04)

.

http://www.cs.unc.edu/~debug/papers/AlmDel ●

Bandyopadhyay, D.

and J. Snoeyink (2004). Almost-Delaunay simplices : Robust nearest neighbor relations for imprecise points in CGAL.

Second CGAL User Workshop, 2004

. Software: http://www.cs.unc.edu/~debug/software ● Jun Huan, Wei Wang,

Deepak Bandyopadhyay,

Jack Snoeyink, Jan Prins, Alexander Tropsha (2004).

Protein Family-specific residue packing patterns in Protein Structure Graphs. RECOMB 2004.

Finding

Invited to Journal of Computational Biology, 2005, in press.

Bandyopadhyay, Deepak

, Alexander Tropsha and Jack Snoeyink.

using Almost-Delaunay Tetrahedra.

2005, in submission.

A Robust Score for Protein Packing

Bandyopadhyay, Deepak

, Jun Huan, Jinze Liu, Jan Prins, Jack Snoeyink, Wei Wang, and Alexander Tropsha.

Protein Functional Family Identification by Fast Subgraph Isomorphism Using Structure-Based Fingerprints Mined from SCOP and EC families.

Biophysics Symposium, 2004.

2005, in submission. Poster presented at Triangle ●

Bandyopadhyay, Deepak,

Jack Snoeyink, Alexander Tropsha and Charlie Carter.

Conformational Change Using Almost-Delaunay Tetrahedra.

Analysis of Protein

Manuscript in preparation. Poster presented at Pacific Symposium on Biocomputing (PSB), Jan. 2005, Big Island of Hawaii.

Bandyopadhyay, Deepak,

Alexander Tropsha and Jack Snoeyink.

Analyzing Protein Structure using Almost-Delaunay Tetrahedra.

UNC-CS Technical Report TR03-043, 2003. Poster presented at RECOMB 2004, March 2004, San Diego, CA.

54 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

References

● Computational geometry methods applied to protein structure analysis: ● Gerstein, M., J. Tsai, and M. Levitt (1995).

The volume of atoms on the protein surface: Calculated from simulation, using Voronoi polyhedra.

Journal of Molecular Biology 249(5), 955 –966.

● Tsai, J., R. Taylor, C. Chothia, and M. Gerstein (1999).

volumes.

Journal of Molecular Biology 290(1), 253 –266.

The packing density in proteins: Standard radii and

● Angelov, B., J. Sadoc, R. Jullien, A. Soyer, J. Mornon, and J. Chomilier (2002).

Voronoi tessellation of proteins: an open tool to analyze protein folds.

Nonatomic solvent-driven

Proteins 49(4), 446 –456.

● J. Pontius, J. Richelle and S.J. Wodak (1996).

Deviations from Standard Atomic Volumes as a Quality Measure for Protein Crystal Structures.

Journal of Molecular Biology 264(1), 121-136.

● ● ● H. Edelsbrunner and P. Koehl.

2003; 100: 2203 - 2208.

The weighted-volume derivative of a space-filling diagram

Liang, J. and K. A. Dill (2001).

Are proteins well-packed?

Biophys. J.

81

(2), 751 –766.

. PNAS, Mar J. Liang, H. Edelsbrunner, P. Fu, P. Sudhakar, and S. Subramaniam.

Analytical shape computing of macromolecules II: identification and computation of inaccessible cavities inside proteins

. Proteins, 33:18 – 29, 1998.

● H.L. Cheng.

Algorithms for Smooth and Deformable Surfaces in 3D

. Ph.D. Dissertation, University of Illinois at Urbana-Champaign, 2002.

● Y.-E. Ban, H. Edelsbrunner and J. Rudolph.

RECOMB 2004.

Interface surfaces for protein-protein complexes

.

Proc.

● Wernisch, L., M. Hunting, and S. Wodak (1999).

heuristic

. Proteins

35

(3), 338 –352.

Identification of structural domains in proteins by a graph

● Wako, H. and T. Yamato (1998).

Novel method to detect a motif of local structures in different protein conformations

. Protein Engineering 11, 981 –990.

55 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

References

56

● ● ● ● SNAPP: ● C. W. Carter, B. C. LeFebvre, S. Cammer, A. Tropsha, and M. H. Edgell (2001).

Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations

. Journal of Molecular Biology, 311(4):625 –638.

● B. Krishnamoorthy and A. Tropsha (2003).

Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations

. Bioinformatics, 19(12).

● Tropsha, A., Carter, C., Cammer, S.& Vaisman, I. (2003). Simplicial neighborhood analysis of protein packing (SNAPP) : a computational geometry approach to studying proteins. Meth. Enzymol.,374, 509 –544 Hinges: ● Krebs WG, Alexandrov V, Wilson CA, Echols N, Yu H, Gerstein M. (2002).

macromolecular motions in a database framework: developing mode concentration as a useful classifying statistic.

Proteins. 2002 Sep 1;48(4):682-95.

Normal mode analysis of

● Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001).

Proteins 44, 150 - 165.

Protein Flexibilty Predictions using Graph Theory

● M.F. Thorpe, Ming Lei, A.J. Rader, Donald J. Jacobs, and Leslie A. Kuhn (2001).

Dynamics using Constraint Theory.

J. Molecular Graphics and Modelling 19, 60-69.

Protein Flexibility and

Secondary structure: ● Kabsch, W. and C. Sander (1983).

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

Biopolymers 22(12), 2577 –2637.

Family-specific motifs: ● Cammer, S. A. and A. Tropsha (2000).

structures using Delaunay tessellation.

Verlag, New York.

● ●

Identification of sequence specific tertiary packing motifs in protein

Lecture Notes in Computational Science and Engineering. Springer J. Huan, W. Wang, and J. Prins (2003).

Isomorphism Efficient Mining of Frequent Subgraphs in the Presence of

. International Conference on Data Mining ’03.

Jun (Luke) Huan, Wei Wang, Anglinia Washington, Jan Prins, Ruchir Shah, Alexander Tropsha (2004).

Accurate Classification of Protein Structural Families Based on Coherent Subgraph Mining

. PSB 2004.

● Huan, J., Wang, W., Prins, J. & Yang, J. (2004b).

databases

. SIGKDD 2004

SPIN: Mining maximal frequent subgraphs from graph

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Canonical Adjacency Matrix

The

Canonical Adjacency Matrix

(CAM) of a graph G is the maximal adjacency matrix for G under a total ordering defined on adjacency matrices.

p 1 1 y y y y p 2 2 y y p 5 5 p 3 p 4 4 a y y 0 0 b x y 0 b 0 y M 1 c 0 d

>

a y y 0 0 b x 0 y b y 0 M 2 d 0 c

>

b x y 0 y b 0 y y d 0 0 M 3 c 0 a

57 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CAM Tree

58

y 0 a y b x y b 0 c a y y 0 0 b x y 0 b 0 y c 0 d

5/1/2020

a a y b a y y b x b a y 0 b y c a y 0 b y d y 0 a y b x y b 0 d b x y b 0 c b x b b b y c b x y b 0 d c b y d d y 0 b x b 0 y c 0 d p 1 a y y p 2 b x b p 3 y y (P) p 5 c d p 4

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Chemical Datasets

● ● Predictive Toxicology Evaluation Competition ● ● ● Dataset: 337 compounds Two class labels:

positive

(180) and

negative

(157) Each chemical graph contains 27 nodes and 27 edges on average NIH DTP Anti-Viral Screen Test ● Chemicals are classified to be Confirmed Active (

CA

), Confirmed Moderate Active (

CM

) and Confirmed Inactive (

CI

) in NIH DTP Anti-Viral Screen Test .

● ● Dataset contains 423 CA and 1083 CM compounds Each chemical graph contains 25 nodes and 27 edges on average

59 5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Performance (Chemical Datasets)

PTE DTP CA/CM 60

FFSM and gSpan are the current available most efficient frequent subgraph mining algorithms

5/1/2020 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL