Transcript Document


   A brief introduction The MASS algorithm  The pairwise case  Extension to the multiple case Experimental results



  Protein analysis:  Protein classification  Detecting functional units which share similar geometrical configurations. Applications to:  Docking  Protein engineering Drug design:  Pharmacophore searching

The LCP Problem

 Given a collection of


point-sets in 3D space, find the largest common subset.

 Known as the

LCP problem

 The LCP problem is NP-hard.

 All solutions are based on some heuristics.

The Multiple Alignment by Secondary Structures (MASS) Algorithm


 The MUSTA algorithm 

Leibowitz, Fligelman, Nussinov

, and


1999   A truly multiple-based approach Desired improvements:   Efficiency Finding partial solutions i.e. alignments between a subset of the input molecules.

Partial Alignments

A A C B C Two types of partial alignments: B & C B A B

General Strategy

  Pivot scheme Based on a two-level alignment:  Local secondary structure superposition   Global atomic superposition Geometric hashing paradigm

Why Secondary Structure?

   Stability:  Secondary structures are conserved during evolution Robustness:  Proteins are dense molecules Efficiency:  Introduces great savings in structural description

The Pairwise Case

Outline:      SSE assignment SSE representation Detection of seed matches Clustering the seed matches Global extension & refinement SSE Representation Atomic Representation

Step 1: 

SSE assignment

The proteins are represented by their secondary structure elements.

Secondary Structure Element (SSE) Helix Alpha abundant 3 10 infrequent π rare Strand abundant

Secondary Structure Assignment PDB Bernstein

et al

1977 DSSP Kabsch & Sander 1983 STICK Taylor 2001 DSSPCont Andersen et al.


Step 2: SSE representation

  A SSE is represented by a 3D line segment with fuzzy endpoints.

Helix representation:

 Strand representation:

 The SSE

least-square line

minimizes: 

i d i

2 di Cα Atom (xi,yi)

Step 3: detection of seed matches


– SSE pair   Finding bases, whose configuration appears in both proteins.

A base configuration is represented by a


 A base fingerprint is a 5D vector composed of:  SSE types: helix, strand  Line distance  Midpoint distance  Angle

midpoint distance line distance

  The fingerprint is invariant to 3D rigid transformation Bases with a similar fingerprint can be aligned in different ways:  Axis system superposition  Midpoint to midpoint alignment  RMSD minimization

 Axis system superposition:  Define an axis-system on each base: Y-Axis

 Superimpose the axis-systems of matched bases.


   Based on the assumption:  The line distance segments are conserved Pros:  No use of the SSE length and endpoints Cons:  The assumption is not always correct.

Pathological Example in 2D: d d=0

 Midpoint to midpoint alignment:   Align the mid Cα atoms Expand to the sides

   Based on the assumptions:   SSE endpoints are fuzzy SSE midpoints are conserved.

Pros:  Simplicity Cons:  The SSE midpoints are not always conserved.

 The DSSP sometimes split a SSE in two

 RMSD minimization:  Iterate over all the possible atomic alignment between the matched SSEs.

 Choose the alignment that minimizes the RMSD

  Pros:  No assumption Cons:  Convergence to a local minimum instead of a global one.

 To find congruent bases efficiently:  All bases are stored in a geometric hash according to their fingerprint.


 Bases that reside in the same hash bin or in adjacent bins are congruent: 2D Cut: ε ε - tolerance

 For each hash bin:  Retrieve all the bases in the bin and in the adjacent bins   Insert the bases into a combinatorial bucket Two bases from different column define a seed match Protein 1 Protein 2 3 x 2 seed matches

Step 4: clustering the seed matches

  Detecting matches with a similar transformation and join them into clusters.


RMSD clustering:

 Similar to (Rare y 1996)  Works in an iterative manner


     


( , ) :

i j Dist T T i j

  (

edge T T i j

 

Dist T T i j

 }



Dist T T i j

 ( ( ), ( ))

i j

T1 T5 3 T4 1 3 T6 T2 1 T3 2

Step 5: global extension & refinement

 For each match:  Apply its transformation    Find corresponding atoms that lie close enough to each other after the superposition.

Use least-squares fitting transformation to refine the Iterate until the RMSD convergence.

The Multiple Case

Outline:  SSE assignment & representation       Detection of seed matches Clustering the seed pairwise matches Global extension of pairwise matches Computing multiple matches Refinement Selecting high-scoring multiple matches

 Finding bases whose configuration appears in sufficient number of molecules:   All bases are stored in a geometric hash according to their fingerprint.

Bases that reside in the same bin or in adjacent bins are congruent.

 For each hash bin:  Retrieve all the bases in the bin and in the adjacent bins  Insert them into a combinatorial bucket (CB): Protein i Protein j Protein k Protein r Protein s i

   Construct pairwise seed matches.

 The reference protein is the one with the smaller index Cluster the pairwise matches Global extend the pairwise matches

 Recursively construct multiple alignment:


   




) 


  

Protein i Protein j Protein k Protein r Protein s i

  Refinement Selecting high-scoring multiple matches  The score of a multiple match with n proteins and k atoms is given by:


  n = 3 k = 4 score = 12

Experimental Results


Partial Solutions

All-alpha Class

The core between ten proteins. The proteins belong to 4 different folds of the all-alpha class.

Tim-barrel Fold

The core between 6 proteins out of 7 proteins, taken from different super families of the tim-barrel fold

The core of 6 proteins, belong to 3 different families of the EF hand-like super family

Calcium Binding

An alignment of four structures from different species of the Lipase Family.

Two of the conformations are open and two of them are closed.

Lipase Family