Document

Transcript Document

Agenda

   A brief introduction The MASS algorithm  The pairwise case  Extension to the multiple case Experimental results

Introduction

Importance

  Protein analysis:  Protein classification  Detecting functional units which share similar geometrical configurations. Applications to:  Docking  Protein engineering Drug design:  Pharmacophore searching

The LCP Problem

 Given a collection of

point-sets in 3D space, find the largest common subset.

 Known as the

LCP problem

 The LCP problem is NP-hard.

 All solutions are based on some heuristics.

The Multiple Alignment by Secondary Structures (MASS) Algorithm

Motivation

 The MUSTA algorithm 

Leibowitz, Fligelman, Nussinov

, and

Wolfson

1999   A truly multiple-based approach Desired improvements:   Efficiency Finding partial solutions i.e. alignments between a subset of the input molecules.

Partial Alignments

A A C B C Two types of partial alignments: B & C B A B

General Strategy

  Pivot scheme Based on a two-level alignment:  Local secondary structure superposition   Global atomic superposition Geometric hashing paradigm

Why Secondary Structure?

   Stability:  Secondary structures are conserved during evolution Robustness:  Proteins are dense molecules Efficiency:  Introduces great savings in structural description

The Pairwise Case

Outline:      SSE assignment SSE representation Detection of seed matches Clustering the seed matches Global extension & refinement SSE Representation Atomic Representation

Step 1: 

SSE assignment

The proteins are represented by their secondary structure elements.

Secondary Structure Element (SSE) Helix Alpha abundant 3 10 infrequent π rare Strand abundant

Secondary Structure Assignment PDB Bernstein

et al

1977 DSSP Kabsch & Sander 1983 STICK Taylor 2001 DSSPCont Andersen et al.

2002

Step 2: SSE representation

  A SSE is represented by a 3D line segment with fuzzy endpoints.

Helix representation:

 Strand representation:

 The SSE

least-square line

minimizes: 

i d i

2 di Cα Atom (xi,yi)

Step 3: detection of seed matches



Base

– SSE pair   Finding bases, whose configuration appears in both proteins.

A base configuration is represented by a

fingerprint

 A base fingerprint is a 5D vector composed of:  SSE types: helix, strand  Line distance  Midpoint distance  Angle

midpoint distance line distance

  The fingerprint is invariant to 3D rigid transformation Bases with a similar fingerprint can be aligned in different ways:  Axis system superposition  Midpoint to midpoint alignment  RMSD minimization

 Axis system superposition:  Define an axis-system on each base: Y-Axis

 Superimpose the axis-systems of matched bases.

Y-Axis

   Based on the assumption:  The line distance segments are conserved Pros:  No use of the SSE length and endpoints Cons:  The assumption is not always correct.

Pathological Example in 2D: d d=0

 Midpoint to midpoint alignment:   Align the mid Cα atoms Expand to the sides

   Based on the assumptions:   SSE endpoints are fuzzy SSE midpoints are conserved.

Pros:  Simplicity Cons:  The SSE midpoints are not always conserved.

 The DSSP sometimes split a SSE in two

 RMSD minimization:  Iterate over all the possible atomic alignment between the matched SSEs.

 Choose the alignment that minimizes the RMSD

  Pros:  No assumption Cons:  Convergence to a local minimum instead of a global one.

 To find congruent bases efficiently:  All bases are stored in a geometric hash according to their fingerprint.

 Bases that reside in the same hash bin or in adjacent bins are congruent: 2D Cut: ε ε - tolerance

 For each hash bin:  Retrieve all the bases in the bin and in the adjacent bins   Insert the bases into a combinatorial bucket Two bases from different column define a seed match Protein 1 Protein 2 3 x 2 seed matches

Step 4: clustering the seed matches

  Detecting matches with a similar transformation and join them into clusters.

Using

RMSD clustering:

 Similar to (Rare y 1996)  Works in an iterative manner

G V E

     

( , ) :

i j Dist T T i j

  (

edge T T i j

 

Dist T T i j

 }

where

Dist T T i j

 ( ( ), ( ))

i j

T1 T5 3 T4 1 3 T6 T2 1 T3 2

Step 5: global extension & refinement

 For each match:  Apply its transformation    Find corresponding atoms that lie close enough to each other after the superposition.

Use least-squares fitting transformation to refine the Iterate until the RMSD convergence.

The Multiple Case

Outline:  SSE assignment & representation       Detection of seed matches Clustering the seed pairwise matches Global extension of pairwise matches Computing multiple matches Refinement Selecting high-scoring multiple matches

 Finding bases whose configuration appears in sufficient number of molecules:   All bases are stored in a geometric hash according to their fingerprint.

Bases that reside in the same bin or in adjacent bins are congruent.

 For each hash bin:  Retrieve all the bases in the bin and in the adjacent bins  Insert them into a combinatorial bucket (CB): Protein i Protein j Protein k Protein r Protein s i

   Construct pairwise seed matches.

 The reference protein is the one with the smaller index Cluster the pairwise matches Global extend the pairwise matches

 Recursively construct multiple alignment:

   





(

) 

  

Protein i Protein j Protein k Protein r Protein s i

  Refinement Selecting high-scoring multiple matches  The score of a multiple match with n proteins and k atoms is given by:

  n = 3 k = 4 score = 12

Experimental Results

MASS vs. MUSTA

Partial Solutions

All-alpha Class

The core between ten proteins. The proteins belong to 4 different folds of the all-alpha class.

Tim-barrel Fold

The core between 6 proteins out of 7 proteins, taken from different super families of the tim-barrel fold

The core of 6 proteins, belong to 3 different families of the EF hand-like super family

Calcium Binding

An alignment of four structures from different species of the Lipase Family.

Two of the conformations are open and two of them are closed.