Transcript Document
Agenda
A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results
Introduction
Importance
Protein analysis: Protein classification Detecting functional units which share similar geometrical configurations. Applications to: Docking Protein engineering Drug design: Pharmacophore searching
The LCP Problem
Given a collection of
M
point-sets in 3D space, find the largest common subset.
Known as the
LCP problem
The LCP problem is NP-hard.
All solutions are based on some heuristics.
The Multiple Alignment by Secondary Structures (MASS) Algorithm
Motivation
The MUSTA algorithm
Leibowitz, Fligelman, Nussinov
, and
Wolfson
1999 A truly multiple-based approach Desired improvements: Efficiency Finding partial solutions i.e. alignments between a subset of the input molecules.
Partial Alignments
A A C B C Two types of partial alignments: B & C B A B
General Strategy
Pivot scheme Based on a two-level alignment: Local secondary structure superposition Global atomic superposition Geometric hashing paradigm
Why Secondary Structure?
Stability: Secondary structures are conserved during evolution Robustness: Proteins are dense molecules Efficiency: Introduces great savings in structural description
The Pairwise Case
Outline: SSE assignment SSE representation Detection of seed matches Clustering the seed matches Global extension & refinement SSE Representation Atomic Representation
Step 1:
SSE assignment
The proteins are represented by their secondary structure elements.
Secondary Structure Element (SSE) Helix Alpha abundant 3 10 infrequent π rare Strand abundant
Secondary Structure Assignment PDB Bernstein
et al
1977 DSSP Kabsch & Sander 1983 STICK Taylor 2001 DSSPCont Andersen et al.
2002
Step 2: SSE representation
A SSE is represented by a 3D line segment with fuzzy endpoints.
Helix representation:
Strand representation:
The SSE
least-square line
minimizes:
i d i
2 di Cα Atom (xi,yi)
Step 3: detection of seed matches
Base
– SSE pair Finding bases, whose configuration appears in both proteins.
A base configuration is represented by a
fingerprint
A base fingerprint is a 5D vector composed of: SSE types: helix, strand Line distance Midpoint distance Angle
midpoint distance line distance
The fingerprint is invariant to 3D rigid transformation Bases with a similar fingerprint can be aligned in different ways: Axis system superposition Midpoint to midpoint alignment RMSD minimization
Axis system superposition: Define an axis-system on each base: Y-Axis
Superimpose the axis-systems of matched bases.
Y-Axis
Based on the assumption: The line distance segments are conserved Pros: No use of the SSE length and endpoints Cons: The assumption is not always correct.
Pathological Example in 2D: d d=0
Midpoint to midpoint alignment: Align the mid Cα atoms Expand to the sides
Based on the assumptions: SSE endpoints are fuzzy SSE midpoints are conserved.
Pros: Simplicity Cons: The SSE midpoints are not always conserved.
The DSSP sometimes split a SSE in two
RMSD minimization: Iterate over all the possible atomic alignment between the matched SSEs.
Choose the alignment that minimizes the RMSD
Pros: No assumption Cons: Convergence to a local minimum instead of a global one.
To find congruent bases efficiently: All bases are stored in a geometric hash according to their fingerprint.
GH
Bases that reside in the same hash bin or in adjacent bins are congruent: 2D Cut: ε ε - tolerance
For each hash bin: Retrieve all the bases in the bin and in the adjacent bins Insert the bases into a combinatorial bucket Two bases from different column define a seed match Protein 1 Protein 2 3 x 2 seed matches
Step 4: clustering the seed matches
Detecting matches with a similar transformation and join them into clusters.
Using
RMSD clustering:
Similar to (Rare y 1996) Works in an iterative manner
G V E
i
( , ) :
i j Dist T T i j
(
edge T T i j
Dist T T i j
}
where
:
Dist T T i j
( ( ), ( ))
i j
T1 T5 3 T4 1 3 T6 T2 1 T3 2
Step 5: global extension & refinement
For each match: Apply its transformation Find corresponding atoms that lie close enough to each other after the superposition.
Use least-squares fitting transformation to refine the Iterate until the RMSD convergence.
The Multiple Case
Outline: SSE assignment & representation Detection of seed matches Clustering the seed pairwise matches Global extension of pairwise matches Computing multiple matches Refinement Selecting high-scoring multiple matches
Finding bases whose configuration appears in sufficient number of molecules: All bases are stored in a geometric hash according to their fingerprint.
Bases that reside in the same bin or in adjacent bins are congruent.
For each hash bin: Retrieve all the bases in the bin and in the adjacent bins Insert them into a combinatorial bucket (CB): Protein i Protein j Protein k Protein r Protein s i
Construct pairwise seed matches.
The reference protein is the one with the smaller index Cluster the pairwise matches Global extend the pairwise matches
Recursively construct multiple alignment:
f
f
(
k
)
CB
Protein i Protein j Protein k Protein r Protein s i
Refinement Selecting high-scoring multiple matches The score of a multiple match with n proteins and k atoms is given by:
k
n = 3 k = 4 score = 12
Experimental Results
MASS vs. MUSTA
Partial Solutions
All-alpha Class
The core between ten proteins. The proteins belong to 4 different folds of the all-alpha class.
Tim-barrel Fold
The core between 6 proteins out of 7 proteins, taken from different super families of the tim-barrel fold
The core of 6 proteins, belong to 3 different families of the EF hand-like super family
Calcium Binding
An alignment of four structures from different species of the Lipase Family.
Two of the conformations are open and two of them are closed.