Transcript Document

Fast Approximate Database Searching
of Polypeptide Structures
Hanjo Taeubig
Arno Buchner
Jan Griebsch
Efficient Algorithms Group
Prof. Ernst W. Mayr
Technical University of Munich
German Conference on Bioinformatics
October 4th, 2004
Structure
I.
motivation & problem definition
II.
suffix trees
III. polypeptide angles suffix trees
IV.
application & future work
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
I. Motivation
• the function of a protein is largely determined by it’s
structure and geometric shape
• How to find similar structures in a database ?
• related work
– DALI, VAST, CE
– TopScan, ProtDex2
• existing methods are mostly based on the principle
filter heuristics + exhaustive search/pairwise
comparison and scale at least linearly
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
I. Motivation
• PDB – Protein Data Bank
– ca. 3.5GB compressed, 14GB decompressed
– > 23.000 entries
– 90% Proteins, 5% Nucleotidesequences, 4% NucleotideProtein complexes
– 85% x-ray cristalography, 15% NMR
• protein structure databases grow almost
exponentially
• search methods with time complexity at most O(n)
required
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
I. Problem Definition
• search a given polypeptide structure in a protein
database
• search the longest common substructure in the
database
• identify frequent substructures (motifs) in the
database
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
II. Suffix Trees
Tries
• tree with a root node
• every edge is labeled
with a letter
• labels of all edges to
the child nodes of one
node are pairwise
distinct
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
II. Suffix Trees
Suffixtries
• stores all suffixes of a
string
• the sentinel $ ensures
that every suffix is
represented by a leaf
Suffixtree for the word aaabbb$
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
II. Suffix Trees
Compressed Suffixtries
• collapse linear paths in
the tree
• store only start- and
end-index
• linear number of inner
nodes
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
II. Suffix Trees
Further Extensions
• generalized suffix trees
– stores suffixes of multiple strings in one tree
• online linear time construction
Time Complexity
• Finding an occurrence of the search pattern does not depend on
the size of the searched database, but linearly on the length m
of the pattern
• Finding all k occurrences of a pattern takes time proprtional to
m+k
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
III. Polypeptide Angles Suffix Tree
Idea
I.
encode the geometry of the database proteins in a
translation and rotation invariant linear description
(“structural text”)
–
torsion angle encoding of the protein backbone
II. adapt efficient text mining methods to the error
tolerant substructure searching problem
–
generalized suffix trees with fault tolerant search
strategies
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
III. Polypeptide Angles Suffix Tree
1a1f
… (22,93), (112, 4) …

Discretization

…abba…
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
III. Polypeptide Angles Suffix Tree
1a1f
… (22,93), (112, 4) …

Discretization

…abba…
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
III. Polypeptide Angles Suffix Tree
Fault Tolerant Searching
• accept a “neighborhood
range” of  intervals left
and right
• worst case time
complexity: exponential (!)
• average: O( n
log|| ( 2 *1)
)
figure: branching with =1
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
IV. Application
Example
• search occurrences the C2H2 zinc finger in the
complete PDB
• discretization: 24 intervals of 15°
• compare with SCOP classification, sequence-based
search, SPASM
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
IV. Application
Score
Sequences producing significant alignments:
E
(bits) Value
gi|37926551|pdb|1LLM|C
Chain C, Crystal Structure Of A Zif2...
47
6e-07
gi|15988358|pdb|1F2I|G
Chain G, Cocrystal Structure Of Sele...
42
2e-05
gi|3319019|pdb|1A1H|A
Chain A, Qgsr (Zif268 Variant) Zinc F...
42
3e-05
gi|3319013|pdb|1A1F|A
Chain A, Dsnr (Zif268 Variant) Zinc F...
41
3e-05
gi|3319022|pdb|1A1I|A
Chain A, Radr (Zif268 Variant) Zinc F...
41
3e-05
gi|16975178|pdb|1JK1|A
Chain A, Zif268 D20a Mutant Bound To...
41
3e-05
gi|2098365|pdb|1AAY|A
Chain A, Zif268 Zinc Finger-Dna Compl...
41
4e-05
gi|33357855|pdb|1P47|A
Chain A, Crystal Structure Of Tandem...
41
5e-05
Chain C, Zif268 Immediate Early Gene (...
40
8e-05
gi|15988466|pdb|1G2F|C
Chain C, Structure Of A Cys2his2 Zin...
33
0.015
gi|15988460|pdb|1G2D|C
Chain C, Structure Of A Cys2his2 Zin...
32
0.025
Chain C, Crystal Structure Of A Desig...
28
0.44
Chain A, Solution Stucture Of The Th...
27
0.64
gi|3318788|pdb|2ADR|
Adr1 Dna-Binding Domain From Saccharo...
27
0.78
gi|2094895|pdb|1SP1|
Nmr Structure Of A Zinc Finger Domain...
26
1.4
gi|1420993|pdb|1ARD|
Yeast Transcription Factor Adr1 (Resi...
23
9.7
gi|443340|pdb|1ZAA|C
gi|1941952|pdb|1MEY|C
gi|40889293|pdb|1P7A|A
. . .
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
IV. Application
Search rangei n1 5°
±0 ±1 ±2 ±3 ±4 ±5 ±6 ±7
True positives
11 12 64 05 56 16 26 46
1a1f False positives
13333
Time [s]
< 1 < 11
23458
True positives
113
369
14 15
1mfs False positives
49
Time [s]
< 1 < 1 < 1 < 111236
True positives
11 78 7 120 132 135 138 144
1a3n False positives
Time [s]
< 1 < 11
23468
±8
5
254
12
18
9
146
0
12
Table 1 : The number of tru ea nd falsep ositivesfort he structure searches.
Figure : Searching PDBentry 1a1f with different neighborhood settings
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
IV. Application
Minimum RMSD superposition: 1a1f vs. 1f2i
1a1f vs. 6 other true positives
“False” positives: 1a1f vs. 1vl2
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
IV. Application
Run Time
• decompression of the packed PDB
files
25min
• parsing of the PDB files and
calculating the torsion angles
55min
• discretization and building the
PAST
2min
• searching a structure
seconds
Preprocessing
Searching
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
Summary
• suffixtree-based protein (sub-)structure database
search method
• preprocessing required
• fast search
• does not rely on heuristics, SSE recognition
• adaptable sensitivity and error models
• until gapped matching is modeled: applicable for
shorter peptide chains, motifs
• surprisingly simple
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
Future Work
• model matching with insertions & deletions
• consensus search pattern
• implementation and practical testing of further error
models
•  and  angle encoding
• identification of new motifs
• testing, testing, testing: evaluating the method
further with real life problems from pharmaceutical
researchers, biologists, patent offices, …
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de
Acknowledgements
• Hanjo Taeubig, Arno Buchner
• Volker Heun, Moritz Maass
• BFAM/BMBF
• ALTANA
www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de