Latest developments and plans

Download Report

Transcript Latest developments and plans

BALBES (Current working name)
A. Vagin, F. Long, J. Foadi, A. Lebedev
G. Murshudov
Chemistry Department, University of York
Outline
•
•
•
•
•
•
•
Introduction
Database
System manager
Scientific programs
Calibrating the System
A Example
Release and Development Plan
Introduction
•
The number of entries in the Protein data bank (PDB) is
increasing every year. It has many implications to
Macromolecular crystallography. One challenge is how to
use them efficiently in development of a structure solution
software.
•
Analysis of the PDB shows that this year around 67% of all
the deposited structures reported to be solved by
molecular replacement.
•
With better algorithms and organisation of data bank it is
expected that the above number can be substantially
higher. Our system contains three main components,
(1)reorganised database, (2) a manager written in
PYTHON that makes decision and (3) scientific programs
such as MOLREP and REFMAC
Database: Reorganisation of PDB
•
•
•
•
•
•
All entries in the PDB have been analysed according
to their homology and only non-redundant set of
structures were stored.
Hierarchical database was organized according to
sequence identities
If domains are present, information about them was
stored
Multimiers of a structure
Fragments of various lengths (under way)
Intensity curves for various types of
macromolecules(later)
Database: (continue)
A Database of portable size is created, which
enables



fast search for similar structure (less than 10
seconds in a typical MAC G5 processor for most
test cases so far)
all action performed locally (independent on
internet)
provide required information of the similar
structures(domains, tertiary structures)
Manager System
User’s input files (cif, mtz, pdb, seq)
Input Data Check
General
Manager
Template Model
Generation
Database
Molecule Replacement
Refinement
Output file (pdb)
Differrnt protocols
(e.g.Multiple
Domain
Processing)
System Manager
It is written using PYTHON and relies on files of XML
format for information exchange:
1. Data
•
•
•
•
2.
Twinning
Pseudotranslation
Resolution for molecular replacement
Completeness and other properties
Sequence
•
•
•
3.
Finds template structures with their domain and multimeric
organisations
Finds number of molecules in the asymmetric unit
“Corrects” template molecules using sequence alignment
Protocols
•
Runs various protocols with molecular replacement and
refinement and makes decisions accordingly
Scientific programs
MOLREP - molecular replacement
Simple molecular replacement, Phased rotation, translation
functions, spherically averaged phased translation function, dyad search,
search with one model fixed etc
REFMAC
Maximum likelihood refinement, phased refinement, rigid body
refinement, extensive dictionary, map coefficients etc
SFCHECK
Twinning tests, psuedotranslation, optical resolution, optimal
resolution for molecular replacement, analysis of coordinates against
electron density etc
Auxiliary programs:
Alignment, search in DB, analysis of sequence and data to suggest
number of expected monomers, removal of bits of structure from
coordinates according to fit into electron density, semiautomatic domain
definition etc
Calibrating the System
Step 1: Making the database
In the PDB there were more than 30,000 structures deposited
up to end of 2004, but only ~10,000 were non-redundant.
These 10,000 were used to construct our database of known
structures.
Step 2: Testing the system:
~1000 structures were deposited between Jan-May 2005.
We tried to solve all of these with our automated approach.
The success rate was ~75% with our current version.
This is actually higher than the proportion reported as solved
using MR!
Overall test results
Method
Case
Number
Success
Cases
Rate
(%)
ALL
1027
777
75.6
MR
695
609
87.6
SAD
80
23
28.8
MAD
117
40
34.1
SIR
10
5
50
MIR
23
9
39.0
OTHER
102
89
87.9
Reported in PDB
MR
2.2 1
MAD
SAD
MIR
SIR
Other
9.9
7.8
11.4
67.7
Test Case Statistics
Note that not all structures that were used as a
search model are present in our DB
Schematic view of the success rate of our
system
Solved automatically by our
system - 75%
All 100%
Reported to be solved
by MR 67%
Progress to date
We are analysing all failed cases and have already
significantly enhanced the system as a result. We have
developed several new techniques by carefully
analysing these results.
Success is great for funding!
Failure is great for future developments!
Example: Addition of domains
Search with the whole
molecule
Is it
solution?
Yes
Refine and
exit
No
MR for each domain
and find the best
Refine and
produce map
No
Yes
Are there
domains?
No
Mask out found
domain(s)
Is solution
complete?
Yes
Refine and exit
Other
protocols
Use SPTF, PRF, PTF to
find missing domains
Yes
Is it solution?
No
Other protocols
Example: Domain motions - 1tj3
Finding whole molecule was problematic. Finding the large domain
refining and then using SPTF/PT/TF using masked map was
straightforward
Conclusions
1. Database is an essential ingredient of efficient
automation
2. With relatively simple protocols it will be
possible to solve more than 80% of structure
automatically
3. Interplay of different protocols is very
promising
4. Huge number of tests help to prioritise
developments and generate ideas
Development Plans
Development currently under way and in immediate future:
• Update database by adding entries based on PDB files deposited in
2005 (Thanks Eugene for PISA, which we use for multimer analysis)
• Add multichain domain definitions
• Test the system against PDB files deposited in 2006
• Target release date: May-June 2006
• Combine with some protocols from experimental phasing and automatic
model building (Foadi, Cowtan)
Future:
• Combine with automatic model building
• Make decision during refinement about twinning and other properties
• Pass information about search templates to refinement
• Combine with experimental phasing
• Regular update
Acknowledgements
All CCP4 and YSBL people
Wellcome Trust, BBSRC, EU BIOXHIT, NIH for support