Transcript slides

Abstracts of main servers in
CASP11
presented by Chao Wang
Offical Ranking by SUM Z-scoer (>-2.0)
• We focus on
– Zhang-Server
– QUARK
– BAKER-ROSETTASERVER
– RaptorX
– MULTICOM-CLUSTER
– Pcons
• We don’t focus on
– HHPred
– ZHOU-SPARKS-X
– MUFOLD-Server
• No HHpred and ZHOU-SPARKS-X abstracts in proceedings of
CASP11
• MUFOLD:
• formulates the structure prediction problem as a graph realization
problem
• employs the multi-dimensional scaling (MDS) technique
– Cut the sequence into different segments and generate distance
matrices of the blocks.
– Cluster distance matrices on each block.
– Recombine these cluster centers of each block to generate new
distance matrices and filter out some poor distance matrices by a
set of criteria such as triangle law.
– Generate new structures according to the sampled distance
matrices.
– Use a consensus method to select best candidates from the new
structures.
Zhang-Server
• based on the I-TASSER pipeline
• In addition to the classic I-TASSER pipeline, several approaches
were recently developed and integrated into I-TASSER to enhance
its ability of structure modeling for distant-homology targets.
• First, the top models generated by the QUARK ab initio folding were
merged into the threading template pool, which were used as the
starting conformations of I-TASSER simulations.
• Second, since the hard targets generally lack global templates, the
sequences were broken into segments of 2-4 consecutive
secondary structure elements which were then threaded through the
PDB by the segmental threading tool SEGMER9 to identify supersecondary structure motifs.
• Third, SVM-SEQ and SPcon (Shen et al, in preparation) are used to
generate residue contact maps.
• For multiple-domain proteins, ThreaDom was used to predict the
domain boundary and linker regions.
QUARK
• QUARK has been developed for ab initio protein structure prediction.
• It starts with the collection of continuously distributed structural
fragments with 1-20 residues from unrelated proteins in the PDB.
Full-length structure models are then assembled from the fragments
by replica-exchanged Monte Carlo (REMC) simulations, which are
guided by a composite physics- and knowledge-based force field
that contains a variety of local structure features derived from
sequence.
• For the proteins that are deemed by LOMETS as the Easy or
Medium targets, i.e. there are at least one structure template with Zscore above the confidence cutoff, a new template-based QUARK
pipeline is exploited to generate the structure prediction. In this
pipeline, each replica in the REMC simulation starts from different
top LOMETS templates.
• The weights of the QUARK force field have been reparameterized in
this pipeline to enhance the knowledge-based components derived
from threading alignments.
• multiple-domain proteins: ThreaDom
BAKER-ROSETTASERVER
• Robetta is a fully automated structure prediction server that consists
of three main steps: domain boundary identification, structure
modeling, and domain assembly.
• Domain boundary identification: Domain boundaries are predicted
by identifying PDB templates with optimal sequence similarity and
structural coverage to the target through an iterative process. For
each iteration, we use locally installed programs, HHSearch, Sparks,
and Raptor, to identify templates and generate alignments. The
target sequence is threaded onto the template structures to
generate partial-threaded models, which are then clustered to
identify distinct topologies that are ranked based on the likelihood of
the alignments. Regions of the target sequence that are not covered
by the partial-threads or are not similar in structure within the top
ranked cluster are passed on to the next search iteration.
• Structure modeling: For each predicted domain, models are
generated using our comparative modeling protocol, RosettaCM,
which recombines structural elements from the clustered partialthreads and models missing segments using a combination of
fragment insertion and mixed torsion-Cartesian space minimization.
• For difficult domains, models are also generated using the Rosetta
fragment assembly methodology (Rosetta Abinitio), and if GREMLIN
contacts are predicted, they are used as restraints for sampling and
refinement.
• All models are refined using a relax protocol that minimizes the
Rosetta full-atom energy in torsion and Cartesian space to allow
bond angle flexibility. Final models are selected by clustering the
best scoring 100 models from each topologically distinct alignment
cluster, and then averaging the models within each cluster and
refining the final averaged models.
RaptorX
• RaptorX is a template-based protein modeling server.
• Not finished.
• To significantly advance homology detection and fold recognition,
we have developed a Markov Random Fields (MRFs) modeling of
an MSA (multiple sequence alignment). MRFs can model long-range
residue interactions and thus, encodes information for the global 3D
structure of a protein family.
• Each node is associated with a function describing position-specific
amino acid mutation pattern. Similarly, each edge is associated with
a function describing correlated mutation statistics between two
columns.
• To score the similarity of two MRFs, we use both node and edge
alignment potentials, which measure the node (i.e., residue)
similarity and edge (i.e., interaction pattern) similarity, respectively.
To derive the node alignment potential, we use a set of 1400 protein
pairs as the training data, which covers 458 SCOP folds. The
reference alignment for a protein pair is generated by a structure
alignment tool DeepAlign2. The edge alignment potential is derived
from a software package EPAD3, which takes as input PSSM and
residue interaction strength and outputs the inter-residue distance
probability distribution. The interaction strength of two residues can
be calculated by different ways. In current implementation we
calculate the mutual information matrix (MI).
• It is computationally challenging to optimize the MRFalign scoring
function due to the edge alignment potential. We formulate this
problem as an integer programming problem and then develop an
ADMM (Alternative Direction Method of Multipliers) algorithm to
solve it efficiently to a suboptimal solution.
MULTICOM-CLUSTER
• The method was based on a conformation ensemble approach to
protein tertiary structure prediction.
• The basic conformation ensemble protocol in MULTICOMCLUSTER generated an ensemble of protein models for each target
using multiple templates identified by more than a dozen of
sequence/profile comparison tools (e.g., BLAST, PSI-BLAST,
HHSearch, SAM, HMMer, MUSTER, RaptorX), combination of
alternative target-template alignments, and complementary model
generation tools.
• An ensemble of hundreds (e.g., 150-250) of models generally
approximated the near native conformations of a relatively easy
target well if one or more homologous templates were identified for
the target. For relatively hard targets for which no good template
was found, additional tens of models selected from hundreds of
template-free models generated by a fragment assembly based tool
(i.e. Rosetta) were added into the ensemble in order to increase the
diversity of the model pool.
• The conformations of all the chunks will be combined into a fulllength model using Modeller.
• The ensemble of models of a target were evaluated by several
different methods, including the single-model absolute model quality
assessment tool – ModelEvaluator, the fully pairwise model
comparison tool-APOLLO, a protein energy calculation toolSELECTpro, and the frequency of the templates (i.e., number of
times that a template was chosen by different sequence/profile
comparison tools) used to generate models if any. From the
ensemble, MULTICOM-CLUSTER selected top five models ranked
mostly by the APOLLO scores supplemented by other information.
• Trick (Chao comments) Furthermore, several exception handling
strategies were applied to remove seemly bad models ranked in the
top five models, including replacing template-based models with
very low template coverage, filling in terminal regions of models not
covered by any template by model combination, replacing the same
models within top five models, removing models based on false
positives of blast-based search.
Pcons
• PconsFold is a fully automated pipeline for ab-initio
protein structure prediction based on evolutionary
information.
• PconsFold is based on PconsC contact prediction and
uses the Rosetta folding protocol.
• PconsC2, is a novel method that uses a deep learning
approach to identify protein-like contact patterns to
improve contact predictions.
Advantages
• RaptorX: alignment accuracy, statistical
model and learning methods
• MULTICOM: model selection, template
selection
• Rosetta: assembly
• I-TASSER & QUARK: You see
We need to improve …
• Clean vs. Dirty
– Discovery vs. Performance
– For understanding vs. For CASP
• Clean
– Single secondary structure element
– Interaction of pair SSEs
– Topology
• Domain parsing: to avoid directly from threading
alignments
• Template selection: to avoid from only p-value or
threading raw scores
• Model generation: MODELLER is really NOT reliable
when the gap length is over 15(?)
• Model selection: to avoid selection based on only
dDFIRE score
• to develop:
– loop modeling tools
– template selection tools (SVM?)
– consensus based selection strategy