Transcript ccsws2 9830
Jack Snoeyink & Matt O’Meara
Dept. Computer Science
UNC Chapel Hill
Collaborators
Brian Kuhlman, UNC Biochem
Many other members of the RosettaCommons
Richardson lab, Duke Biochem
Funding
NIH
NSF
Scientific Models, esp. for Structural Molecular Biology
Focus on statistical/computational models with
Models are the lens through which we view data
Models are predominantly geometric
Computational models are complex
Models evolve, so testing becomes crucial
a sample source, observable local features, chosen functional form,
fit parameters, & visualization/testing methods
Capture assumptions and date used to build models to:
Visualize for making design decisions while building
Fit parameters to ensure best performance
Record as scientific benchmarks
Case Study: Rosetta protein structure prediction software [B]
Physical and Conceptual models
Kept simple to aid understanding
Statistical and Computational models
Evolve by combining simple models
Even when complex can still be effective at
Validation (Molprobity) or Prediction (Rosetta)
Spiral development, much like software
Discover problematic features in some data
Create an energy function to adjust them
Fit parameters to improve results
Check into the software as a new option
Make default option if everyone likes it
Occasionally refactor and rewrite, removing
outdated or unused models
But less support for testing…
Our goal:
Capture
data and
assumptions
from model
building for
use in model
visualization
and testing.
Abstraction: A simple component of a complex
computational model consists of:
One or more sample sources giving
Observable local features having a
Hydrogen bond distances and angles
Chosen functional form that
Pdb files from native or decoys
Energy from distances and angles
Depends on fitting parameters
Weights for combining terms
KMB’03
data set A
data set B
...
gather
features
data set Z
plots
SQL query
ggplot2
spec
filter
transform
statistics
Implemented tools
Compare distributions from sample sources
Tufte’s small multiples via ggplot
Kernel density estimation
Normalization
Opportunities for
Statistical analysis
Dimension reduction …
1400
1200
1000
800
600
400
200
1.
45
1.
55
1.
65
1.
75
1.
85
1.
95
2.
05
2.
15
2.
25
2.
35
2.
45
2.
55
2.
65
2.
75
2.
85
0
[KMB’03]
Histogram of Hbond A-H distances in natives
Scientific unit tests
native, HEAD, ^HEAD
run on continuously testing server
Knowledge-base score term creation
native, release, experimental
turn exploration into living benchmarks
Test design hypotheses
native, protocol, designs
how strange is the this geometry?
Scientific Models, esp. for Structural Molecular Biology
Focus on statistical/computational models with
Models are the lens through which we view data
Models are predominantly geometric
Computational models are complex
Models evolve, so testing becomes crucial
a sample source, observable local features, chosen functional form,
fit parameters, & visualization/testing methods
Capture assumptions and date used to build models to:
Visualize for making design decisions while building
Fit parameters to ensure best performance
Record as scientific benchmarks
Case Study: Rosetta protein structure prediction software [B]