Dynameomics Background HT MD – Target Selection – Database Mining Native DB Reference unfolded peptide DB Mining Unfolding Protein DB Prion Protein and amyloid DB Valerie Daggett Bioengineering Department Biomedical.

Download Report

Transcript Dynameomics Background HT MD – Target Selection – Database Mining Native DB Reference unfolded peptide DB Mining Unfolding Protein DB Prion Protein and amyloid DB Valerie Daggett Bioengineering Department Biomedical.

Dynameomics

Background HT MD – Target Selection – Database Mining Native DB Reference unfolded peptide DB Mining Unfolding Protein DB Prion Protein and amyloid DB Valerie Daggett Bioengineering Department Biomedical and Health Informatics University of Washington Seattle, WA

Central dogma of biology

Genomes DNA

…AAAGTCCAGGCAGAATATAATTCTATAAAG GGAACTCCTTCAGAGGCTGAAATCTTT…

information to make protein

transcription

Life

translation

RNA template to make protein Protein

…LEVVAATPTSLLISWDAPAVTVR YYTYGETGGNSPVQEFTVPGS…

function, phenotype

Central dogma of biology

Genomes DNA

…AAAGTCCAGGCAGAATATAATTCTATAAAG GGAACTCCTTCAGAGGCTGAAATCTTT…

information to make protein

transcription

Life

translation

RNA template to make protein Protein

…LEVVAATPTSLLISWDAPAVTVR YYTYGETGGNSPVQEFTVPGS…

function, phenotype Motion critical

Dynamic cleft discovered through MD

Cytochrome b 5

Storch et al., Biochem, 1995, 1999a,b, 2000

Protein folding embedded

Genomes DNA

transcription

Life

translation

RNA Protein

Protein folding problem

D

, denatured biologically inactive ?

N

, native biologically active

Protein folding embedded

Genomes DNA

transcription

Life

translation

RNA Protein

Protein

un

/folding problem

D

, denatured biologically inactive ?

Process or pathway

N

, native biologically active

Unfolding pathway of CI2 in water

[Simulation contains 500,000 structures]

373 K N (94 ns) TS (21 ns) D (30 ns) D (94 ns)

• MD unfolding process in good agreement with experiment • TS in quantitative agreement with experiment---prediction • Residual structure in D verified experimentally • Atomic-level characterization of transition, intermediate and denatured state ensembles

Daggett and Fersht, TIBS, PNAS, +

Conformational ensembles in folding

N TS D

100 simulations

Day and Daggett, PNAS, 2005

Refolding by quenching TS

8 N TS D 7 6 5 4 3 2 1 Brute force MD can refold proteins from the TS

0

Plan: predict TS structures, perform MD simulations and solve protein folding problem But we need info to predict TS (TS easier than D)

0.5

1 1.5

Time (ns) 2 2.5

‘D’ TS Control, N

DeJong et al., JMB 2002

3

Reversible folding and unfolding

348K in water, the T m of the protein

Xtal

–A16 –I57 –L49 –I20 –

5 ns

–4.0 Å –

25.6 ns

–I57 –L49 –A16 –I20 –8.9 Å –

200 ns

–4.8 Å – And, refolding = unfolding Detailed pathway reversed –A16/I20 orientation maintained –

Day and Daggett, JMB, 2007

McCully et al., Biochem in press (EnHD)

Reverse central dogma of biology

?

D

, denatured biologically inactive Process or pathway

N

, native biologically active Determine pathways for many proteins, ascertain general features

DNA RNA Decode genomes Protein

Proteins

Proteins are life’s machines, tools and structures

Many jobs, many shapes, many sizes

Dynameomics

Goals

: 1. Perform HT MD simulations of representatives of all folds (41,000 structures in PDB → 1130 fold families) 2. Construct a novel relational/multidimensional database to house these data and facilitate discovery - Native state – information relevant to disease and drug design targets, SNPs - Unfolding – disease and solution to protein folding problem –

Beck et al., Prot Eng Des Sel, 2008

– NERSC – DOE –

Unix

– The Wall –

Windows

– Athena @ MS

700

Fold space

600 500 400 300 200

Rank

1 2 3 4 5

Fold

IGG-like Rossmann TIM Barrel Jelly Roll a  Plait

Target

Fibronectin Che Y TIM SAP S6 100 0 0 50 100 Fold Rank

Population

642 424 205 162 121 150 200 1.0

0.8

0.6

0.4

0.2

0 200

30 folds represent ~ 50% of known protein structures

400 600 Fold Rank 800 • Divide protein structures into folds – Consensus of SCOP, CATH and Dali • Rank folds based on population • Choose a representative protein from each fold 1000

Day et al., Prot. Sci., 2003

Target selection

• Selection criteria – Structure quality – Protein size – Experimental data available – Biomedical relevance – 1 st globular then membrane

PDB code

3chy 2fvx 1a1v ...

Example: Rank 2, population 424

X-ray/NMR (resolution) Length Gap in chain

X-ray, 1.66 Å 128 No X-ray, 1.80 Å 136 No X-ray, 2.20 Å … 137 … Yes … CheY [PDB:3chy]

Folding Studies

Yes No No … Amanda Jonsson

Targets with biomedical relevance

Amyloid  precursor protein Alzheimer’s disease Glutathione S-transferase Chemotherapy resistance HIV-1 Protease HIV Triosephosphate isomerase Neurodegeneration MAP30 HIV and cancer Serum amyloid P component Amyloidosis

Top 30 folds

IGG-like: 1fna Rossman: 3chy fibronectin CheY TIM barrel: 1ypi TIM Jelly Roll: 1sac a  plait: 1ris SAP S6 Represent 50% of all known protein folds 3-helix bundle: 1enh engrailed homeodomain Globin: 1a6n myoglobin 4-helix bundle: 2a0b  -grasp: 1pgb phosphotransfer domain protein G EF-hand: 4icb calbindin Trypsin-like serine Thioredoxin-like: 1ev4 OB fold: 1mjc protease: 1qq4 a -lytic protease GST A1-1 CspA IGG-like: 1e65 Cytochrome C: 1hrc azurin cytochrome C Rossman: 1ght SH3 barrel: 1shg FAD/NAD(P) binding knottin: 1snb C-type lectin: 2afp transposon  resolvase a -spectrin SH3 domain: 1ebd oxidoreductase neurotoxin BMK M8 type II antifreeze prot.

Data and metadata for ‘Top 30’ at www.dynameomics.org lipocalin: 1ifc trefoil: 1tld fatty acid binding prot. bovine trypsin Zn finger: 2adr Zn finger (ADR) snake toxin: 1ntn acid protease: 1g6l cobra neurotoxin HIV-1 protease Rossman: 2pth GST (C-term): 1ev4 IL-8 like (OB): 1bf4 PLP dep. transferase: 1e5f Laminin-like: 1edm peptidyl tRNA GST A1-1 Sso7d methionine  -lyase coagulation hydrolase factor IX

Dynameomics protocol

• One 298 K native state simulation (21-60 ns, <26 ns>) • At least three 310 K native simulations (some) • At least five 498 K unfolding simulations – Two long simulations (at least 31 ns, <36 ns>) – At least three short simulations (2 ns, <14 ns>) – (5 simulations ~ 100 simulations)

Trade-off sampling of different folds and different sequences as opposed to more thorough sampling of individual protein (~400 simulations of PrP)

Validation of Trajectories

• Computational checks: energy conservation • Native State: NOEs, S 2 order parameters from NMR relaxation experiments, etc. • Unfolding Process: F values, residual structure in denatured state, intermediates David Beck

Native State Simulations: Ubiquitin

– – NOEs (2727) – MD: 95.2 % XTAL: 94.4% – Proton Chemical shifts: R=0.98

Comparison with available NMR

Level of agreement with experiment

92 ± 4 %

Available BMRB data

NOE 1 Chemical Shifts 2 R = 0.96

27 proteins (28,504 NOEs) 15 proteins (5,778) 1.The 27 proteins with available data (by PDB code) are: 1aa3, 1c06, 1d1r, 1gle, 1kjs, 2ife, 3gcc, 1bf0, 1cmz, 1cok, 1cz4, 1d1n, 1d8v, 1enh, 1fad, 1fvl, 1fzt, 1ght, 1i11, 1iyu, 11dl, 1mut, 1sso, 1tfb, 1ubq, 1uxc, 3chy. 2.Proton chemical shifts from MD structures were calculated with SHIFTS (Osapay and Case, 1991). The 15 proteins with data available (by PDB code): 1mjc, 1hcc, 1ubq, 1baz, 1cz4, 1a2p, 1e65, 1ill, 3chy, 1ght, 1cmz, 1gpr, 1byl, 1fzt, 1b10.

Dynameomics status

• Dataset includes over 500 proteins and nearly 4000 simulations for a total of >60  s of simulation time, > 65M structures • > 64 TB

Measure

Residues

Mean

134

Min

29

Max

417 Not including 637 amyloid simulations

Comprehensive data/metadata

In theory, build a warehouse Andrew Simms

Build a data warehouse (not so easy)

– • The data set is large… (~6 months to load protein coordinates) – Storing protein data only, no solvent data – Only single simulations per table (

10M – 90M rows

) – 4000 simulations x 10 analyses right now (

40K tables

) – And we are growing at a rate of ~2000 simulations per year (

10K tables

) • Approach for scaling...

– Multiple servers – Multiple databases per server – 100 targets per database –Andrew Simms

Simms et al., Prot Eng Des Sel, 2008

Multi-D cubes for complex data analysis

• Though our data set may be large, our requirements are typical in the scientific world • Large, complex and often multidimensional data sets • Analytical rather than transactional processing • Need for performance and storage efficiency On-line analytical processing – OLAP MOLAP – multidimensional OLAP Catherine Kehl

Molecular Dynamics

MD provides atomic resolution of native dynamics

3chy, waters and hydrogens hidden

Molecular Dynamics

MD provides atomic resolution of native dynamics

native state simulation of 3chy at 298 K, Asp 57

Native-state dynamics: helix motion

a3:a4 a2:a3 a3:a4

Standard Deviation Helix Angle (degrees) CheY at 298 K

α5 α4 α2 α3 α4

0 ns

α2 α3

5 ns 10 ns 15 ns

a 2 and a 3 dynamic, a 4 and a 5 stable structural scaffold

20 ns

CheY – Binding partners

Structures of CheY complexes -show binding to α4 and α5 a4:a5

Distances between ends of helices

α4 α2 α3

20 ns

α2  3 α4  5 CheY CheZ CheY FliM CheY CheA Rudesh Toofanny • Functionally important face of protein stable • Asp 57, phosphorylation • Motion in a 2 and a 3 does not disrupt function, entropy sink?

Catechol O-methyltransferase

COMT CheY

– Both proteins: Rank 2 Rossman fold – COMT polymorphism: Val108 → Met – 108M - increased risk for diseases such as breast cancer and OCD – Improved memory MD 108M 15 ns – a 6 and a 7 mobile in COMT, too – In 108M movement of a 6 propagated 16 Å and disrupts the active site

Rutherford et al., Biochem. 2006

30 ns

Importance of characterizing dynamics

SNP-induced changes in COMT

–Native-like –Intermediate a 8 108V a 6 a 7 108M Mutation to Met leads to loosening of the active site Followed up with CD, NMR, crystallography, fluorescence

Rutherford et al., BBA, JMB, JMB, Biochem, 2008

SNP leads to broader conformational ensemble at 310 K 108V COMT 700 600 500 400 300 200 100 0 0 700 1 2 3 4 5 Starting Structure 25 °C 37 °C 50 °C 500 400 300 200 100 0 0 1 2 3 C

a

RMSD Distribution (Å) 4 5

SNP-omics

COMT – SNP leads to subtle differences in packing near the mutation site that propagate to the active site Similar behavior now seen in 4 other members of this methyltransferase family (fold rank 2) Effects NOT apparent in static structures Large scale effort to investigate dynamic effects of SNPs starting with 80 proteins ---- dynameomics protocol add multiple 310 K simulations

SLIRP

• Structural Library of Intrinsic Residue Propensities (SLIRP) to determine structural propensities for design – GGXGG peptides at in water at 298 K and 498 K and in 8M urea at 298 K (multiple simulations, 100 ns) • Unbiased coil library, main chain and side chain, exhaustive sampling – Dynamic protein side chain rotamer library • Rotamer populations, improved over static from crystal structures • S 2 axis , waiting times between rotamers

“Random Coil” Peptides: Ala

26% GGAGG 16% Protein-MD Protein-PDB 4% 26% 24% Glu

F

( °)

F

( °)

F

( °)

HN, H a , H  , NH, C a , C  , and C’ for GGAGG are very close to the corresponding experimentally derived values (R = 0.999 over 28 points, 7 atoms x 4 independent simulations).

Chemical shifts for GGXGG: MD and Expt Atom in X

H N H α H β1 H β2 N H C α C β C'

Overall R

0.8666

0.8703

0.9722

0.9216

0.8483

0.9879

0.9950

0.9068

0.9998

Values

19 20 19 15 19 20 19 20

151

Predictions calculated with ShiftX v1.0 (Neal

et al

., 2003,

J Biomol NMR

) Experimental data taken from Schwarzinger

et al

.,

J. Biomol NMR

, 2000

“Random Coil” Peptides vs. Protein: Ala

GGAGG Protein-MD Protein-PDB Glu

F

( °)

F

( °)

F

( °)

Ala in protein MD distributions (188 proteins) similar to PDB Ala in GGXGG different GGAGG vs experimental helix propensities, R = 0.28

Protein MD vs helix propensities, R = 0.92

Host-guest studies reflecting the host more than the guest

Mining the database

• SLIRP to determine structural propensities for design • Dynamic area conserved in members of protein family. In one case critical for biological function and in another mutation at the region leads to disease • Inflexible region across 188 proteins, identified novel structural elements associated with loop structure (antifreeze) Rudesh Toofanny Noah Benson

Solving the protein folding problem?

– Data mining of the Dynameomics database for information to predict TS structures – Bootstrapping to native state prediction by refolding from predicted transition state structures

Unfolding

N

Refolding

D

?

TS TS

?

N D Dustin Schaeffer

Contact analysis

• Determined contact probabilities by amino acid and separation between the amino acids from mining of Dynameomics DB

Contacts

i → i+x

Leu

Residue separation

i →i+2

Leu

i i →i+1 i →i+3

Leu-Leu

i → i+3

Residue Type 1

Coordinates from contacts

Most Probable contacts Protein structure DG

A set of distances for a particular sequence can be converted into coordinates by singular value decomposition (SVD) of a distance matrix ― distance geometry

TS predictions for Fyn SH3

Prediction from mined data

via

distance geometry

(too compact)

RMSD = 3.8

 0.37

Å MD generated TS ensemble

Solving the folding problem with MD

High-throughput structure prediction should be possible by refolding from transition states Sequence TS Structure

DB Info + DG We have TSs for 80% of known protein structures

N Structure

We have refolded from TS MD

Dynameomics Conclusions

• Native state simulations to probe protein function, for drug design, SNP-omics • Unfolding simulations for structure prediction, protein design/redesign, unfolding diseases • SLIRP---Structural Library of Intrinsic Residue Propensities: intrinsic mainchain conformations, dynamic side chain rotamer library, coil library • Dynameomics.org –Noah Benson