Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004
Download ReportTranscript Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004
Protein Structure Prediction
(Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004
ChengXiang Zhai
Department of Computer Science University of Illinois, Urbana-Champaign
1
Topics in Bioinformatics
Gene (DNA) Gene expression & regulation Function (Protein)
Microarray data (Matrix) > DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTA TCTGGTAAAGACGTCAACACCATCAACGTGTC ACATCGATGAACTGCTGAACGAAGATATCCTG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGG Genomics transcriptomics > Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDV DELLNEDILILGCSAMGDEVLEESEFEPFIE KVALFGSYGWGDGKWMRDFEERMNGYG PDEAEQDCIEFGKKIANI Proteomics 2
Proteomics: Protein Sequence Analysis
• • • •
Determine protein sequences (primary structure)
– – Indirect: Find genes and then translate them to proteins Direct: Mass spectrometry data
Determine 3-D protein structures (secondary, tertiary, quaternary)
– – Computational: Sequence matching, energy minimization etc.
Experimental: X-ray Crystallography, Nuclear Magnetic Resonance spectroscopy (NMR), Electron Microscopy/Diffraction
Determine protein functions
– – Computational: Profile HMMs, protein classification, motif analysis Experimental: Web lab experiments
Determine protein-protein interactions
– – Gene network finding (time series microarray data) Metabolic engineering 3
Basics of Protein Structures…
4
The Building Blocks (Amino Acids)
5
The 20 Amino Acids
6
Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
(Adapted from Jaap Heringa’s slide)
-helix TERTIARY STRUCTURE (fold)
-sheet loop/coil
7
Domain and Folds
•
A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.
• •
Most proteins have multi-domains.
The core 3D structure of a domain is called a fold. There are only a few thousand possible folds.
8
Examples of fold classes (CATH architectures) 9
Protein Structure & Function
structure medicine sequence
Most functions depend on structures
function 10
Structure Prediction Methods
Homology modeling High sequence similarity (> 30% identity) Exploit known whole structure Fold Recognition Medium sequence similarity (generally < 30% identity) Exploit known partial structures (e.g., known folds, secondary structures) Ab Initio Low sequence similarity Use “first principles” (e.g., energy minimization)
(Adapted from a slide by P. Johansson, E. Jakobsson) 11
First, suppose we have high similarity…
12
Homology Modeling
• • • •
Simplest, reliable approach Basis: proteins with similar sequences tend to fold into similar structures Has been observed that even proteins with 30% sequence identity fold into similar structures Does not work for remote homologs (< 30% pairwise identity)
13
Homology Modeling
(cont.)
•
Given:
– A query sequence Q – A database of known protein structures •
Find protein P such that P has high sequence similarity to Q
– – Based on sequence alignment (tuned for protein structure matching, less penalty for gaps) HMMs, BLAST, etc.
•
Return P’s structure as an approximation to Q’s structure
14
Now, if we don’t have high similarity, but we have medium similarity…
15
Threading (Fold Recognition)
•
Given:
– Sequence of protein P with unknown structure – Database of known folds (overall structures) •
Find:
– Most plausible fold for P – Evaluate quality of such arrangement •
Places the residues of unknown P along the backbone of a known structure and determines stability of side chains in that arrangement
16
What if we have really low similarity?
17
Secondary Structure Prediction
• • • • •
Given an amino acid sequence Predict a secondary structure state (
,
, coil) for each residue in the sequence Secondary structures can help
– Determine 3D structures (e.g., help threading) – Provide insights about functions
Evaluation: Q3 = percentage of correct assignments Accuracy
– – – 64% -75% based on primary sequence only (recent methods perform better) Higher accuracy for -helices than strands Accuracy is dependent on protein family 18
Typical Secondary Structure Prediction Results
19
Secondary Structure Prediction Methods
• •
Early approaches
(Chou and Fasman 1978) – – Make prediction for a given residue by considering a window of n (13 – 21) neighboring residues Learn model that performs mapping from window of residues to secondary structure state •
Later methods utilize evolutionary information (e.g., PHD system prediction
(Rost & Sander, 1993)
) and consider related sequences when making Most recent approaches: Neural networks (PSIPRED, 77%)
(Altschul et al., 1997) 20
Chou-Fasman Method
• • • •
Developed by Chou & Fasman in 1974 & 1978 Based on frequencies of residues in
-helices,
-sheets and turns Assumptions:
– – – – The entire information for forming secondary structure is contained in the primary sequence Side groups of residues will determine structure Examining windows of 13 - 17 residues is sufficient to predict structure Basis for window size selection: - -helices 5 – 40 residues long - -strands 5 – 10 residues long
Accuracy ~50 - 60% Q3
21
Chou-Fasman Pij-values
Name Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine P(H) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(E) P(turn) 83 93 66 95 54 89 119 37 146 156 119 74 110 75 87 160 130 74 105 138 98 156 95 47 59 101 60 60 55 75 119 137 147 170 152 143 96 96 114 50 Values indicate how likely an amino acid occurs in one secondary structure as opposed to others 22
Improved Chou-Fasman
1. Assign all of the residues the appropriate set of parameters 2. Identify
-helix and
-sheet regions. Extend the regions in both directions.
3. If structures overlap compare average values for P(H) and P(E) and assign secondary structure based on best scores.
4. Turns are modeled as tetrapeptides using 2 different probability values.
23
Assign Pij values
1. Assign all of the residues the appropriate set of parameters
P(H) P(E) P(turn) T
69 147 114
S
77 75 143
P
57 55 152
T
69 147 114
A
142 83 66
E
151 37 74
L
121 130 59
M
145 105 60
R
98 93 95
S
77 75 143
T
69 147 114
G
57 75 156 24
Scan peptide for
-
helix regions
2. Identify regions where 4/6 have a P(H) >100 “alpha-helix nucleus”
P(H) T
69
S
77
P
57
T
69
A
142
E
151
L
121
M
145
R
98
S
77
T
69
G
57
P(H) T
69
S
77
P
57
T
69
A
142
E
151
L
121
M
145
R
98
S
77
T
69
G
57 25
Extend
-helix nucleus
3. Extend helix in both directions until a set of four residues have an average P(H) <100.
P(H) T
69
S
77
P
57
T
69
A
142
E
151
L
121
M
145
R
98
S
77
T
69
G
57
Repeat steps 1 – 3 for entire peptide
26
Scan peptide for
-sheet regions
4. Identify regions where 3/5 have a P(E) >100 “ -sheet nucleus” 5. Extend -sheet until 4 continuous residues an have an average P(E) < 100
P(H) P(E) T
69
147 S
77
75 P
57
55 T
69
147 A
142
83 E
151
37 L
121
130 M
145
105 R
98
93 S
77
75 T
69
147 G
57
75
6. If region average > 105 and the average P(E) > average P(H) then “ -sheet” 27
Visit
http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
28
Neural Network Predictors
• • •
All current state of the art methods for secondary structure prediction (except consensus methods) employ neural network classifiers.
•
(Large) data sets are used to train the neural net A sequence window centered on the amino acid to predict is presented to the classifier Homologous sequences (e.g.
Y
-Blast profile) are used to augment prediction capability
29
What about exploit physical principles?
30
Ab Initio Prediction
Solve a complex optimization Problem: Measure “goodness” based on energy etc - Randomly start with some conformation - Heuristically propose a next conformation - Search for the best conformation
31
Best so far…
Using Rosetta for Ab Initio Structure Prediction in the Fourth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP4) Group of David Baker, Univ. of Washington Visit their website and read the paper if you are interested…
http://depts.washington.edu/bakerpg/ 32