CSCE590/822 Data Mining Principles and Applications

Download Report

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics

 Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu

.

Outline

    Understanding Protein Structures Protein bioinformatics: what and why?

Protein Secondary Structure Prediction: problem & algorithm Summary

Proteins

   Large organic compounds made of amino acids Proteins play a crucial role in virtually all biological processes with a broad range of

functions

. The activity of an enzyme or the function of a protein is governed by the three-dimensional

structure

How Proteins Are Generated

folding

Protein Bioinformatics

  Analysis and prediction of protein structures (Structural Bioinformatics) ◦ Protein Design: design a sequence that will fold into a designated structure Assist experimental biology in assigning functions or suggesting functional hypotheses for all known proteins.

Protein Bioinformatics

Gene expression database Protein structure databases DNA transcription RNA translation protein phenotype Genomic DNA Databases cDNA ESTs UniGene Protein sequence databases

TOP 10 Most Wanted solutions in protein bioinformatics

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

Protein sequence alignment Predicting protein features from sequence Function prediction Protein structure prediction Membrane proteins Functional site identification Protein-protein interaction Protein-small molecule interaction (Docking) Protein design Protein engineering

Why Protein Bioinformatics?

Function =

S

interactions

Disease Mechanism, Gene regulation, Drug design…

Relevance of Protein Structure in the Post-Genome Era

structure medicine sequence function

Helix

Protein Structure Example

Beta Sheet Loop 2 chains

Proteins Structure is Hierarchical

Sequence Local Folding Single peptide chain Multiple peptide chains Long-range Folding Multi-meric organization

How to Obtain Protein Structures   Experimental methods (>50,000)  X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry  limitation: protein size, require crystallized proteins  Difficult to get crystallized for membrane proteins Computational methods (predictive methods)  2-D structure (secondary structure)   3-D structure (tertiary structure) CASP competition:

Critical Assessment of Techniques for Protein Structure Prediction

Protein Structure Prediction Problem  ◦ Given the amino acid sequence of a protein, what’s its shape in three dimensional space?

Sequence → function secondary structure → 3D structure →

Why Prediction Needed?

   The functions of a protein is determined by its structure.

Experimental methods to determine protein structure are time-consuming and expensive .

Big gap between the available protein sequences and structures.

Growth of Protein Sequences and Structures

30000*X species 50,000 as 2008 Data from http://www.dna.affrc.go.jp

     

What determines structures: Inter-atomic Forces

Covalent bond ◦ (short range, very strong) Binds atoms into molecules / macromolecules Hydrogen bond ◦ (short range, strong) Binds two polar groups (hydrogen + electronegative atom) Disulfide bond / bridge ◦ (short range, very strong) Covalent bond between sulfhydryl (sulfur + hydrogen) groups Hydrophobic / hydrophillic interaction (weak) ◦ Hydrogen bonding w/ H2O in solution Van der Waal’s interaction (very weak) ◦ Nonspecific electrostatic attractive force Electrostatic forces: ◦

oppositely charged side chains form salt bridges

Secondary Structure Predication (2D)  For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others).

amino acid sequence Secondary structure sequence  Currently the accuracy of secondary structure methods is nearly 80-82% (2006). Theoretical uplimit is 90% due to uncertainty 10% in real proteins  Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling.

http://bioinf.cs.ucl.ac.uk/psipred/psiform.html

PSSP: Protein Secondary Structure Prediction

 Three Generations • Based on statistical information of single amino acids • Based on local amino acid interaction (segments). Typically a segment containes 11-21 aminoacids • Based on evolutionary information of the homology sequences

Formulate PSSP as a machine learning classification problem  Using a sliding window to move along the amino acid sequence ◦ Each window denotes an instance ◦ Each amino acid inside the window denotes an attribute ◦ The known secondary structure of the central the class label amino acid is

How to generalize protein secondary prediction as a machine learning problem?

    A set of “examples” are generated from sequence with known secondary structures Examples form a training set Build a neural network classifier Apply the classifier to a sequence with unknown secondary structure

Introduction to Neural Network

 What is an Artificial Neural Network?

◦ An extremely simplified model of the brain   Essentially a function approximator Transforms inputs into outputs to the best of its ability

How do Neural Network Work?

  A neuron (perceptron) is a single layer NN The output of a neuron is a function of the weighted sum of the inputs plus a bias

Activation Function

  Binary active function ◦ f(x)=1 if x>=0 ◦ f(x)=0 otherwise The most common sigmoid function used is the logistic function ◦ f(x) = 1/(1 + e -x )

Multi-Layer Feedforward NN Example  XOR problem (nonlinear classification capable)

Where Do The Weights Come From?

  The weights in a neural network are the most important factor in determining its function Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function ( class labels ) ◦ Supervised Training   Supplies the neural network with inputs and the desired outputs Response of the network to the inputs is measured  The weights are modified to reduce the difference between the actual and desired outputs

Training in Perceptron Neural Net

Training a perceptron: Find the weights W that minimizes the error function:

E

i P

  1 

F

(

X i

.

W

) 

t

(

X i

)  2 P: number of training data X i : training vectors F(W.X

i ): output of the perceptron t(X i ) : target value for X i

Use steepest descent:

compute gradient: update weight vector: 

E

   

E

w

1 

E

, 

w

2 , 

E

w

3 

E

,..., 

w N

 

W new

W old

  

E

iterate (e: learning rate)

Back-propagation algorithm

  For Mult-layer NN, the errors of hidden layers are not known Searches for weight values that minimize the total error of the network over the set of training examples ◦

Forward pass

: Compute the outputs of all units in the network, and the error of the output layers.

Backward pass

:The network error is backpropogated for updating the weights (credit assignment problem).

Feedforward Network Training by Backpropagation: Process Summary     Select an architecture Randomly initialize weights While error is too large ◦ Select training pattern and feedforward to find actual network output ◦ Calculate errors and backpropagate error signals ◦ Adjust weights Evaluate performance using the test set 5/2/2020 Copyright G. A. Tagliarini, PhD 28

NN for Protein Secondary Structure Prediction 0

How to Encode Each Amino Acid?

      20 bit binary sequence 10000000000000000000-----A 01000000000000000000-----R 00100000000000000000-----N … 00000000000000000001-----V

Evaluation of Performance: Accuracy(Q3)

ALHEASGPSVILFGSDVTVPPASNAEQAK hhhhh oooo eeee ooo eee ooooo hhhhh

Amino acid sequence Actual Secondary Structure

o hhh oooo eeee ooooo eee ooo hhhhhh Q3=22/29=76%

Q3 for random prediction is 33% Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.

Performances(CASP)

CASP CASP1 CASP2 YEAR # of Targets 1994 6 63% Group Rost and Sander Rost 1996 24 70% CASP3 CASP4 1998 2000 18 28 75% 80% Jones Jones

Summary

   Protein bioinformatics is a very important area with many interesting problems Computational methods can have big impact in medicine and molecular biology Secondary protein structure prediction algorithms are very strong

Slides Acknowledgements

  Jinbo Xu University of Waterloo Xingquan Zhu

Why predict structure: Can Label Proteins by Dominant Structure  Protein classification, Structural Blasting

Amino Acids

Side chain Each amino acid is identified by its side chain, which determines the properties of this amino acid.

Side Chain Properties

hydrophobic Hydrophilic In-between Positively charged Negatively charged Polar but not charged nonpolar Aromatic V, L, I, M, F N, E, Q, H, K, R, D G, A, S, T, Y, W, C, P R, H, L D, E N, Q, S, T A, G, I, L, M, P, V F, W, Y Hydrophobic amino acids stay inside of a protein, while Hydrophilic ones tend to stay in the exterior of a protein.

Oppositely charged amino acids can form salt bridge.

Polar amino acids can participate hydrogen bonding

Alpha Helix Examples

Beta Sheet Examples

Parallel beta sheet Anti-parallel beta sheet

Calculate Outputs For Each Neuron Based On The Pattern

 The output from neuron j for pattern p is O pj where

Feedforward

and

O pj

(

net j

)  1  1

e

 

net j net

jk neuron j

j

bias

*

W

 connection from input k to 

k O pk W jk

40 5/2/2020 Copyright G. A. Tagliarini, PhD

Calculate The Error Signal For Each Output Neuron

  The output neuron error signal by d

pj =(T pj -O pj ) O pj (1-O pj )

d

pj

is given

T pj

is the target value of output neuron j for pattern p 

O pj

is the actual output value of output neuron j for pattern p 5/2/2020 Copyright G. A. Tagliarini, PhD 41

Calculate The Error Signal For Each Hidden Neuron

 The hidden neuron error signal by d

pj

is given d

pj

O pj

( 1 

O pj

)  d

pk W kj

where d

pk k

is the error signal of a post synaptic neuron k and W

kj

is the weight of the connection from hidden neuron j to the post-synaptic neuron k 5/2/2020 Copyright G. A. Tagliarini, PhD 42