Protein Family Classification using Sparse Markov Transducers Intelligent Systems for Molecular Biology
Download
Report
Transcript Protein Family Classification using Sparse Markov Transducers Intelligent Systems for Molecular Biology
Protein Family Classification
using Sparse Markov Transducers
Proceedings of Eighth International Conference on
Intelligent Systems for Molecular Biology
(ISMB2000), pp. 134-145
E. Eskin, W.N. Grundy, and Y. Singer
Cho, Dong-Yeon
Abstract
Classifying proteins into families using sparse
Markov transducers (SMTs)
Estimation of a probability distribution conditioned on
an input sequence
Similar
to probability suffix trees
Allowing for wild-cards
Two models
Efficient data structures
Introduction
Protein Classification
Pairwise similarity
Creating profiles for protein families
Consensus patterns using motifs
HMM-based approaches
Probability suffix trees (PSTs)
A PST
is a model that predicts the next symbol in a sequence
based on the previous symbols.
This approach is based on the presence of common short
sequences (motifs) through the protein family.
One drawback of PSTs is that they rely on exact matches to the
conditional sequences (e.g., 3-hydroxyacyl-CoA dehydrogenase).
VAVIGSGT
VGVLGLGT
V*V*G*GT – wild cards
Sparse Markov Transducers (SMTs)
A generalization of PSTs
It
can condition the probability model over a sequence that
contains wild-cards.
In a transducer, the input symbol alphabet and output symbol
alphabet can be different.
Two methods
Single
amino acid
Protein family
Efficient data structure
Experiments
Pfam database of protein family
Sparse Markov Transducers
A Markov Transducer of Order L
Conditional probability distribution
P(Yt | X t X t 1 X t 2 X t 3 ... X t ( L 1) )
Xk
are random variables over an input alphabet
Yk is a random variable over an output alphabet
Sparse Markov Transducer
Conditional probability distribution
P(Yt | n X t n X t ... n X t )
1
1
:
ti
k
2
2
k
wild card
t ( j 1 n j ) (i 1)
i
Two approaches for SMT-based protein classification
A prediction
model for each family: single amino acid
A single model for the entire database: protein family
Sparse Markov Trees
Representationally equivalent to SMTs
The
topology of a tree encodes the positions of the wild-cards
in the conditioning sequence of the probability distribution.
u 2 1 A 2C u5 1C 3C
* C *** C
* A * *C
ACAAAC
AACCC
CCADC C
BAACC
CCADC CCA
Training a Prediction Tree
A set of training examples
The
input symbols are used to identify which leaf node is
associated with that training example.
The output symbol is then used to update the count of the
appropriate predictor.
The predictor kept counts of each output symbol seen
by that predictor.
We
smooth each count by adding a constant value to the count
of each output symbol. Cf) Dirichlet distribution
u1
DACDADDDCAA, C
CAAAACAD, D
AACCAAA, ?
C0.5, D0.5
Mixture of Sparse Prediction Trees
We do not know which tree topology can best estimate the
distribution.
A mixture
technique employs a weight sum of trees as a predictor.
P t (Y | X t )
t
t
w
P
(
Y
|
X
)
T T T
t
w
T
T
Updating
the weight of each tree for each input string in the data
set based on how well the tree preformed on predicting the output
wTt 1 wTt PT ( yt | x t )
t 1
T
w
w
1
T
t
i
P
(
y
|
x
T i )
i 1
The
tree.
prior probability of a tree is defined by the topology of the
Implementation of SMTs
Two important parameters
MAX_DEPTH:
the
maximum depth of the tree
MAX_PHI: the maximum
number of wild-cards at
every node
Ten tress in the mixture if
MAX_DEPTH=2 and
MAX_PHI = 1
Template tree
We
only store these nodes which are reached during training.
AA, AC and CD
Efficient Data Structures
Performance of the SMT typically improves with
higher MAX_PHI and MAX_DEPTH.
The memory usage become bottleneck because it restricts
these parameters to values that will allow the tree to fit in
memory.
Lazy Evaluation
We store the tails of the training sequence and recompute
the part of the tree on demand when necessary.
EXPAND_SEQUENCE_COUNT =
4
ACDACAC(D)
ACDACAC(A), DACADAC(C), DACAAAC(D), ACACDAC(A), ADCADAC(D)
Methodology
Data
Two versions of the Pfam database
Version
1.0: for comparing results to previous one
Version 5.2: the latest version
175 protein families
A total
of 15610 single domain protein sequences containing a
total 3560959 residues
Training and test data with a ratio of 4:1 for each family
transmembrane receptor: 530 protein sequence (424 + 106)
The 424 sequences of the training set give 108858
subsequences that are used to train the model.
Building SMT Prediction Models
A prediction model for each protein family
A sliding window of size 11
Prediction
of the middle symbol a6 using neighboring symbols
The input symbols are a5a7a4a8a3a9a2a10a1a11.
MAX_DEPTH = 7 and MAX_PHI = 1
Classification of a Sequence using a SMT
Prediction Model
Computation of the likelihood for an unknown sequence
A sequence
is classified into a family by computing the
likelihood of the fit for each of the 175 models.
Building the SMT Classifier Model
Estimation of the probability over protein families given a
sequence of amino acids
Input
sequence: an amino acid sequence from a protein family
Output symbol: the protein family name
A sliding window of 10 amino acids: a1,…,a10
MAX_DEPTH=5 and MAX_PHI=1
Classification of a Sequence using an SMT Classifier
Each position of the sequence gives us a probability over
the 175 families measuring how likely the substring
originated from each family.
Results
Time-Space-Performance tradeoffs
Results of Protein Classification using SMTs
The
SMT models outperform the PST models.
SMT Classifier > SMT Prediction > PST Prediction
Discussion
Sparse Markov Transducers (SMTs)
We have presented two methods for protein
classification using sparse Markov transducers (SMTs).
Future Work
Incorporating biological information into the model
such as Dirichlet mixture priors
Combining a generative and discriminative model
Using both positive and negative examples in training