Deborah Weisser

Download Report

Transcript Deborah Weisser

Partitioning Sequences Based
on Association Measures
Deborah Weisser
Carnegie Mellon University
Feature boundaries
• Need to know form and function of protein
sequences to understand complex
biological systems
• Not possible to directly determine features
or functions directly
– estimate feature positions by indirect
laboratory experiments, e.g. hydrophobicity
• Use statistical measures of association to
determine feature boundaries
Feature boundaries
• Proteins are comprised of adjacent, nonoverlapping features:
– helical, cytoplasmic, periplasmic, extracellular,
intracellular, etc.
• GPCR proteins have a fixed feature
pattern, although feature positions are only
known for one member of the family,
Rhodopsin (opsd_human)
A.
C
cp3
cp2
cp1
Cytoplasmic (cp) Domain
TransMembrane
(helices)
Domain
1
2
3
4
Extracellular (ec) Domain
5
6
7
ec3
ec1
ec2
N
B.
N
C
Segments contributing to extracellular (ec) domain
Segments contributing to transmembrane (helices) domain
Segments contributing to cytoplasmic (cp) domain
Goal: Statistically determine feature
boundaries in sequences of amino acids
SHDEGCLSSEPKPRKQSDSST
Association measures
SHDEGCLSSEPKPRKQSDSST
2.5
2.5 is a measure of the strength of the association between P and R
Association measures
SHDEGCLSSEPKPRKQSDSST
4.8 0.3 1.2
4.5 6.2
1.2
3.7
0.7
3.4
5.2 1.1
0.8
1.1
5.5 2.3
4.1
0.2 2.5
1.8
1.1
6.2
Association measures
SHDEGCLSSEPKPRKQSDSST
4.8 0.3 1.2
4.5 6.2
1.2
3.7
0.7
3.4
5.2 1.1
0.8
1.1
5.5 2.3
4.1
0.2 2.5
1.8
1.1
6.2 4.2
Adjacent pairs with low association measures
are candidates for partition points.
Association measures are used
to quantify correlations between
adjacent amino acids
• Yule’s Q statistic
• Mutual information
MI breaks
Hydropathy breaks
230
Cytoplasmic
(cp)
Domain
233
136
63
255
133
309
153
253 74
61
Cytoplasmic
(cp)
Domain
76
155
301
Transmembrane
(helices)
Domain
Transmembrane
(helices)
Domain
Extracellular
(ec)
Domain
Extracellular
(ec)
Domain
Cytoplasmic (cp) Domain
-
cp2
M
cp1
H
K K L
Q
TV V
Y
L
T
L
N F
55
I
R
T
P
NL Y
I
P
K
VC
VV
Y
136
155
75
II
III
cp3
S N FR
F
G
E
HN
A
I
M
IV
OOC - A PAVQ ST E
Q
QQ
AA
A
KE
V
V F TL
Q
225
V
E
S
T AQ T
K
A E
EK V
T
MR V
I I
256
VI
I M
T KS V T T
SA
E
D
D
G
Q C T
L
K F MT
P
R
L
N
N
N V C CG K
M
Y
306
VII
MI:
39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301
Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, 309
• The changes in association measure
values correspond to feature boundaries
• Goal: automatically detect partition points
based on association measures
Partitioning algorithm
• Cluster adjacent association values
– each group is represented by its mean value
• Calculate standard deviation of values over
all clusters
• Locate partition points in data based on:
– deviation from mean
– [change between adjacent values]
Parameters
• Cluster adjacent association values
– each group is represented by its mean value
window size for computing mean
• Calculate standard deviation of values over
all clusters
• Locate partition points in data based on:
– deviation from mean
– [change between adjacent values]
cutoff distance from mean for a value to be
considered “extreme”
Effect of cutoff threshold on partitioning
in opsd_human using mutual information
Effect of window size on partitioning
in opsd_human using mutual information
GPCR: different subfamilies
Class A Rhodopsin like
Amine
Peptide
Hormone protein
(Rhodopsin
Rhodopsin Vertebrate
Rhodopsin Vertebrate
type 1
Rhodopsin Vertebrate
type 2
Rhodopsin Vertebrate
type 3
Rhodopsin Vertebrate
type 4
Rhodopsin Vertebrate
type 5
Rhodopsin Arthropod
Rhodopsin Mollusc
Rhodopsin Other
Olfactory
Prostanoid
Nucleotide-like
Cannabis
Platelet activating factor
Gonadotropin-releasing hormone
Thyrotropin-releasing hormone &
Secretagogue
Melatonin
Viral
Lysosphingolipid & LPA (EDG)
Leukotriene B4 receptor
Class A Orphan/other
Class B Secretin like
Class C Metabotropic glutamate / pheromone
Class D Fungal pheromone
Class E cAMP receptors (Dictyostelium)
Frizzled/Smoothened family
GPCR: different subfamilies
Size:
717755
371134
48393
33543
20314
348
39724
20930
Hierarchy:
GPCR
Class A
Rhodopsin
Vertebrate
Vertebrate 1
opsd_human
Class B
Class C
• Structure of curve is preserved even when
the dataset is small.
In progress / Future work
• Set parameters of partition algorithm
automatically
• Apply to other sources of data, types of
features
• Group amino acids into sub-classes
• Quantify the effect of training set
information content and training set size.