Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University.

Download Report

Transcript Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University.

Machine Learning Challenges in
Location Proteomics
Robert F. Murphy
Departments of Biological Sciences and
Biomedical Engineering &
Center for Automated Learning and Discovery
Carnegie Mellon University
Protein characteristics relevant to
systems approach
sequence
 structure
 expression level
 activity
 partners

location
Subcellular locations from major
protein databases


Giantin
 Entrez: /note="a new 376kD Golgi complex
outher membrane protein"
 SwissProt: INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE.
GPP130
 Entrez: /note="GPP130; type II Golgi
membrane protein”
 SwissProt: nothing
More questions than answers

We learned that Giantin and GPP130 are
both Golgi proteins, but do we know:
 What part (i.e., cis, medial, trans) of the
Golgi complex they each are found in?
 If they have the same subcellular
distribution?
 If they also are found in other
compartments?
Vocabulary is part of the problem
Different investigators may use different terms
to refer to the same pattern or the same term to
refer to different patterns
 Efforts to create restricted vocabularies (e.g.,
Gene Ontology consortium) for location have
been made

SWALL entries for giantin and gpp130
ID
AC
GN
DR
DR
DR
DR
GIAN_HUMAN
STANDARD;
PRT; 3259 AA.
Q14789; Q14398;
GOLGB1.
GO; GO:0000139; C:Golgi membrane; TAS.
GO; GO:0005795; C:Golgi stack; TAS.
GO; GO:0016021; C:integral to membrane; TAS.
GO; GO:0007030; P:Golgi organization and biogenesis; TAS.
ID
AC
GN
DR
DR
DR
DR
O00461
PRELIMINARY;
PRT;
696 AA.
O00461;
GPP130.
GO; GO:0005810; C:endocytotic transport vesicle; TAS.
GO; GO:0005801; C:Golgi cis-face; TAS.
GO; GO:0005796; C:Golgi lumen; TAS.
GO; GO:0016021; C:integral to membrane; TAS.
Words are not enough
Still don’t know how similar the locations
patterns of these proteins are
 Restricted vocabularies do not provide the
necessary complexity and specificity

Needed: Systematic Approach
•Need to advance past “cartoon” view of subcellular location
•Need systematic, quantitative approach to protein location
•Need new methods
for accurately and
objectively
determining the
subcellular location
pattern of all proteins
•Distinct from drug
screening by lowresolution
microscopy
First Decision Point
Classification
by direct (pixel-by-pixel)
comparison of individual images to
known patterns is not useful, since
different cells have different shapes,
sizes, orientations
organelles within cells are not
found in fixed locations
• Therefore, use feature-based methods rather
than (pixel) model-based methods
Input Images
Created 2D image database for HeLa cells
 Ten classes covering all major subcellular
structures: Golgi, ER, mitochondria,
lysosomes, endosomes, nuclei, nucleoli,
microfilaments, microtubules
 Included classes that are similar to each
other

Example 2D Images of HeLa
Features: SLF
Developed sets of Subcellular Location
Features (SLF) containing features of
different types
 Motivated in part by descriptions used by
biologists (e.g., punctate, perinuclear)
 First type of features derived from
morphological image processing - finding
objects by automated thresholding

Features: Morphological





Number of fluorescent objects per cell
Variance of the object sizes
Ratio of the largest object to the smallest
Average distance of objects to the ‘center
of fluorescence’
Average “roundness” of objects
Features: Haralick texture

Give information on correlations in
intensity between adjacent pixels to
answer questions like
 is the pattern more like a checkerboard or
alternating stripes?
 is the pattern highly organized (ordered)
or more scattered (disordered)?
Example: Difference detected by
texture feature “entropy”
Features: Zernike moment
Measure degree to which pattern matches a
particular Zernike polynomial
 Give information on basic nature of pattern
(e.g., circle, donut) and sizes (frequencies)
present in pattern

Examples of Zernike Polynomials
Z(2,0)
Z(4,4)
Z(10,6)
Subcellular Location Features: 2D
Morphological features
 Haralick texture features
 Zernike moment features
 Geometric features
 Edge features

2D Classification Results
True
Class
Output of the Classifier
DNA
ER
Gia
Gpp
Lam
Mit
Nuc
Act
TfR
Tub
DNA
99
1
0
0
0
0
0
0
0
0
ER
0
97
0
0
0
2
0
0
0
1
Gia
0
0
91
7
0
0
0
0
2
0
Gpp
0
0
14
82
0
0
2
0
1
0
Lam
0
0
1
0
88
1
0
0
10
0
Mit
0
3
0
0
0
92
0
0
3
3
Nuc
0
0
0
0
0
0
99
0
1
0
Act
0
0
0
0
0
0
0
100
0
0
TfR
0
1
0
0
12
2
0
1
81
2
Tub
1
2
0
0
0
1
0
0
1
95
Overall accuracy = 92% (95% for major patterns)
Human Classification Results
True
Class
DNA ER
DNA
100% 0%
ER
0% 90%
Gianti n
0% 0%
GPP130
0% 0%
LAMP2
0% 0%
Mi tochond. 0% 3%
Nucle ol i n
0% 0%
Acti n
0% 0%
TfR
0% 13%
Tu bu l n
i
0% 3%
O utpu t of the Cl assi fie r
Gia GPP LAM Mi t Nuc Act TfR
0% 0% 0% 0% 0% 0% 0%
0% 0% 3% 6% 0% 0% 0%
56% 36% 3% 3% 0% 0% 0%
53% 43% 0% 0% 0% 0% 3%
6% 0% 73% 0% 0% 0% 20%
0% 0% 0% 96% 0% 0% 0%
0% 0% 0% 0% 100% 0% 0%
0% 0% 0% 0% 0% 100% 0%
0% 0% 3% 0% 0% 0% 83%
0% 0% 0% 0% 0% 3% 0%
Tu b
0%
0%
0%
0%
0%
0%
0%
0%
0%
93%
Overall accuracy = 83% (92% for major patterns)
Computer vs. Human
100
Human Accuracy
90
80
70
60
50
40
40
50
60
70
80
Co mputer Accuracy
90
100
Extending to 3D:
Labeling approach
Total protein labeled with Cy5 reactive dye
 DNA labeled with PI
 Specific Proteins labeled with primary Ab +
Alexa488 conjugated secondary Ab

3D Image Set
Nuclear
ER
Giantin
gpp130
Lysosomal
Mitoch.
Nucleolar
Actin
Endosomal
Tubulin
New features to measure “z”
asymmetry
2D features treated x and y equivalently
 For 3D images, while it makes sense to treat
x and y equivalently (cells don’t have a
“left” and “right”, z should be treated
differently (“top” and “bottom” are not the
same)
 We designed features to separate distance
measures into x-y component and z
component

Classification Results for 3D images
Overall accuracy = 97%
How to do even better
Biologists interpreting images of protein
localization typically view many cells
before reaching a conclusion
 Can simulate this by classifying sets of cells
from the same microscope slide

True Class
Classification of Sets of 3D Images
Predicted Class
DNA ER Gia Gpp Lyso Mito Nucl Actin Endo Tub
DNA 100 0
0
0
0
0
0
0
0
0
ER
0
99
0
0
0
0
0
0
0
0
Gia
0
0 100 0
0
0
0
0
0
0
Gpp
0
0
0
99
0
0
0
0
0
0
Lyso 0
0
0
0 100 0
0
0
0
0
Mito 0
0
0
0
0 100 0
0
0
0
Nucle 0
0
0
0
0
0 100 0
0
0
Actin 0
0
0
0
0
0
0 100 0
0
Endo 0
0
0
0
0
0
0
0 100 0
Tub
0
0
0
0
0
0
0
0
0
99
Set size 9, Overall accuracy = 99.7%
First Conclusion

Description of subcellular locations for
systems biology should be implemented
using a data-driven approach rather than a
knowledge-capture approach, but…
Subcellular Location Image
Finder

(Have automated system for finding images in on-line journal
articles that match a particular pattern - enables connection
between new images and previously published results)
Caption
Scope
Caption
understanding
Figure
Panel
splitting
[Murphy et
al, 2001]
[Cohen et al, 2003]
Label
finding
]
Panels
Entity
proteins,
Annotated
extraction
cells, drugs,
Scopes experimental
conditions, …
ImagePtr
Label alignment between
Matchingcaption entities and
Panel
labels
Panel classification,
Micrograph analysis
[Murphy et al, 2001]
panels
image type,
Annotated image
scale,
Panels
subcellular
pattern
analysis…
Image Similarity
Classification power of features implies that
they capture essential characteristics of
protein patterns
 Can be used to measure similarity between
patterns

Clustering by Image Similarity
Ability to measure similarity of protein
patterns allows us for the first time to create
a systematic, objective, framework for
describing subcellular locations
 Ideal for database references
 One way is by creating a Subcellular
Location Tree
 Illustration: Build hierarchical dendrogram

Subcellular
Location
Tree for 10
classes in
HeLa cells
Do this for all proteins:
Location Proteomics




Can use CD-tagging (developed by Dr. Jonathan
Jarvik) to randomly tag many proteins: Infect
population of cells with a retrovirus carrying a
DNA sequence that will produce a “tag” in a
random gene in each cell
Isolate separate clones, each of which produces
express one tagged protein
Use RT-PCR to identify tagged gene in each
clone
Collect images of many cells for each clone using
fluorescence microscopy
Example images of CD-tagged
clones
(A) Glut1 gene (type 1 glucose
transporter)
(B) Tmpo gene (thymopoietin 
(C) tuba1 gene (-tubulin)
(D) Cald gene (caldesmon 1)
(E) Ncl gene (nucleolin)
(F) Rps11 gene (ribosomal
protein S11)
(G) Hmga1 gene (high mobility
group AT-hook 1)
(H) Col1a2 gene (procollagen
type I 2)
(I) Atp5a1 gene (ATP synthase
isoform 1)
Proof of principle

Cluster 46 clones expressing different
tagged proteins based on their subcellular
location patterns
Feature selection
Use Stepwise Discriminant Analysis to
rank features based on their ability to
distinguish proteins
 Use increasing numbers of features to train
neural network classifiers and evaluate
classification accuracy over all 46 clones
 Best performance obtained with 10 features

Tree building



Therefore use these 10 features with z-scored
Euclidean distance function to build SLT
Find optimal number of clusters using k-means
clustering and AIC
Find consensus hierarchical trees by randomly
dividing the images for each protein in half and
keeping branches conserved between both halves
(repeat for 50 random divisions)
Consensus Subcellular Location Tree
Examples from major clusters
Significance
Proteins clustered by location analogous to
proteins clustered by sequence (e.g., PFAM)
 Can subdivide clusters by observing
response to drugs, oncogenes, etc.
 These represent protein location states
 Base knowledge required for modeling
 Can be used to filter protein interactions

From patterns to causes
Machine learning approaches have been
previously used to find localization motifs
in protein sequences, but the set of locations
used was limited to major organelles
 High-resolution subcellular location trees
can be used to discover (recursively) new
motifs that determine location of each group
 Can include post-translational modifications

More Conclusions
Organized data collection approach is
required to capture high-resolution
information on the subcellular location of
all proteins
 Prohibitive combinatorial complexity make
colocalization approach infeasible, so major
effort should focus on one protein at a time

Center for Bioimage Informatics






$2.75 M CMU funding from NSF ITR
Joint with UCSB and collaborators at Berkeley
and MIT
R. Murphy (CALD/Biomed.Eng./Biol.Sci.)
Jelena Kovacevic (Biomedical Engineering)
Tom Mitchell (CALD)
Christos Faloutsos (CALD)
Acknowledgments




Former students
 Michael Boland, Mia Markey,
William Dirks, Gregory Porreca,
Edward Roques, Meel Velliste
Current grad students
 Kai Huang, Xiang Chen, Ting Zhao,
Yanhua Hu, Elvira Garcia Osuna,
Zhenzhen Kou, Juchang Hua
Funding
 NSF, NIH, Rockefeller Bros. Fund,
PA. Tobacco Settlement Fund
Collaborators/Consultants
 Simon Watkins, David Cassasent,
Tom Mitchell, Christos Faloutsos,