Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University.
Download ReportTranscript Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University.
Machine Learning Challenges in Location Proteomics Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University Protein characteristics relevant to systems approach sequence structure expression level activity partners location Subcellular locations from major protein databases Giantin Entrez: /note="a new 376kD Golgi complex outher membrane protein" SwissProt: INTEGRAL MEMBRANE PROTEIN. GOLGI MEMBRANE. GPP130 Entrez: /note="GPP130; type II Golgi membrane protein” SwissProt: nothing More questions than answers We learned that Giantin and GPP130 are both Golgi proteins, but do we know: What part (i.e., cis, medial, trans) of the Golgi complex they each are found in? If they have the same subcellular distribution? If they also are found in other compartments? Vocabulary is part of the problem Different investigators may use different terms to refer to the same pattern or the same term to refer to different patterns Efforts to create restricted vocabularies (e.g., Gene Ontology consortium) for location have been made SWALL entries for giantin and gpp130 ID AC GN DR DR DR DR GIAN_HUMAN STANDARD; PRT; 3259 AA. Q14789; Q14398; GOLGB1. GO; GO:0000139; C:Golgi membrane; TAS. GO; GO:0005795; C:Golgi stack; TAS. GO; GO:0016021; C:integral to membrane; TAS. GO; GO:0007030; P:Golgi organization and biogenesis; TAS. ID AC GN DR DR DR DR O00461 PRELIMINARY; PRT; 696 AA. O00461; GPP130. GO; GO:0005810; C:endocytotic transport vesicle; TAS. GO; GO:0005801; C:Golgi cis-face; TAS. GO; GO:0005796; C:Golgi lumen; TAS. GO; GO:0016021; C:integral to membrane; TAS. Words are not enough Still don’t know how similar the locations patterns of these proteins are Restricted vocabularies do not provide the necessary complexity and specificity Needed: Systematic Approach •Need to advance past “cartoon” view of subcellular location •Need systematic, quantitative approach to protein location •Need new methods for accurately and objectively determining the subcellular location pattern of all proteins •Distinct from drug screening by lowresolution microscopy First Decision Point Classification by direct (pixel-by-pixel) comparison of individual images to known patterns is not useful, since different cells have different shapes, sizes, orientations organelles within cells are not found in fixed locations • Therefore, use feature-based methods rather than (pixel) model-based methods Input Images Created 2D image database for HeLa cells Ten classes covering all major subcellular structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubules Included classes that are similar to each other Example 2D Images of HeLa Features: SLF Developed sets of Subcellular Location Features (SLF) containing features of different types Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear) First type of features derived from morphological image processing - finding objects by automated thresholding Features: Morphological Number of fluorescent objects per cell Variance of the object sizes Ratio of the largest object to the smallest Average distance of objects to the ‘center of fluorescence’ Average “roundness” of objects Features: Haralick texture Give information on correlations in intensity between adjacent pixels to answer questions like is the pattern more like a checkerboard or alternating stripes? is the pattern highly organized (ordered) or more scattered (disordered)? Example: Difference detected by texture feature “entropy” Features: Zernike moment Measure degree to which pattern matches a particular Zernike polynomial Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in pattern Examples of Zernike Polynomials Z(2,0) Z(4,4) Z(10,6) Subcellular Location Features: 2D Morphological features Haralick texture features Zernike moment features Geometric features Edge features 2D Classification Results True Class Output of the Classifier DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub DNA 99 1 0 0 0 0 0 0 0 0 ER 0 97 0 0 0 2 0 0 0 1 Gia 0 0 91 7 0 0 0 0 2 0 Gpp 0 0 14 82 0 0 2 0 1 0 Lam 0 0 1 0 88 1 0 0 10 0 Mit 0 3 0 0 0 92 0 0 3 3 Nuc 0 0 0 0 0 0 99 0 1 0 Act 0 0 0 0 0 0 0 100 0 0 TfR 0 1 0 0 12 2 0 1 81 2 Tub 1 2 0 0 0 1 0 0 1 95 Overall accuracy = 92% (95% for major patterns) Human Classification Results True Class DNA ER DNA 100% 0% ER 0% 90% Gianti n 0% 0% GPP130 0% 0% LAMP2 0% 0% Mi tochond. 0% 3% Nucle ol i n 0% 0% Acti n 0% 0% TfR 0% 13% Tu bu l n i 0% 3% O utpu t of the Cl assi fie r Gia GPP LAM Mi t Nuc Act TfR 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 6% 0% 0% 0% 56% 36% 3% 3% 0% 0% 0% 53% 43% 0% 0% 0% 0% 3% 6% 0% 73% 0% 0% 0% 20% 0% 0% 0% 96% 0% 0% 0% 0% 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 0% 100% 0% 0% 0% 3% 0% 0% 0% 83% 0% 0% 0% 0% 0% 3% 0% Tu b 0% 0% 0% 0% 0% 0% 0% 0% 0% 93% Overall accuracy = 83% (92% for major patterns) Computer vs. Human 100 Human Accuracy 90 80 70 60 50 40 40 50 60 70 80 Co mputer Accuracy 90 100 Extending to 3D: Labeling approach Total protein labeled with Cy5 reactive dye DNA labeled with PI Specific Proteins labeled with primary Ab + Alexa488 conjugated secondary Ab 3D Image Set Nuclear ER Giantin gpp130 Lysosomal Mitoch. Nucleolar Actin Endosomal Tubulin New features to measure “z” asymmetry 2D features treated x and y equivalently For 3D images, while it makes sense to treat x and y equivalently (cells don’t have a “left” and “right”, z should be treated differently (“top” and “bottom” are not the same) We designed features to separate distance measures into x-y component and z component Classification Results for 3D images Overall accuracy = 97% How to do even better Biologists interpreting images of protein localization typically view many cells before reaching a conclusion Can simulate this by classifying sets of cells from the same microscope slide True Class Classification of Sets of 3D Images Predicted Class DNA ER Gia Gpp Lyso Mito Nucl Actin Endo Tub DNA 100 0 0 0 0 0 0 0 0 0 ER 0 99 0 0 0 0 0 0 0 0 Gia 0 0 100 0 0 0 0 0 0 0 Gpp 0 0 0 99 0 0 0 0 0 0 Lyso 0 0 0 0 100 0 0 0 0 0 Mito 0 0 0 0 0 100 0 0 0 0 Nucle 0 0 0 0 0 0 100 0 0 0 Actin 0 0 0 0 0 0 0 100 0 0 Endo 0 0 0 0 0 0 0 0 100 0 Tub 0 0 0 0 0 0 0 0 0 99 Set size 9, Overall accuracy = 99.7% First Conclusion Description of subcellular locations for systems biology should be implemented using a data-driven approach rather than a knowledge-capture approach, but… Subcellular Location Image Finder (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection between new images and previously published results) Caption Scope Caption understanding Figure Panel splitting [Murphy et al, 2001] [Cohen et al, 2003] Label finding ] Panels Entity proteins, Annotated extraction cells, drugs, Scopes experimental conditions, … ImagePtr Label alignment between Matchingcaption entities and Panel labels Panel classification, Micrograph analysis [Murphy et al, 2001] panels image type, Annotated image scale, Panels subcellular pattern analysis… Image Similarity Classification power of features implies that they capture essential characteristics of protein patterns Can be used to measure similarity between patterns Clustering by Image Similarity Ability to measure similarity of protein patterns allows us for the first time to create a systematic, objective, framework for describing subcellular locations Ideal for database references One way is by creating a Subcellular Location Tree Illustration: Build hierarchical dendrogram Subcellular Location Tree for 10 classes in HeLa cells Do this for all proteins: Location Proteomics Can use CD-tagging (developed by Dr. Jonathan Jarvik) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a random gene in each cell Isolate separate clones, each of which produces express one tagged protein Use RT-PCR to identify tagged gene in each clone Collect images of many cells for each clone using fluorescence microscopy Example images of CD-tagged clones (A) Glut1 gene (type 1 glucose transporter) (B) Tmpo gene (thymopoietin (C) tuba1 gene (-tubulin) (D) Cald gene (caldesmon 1) (E) Ncl gene (nucleolin) (F) Rps11 gene (ribosomal protein S11) (G) Hmga1 gene (high mobility group AT-hook 1) (H) Col1a2 gene (procollagen type I 2) (I) Atp5a1 gene (ATP synthase isoform 1) Proof of principle Cluster 46 clones expressing different tagged proteins based on their subcellular location patterns Feature selection Use Stepwise Discriminant Analysis to rank features based on their ability to distinguish proteins Use increasing numbers of features to train neural network classifiers and evaluate classification accuracy over all 46 clones Best performance obtained with 10 features Tree building Therefore use these 10 features with z-scored Euclidean distance function to build SLT Find optimal number of clusters using k-means clustering and AIC Find consensus hierarchical trees by randomly dividing the images for each protein in half and keeping branches conserved between both halves (repeat for 50 random divisions) Consensus Subcellular Location Tree Examples from major clusters Significance Proteins clustered by location analogous to proteins clustered by sequence (e.g., PFAM) Can subdivide clusters by observing response to drugs, oncogenes, etc. These represent protein location states Base knowledge required for modeling Can be used to filter protein interactions From patterns to causes Machine learning approaches have been previously used to find localization motifs in protein sequences, but the set of locations used was limited to major organelles High-resolution subcellular location trees can be used to discover (recursively) new motifs that determine location of each group Can include post-translational modifications More Conclusions Organized data collection approach is required to capture high-resolution information on the subcellular location of all proteins Prohibitive combinatorial complexity make colocalization approach infeasible, so major effort should focus on one protein at a time Center for Bioimage Informatics $2.75 M CMU funding from NSF ITR Joint with UCSB and collaborators at Berkeley and MIT R. Murphy (CALD/Biomed.Eng./Biol.Sci.) Jelena Kovacevic (Biomedical Engineering) Tom Mitchell (CALD) Christos Faloutsos (CALD) Acknowledgments Former students Michael Boland, Mia Markey, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste Current grad students Kai Huang, Xiang Chen, Ting Zhao, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang Hua Funding NSF, NIH, Rockefeller Bros. Fund, PA. Tobacco Settlement Fund Collaborators/Consultants Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos,