Søren Brunak Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] Assignment of protein function.
Download ReportTranscript Søren Brunak Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] Assignment of protein function.
Søren Brunak Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] Assignment of protein function An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily 1AOZ (129 aa) vs. 1PLC (99 aa) scoring matrix: BLOSUM50, gap penalties: -12/-2 15.5% identity; Global alignment score: -23 10 20 30 40 50 60 1AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40 70 80 90 100 110 120 1AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90 1AOZ VDPPQGKKE :. 1PLC VN------- Transfer of functional information – in what space ? Recognize function in: Sequence space – sequence alignment Structure space – structural comparison Gene expression spaces – array data Interaction spaces – network/pathway extraction Paper space – text mining … Protein feature space Cellular context and protein function prediction in feature space All sequences have to use the standard cellular machinery for sorting, posttranslational modification, etc. Similar pattern of modification may imply similar function Predict sequence attributes independently, e.g. local and global properties such as - post-translational modifications - localization signals - structure - composition, length, isoelectric point, …. Integrate and correlate using machine learning techniques Length distributions and functional role categories Propeptide cleavage sites Post-translational processing by limited proteolysis of inactive secretory precursors produces active proteins and peptides Furin specific (a) and other proprotein convertase cleavage sites (b) PCs activate a large variety of proteins Peptide hormones, neuropeptides, growth and differentiation factors, adhesion factors, receptors, blood coagulation factors, plasma proteins, extracellular matrix proteins, proteases, exogenous proteins such as coat glycoproteins from infectious viruses (e.g. HIV-1 and Influenza) and bacterial toxins (e.g. diphtheria and anthrax toxin). PCs play an essential role in many vital biological processes like embryonic development and neural function, and in viral and bacterial pathogenesis. PCs are implicated in pathologies such as cancer and neurodegenerative diseases. Mucin-type O-glycosylation All Ser sites Single Ser sites All Thr sites Single Thr sites Mucin-type O-glycosylation site conservation Positional preference of N-Glyc sites across cellular role categories NES – a tricky structural motif Nuclear export signals (NES) are structural motifs believed to interact with CRM1, a receptor involved in active transport from the nucleus to the cytoplasm 61 NES motifs, centered on last leucine residue. Only ~30% of known NES signals fit the general NES consensus of L-x(2,3)-[LIVFM]-x(2,3)-L-x-[LI] The concept of ProtFun Predict as many biologically relevant features as we can from the sequence Train artificial neural networks for each category Assign a probability for each category from the NN outputs Predicting Gene Ontology categories The GO system is designed for proteins to belong to multiple classes rather than one Different kinds of function can be annotated: • Molecular function • Biological process • Cellular component GO assigns the ”function” at several levels of detail rather than only one An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily 1AOZ and 1PLC predictions # Functional category Amino_acid_biosynthesis Biosynthesis_of_cofactors Cell_envelope Cellular_processes Central_intermediary_metabolism Energy_metabolism Fatty_acid_metabolism Purines_and_pyrimidines Regulatory_functions Replication_and_transcription Translation Transport_and_binding 1AOZ 0.126 0.100 0.429 0.057 0.063 0.126 0.027 0.439 0.102 0.052 0.079 0.032 1PLC 0.070 0.075 0.032 0.059 0.041 0.268 0.072 0.088 0.019 0.089 0.150 0.052 # Enzyme/nonenzyme Enzyme Nonenzyme 0.773 0.227 0.310 0.690 0.077 0.260 0.114 0.025 0.010 0.017 0.077 0.099 0.071 0.020 0.068 0.017 # Enzyme class Oxidoreductase Transferase Hydrolase Lyase Isomerase Ligase (EC (EC (EC (EC (EC (EC 1.-.-.-) 2.-.-.-) 3.-.-.-) 4.-.-.-) 5.-.-.-) 6.-.-.-) Performance on Gene Ontology categories (worst case) Predicting the periodicly expressed genes in yeast – Whole system description • Focus on whole systems, rather than individual units • Requires identification of all units in the system • High diversity in biological systems • Inference of system features/functions from experimental data • Ultimate goal is in-silico modeling of the temporal aspects of the cell cycle in different organisms The Eukaryotic Cell Cycle Microarray identification of periodic genes Synchronous Yeast cells Periodic DNA chips Gene expression ???? Temporal expression Non-Periodic Look for those with a periodic expression Identification of periodicly expressed genes 1) Visual inspection of expression profiles (Cho et al., 1998) 2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998) 3) Statistical modeling (single pulse model) (Zhao et al., 2001) 104 known genes 70% 91% 47% Problems • Cho uses non-objective criteria • Spellman identifies too many genes • Zhao identifies less than half of previous identified cell cycle regulated genes The ProtFun strategy applied to the cell cycle Sequence based ’’machine learning approach’’ consistensy Periodic genes filter ? Grey zone area (~5600 gener) Non-periodic genes 6200 genes Learn { Positive set (97 sequences) Negative set (556 sequences) Prediction of cell cycle regulated genes from protein sequence Features of cell cycle regulated genes used by neural net ensemble Non-linear function prediction! Responds to single AA change ORF YIL169C YNL322C YJL078C YDL038C YOL155C YJR151C YLR286C YOL030W YOR220W YNR044W YGR023W YDL016C YDL152W YPR136C YGR115C YMR317W YCR089W YLR194C YIL011W YGR161C YBR067C YNL228W YNL327W YLR332W YNR067C ANN F-score Intensity 0,98 0,98 0,98 0,98 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,97 0,96 0,96 0,96 0,96 0,96 0,96 0,96 0,96 2,8 1,7 5,5 5,3 3,0 1,3 9,3 4,1 2,5 6,5 1,8 0,8 1,0 1,1 1,0 2,1 3,4 5,4 2,6 2,4 5,9 1,9 8,7 1,5 6,3 176 870 86 165 391 251 520 817 340 172 129 338 156 76 71 260 104 1870 565 190 825 250 1320 642 222 Protein function Protein of unknown function Cell wall protein needed for cell wall beta-1,6-glucan assembly Protein that may have a role in mating efficiency Protein of unknown function Protein with similarity to glucan 1,4-alpha-glucosidase Member of the seripauperin (PAU) family Endochitinase Protein with similarity to Gas1p Protein of unknown function Anchor subunit of a-agglutinin Signal transduction of cell wall stress during morphorgenesis Protein of unknown function Protein of unknown function Protein of unknown function Protein of unknown function, questionable ORF Protein of unknown function Protein involved in mating induction Protein of unknown function Member of the seripauperin (PAU) family Protein of unknown function Cold- and heat-shock induced mannoprotein of the cell wall Protein of unknown function; questionable ORF Cell-cycle regulation protein involved in cell separation Putative sensor for cell wall integrity signaling during growth Protein with similarity to endo-1,3-beta-glucanase Top 250 genes predicted from the entire genome Among the ”top 250 predicted” genes not used for training are • 75 previous identified as cell cycle regulated genes • 175 new potentially cell cycle regulated genes cytoskeleton other other cytoplasmic hydrolase membrane Serine rich RNA binding nuclear transcription unknown kinase & phosphatase Functional grouping unknown wall Subcellular localization Febit Geniom One chip Factory synthesis of arrays Affymetrix Photolithography with masks Spotted arrays Robot spotting of oligos Agilent Inkjet synthesis of arrays NimbleGen Micromirror photosynthesis Febit Customer synthesis of arrays Robot spotting of oligos Micromirror photosyntheis Experimental validation of predictions with CDC15-2 mutant Fermentation of synchroneous yeast culture Samples taken at 20 min intervals Experiment covers two whole cell cycles Samples analyzed on the Febit Geniom microarray platform Probe design optimized with GeneWiz Non-linear normalization with Qspline Validation results More than 100 new periodic genes identified/validated For many of them, a role in the cell cycle is supported by other sources of evidence About 30% of them have no known functional role Gene p-value Gene A Gene B Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N 0.0009 0.0026 0.0081 0.0111 0.0142 0.0169 0.0192 0.0222 0.0247 0.0255 0.0353 0.0482 0.0520 0.0630 Neural Network score 0.76 0.70 0.59 0.76 0.90 0.85 0.74 0.76 0.75 0.81 0.46 0.74 0.81 0.92 GO Biological Process & Gene Description Regulates the cell size requirement for passage through Start and commitment to cell division cyclin involved in G1/S transition of mitotic cell cycle Involved in cell cycle dependent gene expression cell wall organization and biogenesis* Required for spindle pole body duplication and a mitotic checkpoint function. DNA repair* G1/S transition of mitotic cell cycle* DNA repair* cellular morphogenesis* regulation of exit from mitosis Protein with similarity to putative glycosidase of the cell wall G2/M transition of mitotic cell cycle* chromatin assembly/disassembly* actin cytoskeleton organization and biogenesis* High confidence set “Completely” novel cell cycle regulated genes in yeast Novel periodic genes tend to be weakly expressed Protein-protein interaction data Coloring by peak time Represent time by color in cell cycle Most interactions happen between proteins that are close in time. Identifying protein complexes Temporal interaction network de Lichtenberg, Jensen, Brunak, Bork Science, Feb 4, 2005 Just-in-time synthesis? yes and no! Observation: The dynamic proteins are generally expressed just before they are needed to carry out their function, generally referred to as just-in-time synthesis But, the general design principle seems to be that only some key components of each module/complex are dynamic This suggests a mechanism of just-in-time assembly or partial just-in-time synthesis Network as a discovery tools Observation: The network places 30+ uncharacterized proteins in a temporal interaction context. The network thus generates detailed hypotheses about their function. Observation: The network contains entire novel modules and complexes. Phosphorylation and degradation By comparing the dynamic and the static components in our network, we discover that: Phosphorylation by the yeast cyclindependent kinase, Cdc28, specifically targets the dynamic subunits, but not the static. PEST degradation signals are significantly more frequent among the dynamic proteins and among the Cdc28targets. In summary, we discover that only some subunits of each complex are regulated transcriptionally, but these dynamic proteins are often subject to additional regulation at the level of post-translational modifications and targeted degradation. Network Hubs: “Party” versus “Date” “Party” Hub: the hub protein and its interactors are expressed close in time. “Date” Hub: the hub protein interacts with different proteins at different times. The eukaryotic cell cycle The cell division process is divided into four phases: • • • • G1 S G2 M growth/synthesis replication of DNA growth/synthesis mitosis/cell division Temporal variation in feature space S phase feature snapshot S phase ? 40% into the cell cycle the plots shows: • High isoelectric point • Many nuclear proteins • Short proteins • Low potential for N-glycosylation • Low potential for Ser/Thr-phosphorylation • Few PEST regions • Low aliphatic index Fscore Avg. Int. pI Length S phase peaking genes IRS4 0,98 122 9,8 615 Protein involved in silencing of ribosomal DNA SHE1 HHT1 2,09 60 10,4 8,89 2920 11,4 338 Protein that causes lethality when overexpressed 136 Histone H3, identical to Hht2p YGR079W 1,06 370 Protein of unknown function HTB1 9,68 1171 10,1 131 Histone H2B MKC7 YNL228W 2,00 1,92 596 Aspartyl protease found in the periplasmic space 258 Protein of unknown function; questionable ORF HTB2 9,70 1071 10,1 131 Histone H2B, nearly identical to Htb1p HHF2 9,18 1955 11,4 103 Histone H4, identical to Hhf1p TOF2 ENT4 4,15 1,47 771 Protein that interacts with DNA topoisomerase I 247 Protein of unknown function HTA1 9,82 1340 10,7 132 Histone H2A, nearly identical to Hta2p HHT2 7,86 2084 11,4 136 Histone H3, core component of the nucleosome YPL150W YKR045C 0,66 1,01 95 9,4 242 11,0 901 Serine/threonine protein kinase with unknown role 191 Protein of unknown function YNR014W 1,80 312 212 Protein of unknown function HHO1 9,17 625 10,2 Name 194 533 250 270 73 5,4 4,6 4,9 8,0 9,4 8,7 258 Histone H1 Protein function or role Predicted cdc2 phosphorylation in yeast Predicted cdc2 phosphorylation in human HeLa cell data Acknowledgements People at CBS • Lars Juhl Jensen • Ramneek Gupta • + 10 others • Karin Julenius (O-glyc conservation) • • • • • Thomas Skøt Jensen (cell cycle) Ulrik de Lichtenberg (cell cycle) Anders Fausbøll (cell cycle) Rasmus Wernersson (Febit experiments) Lars Kiemer (NES prediction) • Thomas Schiritz-Ponten (new ProtFun method) Febit AG • Peer Smith CNB/CSIC, Madrid • Alfonso Valencia • Javier Tamames • Damien Devos (ProtFun approach) Matthias Mann group, SDU, Odense (NucleolusP) Peer Bork, EMBL Lars Juhl Jensen, EMBL (cell cycle interaction data analysis) WWW resources and links www.cbs.dtu.dk/services/Protfun www.cbs.dtu.dk/cellcycle