Søren Brunak Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] Assignment of protein function.

Download Report

Transcript Søren Brunak Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] Assignment of protein function.

Søren Brunak
Center for Biological Sequence Analysis
BioCentrum-DTU
Technical University of Denmark
[email protected]
Assignment of protein function
An enzyme (1AOZ) and a non-enzyme
(1PLC) from the Cupredoxin superfamily
1AOZ (129 aa) vs. 1PLC (99 aa)
scoring matrix: BLOSUM50, gap penalties: -12/-2
15.5% identity;
Global alignment score: -23
10
20
30
40
50
60
1AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH
.. .. :
... . . ..:
. :...: . .:
...:.
1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD
10
20
30
40
70
80
90
100
110
120
1AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI
.:
:. . . : .
::::
.. . .:.
: :
::. :..
1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT
50
60
70
80
90
1AOZ VDPPQGKKE
:.
1PLC VN-------
Transfer of functional
information – in what space ?
Recognize function in:
Sequence space – sequence alignment
Structure space – structural comparison
Gene expression spaces – array data
Interaction spaces – network/pathway extraction
Paper space – text mining
…
Protein feature space
Cellular context and protein
function prediction in feature space
All sequences have to use the standard
cellular machinery for sorting, posttranslational modification, etc.
Similar pattern of modification may imply
similar function
Predict sequence attributes independently,
e.g. local and global properties such as
- post-translational modifications
- localization signals
- structure
- composition, length, isoelectric point, ….
Integrate and correlate using machine
learning techniques
Length
distributions
and
functional
role
categories
Propeptide cleavage sites
Post-translational
processing by limited
proteolysis of inactive
secretory precursors
produces active
proteins and peptides
Furin specific (a)
and other
proprotein convertase
cleavage sites (b)
PCs activate a large variety of
proteins
Peptide hormones, neuropeptides, growth and
differentiation factors, adhesion factors, receptors,
blood coagulation factors, plasma proteins,
extracellular matrix proteins, proteases,
exogenous proteins such as coat glycoproteins from
infectious viruses (e.g. HIV-1 and Influenza) and
bacterial toxins (e.g. diphtheria and anthrax toxin).
PCs play an essential role in many vital biological
processes like embryonic development and neural
function, and in viral and bacterial pathogenesis.
PCs are implicated in pathologies such as cancer
and neurodegenerative diseases.
Mucin-type O-glycosylation
All Ser sites
Single Ser sites
All Thr sites
Single Thr sites
Mucin-type O-glycosylation site
conservation
Positional preference of N-Glyc
sites across cellular role
categories
NES – a tricky structural motif
Nuclear export signals (NES) are structural motifs
believed to interact with CRM1, a receptor
involved in active transport from the nucleus to
the cytoplasm
61 NES motifs, centered on
last leucine residue.
Only ~30% of known NES
signals fit the general NES
consensus of
L-x(2,3)-[LIVFM]-x(2,3)-L-x-[LI]
The concept of ProtFun
Predict as many
biologically relevant
features as we can
from the sequence
Train artificial neural
networks for each
category
Assign a probability
for each category from
the NN outputs
Predicting Gene Ontology
categories
The GO system is
designed for proteins to
belong to multiple classes
rather than one
Different kinds of function
can be annotated:
• Molecular function
• Biological process
• Cellular component
GO assigns the ”function”
at several levels of detail
rather than only one
An enzyme (1AOZ) and a non-enzyme
(1PLC) from the Cupredoxin superfamily
1AOZ and 1PLC predictions
# Functional category
Amino_acid_biosynthesis
Biosynthesis_of_cofactors
Cell_envelope
Cellular_processes
Central_intermediary_metabolism
Energy_metabolism
Fatty_acid_metabolism
Purines_and_pyrimidines
Regulatory_functions
Replication_and_transcription
Translation
Transport_and_binding
1AOZ
0.126
0.100
0.429
0.057
0.063
0.126
0.027
0.439
0.102
0.052
0.079
0.032
1PLC
0.070
0.075
0.032
0.059
0.041
0.268
0.072
0.088
0.019
0.089
0.150
0.052
# Enzyme/nonenzyme
Enzyme
Nonenzyme
0.773
0.227
0.310
0.690
0.077
0.260
0.114
0.025
0.010
0.017
0.077
0.099
0.071
0.020
0.068
0.017
# Enzyme class
Oxidoreductase
Transferase
Hydrolase
Lyase
Isomerase
Ligase
(EC
(EC
(EC
(EC
(EC
(EC
1.-.-.-)
2.-.-.-)
3.-.-.-)
4.-.-.-)
5.-.-.-)
6.-.-.-)
Performance on Gene Ontology
categories (worst case)
Predicting the periodicly
expressed genes in yeast –
Whole system description
• Focus on whole systems, rather
than individual units
• Requires identification of all units
in the system
• High diversity in biological
systems
• Inference of system
features/functions from
experimental data
• Ultimate goal is in-silico modeling
of the temporal aspects of the cell
cycle in different organisms
The Eukaryotic Cell Cycle
Microarray identification of
periodic genes
Synchronous
Yeast cells
Periodic
DNA chips
Gene expression
????
Temporal expression
Non-Periodic
Look for those with a periodic expression
Identification of periodicly
expressed genes
1) Visual inspection of expression profiles (Cho et al., 1998)
2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998)
3) Statistical modeling (single pulse model) (Zhao et al., 2001)
104 known genes
70% 91% 47%
Problems
• Cho uses non-objective criteria
• Spellman identifies too many genes
• Zhao identifies less than half of previous identified cell cycle regulated genes
The ProtFun strategy applied to the
cell cycle
Sequence based ’’machine learning approach’’
consistensy
Periodic genes
filter
?
Grey zone area
(~5600 gener)
Non-periodic genes
6200 genes
Learn
{
Positive set
(97 sequences)
Negative set
(556 sequences)
Prediction of cell cycle
regulated genes from protein
sequence
Features of cell cycle regulated
genes used by neural net
ensemble
Non-linear function prediction!
Responds to single AA change
ORF
YIL169C
YNL322C
YJL078C
YDL038C
YOL155C
YJR151C
YLR286C
YOL030W
YOR220W
YNR044W
YGR023W
YDL016C
YDL152W
YPR136C
YGR115C
YMR317W
YCR089W
YLR194C
YIL011W
YGR161C
YBR067C
YNL228W
YNL327W
YLR332W
YNR067C
ANN F-score Intensity
0,98
0,98
0,98
0,98
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,97
0,96
0,96
0,96
0,96
0,96
0,96
0,96
0,96
2,8
1,7
5,5
5,3
3,0
1,3
9,3
4,1
2,5
6,5
1,8
0,8
1,0
1,1
1,0
2,1
3,4
5,4
2,6
2,4
5,9
1,9
8,7
1,5
6,3
176
870
86
165
391
251
520
817
340
172
129
338
156
76
71
260
104
1870
565
190
825
250
1320
642
222
Protein function
Protein of unknown function
Cell wall protein needed for cell wall beta-1,6-glucan assembly
Protein that may have a role in mating efficiency
Protein of unknown function
Protein with similarity to glucan 1,4-alpha-glucosidase
Member of the seripauperin (PAU) family
Endochitinase
Protein with similarity to Gas1p
Protein of unknown function
Anchor subunit of a-agglutinin
Signal transduction of cell wall stress during morphorgenesis
Protein of unknown function
Protein of unknown function
Protein of unknown function
Protein of unknown function, questionable ORF
Protein of unknown function
Protein involved in mating induction
Protein of unknown function
Member of the seripauperin (PAU) family
Protein of unknown function
Cold- and heat-shock induced mannoprotein of the cell wall
Protein of unknown function; questionable ORF
Cell-cycle regulation protein involved in cell separation
Putative sensor for cell wall integrity signaling during growth
Protein with similarity to endo-1,3-beta-glucanase
Top 250 genes predicted from
the entire genome
Among the ”top 250 predicted” genes not used for training are
• 75 previous identified as cell cycle regulated genes
• 175 new potentially cell cycle regulated genes
cytoskeleton
other
other
cytoplasmic
hydrolase
membrane
Serine rich
RNA binding
nuclear
transcription
unknown
kinase &
phosphatase
Functional grouping
unknown
wall
Subcellular localization
Febit Geniom One chip
Factory synthesis of arrays
Affymetrix
Photolithography with masks
Spotted arrays
Robot spotting of oligos
Agilent
Inkjet synthesis of arrays
NimbleGen
Micromirror photosynthesis
Febit
Customer synthesis of arrays
Robot spotting of oligos
Micromirror photosyntheis
Experimental validation of
predictions with CDC15-2 mutant
Fermentation of synchroneous yeast culture
Samples taken at 20 min intervals
Experiment covers two whole cell cycles
Samples analyzed on the Febit Geniom
microarray platform
Probe design optimized with GeneWiz
Non-linear normalization with Qspline
Validation results
More than 100 new periodic genes identified/validated
For many of them, a role in the cell cycle is supported by other
sources of evidence
About 30% of them have no known functional role
Gene
p-value
Gene A
Gene B
Gene C
Gene D
Gene E
Gene F
Gene G
Gene H
Gene I
Gene J
Gene K
Gene L
Gene M
Gene N
0.0009
0.0026
0.0081
0.0111
0.0142
0.0169
0.0192
0.0222
0.0247
0.0255
0.0353
0.0482
0.0520
0.0630
Neural
Network
score
0.76
0.70
0.59
0.76
0.90
0.85
0.74
0.76
0.75
0.81
0.46
0.74
0.81
0.92
GO Biological Process & Gene Description
Regulates the cell size requirement for passage through Start and commitment to cell division
cyclin involved in G1/S transition of mitotic cell cycle
Involved in cell cycle dependent gene expression
cell wall organization and biogenesis*
Required for spindle pole body duplication and a mitotic checkpoint function.
DNA repair*
G1/S transition of mitotic cell cycle*
DNA repair*
cellular morphogenesis*
regulation of exit from mitosis
Protein with similarity to putative glycosidase of the cell wall
G2/M transition of mitotic cell cycle*
chromatin assembly/disassembly*
actin cytoskeleton organization and biogenesis*
High confidence set
“Completely” novel cell cycle
regulated genes in yeast
Novel periodic genes tend to be
weakly expressed
Protein-protein interaction data
Coloring by peak time
Represent time by color in cell cycle
Most interactions happen between proteins that
are close in time.
Identifying protein complexes
Temporal interaction network
de Lichtenberg,
Jensen, Brunak, Bork
Science, Feb 4, 2005
Just-in-time synthesis? yes and no!
Observation: The dynamic
proteins are generally
expressed just before they
are needed to carry out their
function, generally referred
to as just-in-time
synthesis
But, the general design
principle seems to be that
only some key components
of each module/complex are
dynamic
This suggests a mechanism
of just-in-time assembly or
partial just-in-time
synthesis
Network as a discovery tools
Observation: The
network places 30+
uncharacterized
proteins in a temporal
interaction context.
The network thus
generates detailed
hypotheses about
their function.
Observation: The
network contains
entire novel modules
and complexes.
Phosphorylation and degradation
By comparing the dynamic and the static
components in our network, we discover
that:
Phosphorylation by the yeast cyclindependent kinase, Cdc28, specifically
targets the dynamic subunits, but not
the static.
PEST degradation signals are
significantly more frequent among the
dynamic proteins and among the Cdc28targets.
In summary, we discover that only
some subunits of each complex are
regulated transcriptionally, but these
dynamic proteins are often subject to
additional regulation at the level of
post-translational modifications and
targeted degradation.
Network Hubs: “Party” versus “Date”
“Party” Hub: the hub
protein and its interactors
are expressed close in
time.
“Date” Hub: the hub
protein interacts with
different proteins at
different times.
The eukaryotic cell cycle
The cell division process is
divided into four phases:
•
•
•
•
G1
S
G2
M
growth/synthesis
replication of DNA
growth/synthesis
mitosis/cell division
Temporal variation in feature
space
S phase feature snapshot
S phase ?
40% into the cell cycle the plots shows:
• High isoelectric point
• Many nuclear proteins
• Short proteins
• Low potential for N-glycosylation
• Low potential for Ser/Thr-phosphorylation
• Few PEST regions
• Low aliphatic index
Fscore
Avg. Int.
pI
Length
S phase peaking genes
IRS4
0,98
122
9,8
615 Protein involved in silencing of ribosomal DNA
SHE1
HHT1
2,09
60 10,4
8,89 2920 11,4
338 Protein that causes lethality when overexpressed
136 Histone H3, identical to Hht2p
YGR079W
1,06
370 Protein of unknown function
HTB1
9,68 1171 10,1
131 Histone H2B
MKC7
YNL228W
2,00
1,92
596 Aspartyl protease found in the periplasmic space
258 Protein of unknown function; questionable ORF
HTB2
9,70 1071 10,1
131 Histone H2B, nearly identical to Htb1p
HHF2
9,18 1955 11,4
103 Histone H4, identical to Hhf1p
TOF2
ENT4
4,15
1,47
771 Protein that interacts with DNA topoisomerase I
247 Protein of unknown function
HTA1
9,82 1340 10,7
132 Histone H2A, nearly identical to Hta2p
HHT2
7,86 2084 11,4
136 Histone H3, core component of the nucleosome
YPL150W
YKR045C
0,66
1,01
95 9,4
242 11,0
901 Serine/threonine protein kinase with unknown role
191 Protein of unknown function
YNR014W
1,80
312
212 Protein of unknown function
HHO1
9,17
625 10,2
Name
194
533
250
270
73
5,4
4,6
4,9
8,0
9,4
8,7
258 Histone H1
Protein function or role
Predicted cdc2 phosphorylation
in yeast
Predicted cdc2 phosphorylation
in human HeLa cell data
Acknowledgements
People at CBS
• Lars Juhl Jensen
• Ramneek Gupta
• + 10 others
• Karin Julenius (O-glyc conservation)
•
•
•
•
•
Thomas Skøt Jensen (cell cycle)
Ulrik de Lichtenberg (cell cycle)
Anders Fausbøll (cell cycle)
Rasmus Wernersson (Febit experiments)
Lars Kiemer (NES prediction)
• Thomas Schiritz-Ponten
(new ProtFun method)
Febit AG
• Peer Smith
CNB/CSIC, Madrid
• Alfonso Valencia
• Javier Tamames
• Damien Devos
(ProtFun approach)
Matthias Mann group,
SDU, Odense (NucleolusP)
Peer Bork, EMBL
Lars Juhl Jensen, EMBL
(cell cycle interaction data
analysis)
WWW resources and links
www.cbs.dtu.dk/services/Protfun
www.cbs.dtu.dk/cellcycle