Slide 1

Transcript Slide 1

Lecture 3
1. Protein Function prediction using network
concepts
2. Application of network concepts in DNA
sequencing
Topology of Protein-protein interaction is informative but
further analysis can reveal other information.
A popular assumption, which is true in many cases is that
similar function proteins interact with each other.
Based on these assumption, we have developed methods to
predict protein functions and protein complexes from the PPI
networks mainly based on cluster analysis.
Cluster Analysis
Cluster Analysis, also called data segmentation, implies grouping
or segmenting a collection of objects into subsets or "clusters",
such that those within each cluster are more closely related to
one another than objects assigned to different clusters.
In the context of a graph densely connected nodes are
considered as clusters
Visually we can detect two clusters in this graph
K-cores of
Protein-Protein Interaction Networks
Definition
Let, a graph G=(V, E) consists of a finite set of
nodes V and a finite set of edges E.
A subgraph S=(V, E) where V V and E  E
is a k-core or a core of order k of G if and only
if  v  V: deg(v)  k within S and S is the
maximal subgraph of this property.
Concept of a k-core graph
Graph G
1-core graph: The degree of all nodes are one or more
Concept of a k-core graph
1-core graph: The degree of all nodes are one or more
Concept of a k-core graph
2-core graph: The degree of all nodes are two or more
Concept of a k-core graph
1-core graph: The degree of all nodes are one or more
Graph G
3-core graph: The degree of all nodes are three or more
The 3-core is the highest k-core subgraph of the graph G
Application of a k-core graph
Analyzing protein-protein interaction data obtained from
different sources, G. D. Bader and C.W.V. Hogue, Nature
biotechnology, Vol 20, 2002
Protein function prediction using k-core graphs
Introduction : Function prediction
Schwikowski, B., Uetz, P. and Fields, S. A network of proteinprotein interactions in yeast. Nature Biotech. 18, 1257-1261
(2000)
Deals with a network of 2039 proteins and 2709 interactions.
65% of interactions occurred between protein pairs with
at least one common function
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T.
Assessment of prediction accuracy of protein function from
protein-protein interaction data. Yeast 18, 523-531 (2001)
Reported similar results..
Introduction : Function prediction
Hypothesis
Unknown function proteins that form densely connected
subgraph with proteins of a particular function may belong to that
CLASS A
functional group.
UNCLASSIFIED
PROTEINS
We utilize this concept by determining k-cores of strategically
constructed sub-networks.
14
Prediction of Protein Functions Based on K-cores of
Protein-Protein Interaction Networks
“Prediction of Protein Functions Based on K-cores of
Protein-Protein Interaction Networks and Amino Acid
Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata,
Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md.
Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima,
Hirotada Mori, Shigehiko Kanaya The 14th International
Conference on Genome Informatics December 14-17,
2003, Yokohama Japan.
E.Coli PPI network
Total 3007
proteins and
11531
interactions
Around 2000 are
unknown
function proteins
Highest K-core of
this total graph is
not so helpful
10-core graph—the highest k-core of the E.Coli
PPI network
We separate 1072 interactions (out of 11531) involving protein
synthesis and function unknown proteins.
P. S.
P. S.
U. F.
P. S.
Function unknown Proteins of this 6-kore graph are likely to be involved
in protein synthesis
Unknown
Extending the k-core based function prediction method
and its application to PPI data of Arabidopsis thaliana
Protein Function Prediction based on k-cores of
Interaction Networks, Norihiko Kamakura, Hiroki
Takahashi, Kensuke Nakamura, Shigehiko Kanaya and
Md. Altaf-Ul-Amin, Proceedings of 2010 International
Conference on Bioinformatics and Biomedical Technology
(ICBBT 2010)
Materials and Methods : Dataset
All PPI data of Arabidopsis thaliana
•3118 interactions
involving 1302 proteins.
• Collected from databases
and scientific literature by
our laboratory.
Green= Unknown proteins
(289 proteins)
Pink= Known proteins
(1013 proteins)
21
Materials and Methods : Dataset
Functional groups in the network
The PPI dataset contains proteins of 19 different functions according to the first level
categories of the KNApSAcK database.
function names
CELL CYCLE AND DNA PROCESSING
CELL FATE
CELL RESCUE, DEFENSE AND VIRULENCE
CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM
CONTROL OF CELLULAR ORGANIZATION
DEVELOPMENT (Systemic)
ENERGY
Endoplasmic reticulum biogenesis
METABOLISM
Mitochondria biogenesis
PROTEIN ACTIVITY REGULATION
PROTEIN FATE (folding, modification, destination)
PROTEIN SYNTHESIS
REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT
STORAGE PROTEIN
SYSTEMIC REGULATION OF / INTERACTION WITH ENVIRONMENT
TRANSCRIPTION
TRANSPORT FACILITATION
UNCLASSIFIED PROTEINS
number of proteins
69
5
32
171
3
9
51
4
120
4
1
112
20
1
1
2
362
46
289
22
Materials and Methods : Dataset
The trends of interactions in the context of functional
similarity
Diagonal elements show number of interactions between similar function proteins.
function name
No No 1
2 3
4
5 6 7 8
9 10 11 12 13 14 15 16 17 18 19
METABOLISM
1
72 23 1
9 10 0 1 0 67 0 29 0 4 3 0 0 0 0 0
UNCLASSIFIED PROTEINS
2
23 82 19 166 279 9 3 4 189 0 35 0 35 16 0 0 0 0 1
CELL RESCUE, DEFENSE AND VIRULENCE
3
1 19 9 15
7 0 0 0 38 0 1 0 3 4 0 0 0 0 0
TRANSCRIPTION
4
9 166 15 689 64 6 1 0 354 0 2 3 22 7 0 0 0 1 0
PROTEIN FATE (folding, modification, destination)
5
10 279 7 64 137 0 9 2 20 0 22 2 7 5 0 0 0 0 0
DEVELOPMENT (Systemic)
6
0
9 0
6
0 1 0 0
1 0 0 0 0 2 0 0 0 0 0
CELL FATE
7
1
3 0
1
9 0 1 0
2 0 0 0 0 1 0 0 0 0 0
PROTEIN SYNTHESIS
8
0
4 0
0
2 0 0 17
2 0 1 0 1 1 0 0 0 0 0
CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM9
67 189 38 354 20 1 2 2 374 0 24 0 35 11 0 0 1 1 0
Mitochondria biogenesis
10
0
0 0
0
0 0 0 0
0 3 0 0 0 0 0 0 0 0 0
ENERGY
11
29 35 1
2 22 0 0 1 24 0 64 0 3 8 0 0 0 0 0
SYSTEMIC REGULATION OF / INTERACTION WITH ENVIRONMENT 12
0
0 0
3
2 0 0 0
0 0 0 0 0 0 0 0 0 0 0
CELL CYCLE AND DNA PROCESSING
13
4 35 3 22
7 0 0 1 35 0 3 0 44 2 2 0 0 0 0
TRANSPORT FACILITATION
14
3 16 4
7
5 2 1 1 11 0 8 0 2 17 0 2 0 0 3
CONTROL OF CELLULAR ORGANIZATION
15
0
0 0
0
0 0 0 0
0 0 0 0 2 0 1 0 0 0 0
REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT 16
0
0 0
0
0 0 0 0
0 0 0 0 0 2 0 0 0 0 0
PROTEIN ACTIVITY REGULATION
17
0
0 0
0
0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
STORAGE PROTEIN
18
0
0 0
1
0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
Endoplasmic reticulum biogenesis
19
0
1 0
0
0 0 0 0
0 0 0 0 0 3 0 0 0 0 6
23
Materials And Methods : Flowchart of the method
Input: A PPI network
Make a sub-network corresponding to a functional group
Remove the components consisting of only unknown proteins
Determine k-cores and assign the corresponding function to
the unknown proteins included in the k-cores(for k =3 or more)
Output: Predicted functions for some unknown proteins
24
Results : Subnetworks
Subnetwork Name
Number of interactions
we do not consider in this work the sub-networks that contain less than 100 interactions.
And finally I consider subnetworks corresponding to 9 functional classes.
25
Results : Subnetwork corresponding to
cellular communication
As an example here we show the subnetworks and k-cores corresponding
to cellular communication.
Subnetwork extraction
We extracted the following 3 types of interactions.
Cellular communication-Cellular communication
Cellular communication-Unknown,
Unknown-Unknown
Total 603 interactions
26
Results : Subnetwork corresponding to
cellular communication
1-core
The red nodes : known proteins.
The green nodes : unknown proteins.
27
Results : k-cores corresponding to cellular
communication
2-core
3-core
The red nodes : known proteins.
The green nodes : unknown proteins.
The red color nodes represent known proteins, the green color nodes represent function
unknown proteins.
28
Results : k-cores corresponding to cellular
communication
4-core
5-core
The red nodes : known proteins
The green nodes : unknown
proteins.
6-core
7-core
This figure implies that determination
of k-cores in strategically constructed
sub-networks can reveal which
unknown proteins are densely
connected to proteins of a particular
functional class.
29
Results : Function Predictions
The number of unknown genes included in different k-cores
corresponding to different functional groups
k-core 2
cell_cycle
11
cell_rescue
4
cellular_communicati
on
k-core 3 k-core 4 k-core 5 k-core 6 k-core 7 k-core 8
7
37
33
23
15
12
8
energy
5
2
2
2
2
2
metabo
5
1
1
69
35
25
25
15
10
24
14
11
8
8
88
64
52
36
27
protein_fate
protein_synthesis
transcription
transport_facilitation
total
2
2
33
2
129
2
30
Results : Function Predictions
Prediction based on 2-cores, 3-cores and 4-cores
2-core
4-core
Most proteins have been assigned
unique functions
CELL CYCLE AND DNA PROCESSING
CELLULAR COMMUNICATIO/SIGNAL TRANDUCTION
CELL RESCUEM, SEFENSE AND VIRULENCE
ENERGY
3-core
Most proteins have been assigned
unique functions and some have
been assigned multiple functions
METABOLISM
PROTEIN FATE (folding, modification, destination)
PROTEIN SYNTHESIS
TRANSCRIPTION
TRANSPORT FACILITATION
31
Assessment of Predictions
As most of the function predicted proteins are still unknown
their annotations do not contain clear information on their
functions.
When k is much larger than one, the effect of false positives is
greatly reduced.
However to assess statistically, we constructed 1000 random
graphs consisting of the same 1,302 proteins but I inserted
3,118 edges randomly and constructed subnetworks.
32
Assessment of Predictions
Cell Cycle
Energy
Protein
Synthesis
Cell Rescue
Metabolism
Transcription
Cellular
Communication
Protein fate
Transport
The box plots show the distribution of k-cores with respect to their size in 1000
graphs corresponding to each sub-network and the filled triangles show the size of
k-cores in real PPI sub-networks.
Assessment of Predictions
•it can be theoretically concluded that the existence of
higher order k-core graphs in PPI sub-networks
compared to in the random graphs of the same size are
likely to be because of interaction between similar
function proteins.
•Therefore we assume that the function prediction based
on k-cores for the value of k greater than highest possible
value of k for corresponding random graphs are
statistically significant predictions.
• Based on this we predicted the functions of 67
proteins(list is available online at
http://kanaya.naist.jp/Kcore/supplementary/Function_pre
diction.xls.
34
“Prediction of Protein Functions Based on ProteinProtein Interaction Networks: A Min-Cut
Approach”, Md. Altaf-Ul-Amin, Toshihiro Koma,
Ken Kurokawa, Shigehiko Kanaya, Proceedings of
the Workshop on Biomedical Data Engineering
(BMDE), Tokyo, Japan, pp. 37-43, April 3-4, 2005.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Introduction
After the complete sequencing of several genomes, the
challenging problem now is to determine the functions of
proteins
1) Determining protein functions experimentally
2) Using various computational methods
a) sequence
b) structure
c) gene neighborhood
d) gene fusions
e) cellular localization
f) protein-protein interactions
Introduction
Present work predicts protein functions based on proteinprotein interaction network.
For the purpose of prediction, we consider the interactions of
•function-unknown proteins with function-known proteins and
• function-unknown proteins with function-unknown proteins
In the context of the whole network.
Introduction
Majority of protein-protein interactions are between similar
function protein pairs.
Therefore,
We assign function-unknown proteins to different
functional groups in such a way so that the number of
inter-group interactions becomes the minimum.
Hence we call the proposed approach a Min-Cut
approach.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
The concept of Min-Cut
U4
K1
U3
K4
K2
U2
K3
U1
G1
K8
K6
K5
G2
A typical and small network of known and unknown proteins
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
G2
Unknown proteins assigned to known groups based on
majority interactions
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
G2
Number of CUT = 4
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
G2
An alternative assignment of unknown proteins
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
Number of CUT = 2
G2
For every assignment of unknown proteins, there is a value of CUT.
Min-cut approach looks for an assignment for which the number of
CUT is minimum.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Problem Formulation
Let G , G ,……..,G are n sets/groups of functionknown proteins such that all proteins of a group are
of similar function. Multiple function proteins are
members of more than one group. Therefore, the set
of all function-known proteins G  G . The set of
function-unknown proteins is denoted byU .N (V , E) is
a graph/network where v  V is a node representing a
protein and e (v , v )  E is an edge representing…….
1
2
n
n
k 1
k
i
ij
i
j
Here we explain some points with a typical example.
Problem Formulation
U8
K9
U7
N (V , E )
K10
U6
G3
K8
V= set of all nodes
U5
E =set of all edges
U4
K1
U3
K3
K2
K7
U2
K4
K6
U1
K5
G1
G2
G={K1, K2, K3, K4, K5, K6, K7, K8, K9, K10}
U={U1, U2, U3, U4, U5, U6, U7, U8}
Problem Formulation
U8
We generate U´ U
such that each
protein of U´ is
connected in N with
at least one protein of
group G by a path of
length 1 or length 2.
K9
U7
K10
U6
G3
K8
U5
U4
K1
K3
K2
K7
U3
U2
K4
K6
U1
K5
G1
G2
U´= {U1, U2, U3, U4, U5, U6, U7}
Problem Formulation
U8
K9
U7
K10
U6
G3
K8
We can assign
proteins of U´ to
different groups
and calculate
CUT
U5
U4
K1
K3
K2
K7
U3
U2
K4
K6
U1
Interactions between
known protein pairs
can never be part of
CUT
K5
G1
G2
For this assignment of unknown proteins, the CUT= 6
Problem Formulation
The problem we are trying to solve is to
assign the proteins of set U´ to known
groups G1 , G2 ,…….., G3 in such a way so
that the CUT becomes the minimum.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
A Heuristic Method
•The problem under hand is a variant of network partitioning
problem.
•It is known that network partitioning problems are NP-hard.
•Therefore, we resort to some heuristics to find a solution as
better as it is possible.
A Heuristic Method
min_cut = |E|
iteration = 0
Make a table for each protein of U containing
maximum 3 IDs of respective priority groups
U1
Assign each protein of Uto some randomly or intentionally
chosen group from among its priority groups
U2
U3
Calculate CUT
CUT < min_cut
YES
min_cut = CUT
Record the current
assignment
NO
iteration = iteration + 1
YES
iteration < max_value
NO
Print min_cut, corresponding assignment and Exit
U4
U5
U6
U7
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2
U3
U4
K1
U1 G2 G1 x
K10
U2
K4
U5
K6
U1
K5
G1
U4
U6
U7
G2
U1 has one path of length 1 with G2 and two paths of length
two with G1
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5
K6
U1
K5
G1
U4 G1 G2 G3
U6
U7
G2
U4 has two paths of length 1 with G1, one path of length one
with G2 and one path of length two with G3.
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5 G1 G2 G3
K6
U1
K5
G1
G2
U4 G1 G2 G3
U6 G1 G3 G2
U7 G3 G2 x
A Heuristic Method
min_cut = |E|
iteration = 0
U1 G2 G1 x
Make a table for each protein of U containing
maximum 3 IDs of respective priority groups
U2 G2 G1 x
Assign each protein of Uto some randomly or intentionally
chosen group from among its priority groups
U4 G1 G2 G3
Calculate CUT
CUT < min_cut
U3 G2 G1 x
YES
min_cut = CUT
Record the current
assignment
U5 G1 G2 G3
NO
U6 G1 G3 G2
iteration = iteration + 1
U7 G3 G2 x
YES
iteration < max_value
NO
Print min_cut, corresponding assignment and Exit
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
K6
U4 G1 G2 G3
U5 G1 G2 G3
U6 G1 G3 G2
U1
K5
G1
U7 G3 G2 x
G2
By assigning all the unknown proteins to respective height
priority groups, CUT = 6
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5 G1 G2 G3
K6
U1
K5
G1
U4 G1 G2 G3
U6 G1 G3 G2
U7 G3 G2 x
G2
For this assignment of unknown proteins, the CUT= 7
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5 G1 G2 G3
K6
U1
K5
G1
U4 G1 G2 G3
U6 G1 G3 G2
U7 G3 G2 x
G2
For this assignment of unknown proteins, the CUT= 4
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Evaluation of the Proposed Approach
•The proposed method is a general one and can be
applied to any organism and any type of functional
classification.
•Here we applied it to yeast Saccharomyces cerevisiae
protein-protein interaction network
•We obtain the protein-protein interaction data from
ftp://ftpmips.gsf.de/yeast/PPI/ which contains 15613
genetic and physical interactions.
Evaluation of the Proposed Approach
We
discard
selfinteractions and extract a
set of 12487 unique binary
interactions involving 4648
proteins.
YAR019c
YMR001c
YAR019c
YNL098c
YAR019c
YOR101w
YAR019c
YPR111w
YAR027w
YAR030c
YAR027w
YBR135w
YAR031w
YBR217w
-------------
-------------
-------------
-------------
Total 12487 pairs
Evaluation of the Proposed Approach
A network of 12487 interactions and 4648 proteins is reasonably big
Evaluation of the Proposed Approach
We collect from http://mips.gsf.de/genre/proj/yeast/index.jsp the
classification data
Name of functional class
METABOLISM
ENERGY
CELL CYCLE AND DNA
PROCESSING
TRANSCRIPTION
PROTEIN SYNTHESIS
PROTEIN FATE (folding, modification,
destination)
PROTEIN WITH BINDING
FUNCTION OR COFACTOR
REQUIREMENT (structural or catalytic)
PROTEIN ACTIVITY REGULATION
CELLULAR TRANSPORT,
TRANSPORT FACILITATION AND
TRANSPORT ROUTES
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND
VIRULENCE
INTERACTION WITH THE
CELLULAR ENVIRONMENT
TRANSPOSABLE ELEMENTS,
VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR
COMPONENTS
CELL TYPE DIFFERENTIATION
# of
proteins
984
260
690
842
381
631
39
27
719
94
296
336
118
451
339
Evaluation of the Proposed Approach
Name of functional class
METABOLISM
ENERGY
CELL CYCLE AND DNA
PROCESSING
TRANSCRIPTION
PROTEIN SYNTHESIS
PROTEIN FATE (folding, modification,
destination)
PROTEIN WITH BINDING
FUNCTION OR COFACTOR
REQUIREMENT (structural or catalytic)
PROTEIN ACTIVITY REGULATION
CELLULAR TRANSPORT,
TRANSPORT FACILITATION AND
TRANSPORT ROUTES
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND
VIRULENCE
INTERACTION WITH THE
CELLULAR ENVIRONMENT
TRANSPOSABLE ELEMENTS,
VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR
COMPONENTS
CELL TYPE DIFFERENTIATION
# of
proteins
984
260
690
842
381
631
39
27
719
94
296
336
118
451
339
•The proposed approach is
intended to predict the functions
of function-unknown proteins.
•However, by predicting the
functions of function-unknown
proteins, it is not possible to
determine the correctness of the
predictions.
•We consider around 10%
randomly selected proteins of
each group of Table 1 as
function-unknown proteins.
Evaluation of the Proposed Approach
Name of functional class
METABOLISM
ENERGY
CELL CYCLE AND DNA
PROCESSING
TRANSCRIPTION
PROTEIN SYNTHESIS
PROTEIN FATE (folding, modification,
destination)
PROTEIN WITH BINDING
FUNCTION OR COFACTOR
REQUIREMENT (structural or catalytic)
PROTEIN ACTIVITY REGULATION
CELLULAR TRANSPORT,
TRANSPORT FACILITATION AND
TRANSPORT ROUTES
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND
VIRULENCE
INTERACTION WITH THE
CELLULAR ENVIRONMENT
TRANSPOSABLE ELEMENTS,
VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR
COMPONENTS
CELL TYPE DIFFERENTIATION
# of
proteins
984
260
690
842
381
631
39
27
719
94
296
336
118
451
339
•The union of 10% of all groups
consists of 604 proteins. This is the
unknown group U.
•The union of the rest 90% of each
of the functional groups constitutes
the set of known proteins G. There
are total 3783 proteins in G.
•We generate U´ U such that each
protein of U´ is connected in N with
at least one protein of group G by a
path of length 1 or length 2. There
are 470 proteins in U´ .
•We predicted functions of these 470
proteins using the proposed method.
Evaluation of the Proposed Approach
min_cut = |E|
iteration = 0
Make a table for each protein of U containing
maximum 3 IDs of respective priority groups
Assign each protein of Uto some randomly or intentionally
chosen group from among its priority groups
Calculate CUT
CUT < min_cut
YES
min_cut = CUT
Record the current
assignment
NO
iteration = iteration + 1
YES
iteration < max_value
NO
Print min_cut, corresponding assignment and Exit
We applied this
algorithm using
Max_value=50000 to
predict the functions
470 proteins.
Evaluation of the Proposed Approach
•We cannot guarantee that minimum CUT corresponds to
maximum successful prediction.
•However, the trends of the results of the Figure above
shows that it is very likely that the lower is the value of
CUT the greater is the number of successful predictions
Evaluation of the Proposed Approach
We then examine the relation of successful predictions with
the number of degrees of the proteins in the network .
U8
K9
U7
K10
U6
G3
K8
U5
U4
K1
K2
K7
U3
K3
U2
K4
K6
U1
K5
G1
G2
Degree of U4 =7
Degree of U7=3
Evaluation of the Proposed Approach
We then examine the relation of successful predictions with
the number of degrees of the proteins in the network .
Evaluation of the Proposed Approach
Degree
1
2
3
4
5
6
7
>7
Total
Number of
proteins
128
80
60
33
23
24
17
105
470
Successful
prediction
39
39
32
24
15
14
12
71
246
•The success rate of
prediction is as low as 30.46%
for proteins that have only
one degree in the interaction
network.
Percentage
30.46
48.75
53.33
72.72
65.21
58.33
70.58
67.61
52.34
•However it is 67.61% for
proteins that have degrees 8
or more.
100
•This implies that the
reliability of the prediction
can be improved by providing
reasonable amount of
interaction information
Success Percentage
80
60
40
20
0
0
1
2
3
4
Degree
5
6
7
8
Application of network concepts in DNA sequencing
Sequencing by hybridization (SBH)
Given an unknown DNA sequence, an array provides
information about all strings of length l that the sequence
contains
s=TATGGTGC
S(s,l)={TAT, ATG, TGG, GGT, GTG, TGC}
Orderly placed
S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC}
Randomly placed
Input: A spectrum S representing all l-mers from an unknown string s
Output: The string s such that spectrum (s,l) = S.
Sequencing by hybridization (SBH)
Input: A spectrum S representing all l-mers from an
unknown string s
Output: The string s such that spectrum (s,l) = S.
The reduction of the SBH problem to an Eulerian
path problem is to construct a graph whose edges
correspond to l-mers from spectrum(s,l) and then to
find a path in this graph visiting every edge exactly
once.
Sequencing by hybridization (SBH)
The reduction of the SBH problem to an Eulerian path
problem is to construct a graph whose nodes correspond to
(l-1)-mers and edges correspond to l-mers from
spectrum(s,l) and then to find a path in this graph visiting
every edge exactly once.
S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC}
(l-1)-mers: GT, TG, AT, TG, TG, GG, TA, AT, GG, GT, TG, GC
(l-1)-mers(redundancy removed): GT, TG, AT, GG, TA, GC
GG
AT
GT
s=TATGGTGC
GC
TG
TA
Sequencing by hybridization (SBH)
A path in a graph visiting every edge exactly once is
called Eulerian (pronounced Oilerian) path
A connected graph has an Eulerian path, if and only if it contains at
most two semibalanced nodes and all other nodes are balanced.
Balanced node, indegree=outdegree
Semibalanced node |indegree-outdegree|=1
GG
AT
GT
GC
TG
TA
Semibalanced
Sequencing by hybridization (SBH)
Another example
S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT}
(l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT
(l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG
TG
GG
AT
ATGGCGTGCA
GC
CG
GT
CA
Sequencing by hybridization (SBH)
S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT}
(l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT
(l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG
TG
GG
AT
ATGCGTGGCA
GC
CG
GT
CA

Slide 1

Transcript Slide 1

Directory