Transcript GraphCrunch

Complementarity of network
and sequence information in
homologous proteins
Vesna Memišević2, Tijana Milenković2, and Nataša Pržulj1
1Department
2Department
of Computing, Imperial College London, London, UK
of Computer Science, University of California, Irvine, USA
International Symposium on Integrative Bioinformatics
March, 2010
2
Motivation
• Genetic sequences – revolutionized understanding of biology
• Non-sequence based data of importance, e.g.:
–
–
secondary & tertiary structure of RNA have the dominant role in RNA
function
(tRNA: Gautheret et al., Comput. Appl. Biosci., 1990)
(rRNA: Woese et al., Microbiological Reviews, 1983)
Secondary structure-based approach – more effective at finding new
functional RNAs than sequence-based alignments
(Webb et al., Science, 2009)
• What about patterns of interconnections in PPI networks?
–
–
Can they complement the knowledge learned from genomic sequence?
Wiring patterns of duplicated proteins in PPI net – insights into evol. dist.?
–
Does the information about homologues captured by PPI network topology
differ from that captured by their sequence?
Nataša Pržulj
[email protected]
3
Background
• Homologs – descend from a common ancestor:
1. Paralogs: in the same species, evolve through gene
duplication events
2. Orthologs: in different species, evolve through
speciation events
Nataša Pržulj
[email protected]
4
Background
• Sequence-based homology data from:
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
5
Background
• Sequence-based homology data from:
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
6
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
1
1’
5
2
3
7
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
6
Nataša Pržulj
[email protected]
7
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
•
Triangles in it found
1
1’
5
2
3
7
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
6
Nataša Pržulj
[email protected]
8
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
•
Triangles in it found
1
1’
2
3
7
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
6
Nataša Pržulj
[email protected]
9
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
•
Triangles in it found
•
Triangles sharing a side merged into the groups of
orthologs and paralogs
1
1’
2
3
7
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
6
Nataša Pržulj
[email protected]
10
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
•
Triangles in it found
•
Triangles sharing a side merged into the groups of
orthologs and paralogs
1
1’
2
3
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
Nataša Pržulj
[email protected]
11
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
•
Proteins in different genomes – sequence compared for
the best hits (BeTs)
•
The graph of BeTs constructed
•
Triangles in it found
•
Triangles sharing a side merged into the groups of
orthologs and paralogs
•
No dependence on the absolute level of similarity
between compared proteins
1
1’
2
3
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
Nataša Pržulj
[email protected]
12
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
13
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
•
Sequences aligned
•
If alignment score < 10-8 then 1 assigned as “similarity bit”
•
Otherwise, 0 assigned as “similarity bit”
•
“Bit vectors” constructed for a protein, over all proteins
•
Graph constructed with nodes protein sequences and edges
correlation coefficients of bit vectors of nodes
•
Cliques found in the graph = orthology groups
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
14
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
•
Sequences aligned
•
If alignment score < 10-8 then 1 assigned as “similarity bit”
•
Otherwise, 0 assigned as “similarity bit”
•
“Bit vectors” constructed for a protein, over all proteins
•
Graph constructed with nodes protein sequences and edges
correlation coefficients of bit vectors of nodes
•
Cliques found in the graph = orthology groups
1
1’
5
2
3
7
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
6
Nataša Pržulj
[email protected]
15
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
•
Sequences aligned
•
If alignment score < 10-8 then 1 assigned as “similarity bit”
•
Otherwise, 0 assigned as “similarity bit”
•
“Bit vectors” constructed for a protein, over all proteins
•
Graph constructed with nodes protein sequences and edges
correlation coefficients of bit vectors of nodes
•
Cliques found in the graph = orthology groups
•
Again, no dependence on absolute level of similarity
5
1
1’
2
3
7
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
4
6
Nataša Pržulj
[email protected]
16
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
• We examine yeast proteins only:
• Extract all possible pairs of them in COG and KEGG
groups = “orthologous pairs”
• There are 9,643 of unique such pairs
• What are their topological similarities within the PPI network?
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
17
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
• We examine yeast proteins only:
• Extract all possible pairs of them in COG and KEGG
groups = “orthologous pairs”
• There are 9,643 of unique such pairs
• What are their topological similarities within the PPI network?
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
18
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
• We examine yeast proteins only:
• Extract all possible pairs of them in COG and KEGG
groups = “orthologous pairs”
• There are 9,643 of unique such pairs
• What are their topological similarities within the PPI network?
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
19
Background
• Sequence-based homology data from :
1. Clusters of Orthologous Groups – COG[1]
2. KEGG Orthology System[2]
• Previous network-topology assisted approaches:
•
•
Network-alignment-based (ISORank)
Yosef, Sharan & Noble, Bioinformatics, 2008
(hybrid Rankprop)

Rely heavily on sequence information

Use only limited amount of network topology
[1] Tatusov et al., BMC Bioinformatics, 4(41), 2003.
[2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
Nataša Pržulj
[email protected]
20
Our Method
• We examine yeast proteins only:
• Extract all possible pairs of them in COG and KEGG
groups = “orthologous pairs”
• There are 9,643 of unique such pairs
• What are their topological similarities within the PPI network?
• PPI networks are noisy
• We analyze the high-confidence part of yeast PPI network
by Collins et al.[3]: 9,074 edges amongst 1,621 proteins
• Focus on proteins with degree > 3 to avoid noisy PPIs
•
There are 175 orthologous pairs amongst 181 proteins
[3] Collins et al., Molecular and Cellular Proteomics, 6(3):439–450, 2008
Nataša Pržulj
[email protected]
21
Our Method
• Does PPI network topology contain homology
information?
 Are similarly wired proteins homologous?
• Does homology information obtained from
network topology differ from that obtained
from sequence?
Nataša Pržulj
[email protected]
22
Our Method
N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,”
Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.
Nataša Pržulj
[email protected]
23
Our Method
 Induced
 Of any frequency
N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,”
Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.
Nataša Pržulj
[email protected]
24
Our Method
Generalize node degree
N. Przulj, “Biological Network Comparison Using Graphlet Degree
Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.
Nataša Pržulj
[email protected]
25
Our Method
N. Przulj, “Biological Network Comparison Using Graphlet Degree
Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.
Nataša Pržulj
[email protected]
26
Our Method
N. Przulj, “Biological Network Comparison Using Graphlet Degree
Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.
Nataša Pržulj
[email protected]
27
Our Method
Graphlet Degree (GD) vectors, or “node signatures”
T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet
Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.
Nataša Pržulj
[email protected]
28
Our Method
Similarity measure between nodes’ Graphlet Degree vectors
T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet
Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.
Nataša Pržulj
[email protected]
29
Our Method
Signature Similarity Measure
T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet
Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.
Nataša Pržulj
[email protected]
30
Our Method
• For the 181 proteins in 175 orthologous pairs, we find:
• Graphlet degree vectors (GDVs) in the entire PPI network
• GDV-similarities (GDS) = topological similarities
• Sequence identities using Smith-Waterman local alignment with
BLOSUM50 substitution matrix as the scoring scheme
• We compare the GDV-similarity vs. sequence identity
topology vs. sequence
Nataša Pržulj
[email protected]
31
Results
Network Topology
• Orthologous pairs often perform the same or similar function.
• Does GD vector similarity (GDS) imply shared biological function?
• Note: most GO annotations were obtained from sequences

Similar topology ~ similar sequence ~ similar function
Nataša Pržulj
[email protected]
32
Results
Network Topology
• Orthologous proteins have high GD vector similarities
Nataša Pržulj
[email protected]
33
Results
Network Topology
• Orthologous proteins have high GD vector similarities
p-value < 0.05
85%
Nataša Pržulj
[email protected]
34
Results
Network Topology
• Orthologous proteins have high GD vector similarities
> 20% of orthologous pairs have GDS > 85%
p-value < 0.05
85%
Nataša Pržulj
[email protected]
35
Results
Network Topology – Robustness
• PPI networks are noisy
• Random edge additions, deletions and rewirings in the PPI net
Nataša Pržulj
[email protected]
36
Results
Network Topology – Robustness
• PPI networks are noisy
• Random edge additions, deletions and rewirings in the PPI net
Nataša Pržulj
[email protected]
37
Results
Network Topology – Robustness
• PPI networks are noisy
• Random edge additions, deletions and rewirings in the PPI net
Nataša Pržulj
[email protected]
38
Results
Sequence
• Sequence identities for the 175 orthologous pairs
Nataša Pržulj
[email protected]
39
Results
Sequence
• Sequence identities for the 175 orthologous pairs
~70% orth. pairs have seq. identity < 35%
35%
Nataša Pržulj
[email protected]
40
Results
Sequence
• Sequence identities for the 175 orthologous pairs
~20% orth. pairs have seq. identity > 90%
90%
Nataša Pržulj
[email protected]
41
Results
Sequence
• Sequence identities for the 175 orthologous pairs
“Twilight zone” for homology
~70% orth. pairs have seq. identity < 35%
 No dependence on the absolute similarity COG
& KEGG, but triangles in the graph of best matches
20-35%
Nataša Pržulj
[email protected]
42
Results
Comparison:
20%
35%
85%
~20% of orthologous pairs have
signature similarities above 85%
(35 pairs)
~30% of orthologous pairs have
sequence identities above 35%
(53 pairs)
Overlap: 22 pairs (~60% of the smaller set)
 Sequence and network topology
 somewhat complementary slices of homology information
Nataša Pržulj
[email protected]
43
Results
Examples
• 59 of the yeast ribosomal proteins – retained two genomic copies
• Are duplicated proteins functionally redundant?
• No: have different genetic requirements for their assembly and
localization so are functionally distinct
• Also note: avg sequence identity of struct. similar prots ~8-10%
• Two pairs with identical sequence:
100% sequence identity
50% signature similarity
Degrees 25 and 5
Nataša Pržulj
[email protected]
44
Results
Examples
• 59 of the yeast ribosomal proteins – retained two genomic copies
• Are duplicated proteins functionally redundant?
• No: have different genetic requirements for their assembly and
localization so are functionally distinct
• Also note: avg sequence identity of struct. similar prots ~8-10%
• Two pairs with identical sequence:
100% sequence identity
65% signature similarity
Degrees 54 and 9
Nataša Pržulj
[email protected]
Conclusions
• Homology information captured by PPI network
topology differs from that captured by sequence
• Complementary sources for identifying homologs
Future work:
• Could topological similarity be used to identify
orthologs from best-hits graph analysis as done for
sequences?
45
Acknowledgements
This project was supported by the NSF CAREER
IIS-0644424 grant
Nataša Pržulj
[email protected]