Transcript Document
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
Introduction to Bioinformatics
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Lecture 20
Global network behaviour
Networks
"The thousands of components of a living cell are
dynamically interconnected, so that the cell’s functional
properties are ultimately encoded into a complex
intracellular web [network] of molecular interactions."
"This is perhaps most evident with cellular metabolism, a
fully connected biochemical network in which hundreds of
metabolic substrates are densely integrated through
biochemical reactions." (Ravasz E, et al.)
TF
Ribosomal
proteins
(41/(4(4-1)/2) =1/6
How linked up are the direct neighbours of a node considered?
Small-world networks
A seminal paper, Collective dynamics of "small-world" networks, by
Duncan J. Watts and Steven H. Strogatz, which appeared in Nature
volume 393, pp. 440-442 (4 June 1998), has attracted considerable
attention.
One can consider two extremes of networks:
The first are regular networks, where "nearby" nodes have large numbers
of interconnections, but "distant" nodes have few.
The second are random networks, where the nodes are connected at
random.
Regular networks are highly clustered, i.e., there is a high density of
connections between nearby nodes, but have long path lengths, i.e., to go
from one distant node to another one must pass through many
intermediate nodes.
Random networks are highly un-clustered but have short path lengths.
This is because the randomness makes it less likely that nearby nodes will
have lots of connections, but introduces more links that connect one part of
the network to another.
Regular and random networks
random
regular
regular complete
Making a small world
A small-world network can be generated from a regular one by
randomly disconnecting a few points and randomly reconnecting
them elsewhere.
Another way to think of a small world network is that some socalled 'shortcut' links are added to a regular network as shown
here:
The added links are shortcuts because they allow travel from node (a)
to node (b), to occur in only 3 steps, instead of 5 without the shortcuts.
Regular, small-world and random networks:
Rewiring experiments (Watts and Strogatz, 1998)
p is the probability that a randomly chosen connection will be randomly redirected
elsewhere (i.e., p=0 means nothing is changed, leaving the network regular; p=1
means every connection is changed and randomly reconnected, yielding
complete randomness).
For example, for p = .01, (so that only 1% of the edges in the graph have been
randomly changed), the "clustering coefficient" is over 95% of what it would be for
a regular graph, but the "characteristic path length" is less than 20% of what it
would be for a regular graph.
Small-world networks
Network characterisation:
L = characteristic path length
C = clustering coefficient
A small-world network is much more highly clustered than an
equally sparse random graph (C >> Crandom), and its
characteristic path length L is close to the theoretical
minimum shown by a random graph (L ~ Lrandom).
•The reason a graph can have small L despite being
highly clustered is that a few nodes connecting distant
clusters are sufficient to lower L.
•Because C changes little as small-worldliness
develops, it follows that small-worldliness is a global
graph property that cannot be found by studying local
graph properties.
Small-world networks
A network or order (0<p<1 as
in earlier slides) can be
characterized by the average
shortest length L(p) between
any two points, and a
clustering coefficient C(p) that
measures the cliquishness of a
typical neighbourhood (a local
property).
These can be calculated from
mathematical simulations and
yield the following behavior
(Watts and Strogatz):
Small-world networks
Part of the reason for the interest in the results of Watts and Strogatz is that smallworld networks seem to be good models for a wide variety of physical situations.
They showed that the power grid for the western U.S. (nodes are power stations,
and there is an edge joining two nodes if the power stations are joined by highvoltage transmission lines), the neural network of a nematode worm (nodes are
neurons and there is an edge joining two nodes if the neurons are joined by a
synapse or gap junction), and the Internet Movie Database (nodes are actors and
there is an edge joining two nodes if the actors have appeared in the same movie)
all have the characteristics (high clustering coefficient but low characteristic path
length) of small-world networks.
Intuitively, one can see why small-world networks might provide a good model
for a number of situations. For example, people tend to form tight clusters of
friends and colleagues (a regular network), but then one person might move from
New York to Los Angeles, say, introducing a random edge. The results of Watts
and Strogatz then provide an explanation for the empirically observed
phenomenon that there often seem to be surprisingly short connections between
unrelated people (e.g., you meet a complete stranger on an airplane and soon
discover that your sister's best friend went to college with his boss's wife).
Small world example: metabolism.
Wagner and Fell (2001) modeled the known reactions of 287 substrates that
represent the central routes of energy metabolism and small-molecule building
block synthesis in E. coli. This included metabolic sub-pathways such as:
•glycolysis
•pentose phosphate and Entner-Doudoro pathways
•glycogen metabolism
•acetate production
•glyoxalate and anaplerotic reactions
•tricarboxylic acid cycle
•oxidative phosphorylation
•amino acid and polyamine biosynthesis
•nucleotide and nucleoside biosynthesis
•folate synthesis and 1-carbon metabolism
•glycerol 3-phosphate and membrane lipids
•riboflavin
•coenzyme A
•NAD(P)
•porphyrins, haem and sirohaem
•lipopolysaccharides and murein
•pyrophosphate metabolism
•transport reactions
•glycerol 3-phosphateproduction
•isoprenoid biosynthesis and quinone biosynthesis
•These sub-pathways form a
network because some
compounds are part of more
than one pathway and
because most of them include
common components such as
ATP and NADP.
random
Wagner A, Fell D (2001) The small world inside large
metabolic networks. Proc. R. Soc. London Ser. B 268, 18031810.
•The graphs on the left show
that considering either
reactants or substrates, the
clustering coefficient
C>>Crandom, and the length
coefficient L is near that of
Lrandom, characteristics of a
small world system.
Scale-free Networks
Using a Web crawler, physicist Albert-Laszlo Barabasi and his colleagues at
the University of Notre Dame in Indiana in 1998 mapped the connectedness
of the Web. They were surprised to find that the structure of the Web didn't
conform to the then-accepted model of random connectivity. Instead, their
experiment yielded a connectivity map that they christened "scale-free."
•Often small-world networks are also
scale-free.
•In a scale-free network the
characteristic clustering is
maintained even as the networks
themselves grow arbitrarily large.
Scale-free Networks
In any real network some nodes are more highly connected than
others.
•P(k) is the proportion of nodes that have k-links.
•For large, random graphs only a few nodes have a very small k
and only very few have a very large k, leading to a bell-shaped
Poisson distribution:
Scale-free networks fall off more slowly
and are more highly skewed than random
ones due to the combination of smallworld local highly connected
neighborhoods and more 'shortcuts' than
would be expected by chance.
Scale-free networks are governed by a power law of the form:
P(k) ~ k-
Scale-free Networks
Because of the P(k) ~ k- power law relationship, a log-log plot of P(k)
versus k gives a straight line of slope - :
Some networks, especially smallworld networks of modest size do
not follow a power law, but are
exponential. This point can be
significant when trying to
understand the rules that
underlie the network.
Comparing Random and Scale-Free Distribution
In the random network (right), the five nodes with the most
links (in red) are connected to only 27% of all nodes (green).
In the scale-free network (left), the five most connected
nodes (red), often called hubs, are connected to 60% of all
nodes (green).
Scale-free Networks
•
•
•
Barabasi and his team first studied the internet and
discovered scale-free network behaviour
Since then, this has been observed for example for power
grids, stock market, cancerous cells, and sexually
transmitted diseases
From random network models, the idea was that large
networks would hardly have any well-connected nodes.
Although not all nodes in a random network would be
connected to the same degree, most would have a number
of connections hovering around a small, average value.
Also, as a randomly distributed network grows, the relative
number of very connected nodes decreases.
Scale-free Networks
•
•
•
•
Scale-free networks include many "very connected" nodes,
hubs of connectivity that shape the way the network
operates. The ratio of very connected nodes to the number
of nodes in the rest of the network remains constant as the
network changes in size.
Because of these differences, random and scale-free
networks behave differently as they break down. The
connectedness of a randomly distributed network decays
steadily as nodes fail, slowly breaking into smaller, separate
domains that are unable to communicate.
Scale-free networks are more robust, but in a special way
Scale-free networks can have small-world characteristics,
as can randomly connected networks (but see the earlier
experiment for small-world networks)
Scale Free Network
• Hubs, highly connected nodes, bring together
different parts of the network
• Rubustness: Removing random nodes have
little effect
• Low attack resistance: Removing a hub is
lethal.
Random Network
• No hubs
• Low robustness
• Low attack resistance
Scale-free Networks
Epidemiologists are also pondering the significance of scale-free connectivity.
Until now, it has been accepted that stopping sexually transmitted diseases requires
reaching or immunizing a large proportion of the population; most contacts will be
safe, and the disease will no longer spread. But if societies of people include the very
connected individuals of scale-free networks—individuals who have sex lives that are
quantitatively different from those of their peers—then health offensives will fail
unless they target these individuals. These individuals will propagate the disease no
matter how many of their more subdued neighbors are immunized.
Now consider the following: Geographic connectivity of Internet nodes is scale-free,
the number of links on Web pages is scale-free, Web users belong to interest groups
that are connected in a scale-free way, and e-mails propagate in a scale-free way.
Barabasi's model of the Internet tells us that stopping a computer virus from
spreading requires that we focus on protecting the hubs.
14-3-3 subtypes (paralogs)
14-3-3 paralogs
(black) have
evolved to binding
different partners
(grey) but still
share MARK3 as
binding partner
Schematic representation
of co-immunoprecipitation studies
performed with anti- MARK
(microtubule affinity-regulating
kinase) antibodies. The strength of
the interactions is indicated by the
thickness of the arrows (after (2) .
…connect preferentially to a hub
Preferential attachment
Hub protein characteristics:
•Multiple binding sites
•Promiscuous binding
•Non-specific binding
…connect preferentially to a hub
Network motifs
• Different Motifs in
different processes
• Observation: more
interconnected
motifs are more
conserved
Robustness of the biodegradation
network against perturbations is
tested here by removing 200
edges randomly (simulating each
time that the enzyme catalysing
the reaction step has mutated)
(A) For each connection lost (red line),
1.6 compounds lose their pathway to
the Central Metabolism (CM).
(B) However, the increase in the
average pathway length to the CM for
the remaining compounds is small
The biodegradation network appears to be less tolerant to
perturbations than metabolic networks (Jeong et al., 2000)
Preferential attachment in
biodegradation networks
New degradable
compounds are
observed to attach
prefentially to hubs
close to (or in) the
Central Metabolism
The “Matchmaker” 14-3-3 family
•Massively interacting protein family (the
PPI champions) by means of various
binding modes
•Involved in many essential cell
processes
•Occurs throughout kingdom of life
•Various numbers of isoforms in different
organisms (7 in human)
14-3-3 dimer structure
14-3-3 network (hub?) promotion
by binding and bringing together two different proteins
Janus-faced character of 14-3-3s
Identified (co)-targets fall in opposing classes: they seem
to both cause and work against cancer...
Clear color: actin
growth, proapoptotic,
stimulation of
transcription,
nuclear import,
neuron
development.
Hatched: opposing
functions. 100% =
56 proteins (De
Boer & Jimenez,
unpubl. data.).
Targets of 14-3-3 proteins implicated in
tumor development.
Arrows indicate positive effects while sticks represent inhibitory
effects. Targets involved in primary apoptosis and cell cycle control are
not shown due to space limitations.
Role of 14-3-3 proteins in apoptosis
14-3-3 proteins inhibit apoptosis through multiple mechanisms: sequestration
and control of subcellular localization of phosphorylated and
nonphosphorylated pro- and anti-apoptotic proteins.
What is the role of the subtypes? Modularity?
14-3-3 subtypes (paralogs)
Different subtypes display
different binding modes,
reflecting pronounced
divergent evolution after
duplication
14-3-3subtypes
,, and
Schematic representation
of co-immunoprecipitation studies
performed with anti- MARK
(microtubule affinity-regulating
kinase) antibodies. The strength of
the interactions is indicated by the
thickness of the arrows.
Protein Interaction Prediction
How can we get the edges (connections)
of the cellular networks?
•We can predict functions of genes or
proteins so we know where they would fit in a
metabolic network
•There are also techniques to predict whether
two proteins interact, either functionally (e.g.
they are involved in a two-step metabolic
process) or directly physically (e.g. are
together in a protein complex)
Protein Function Prediction
The state of the art – it’s not
complete
Many genes are not annotated, and many more are
partially or erroneously annotated. Given a genome
which is partially annotated at best, how do we fill in the
blanks?
Of each sequenced genome, 20%-50% of the functions
of proteins encoded by the genomes remains unknown!
How then do we build a reasonably complete networks
when the parts list is so incomplete?
Protein interaction prediction through
co-evolution
FALSE NEGATIVES:
•need many organisms
• relies on known orthologous relationships
FALSE POSITIVES
• Phylogenetic signals at the organsism level
• Functional interaction may not mean physical
interaction
Phylogenetic profile analysis (recap)
Function prediction of genes based on “guilt-byassociation” – a non-homologous approach
The phylogenetic profile of a protein is a string that
encodes the presence or absence of the protein in
every sequenced genome
Because proteins that participate in a common
structural complex or metabolic pathway are likely to
co-evolve, the phylogenetic profiles of such proteins
are often ``similar'‘
This means that such proteins have a good chance of
being physically or metabolically connected
Phylogenetic profile analysis (Recap)
Phylogenetic profile (against N genomes)
– For each gene X in a target genome (e.g., E coli),
build a phylogenetic profile as follows
– If gene X has a homolog in genome #i, the ith bit
of X’s phylogenetic profile is “1” otherwise it is “0”
Phylogenetic profile analysis (recap)
Example – phylogenetic profiles based on 60
genomes
genome
gene
orf1034:1110110110010111110100010100000000111100011111110110111010101
orf1036:1011110001000001010000010010000000010111101110011011010000101
orf1037:1101100110000001110010000111111001101111101011101111000010100
orf1038:1110100110010010110010011100000101110101101111111111110000101
orf1039:1111111111111111111111111111111111111111101111111111111111101
orf104: 1000101000000000000000101000000000110000000000000100101000100
orf1040:1110111111111101111101111100000111111100111111110110111111101
orf1041:1111111111111111110111111111111101111111101111111111111111101
orf1042:1110100101010010010110000100001001111110111110101101100010101
orf1043:1110100110010000010100111100100001111110101111011101000010101
orf1044:1111100111110010010111010111111001111111111111101101100010101
orf1045:1111110110110011111111111111111101111111101111111111110010101
orf1046:0101100000010001011000000111110000010100000001010010100000000
orf1047:0000000000000001000010000001000100000000000000010000000000000
orf105: 0110110110100010111101101010111001101100101111100010000010001
orf1054:0100100110000001100001000100000000100100100001000100100000000
By correlating the rows
(open reading frames
(ORF) or genes) you find
out about joint presence
or absence of genes: this
is a signal for a
functional connection
Genes with similar phylogenetic profiles have related functions
or functionally linked – D Eisenberg and colleagues (1999)
Phylogenetic profile analysis
• Evolution suppresses
unnecessary proteins
• Once a member of an
interaction is lost, the
partner is likely to be lost
as well
Phylogenetic profile analysis (recap)
Phylogenetic profiles contain great amount of functional
information
Phlylogenetic profile analysis can be used to distinguish
orthologous genes from paralogous genes
Subcellular localization: 361 yeast nucleus-encoded
mitochondrial proteins are identified at 50% accuracy with 58%
coverage through phylogenetic profile analysis
Functional complementarity: By examining inverse phylogenetic
profiles, one can find functionally complementary genes that
have evolved through one of several mechanisms of convergent
evolution.
Prediction of protein-protein interactions
(recap)
Rosetta stone method
Gene fusion is the an effective method for prediction
of protein-protein interactions
– If proteins A and B are homologous to two domains of a
protein C, A and B are predicted to have interaction
A
B
C
Though gene-fusion has low prediction coverage, it
false-positive rate is low (high specificity)
Gene (domain) fusion example
Vertebrates have a multi-enzyme protein (GARsAIRs-GARt) comprising the enzymes GAR
synthetase (GARs), AIR synthetase (AIRs), and
GAR transformylase (GARt).
In insects, the polypeptide appears as GARs(AIRs)2-GARt.
In yeast, GARs-AIRs is encoded separately from
GARt
In bacteria each domain is encoded separately
(Henikoff et al., 1997).
GAR: glycinamide ribonucleotide
AIR: aminoimidazole ribonucleotide
Protein interaction database (recap)
There are numerous databases of protein-protein
interactions
DIP is a popular protein-protein interaction database
The DIP database catalogs
experimentally determined
interactions between proteins.
It combines information from a
variety of sources to create a
single, consistent set of
protein-protein interactions.
Protein interaction databases (Recap)
BIND - Biomolecular Interaction Network Database
DIP - Database of Interacting Proteins
PIM – Hybrigenics
PathCalling Yeast Interaction Database
MINT - a Molecular Interactions Database
GRID - The General Repository for Interaction Datasets
InterPreTS - protein interaction prediction through tertiary structure
STRING - predicted functional associations among genes/proteins
Mammalian protein-protein interaction database (PPI)
InterDom - database of putative interacting protein domains
FusionDB - database of bacterial and archaeal gene fusion events
IntAct Project
The Human Protein Interaction Database (HPID)
ADVICE - Automated Detection and Validation of Interaction by Co-evolution
InterWeaver - protein interaction reports with online evidence
PathBLAST - alignment of protein interaction networks
ClusPro - a fully automated algorithm for protein-protein docking
HPRD - Human Protein Reference Database
Protein interaction database (recap)
Recap
Network of protein interactions and predicted functional links involving silencing
information regulator (SIR) proteins. Filled circles represent proteins of known function;
open circles represent proteins of unknown function, represented only by their
Saccharomyces genome sequence numbers ( http://genomewww.stanford.edu/Saccharomyces). Solid lines show experimentally determined
interactions, as summarized in the Database of Interacting Proteins19 (http://dip.doembi.ucla.edu). Dashed lines show functional links predicted by the Rosetta Stone
method12. Dotted lines show functional links predicted by phylogenetic profiles16. Some
predicted links are omitted for clarity.
Network of predicted
functional linkages involving
the yeast prion protein20
Sup35. The dashed line shows
the only experimentally
determined interaction. The
other functional links were
calculated from genome and
expression data11 by a
combination of methods,
including phylogenetic
profiles, Rosetta stone
linkages and mRNA
expression. Linkages
predicted by more than one
method, and hence
particularly reliable, are
shown by heavy lines.
Adapted from ref. 11.
Recap
Recap
STRING - predicted functional
associations among genes/proteins
STRING is a database of predicted functional
associations among genes/proteins.
Genes of similar function tend to be
maintained in close neighborhood, tend to be
present or absent together, i.e. to have the
same phylogenetic occurrence, and can
sometimes be found fused into a single gene
encoding a combined polypeptide.
STRING integrates this information from as
many genomes as possible to predict
functional links between proteins.
Berend Snel en Martijn Huynen (RUN) and the group of Peer Bork (EMBL, Heidelberg)
STRING - predicted functional Recap
associations among genes/proteins
STRING is a database of known and predicted proteinprotein interactions.
The interactions include direct (physical) and indirect
(functional) associations; they are derived from four
sources:
1.
2.
3.
4.
Genomic Context (Synteny)
High-throughput Experiments
(Conserved) Co-expression
Previous Knowledge
STRING quantitatively integrates interaction data from
these sources for a large number of organisms, and
transfers information between these organisms where
applicable. The database currently contains 736429
proteins in 179 species
STRING - predicted functional Recap
associations among genes/proteins
Conserved Neighborhood
This view shows runs of genes that occur repeatedly in close neighborhood in
(prokaryotic) genomes. Genes located together in a run are linked with a black line
(maximum allowed intergenic distance is 300 bp). Note that if there are multiple runs
for a given species, these are separated by white space. If there are other genes in
the run that are below the current score threshold, they are drawn as small white
triangles. Gene fusion occurences are also drawn, but only if they are present in a
run.
STRING - predicted functional Recap
associations among genes/proteins
Genes clustered in a
genomic region are likely
to interact
• co-ordinated expression
• co-ordinated gene
gains/losses
Wrapping up
Understand regular, random, small-world and scale-free
networks
– Know and understand observations on path length, clustering
coefficients, etc.
Know and understand interaction prediction using
phylogenetic co-evolution, phylogenetic profiling,
Rosetta stone methods and the STRING server
Comparing and overlaying various networks (e.g.
regulation, signalling, metabolic, PPI) and studying
evolutionary conservation at these network levels is one
of the current grand challenges, and will be crucially
important for a systems–based approach to
(intra)cellular behaviour.