Statistical physics of complex networks Sergei Maslov Brookhaven National Laboratory

Download Report

Transcript Statistical physics of complex networks Sergei Maslov Brookhaven National Laboratory

Statistical physics
of complex networks
Sergei Maslov
Brookhaven National Laboratory
Short history: complex systems
before & after networks

Statistical physics of complex systems was active in 80’s-90’s
(following the chaos boom of 70’s)





Fractals (Mandelbrot and many others)
Self-Organized Criticality (Per Bak and co-authors)  sandpiles 
granular systems
Complex==multiple time and length scales (e.g. avalanches) 
Cult of power-laws
Cellular automata (mostly in real space+time)
Examples:





earthquakes
disordered moving interfaces
(co)-evolution of species
agent-based modeling (“ants”)
By the end of 90’s breakup of the community and specialization




Biology
Economics and finance
Internet
Social sciences
Networks in complex systems

Complex systems



Large number of components interacting with each other
All components and/or interactions are different from
each other (unlike in traditional physics where 1023 electrons are all
the same!)
Paradigms:






104 types of proteins in an organism,
106 routers in the Internet
109 web pages in the WWW
1011 neurons in a human brain
The simplest property: who interacts with whom? can be
visualized as a network
Complex networks are just a backbone for complex
dynamical processes
Why study the topology of
complex networks?



Lots of easily available data: that’s where the state of the
art information is (at least in biology)
Large networks may contain information about basic
design principles and/or evolutionary history of the
complex system
This is similar to paleontology: learning about an animal
from its backbone
Inside single cells
A small part of a metabolic network: the citric acid cycle
Metabolic pathway chart by ExPASy
Protein binding networks
Baker’s yeast S. cerevisiae
(only nuclear proteins shown)
Nematode worm C. elegans
Transcription regulatory networks
Bacterium: E. coli
Single-celled eukaryote:
S. cerevisiae
GENOME
protein-gene interactions
PROTEOME
protein-protein interactions
METABOLISM
bio-chemical reactions
slide after Reka Albert
Between cells in a multi-cellular
organism
Sea urchin embryonic development (endomesoderm up to 30 hours) by Davidson’s lab
C. elegans neurons
Between organisms
Freshwater food web by Neo Martinez and Richard Williams
Sexual contacts: M. E. J. Newman, The structure and function of complex networks, SIAM Review 45, 167-256 (2003).
Social
High school dating: Data drawn from Peter S. Bearman, James Moody, and Katherine Stovel visualized by Mark Newman
Network of actor co-starring in movies
Networks of scientists’ co-authorship of papers
Webpages connected by hyperlinks on the AT&T website circa 1996 visualized by Mark Newman
Citation networks are similar to the WWW but time-ordered
Technological
Internet as measured by Hal Burch and Bill Cheswick's Internet Mapping Project.
transportation networks: airlines
transportation networks: railway maps
Tokyo rail map

Lecture 1: General introduction into networks


Node degrees, its distribution, and correlations
Simple models





preferential attachment and Simon model
Growth model for protein families
Percolation transition on networks
Clustering coefficient
Lectures 2-3: Biomolecular (mostly protein) networks

Regulatory and signaling networks



How many regulators? Bureaucratic collapse
Network motifs in directed (e.g. regulatory) networks
Protein binding networks

Broad degree distributions in protein binding networks and possible
explanations







Evolutionary (duplication-divergence)
Biophysical (stickiness)
Functional
Beyond degree distributions: How it all is wired together? Correlations in
degrees
Randomization of networks
Law of Mass Action and propagation of perturbations
Lecture 4: Technological and information networks


Diffusion and modules in the Internet, WWW, and scientific citations
Predicting opinions of customers on products (e.g. movies)
using knowledge networks
Degree (or connectivity)
of a node – the # of neighbors
Degree
K=2
Degree
K=4
Directed networks have
in- and out-degrees
In-degree
Kin=2
Out-degree
Kout=5
Degree distributions
in random and real networks
Degree distribution in
a random network
Poisson distribution




Randomly throw E
edges among N nodes
Solomonoff, Rapaport,
Bull. Math. Biophysics
(1951)
Erdos-Renyi (1960)
Degree distribution –
Binominal  Poisson
K~ with no hubs
(fast decay of N(K))

N (K )  N
exp( )
K!
  K  2E / N
K
Degree distribution in real
protein binding network

4
10
2
10
0

10
-2
10
x-2.5
-4
10 0
10
1
10
2
10
3
10
Histogram N(K) is
broad: most
nodes have low
degree ~ 1, few
nodes – high
degree ~100
Can be
approximately
fitted with
N(K)~K-
functional form
with ~=2.5
Many real world networks
have broad degree distributions
NETWORK
exponent 
film actors
2.3
telephone call graph
2.1
email networks
1.5/2.0
sexual contacts
3.2
WWW
2.3/2.7
internet
2.5
peer-to-peer
2.1
metabolic network
2.2
protein interactions
2.4
Basic BA-model

Very simple algorithm to implement

start with an initial set of m0 fully connected nodes

3
e.g. m0 = 3
1



2
now add new vertices one by one, each one with exactly m edges
each new edge connects to an existing vertex in proportion to the number of
edges that vertex already has → preferential attachment
easiest if you keep track of edge endpoints in one large array and select an
element from this array at random

the probability of selecting any one vertex will be proportional to the number of times it
appears in the array – which corresponds to its degree
1 1 2 2 2 3 3 4 5 6 6 7 8 ….
generating BA graphs – cont’d

To start, each vertex has an equal
number of edges (2)




112233
the probability of choosing any
vertex is 1/3
We add a new vertex, and it will
have m edges, here take m=2

3
draw 2 random elements from
the array – suppose they are 2
and 3
1
2
3
4
1122233344
1
Now the probabilities of selecting
1,2,3,or 4 are
1/5, 3/10, 3/10, 1/5
2
5
3
4
Add a new vertex, draw a vertex
for it to connect from the array

etc.
11222333344455
1
2
The tale of
linear vs exponential growth

Linear growth: Barabasi-Albert model with
=3 is a version of the Simon’s word usage
model: =2+


dnk/dt=(k-1)nk-1/(t+t)-knk/(t+t)
Exponential growth: Protein duplicationdeletion model: =2+/(dup-del)

dnk/dt=dup (k-1)nk-1- (dup+del )knk+
+del (k+1)nk+1; NF=knk also grows
exponentially: dNF/dt=  NG= 
kknk
Preferential attachment
with fitness





Bianconi-Barabasi (2001)
Attractiveness of a node to new edges
is given by fiki/rfrkr
For uniform (f): Pk ~ k-(1+C*)/ln(k),
where C*=1.255
Generally C depends on (f)
Some (f) result in “Bose-Einstein
condensation” in which super-hubs
emerge
Percolation transition
in networks
Why should we care?

The most important property of a network. It
quantifies how broken-up is a network




Below the percolation threshold: many small
components
At the percolation threshold: scale-free distribution
of component sizes: P(S)=S-2.5
Above the percolation threshold: giant connected
component and a few small ones?
Determines the propagation of perturbations
which affect neighbors with probability p (e.g.
infections)
Naïve (and wrong) argument



An average node has <K> first neighbors,
<K><K-1> second neighbors,
<K><K-1><K-1> third neighbors
We neglect overlap between e.g. second and
first neighbors: in random networks a small
effect ~1/N
If <K-1>  1 a single node is connected to a
finite fraction of all nodes in the network
Where is it wrong?





Probability to arrive at a node with K
neighbors is proportional to K!
All averages have to be modified <F(K)> 
<F(K) K>/<K>
The right answer: <K(K-1)>/<K>  1
a perturbation would spread
In directed networks it is <KinKout>/<Kin>  1
Correlations between degrees of neighbors
and an abnormally large number of triangles
(clustering) would affect the answer
How many clusters?





If <K(K-1)>/<K> << 1 there are only small clusters
If <K(K-1)>/<K>  1 cluster sizes S have a scale-free
distribution: P(S)~S-2.5.
If <K(K-1)>/<K> >> 1 there is one “giant” cluster
and a few small ones
Perturbation which affects neighbors with probability
p propagates if p<K(K-1)>/<K>  1
For scale-free networks P(K)~K- with
<3, <K2>=  perturbation always
spreads in a large enough network
Diameter and mean cluster size
are determined by <k(k-1)>/<k>


Mean diameter L: 1+<k>+
<k><k(k-1)>/<k>+
<k>(<k(k-1)>/<k>)L=
=N 
L  log(N/<k>)/log(<k(k-1)>/<k>)+1
Mean cluster size below pc:
<S>=1+<k>/(1-<k(k-1)>/<k>)
Amplification ratios
• A(dir): 1.08 - E. Coli, 0.58 - Yeast
• A(undir): 10.5 - E. Coli, 13.4 – Yeast
• A(PPI):
? - E. Coli, 26.3 - Yeast
Clustering coefficient C




C=3 N/knk k(k-1)/2
Could be defined for individual nodes or as a
function of k:
C(k)=3 N(k)/nk k(k-1)/2
C=1 could not be realized if k is
heterogeneous
Needs to be compared to its value in
randomized networks with the same degree
sequence
End lecture 1
Lecture 2
Protein networks
Places to learn molecular biology
1.
2.
3.
4.
5.
Molecular Biology of the Cell. Fourth Edition. Bruce Alberts, Alexander
Johnson, Julian Lewis, Martin Raff, Keith Roberts, Peter Walter. Garland
Science. 2002.
DNA from the beginning. http://www.dnaftb.org/
Online Biology Book.
http://gened.emc.maricopa.edu/bio/bio181/BIOBK/BioBookTOC.html
Kimball’s Biology Pages. http://www.ultranet.com/~jkimball/BiologyPages/
Gene expression.
http://vlib.org/Science/Cell_Biology/gene_expression.shtml
6.
Human Genome Project. http://www.ornl.gov/hgmis/
7.
Microarrays. http://www.gene-chips.com/
From Prof. Michael Hallett (McGill) online lectures
Protein networks


Nodes – proteins
Edges – interactions between proteins






Metabolic (protein enzymes on sharing common metabolites are
connected)
Physical (binding interactions)
Regulatory and signaling (transcriptional regulation, protein
modifications)
Co-expression networks from microarray data (connect genes with
similar expression (abundance) patterns under many conditions)
Genetic interactions e.g. synthetic lethal protein pairs (removal of
any one of the two proteins doesn’t kill the cell, but removal of
both proteins does)
Etc, etc, etc.
Sources of data on
protein networks

Genome-wide experiments




Binding – two-hybrid (Y2H) and mass-spec (MS)
high-throughput techniques
Transcriptional regulation – ChIP-on-chip, or ChIPthen-SAGE
Expression, disruption networks – microarrays
Lethality of genes (including synthetic lethals):



Gene knockout – yeast
RNAi –worm, fly
Many small or intermediate-scale experiments

All stored in public databases: BIOGRID, DIP, BIND,
YPD (no longer public), SGD, Flybase, Ecocyc, etc.
Pathway  network
paradigm shift
Pathway
Network
Images from ResNet3.0 by Ariadne Genomics
Inhibition of apoptosis
MAPK signaling
Transcription
regulatory networks
Transcription factors bind DNA
Activators and repressors

Depending on the position of the binding site
(operator) with respect to the RNApolymerase binding site (promoter)
Transcription Factors could either activate or
repress the production of mRNA from a given
gene (transcription) and thus affect the
abundance of a protein product
Transcription regulatory networks
Bacterium: E. coli
3:2 ratio
Single-celled eukaryote:
S. cerevisiae; 3:1 ratio
Sea urchin embryonic development (endomesoderm up to 30 hours) by Davidson’s lab
How many transcriptional
regulators are out there?
Fraction of transcriptional
regulators in bacteria
from Stover et al.,
Nature (2000)
Figure from Erik van Nimwegen, TIG 2003
Complexity of regulation grows
with complexity of organism



NR<Kout>=N<Kin>=number of edges
NR/N= <Kin>/<Kout> increases with N
<Kin> grows with N



In bacteria NR~N2 (Stover, et al. 2000)
In eucaryots NR~N1.3 (van Nimwengen,
2002)
Networks in more complex organisms
are more interconnected then in simpler
ones
Complexity is manifested
in Kin distribution
E. coli vs H. sapiens
Table from Erik van Nimwegen, TIG 2003
Toolbox model




NTF=AN2  dNTF=2ANdN  dN/dNTF=2A/N
In small genomes ~100 genes per TF. In
large ones only 4!
A toolbox (e.g. metabolic network) grows
linearly with N. To handle a new condition
(NTFNTF+1) one needs fewer and fewer new
tools.
S. Maslov, S. Krishna, K. Sneppen in
preparation
How is it all connected?
(beyond degree distribution)
What is unusual about topology
of a given network?



Look for a number of occurrences of a certain
topological pattern
Compare with a randomized network
What patterns to look for?




Number of edges connecting nodes with given degrees
(degree-degree correlations)
Motifs – small subgraphs of 3-4 nodes (in undirected
networks clustering or the triangles)
Overrepresentation – Nature needs them for some
function
Underrepresentation – they are detrimental and
nature avoids them
How to construct a proper
random network?
Randomization of a network
given complex
network
random
Stub reconnection algorithm





Break every edge into two halves (“stubs”)
Randomly reconnect stubs
Watch for multiple edges!
For example, in the AS-Internet two largest
hubs would end up being connected with 50
edges (sic!)
Not adaptable to conserve other low-level
topological properties of the network
Local rewiring algorithm


Randomly select and
rewire two edges
Repeat many times
• R. Kannan, P. Tetali,
and S. Vempala,
Random Structures
and Algorithms (1999)
• SM, K. Sneppen,
Science (2002)
Metropolis rewiring algorithm
“energy” E
“energy” E+E
SM, K. Sneppen:
cond-mat preprint
(2002),
Physica A (2004)

Randomly select two edges
Calculate change E in “energy function”

Rewire with probability p=exp(-E/T)

E=(Nactual-Ndesired)2/Ndesired
Degree-degree correlations
Central vs peripheral
network architecture
central
(hierarchical)
peripheral
random (anti-hierarchical)
A. Trusina, P. Minnhagen, SM, K. Sneppen, Phys. Rev. Lett. 92, 17870, (2004)
What is the case for
protein interaction network
SM, K. Sneppen, Science 296, 910 (2002)
Correlation profile



Count N(k0,k1) – the number of links
between nodes with connectivities
k0 and k1
Compare it to Nr(k0,k1) – the same
property in a random network
Qualitative features are very noisetolerant with respect to both false
positives and false negatives
Correlation profile of the
protein interaction network
R(k0,k1)=N(k0,k1)/Nr(k0,k1)
Z(k0,k1) =(N(k0,k1)-Nr(k0,k1))/Nr(k0,k1)
Similar profile is seen in the yeast regulatory network
Hubs may act within a module, or
connect modules

Party hub:



simultaneous
interactions
tends to be within
the same module
Date hub:


sequential
interactions
connect different
modules
Han et al, Nature 443, 88 (2004)
Correlation profile of the
yeast regulatory network
R(kout, kin)=N(kout, kin)/Nr(kout,kin)
Z(kout,kin)=(N(kout,kin)-Nr(kout,kin))/
Nr(kout,kin)
Some scale-free networks
may appear similar
In both networks the degree distribution is scale-free P(k)~ k- with ~2.2-2.5
But: correlation profiles
give them unique identities
Protein interactions
Internet
Small network motifs
(Uri Alon and his group)
All 3 node motifs
Motifs can overlap in the network
graph
motif matches in the target graph
http://mavisto.ipk-gatersleben.de/frequency_concepts.html
motif to be found
Detection of important network
motifs

Technique:




construct many random graphs with the same
number of nodes and degree distribution
count the number of motifs in those graphs
calculate the Z score: the probability that the
same or larger number of motifs in the real world
network could have occurred in a random one
Software available:

http://www.weizmann.ac.il/mcb/UriAlon/
What the Z score means
m  mean number of times the motif
appeared in the random graph
s standard deviation
the probability observing a Z
score of 2 is 0.02275
In the context of motifs:
Z > 0, motif occurs more often
than for random graphs
Z < 0, motif occurs less often
than in random graphs
# of times motif
appeared in random graph
zx =
x - mx
sx
|Z| > 1.65, only a 5% chance of
random occurrence
Examples of network motifs
(3 nodes)

Feed forward loop

Found in many
transcriptional
regulatory networks
coherent
incoherent
Possible functional role of a
coherent feed-forward loop


Noise filtering: short
pulses in input do
not result in turning
on of the Z
To function needs
time-delay (about
0.5hrs for bacterial
transcription)
All 4 node subgraphs (computational expense increases
with the size of the graph!)
Higher-order motifs



4-node motifs contain some 3-node
motifs
One needs to be careful when
calculating over-representation
Alon & co-authors use our Metropolis
algorithm to generate networks with a
given number of low-level motifs
Table 1 from
R Milo, S Shen-Orr, S Itzkovitz,
N Kashtan, D Chklovskii & U Alon,
Network Motifs: Simple Building
Blocks of Complex Networks
Science, 298:824-827 (2002)
Examples of network motifs (4 nodes)
W
X
Y
Z
Parallel paths are
over represented


Neural networks
Food webs
Finding classes on graphs based on
their motif “profiles”
THE END