Basic Principles of Graph Theory - hu

Download Report

Transcript Basic Principles of Graph Theory - hu

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Networks in Metabolism and Signaling Edda Klipp Humboldt University Berlin

Lecture 2 / WS 2007/08

Basic Principles of Graph Theory and Random Networks

VL Netzwerke, WS 2007/08 Edda Klipp 1

Humboldt University Berlin Theoretical Biophysics

Basic Principles of Graph Theory

Max Planck Institute Molecular Genetics Literature: J. Sedlácek (1968) Einführung in die Graphentheorie. Teubner Verlagsgesellschaft, Leipzig.

Albert & Barabási (2002) Statistical mechanics of complex networks.

Rev Mod Physics

, 74, 47-97.

Barabási & Oltvai (2004) Network biology: understanding the cell’s functional organization,

Nature Review Genetics

, 5, 101-113.

VL Netzwerke, WS 2007/08 Edda Klipp 2

Humboldt University Berlin Theoretical Biophysics

Classical Examples

The problem of “Fährmann, Ziege, Wolf und Heu” Max Planck Institute Molecular Genetics (F,Z,W,H) (F,W,H) (W) (F,Z,W) (Z) (F,Z) (0) (W,H) (H) VL Netzwerke, WS 2007/08 Edda Klipp (F,Z,H) 3

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

The Bridges of Königsberg

Die Brücken von Königsberg

Im Zentrum der preussischen Stadt Königsberg (heute Kaliningrad) bildet der Fluss Pregel beim Zusammenfluss zweier Arme eine Insel. Im 18. Jahrhundert verbinden 7 Brücken die Flussufer mit der Insel. Es stellt sich die Frage, ob es einen Rundweg gibt, bei dem man alle 7 Brücken genau einmal überquert und wieder zum Ausgangspunkt zurück gelangt.

Geschichte

Das Problem der Königsberger Brücken stammt von Leonhard Euler. Im Jahre 1736 beweist er, dass es keinen solchen Rundweg geben kann. Er betrachtet den allgemeinen Fall mit einer beliebigen Anzahl Inseln und Brücken und zeigt, dass ein Rundweg der gesuchten Art genau dann möglich ist, wenn sich an keinem der Ufer eine ungerade Zahl von Brücken befindet. Gibt es an genau zwei Ufern eine ungerade Anzahl Brücken, dann existiert ein Weg, der bei diesen beiden Ufern beginnt und endet und dabei alle Brücken genau einmal überquert. Gibt es, wie in Königsberg, mehr als zwei Gebiete, zu denen eine ungerade Zahl von Brücken führt, dann kann kein Weg existieren, der genau einmal alle Brücken überquert. 4 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics vertex, node A

Graphs: Definitions

B edge C

V E

 

 

A,B,C A,B ,

    

Max Planck Institute Molecular Genetics vertex – Knoten edge – Kante tuple – Tupel, geordnete Menge set – Menge  

V

A graph is a tuple (

V,E

) a set of

m

edges

E

: with

V

a set of

n

G

=(

V,E

) Example: Proteins – vertices, interactions – edges VL Netzwerke, WS 2007/08 Edda Klipp 5

Humboldt University Berlin Theoretical Biophysics vertex A Max Planck Institute Molecular Genetics C B edge

Graphs: Completeness

V E

 

 

A,B,C A,B ,

    

Edge AB is has vertices A and B.

Knoten A ist inzidiert mit Kante AB.

Be

E

0 the set of all sub-sets of V with two elements.

E

V

V

A graph is

complete

, if

E

=

E

0 . a) b) c) d) If

G

1 =(

V

1

,E

1 )

G

2 =(

V

2 ,

E

2 ) and

E

1 

E

1 

E

2

E

2   

E

0 :

G

1 and

G

2 are

complementary

.

d) VL Netzwerke, WS 2007/08 Edda Klipp 6

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph Types Undirected graphs

: A B

V E

     

A A

, ,

B B

 

Directed graphs

(digraphs): directed edge denoting the head and

j

(

i,j

) 

E

with

i

denoting the tail of the edge.

A B A B

V E

     

A A

, ,

B B

 

B

,

A

 

Extension

: Directed edge (

i,j,s

)  activatory or inhibitory influences.

E

with

s

 {+1,-1} to represent VL Netzwerke, WS 2007/08 Edda Klipp 7

Humboldt University Berlin Theoretical Biophysics

Graph Types: Biparite Graphs

Max Planck Institute Molecular Genetics A C R 1 B D Set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent.

Graphs must represent two distinct classes of nodes such as metabolites (blue, circles) and reactions (yellow, boxes) ATP Fruc-6-P VL Netzwerke, WS 2007/08 Edda Klipp R 1 ADP Fruc-1,6-P 2 8

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics A

Graph Representation: Adjacency Matrix

B C D E F G A B C D E F G A 0 1 0 0 0 0 0 B 0 0 1 1 0 0 0 C 0 0 0 0 0 0 0 D 0 0 1 0 0 1 0 E 0 0 0 1 0 0 0 F 0 0 0 0 1 0 1 G 0 0 0 0 0 0 0 Adjacency matrix – Inzidenzmatrix ◊ Adjacency matrix

A

: non-zero entries represent edges - quadratic - unique assignment of adjacency matrix to graph - unique assignment of graph to adjacency matrix ◊ Bipartite graphs: sub-matrices for the two classes of nodes ◊ Alternative formats: edge lists, vertex lists 9 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics VL Netzwerke, WS 2007/08 Edda Klipp Max Planck Institute Molecular Genetics

Graph Theoretical Measures: Degree

A B C D E F G A B C D E F G A 0 1 0 0 0 0 0 B 0 0 1 1 0 0 0 C 0 0 0 0 0 0 0 D 0 0 1 0 0 1 0 E 0 0 0 1 0 0 0 F 0 0 0 0 1 0 1 G 0 0 0 0 0 0 0 ◊ Number of edges to which a vertex is connected:

Degree

k

.

◊ For directed graphs: in-degree – edges ending at a vertex out-degree – edges starting a vertex Degree – Knotengrad

k B

 3

k i B

 1

k o B

 2 ◊ Vertices with degree 0:

isolated

Be

G

a finite graph, and

s

1

, s

2

,…s

u

v

the number of nodes,

k

the number of edges the degrees of the individual nodes, then holds:

i v

  1

s i

 2

k

10

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph Theoretical Measures: Degree

A B D E

k in

0 1 2

P

(

k in

) 1/7 4/7 2/7 C F G

k out

0 1 2

P

(

k out

) 1/7 4/7 2/7 ◊ Global connectivity properties of a graph: Average degree <

k

> <

k in

> = (4x1 + 2x2)/7=8/7≈1,14 Degree distribution

P

(

k

) Degree distributions allow to distinguish between different types of networks VL Netzwerke, WS 2007/08 Edda Klipp 11

Humboldt University Berlin Theoretical Biophysics

Einschub: Diskrete Wahrscheinlichkeitsverteilungen

Max Planck Institute Molecular Genetics Binomialverteilung:

P

 

  

n k

 

p k

1 

p

n

k p

=1/2 Eigenschaften einer Stichprobe: Wenn das gewünschte Ergebnis eines Versuches die Wahrscheinlichkeit

p

besitzt, und die Zahl der Versuche

n

ist, dann gibt die Binomialverteilung an, mit welcher Wahrscheinlichkeit sich insgesamt

k

Erfolge einstellen.

P(

k

) ist die Wahrscheinlichkeit (z.B. mit

n

Versuchen aus einem Topf von Bällen

k

schwarze zu ziehen) Summe aller Wahrscheinlichkeiten E(

X

) =

np

Var(

X

) =

np

(1-

p

) Erwartungswert (vgl.: Mittelwert für sehr viele Wiederholungen) Varianz VL Netzwerke, WS 2007/08 Edda Klipp 12

Humboldt University Berlin Theoretical Biophysics

Einschub: Diskrete Wahrscheinlichkeitsverteilungen

Max Planck Institute Molecular Genetics Poissonverteilung: Eigenschaften einer Stichprobe: Wie vorher, nur bei sehr kleiner Wahrscheinlichkeit der Einzelereignisse, z.B. weil

n

sehr groß. l - Ereignisrate (z.B. Fehlerrate bei der DNS-Replikation) E(

X

) = l Var(

X

) = l Erwartungswert (vgl.: Mittelwert für sehr viele Wiederholungen) Varianz Exponentialverteilung: E(

X

) = 1/ l Var(

X

) = 1/ l 2 0.8

0.6

0.4

0.2

0.0

0 20 40 60 80 100 VL Netzwerke, WS 2007/08 Edda Klipp 13

Humboldt University Berlin Theoretical Biophysics

Degree Distributions

Max Planck Institute Molecular Genetics VL Netzwerke, WS 2007/08 Edda Klipp Degree distribution of the World Wide Web from two different measurements: h, the 325 729-node sample of Albert measurements of over 200 million pages by Broder et al. et al. (1999); s, the (2000); (a) degree distribution of the outgoing edges; (b) degree distribution of the incoming edges. The data have been binned logarithmically to reduce noise.

Albert & Barabasi, 2002, Rev Mod Phys 14

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Degree Distributions

The degree distribution of several real networks: (a) Internet at the router level. Data courtesy of Ramesh Govindan; (b) movie actor collaboration network. After Barabasi and Albert 1999. Note that if TV series are included as well, which aggregate a large number of actors, an exponential cutoff emerges for large k (Amaral (2001a,2001b); et al.

, 2000); (c) co-authorship network of high energy physicists. After Newman (d) co-authorship network of neuroscientists. After Barabasi et al. (2001).

Albert & Barabasi, 2002, Rev Mod Phys 15 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Degree Distributions

VL Netzwerke, WS 2007/08 Edda Klipp Connectivity distributions P(k) for substrates. a, Archaeoglobus fulgidus (archae); b, E. coli (bacterium); c, Caenorhabditis elegans (eukaryote), shown on a log±log plot, counting separately the incoming (In) and outgoing links (Out) for each substrate.

kin (kout) corresponds to the number of reactions in which a substrate participates as a product (educt). d, The connectivity distribution averaged over all 43 organisms.

Jeong H et al, 2000, Nature 16

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Random Graphs

First well-know example: Model of Paul Erdős and Alfréd Rényi History: Erdős number beschreibt die Distanz im Graphen der Koautorenschaft bezogen auf den Mathematiker Paul Erdős. Im Graphen werden die publizistisch verwandten Autoren als Knoten repräsentiert, zwischen denen jeweils dann eine Kante existiert, wenn sie eine Publikation gemeinsam verfasst haben.

Paul Erdős selbst hat die Erdős-Zahl 0, alle Koautoren, mit welchen er publiziert hat, haben die Erdős-Zahl 1. Autoren, die mit Koautoren von Paul Erdős publiziert haben, haben die Erdős-Zahl 2 usw. Wenn keine Verbindung in dieser Form zu einer Person herstellbar ist, ist ihre Erdős-Zahl ∞.

Es zeigt sich, dass die Erdős-Zahl der meisten Personen entweder unendlich oder erstaunlich gering ist. Letzteres rührt vor allem daher, dass Erdős mit über 500 verschiedenen Wissenschaftlern gemeinsam publizierte und er in vielen Teilbereichen der Mathematik bewandert war.

17 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Random Graphs

A well-know example: Model of Paul Erdős and Alfréd Rényi Start with

N

nodes.

Connect every pair of nodes with probability

p

 Obtain graph with approx. ½

pN

(

N

-1) edges Degree distribution: Poisson distribution Average degree: <

k

> = ½

pN

(

N

-1) * 2/

N

=

p

(

N

-1) 

pN

VL Netzwerke, WS 2007/08 Edda Klipp 1 n 2 n 3 n 4 n 5 n 1 n 2 n x 3 n 4 n x x 5 n x Dice number If

z

 [0;1]. then connection 18

Humboldt University Berlin Theoretical Biophysics

Random Graphs: Evolution

Max Planck Institute Molecular Genetics Construction of random graphs is called evolution. Starting with a set of N isolated vertices, the graph develops by the successive addition of random edges. The graphs obtained at different stages of this process correspond to larger and larger connection probabilities

p

, eventually obtaining a fully connected graph having the maximum number of edges

n

=

N

(

N

-1)/2 for

p

 1 .

19 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics

Random Networks

Questions: Are real networks really random?

Display real networks organization principles?

Is a typical graph connected? (depending on

p

) Does it contain a triangle of connected nodes?

Does its diameter depends on its size?

Max Planck Institute Molecular Genetics 20 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Random Networks: Subgraphs

A graph

G

1 consisting of a set

V

1 of a graph

G

={

V,E

} also edges of

E

.

if all nodes in of nodes and a set

V

1

E

1 are also nodes of

V

of edges is a subgraph and all edges in

E

1 are A cycle of order

k

is a closed loop of edges and only those have a common node.

Average degree: 2

k

edges such that every two consecutive Triangle Rectangle The opposite of cycles are the trees precisely, a graph is a tree of order , which cannot form closed loops. More

k

if it has

k

of its subgraphs is a cycle. Average degree of a tree of order

k

: <

k>=

2-2

/k

nodes and

k

-1 (  2 edges, and none for large trees) 21 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics

Random Networks: Subgraphs

Max Planck Institute Molecular Genetics The threshold probabilities at which different subgraphs appear in a random graph. For

pN

3/2  0 the graph consists of isolated nodes and edges. For appear, while for

p

~

N

-4/3 trees of order 4 appear. At

p

~

N

-1

p

~

N

and at the same time cycles of all orders appear. The probability -3/2 trees of all orders are present,

p

~ trees of order 3

N

-2/3 marks the appearance of complete subgraphs of order 4 and

p

~

N

-1/2 corresponds to complete subgraphs of order 5. As

z

approaches 0, the graph contains complete subgraphs of increasing order.

VL Netzwerke, WS 2007/08 Edda Klipp 22

Humboldt University Berlin Theoretical Biophysics

Degree Distribution

The degree distribution that results from the numerical simulation of a random graph. We generated a single random graph with

N

=10 000 nodes and connection probability

P

=0.0015

, and calculated the number of nodes with degree

k

,

Xk

. The plot compares expectation value of the Poisson distribution (13), is small.

E

(

X X k k

/ )/

N N

with the =

P

(

k i

=

k

), and we can see that the deviation Max Planck Institute Molecular Genetics 23 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measure: Distance

A B C D E F G  Path : Connection between two vertices

u

and

v

repetition of nodes (i.e. no backtracking, no loops) without  Shortest path length

l(u,v)

: Local measure for two nodes  Average shortest path length <

l

> Global network property indicating navigability VL Netzwerke, WS 2007/08 Edda Klipp 24

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measure: Distance

A B C D E F G  Breadth-first search: Exploration of all nodes in a graph starting from those adjacent to a current node.

 Dijkstra’s algorithm: Construct shortest-path tree from a source to every other vertex (vertex number

N

:

O

(

N

2 ) ) VL Netzwerke, WS 2007/08 Edda Klipp 25

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measure: Diameter

A B D E The diameter of a graph is the maximal distance between any pair of its nodes.

C F G Strictly speaking, the diameter of a disconnected graph (i.e., one made up of several isolated clusters) is infinite, but it can be defined as the maximum diameter of its clusters.

Random graphs tend to have small diameters, provided

p

is not too small.

• If • If < <

k k

> =

pN

> > 1 < 1 , a typical graph is composed of isolated trees and its diameter equals the diameter of a tree.

, a giant cluster appears. The diameter of the graph equals the diameter of the giant cluster if <

k

> >3.5

, and is proportional to ln(

N

)/ln(<

k>

).

• If <

k>

>ln(

N

), the same

N

almost every graph is totally connected. The diameters of the graphs having and <

k

> are concentrated on a few values around ln(

N

)/ln(<

k

>).

VL Netzwerke, WS 2007/08 Edda Klipp 26

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measures: Clustering

A B C D E F G

C

( D ) =1/3 Adjacent nodes: B, C, E, F Number of links: 2 Possible number of links : 6 Clustering coefficient

C

(

v

) for node

v

: Ratio between the number of edges linking nodes adjacent to

v

edges among them (at most

k v

(

k v

and the total number of possible -1)/2 for

k v

neighbors) Idea behind: In many networks, if node A is connected to B, and B is connected to C, then it is highly probable that A also has a direct link to C.

VL Netzwerke, WS 2007/08 Edda Klipp 27

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measures: Clustering

A B C D E F G

C

( A ) =0

C

(B) =1/3

C

(C) =1

C

(D) =1/3

C

(E) =1

C

(F) =1/3

C

(G) =0 Average clustering coefficient <

C

>: Tendency of the network to form clusters or groups <

C

>=3/7 Average clustering coefficient for all nodes with

k

links

C

(

k

) : Diversity of cohesiveness of local neighborhoods 28 VL Netzwerke, WS 2007/08 Edda Klipp

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measures: Clustering

A B C D E F G Complex networks exhibit a large degree of clustering. If we consider a node in a random graph and its nearest neighbors, the probability that two of these neighbors are connected is equal to the probability that two randomly selected nodes are connected.

VL Netzwerke, WS 2007/08 Edda Klipp 29

Humboldt University Berlin Theoretical Biophysics Max Planck Institute Molecular Genetics

Graph-theoretical Measures: Clustering

Clustering coefficients as predicted for random networks and Clustering coefficients for real networks (WWW, movie actors, co-authorship, E.coli substrate graph, E.coli reaction graph, food webs, word co-occurrence, power grids,…) VL Netzwerke, WS 2007/08 Edda Klipp 30