Models of the web graph
Download
Report
Transcript Models of the web graph
AM8002
Fall 2014
Week 3 - Complex Networks
and their Properties
Dr. Anthony Bonato
Ryerson University
Complex Networks
• web graph, social networks, biological networks, internet
networks, …
Networks - Bonato
2
What is a complex network?
• no precise definition
• however, there is general consensus on the
following observed properties
1. large scale
2. evolving over time
3. power law degree distributions
4. small world properties
•
other properties depend on the kind of network
being discussed
3
Examples of complex networks
• technological/informational: web graph, router
graph, AS graph, call graph, e-mail graph
• social: on-line social networks (Facebook,
Twitter, LinkedIn,…), collaboration graphs, coactor graph
• biological networks: protein interaction
networks, gene regulatory networks, food
networks
4
Example: the web graph
• nodes: web pages
• edges: links
• one of the first
complex networks
to be analyzed
• viewed as directed
or undirected
Networks - Bonato
5
Example: On-line Social
Networks (OSNs)
• nodes: users on
some OSN
• edges: friendship
(or following) links
• maybe directed or
undirected
Anthony Bonato - The web graph
6
Example: Co-author graph
• nodes:
mathematicians
and scientists
• edges: coauthorship
• undirected
7
Example: Co-actor graph
• nodes: actors
• edges: co-stars
• Hollywood graph
• undirected
8
Heirarchical social networks
• social networks
which are oriented
from top to bottom
• information flows
one way
• examples: Twitter,
executives in a
company, terrorist
networks
9
Example: protein interaction
networks
• nodes: proteins in a
living cell
• edges: biochemical
interaction
• undirected
Introducing the Web Graph Anthony Bonato
10
Properties of complex networks
1. Large scale: relative to order and size
•
web graph: order > trillion
– some sense infinite: number of strings entered into
Google
• Facebook: > 1 billion nodes; Twitter: > 500 million
nodes
– much denser (ie higher average degree) than the
web graph
• protein interaction networks: order in thousands
11
Properties of complex networks
2. Evolving: networks change over time
•
web graph: billions of nodes and links appear and
disappear each day
• Facebook: grew to 1 billion users
– denser than the web graph
• protein interaction networks:
order in the thousands
–
evolves much more slowly
12
Properties of Complex Networks
3. Power law degree distribution
•
for a graph G of order n and i a positive integer, let Ni,n
denote the number of nodes of degree i in G
•
we say that G follows a power law degree distribution if
for some range of i and some b > 2,
b
N i ,n i n
•
b is called the exponent of the power law
Complex Networks
13
Properties of Complex Networks
• power law degree distribution in the web
graph:
(Broder et al, 01)
reported an
exponent b = 2.1 for
the in-degree
distribution (in a 200
million vertex crawl)
Complex Networks
14
Interpreting a power law
Many
lowdegree
nodes
Few
highdegree
nodes
Complex Networks
15
Binomial
Power law
Highway network
Air traffic network
Complex Networks
16
Notes on power laws
• b is the exponent of the power law
• note that the law is
– approximate: constants do not affect it
– asymptotic: holds only for large n
– may not hold for all degrees, but most
degrees (for example, sufficiently large or
sufficiently small degrees)
Complex Networks
17
Degree distribution (log-log plot) of
a power law graph
Complex Networks
18
Power laws in OSNs
Complex Networks
19
Discussion
Which of the following are power law
graphs?
1. High school/secondary school graph. Nodes: students
in a high school; edges: friendship links.
2. Power grids. Nodes: generators, power plants, large
consumers of power; edges: electrical cable.
3. Banking networks. Nodes: banks; edges: financial
transaction.
20
21
Graph parameters
• average distance:
n
L ( G ) d ( u , v )
u , vV ( G )
2
1
Wiener index, W(G)
• clustering coefficient:
-1
deg( x )
1
, C ( G ) n c ( x )
c ( x ) | E ( N ( x )) |
xV ( G )
2
Complex Networks
22
Examples
• Cliques have average distance 1, and clustering
coefficient 1
• Triangle-free graphs have clustering coefficient 0
• Clustering coefficient of following graph is 0.75.
• Note: average distance bounded above by diameter
23
Properties of Complex Networks
4. Small world property
• small world networks introduced by social
scientists Watts & Strogatz in 1998
– low distances
• diam(G) = O(log n)
• L(G) = O(loglog n)
– higher clustering coefficient than random graph with
same expected degree
Complex Networks
24
Ryerson
Nuit
Blanche
City of
Toronto
Four
Seasons
Hotel
Frommer’s
Greenland
Tourism
25
Sample data: Flickr, YouTube,
LiveJournal, Orkut
• (Mislove et al,07): short average distances
and high clustering coefficients
Complex Networks
26
Other properties of complex networks
– many complex networks (including on-line
social networks) obey two additional laws:
1. Densification Power Law (Leskovec,
Kleinberg, Faloutsos,05):
– networks are becoming more dense over
time; i.e. average degree is increasing
|(E(Gt)| ≈ |V(Gt)|a
where 1 < a ≤ 2: densification exponent
Complex Networks
27
Densification – Physics Citations
1.69
Complex Networks
28
Densification – Autonomous Systems
e(t)
1.18
n(t)
Complex Networks
29
2.
Decreasing distances (Leskovec, Kleinberg,
Faloutsos,05):
•
distances (diameter and/or average distances)
decrease with time
(Kumar et al,06):
Complex Networks
30
Diameter – ArXiv citation graph
diameter
time [years]
Complex Networks
31
Other properties
• Connected component structure: emergence of
components; giant components
• Spectral properties: adjacency matrix and Laplacian
matrices, spectral gap, eigenvalue distribution
• Small community phenomenon: most nodes belong to
small communities (ie subgraphs with more internal than
external links)
…
32
Discussion
Compute the average distance of each of
the following graphs.
1. A star with n nodes (i.e. a tree of order n
with one vertex of order n-1, the rest
degree 1)
2. A path with n nodes
3. A wheel with n+1 nodes, n>2.
33
34
Web Search
• the web contains large amounts of
information (≈ 4 zettabytes = 1021 bytes)
– rely on web search engines, such as Google,
Yahoo! Search, Bing, …
35
Search Engines
• search engines are tools designed to hunt
for information on the web
• they do this by first crawling the web by
making copies of pages and their links
36
Indexing
• the search engine then indexes the
information crawled from the web, storing
and sorting it
37
User interface
• users type in queries and get back a
sorted list of web pages and links
38
Key questions
1. How do search engines choose their
rankings?
2. What makes modern search engines
more accurate than the first search
engines?
3. What does math have to do with it?
39
Challenges of web search
1. Massive size.
2. Multimedia.
3. Authorities.
40
Text based search
• first search engines ranked
pages using word frequency
– eg: if “baseball’’ appears
many times on page X, then
X is ranked higher on a
search for “baseball’’
• easily spammed: insert
“baseball” 100s of times on
page!
41
Analogy: evil librarian
• you are looking for a
book on baseball in a
library
• evil librarian spends
her time moving books
to fool you
42
Then came
43
Google uses graph theory!
Google founders: Larry Page, Sergey Brin
44
• Pagerank
PageRankismodels
the probability
web surfing
a random
via a
surfer
random
visits
walka page
• surfer usually
moves via out-links
• on occasion, the
surfer teleports to a
random page
45
How PageRank addresses the
challenges of web search
• PageRank can be computed quickly, even
for large matrices
• PageRank relies only on the link structure
– popular pages are those with many in-links, or
linked to other popular pages
• “authorities” have higher PageRank
46
Google random walk
• this modification of the usual
random walk is called the
Google random walk
• note that it takes place on a
directed graph
47
The Google Matrix
• given a digraph G with nodes {1,…,n},
define the matrix P1
• form P2 by replacing any zero rows of P1 by 1/nJ1,n
• define the Google matrix P as
-c in (0,1) is the teleportation constant
48
Example
49
Example, continued
50
Motivation
• P1 corresponds to the random walk using
out-links
• P2 takes care of spider traps: nodes with
zero out-degree
• P(G) adds in the teleportation:
– 85% of the time follow out-links, 15% of the
time use jump to a new node chosen at
random from all nodes
51
PageRank defined
Theorem (Brin, Page, 2000) The Google random
walk converges to a stationary distribution s,
which is the dominant eigenvector of P(G).
That is, the PageRank vector s solves the linear
system:
P(G)s = s.
52
Power method
• for a fixed integer n > 0, let z0 be the stochastic vector
whose every entry is 1/n
• define
zt+1T = ztTP = …= z0TPt
Lemma 6 (Power Method): The limit of the sequence of
(zt : t ≥ 0) is the dominant eigenvector.
• gives a simple method of computing Pagerank: multiply
by powers of P(G)
53
Example, continued
PageRank vector:
54