Transcript Mining di dati web
Mining di dati web
Lezione n ° 2 Il grafo del Web A.A 2006/2007
The Web Graph
The linkage structure of Web Pages forms a graph structure.
The Web Graph (hereinafter called W) is a directed graph W = (V,E) V is the vertex set and each vertex represents a page in the Web.
E is the edge set and each directed edge
(e 1, e 2
) exists whenever a link appears in the page represented by e
1
to the page represented by e
2
.
1 Link 11 Link 12 2 Link 21 Link 22 3 Link 41
A Toy Example of W
V= {1,2,3,4}
Link 31 4
E= { (1,2) , (1,4) , (2,3) , (2,4) , (3,1) , (4,3) }
1 2 3 4 1
= 0 1 0
2
1 = 0 0
3
0 1 = 1
4
1 1 = =
1 2 3 4 2 , 4 3 , 4 1 3 1 2 3 4 2 , 2 3 , 1 1 3
A More Realistic W
The size of W
What is being measured?
Number of hosts Number of (static) html pages Volume of data Number of hosts - netcraft survey http://news.netcraft.com/archives/web_server_survey.html
Monthly report on how many web hosts & servers are out there!
Number of pages - numerous estimates Recently Yahoo announced an index with 20B pages.
The “real” size of W
The web is really infinite Dynamic content, e.g. calendars, online organizers, etc.
http://www.raingod.com/raingod/resources/Pr ogramming/JavaScript/Software/RandomString s/index.html
Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%) Some servers are seldom connected.
Recent Measurement of W
[Gulli & Signorini, 2005]. Total web > 11.5B.
2.3B the pages unknown to popular Search Engines.
35-120B of pages are within the hidden web.
The index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, AskJeeves -- is estimated to be
28.8%.
Evolution of W
All of these numbers keep changing.
Relatively few scientific studies of the evolution of the web [Fetterly & al., 2003] http://research.microsoft.com/research/sv/sv pubs/p97-fetterly/p97-fetterly.pdf
Sometimes possible to extrapolate from small samples (fractal models) [Dill & al., 2001] http://www.vldb.org/conf/2001/P069.pdf
Rate of change
There a number of different studies analyzing the rate of changes of pages in V.
[Cho & al., 2000] 720K pages from 270 popular sites sampled daily from Feb 17 - Jun 14, 1999 Any changes: 40% weekly, 23% daily [Fetterly & al., 2003] Massive study 151M pages checked over few months Significant changed -- 7% weekly Slightly changed -- 25% weekly [Ntoulas & al., 2004] 154 large sites re-crawled from scratch weekly 8% new pages / week 8% die 5% new content 25% new links/week
Rate of change [Fetterly & al., 2003]
Rate of change [Ntoulas & al., 2004]
The Bow-Tie Structure
The Power of Power Laws
A power law relationship between two scalar quantities
x
and
y
is one where the relationship can be written as y= ax k where
a
exponent (the constant of proportionality) and
k
of the power law) are constants. (the
Power laws
are observed in many subject areas, including physics , biology , geography , sociology , economics , and linguistics . Power laws are among the most frequent scaling laws that describe the scale invariance found in many natural phenomena.
Power Law Probability Distributions
Sometimes called heavy-tail or long-tail distributions.
Examples of power law probability distributions : The Pareto distribution , for example, the distribution of wealth in capitalist economies Zipf's law , for example, the frequency of unique words in large texts http://wordcount.org/main.php
Scale-free networks , where the distribution of links is given by a power law (in particular, the World Wide Web ) Frequency of events or effects of varying size in self organized critical systems earthquake systems , e.g. Gutenberg-Richter Law of magnitudes and Horton 's laws describing river
The in/out-degree
2.1
Pr( Power law trend:
X k
k
)
ck
2.55
Random Graphs
RGs are structures introduced by Paul Erdos and Alfred Reny.
There are several models of RGs. We are concerned with the model G
n,p
.
A graph G = (V,E) and an edge (u,v)
G n,p
is such that |V|=n E is selected uniformly at random with probability p.
W cannot be a RG
Let X
k
be a discrete value indicating the number of nodes having degree equal to k.
Obviously in G
E(X p n
n n,p k
1
p k
the expected value of X
p
X
k
is asintotically distributed as a Poisson variable with mean Pr(
X k
r
)
k
.
e
k
r
!
r k
The avg distance of a graph G
Let u, v V be two nodes of G.
Let d(u,v) be the distance from u to v expressed as the length of the shortest path connecting u to v. If u and v are not connected then the distance is set to .
Define
L
(
G
)
S d
(
u
,
v
)
S
where S is the set of pairs of distinct nodes u, v of W with the property that d(u,v) is finite.
The avg distance of W
A small world graph is a graph whose avg distance is much smaller that the order of the graph.
For instance L(G) O(log(|V(G)|)).
L(W) is about 7.
L d
(W) is about 18
What’s the best model for W?
A graph model for the web should have (at least) the following features:
1.
On-line property. The number of nodes and edges changes with time.
Power law degree distribution. The degree to find a web graph model >2.
provably has all four properties.
much smaller that the order of the graph.
4.
Many dense bipartite subgraphs. The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.
W Models proposed so far.
[Bollobas & al., 2001]. Linearized Chord Diagram (LCD).
[Aiello & al., 2001]. ACL.
[Chung & al., 2003]. CL.
[Kumar & al., 1999]. Copying model.
[Chung & al., 2004]. CL-del growth- deletion model.
[Cooper & al., 2004]. CFV.
General Characteristics
Model
LCD ACL CL Copying CL-del CFV Directed Y Y N Y N N 1 Y Y N Y Y Y 2 Y Y Y Y Y Y 3 Y ?
Y ?
Y ?
4 ?
N ?
Y ?
?
3 (2, ) (2, ) (2, ) (2, ) (2, )
References
[Gulli & Signorini, 2005]. Antonio Gulli and Alessio Signorini. The indexable web is more than 11.5 billion pages. WWW (Special interest tracks and posters) 2005: 902-903.
[Fetterly & al., 2003]. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large- Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (May 2003), pages 669-678.
[Dill & al., 2001]. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins: Self-similarity in the web. ACM Trans. Internet Techn. 2(3): 205-223 (2002).
References
[Cho & al., 2000]. Junghoo Cho, Hector Garcia Molina. The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000: 200-209.
[Ntoulas & al., 2004]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What's new on the
web?: the evolution of the web from a search
engine perspective. WWW 2004: 1-12.
[ Bollobas & al., 2001 ]. Bela Bollobas, Oliver Riordan, G. Tusnary and Joel Spencer. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, vol 18, 2001, 279-290.
References
[ Aiello & al., 2001 ]. William Aiello, Fan R. K. Chung, Linyuan Lul. Random Evolution in Massive Graphs. FOCS 2001: 510-519.
[ Chung & al., 2003 ]. Fan R. K. Chung, L. Lu. The
average distances in random graphs with given
expected degrees. Internet Mathematics. 1(2003): 91-114.
[ Kumar & al., 1999 ]. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal. Stochastic models for the Web graph. Proceedings of the 41th FOCS. 2000, pp. 57-65.
References
[Chung & al., 2004 ]. F. Chung, L. Lu.
Coupling Online and Offline Analyses for
Random Power Law Graphs. Internet Mathematics. Vol 1 (2003). 409-461.
[ Cooper & al., 2004 ]. C. Cooper, A. Frieze, J. Vera. Random Deletions in a Scale Free Random Graph Process. Internet Mathematics. Vol 1 (2003). 463 - 483.