Mining di dati web

Download Report

Transcript Mining di dati web

Mining di dati web

Lezione n ° 2 Il grafo del Web A.A 2006/2007

The Web Graph

 The linkage structure of Web Pages forms a graph structure.

 The Web Graph (hereinafter called W) is a directed graph W = (V,E)V is the vertex set and each vertex represents a page in the Web.

E is the edge set and each directed edge

(e 1, e 2

) exists whenever a link appears in the page represented by e

1

to the page represented by e

2

.

1 Link 11 Link 12 2 Link 21 Link 22 3 Link 41

A Toy Example of W

V= {1,2,3,4}

Link 31 4

E= { (1,2) , (1,4) , (2,3) , (2,4) , (3,1) , (4,3) }

1 2 3 4 1

= 0 1 0

2

1 = 0 0

3

0 1 = 1

4

1 1 = =

1 2 3 4 2 , 4 3 , 4 1 3 1 2 3 4 2 , 2 3 , 1 1 3

A More Realistic W

The size of W

 What is being measured?

 Number of hosts  Number of (static) html pages  Volume of data  Number of hosts - netcraft survey  http://news.netcraft.com/archives/web_server_survey.html

 Monthly report on how many web hosts & servers are out there!

 Number of pages - numerous estimates  Recently Yahoo announced an index with 20B pages.

The “real” size of W

 The web is really infinite  Dynamic content, e.g. calendars, online organizers, etc.

 http://www.raingod.com/raingod/resources/Pr ogramming/JavaScript/Software/RandomString s/index.html

 Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%)  Some servers are seldom connected.

Recent Measurement of W

 [Gulli & Signorini, 2005]. Total web > 11.5B.

2.3B the pages unknown to popular Search Engines.

35-120B of pages are within the hidden web.

The index intersection between the largest available search engines -- namely Google, Yahoo!, MSN, AskJeeves -- is estimated to be

28.8%.

Evolution of W

 All of these numbers keep changing.

 Relatively few scientific studies of the evolution of the web [Fetterly & al., 2003]  http://research.microsoft.com/research/sv/sv pubs/p97-fetterly/p97-fetterly.pdf

 Sometimes possible to extrapolate from small samples (fractal models) [Dill & al., 2001]  http://www.vldb.org/conf/2001/P069.pdf

Rate of change

    There a number of different studies analyzing the rate of changes of pages in V.

[Cho & al., 2000] 720K pages from 270 popular sites sampled daily from Feb 17 - Jun 14, 1999  Any changes: 40% weekly, 23% daily [Fetterly & al., 2003] Massive study 151M pages checked over few months  Significant changed -- 7% weekly  Slightly changed -- 25% weekly [Ntoulas & al., 2004] 154 large sites re-crawled from scratch weekly  8% new pages / week    8% die 5% new content 25% new links/week

Rate of change [Fetterly & al., 2003]

Rate of change [Ntoulas & al., 2004]

The Bow-Tie Structure

The Power of Power Laws

 A power law relationship between two scalar quantities

x

and

y

is one where the relationship can be written as y= ax k where

a

exponent (the constant of proportionality) and

k

of the power law) are constants. (the 

Power laws

are observed in many subject areas, including physics , biology , geography , sociology , economics , and linguistics .  Power laws are among the most frequent scaling laws that describe the scale invariance found in many natural phenomena.

Power Law Probability Distributions

 Sometimes called heavy-tail or long-tail distributions.

 Examples of power law probability distributions :  The Pareto distribution , for example, the distribution of wealth in capitalist economies  Zipf's law , for example, the frequency of unique words in large texts http://wordcount.org/main.php

 Scale-free networks , where the distribution of links is given by a power law (in particular, the World Wide Web )  Frequency of events or effects of varying size in self organized critical systems earthquake systems , e.g. Gutenberg-Richter Law of magnitudes and Horton 's laws describing river



The in/out-degree

  2.1

 Pr(  Power law trend:

X k

k

) 

ck

    2.55

Random Graphs

 RGs are structures introduced by Paul Erdos and Alfred Reny.

 There are several models of RGs. We are concerned with the model G

n,p

.

 A graph G = (V,E) and an edge (u,v)  

G n,p

is such that |V|=n E is selected uniformly at random with probability p.

W cannot be a RG

 Let X

k

be a discrete value indicating the number of nodes having degree equal to k.

 Obviously in G

E(X p n



n n,p k

 1  

p k

the expected value of X

p

X



k

is asintotically distributed as a Poisson variable with mean Pr(

X k

r

) 

k

.

e

 

k

r

!

r k



The avg distance of a graph G

   Let u, vV be two nodes of G.

Let d(u,v) be the distance from u to v expressed as the length of the shortest path connecting u to v. If u and v are not connected then the distance is set to  .

Define

L

(

G

)     

S d

(

u

,

v

)

S

where S is the set of pairs of distinct nodes u, v of W with the property that d(u,v) is finite.



The avg distance of W

 A small world graph is a graph whose avg distance is much smaller that the order of the graph.

 For instance L(G) O(log(|V(G)|)).

L(W) is about 7.

L d

(W) is about 18

What’s the best model for W?

 A graph model for the web should have (at least) the following features:

1.

On-line property. The number of nodes and edges changes with time.

Power law degree distribution. The degree to find a web graph model  >2.

provably has all four properties.

much smaller that the order of the graph.

4.

Many dense bipartite subgraphs. The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.

W Models proposed so far.

 [Bollobas & al., 2001]. Linearized Chord Diagram (LCD).

 [Aiello & al., 2001]. ACL.

 [Chung & al., 2003]. CL.

 [Kumar & al., 1999]. Copying model.

 [Chung & al., 2004]. CL-del growth- deletion model.

 [Cooper & al., 2004]. CFV.

General Characteristics

Model

LCD ACL CL Copying CL-del CFV Directed Y Y N Y N N 1 Y Y N Y Y Y 2 Y Y Y Y Y Y 3 Y ?

Y ?

Y ?

4 ?

N ?

Y ?

?

 3 (2,  ) (2,  ) (2,  ) (2,  ) (2,  )

References

   [Gulli & Signorini, 2005]. Antonio Gulli and Alessio Signorini. The indexable web is more than 11.5 billion pages. WWW (Special interest tracks and posters) 2005: 902-903.

[Fetterly & al., 2003]. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A Large- Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (May 2003), pages 669-678.

[Dill & al., 2001]. Stephen Dill, Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins: Self-similarity in the web. ACM Trans. Internet Techn. 2(3): 205-223 (2002).

References

   [Cho & al., 2000]. Junghoo Cho, Hector Garcia Molina. The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000: 200-209.

[Ntoulas & al., 2004]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston. What's new on the

web?: the evolution of the web from a search

engine perspective. WWW 2004: 1-12.

[ Bollobas & al., 2001 ]. Bela Bollobas, Oliver Riordan, G. Tusnary and Joel Spencer. The degree sequence of a scale-free random graph process. Random Structures and Algorithms, vol 18, 2001, 279-290.

References

 [ Aiello & al., 2001 ]. William Aiello, Fan R. K. Chung, Linyuan Lul. Random Evolution in Massive Graphs. FOCS 2001: 510-519.

 [ Chung & al., 2003 ]. Fan R. K. Chung, L. Lu. The

average distances in random graphs with given

expected degrees. Internet Mathematics. 1(2003): 91-114.

 [ Kumar & al., 1999 ]. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal. Stochastic models for the Web graph. Proceedings of the 41th FOCS. 2000, pp. 57-65.

References

 [Chung & al., 2004 ]. F. Chung, L. Lu.

Coupling Online and Offline Analyses for

Random Power Law Graphs. Internet Mathematics. Vol 1 (2003). 409-461.

 [ Cooper & al., 2004 ]. C. Cooper, A. Frieze, J. Vera. Random Deletions in a Scale Free Random Graph Process. Internet Mathematics. Vol 1 (2003). 463 - 483.