Document 7836415

Transcript Document 7836415

February 25, 2005
Information Retrieval
Handout #8
(C) 2003, The University of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev ([email protected])
Office: 3080, West Hall Connector
Phone: (734) 615-5225
Office hours: M 11-12 & Th 12-1 or via email
Course page: http://tangra.si.umich.edu/~radev/650/
Class meets on Fridays, 2:10-4:55 PM in 409 West Hall
(C) 2003, The University of Michigan
2
Models of the Web
(C) 2003, The University of Michigan
3
Size
• The Web is the largest repository of data and it
grows exponentially.
– 320 Million Web pages [Lawrence & Giles 1998]
– 800 Million Web pages, 15 TB [Lawrence & Giles
1999]
– 8 Billion Web pages indexed [Google 2005]
• Amount of data
– roughly 200 TB [Lyman et al. 2003]
(C) 2003, The University of Michigan
4
Bow-tie model of the Web
TEND
44M
IN
44 M
24% of pages
reachable from
a given page
(C) 2003, The University of Michigan
SCC
56 M
OUT
44 M
DISC
17 M
5
Bröder & al. WWW 2000, Dill & al. VLDB 2001
Power laws
• Web site size (Huberman and Adamic 1999)
• Power-law connectivity (Barabasi and Albert
1999): exponents 2.45 for out-degree and 2.1 for
the in-degree
• Others: call graphs among telephone carriers,
citation networks (Redner 1998), e.g., Erdos,
collaboration graph of actors, metabolic pathways
(Jeong et al. 2000), protein networks (Maslov and
Sneppen 2002). All values of gamma are around
2-3.
(C) 2003, The University of Michigan
6
Small-world networks
• Diameter = average length of the shortest path
between all pairs of nodes. Example…
• Milgram experiment (1967)
– Kansas/Omaha --> Boston (42/160 letters)
– diameter = 6
• Albert et al. 1999 – average distance between two
verstices is d = 0.35 + 2.06 log10n. For n = 109,
d=18.89.
• Six degrees of separation
(C) 2003, The University of Michigan
7
Clustering coefficient
• Cliquishness (c): between the kv (kv – 1)/2 pairs
of neighbors.
• Examples:
n
k
d
drand
C
crand
Actors
225226
61
3.65
2.99
0.79
0.00027
Power grid
4941
2.67
18.7
12.4
0.08
0.005
C. Elegans
282
14
2.65
2.25
0.28
0.05
(C) 2003, The University of Michigan
8
Models of the Web
• Evolving networks: fundamental object of statistical physics, social
networks, mathematical biology, and epidemiology
• Erdös/Rényi 59, 60
• Barabási/Albert 99
• Watts/Strogatz 98
• Kleinberg 98
• Menczer 02
A
a
• Radev 03
B
(C) 2003, The University of Michigan
e  k   k  k
P(k ) 
kk!
P(k ) 
 k   Np  ( )
b
9
Self-triggerability across
hyperlinks
• Document closures for
information retrieval
• Self-triggerability
r
[Mosteller&Wallace 84] 
Poisson distribution
• Two-Poisson
[Bookstein&Swanson 74]
• Negative Binomial, K-mixture
[Church&Gale 95]
• Triggerability across
hyperlinks?
pi
pj
p' p(w  p j | pi  p j  w  pi )
 
p
p
p’
(C) 2003, The University of Michigan
10
photo
by withdream
from path
p
Evolving Word-based Web
• Observations:
– Links are made based on topics
– Topics are expressed with
words
– Words are distributed very
unevenly (Zipf, Benford, selftriggerability laws)
• Model
– Pick n
– Generate n lengths according
to a power-law distribution
– Generate n documents using a
trigram model
(C) 2003, The University of Michigan
• Model (cont’d)
– Pick words in decreasing order
of r.
– Generate hyperlinks with
random directionality
• Outcome
– Generates power-law degree
distributions
– Generates topical communities
– Natural variation of PageRank:
LexRank
11
Social network analysis for IR
(C) 2003, The University of Michigan
12
Social networks
• Induced by a relation
• Symmetric or not
• Examples:
–
–
–
–
–
Friendship networks
Board membership
Citations
Power grid of the US
WWW
(C) 2003, The University of Michigan
13
Krebs 2004
(C) 2003, The University of Michigan
14
Prestige and centrality
• Degree centrality: how many neighbors each node has.
• Closeness centrality: how close a node is to all of the other
nodes
• Betweenness centrality: based on the role that a node plays
by virtue of being on the path between two other nodes
• Eigenvector centrality: the paths in the random walk are
weighted by the centrality of the nodes that the path
connects.
• Prestige = same as centrality but for directed graphs.
(C) 2003, The University of Michigan
15
Graph-based representations
Square connectivity
(incidence) matrix
Graph G (V,E)
1
6
1 2 3 4 5 6 7 8
8
2
7
5
3
(C) 2003, The University of Michigan
4
1
2
3
4
5
6
7
8
1 1
1
1
1
1
1
1 1 1
1 1
16
Markov chains
• A homogeneous Markov chain is defined by an
initial distribution x and a Markov kernel E.
• Path = sequence (x0, x1, …, xn).
Xi = xi-1*E
• The probability of a path can be computed as a
product of probabilities for each step i.
• Random walk = find Xj given x0, E, and j.
(C) 2003, The University of Michigan
17
Stationary solutions
• The fundamental Ergodic Theorem for Markov chains [Grimmett and
Stirzaker 1989] says that the Markov chain with kernel E has a
stationary distribution p under three conditions:
– E is stochastic
– E is irreducible
– E is aperiodic
• To make these conditions true:
– All rows of E add up to 1 (and no value is negative)
– Make sure that E is strongly connected
– Make sure that E is not bipartite
• Example: PageRank [Brin and Page 1998]: use “teleportation”
(C) 2003, The University of Michigan
18
Example
1
6
1
8
0.9
t=0
0.8
PageRank
0.7
0.6
0.5
0.4
0.3
0.2
2
0.1
0
7
1
2
3
4
5
6
7
8
1
0.9
5
t=1
0.8
3
PageRank
0.7
4
0.6
0.5
0.4
0.3
0.2
0.1
This graph E has a second graph E’
(not drawn) superimposed on it:
E’ is the uniform transition graph.
(C) 2003, The University of Michigan
0
1
2
3
4
5
6
7
19
8
Eigenvectors
• An eigenvector is an implicit “direction” for a
matrix.
Mv = λv, where v is non-zero, though λ can be any
complex number in principle.
• The largest eigenvalue of a stochastic matrix E is
real: λ1 = 1.
• For λ1, the left (principal) eigenvector is p, the
right eigenvector = 1
• In other words, ETp = p.
(C) 2003, The University of Michigan
20
Computing the stationary
distribution
Solution for the
stationary distribution
pE p
T
(I  E ) p  0
T
(C) 2003, The University of Michigan
function PowerStatDist (E):
begin
p(0) = u; (or p(0) = [1,0,…0])
i=1;
repeat
p(i) = ETp(i-1)
L = ||p(i)-p(i-1)||1;
i = i + 1;
until L < 
return p(i)
end
21
Example
1
0.9
t=0
0.8
PageRank
0.7
0.6
0.5
0.4
0.3
0.2
1
6
8
0.1
0
1
2
3
4
5
6
7
8
1
0.9
t=1
0.8
0.7
PageRank
2
7
0.6
0.5
0.4
0.3
0.2
0.1
5
0
1
2
3
4
5
6
7
8
1
4
t=10
0.8
0.7
PageRank
3
0.9
0.6
0.5
0.4
0.3
0.2
0.1
0
1
(C) 2003, The University of Michigan
2
3
4
5
6
7
22
8
How Google works
•
•
•
•
Crawling
Anchor text
Fast query processing
Pagerank
(C) 2003, The University of Michigan
23
More about PageRank
• Named after Larry Page, founder of Google (and
UM alum)
• Reading “The anatomy of a large-scale
hypertextual web search engine” by Brin and
Page.
• Independent of query (although more recent work
by Haveliwala (WWW 2002) has also identified
topic-based PageRank.
(C) 2003, The University of Michigan
24
HITS
• Query-dependent model (Kleinberg 97)
• Hubs and authorities (e.g., cars, Honda)
a  ET h
'
h  Ea
'
• Algorithm
–
–
–
–
obtain root set using input query
expanded the root set by radius one
run iterations on the hub and authority scores together
report top-ranking authorities and hubs
(C) 2003, The University of Michigan
25
The link-content hypothesis
• Topical locality: page is similar () to the page that points to it ().
• Davison (TF*IDF, 100K pages)
–
–
–
–
0.31 same domain
0.23 linked pages
0.19 sibling
0.02 random
• Menczer (373K pages, non-linear least squares fit)
1 2
 ( )     (1    )e
1=1.8, 2=0.6,  
 0.03
• Chakrabarti (focused crawling) - prob. of losing the topic
(C) 2003, The University of Michigan
26
Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001
Document closures for Q&A
spain
Madrid
capital
P
(C) 2003, The University of Michigan
spain
capital
L
P
27
Document closures for IR
University of
Michigan
Physics
Department
Michigan
Physics
P
(C) 2003, The University of Michigan
L
P
28
Language models
• Conditional probability distributions over
word sequences
• Example:
p (“Paris”  dj) = ?
p (“Paris”  dj | dj on Europe) = ?
• Training models: assume a parametric form,
then maximize the probability of an existing
text
(C) 2003, The University of Michigan
29
Link-based language models
• In the absence of other information,
p(wip) = 1/d(wj)
• Link information:
p(wip|p1pwip1)  p(wip)*Ri
conjecture: Ri > 1
(C) 2003, The University of Michigan
30
Experimental setup
•
•
•
•
2-Gigabyte wt2g corpus
247,491 Web documents
3,118,248 links
948,036 unique words (after Porter-style
stemming)
• ALE (automatic link extrapolator)
(C) 2003, The University of Michigan
31
Experiment one: setup
• For each stemmed word in wt2g, we compute the
following numbers:
– PagesContainingWord = how many pages in the
collection contain the word
– OutgoingLinks = the total number of outgoing links in
all the pages that contain the word
– LinkedPagesContainingWord = how many of the linked
pages contain the word
• For the latter two measures, only the links inside
the collection were considered
(C) 2003, The University of Michigan
32
The link effect R
• The word “each”
p = 55654/247491 = .225
p’ = 15815/46163 = .343
R = p’/p = .343/.225 = 1.524
(C) 2003, The University of Michigan
33
Establishing values for R
IDF 3.0
sorted by IDF
sorted by R
word IDF
word
IDF
IDF 4.0
sorted by IDF
sorted by R
word
IDF
word
IDF
human
2.981
close
1.675
centuri
3.988
extend
2.085
accord
2.983
among
1.770
interact
3.990
beyond
2.477
perform
2.984
further
1.796
introduct
3.993
front
2.606
close
2.985
expect
1.864
front
3.994
centuri
2.713
press
2.992
accord
1.922
travel
3.997
elimin
2.753
applic
2.992
assist
1.962
elimin
4.009
damag
2.757
expect
2.997
human
2.093
opinion
4.013
introduct
2.843
among
2.998
perform
2.095
damag
4.017
opinion
2.984
assist
3.004
applic
2.203
beyond
4.019
travel
3.491
further
3.011
press
2.388
extend
4.021
interact
3.527
(C) 2003, The University of Michigan
34
IDF rank
p
p’
1-100 0.4047 0.5293
101-200 0.2141 0.3574
201-300 0.1688 0.3209
301-400 0.1386 0.2876
401-500 0.1192 0.2588
501-600 0.1046 0.2426
601-700 0.0934 0.2246
701-800 0.0839 0.2201
801-900 0.0752 0.2004
901-1000 0.0669 0.2024
1001-1100 0.0605 0.1823
1101-1200 0.0548 0.1710
1201-1300 0.0498 0.1752
1301-1400 0.0454 0.1652
1401-1500 0.0416 0.1630
1501-1600 0.0385 0.1508
1601-1700 0.0358 0.1450
1701-1800 0.0332 0.1462
1801-1900 0.0310 0.1386
1901-2000 0.0287 0.1379
… …
…
100001-100100 0.0000 0.0642
500001-500100 0.0000 0.0215
(C) 2003, The University of Michigan
IDF
1.6761
2.3803
2.6896
2.9513
3.1548
3.3326
3.4879
3.6354
3.7884
3.9499
4.0909
4.2292
4.3635
4.4934
4.6166
4.7262
4.8306
4.9358
5.0363
5.1454
…
12.4774
16.2970
R
1.3639
1.6745
1.9047
2.0750
2.1750
2.3179
2.4085
2.6233
2.6668
3.0200
3.0149
3.1213
3.5210
3.6366
3.9224
3.9165
4.0517
4.4050
4.4735
4.8090
…
363.73
2658.9
31
231
sample words
the of make and
under go between copyright
market subject special mean
administr put establish ask
understand social hand share
prevent staff risk north
trade class size california
global drug letter softwar
sound tool monitor transport
permit target east normal
approxim telephon danger europ
favor richard map pictur
professor earth english republican
medicin, doctor, church, color
permiss agenda programm prioriti
prospect broadcast acquir feedback
temperatur florida percentag membership
alcohol lake crisi china
francisco disciplin film medium
entertain psycholog anticip arrest
…
sinker surmont thong undergrowth
scheflin schena schendel scheriff
35
Linear fit for the 2000 lowestIDF
words
p’
(C) 2003, The University of Michigan
p
36
Cluster One
p’
by
with
from
(C) 2003, The University of Michigan
p
37
Cluster Two
p’
photo
dream
path
(C) 2003, The University of Michigan
p
38