Graph structures in data mining
Download
Report
Transcript Graph structures in data mining
Using Graphs in Unstructured
and Semistructured Data Mining
Soumen Chakrabarti
IIT Bombay
www.cse.iitb.ac.in/~soumen
ACFOCS 2004
Chakrabarti
1
Acknowledgments
C. Faloutsos, CMU
W. Cohen, CMU
IBM Almaden (many colleagues)
IIT Bombay (many students)
S. Sarawagi, IIT Bombay
S. Sudarshan, IIT Bombay
ACFOCS 2004
Chakrabarti
2
Graphics are everywhere
Phone network, Internet, Web
Databases, XML, email, blogs
Web of trust (epinion)
Text and language artifacts (WordNet)
Commodity distribution networks
Internet Map
[lumeta.com]
ACFOCS 2004
Food Web
[Martinez ’91]
Chakrabarti
Protein Interactions
[genomebiology.com]
3
Why analyze graphs?
What properties do real-life graphs have?
How important is a node? What is
importance?
Who is the best customer to target in a
social network?
Who spread a raging rumor?
How similar are two nodes?
How do nodes influence each other?
Can I predict some property of a node
based on its neighborhood?
ACFOCS 2004
Chakrabarti
4
Outline, some more detail
Part 1 (Modeling graphs)
What do real-life graphs look like?
What laws govern their formation, evolution and
properties?
What structural analyses are useful?
Part 2 (Analyzing graphs)
ACFOCS 2004
Modeling data analysis problems using graphs
Proposing parametric models
Estimating parameters
Applications from Web search and text mining
Chakrabarti
5
Modeling and generating
realistic graphs
ACFOCS 2004
Chakrabarti
6
Questions
What do real graphs look like?
Edges, communities, clustering effects
What properties of nodes, edges are
important to model?
Degree, paths, cycles, …
What local and global properties are
important to measure?
How to artificially generate realistic graphs?
ACFOCS 2004
Chakrabarti
7
Modeling: why care?
Algorithm design
Can skewed degree distribution make our
algorithm faster?
Extrapolation
How well will Pagerank work on the Web 10
years from now?
Sampling
Make sure scaled-down algorithm shows same
performance/behavior on large-scale data
Deviation detection
Is this page trying to spam the search engine?
ACFOCS 2004
Chakrabarti
8
Laws – degree distributions
Q: avg degree is ~10 - what is the most
probable degree?
count
??
10
ACFOCS 2004
degree
Chakrabarti
9
Laws – degree distributions
Q: avg degree is ~10 - what is the most
probable degree?
count
??
10
ACFOCS 2004
count
degree
Chakrabarti
10
degree
10
Power-law: outdegree O
Frequency
Exponent = slope
O = -2.15
-2.15
Nov’97
Outdegree
The plot is linear in log-log scale [FFF’99]
freq = degree (-2.15)
ACFOCS 2004
Chakrabarti
11
Power-law: rank R
outdegree
Exponent = slope
R = -0.74
R
Dec’98
Rank: nodes in decreasing outdegree order
The plot is a line in log-log scale
ACFOCS 2004
Chakrabarti
12
Eigenvalues
Let A be the adjacency matrix of graph
The eigenvalue is:
A v = v,
where v some vector
Eigenvalues are strongly related to graph
topology
A
B
C
A
B
A
B
C
D
1
1
1
1
C
1
D
1
D
ACFOCS 2004
Chakrabarti
13
Power-law: eigenvalues of E
Eigenvalues in decreasing order
Eigenvalue
Exponent = slope
E = -0.48
Dec’98
Rank of decreasing eigenvalue
ACFOCS 2004
Chakrabarti
14
The Node Neighborhood
N(h) = # of pairs of nodes within h hops
Let average degree = 3
How many neighbors should I expect within
1,2,… h hops?
Potential answer:
ACFOCS 2004
1 hop -> 3 neighbors
2 hops -> 3 * 3
…
h hops -> 3h
Chakrabarti
15
The Node Neighborhood
N(h) = # of pairs of nodes within h hops
Let average degree = 3
How many neighbors should I expect within
1,2,… h hops?
Potential answer:
ACFOCS 2004
1 hop -> 3 neighbors
2 hops -> 3 * 3
…
h hops -> 3h
WE HAVE DUPLICATES!
Chakrabarti
16
The Node Neighborhood
N(h) = # of pairs of nodes within h hops
Let average degree = 3
How many neighbors should I expect within
1,2,… h hops?
Potential answer:
ACFOCS 2004
1 hop -> 3 neighbors
2 hops -> 3 * 3
…
h hops -> 3h
‘avg’ degree: meaningless!
Chakrabarti
17
Power-law: hop-plot H
# of Pairs
# of Pairs
H = 4.86
Hops
Dec 98
H = 2.83
Hops
Router level ’95
Pairs of nodes as a function of hops N(h)= hH
ACFOCS 2004
Chakrabarti
18
Observation
Q: Intuition behind ‘hop exponent’?
A: ‘intrinsic=fractal dimensionality’ of the
network
...
N(h) ~ h1
ACFOCS 2004
N(h) ~ h2
Chakrabarti
19
Any other ‘laws’?
Bow-tie, for the Web [Kumar+ ‘99]
IN, SCC, OUT, ‘tendrils’
Disconnected components
ACFOCS 2004
Chakrabarti
20
Generators
How to generate graphs from a realistic
distribution?
Difficulty: simultaneously preserving many
local and global properties seen in realistic
graphs
Erdos-Renyi: switch on each edge
independently with some probability
Problem: degree distribution not power-law
Degree-based
Process-based (“preferential attachment”)
ACFOCS 2004
Chakrabarti
21
Degree-based
Fix the degree distribution (e.g., ‘Zipf’)
Assign degrees to nodes
Add matching edges to satisfy degrees
No control over
other properties
ACFOCS 2004
Chakrabarti
22
Process-based: Preferential attachment
Start with a clique with m nodes
Add one node v at every time step
v makes m links to old nodes
Suppose old node u has degree d(u)
Let pu = d(u)/ wd(w)
v invokes a multinomial distribution defined
by the set of p’s
And links to whichever u’s show up
What is the degree distribution?
ACFOCS 2004
Chakrabarti
23
Bipartite cores
Problem with preferential attachment: does not
explain dense/complete bipartite cores
100,000’s in O(20 million)-page crawl
Need a better generation process
Log(count)
n:1
n:3
n:2
Log(m)
ACFOCS 2004
Chakrabarti
2:3 core
(m:n core)
24
Other process-based generators
“Copying model” [Kleinberg+1999]
Sample a node v to which we must add k links
W.p. b add k links to nodes picked u.a.r.
W.p. 1–b choose a random reference node r
Copy k links from r to v
Much more difficult to analyze
Reference node compression techniques!
[Fabrikant+, ‘02]: H.O.T.: connect to closest,
high connectivity neighbor
[Pennock+, ‘02]: Winner does not take all
ACFOCS 2004
Chakrabarti
25
R-MAT: Graph generator package
Recursive MATrix generator
[Chakrabarti+,’04]
Goals:
ACFOCS 2004
Power-law in- and out-degrees
Power law eigenvalues
Small diameter
Few parameters
Chakrabarti
26
Graph Patterns
Power Laws
Effective
Diameter
Count vs Indegree
Count vs Outdegree
Eigenvalue vs Rank
“Network values” vs
Rank
ACFOCS 2004
Chakrabarti
Hop-plot
Count vs edge-stress
27
R-MAT
To
a
(0.5)
b
(0.1)
c
(0.15)
d
(0.25)
2n
Subdivide the
adjacency matrix
From
choose a quadrant
with probability
(a,b,c,d)
2n
ACFOCS 2004
Chakrabarti
28
R-MAT
Recurse till we reach
a 1*1 cell
a
b
c
d
2n
a
d
c
2n
ACFOCS 2004
Chakrabarti
29
R-MAT
a
2n
by construction:
rich-get-richer for in-degree
....................... for out-degree
communities within communities
and
small diameter
a
b
c
d
d
c
2n
ACFOCS 2004
Chakrabarti
30
Experiments (Clickstream)
Count vs Indegree
Count vs Outdegree
Hop-plot
Singular value vs
Rank
Left “Network value” Right “Network value”
R-MAT matches it well
ACFOCS 2004
Chakrabarti
31
Bible: rank vs. word
frequency
Length of file transfers
[Bestavros+]
Web hit counts
[Huberman]
Click-stream data
[Montgomery+01]
Lotka’s law of
publication count
(CiteSeer data)
“a”
“the”
log(rank)
log(count)
Power laws all over
log(freq)
J. Ullman
log(#citations)
ACFOCS 2004
Chakrabarti
32
Resources
Generators
RMAT ([email protected])
BRITE www.cs.bu.edu/brite/
INET topology.eecs.umich.edu/inet
Visualization tools
Graphviz www.graphviz.org
Pajek vlado.fmf.uni-lj.si/pub/networks/pajek
Kevin Bacon web site
www.cs.virginia.edu/oracle
Erdos numbers etc.
ACFOCS 2004
Chakrabarti
33
Outline, some more detail
Part 1 (Modeling graphs)
What do real-life graphs look like?
What laws govern their formation, evolution and
properties?
What structural analyses are useful?
Part 2 (Analyzing graphs)
ACFOCS 2004
Modeling data analysis problems using graphs
Proposing parametric models
Estimating parameters
Applications from Web search and text mining
Chakrabarti
34
Centrality and prestige
ACFOCS 2004
Chakrabarti
35
How important is a node?
Degree, min-max radius, …
Pagerank
Maximum entropy network flows
HITS and stochastic variants
Stability and susceptibility to spamming
Hypergraphs and nonlinear systems
Using other hypertext properties
Applications: Ranking, crawling, clustering,
detecting obsolete pages
ACFOCS 2004
Chakrabarti
36
Motivating problem
Given a graph, find its most
interesting/central node
A node is important,
if it is connected
with important nodes
(recursive, but OK!)
ACFOCS 2004
Chakrabarti
37
Motivating problem – pageRank solution
Given a graph, find its most
interesting/central node
Proposed solution: Random walk; spot
most ‘popular’ node (-> steady state
prob. (ssp))
A node has high ssp,
if it is connected
with high ssp nodes
(recursive, but OK!)
ACFOCS 2004
Chakrabarti
38
(Simplified) PageRank algorithm
Let A be the transition matrix (=
adjacency matrix); let AT become column-normalized From
then
AT
To
2
1
4
ACFOCS 2004
5
3
1
1
1
1/2
1/2
Chakrabarti
p1
p1
p2
p2
=
1/2
p3
1/2
p4
p4
p5
p5
p3
39
(Simplified) PageRank algorithm
AT p = p
AT
p
1
2
1
3
1
1
1/2
4
ACFOCS 2004
5
1/2
Chakrabarti
=
p
p1
p1
p2
p2
=
1/2
p3
1/2
p4
p4
p5
p5
p3
40
(Simplified) PageRank algorithm
AT p = 1 * p
thus, p is the eigenvector that corresponds
to the highest eigenvalue (=1, since the matrix is
column-normalized)
ACFOCS 2004
Chakrabarti
41
(Simplified) PageRank algorithm
In short: imagine a particle randomly moving
along the edges
compute its steady-state probabilities (ssp)
Full version of algo: with occasional random
jumps – see later
ACFOCS 2004
Chakrabarti
42
Intuition
A as vector transformation
x’
A
2
1
2
1 =
x
1
3
x’
1
0
x
1
3
2
1
ACFOCS 2004
Chakrabarti
43
Intuition
By defn., eigenvectors remain parallel to
themselves (‘fixed points’)
1
v1
0.52
3.62 * 0.85 =
ACFOCS 2004
A
2
1
v1
1
3
0.52
0.85
Chakrabarti
44
Convergence
Usually, fast:
depends on ratio
2
1 : 2
ACFOCS 2004
1
Chakrabarti
45
Prestige as Pagerank [BrinP1997]
“Maxwell’s equation for the Web”
PR (v )
PR (u )
u v E OutDegree(u )
( , )
u
OutDegree(u)=3
v
PR converges only if E is aperiodic and
irreducible; make it so:
d
PR (u )
PR (v ) (1 d )
N
(u ,v )E OutDegree(u )
d is the (tuned) probability of “teleporting” to
one of N pages uniformly at random
(Possibly) unintended consequences: topic
sensitivity, stability
ACFOCS 2004
Chakrabarti
46
Prestige as network flow
yij =#surfers clicking from i to j per unit time
Hits per unit time on page j is H j
y ij
( i , j )E
Flow is conserved at j :
(i , j )E y ij ( j ,k )E y jk
The total traffic is Y
j H j i , j y ij
Normalize: p y /Y
ij
ij
Can interpret pij as a probability
Standard Pagerank corresponds to one
solution: p H (Y OutDegree(i ))
ij
i
Many other solutions possible
ACFOCS 2004
Chakrabarti
47
Maximum entropy flow [Tomlin2003]
Flow conservation modeled using feature
1 j r , (i ,r ) E
r 1,,N : f r ( x ij ) 1 i r , (r , j ) E
0
otherwise
And the constraints 0 E (f r ( x ij )) p ij f r ( x ij )
i ,j
Goal is to maximize
i , j p ij log p ij
subject to p ij 1
i ,j
Solution has form p ij exp( 0 i j )
i is the “hotness” of page i
ACFOCS 2004
Chakrabarti
48
Maxent flow results
i ranking is better
than Pagerank; Hi
ranking is worse
Two IBM
intranet
data sets
with known
top URLs
Average
rank (106)
of known
top URLs
when
sorted by
Pagerank
Depth up to which
dmoz.org URLs are
used as ground truth
Hi
i
(Smaller rank is better)
Average rank (108)
ACFOCS 2004
Chakrabarti
49
Nodes influencing neighbors
Q1: How does a virus spread across an
arbitrary network?
Q2: will it create an epidemic?
Susceptible-Infected-Susceptible (SIS)
model
Cured nodes immediately become susceptible
Infected by neighbor
Susceptible/
healthy
ACFOCS 2004
Cured
internally
Chakrabarti
Infected
&
infectious
50
The model
(virus) Birth rate b: probability than an
infected neighbor attacks
(virus) Death rate d: probability that an
Healthy
infected node heals
Prob. d
N2
Prob. β
N1
N
Infected
ACFOCS 2004
N3
Chakrabarti
51
The model
Virus ‘strength’ s= b/d
Healthy
Prob. δ
N2
Prob. β
N1
N
Infected
ACFOCS 2004
N3
Chakrabarti
52
Epidemic threshold t
of a graph, defined as the value of t, such that
if strength s = b / d < t
an epidemic can not happen
Thus,
given a graph
compute its epidemic threshold
ACFOCS 2004
Chakrabarti
53
Epidemic threshold t
What should t depend on?
avg. degree? and/or highest degree?
and/or variance of degree?
and/or third moment of degree?
ACFOCS 2004
Chakrabarti
54
Epidemic threshold
[Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
ACFOCS 2004
Chakrabarti
55
Epidemic threshold
[Theorem] We have no epidemic, if
epidemic threshold
recovery prob.
β/δ <τ = 1/ λ1,A
attack prob.
largest eigenvalue
of adj. matrix A
Proof: [Wang+03]
ACFOCS 2004
Chakrabarti
56
Experiments (Oregon)
Number of Infected Nodes
500
Oregon
β = 0.001
b/d > τ
(above threshold)
400
300
200
b/d = τ
(at the threshold)
100
0
0
250
500
750
Time
δ:
ACFOCS 2004
0.05
0.06
1000
b/d < τ
(below threshold)
0.07
Chakrabarti
57
HITS [Kleinberg1997]
Two kinds of prestige
Good hubs link to good authorities
Good authorities are linked to by good hubs
a (v ) (u ,v )E h (u ); h (u ) (u ,v )E a (v )
Eigensystems of EET (h) and ETE (a)
Whereas Pagerank uses the eigensystem of
where dJ (1 d )LT
1/ N
J
1/ N
1/ N
1/OutDegree(i ) (i , j ) E
and L(i , j )
0
otherwise
1/ N
Query-specific graph; drop same-site links
ACFOCS 2004
Chakrabarti
58
Kleinberg’s algorithm
Step 1: expand by one move forward and
backward
ACFOCS 2004
Chakrabarti
59
Kleinberg’s algorithm
give high score (= ‘authorities’) to nodes that
many important nodes point to
give high importance score (‘hubs’) to nodes
that point to good ‘authorities’)
hubs
ACFOCS 2004
authorities
Chakrabarti
60
Kleinberg’s algorithm
Observations
recursive definition!
each node (say, ‘i’-th node) has both an
authoritativeness score ai and a hubness
score hi
ACFOCS 2004
Chakrabarti
61
Kleinberg’s algorithm
Let A be the adjacency matrix:
the (i,j) entry is 1 if the edge from i to j exists
Let h and a be [n x 1] vectors with the
‘hubness’ and ‘authoritativiness’ scores.
Then:
ACFOCS 2004
Chakrabarti
62
Kleinberg’s algorithm
In conclusion, we want vectors h and a such
that:
h=Aa
a = AT h
That is:
a = ATA a
ACFOCS 2004
Chakrabarti
63
Kleinberg’s algorithm
a is a right- singular vector of the adjacency
matrix A (by dfn!)
== eigenvector of ATA
Starting from random a’ and iterating, we’ll
eventually converge
(Q: to which of all the eigenvectors? why?)
ACFOCS 2004
Chakrabarti
64
Dyadic interpretation [CohnC2000]
Graph includes many communities z
Query=“Jaguar” gets auto, game, animal links
Each URL is represented as two things
A document d
A citation c
Max
Pr(d ,c )
(d ,c )E
d c
d c
( , )E
( , )E
Pr(d ) Pr(c | d )
Pr(d )z Pr( z | d ) Pr(c | z )
Guess number of aspects zs and use
[Hofmann 1999] to estimate Pr(c|z)
These are the most authoritative URLs
ACFOCS 2004
Chakrabarti
65
Dyadic results for “Machine learning”
Clustering based on citations + ranking within clusters
ACFOCS 2004
Chakrabarti
66
Spamming link-based ranking
Recipe for spamming HITS
Create a hub linking to genuine authorities
Then mix in links to your customers’ sites
Highly susceptible to adversarial behavior
Recipe for spamming Pagerank
Buy a bunch of domains, cloak IP addresses
Host a site at each domain
Sprinkle a few links at random per page to other
sites you own
Takes more work than spamming HITS
ACFOCS 2004
Chakrabarti
67
Why?
How to design more stable
algorithms?
ACFOCS 2004
Chakrabarti
HITS Authority
Compute HITS authority
scores and Pagerank
Delete 30% of nodes/links at
random
Recompute and compare
ranks; repeat
Pagerank ranks more stable
than HITS authority ranks
1
2
3
4
5
6
10
8
Pagerank
Stability of link analysis [NgZJ2001]
1
2
3
4
5
6
7
8
3
1
1
5
3
3
12
6
6
52 20 23
171 119 99
135 56 40
179 159 100
316 141 170
1
2
5
3
6
4
7
8
1
2
6
5
3
4
7
8
1
2
4
5
6
3
7
8
1
2
3
4
5
8
7
6
1
2
5
4
3
6
7
9
68
Stability depends on graph and params
Auth score is eigenvector for ETE = S, say
Let 1 > 2 be the first two eigenvalues
There exists an S’ such that
S and S’ are close ||S–S’||F = O(1 –2)
But ||u1 – u’1||2 = (1)
Pagerank p is eigenvector of (U (1 ) E )T
U is a matrix full of 1/N and is the jump prob
If set C of nodes are changed in any way, the
new Pagerank vector p’ satisfies
p' p 2 2uC pu /
ACFOCS 2004
Chakrabarti
69
T
(t )
a
1 (1 ) Erow h
( t 1)
(t )
h
1 (1 ) Ecol a
( t 1)
Much more stable than HITS
Results more meaningful
near 1 will always stabilize
Here was 0.2
ACFOCS 2004
Chakrabarti
Randomized HITS
Each half-step, with
probability , teleport to a
node chosen uniformly at
random
1
4
2
3
5
6
7
8
3
1
2
4
6
5
7
8
3
1
2
4
6
5
7
8
2
1
3
4
6
5
7
8
1
2
4
3
5
6
7
8
Pagerank
Randomized HITS
1
3
2
4
5
6
7
8
1
2
3
4
6
7
5
9
1
2
3
4
7
6
5
9
1
2
3
4
5
6
7
9
2
1
3
4
5
6
7
11
70
Another random walk variation of HITS
SALSA: Stochastic HITS [Lempel+2000]
Two separate random walks
1/3
a1
From authority to authority via hub 1/3
From hub to hub via authority
1/3
Transition probability Pr(aiaj) =
1
1
h :(h ,ai ),(h ,a j )E InDegree(a ) OutDegree(h )
i
1/2
a2
1/2
If transition graph is irreducible, InDegree(a )
a
For disconnected components, depends on
relative size of bipartite cores
Avoids dominance of larger cores
ACFOCS 2004
Chakrabarti
71
SALSA sample result (“movies”)
HITS: The Tightly-Knit Community (TKC) effect
SALSA: Less TKC influence (but no reinforcement!)
ACFOCS 2004
Chakrabarti
72
Links in relational data [GibsonKR1998]
(Attribute, value) pair is a node
Each node v has weight wv
Each tuple is a hyperedge
Tuple r has weight xr
HITS-like iterations to update
weight wv
For each tuple
r (v, u1 ,, uk )
xr ( wu1 , , wuk )
Update weight
wv r xr
Combining operator can be
sum, max, product, Lp avg, etc.
ACFOCS 2004
Chakrabarti
73
Database
Theory
Distilling links in relational data
ACFOCS 2004
Author
Author
Chakrabarti
Forum
Year
74
Searching and annotating graph data
ACFOCS 2004
Chakrabarti
75
Searching graph data
Nodes in graph contain text
RandomIntelligent surfer [RichardsonD2001]
Topic-sensitive Pagerank [Haveliwala2002]
Assigning image captions using random walks
[PanYFD2004]
Rotting pages and links [BarYossefBKT2004]
Query is a set of keywords
All keywords may not match a single node
Implicit joins [Hulgeri+2001, Agrawal+2002]
Or rank aggregation [Balmin+2004] required
ACFOCS 2004
Chakrabarti
76
Intelligent Web surfer
Prq ( j ) (1 b ) Pr'q ( j ) b (i , j )E Prq (i ) Prq (i j )
Keyword
Probability
of teleporting
to node j
Relevance
of node k wrt q
Pr'q ( j )
Rq ( j )
kV
Rq (k )
Pr'q (i j )
ACFOCS 2004
Rq ( j )
( i , k )E
PR Q ( j ) qQ Pr( q ) Prq ( j )
Query=set
of words
Probability of walking
from i to j wrt q
Rq (k )
Pick out-link to walk on in
proportion to relevance of
target out-neighbor
Pick a query word per
some distribution, e.g. IDF
Chakrabarti
77
Implementing the intelligent surfer
PRQ(j) approximates a walk that picks a query
keyword using Pr(q) at every step
Precompute and store Prq(j) for each keyword q in
lexicon: space blowup = avg doc length
Query-dependent PR rated better by volunteers
ACFOCS 2004
Chakrabarti
78
Topic-sensitive Pagerank
High overhead for per-word Pagerank
Instead, compute Pageranks for some
collection of broad topics PRc(j)
Topic c has sample page set Sc
Walk as in Pagerank
Jump to a node in Sc uniformly at random
“Project” query onto set of topics
Pr(c | Q) Pr( c)qQ Pr( q | c)
Rank responses by projection-weighted
Pageranks Score(Q, j )
Pr(c | Q) PR ( j )
c
ACFOCS 2004
Chakrabarti
c
79
Topic-sensitive Pagerank results
Users prefer topic-sensitive Pagerank on most
queries to global Pagerank + keyword filter
ACFOCS 2004
Chakrabarti
80
Image captioning
Segment images
into regions
Image has
caption words
Three-layer
graph: image,
regions, caption
words
Threshold on
region similarity
to connect
regions (dotted)
ACFOCS 2004
Chakrabarti
81
Random walks with restarts
Regions
Images
Test
image
Words
Find regions in test image
Connect regions to other nodes in the region layer
using region similarity
Random walk, restarting at test image node
Pick words with largest visit probability
ACFOCS 2004
Chakrabarti
82
More random walks: “Link rot”
How stale is a page?
Last-mod unreliable
Automatic dead-link cleaners mask disuse
A page is “completely stale” if it is “dead”
Let D be the set of pages which cannot be
accessed (404 and other problems)
How stale is a page u? Start with p u
If pD declare decay value of u to be 1, else
With probability declare decay value of u = 0
W.p. 1– choose outlink v, set pv, loop
ACFOCS 2004
Chakrabarti
83
Page staleness results
Decay
404s
Decay score is correlated with, but generally larger
than the fraction of dead outlinks on a page
Removing direct dead links automatically does not
eliminate live but “rotting” pages
ACFOCS 2004
Chakrabarti
84
Graph proximity search: two paradigms
A single node as query response
Find node that matches query terms…
…or is “near” nodes matching query terms
[Goldman+ 1998]
A connected subgraph as query response
Single node may not match all keywords
No natural “page boundary” [Bhalotia+2002]
[Agrawal+2002]
ACFOCS 2004
Chakrabarti
85
Single-node response examples
Travolta, Cage
Movie
Actor, Face/Off
Travolta, Cage,
Movie
“is-a”
Gathering
Grease
“acted-in”
Gathering, Grease
Kleiser, Woo, Actor
“directed”
Face/Off
Kleiser, Movie
Face/Off
A3
Travolta
Cage
“is-a”
Actor
Travolta
Kleiser
Woo
“is-a”
Director
ACFOCS 2004
Chakrabarti
86
Basic search strategy
Node subset A activated because they
match query keyword(s)
Look for node near nodes that are activated
Goodness of response node depends
Directly on degree of activation
Inversely on distance from activated node(s)
ACFOCS 2004
Chakrabarti
87
Proximity query: screenshot
http://www.cse.iitb.ac.in/banks/
ACFOCS 2004
Chakrabarti
88
Ranking a single node response
Activated node set A
Rank node r in “response set” R based on
proximity to nodes a in A
Nodes have relevance R and A in [0,1]
Edge costs are “specified by the system”
d(a,r) = cost of shortest path from a to r
Bond between a and r
A (a ) R ( r )
b(a, r )
t
d (a, r )
Parameter t tunes relative emphasis on
distance and relevance score
Several ad-hoc choices
ACFOCS 2004
Chakrabarti
89
Scoring single response nodes
Additive
Belief
score(r ) aA b(a, r )
score(r ) 1 aA 1 b(a, r )
Goal: list a limited number of find nodes with
the largest scores
Performance issues
Assume the graph is in memory?
Precompute all-pairs shortest path (|V |3)?
Prune unpromising candidates?
ACFOCS 2004
Chakrabarti
90
Hub indexing
Decompose APSP problem using sparse
vertex cuts
|A|+|B | shortest paths to p
|A|+|B | shortest paths to q
d(p,q)
To find d(a,b) compare
d(apb) not through q
d(aqb) not through p
d(apqb)
d(aqpb)
A
B
p
a
b
q
Greatest savings when |A||B|
Heuristics to find cuts, e.g. large-degree nodes
ACFOCS 2004
Chakrabarti
91
ObjectRank [Balmin+2004]
Given a data graph with nodes having text
For each keyword precompute a keywordsensitive Pagerank [RichardsonD2001]
Score of a node for multiple keyword search
based on fuzzy AND/OR
Approximation to Pagerank of node with restarts
to nodes matching keywords
Use Fagin-merge [Fagin2002] to get best
nodes in data graph
ACFOCS 2004
Chakrabarti
92
Connected subgraph as response
Single node may not match all keywords
No natural “page boundary”
On-the-fly joins make up a “response page”
Two scenarios
Keyword search on relational data
• Keywords spread among normalized relations
Keyword search on XML-like or Web data
• Keywords spread among DOM nodes and subtrees
ACFOCS 2004
Chakrabarti
93
Keyword search on relational data
Tuple = node
Some columns have text
Foreign key constraints =
edges in schema graph
Query = set of terms
No natural notion
AuthorID
A1
of a document
A2
Cites
Citing
Cited
Author
AuthorID
AuthorName
PaperID
P1
P2
P2
Paper
PaperID
PaperName
Writes
AuthorID
PaperID
AuthorID AuthorName
A1
Chaudhuri
A2
Sudarshan
A3
Hulgeri
Normalization
A3
Join may be needed
Citing Cited PaperID PaperName
to generate results
P2
P1
P1
DBXplorer
P2
BANKS
Cycles may exist in schema
graph: ‘Cites’
ACFOCS 2004
Chakrabarti
94
DBXplorer and DISCOVER
Enumerate subsets of relations in schema graph
which, when joined, may contain rows which have
all keywords in the query
“Join trees” derived from schema graph
Output SQL query for each join tree
Generate joins, checking rows for matches
[Agrawal+2001], [Hristidis+2002]
K1,K2,K3
T1
T2
T4
T3
ACFOCS 2004
K3
T3
T5
K2
T5
T2
T2
T4
T4
T2
T3
Chakrabarti
T2
T3
T5
95
Discussion
Exploits relational
schema information to
contain search
Pushes final extraction
of joined tuples into
RDBMS
Faster than dealing
with full data graph
directly
ACFOCS 2004
Coarse-grained
ranking based on
schema tree
Does not model
proximity or (dis)
similarity of individual
tuples
No recipe for data with
less regular (e.g. XML)
or ill-defined schema
Chakrabarti
96
Motivation from Web search
“Linux modem driver
for a Thinkpad A22p”
IBM Thinkpads
•A20m
Thinkpad
•A22p
Drivers
•Windows XP
Download
•Linux
Installation tips
•Modem
•Ethernet
Hyperlink path matches
query collectively
Conjunction query
would fail
Projects where X and
P work together
Conjunction may
retrieve wrong page
General notion of
graph proximity
ACFOCS 2004
The B System
Home Page of
Professor X
Papers
•VLDB…
Students
•P
•Q
Chakrabarti
Group members
•P
•S
•X
P’s home page
I work on the B
project.
97
Data structures for search
Answer = tree with at least one leaf
containing each keyword in query
Group Steiner tree problem, NP-hard
Query term t found in source nodes St
Single-source-shortest-path SSSP iterator
Initialize with a source (near-) node
Consider edges backwards
getNext() returns next nearest node
For each iterator, each visited node v
maintains for each t a set v.Rt of nodes in St
which have reached v
ACFOCS 2004
Chakrabarti
98
Generic expanding search
Near node sets St with S = t St
For all source nodes S
create a SSSP iterator with source
While more results required
Get next iterator and its next-nearest node v
Let t be the term for the iterator’s source s
crossProduct = {s} t ’tv.Rt’
For each tuple of nodes in crossProduct
• Create an answer tree rooted at v with paths to each
source node in the tuple
Add s to v.Rt
ACFOCS 2004
Chakrabarti
99
Search example (“Vu Kleinberg”)
Quoc Vu
Jon Kleinberg
writes
writes
Organizing Web pages
by “Information Unit”
cites
writes
Authoritative sources in a
hyperlinked environment
A metric
labeling problem
cites
cites
Divyakant Agrawal
author
ACFOCS 2004
paper
writes
writes
cites
Chakrabarti
writes
Eva Tardos
100
First response
Quoc Vu
Jon Kleinberg
writes
writes
Organizing Web pages
by “Information Unit”
cites
writes
Authoritative sources in a
hyperlinked environment
A metric
labeling problem
cites
cites
Divyakant Agrawal
author
ACFOCS 2004
paper
writes
writes
cites
Chakrabarti
writes
Eva Tardos
101
Subgraph search: screenshot
http://www.cse.iitb.ac.in/banks/
ACFOCS 2004
Chakrabarti
102
Similarity, neighborhood, influence
ACFOCS 2004
Chakrabarti
103
Why are two nodes similar?
What is/are the best paths connecting two nodes
explaining why/how they are related?
Graph of co-starring, citation, telephone call, …
Graph with nodes s and t; budget of b nodes
Find “best” b nodes capturing relationship between
s and t [FaloutsosMT2004]:
Proposing a definition of goodness
How to efficiently select best connections
Negroponte
Palmisano
Esther Dyson
ACFOCS 2004
Gerstner
Chakrabarti
104
Simple proposals that do not work
Shortest path
Pizza boy p gets same attention as g
Network flow
sabt is as good as sgt
Voltage
Connect +1V at s, ground t
Both g and p will be at +0.5V
Observations
a
s
b
g
t
p
Must reward parallel paths
Must reward short paths
Must penalize/tax pizza boys
ACFOCS 2004
Chakrabarti
105
Resistive network with universal sink
Connect +1V to s
Ground t
Introduce universal sink
Grounded
Connected to every node
a
s
b
g
t
p
Universal sink is a “tax
collector”
Penalizes pizza boys
Penalizes long paths
Goodness of a path is the
electric current it carries
ACFOCS 2004
Chakrabarti
Connected
to every node
106
Resistive network algorithm
Ohm’s law: I (u, v) C (u, v)[V (u ) V (v)] u, v
Kirchhoff’s current law: v s, t :
u I (u, v) 0
Boundary conditions (without sink):
V ( s) 1, V (t ) 0
Solution:
V (v)C (u , v)
v
V (u )
, for u s, t
C (u , w)
w
Here C(u,v) is the conductance from u to v
Add grounded universal sink z with V(z)=0
Set u : C (u, z )
w z C (u, w)
Display subgraph carrying high current
ACFOCS 2004
Chakrabarti
107
Distributions coupled via graphs
Hierarchical classification
Document topics organized in a tree
Mapping between ontologies
Can Dmoz label help labeling in Yahoo?
Hypertext classification
Topic of Web page better predicted from
hyperlink neighborhood
Categorical sequences
Part-of-speech tagging, named entity tagging
Disambiguation and linkage analysis
ACFOCS 2004
Chakrabarti
108
Hierarchical classification
Obvious approaches
Flatten to leaf topics, losing hierarchy info
Level-by-level, compounding error probability
Cascaded generative model
Pr(c | d ) Pr( r | d ) Pr(c | d , r )
r
c
Pr(c|d,r) estimated as Pr(c|r)Pr(d|c)/Z(r)
Estimate of Pr(d|c) makes naïve independence
assumptions if d has high dimensionality
Pr(c|d,r) tends to 0/1 for large dimensions and
Mistake made at shallow levels become
irrevocable
ACFOCS 2004
Chakrabarti
109
Global discriminative model
Each node has an associated bit X
Propose a parametric form
exp( wc F (d , xr ))
Pr( X c 1 | d , xr )
1 exp( wc F (d , xr ))
Each training instance sets one path to 1, all other
nodes have X=0
T
d
xr=0
xr
xr=1
2T+1
wc
ACFOCS 2004
F(d,xr)
%Accuracy
SVM 1v1 Ctree
Reuters 1%
41.9
47.3
Reuters 5%
68
71
News20 1%
17.2
21.8
Chakrabarti
110
Hypertext classification
c=class, t=text,
N=neighbors
Text-only model: Pr[t|c]
Using neighbors’ text
to judge my topic:
Pr[t, t(N) | c]
Better model:
Pr[t, c(N) | c]
Non-linear relaxation
ACFOCS 2004
Chakrabarti
?
111
Generative graphical model: results
ACFOCS 2004
Chakrabarti
40
%Error
9600 patents from 12
classes marked by
USPTO
Patents have text and
cite other patents
Expand test patent to
include neighborhood
‘Forget’ fraction of
neighbors’ classes
30
20
10
0
0
50
100
%Neighborhood known
Text
Link
Text+Link
112
Discriminative graphical model
OA(X) = direct “own” attributes of node X
LD(X) = link-derived attributes of node X
Mode-link: most frequent label of neighbors(X)
Count-link: histogram of neighbor labels
Binary-link: 0/1 histogram of neighbor labels
Pr(c | wo , OA( X )) 1 exp( cwoT OA( X ) 1)
Neighborhood model params
Local model params
Pr(c | wl , LD( X )) 1 exp( cwlT LD( X ) 1)
Cˆ ( X ) arg max c Pr(c | OA( X )) Pr(c | LD( X ))
Iterate as in generative case
ACFOCS 2004
Chakrabarti
113
Discriminative model: results [Li+2003]
Binary-link and count-link outperform content-only
at 95% confidence
Better to separately estimate wl and wo
In+Out+Cocitation better than any subset for LD
ACFOCS 2004
Chakrabarti
114
Undirected Markov networks
Clique cC(G) a set of
completely connected
nodes
Clique potential c(Vc) a
function over all possible
configurations of nodes
in Vc
Decompose Pr(v) as
(1/Z)cC(G)c(Vc)
Parametric form
Label coupling
c ( v c ) exp w c f c ( v c )
Params of model
ACFOCS 2004
Instance
Local feature variable
Label variable
Feature functions
Chakrabarti
115
Conditional and relational networks
x = vector of observed features at all nodes
y = vector of labels
1
Pr( y | x)
c (xc , y c )
Z (x) cC (G )
where Z (x)
'
(
x
,
y
c c c)
y ' cC ( G )
c (xc , y c ) exp w c Fc (xc , y c )
A set of “clique templates”
specifying links to use
Other features: “in the same
HTML section”
ACFOCS 2004
Chakrabarti
116
Special case: sequential networks
Text modeled as sequence of tokens drawn
from a large but finite vocabulary
Each token has attributes
Visible: allCaps, noCaps, hasXx, allDigits,
hasDigit, isAbbrev, (part-of-speech, wnSense)
Not visible: part-of-speech, (isPersonName,
isOrgName, isLocation, isDateTime),
{starts|continues|ends}-noun-phrase
Visible (symbols) and invisible (states)
attributes of nearby tokens are dependent
Application decides what is (not) visible
Goal: Estimate invisible attributes
ACFOCS 2004
Chakrabarti
117
Hidden Markov model
A generative sequential model for the joint
distribution of states (s) and symbols (o)
S
S
S
s s0 , s1 ,...sn
o o1 , o2 ,...on
|o|
Pr( s , o ) Pr( st 1 st ) Pr( st ot ) O
O
O
t-1
t-1
t
t
t+1
t+1
...
...
t 1
ACFOCS 2004
Chakrabarti
118
Using redundant token features
Each o is usually a vector of features
extracted from a token
Might have high dependence/redundancy:
hasCap, hasDigit, isNoun, isPreposition
Parametric model for Pr(stot) needs to
make naïve assumptions to be practical
Overall joint model Pr(s,o) can be very
inaccurate
(Same argument as in naïve Bayes vs. SVM
or maximum entropy text classifiers)
ACFOCS 2004
Chakrabarti
119
Discriminative graphical model
Assume one-stage Markov dependence
Propose direct parametric form for
conditional probability of state sequence
given symbol sequence
St-1
|o|
Pr( s | o )
1
Pr( st | st 1 ) Pr(ot | st )
Pr(o) t 1
O
Model
|o|
1
s ( st , st 1 )o ( st , st 1 , o, t )
Z (o ) t 1
o (t ) exp k k f k (st 1 , st , o, t )
t-1
Ot
St+1
...
Ot+1
...
Log-linear form
Feature function; might
depend on whole o
Parameters to fit
ACFOCS 2004
St
Chakrabarti
120
Feature functions and parameters
|o |
1
2k
L log exp k f k ( st , st 1 , o, t ) 2
s , o D
k
k 2 Penalize
Z (o ) t 1
large params
Maximize total
conditional
likelihood over
all instances
Find L/k for each k and
perform a gradient-based
numerical optimization
Efficient for linear state
dependence structure
ACFOCS 2004
Chakrabarti
121
Conditional vs. joint: results
Penn Treebank: 45 tags, 1M words training data
DT
NN
NN ,
NN
, VBZ RB
JJ
IN
The asbestos fiber , crocidolite, is unusually resilient once
PRP VBZ DT NNS , IN RB JJ
NNS TO PRP VBG
it enters the lungs , with even brief exposures to it causing
Algorithm/Features
HMM with words
CRF with words
CRF with words and orthography
%Error %OOVE*
5.69
45.99
5.55
48.05
4.27
23.76
Orthography: Use words, plus overlapping features:
isCap, startsWithDigit, hasHyphen, endsWith… -ing, ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
ACFOCS 2004
Chakrabarti
Out-of-vocabulary error
NNS WDT VBP RP NNS JJ ,
NNS
VBD .
symptoms that show up decades later , researchers said .
122
Summary
Graphs provide a powerful way to model
many kinds of data, at multiple levels
Web pages, XML, relational data, images…
Words, senses, phrases, parse trees…
A few broad paradigms for analysis
Factors affecting graph evolution over time
Eigen analysis, conductance, random walks
Coupled distributions between node attributes
and graph neighborhood
Several new classes of model estimation
and inferencing algorithms
ACFOCS 2004
Chakrabarti
123
References
[BrinP1998] The Anatomy of a Large-Scale
Hypertextual Web Search Engine, WWW.
[GoldmanSVG1998] Proximity search in
databases. VLDB, 26—37.
[ChakrabartiDI1998] Enhanced hypertext
categorization using hyperlinks. SIGMOD.
[BikelSW1999] An Algorithm that Learns What’s in
a Name. Machine Learning Journal.
[GibsonKR1999] Clustering categorical data: An
approach based on dynamical systems. VLDB.
[Kleinberg1999] Authoritative sources in a
hyperlinked environment. JACM 46.
ACFOCS 2004
Chakrabarti
124
References
[CohnC2000] Probabilistically Identifying
Authoritative Documents, ICML.
[LempelM2000] The stochastic approach for linkstructure analysis (SALSA) and the TKC effect.
Computer Networks 33 (1-6): 387-401
[RichardsonD2001] The Intelligent Surfer:
Probabilistic Combination of Link and Content
Information in PageRank. NIPS 14 (1441-1448).
[LaffertyMP2001] Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling
Sequence Data. ICML.
[BorkarDS2001] Automatic text segmentation for
extracting structured records. SIGMOD.
ACFOCS 2004
Chakrabarti
125
References
[NgZJ2001] Stable algorithms for link analysis.
SIGIR.
[Hulgeri+2001] Keyword Search in Databases.
IEEE Data Engineering Bulletin 24(3): 22-32.
[Hristidis+2002] DISCOVER: Keyword Search in
Relational Databases. VLDB.
[Agrawal+2002] DBXplorer: A system for keywordbased search over relational databases. ICDE.
[TaskarAK2002] Discriminative probabilistic
models for relational data.
[Fagin2002] Combining fuzzy information: an
overview. SIGMOD Record 31(2), 109–118.
ACFOCS 2004
Chakrabarti
126
References
[Chakrabarti2002] Mining the Web: Discovering
Knowledge from Hypertext Data
[Tomlin2003] A New Paradigm for Ranking Pages
on the World Wide Web. WWW.
[Haveliwala2003] Topic-Sensitive Pagerank: A
Context-Sensitive Ranking Algorithm for Web
Search. IEEE TKDE.
[LuG2003] Link-based Classification. ICML.
[FaloutsosMT2004] Connection Subgraphs in
Social Networks. SIAM-DM workshop.
[PanYFD2004] GCap: Graph-based Automatic
Image Captioning. MDDE/CVPR.
ACFOCS 2004
Chakrabarti
127
References
[Balmin+2004] Authority-Based Keyword Queries
in Databases using ObjectRank. VLDB.
[BarYossefBKT2004] Sic transit gloria telae:
Towards an understanding of the Web’s decay.
WWW2004.
ACFOCS 2004
Chakrabarti
128