Transcript ppt
A Graph-Theoretic Approach
to Webpage Segmentation
Deepayan Chakrabarti ([email protected])
Ravi Kumar
([email protected])
Kunal Punera
([email protected])
1
Motivation and Related Work
Header
Navigation
bar
Primary
content
Ad
Related
links
Copyright2
Motivation and Related Work
Header
Navigation
bar
Divide a webpage into visually and
semantically cohesive sections
Primary
content
Ad
Related
links
Copyright3
Motivation and Related Work
Sectioning can be useful in:
Webpage classification
Displaying webpages on mobile phones and
small-screen devices
Webpage ranking
Duplicate detection
…
4
Motivation and Related Work
A lot of recent interest
Informative Structure Mining [Cai+/2003,
Kao+/2005]
Displaying webpages on small screens
[Chen+/2005, Baluja/2006]
Template detection: [Bar-Yossef+/2002]
Topic distillation: [Chakrabarti+/2001]
Based solely on visual, or content, or DOM
based clues
Mostly heuristic approaches
5
Motivation and Related Work
Our contributions
Combine visual, DOM, and content based cues
Propose a formal graph-based combinatorial
optimization approach
Develop two instantiations, both with:
Approximation guarantees
Automatic determination of the number of sections
Develop methods for automatic learning of graph
weights
6
Outline
Motivation and Related Work
Proposed Work
Experiments
Conclusions
7
Proposed Work
A graph-based approach
Construct a neighborhood graph of
DOM tree nodes
Neighbors close according to:
DOM tree distance,
or, visual distance when
rendered on the screen,
or, similar content types
Partition the neighborhood graph to
optimize a cost function
A
B
C
D
E
DOM Tree
A
E
B
C
D
Neighborhood
Graph
8
Proposed Work
A graph-based approach
What is a good cost function?
A
B
C
E
DOM Tree
Intuitive
Has polynomial-time algorithms that
can get provably close to the
optimal
Correlation Clustering
Energy-minimizing Graph Cuts
How should we set weights in
the neighborhood graph?
D
A
E
B
C
D
Neighborhood
Graph
9
Correlation Clustering
Assign each DOM node p to a section S(p)
Penalty for having DOM
nodes p and q in
different sections
Vpq are edge weights in the
neighborhood graph
VAB
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
10
Correlation Clustering
Rendering Constraint:
Each pixel on the screen must belong to at most
one section
Either SA=SB=SC, or
SA≠SB and SA≠Sc
SA=?
Parent
A
section = child section
Constraint only applies to DOM
nodes “aimed” at visual rendering
B
C
DOM Tree
11
Correlation Clustering
Rendering Constraint:
Each pixel on the screen must belong to at most
one section
Either SA=SB=SC, or
Not enforced by CCLUS
S ≠S and S ≠S
A
B
A
c
SA=?
Workaround: Use only leaf
nodes in the neighborhood
graph
But content cues may be too
noisy at the leaf level
A
B
C
DOM Tree
12
Correlation Clustering
Algorithm: [Ailon+/2005]
Pick a random leaf node p
Create a new section of p, and all nodes q which are
strongly connected to p:
Remove p and q’s from the neighborhood graph
Iterate
Within a factor of 2 of the optimal
Number of sections picked automatically
13
Proposed Work
A graph-based approach
What is a good cost function?
A
B
C
E
DOM Tree
Intuitive
Has polynomial-time algorithms that
can get provably close to the
optimal
Correlation Clustering
Energy-minimizing Graph Cuts
How should we set weights in
the neighborhood graph?
D
A
E
B
C
D
Neighborhood
Graph
14
Energy-minimizing Graph Cuts
Extra: A predefined set of labels
Assign to each node p a label S(p)
Distance of node to
label
Distance between
pairs of nodes
15
Energy-minimizing Graph Cuts
Distance of node to
label
Distance between
pairs of nodes
DA
Difference from CCLUS:
Node weights Dp in addition to
edge weights Vpq
Dp and Vpq can depend on the
labels (not just “same” or
“different”)
DB
VAB
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
16
DE
Energy-minimizing Graph Cuts
How can we fit the Rendering Constraint?
Have a special “invisible” label ξ
Parent is invisible, unless all
children have the same label
Can set the Vpq values accordingly
SA=?ξ
A
B
C
17
Energy-minimizing Graph Cuts
How can we fit the Rendering Constraint?
Have a special “invisible” label ξ
Parent is invisible, unless all
children have the same label
Can set the Vpq values accordingly
Automatically infer “rendering”
versus “structural” DOM nodes
A
B
C
18
Energy-minimizing Graph Cuts
Why couldn’t we use this trick in CCLUS as
well?
CCLUS only asks: Are nodes p and q in the same
section or not?
It cannot handle “special” sections like the
invisible section
Hence, labels are giving us extra power
19
Energy-minimizing Graph Cuts
Advantages
Can use all DOM nodes, while still obeying the
Rendering Constraint Better than CCLUS
Factor of 2 approximation of the optimal, by
performing iterative min-cuts of specially
constructed graphs
We extend [Kolmogorov+/2004]
Number of sections are picked automatically
20
Energy-minimizing Graph Cuts
Theorem: Vpq must obey the constraint
Separation cost ≥ Merge cost
Set Vpq(different) >> Vpq(same) for nodes that are
extremely close
Cost minimization tries to place them in the
same section
21
Energy-minimizing Graph Cuts
Theorem: Vpq must obey the constraint
Separation cost ≥ Merge cost
However, we cannot use Vpq to push two nodes to
be in different sections
Use Dp instead
22
Energy-minimizing Graph Cuts
Distance of node to
label
To separate nodes p and q:
Ensure that either Dp(α) or Dq(α) is large, for any
label α
So, assigning both p and q to the same label will
be too costly
23
Energy-minimizing Graph Cuts
Ensures that nodes with
very different content or visual
features are split up
Ensures
that nodes
with very similar
content or visual
features are merged
Invisible label lets us use the parent-child
DOM tree structure
24
Proposed Work
A graph-based approach
What is a good cost function?
A
B
C
E
DOM Tree
Intuitive
Has polynomial-time algorithms that
can get provably close to the
optimal
Correlation Clustering
Energy-minimizing Graph Cuts
How should we set weights in
the neighborhood graph?
D
A
E
B
C
D
Neighborhood
Graph
25
DA
Learning graph weights
VAB
DB
Extract content and visual
features from training data
Learning Vpq(.)
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
Learn a logistic regression classifier
(prob. that p and q belong to the same section)
26
DE
DA
Learning graph weights
VAB
DB
Extract content and visual
features from training data
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
Learning Dp(.)
Training data does not provide labels
Set of labels = Set of DOM tree nodes in that webpage
Dp(α) = distance in some feature space
Learn a Mahalanobis distance metric between nodes
(distances within section < distances across sections)
27
DE
Outline
Motivation and Related Work
Proposed Work
Experiments
Conclusions
28
Experiments
Manually sectioned 105 randomly chosen
webpages to get 1088 sections
Two measures were used:
Adjusted RAND: fraction of leaf node pairs which
are correctly predicted to be together or apart
(over and above random sectioning)
Normalized Mutual Information
Both are between 0 and 1, with higher values
indicating better results.
29
Experiments
% webpages < score
GCUTS:
Almost 50% of the webpages
score better than 0.6
CCLUS:
Only 20% of the
webpages score
better than 0.6
Adjusted RAND
30
Experiments
Over all webpages
GCUTS is better than CCLUS
31
Experiments
Application to duplicate detection on the Web
Collected lyrics of the same songs from 3 different
sites (~2300 webpages)
Nearly similar content
Different template structures
Our approach:
Section all webpages
Perform duplicate detection using only the largest
section (primary content)
32
Experiments
Sectioning > No sectioning
GCUTS > CCLUS
33
Outline
Motivation and Related Work
Proposed Work
Experiments
Conclusions
34
Conclusions
Combined visual, DOM, and content based
cues
Optimization on a neighborhood graph
Node and edge weights are learnt from training
data
Developed CCLUS and GCUTS, both with:
Approximation guarantees
Automatic determination of the number of sections
35
Learning graph weights
DA
DB
Extract content and visual
features from training data
VAB
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
36
DE
Energy-minimizing Graph Cuts
What is such a Dp(.) function?
Use the set of internal DOM nodes as the set of
labels
Dp(α) measures the difference in feature vectors
between node p and internal node (label) α
If nodes p and q are very different, Dp(α) and
Dq(α) will differ for all α
37
Correlation Clustering
Does not enforce the Rendering Constraint:
Each pixel on the screen must belong to at most
one section Parent nodes should have same
section as their children
Workaround: Consider only leaf nodes in the
neighborhood graph
But content cues may be too noisy at the leaf level
38
Correlation Clustering
Does not enforce the
Rendering Constraint
Either SA=SB=SC, or
SA≠SB and SA≠Sc
SA=?
A
Each pixel on the screen must
belong to at most one section
Parent section = child section
Apply rule only for ancestors
“aimed” at visual rendering
B
C
39
Correlation Clustering
Does not enforce the
Rendering Constraint
Workaround: Consider only
leaf nodes in the
neighborhood graph
But content cues may be too
noisy at the leaf level
Either SA=SB=SC, or
SA≠SB and SA≠Sc
SA=?
A
B
C
SB=5
SC=7
40
Energy-minimizing Graph Cuts
How can we fit the Rendering Constraint?
Have a special “invisible” label ξ
Parent is invisible, unless all
children have the same label
Can set the Vpq values accordingly
Automatically infer “rendering”
versus “structural” DOM nodes
SA=5
=?ξ
A
B
C
SB=5
SC=5
=7
41
Energy-minimizing Graph Cuts
What is the set of labels?
The set of internal DOM nodes
Available at the beginning of the algorithm
The labels are themselves nodes, with feature vectors
Dp(α) = distance in some feature space
“Tuned” to the current webpage
42