Transcript ppt

A Graph-Theoretic Approach
to Webpage Segmentation
Deepayan Chakrabarti ([email protected])
Ravi Kumar
([email protected])
Kunal Punera
([email protected])
1
Motivation and Related Work
Header
Navigation
bar
Primary
content
Ad
Related
links
Copyright2
Motivation and Related Work
Header
Navigation
bar
Divide a webpage into visually and
semantically cohesive sections
Primary
content
Ad
Related
links
Copyright3
Motivation and Related Work

Sectioning can be useful in:





Webpage classification
Displaying webpages on mobile phones and
small-screen devices
Webpage ranking
Duplicate detection
…
4
Motivation and Related Work

A lot of recent interest






Informative Structure Mining [Cai+/2003,
Kao+/2005]
Displaying webpages on small screens
[Chen+/2005, Baluja/2006]
Template detection: [Bar-Yossef+/2002]
Topic distillation: [Chakrabarti+/2001]
Based solely on visual, or content, or DOM
based clues
Mostly heuristic approaches
5
Motivation and Related Work

Our contributions



Combine visual, DOM, and content based cues
Propose a formal graph-based combinatorial
optimization approach
Develop two instantiations, both with:



Approximation guarantees
Automatic determination of the number of sections
Develop methods for automatic learning of graph
weights
6
Outline




Motivation and Related Work
Proposed Work
Experiments
Conclusions
7
Proposed Work

A graph-based approach


Construct a neighborhood graph of
DOM tree nodes
Neighbors  close according to:




DOM tree distance,
or, visual distance when
rendered on the screen,
or, similar content types
Partition the neighborhood graph to
optimize a cost function
A
B
C
D
E
DOM Tree
A
E
B
C
D
Neighborhood
Graph
8
Proposed Work

A graph-based approach

What is a good cost function?





A
B
C
E
DOM Tree
Intuitive
Has polynomial-time algorithms that
can get provably close to the
optimal
Correlation Clustering
Energy-minimizing Graph Cuts
How should we set weights in
the neighborhood graph?
D
A
E
B
C
D
Neighborhood
Graph
9
Correlation Clustering

Assign each DOM node p to a section S(p)
Penalty for having DOM
nodes p and q in
different sections

Vpq are edge weights in the
neighborhood graph
VAB
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
10
Correlation Clustering

Rendering Constraint:

Each pixel on the screen must belong to at most
one section
Either SA=SB=SC, or
SA≠SB and SA≠Sc
SA=?
 Parent

A
section = child section
Constraint only applies to DOM
nodes “aimed” at visual rendering
B
C
DOM Tree
11
Correlation Clustering

Rendering Constraint:


Each pixel on the screen must belong to at most
one section
Either SA=SB=SC, or
Not enforced by CCLUS
S ≠S and S ≠S
A
B
A
c
SA=?

Workaround: Use only leaf
nodes in the neighborhood
graph

But content cues may be too
noisy at the leaf level
A
B
C
DOM Tree
12
Correlation Clustering

Algorithm: [Ailon+/2005]






Pick a random leaf node p
Create a new section of p, and all nodes q which are
strongly connected to p:
Remove p and q’s from the neighborhood graph
Iterate
Within a factor of 2 of the optimal
Number of sections picked automatically
13
Proposed Work

A graph-based approach

What is a good cost function?





A
B
C
E
DOM Tree
Intuitive
Has polynomial-time algorithms that
can get provably close to the
optimal
Correlation Clustering
Energy-minimizing Graph Cuts
How should we set weights in
the neighborhood graph?
D
A
E
B
C
D
Neighborhood
Graph
14
Energy-minimizing Graph Cuts


Extra: A predefined set of labels
Assign to each node p a label S(p)
Distance of node to
label
Distance between
pairs of nodes
15
Energy-minimizing Graph Cuts
Distance of node to
label
Distance between
pairs of nodes
DA

Difference from CCLUS:


Node weights Dp in addition to
edge weights Vpq
Dp and Vpq can depend on the
labels (not just “same” or
“different”)
DB
VAB
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
16
DE
Energy-minimizing Graph Cuts

How can we fit the Rendering Constraint?



Have a special “invisible” label ξ
Parent is invisible, unless all
children have the same label
Can set the Vpq values accordingly
SA=?ξ
A
B
C
17
Energy-minimizing Graph Cuts

How can we fit the Rendering Constraint?




Have a special “invisible” label ξ
Parent is invisible, unless all
children have the same label
Can set the Vpq values accordingly
Automatically infer “rendering”
versus “structural” DOM nodes
A
B
C
18
Energy-minimizing Graph Cuts

Why couldn’t we use this trick in CCLUS as
well?



CCLUS only asks: Are nodes p and q in the same
section or not?
It cannot handle “special” sections like the
invisible section
Hence, labels are giving us extra power
19
Energy-minimizing Graph Cuts

Advantages


Can use all DOM nodes, while still obeying the
Rendering Constraint  Better than CCLUS
Factor of 2 approximation of the optimal, by
performing iterative min-cuts of specially
constructed graphs


We extend [Kolmogorov+/2004]
Number of sections are picked automatically
20
Energy-minimizing Graph Cuts

Theorem: Vpq must obey the constraint


Separation cost ≥ Merge cost
Set Vpq(different) >> Vpq(same) for nodes that are
extremely close
 Cost minimization tries to place them in the
same section
21
Energy-minimizing Graph Cuts

Theorem: Vpq must obey the constraint


Separation cost ≥ Merge cost
However, we cannot use Vpq to push two nodes to
be in different sections
 Use Dp instead
22
Energy-minimizing Graph Cuts
Distance of node to
label

To separate nodes p and q:


Ensure that either Dp(α) or Dq(α) is large, for any
label α
So, assigning both p and q to the same label will
be too costly
23
Energy-minimizing Graph Cuts
Ensures that nodes with
very different content or visual
features are split up


Ensures
that nodes
with very similar
content or visual
features are merged
Invisible label lets us use the parent-child
DOM tree structure
24
Proposed Work

A graph-based approach

What is a good cost function?





A
B
C
E
DOM Tree
Intuitive
Has polynomial-time algorithms that
can get provably close to the
optimal
Correlation Clustering
Energy-minimizing Graph Cuts
How should we set weights in
the neighborhood graph?
D
A
E
B
C
D
Neighborhood
Graph
25
DA
Learning graph weights
VAB
DB


Extract content and visual
features from training data
Learning Vpq(.)

A
VAE
E
B
VBC
C
D
Neighborhood
Graph
Learn a logistic regression classifier
(prob. that p and q belong to the same section)
26
DE
DA
Learning graph weights
VAB
DB

Extract content and visual
features from training data
A
VAE
E
B
VBC
C
D
Neighborhood
Graph

Learning Dp(.)


Training data does not provide labels
Set of labels = Set of DOM tree nodes in that webpage


Dp(α) = distance in some feature space
Learn a Mahalanobis distance metric between nodes
(distances within section < distances across sections)
27
DE
Outline




Motivation and Related Work
Proposed Work
Experiments
Conclusions
28
Experiments


Manually sectioned 105 randomly chosen
webpages to get 1088 sections
Two measures were used:



Adjusted RAND: fraction of leaf node pairs which
are correctly predicted to be together or apart
(over and above random sectioning)
Normalized Mutual Information
Both are between 0 and 1, with higher values
indicating better results.
29
Experiments
% webpages < score
GCUTS:
Almost 50% of the webpages
score better than 0.6
CCLUS:
Only 20% of the
webpages score
better than 0.6
Adjusted RAND
30
Experiments
Over all webpages

GCUTS is better than CCLUS
31
Experiments

Application to duplicate detection on the Web

Collected lyrics of the same songs from 3 different
sites (~2300 webpages)



Nearly similar content
Different template structures
Our approach:


Section all webpages
Perform duplicate detection using only the largest
section (primary content)
32
Experiments


Sectioning > No sectioning
GCUTS > CCLUS
33
Outline




Motivation and Related Work
Proposed Work
Experiments
Conclusions
34
Conclusions


Combined visual, DOM, and content based
cues
Optimization on a neighborhood graph


Node and edge weights are learnt from training
data
Developed CCLUS and GCUTS, both with:


Approximation guarantees
Automatic determination of the number of sections
35
Learning graph weights
DA
DB

Extract content and visual
features from training data
VAB
A
VAE
E
B
VBC
C
D
Neighborhood
Graph
36
DE
Energy-minimizing Graph Cuts

What is such a Dp(.) function?
Use the set of internal DOM nodes as the set of
labels
 Dp(α) measures the difference in feature vectors
between node p and internal node (label) α
 If nodes p and q are very different, Dp(α) and
Dq(α) will differ for all α

37
Correlation Clustering

Does not enforce the Rendering Constraint:


Each pixel on the screen must belong to at most
one section  Parent nodes should have same
section as their children
Workaround: Consider only leaf nodes in the
neighborhood graph

But content cues may be too noisy at the leaf level
38
Correlation Clustering

Does not enforce the
Rendering Constraint
Either SA=SB=SC, or
SA≠SB and SA≠Sc
SA=?
A
Each pixel on the screen must
belong to at most one section
 Parent section = child section


Apply rule only for ancestors
“aimed” at visual rendering
B
C
39
Correlation Clustering


Does not enforce the
Rendering Constraint
Workaround: Consider only
leaf nodes in the
neighborhood graph

But content cues may be too
noisy at the leaf level
Either SA=SB=SC, or
SA≠SB and SA≠Sc
SA=?
A
B
C
SB=5
SC=7
40
Energy-minimizing Graph Cuts

How can we fit the Rendering Constraint?




Have a special “invisible” label ξ
Parent is invisible, unless all
children have the same label
Can set the Vpq values accordingly
Automatically infer “rendering”
versus “structural” DOM nodes
SA=5
=?ξ
A
B
C
SB=5
SC=5
=7
41
Energy-minimizing Graph Cuts

What is the set of labels?

The set of internal DOM nodes



Available at the beginning of the algorithm
The labels are themselves nodes, with feature vectors
 Dp(α) = distance in some feature space
“Tuned” to the current webpage
42