Transcript ppt
A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti ([email protected]) Ravi Kumar ([email protected]) Kunal Punera ([email protected]) 1 Motivation and Related Work Header Navigation bar Primary content Ad Related links Copyright2 Motivation and Related Work Header Navigation bar Divide a webpage into visually and semantically cohesive sections Primary content Ad Related links Copyright3 Motivation and Related Work Sectioning can be useful in: Webpage classification Displaying webpages on mobile phones and small-screen devices Webpage ranking Duplicate detection … 4 Motivation and Related Work A lot of recent interest Informative Structure Mining [Cai+/2003, Kao+/2005] Displaying webpages on small screens [Chen+/2005, Baluja/2006] Template detection: [Bar-Yossef+/2002] Topic distillation: [Chakrabarti+/2001] Based solely on visual, or content, or DOM based clues Mostly heuristic approaches 5 Motivation and Related Work Our contributions Combine visual, DOM, and content based cues Propose a formal graph-based combinatorial optimization approach Develop two instantiations, both with: Approximation guarantees Automatic determination of the number of sections Develop methods for automatic learning of graph weights 6 Outline Motivation and Related Work Proposed Work Experiments Conclusions 7 Proposed Work A graph-based approach Construct a neighborhood graph of DOM tree nodes Neighbors close according to: DOM tree distance, or, visual distance when rendered on the screen, or, similar content types Partition the neighborhood graph to optimize a cost function A B C D E DOM Tree A E B C D Neighborhood Graph 8 Proposed Work A graph-based approach What is a good cost function? A B C E DOM Tree Intuitive Has polynomial-time algorithms that can get provably close to the optimal Correlation Clustering Energy-minimizing Graph Cuts How should we set weights in the neighborhood graph? D A E B C D Neighborhood Graph 9 Correlation Clustering Assign each DOM node p to a section S(p) Penalty for having DOM nodes p and q in different sections Vpq are edge weights in the neighborhood graph VAB A VAE E B VBC C D Neighborhood Graph 10 Correlation Clustering Rendering Constraint: Each pixel on the screen must belong to at most one section Either SA=SB=SC, or SA≠SB and SA≠Sc SA=? Parent A section = child section Constraint only applies to DOM nodes “aimed” at visual rendering B C DOM Tree 11 Correlation Clustering Rendering Constraint: Each pixel on the screen must belong to at most one section Either SA=SB=SC, or Not enforced by CCLUS S ≠S and S ≠S A B A c SA=? Workaround: Use only leaf nodes in the neighborhood graph But content cues may be too noisy at the leaf level A B C DOM Tree 12 Correlation Clustering Algorithm: [Ailon+/2005] Pick a random leaf node p Create a new section of p, and all nodes q which are strongly connected to p: Remove p and q’s from the neighborhood graph Iterate Within a factor of 2 of the optimal Number of sections picked automatically 13 Proposed Work A graph-based approach What is a good cost function? A B C E DOM Tree Intuitive Has polynomial-time algorithms that can get provably close to the optimal Correlation Clustering Energy-minimizing Graph Cuts How should we set weights in the neighborhood graph? D A E B C D Neighborhood Graph 14 Energy-minimizing Graph Cuts Extra: A predefined set of labels Assign to each node p a label S(p) Distance of node to label Distance between pairs of nodes 15 Energy-minimizing Graph Cuts Distance of node to label Distance between pairs of nodes DA Difference from CCLUS: Node weights Dp in addition to edge weights Vpq Dp and Vpq can depend on the labels (not just “same” or “different”) DB VAB A VAE E B VBC C D Neighborhood Graph 16 DE Energy-minimizing Graph Cuts How can we fit the Rendering Constraint? Have a special “invisible” label ξ Parent is invisible, unless all children have the same label Can set the Vpq values accordingly SA=?ξ A B C 17 Energy-minimizing Graph Cuts How can we fit the Rendering Constraint? Have a special “invisible” label ξ Parent is invisible, unless all children have the same label Can set the Vpq values accordingly Automatically infer “rendering” versus “structural” DOM nodes A B C 18 Energy-minimizing Graph Cuts Why couldn’t we use this trick in CCLUS as well? CCLUS only asks: Are nodes p and q in the same section or not? It cannot handle “special” sections like the invisible section Hence, labels are giving us extra power 19 Energy-minimizing Graph Cuts Advantages Can use all DOM nodes, while still obeying the Rendering Constraint Better than CCLUS Factor of 2 approximation of the optimal, by performing iterative min-cuts of specially constructed graphs We extend [Kolmogorov+/2004] Number of sections are picked automatically 20 Energy-minimizing Graph Cuts Theorem: Vpq must obey the constraint Separation cost ≥ Merge cost Set Vpq(different) >> Vpq(same) for nodes that are extremely close Cost minimization tries to place them in the same section 21 Energy-minimizing Graph Cuts Theorem: Vpq must obey the constraint Separation cost ≥ Merge cost However, we cannot use Vpq to push two nodes to be in different sections Use Dp instead 22 Energy-minimizing Graph Cuts Distance of node to label To separate nodes p and q: Ensure that either Dp(α) or Dq(α) is large, for any label α So, assigning both p and q to the same label will be too costly 23 Energy-minimizing Graph Cuts Ensures that nodes with very different content or visual features are split up Ensures that nodes with very similar content or visual features are merged Invisible label lets us use the parent-child DOM tree structure 24 Proposed Work A graph-based approach What is a good cost function? A B C E DOM Tree Intuitive Has polynomial-time algorithms that can get provably close to the optimal Correlation Clustering Energy-minimizing Graph Cuts How should we set weights in the neighborhood graph? D A E B C D Neighborhood Graph 25 DA Learning graph weights VAB DB Extract content and visual features from training data Learning Vpq(.) A VAE E B VBC C D Neighborhood Graph Learn a logistic regression classifier (prob. that p and q belong to the same section) 26 DE DA Learning graph weights VAB DB Extract content and visual features from training data A VAE E B VBC C D Neighborhood Graph Learning Dp(.) Training data does not provide labels Set of labels = Set of DOM tree nodes in that webpage Dp(α) = distance in some feature space Learn a Mahalanobis distance metric between nodes (distances within section < distances across sections) 27 DE Outline Motivation and Related Work Proposed Work Experiments Conclusions 28 Experiments Manually sectioned 105 randomly chosen webpages to get 1088 sections Two measures were used: Adjusted RAND: fraction of leaf node pairs which are correctly predicted to be together or apart (over and above random sectioning) Normalized Mutual Information Both are between 0 and 1, with higher values indicating better results. 29 Experiments % webpages < score GCUTS: Almost 50% of the webpages score better than 0.6 CCLUS: Only 20% of the webpages score better than 0.6 Adjusted RAND 30 Experiments Over all webpages GCUTS is better than CCLUS 31 Experiments Application to duplicate detection on the Web Collected lyrics of the same songs from 3 different sites (~2300 webpages) Nearly similar content Different template structures Our approach: Section all webpages Perform duplicate detection using only the largest section (primary content) 32 Experiments Sectioning > No sectioning GCUTS > CCLUS 33 Outline Motivation and Related Work Proposed Work Experiments Conclusions 34 Conclusions Combined visual, DOM, and content based cues Optimization on a neighborhood graph Node and edge weights are learnt from training data Developed CCLUS and GCUTS, both with: Approximation guarantees Automatic determination of the number of sections 35 Learning graph weights DA DB Extract content and visual features from training data VAB A VAE E B VBC C D Neighborhood Graph 36 DE Energy-minimizing Graph Cuts What is such a Dp(.) function? Use the set of internal DOM nodes as the set of labels Dp(α) measures the difference in feature vectors between node p and internal node (label) α If nodes p and q are very different, Dp(α) and Dq(α) will differ for all α 37 Correlation Clustering Does not enforce the Rendering Constraint: Each pixel on the screen must belong to at most one section Parent nodes should have same section as their children Workaround: Consider only leaf nodes in the neighborhood graph But content cues may be too noisy at the leaf level 38 Correlation Clustering Does not enforce the Rendering Constraint Either SA=SB=SC, or SA≠SB and SA≠Sc SA=? A Each pixel on the screen must belong to at most one section Parent section = child section Apply rule only for ancestors “aimed” at visual rendering B C 39 Correlation Clustering Does not enforce the Rendering Constraint Workaround: Consider only leaf nodes in the neighborhood graph But content cues may be too noisy at the leaf level Either SA=SB=SC, or SA≠SB and SA≠Sc SA=? A B C SB=5 SC=7 40 Energy-minimizing Graph Cuts How can we fit the Rendering Constraint? Have a special “invisible” label ξ Parent is invisible, unless all children have the same label Can set the Vpq values accordingly Automatically infer “rendering” versus “structural” DOM nodes SA=5 =?ξ A B C SB=5 SC=5 =7 41 Energy-minimizing Graph Cuts What is the set of labels? The set of internal DOM nodes Available at the beginning of the algorithm The labels are themselves nodes, with feature vectors Dp(α) = distance in some feature space “Tuned” to the current webpage 42