Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

Download Report

Transcript Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

Automatic Wrappers for Large
Scale Web Extraction
Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)
Task: Learn rules to extract information (e.g. Directors)
from structurally similar pages.
VLDB 2011, Seattle, USA
html
body
head
div
class=‘head’
div
class=‘content’
title
Godfather
table
td
Title :
table
td
td
Godfather
Director :
td
Coppola
width=80%
td
Runtime
td
118min
 We can use the following Xpath rule to extract directors
W1 = /html/body/div[2]/table/td[2]/text()
3
VLDB 2011, Seattle, USA
Wrappers
 Can be learned with a little amount of
supervision.
 Very effective for site-level extraction.
 Have been extensively studied in literature.
4
VLDB 2011, Seattle, USA
In This Work:
Objective: learn wrappers without sitelevel supervision.
5
VLDB 2011, Seattle, USA
6
VLDB 2011, Seattle, USA
Idea
Obtain training data cheaply using
dictionaries or automatic labelers.
Make wrapper induction tolerant to
noise.
7
VLDB 2011, Seattle, USA
8
VLDB 2011, Seattle, USA
Summary of Approach
 A generic framework, that can incorporate
wrapper inductors with plausible properties.
 Input : A wrapper inductor Φ, a set of labels L
 Idea: Apply Φ on
subsets of L and choose the
wrapper that gives the
list.
9
VLDB 2011, Seattle, USA
Summary of Approach
 Two main problems:
 Wrapper Enumeration: How to generate the space of
all the possible wrappers efficiently?
 Wrapper Ranking: How to rank the enumerated
wrappers based on quality?
10
VLDB 2011, Seattle, USA
Example : TABLE wrapper system
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
 Works on a table.
 Generates wrappers from the following space: a single
cell, a row, a column or the entire table.
11
VLDB 2011, Seattle, USA
Example : TABLE wrapper system
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
 L = { n1, n2, n4, a4, z5}
 32 possible subsets
 8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T}
12
VLDB 2011, Seattle, USA
Wrapper Enumeration Problem
 Input : A wrapper inductor, Φ and a set of labels L
 Wrapper space of L is defined as
W(L) = {Φ(S)| S  L}
 Problem : Enumerate the wrapper space of L in time
polynomial in the size of the wrapper space and L.
13
VLDB 2011, Seattle, USA
Wrapper Inductors
 TABLE : The wrapper inductor as defined before
 XPATH : Learn the minimal xpath rule, in a simple
fragment of Xpath, that covers all the training examples
 LR : Find the maximal pair of strings preceding and
following all the training examples. The output of the
wrapper is all strings delimited by the pair.
14
VLDB 2011, Seattle, USA
Well-behaved Inductor
 A wrapper inductor Φ is well-behaved if it has following
properties:
 [Fidelity] L  Φ(L)
 [Closure] l Φ(L) Φ(L) = Φ(L ∪ l)
 [Monotonicity] L1 L2 Φ(L1) Φ(L2)
 Theorem : TABLE, LR and XPATH are well-behaved
wrapper inductors.
15
VLDB 2011, Seattle, USA
Bottom-up Algorithm
 Start with singleton labels in L as candidate label sets
 Learn wrappers by feeding candidate label sets to Φ
 Incrementally apply one-label extensions to each candidate
 Extend candidates with the closure of wrappers learned by Φ
 Theorem : Bottom-up algorithm is sound and complete
 Theorem : Bottom-up algorithm makes at most k.|L| calls to
the wrapper, where k is the size of the wrapper space.
16
VLDB 2011, Seattle, USA
Can we do better?
 A wrapper inductor is a feature-based inductor if:
 Every label is associated with a set of features ((attribute, value) pairs)
 Φ(L) = intersection of all the features of L
 Output of a wrapper w = text nodes satisfying all the features of w
 E.g. TABLE can be expressed as a feature-based inductor with two
features, row and col.
 Both LR and XPW can be expressed as a feature-based inductor.
17
VLDB 2011, Seattle, USA
Top-down Algorithm
 We give a top-down algorithm for a feature-based
wrapper that makes exactly k calls to the wrapper,
where k is the size of the wrapper space.
18
VLDB 2011, Seattle, USA
Wrapper Ranking Problem
 Given a set of wrappers, we want to output one that
gives the “best” list.
 Let X be a list extracted by a wrapper w
 Choose wrapper that maximizes P[X | L], or equivalently,
P[L | X] P[X]
19
VLDB 2011, Seattle, USA
Example: Extracting names from
business listings
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
 Let us rank the following three lists as candidates for the
set of names:
 X1 = first column
 X2 = entire table
 X3 = first two columns
20
VLDB 2011, Seattle, USA
Example: Extracting names from
business listings
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
 X1 = first column
P[L | X1] : 2 wrong labels, 3 correct labels
P[X1]
21
: nice repeating structure, schema size = 4
VLDB 2011, Seattle, USA
Example: Extracting names from
business listings
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
 X2 = entire table
P[L | X2] : 0 wrong labels, 5 correct labels
P[X2]
22
: nice repeating structure, schema size =1
VLDB 2011, Seattle, USA
Example: Extracting names from
business listings
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
 X3 = first two columns
P(L | X3) : 1 wrong label, 4 correct labels
P(X3)
23
: poor repeating structure, schema size = 1 or 3
VLDB 2011, Seattle, USA
Ranking Model
 P[L | X]
 Assume a simple annotator with precision p and recall r that
independently labels each node.
 Each node in X is added to L with probability r
 Each node not in X is added to L with probability 1- p
24
VLDB 2011, Seattle, USA
Ranking Model
 P[X]
 Define features of the grammar that describes X, e.g. schema
size and repeating structure
 Learn distributions on the values of features, or take it as input
as part of domain knowledge.
25
VLDB 2011, Seattle, USA
Experiments
 Datasets:
 DEALERS : Used automatic form filling techniques to obtain
dealer listings from 300 store locator pages
 DISCOGRAPHY : Crawled 14 music websites that contain
track listings of albums.
 Task : Automatically learn wrappers to extract business
names/track titles for each of the website.
26
VLDB 2011, Seattle, USA
27
VLDB 2011, Seattle, USA
28
VLDB 2011, Seattle, USA
Summary
 A new framework for noise-tolerant wrapper induction
 Two efficient wrapper enumeration algorithms
 Probabilistic wrapper ranking model
 Web-scale information extraction
 No site-level supervision  No manual labeling
 Tolerating noise in automatic labeling
29
VLDB 2011, Seattle, USA
30
VLDB 2011, Seattle, USA
Bottom-up Algorithm
 INPUT : Φ, L
 Z = all singleton subsets of L
W=Z
 while (Z not empty)
Remove the smallest set S from Z
For each possible single-label expansion S’ of S
Add Φ(S’) to W
Add (Φ(S’)  L) back to Z
31
VLDB 2011, Seattle, USA
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={n1, n2, n4, a4, z5}
32
n1
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={n2, n4, a4, z5, {n1, n2, n4}}
C1
n2
33
n4
n1
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
C1
n2
34
n4
n1
T
a4 z5
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
C1
n2
35
n4
n1
T
a4 z5
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
C1
n2
36
n4
n1
T
a4 z5
R4
a4
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
C1
n2
37
n4
n1
T
a4 z5
R4
a4
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
C1
n2
38
n4
n1
T
a4 z5
R4
a4
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}}
C1
n2
39
n4
n1
T
a4 z5
R4
a4
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={{n1, n2, n4, a4, z5}}
C1
n2
40
n4
n1
T
a4 z5
R4
a4
n2VLDB 2011, Seattle,
n4 USA a4
z5
Bottom-up Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
Z={}
C1
n2
41
n4
n1
T
a4 z5
R4
a4
n2VLDB 2011, Seattle,
n4 USA a4
z5
Top-down Algorithm
n1
a1
z1
p1
n2
a2
z2
p2
n3
a3
z3
p3
n4
a4
z4
p4
n5
a5
z5
p5
n1, n2, n4, a4, z5
row
column
n1, n2, n4
row
n142
n2
a4
n4
z5
VLDB 2011, Seattle, USA
n4, a4
Wrapper Ranking
 argmaxX P(L|X) P(X) ?
 Possible values of X are the possible
wrappers computed byΦ
 P (L |X ): probability of observing L given
that X is the right wrapper
 The annotator has precision p, and recall r
(estimated from tested labelings)
All nodes
H
X
X2
non-labeled nodes
outside X
labeled
nodes
outside X
A2
X1
A1
L
 Independent annotation process:
 Decide on labeling nodes independently
Non-labeled nodes in X
labeled nodes
 Each node in X is added to L with
probability r
labeled nodes in X
 Each node not in X is added to L with
probability 1-p
43
VLDB 2011, Seattle, USA