Transcript [Poster]

Very Fast Similarity Queries on Semi-Structured Data from the Web
Bhavana Dalvi , William W. Cohen
Language Technologies Institute, Carnegie Mellon University
Contributions
PIC-D Representation for Entities on the Web
 PIC-D : A single low-dim. representation for entities on the Web
Hypothesis :
E.g. Entities in HTML tables
using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010.
|X| * n_1
Bipartite graph
 #dimensions in PIC-D = √(total number of dimensions)
PIC-D embeddings will cluster
|X| * m PIC embedding,
similar entities (entities belonging
m << n_1
to same class) together.
PIC
 Time to create PIC-D is linear in total number of dimensions
E.g. Entities with Hearst patterns
 Information extraction tasks posed as similarity queries on PIC-D
|X| * n_2
Bipartite graph
 Comparable precision recall w.r.t. high-dimensional baseline
 Up to 2 orders of magnitude improvements at query run-time incurring
small amount of pre-processing time to create PIC-D.
Preprocessing to create PIC-D
Property
Description
#HTML pages
Dataset
|X|
# Entities
15K
438
15K
30K
|C|
# table columns
156
925
8K
78K
|(x,c)|
# (x, c) edges
|Ys|
# suchas concepts
|(x, Ys)|
|Yn|
|(x, Yn)|
# (x, Ys) edges
# NELL classes
#(x, Yn) edges
|Yc|
# manual column labels
|(c, Yc)|
# (c, Yc) pairs
70.5K
5.5K
91K
566K
2.3K
1.6K
3.8K
21.4K
7.7K
11
419
4.8K
3
39
18.3K
23
691
107.8K
23
977
31
30
-
-
156
925
-
-
#PIC-D dimensions
51
51
110
317
Total time to create PIC-D (msec)
49.7
53
69.7
0.0576
time = O(n)
How much time
does it take to
create PIC-D?
|X| * D * m
PIC-D embedding
PIC
|X| * m PIC embedding,
m << n_D
Hypothesis : Entities co-occurring in
multiple table columns or with similar
suchas concepts probably belong to
the same class label.
Entity occurrences in
HTML Table columns
USA
Country
India
TC-1
Example PIC-3 embedding, m = 2
TC-2
Country
USA
India
Football
Hockey
Baseball
Location
Football
TC-3
Hockey
Sports
Baseball
TC-4
X1
0.23
0.21
0.36
0.35
0.34
X2
0.76
0.79
0.80
0.82
0.79
Y1
0.43
0.41
0.66
0.16
0.14
Y2
0.66
0.69
0.35
0.92
0.89
Aggregate results over
 Set expansion : 272 queries (Delicious_Sports) and 152 queries (Toy_Apple)
 ASIA : 25 queries (Delicious_Sports)
 COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple)
Task
Method
Delicious_Sports
Toy_Apple
Avg. Query Speedup Avg. Query Speedup
Time (msec) of PIC-D Time (msec) of PIC-D
Set
K-NN on PIC-D
12.1
72.8
Expansion
K-NN Baseline
164.4
13.5
17578.3
241.5
Label propagation
1902.4
157.2
4801.9
65.9
Automatic
K-NN on PIC-D
20.0
Set Instance K-NN Baseline
56.0
2.8
Acquisition Label propagation
6000.0
300.0
Column
SVM on PIC-D
0.1
3.8
Classification SVM Baseline
1.2
12
56.8
14.94
Similarity queries on PIC-D are
up to 2 orders of magnitude faster.
Set Expansion
ASIA
IE Tasks as Similarity Queries
Task
Input
Output
Training Testing
Set Expansion
Seed
entities
Class
name
More entities of
same type
Entities
belonging to the
class
Class name for
seeds
PIC-D
Automatic Set
Instance
Acquisition
Column
Classification
m = √ n and
concatenate
Entity occurrences
In text with Hearst-patterns
How many
PIC-D dimensions
are enough?
PIC
E.g. Entities in Subj-Verb-Obj triples
|X| * n_D
Bipartite graph
Toy_
Delicious_ ASIA_
Clueweb_
Apple
Sports
INT
Sports
574
21K
121K
918K
|X| * m PIC embedding,
m << n_2
Query Runtime Speedup vs. Results Quality
Seed
entities
PIC-D +
seeds = top-k-entities(lookup concept in
Index
HCD) + Set Expansion (seeds)
HCD
PIC-D +
Centroid(column) + Predict_SVM (Centroid)
Train SVM
Set Expansion task on Clueweb _Sports
Seed Entities: Expanded entity set by K-NN+PIC-D
method
Arsenal, Liverpool, Manchester United:
Middlesbrough, Man United, Blackburn Rovers,
Manchester City, Tottenham, West Brom, Tottenham
Hotspur, Bolton Wanderers, Newcastle United,
Blackburn, Bolton, Birmingham City, Aston Villa,
Chelsea Fc, Sunderland, Sheffield United, ...
MSN, Google, Yahoo: Qas, Mitre, Cosco, Cerberus,
Cdt, Garrett, Sportingbet, Excelsior, Genzyme, Gt,
Broad, Ge, Bruno, Nortel, Level 3, Nec, Foster,
Renault, Ricardo, Persepolis, …
Centroid(entity set) + K-NN (Centroid)
ASIA task on Clueweb_Sports
Concept
Seed set
Sports
Football,
Softball, Ice Hockey, Volleyball,
Basketball, Skating, Martial Arts, Windsurfing,
Soccer
Hunting, Strength Sports, Lacrosse,
Dodgeball, Curling, ...
Outdoor
Hunting,
Recreation Fishing,
Skiing
Leagues
NFL,
NHL, NBA
K-NN + PIC-D : Expanded set
Cross Country, Martial Arts, Ice
Hockey, Croquet, Curling, Climbing,
Lacrosse, Softball, Basketball, Golf,
Windsurfing, Baseball, ...
PIC-D results in comparable
precision/recall w.r.t high-dimensional
baseline. Label propagation achieves
better performance at the cost of huge
query runtimes.
Column
Classification
Conclusions
 We Present a single, efficiently-constructible representation, named PIC-D representation
for entities on the Web.
 IE tasks can be posed as similarity queries on the PIC-D representation:
Set Expansion, Automatic Set Instance Acquisition and Column Classification
 PIC-D results in huge savings in query run-time with comparable quality of results.
 Future work : Using PIC-D representation
 with many more views of data, e.g., SVO triples, properties derived from KBs etc.
 for unsupervised class-instance pair acquisition.
NHL, NASCAR, NHRA, NCCA, PGA,
Sports Illustrated, Premier League..
Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.