Transcript [Poster]
Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi , William W. Cohen Language Technologies Institute, Carnegie Mellon University Contributions PIC-D Representation for Entities on the Web PIC-D : A single low-dim. representation for entities on the Web Hypothesis : E.g. Entities in HTML tables using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010. |X| * n_1 Bipartite graph #dimensions in PIC-D = √(total number of dimensions) PIC-D embeddings will cluster |X| * m PIC embedding, similar entities (entities belonging m << n_1 to same class) together. PIC Time to create PIC-D is linear in total number of dimensions E.g. Entities with Hearst patterns Information extraction tasks posed as similarity queries on PIC-D |X| * n_2 Bipartite graph Comparable precision recall w.r.t. high-dimensional baseline Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D. Preprocessing to create PIC-D Property Description #HTML pages Dataset |X| # Entities 15K 438 15K 30K |C| # table columns 156 925 8K 78K |(x,c)| # (x, c) edges |Ys| # suchas concepts |(x, Ys)| |Yn| |(x, Yn)| # (x, Ys) edges # NELL classes #(x, Yn) edges |Yc| # manual column labels |(c, Yc)| # (c, Yc) pairs 70.5K 5.5K 91K 566K 2.3K 1.6K 3.8K 21.4K 7.7K 11 419 4.8K 3 39 18.3K 23 691 107.8K 23 977 31 30 - - 156 925 - - #PIC-D dimensions 51 51 110 317 Total time to create PIC-D (msec) 49.7 53 69.7 0.0576 time = O(n) How much time does it take to create PIC-D? |X| * D * m PIC-D embedding PIC |X| * m PIC embedding, m << n_D Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label. Entity occurrences in HTML Table columns USA Country India TC-1 Example PIC-3 embedding, m = 2 TC-2 Country USA India Football Hockey Baseball Location Football TC-3 Hockey Sports Baseball TC-4 X1 0.23 0.21 0.36 0.35 0.34 X2 0.76 0.79 0.80 0.82 0.79 Y1 0.43 0.41 0.66 0.16 0.14 Y2 0.66 0.69 0.35 0.92 0.89 Aggregate results over Set expansion : 272 queries (Delicious_Sports) and 152 queries (Toy_Apple) ASIA : 25 queries (Delicious_Sports) COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple) Task Method Delicious_Sports Toy_Apple Avg. Query Speedup Avg. Query Speedup Time (msec) of PIC-D Time (msec) of PIC-D Set K-NN on PIC-D 12.1 72.8 Expansion K-NN Baseline 164.4 13.5 17578.3 241.5 Label propagation 1902.4 157.2 4801.9 65.9 Automatic K-NN on PIC-D 20.0 Set Instance K-NN Baseline 56.0 2.8 Acquisition Label propagation 6000.0 300.0 Column SVM on PIC-D 0.1 3.8 Classification SVM Baseline 1.2 12 56.8 14.94 Similarity queries on PIC-D are up to 2 orders of magnitude faster. Set Expansion ASIA IE Tasks as Similarity Queries Task Input Output Training Testing Set Expansion Seed entities Class name More entities of same type Entities belonging to the class Class name for seeds PIC-D Automatic Set Instance Acquisition Column Classification m = √ n and concatenate Entity occurrences In text with Hearst-patterns How many PIC-D dimensions are enough? PIC E.g. Entities in Subj-Verb-Obj triples |X| * n_D Bipartite graph Toy_ Delicious_ ASIA_ Clueweb_ Apple Sports INT Sports 574 21K 121K 918K |X| * m PIC embedding, m << n_2 Query Runtime Speedup vs. Results Quality Seed entities PIC-D + seeds = top-k-entities(lookup concept in Index HCD) + Set Expansion (seeds) HCD PIC-D + Centroid(column) + Predict_SVM (Centroid) Train SVM Set Expansion task on Clueweb _Sports Seed Entities: Expanded entity set by K-NN+PIC-D method Arsenal, Liverpool, Manchester United: Middlesbrough, Man United, Blackburn Rovers, Manchester City, Tottenham, West Brom, Tottenham Hotspur, Bolton Wanderers, Newcastle United, Blackburn, Bolton, Birmingham City, Aston Villa, Chelsea Fc, Sunderland, Sheffield United, ... MSN, Google, Yahoo: Qas, Mitre, Cosco, Cerberus, Cdt, Garrett, Sportingbet, Excelsior, Genzyme, Gt, Broad, Ge, Bruno, Nortel, Level 3, Nec, Foster, Renault, Ricardo, Persepolis, … Centroid(entity set) + K-NN (Centroid) ASIA task on Clueweb_Sports Concept Seed set Sports Football, Softball, Ice Hockey, Volleyball, Basketball, Skating, Martial Arts, Windsurfing, Soccer Hunting, Strength Sports, Lacrosse, Dodgeball, Curling, ... Outdoor Hunting, Recreation Fishing, Skiing Leagues NFL, NHL, NBA K-NN + PIC-D : Expanded set Cross Country, Martial Arts, Ice Hockey, Croquet, Curling, Climbing, Lacrosse, Softball, Basketball, Golf, Windsurfing, Baseball, ... PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes. Column Classification Conclusions We Present a single, efficiently-constructible representation, named PIC-D representation for entities on the Web. IE tasks can be posed as similarity queries on the PIC-D representation: Set Expansion, Automatic Set Instance Acquisition and Column Classification PIC-D results in huge savings in query run-time with comparable quality of results. Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived from KBs etc. for unsupervised class-instance pair acquisition. NHL, NASCAR, NHRA, NCCA, PGA, Sports Illustrated, Premier League.. Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.