Transcript Slides
iTrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 September 26, 2007 Outline Motivation iTrails Experiments Conclusions and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 2 Problem: Querying Several Sources Query What is the impact of global warming in Zurich? ? ? ? ? Systems Data Sources Laptop September 26, 2007 Email Server Web Server Marcos Vaz Salles / ETH Zurich / [email protected] DB Server 3 Solution 1: Use a Search Engine Query Job! global warming zurich Graph IR Search Engine System TopX [VLDB05], FleXPath semantics [SIGMOD04], Drawback: Query are not precise! XSearch [VLDB03], XRank [SIGMOD03] text, links text, links text, links text, links Data Sources Laptop September 26, 2007 Email Server Web DB Server Server Marcos Vaz Salles / ETH Zurich / [email protected] 4 Solution 2: Use an Information Integration System //Temperatures/*[city = “zurich”] ... ... ... Temps Cities Query Information Integration System Drawback: Too much effort to provide ... System schema mappings! GAV (e.g. [ICDE95]), LAV (e.g. [VLDB96]), CO2 Sunspots GLAV [AAAI99], P2P (e.g. [SIGMOD04]) missing schema mapping missing schema mapping schema mapping schema mapping Data Sources Laptop September 26, 2007 Email Server Web Server Marcos Vaz Salles / ETH Zurich / [email protected] DB Server 5 Research Challenge: Is There an Integration Solution in-between These Two Extremes? global warming zurich //Temperatures/*[city = “zurich”] global warming zurich ? Graph IR Search Engine Pay-as-you-go text, Information links Integration text, links text, links Dataspace ... System ... text, links text, links Data Sources Laptop September 26, 2007 Email Server Web DB Server Server Marcos Vaz Salles / ETH Zurich / [email protected] ... ... Temps Cities CO2 Sunspots Information Integration System full-blown schema mappings Data Sources Dataspace Vision by Franklin, Halevy, and Maier [SIGMOD Record 05] 6 Outline Motivation iTrails Experiments Conclusions and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 7 iTrails Core Idea: Add Integration Hints Incrementally Step 1: Provide a search service over all the data Use a general graph data model (see VLDB 2006) Works for unstructured documents, XML, and relations Step 2: Add integration semantics via hints (trails) on top of the graph Works across data sources, not only between sources Step 3: If more semantics needed, go back to step 2 Impact: Smooth transition between search and data integration Semantics added incrementally improve precision / recall September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 8 iTrails: Defining Trails Basic Form of a Trail Queries: NEXI-like keyword and path expressions QL [.CL] → QR [.CR] Attribute projections Intuition: When I query for QL [.CL], you should also query for QR [.CR] September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 9 Trail Examples: Global Warming Zurich Trail for Implicit Meaning: global warming zurich “When I query for global warming, you should also query for Temperature data above 10 degrees” Temperatures date city DB Server region celsius 24-Sep Bern BE 20 24-Sep Uster 25-Sep Zurich ZH 15 ZH 14 26-Sep Zurich ZH 9 global warming → //Temperatures/*[celsius > 10] Trail for an Entity: “When I query for zurich, you should also query for references of zurich as a region” zurich → //*[region = “ZH”] September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 10 Trail Example: Deep Web Bookmarks Web Server train home Trail for a Bookmark: “When I query for train home, you should also query for the TrainCompany’s website with origin at ETH Uni and destination at Seilbahn Rigiblick” train home → //trainCompany.com//*[origin=“ETH Uni” and dest =“Seilbahn Rigiblick”] September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 11 Trail Examples: Thesauri, Dictionaries, Language-agnostic Search car auto Email Server Laptop Trail for Thesauri: “When I query for car, you should also query for auto” car → auto car carro Trails for Dictionary: “When I query for car, you should also query for carro and vice-versa” car → carro carro → car September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 12 Trail Examples: Schema Equivalences DB Server Trail for schema match on Employee empId empName salary //Employee//*.tuple.empName → //Person//*.tuple.name Person SSN names: “When I query for Employee.empName, you should also query for Person.name” name age income Trail for schema match on salaries: “When I query for Employee.salary, you should also query for Person.income” //Employee//*.tuple.salary → //Person//*.tuple.income September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 13 Core Idea Trail Examples How are Trails Created? Uncertainty and Trails iTrails Rewriting Queries with Trails Experiments Recursive Matches Outline Motivation Conclusion and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 14 How are Trails Created? Given by the user Explicitly Via Relevance Feedback (Semi-)Automatically Information extraction techniques Automatic schema matching Ontologies and thesauri (e.g., wordnet) User communities (e.g., trails on gene data, bookmarks) September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 15 Uncertainty and Trails Probabilistic Trails: model uncertain trails probabilities used to rank trails QL [.CL] → QR [.CR], 0 ≤ p ≤ 1 p Example: car → auto p = 0.8 September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 16 Certainty and Trails Scored Trails: give higher value to certain trails scoring factors used to boost scores of query results obtained by the trail QL [.CL] → Q [.C ], sf > 1 R R sf Examples: - T1: weather → //Temperatures/* p = 0.9, sf = 2 - T2: yesterday → //*[date = today() – 1] p = 1, sf = 3 September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 17 Rewriting Queries with Trails U U Query weather yesterday T2 matches Trail U yesterday //*[date = today() – 1] T2: yesterday → //*[date = today() – 1] (1) Matching September 26, 2007 weather (3) Merging (2) Transformation Marcos Vaz Salles / ETH Zurich / [email protected] 18 Replacing Trails Trails that use replace instead of union semantics U U Query weather yesterday weather (3) Merging //*[date = today() – 1] T2 matches Trail T2: yesterday //*[date = today() – 1] (1) Matching September 26, 2007 (2) Transformation Marcos Vaz Salles / ETH Zurich / [email protected] 19 Problem: Recursive Matches (1/2) U weather U yesterday New query still matches T2, so T2 could be applied again //*[date = today() – 1] T2 matches U weather T2: yesterday → //*[date = today() – 1] U ... U U T2 yesterday matches September 26, 2007 U //*[date = today() – 1] //*[date = today() – 1] //*[date = today() – 1] ... //*[date = today() – 1] Infinite recursion Marcos Vaz Salles / ETH Zurich / [email protected] 20 Problem: Recursive Matches (2/2) U weather U yesterday T3 matches Trails may be mutually recursive //*[date = today() – 1] U weather //*.tuple.modified T10: //*.tuple.modified → //*.tuple.date yesterday T10 matches U //*[modified = today() – 1] //*[date = today() – 1] U T3: //*.tuple.date → U weather U yesterday U We again match T3 and enter an infinite loop U //*[date = today() – 1] //*[date = today() – 1] //*[modified = today() – 1] September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 21 Solution: Multiple Match Coloring Algorithm T3, T4 match U U First Level //*[date = today() – 1] U yesterday U weather //Temperatures/* T1 matches T2 matches U weather yesterday Second Level U U yesterday weather //Temperatures/* U U //*[date = today() – 1] //*[received = today() – 1] T1: T 2: T 3: T 4: weather → //Temperatures/* //*[modified = today() – 1] yesterday → //*[date = today() – 1] //*.tuple.date → //*.tuple.modified //*.tuple.date → //*.tuple.received September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 22 Multiple Match Coloring Algorithm Analysis Problem: MMCA is exponential in number of levels Solution: Trail Pruning Prune by number of levels Prune by top-K trails matched in each level Prune by both top-K trails and number of levels September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 23 Outline Motivation iTrails Experiments Conclusion and Future Work September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 24 iTrails Evaluation in iMeMex iMeMex Dataspace System: Open-source prototype available at http://www.imemex.org Main Questions in Evaluation Quality: Top-K Precision and Recall Performance: Use of Materialization Scalability: Query-rewrite Time vs. Number of Trails September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 25 iTrails Evaluation in iMeMex Scenario 1: Few High-quality Trails Closer to information integration use cases Obtained real datasets and indexed them 18 hand-crafted trails 14 hand-crafted queries Scenario 2: Many Low-quality Trails Closer to search use cases Generated up to 10,000 trails September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 26 iTrails Evaluation in iMeMex: Scenario 1 Configured iMeMex to act in three modes Baseline: Graph / IR search engine iTrails: Rewrite search queries with trails Perfect Query: Semantics-aware query Data: shipped to central index sizes in MB Laptop September 26, 2007 Web Server Email Server Marcos Vaz Salles / ETH Zurich / [email protected] DB Server 27 Quality: Top-K Precision and Recall perfect query K = 20 Scenario 1: few high-quality trails (18 trails) Search Engine misses relevant results Queries Search Query is partially semantics-aware Q13: to = Q3: pdf raimund.grube@ enron.com yesterday September 26, 2007 Perfect Query always has precision and recall equal to 1 Marcos Vaz Salles / ETH Zurich / [email protected] 28 Performance: Use of Materialization Scenario 1: few high-quality trails (18 trails) Trail merging adds overhead to query execution Trail Materialization provides interactive times for all queries response times in sec. September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 29 Scalability: Query-rewrite Time vs. Number of Trails Scenario 2: many low-quality trails Query-rewrite time can be controlled with pruning September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 30 Conclusion: Pay-as-you-go Information Integration global warming zurich Dataspace System Step 1: Provide a search service over all the data text, links Step 2: Add integration semantics via trails Data Sources Step 3: If more semantics needed, go back to step 2 Our Contributions iTrails: generic method to model semantic relationships (e.g. implicit meaning, bookmarks, dictionaries, thesauri, attribute matches, ...) We propose a framework and algorithms for Pay-as-yougo Information Integration Smooth transition between search and data integration September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 31 Future Work Trail Creation Use collections (ontologies, thesauri, wikipedia) Work on automatic mining of trails from the dataspace Other types of trails Associations Lineage September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 32 Questions? Thanks in advance for your feedback! [email protected] http://www.imemex.org September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 33 Backup Slides September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 34 Related Work: Search vs. Data Integration vs. Dataspaces Integration Solution Features September 26, 2007 Search Dataspaces Data Integration Integration Effort Low Pay-as-yougo High Query Semantics Precision / Recall Precision / Recall Precise Need for Schema Schemanever Schemalater Schemafirst Marcos Vaz Salles / ETH Zurich / [email protected] 35 Personal Dataspaces Literature Dittrich, Salles, Kossmann, Blunschi. iMeMex: Escapes from the Personal Information Jungle (Demo Paper). VLDB, September 2005. Dittrich, Salles. iDM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB, September 2006 Dittrich. iMeMex: A Platform for Personal Dataspace Management. SIGIR PIM, August 2006. Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace Odyssey: The iMeMex Personal Dataspace Management System (Demo Paper). CIDR, January 2007. Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System. BTW 2007, March 2007 Salles, Dittrich, Karakashian, Girard, Blunschi. iTrails: Pay-as-yougo Information Integration in Dataspaces. VLDB, September 2007 September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 36 iDM: iMeMex Data Model Our approach: get the data model closer to personal information – not the other way around Supports: Unstructured, semi-structured and structured data, e.g., files&folders, XML, relations Clearly separation of logical and physical representation of data Arbitrary directed graph structures, e.g., section references in LaTeX documents, links in filesystems, etc Lazily computed data, e.g., ActiveXML (Abiteboul et. al.) Infinite data, e.g., media and data streams See VLDB 2006 September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 37 Data Model Options Data Models Bag of Words Relational XML Support for Graph data Specific schema Extension: XLink/ XPointer Support for Lazy Computation View mechanism Extension: ActiveXML Extension: Relational streams Extension: XML streams iDM Nonschematic data Serialization independent Support for Personal Data Support for Infinite data September 26, 2007 Extension: Document streams Marcos Vaz Salles / ETH Zurich / [email protected] 38 Data Models for Personal Information Abstraction Level lower higher Relational iDM Physical Level Personal Information XML Document / Bag of Words September 26, 2007 Marcos Vaz Salles / ETH Zurich / [email protected] 39 Application Layer Architectural Perspective of iMeMex Search & Browse Indexes&Replicas access (warehousing) Email Office Tools ... iMeMex PDSMS iQL Query Processor Complex operators (query algebra) Data source access (mediation) ... iDM Query Processor Data Source Query Processor Operators Physical Algebra Catalog Data Store Result Cache Operators Data Cleaning Catalog Data Store Indexes & Replicas Operators Content Converters Catalog Data Source Plugins Data Source Layer ... September 26, 2007 File System Marcos Vaz Salles / ETH Zurich / [email protected] IMAP ... DBMS 40