Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003 Entity Mediated Schema Phenotype Gene Sequenceable Entity Protein OMIM Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics Structured Vocabulary LocusLink GO Entrez GEO Query: For the micro-array experiment I just ran, what are.
Download ReportTranscript Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003 Entity Mediated Schema Phenotype Gene Sequenceable Entity Protein OMIM Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics Structured Vocabulary LocusLink GO Entrez GEO Query: For the micro-array experiment I just ran, what are.
Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003 Entity Mediated Schema Phenotype Gene Sequenceable Entity Protein OMIM Experiment Nucleotide Sequence Microarray Experiment SwissProt HUGO GeneClinics Structured Vocabulary LocusLink GO Entrez GEO Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Motivation and Activity Application areas of data integration: Enterprise information integration ($$) The government Data sources on the web Scientific data sharing. Several data sharing architectures: Virtual data integration, warehousing, messagepassing, web-services. Many research projects: Mine: Information Manifold, Tukwila, LSD, Piazza. EII: a new industry buzzword. Today’s Agenda Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age of problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR. AI is more vital than ever for progress here! Mediation Languages Q Goal: Language for Specifying Semantic Relationships (not full FOL) Q’ Source Mediated Schema Q’ Source Q’ Source Q’ Source Assume: data at the sources is structure (or seems so). Q’ Source Global-as-View (GAV) Actor(x,y) :- R1(x,y,z) Actor(x,y) :- R2(x,z), R3(z,y) Mediated Schema Title, Actor, … Source R1 Source R2 Source R3 Source R4 Source R5 Local-as-View (LAV,GLAV) R1(x,y,z) :- Title(x,y), Actor(x,z), y< 1970 R5(x,y,z) :- Movie(x,y,”French”) Mediated Schema Title, Actor … Source R1 Source R2 Source R3 Source R4 Source R5 Mediation Languages: Summary A lot of nice theory and practical algorithms. Careful choice of expressive power mattered. Algorithms for answering queries using views are in every commercial DBMS. Description Logics – also an attractive formalism for mediation. Bottleneck is coming up with the mapping expressions. Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR. Adaptive Query Processing Problem: no stats, network unstable Cannot ‘Plan and then execute’ Need to adapt plan during execution. Ideas already in Ingres (1976) (early database system) Interleaving planning and execution (AI) Key question: when and granularity of adaptation: For every tuple? Materialization points? See [Ives et al. 2002] for our solution. Convergent Query Processing [Ives et al., 2002] Join In-stock, Orders, Shipping (I O S) I2 O2S2 I0OS O0S0 I1 O1S1 “Cleanup” query plan I2S2 I0 O0 IO II0 0 SS O O0 O1S1 I1 O1 S1 O2 I2 S2 XML Query Processing XML facilitates integration. Mediator query processor may manipulate XML directly. Challenges: XML is not flat, but nested; Path queries. Can be irregular; doesn’t adhere to a strict schema. Progress: Defining and optimizing XQuery. Going back and forth: XML to relational. The Commercial World Some startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant). Integration technology in different layers: E.g., reporting companies want it (Actuate) Progress: analysts have buzzword -- EII. Challenges: Integration with EAI? Yet another middleware? Horizontal vs. vertical? What Worked? Performance was not an issue. Tools, tools, tools For managing sources and creating mediated schemas. XML query processing was needed. Concordance: need common keys to join sources: Active research area! Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR. Limitations of Mediated Schema Q Mediated Schema Q’ Source Q’ Source Q’ Source Q’ Source Q’ Source Peer Data-Management PDMS: a network of peers (data sources) Peers can: Export base data, or combinations of data Serve as logical mediators for other peers A peer can be both a server and a client. Semantic relationships are specified locally (between small sets of peers). This is a Semantic Web (different angle) Network of Mappings (Piazza) Q’’ CiteSeer Q’ UW Stanford Q’’ Q’’ GAV, LAV GLAV DBLP Q Roma Vienna Q’ Paris Q’’ Advantages of PDMS No need for a central mediated schema. Can map data opportunistically, as is most convenient. Queries are posed using the peer’s schema. Answers come from anywhere in the system. Infrastructure for Semantic Web applications This is not P2P file sharing. Data has rich semantics Membership is not as dynamic. Schema Mediation for PDMS When can LAV and GAV be combined to form such a network structure? Q’ (semantics not yet obvious. [ICDE-03], [WWW-03 for XML] Q’’ CiteSeer UW Stanford Q’’ Q’’ GAV, LAV GLAV DBLP Q Roma Vienna Q’ Paris Q’’ Efficient Query Answering Problems: • redundant paths • expensive reformulation. Q’ Q’’ CiteSeer UW Stanford Q’’ Possible solution: • Pre-compose some paths Q’’ DBLP Q Roma Vienna Q’ Paris Q’’ Mapping Composition [Jayant Madhavan and Halevy, VLDB 2003] Incredibly subtle! In general, composition can be an infinite set of GLAV formulas. Results: Finite in many cases Even when infinite, often has finite, useful encoding. Hence, compositions can usually be preoptimized. Other Research Issues Intelligent data placement Management of mapping networks Q’’ CiteSeer Q’ UW Stanford Q’’ Improving networks: finding additional connections. Q’’ DBLP Handling inconsistencies Q Saarbruecken Berlin Q’ Leipzig Q’’ PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (Ives, U. Penn) Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR. Schema/Ontology Matching Hotel, Restaurant, AdventureSports, HistoricalSites Data Source Consumer Hotel, Gaststätte Brauerei, Kathedrale Mediator Data Source Data Source Lodges, Restaurants Beaches, Volcanoes Schema heterogeneity: a key roadblock for information integration Different data sources speak their own schema Mapping is key to any data sharing architecture Schema Matching Books BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Inventory Database A Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database B Schema Matching: Discovering correspondences between similar elements Eventually… BooksAndMusic(x:Title,…) = Books(x:Title,…) CDs(x:Album,…) Typical Approaches Multiple sources of evidences in the schemas Schema element names Descriptions and documentation DateTime Integer, addresses have similar formats Schema structure ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances BooksAndCDs/Categories ~ BookCategories/Category All books have similar attributes In isolation, techniques are incomplete or brittle Use domain knowledge Combine multiple techniques to exploit all available evidence Philosophy of Solutions Effective schema matching requires a principled combination of techniques. Like human experts, the matcher should improve over time LSD: Mapping data sources to a mediated schema. Use a few mappings as training examples to learn hypotheses for elements of the mediated schema. See [Doan et al., SIGMOD-2001, MLJ-2003] Next step: corpus-based matching. Corpus-Based Matching Music Books Authors Authors Items Collection of schemas and mappings Artists Information Publisher Litreture CDs Categories Artists Corpus of Books and Inventory Schemas Identify common concepts and patterns Books, Authors, Publishers, … Books Title, Author, Price, Publisher Reuse extracted information to match new schemas Mapping Knowledge Base Learners: extract knowledge from schemas and mappings Learned models: for each unique element in any schema. Name Learner Data Type Learner C1 Data Instances Learner Structure Learner Description Learner Meta Learner CN NL:… DIL:… DTL:… DL:… SL:… ML:… NL:… DIL:… DTL:… DL:… SL:… ML:… Schemas and mappings: accumulated over time Mapping Knowledge Base Preliminary results: Corpus is useful Shipping Domain 15 Avg Number of Matches Only MKB Only BASIC 10 5 0 -5 P1a P1b P2a P2b P3a -10 -15 Schema Pairs P3b P4a P4b With and without the corpus Inventory Domain 1 MKB BASIC COMB 0.8 Recall 0.6 0.4 0.2 0 P1a P1b P2a P2b P3a Schema Pairs P3b P4a P4b Outline Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world. Current challenges Enabling large-scale data sharing: peer-data management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR. Corpus vs. Traditional KR A large corpus of uncoordinated knowledge fragments vs. Carefully designed knowledge base Can a corpus offer a more attractive solution for some KR problems? Pause: KR vs. Corpus Knowledge base: Hard to engineer, brittle at the boundaries Only one way of saying things. Corpus: “Easier” to build, coverage not predefined. Many views of the domain. See proceedings for full argument. Corpus-based KR Contents: Schemas, ontologies, meta-data, data, queries, mappings. Collect statistics on the corpus: How often does a word appear as a relation name? When it does, what tend to be the attribute names? What other tables are there? Support a KR-style interface on the corpus (OKBC-like) Other Applications of C-B-KR Question answering on the web Focused crawling Natural language interfaces to DB’s Schema and ontology authoring Semantic query optimization. Whenever we need knowledge to help us rank multiple answers/plans. Example Queries How are two terms related? GPA(studentID, $value), Student(studentID, GPA, address) Find different ways of saying the same: Class(Lexus, Luxury) LuxuryCar(Lexus, Toyota) When do two terms play similar roles? IJCAIReview(p1, rev2, accept) AIJReferees(round2, p3, rev4, reject) Challenges for C-B-KR Building the corpus. How focused should the corpus be? Is human tuning needed or helpful? How do we accommodate inference? How do we leverage traditional KR? Summary The vision: data authoring, querying and sharing by everyone. We got the plumbing to work. To go further, we need AI techniques. Challenge: cross the structure chasm: It’s hard to author & query structured data! PDMS: architecture for ad-hoc sharing. Ontology/schema matching is key! Are we providing the right tools? Corpus-based knowledge representation. We need benchmarks! Some References www.cs.washington.edu/homes/alon Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Mediation surveys: VLDB Journal 01 Lenzerini tutorial. Schema matching: Rahm and Bernstein, VLDB Journal 01. Workshops: IJCAI, Semantic Web Conf. Teaching integration to undergraduates: SIGMOD Record, September, 2003.