retreat slides

Download Report

Transcript retreat slides

Intelligent Information Integration
(I3)
Chun-Nan Hsu
Institute of Information Science
Academia Sinica, Taipei, TAIWAN
Copyright © 1998 Chun-Nan Hsu, All right reserved
Prepared for a presentation at IIS, AS, Taiwan
October 12, 1998
Lab name TBA
IIS internal talk
1
Information Distribution and
Information Integration

Real query: need a list of attorneys in Phoenix Metro
Area specialized in immigration and deportation. Also
show their years-in-service, educational background
and languages spoken??

The answer IS on the Web!
»
»
»
»

US West yellow page web site
US bar association member directory web site
Alumni directory of law schools
and more…
BUT…
Lab name TBA
IIS internal talk
2
Intelligent Information Integration

Environment assumptions:
» Autonomous information sources
» Heterogeneous but relevant
» Query only (no or limited update allowed)

Desiderata:
» Extensible
– easily add new sources
» Flexible
– can be queried in as many ways as integrated sources
» Scalable
– integrate 1,000s, 10,000s, 100,000s relevant information
sources
Lab name TBA
IIS internal talk
3
Solution: Information Integration
Systems (IIS)

Also known as
» information mediation agents, information mediators
» information gathering agents
» information brokering agents, information brokers

Key ideas:
» users access data through a domain model
» information sources represented by a source model
» the mediator reformulates domain model query into source
model sub-queries
» the mediator constructs a query plan that determines the
orders of data flow and execution to retrieve data
Lab name TBA
IIS internal talk
4
Architecture
Human & Computer Users
User Services:
• Query
• Monitor
Describes the domain:
terms and their relations
Information
Integration
Service
Domain
model
Query
planner
Source
model
Query plan
optimizer/ executor
Wrapper
Wrapper
SQL
Optimizes and executes
a query plan
ORB
Provides source
descriptions and
semantic integration
Text,
Images/Video,
Spreadsheets
Hierarchical
& Network
Databases
Relational
Databases
Determines how to
answer input queries
Provides translation and
communication with
sources
Object &
Knowledge
Bases
Heterogeneous Data Sources
Lab name TBA
IIS internal talk
5
Query processing flow
Domain
model
Information
Integration
Service
Source
model
Query
planner
Query plan
optimizer/ executor
ORB
SQL
Wrapper
Wrapper
Information
sources
Query plan
User query
Subquery
Subquery
Subquery
Subquery
Answer
Lab name TBA
IIS internal talk
Translated
query and
data
6
Query plan

Query: Find immigration attorneys in Phoenix and their
educational background...
Us-west.com
SELECT name1, phone, address
FROM LAWFIRMS
WHERE location = Phoenix
and class = law-firms
and specialty = immigration
P
SELECT *
FROM P,S,…
WHERE P.name1 = S.name2
output
S
Law.yale.edu
Law.yale.edu
Law.harvard.edu
Law.yale.edu
Lab name TBA
SELECT name2, year, degree
FROM ALUMNI
WHERE
Name2 is
one of name1
IIS internal talk
7
Representation and integration of
domain and source models
Domain
representation
SIMS
TSIMMIS
IM, Infomaster,
Softbot
Lab name TBA
IIntegration
LOOM
(description
logic, object-like
with inheritance)
OEM (Object
exchange model)
Links between
classes and
attributes
Datalog
Source classes
as views of
domains
IIS internal talk
Domain classes
as views of
sources
8
Integrating domain and source
models -- Example
airline

pilots(pilot,airline)
s1(pilot,aircraft),
s2(aircraft,airline).

Domain class
Domain as view of source
pilots
pilot
Source as view of domain:
s1(pilot,_)
pilots(pilot,_).
s2(_,airline)
pilots(_,airline).
pilot
Referential integrity
constraint on aircraft

s1
airline
s2
aircraft
Source-links
Lab name TBA
Source
classes
Source-links
IIS internal talk
9
Representation of queries

Queries
»
»
»
»
»
Lab name TBA
enumerate?
Conjunctive?
Negation?
Disjunctive?
Aggregate operators? (group-by, having, etc. SQL stuff)
IIS internal talk
10
Properties of Query Plans

Quality of the answer
» Anything not asked is returned?
» Maximally contained? (due to O. Duscheka, 1998)

Executable (retrievable) query plans
» One that contains no domain model term
Lab name TBA
IIS internal talk
11
Query Planning in SIMS -decompose
airline

Q(?p):pilots(?p, ?a),
pilots(mike,?a).

Domain class
Q: Pilots for the same airline
as Mike
pilots
pilot
Sources:
Source
classes
Source-links
s1(pilot,aircraft).
S2(aircraft,airline).

Decomposed query:
pilot
Q(?p):s1(mike, ?a),
s2(?a,?al),
s2(?a2,?al),
s1(?p,?a2).
Lab name TBA
s1
airline
s2
aircraft
IIS internal talk
12
Query Planning in SIMS -- partition
Flight-hours

Q: What’s flight-hours of
Mike?
Subset-of
Q(?h):pilots(mike,?h).

Domain
classes
pilots
pilot
airline
Civil
pilots
Sources:
Military
pilots
s3(pilot,aircraft,hours).
S4(pilot,aircraft,hours).

Source
classes
Partitioned subqueries:
Q(?p):s3(mike,_,?h).
Q(?p):s4(mike,_,?h).
pilot
s3
pilot
s4
aircraft
Lab name TBA
IIS internal talk
13
Query planning in SIMS




There are 7 other such operators (Arens et al. 1995, JIIS)
for query “reformulation”
In addition there are 9 other operators about opening
a source, moving data around, etc (Knoblock, 1996, AIPS)
Planning involves selecting appropriate operators
and determining the best order for these operators
There are always many choices and search is
required to find the “optimal” query plan
Lab name TBA
IIS internal talk
14
Recursive query plan
airline

Q(?p):pilots(?p, ?a),
pilots(mike,?a).

Domain class
Q: Pilots for the same airline
as Mike
pilots
pilot
Sources:
Source-links
s1(pilot,aircraft).


Non-recursive query plan?
Maximally contained?
pilot
– (this part is due to O.
Duscheka, 1997, PhD
thesis, Stanford Univ.)
Lab name TBA
Source
classes
s1
aircraft
IIS internal talk
15
Negative results of query planning
using source-as-view

Query planning for a query plan equivalent to an
input datalog query is UNDECIDABLE
» otherwise, theorem-proving for first-order logic will be
decidable
» (see O. Duscheka, 1998, PhD thesis, Stanford University)

Query planning for conjunctive, comparison-free
queries with minimal number of sources accessed is
NP-complete
» otherwise, containment of two datalog program will be
polynomial
» (see A. Levy, 1995, PODS)
Lab name TBA
IIS internal talk
16
Domain as view of source


Simply replacing domain terms in a query with their
view definitions will yield an executable query plan
Add a new source may require change the whole
domain model- source model integration
» not a problem for source-as-view
Lab name TBA
IIS internal talk
17
Query optimizations

Semantic query optimization (Hsu and Knoblock,1999 IEEE
TKDE)





Less “semantic” (using local completeness, functional
dependency, etc.) (Kwok AAAI-96, Levy)
Exploring parallelism in plans (Knoblock, IJCAI-95)
Replanning failed retrieval (Knoblock, IJCAI-95)
Caching (static)
Dynamic caching (using partial results from
subqueries)
Lab name TBA
IIS internal talk
18
Basic idea of adaptive
semantic query optimization
Input Query
Give me all the papers
written by “Chunnan”
R1: If AUTHOR is an “AIer”
 PAPER is “AI” paper
R2: “Chunnan” is an “AIer”
R3: ...
PESTO
Query Optimizer
BASIL
Semantic Rules
learner/KDDer
Optimized Query
Give me all the “AI” papers
written by “Chunnan”
Lab name TBA
IIS internal talk
Databases
19
Web wrapper
Name
Degree School Affiliation
WL Hsu PhD
CS Ho
PhD
C.Chen
PhD
C.Wu
PhD
Mark Liao PhD
CJ Liau PhD
WK Cheng PhD
WC Wang MS
:
:
:
Lab name TBA
IIS internal talk
Cornell IIS, Sinica
NTU
EE,NTIT
SUNY EE,NTIT
Utexas Cedu,NNU
NWU IIS, Sinica
NTU
IIS, Sinica
TKU Tunghai
Syracus FIT
20
Wrapper construction



For structured databases, wrapper construction is an
engineering problem
Web sources requires an information extractor
Hand-encoded Web information extractor?
» Web page changed frequently (8% monthly failure rate at
Junglee)

Web wrapper induction? YES (Hsu 1999, J of Info Systems;
Kushmerick 1997, PhD Thesis, U of WA)

XML will make wrapper induction easier
Lab name TBA
IIS internal talk
21
Major players (1)

SIMS, Ariadne
» Arens, Knoblock, Minton, Shen, Hsu, et al. at ISI of USC
» flexible query planner, adaptive semantic query optimizer

Information Manifold
» Levy, Srivastava, Kirk, et al. At AT&T Lab
» query reformulation, relevant source selections

TSIMMIS
» Hammer, Garcia-Molina, Papakonstantinou, Ullman et al. at
Stanford University
» object-based data modeling (OEM)
Lab name TBA
IIS internal talk
22
Major players (2)

Softbot family: Occam, Razor, etc.
» Etzioni, Weld, Kowk, Kushmerick, Friedman
» Fast query planning, wrapper induction, query optimization

Infomaster
» Duscheka and Genesereth at Stanford
» recursive query plans, theoretical analysis of III

Others
» HERMES at U of Maryland, Broker Agents at SRI,
Ontobroker at AFIB Germany, etc.
» Taiwan? Academia Sinica (WL Hsu, CN Hsu) and VF at
NTU (YJ Hsu), others?
Lab name TBA
IIS internal talk
23
Positive results of intelligent
information integration

Spin-off’s
» Junglee (www.junglee.com)
– Key scientists: Mike Stonebraker? Peter Norvig
– Largest integration: 700 Web sites, 30 attributes, 1000+
wrappers
– Bought by Amazon.com for ~$180,000,000
» Jango (www.jango.com)
– Key scientists: Dan Weld, Oren Etzioni
– Bought by Excite.com

Startups
» Socratix Systems
– Key scientist: Oliver Duscheka
Lab name TBA
IIS internal talk
24
Competing alternatives

Hardwired
» mostly applied?

Schema Integration
» dying?

Distributed Heterogeneous Multi-Databases
» dying? Name too long?

Data warehousing
» kicking real good!
» Updating a tough problem

Software Reverse Engineering
» Taiwan has an edge on this?
Lab name TBA
IIS internal talk
25
Research challenges







Optimization
Probabilistic representation of domain-source models
Probabilistic query answering, anytime, imprecise
query answering
Automatic locating and integrating relevant new
sources
Sharing information between incompatible sources
(F-C? Exchange rate? Aliases?)
Wrapper induction
Cooperative agents for information integration
Lab name TBA
IIS internal talk
26
Information sources of intelligent
information integration

Journals
» Journal of intelligent systems, information systems,
intelligent information systems, cooperative information
systems, agents(?)
» Other journals for AI, databases

Meetings
»
»
»
»
»
Lab name TBA
1998 AAAI Workshop on AI & Information Integration
1998 ECAI Workshop on Intelligent Information Integration
1997 SIGMOD Workshop on Semistructured data
1997 German Annual Conference On AI Workshop on III
1995 AAAI Symposium on Information Gathering from
Distributed Heterogenous Sources
IIS internal talk
27
More information sources on
Intelligent Information Integration


Best papers usually published in AAAI, IJCAI,
SIGMOD and PODS
Upcoming meeting:
» IJCAI-99 WORKSHOP on Intelligent Information Integration
(proposed)
Lab name TBA
IIS internal talk
28
The Future
Networked Information Mediators
Human & Computer Users
Attorney
Mediator
Immigration Attorney
Mediator
Phoenix Attorney
Mediator
Good Attorney
Mediator
Bad Attorney
Mediator
Law school
Mediator
Heterogeneous Data Sources
Lab name TBA
IIS internal talk
29