Automated Resolution of Semantic Heterogeneity in

Download Report

Transcript Automated Resolution of Semantic Heterogeneity in

Agenda
• Brief introduction to distributed database, multidatabase
(page 1-2)
• Automated Resolution of Semantic Heterogeneity in
Multidatabases (page 2-4)
• SEMINT: A tool for identifying attribute correspondences
in heterogeneous databases using neural networks
(page 5-7)
• Context Interchange: New Features and Formalisms for
the Intelligent Integration of Information (page 8-9)
7/17/2015
1
Distributed Database Architecture
External schema 1
External schema 2
External schema n
Global conceptual
schema
7/17/2015
Local conceptual
schema 1
Local conceptual
schema 2
Local conceptual
schema n
Local internal
schema 1
Local internal
schema 2
Local internal
schema n
2
• Distributed database management system is defined
as the software system that permits the management of
the DDMS and makes the distributed transparent to the
users.
• Local internal schema: the data are physically located
at different locations, each site has its local internal
schema.
• Local conceptual schema: to handle a situation that
data in distributed database is usually fragmented and
replicated, local conceptual schema describes the logical
organization of data at each site.
• Global conceptual schema is a union of the local
conceptual schema, and provides a single view of the
database
• External schema supports user application to the
database, and it is above the global conceptual schema.
7/17/2015
3
Distributed Multidatabase Systems
• Besides the issues common to distributed database
systems, distributed multidatabase systems need to
address database integration, global query language
and query translation issues.
• Two steps in database integration
1. Translation
local database schemas are translated(mapped) to a
common intermediate canonical representation.
2. Schema integration
Each intermediate schema is integrated into a global
conceptual schema.
• With database integration, a multidatabase system
provides integrated global access to heterogeneous,
autonomous local databases in a distributed system.
7/17/2015
4
Database 1
Database 2
Database 3
Translator 1
Translator 2
Translator n
Intermediate
schema 1
Intermediate
schema 2
Intermediate
schema n
Integrator
Global conceptual
schema
7/17/2015
5
Database Integration Process
Three major design approaches for
multidatabase systems
1. Global-Schema Multidatabases
2. Federated database
3. Multidatabase Language Systems
(The Summary Schemas Model is based on approach #3.)
7/17/2015
6
1. Global-Schema Multidatabases
Another layer, above the local external schemas;
Benefits:
• Global users essentially see a single, large,
integrated database
Problems:
• The amount of global knowledge. A global schema
can be a very large data object
• The global DBA must understand all the local
optimizations
• Changes to local schemas must be reflected by the
global schema.
7/17/2015
7
2. Federated databases
Federated databases only require partial
integration. A federated database integrates a
collection of local database systems by
supporting interoperability between pairs or
collections of the local databases rather than
through a complete global schema.
7/17/2015
8
3. Multidatabase Language Systems
• Attempt to resolve some of the problems associated
with global schemas
• Beyond standard database capabilities, most of the
language extensions are involved with manipulating
differing data representations. The language must
have the ability to transform source information into
the integrated representations.
7/17/2015
9
Automated Resolution of Semantic
Heterogeneity in Multidatabases
M.W. Bright, A.R. Hurson and S. Pakzad
June, 1994 ACM Transactions on Database Systems
7/17/2015
10
Summary
• Problem: identifying semantically similar data in different
local databases.
• Solution: Summary Schemas Model
• Key characteristics of solution: accept imprecise query;
use semantic distance for similarity; use Roget
thesaurus as the taxonomy to construct hierarchical
summary schemas
7/17/2015
11
Summary Schema Model
• One study showed that the probability of two subjects
picking the same term for a given entity ranged from
7% to 18%
• The Summary Schemas Model (SSM) extends
multidatabase systems, provides linguistic support to
automatically identify semantically similar entities with
different terms.
• The proposed Summary Schemas Model provides a
user-friendly interface by allowing users to specify
queries in their own terms, an imprecise query,
and\or to use imprecise terms for data references.
7/17/2015
12
Large-System User Interfaces
• In a small data access system, it is reasonable to
expect all users to learn the exact names for different
entities.
• In large distributed systems, it is unreasonable to
expect users to know the exact system access terms.
(time consuming to do it manually)
– a reason for Multidatabase Language Systems
approach
7/17/2015
13
Semantic Relationships
• Synonym: semantically similar (similar meaning). E.g
salary is similar to wage
• Hypernym: a term with a broader, more general
meaning. E.g earning is broader than salary
• Hyponym: the opposite of hypernym
7/17/2015
14
Summary Schemas Model
• The summary schemas and the online taxonomy
map imprecise terms to the semantically closest
terms that actually exist in the system.
• Taxonomy: 1965 version of Roget’s Thesaurus
All Summary Schemas Model processing is
described in terms of Roget’s structure.
• A summary schema represents the input data in a
more abstract manner and consequently needs fewer
terms to describe the information.
7/17/2015
15
• e.g. Consider two base relations – one includes the
attributes “city” and “zip code”, while the other has
“city” and “country.”
A global-schema representation of these schemas
might have a generalized object with the attribute
“city”, “zip code” and “country” attributes. The global
schema is a precise representation of the base
schemas.
The summary schema for the same base relations
may represent the input attributes with a single
access term (hypernym) “location”.
7/17/2015
16
Schema summation & Summary schema
hierarchy
• Leaf nodes map to the terms in the local schema
• Internal level of the hierarchy has a hypernym
relationship with the leaf node(s)
• Summary schema model structures multidatabase
nodes in a hierarchy. Each internal node contains a
summary schema.
7/17/2015
17
7/17/2015
18
Semantic-Distance Metric
• A key feature of the SSM is the ability to identify
semantically similar entities. The Semantic-Distance
Metric (SDM) provides a quantitative measurement of
“semantic similarity.”
7/17/2015
19
7/17/2015
20
Imprecise-Query Processing
• When an imprecise reference is detected at the query
origin node, the summary schemas structure is used to
match the reference to a semantically close precise
reference, and query processing proceeds normally with
precise data references.
7/17/2015
21
Example of Summary Schemas Model query
processing
NODE 1 (leaf): external schema - (staff, employees, supervisor, wages,
address)
NODE 2(leaf): external schema – (personnel, engineers, manager, salary,
town, Income, city)
1st level internal node summary schema:
NODE 3: (personnel <1,2>, worker<1> manager<1, 2>, pay<1 ,2>,
Iocality<1>, artisan<2>, district <2>, earnings<2>, polity<2>)
2nd level internal node summary schema:
NODE 4 (agent< 3>, director< 3>, payment< 3>, location< 3>, region<3>,
acquisition< 3>, authority<3>)
3rd level internal node summary schema:
NODE 5: (voluntary action<4>, possessive relations<4>, space in general
(<4>, general <4>)
7/17/2015
22
GIVEN:
The Semantic-Distance Metric (SDM) is a simple count of links In the
hypernym hierarchy. The SDM value used is 1.
The user is somewhat familiar with the STAFF and PERSONNEL
databases and wants to retrieve employees that make more than
$20,000. The user is unsure of the access term for wages in Node 2.
The LET command in the ‘query represents the multidatabase language
system ability to combine multiple data references Into a single term.
QUERY:
LET PERS = NODE1. STAFF AND NODE2. PERSONNEL;
/* open two databases with precise data references */
LET EMP = NODE1.EMPLOYEE ANO NODE2. ENGINEERS;
/* combine two precise relation references */
LET PY = NCCIE1.UAGES AND PAY;
/* “pay” is an imprecise data reference */
7/17/2015
23
SELECT ID, PAY
FROM PERS. EMP
WHERE PY > 20000;
QUERY PROCESSING:
Step 1) At Node 1 parse the query. PY(pay) is an imprecise reference so
pass the query to Node 3.
Step 2) Node 3 has a summary schema term “pay” which is 0 links away
from PY(pay). Since the SOM at Node 3 is Less than 1, seed a
message dorm the hierarchy to see if the actual access term at Node 2
is within the specified semantic distance.
At Node 2, the access term “salary” is 1 link away from “pay” (the
hypernym link). Replace PY(pay) with
PY(NODE2. SALARY) and return to node 3.
Step 3) All data references are now precise.
Step 4) Execute the multidatabase query.
7/17/2015
24
SEMINT: A tool for identifying attribute
correspondences in heterogeneous databases
using neural networks
Wen-Syan Li, Chris Clifton
2000 Data & Knowledge Engineering
7/17/2015
25
Summary
• Problem: Semantic integration - identifying
relationships between attributes or classes in different
database schemas
• Solution: SEMantic INTegrator (SEMINT)
• Key characteristics of solution: utilizes both schema
information and data contents; neural network
• Evaluations: Preliminary experiment; The Boeing
company tooling database; US NSF and Canada
NSERC funding award databases
7/17/2015
26
Related work
• Comparing attribute names
This approach assumes that the same attribute may be
represented by synonyms in different databases.
– abbreviations
– homonyms
• Comparing field specifications at the schema level
match attributes are `pre-programmed' by DBAs
• Comparing attribute values and patterns in the data
content level
Relationships and entity sets can be integrated primarily based
on their domain relationships: EQUAL, CONTAINS, OVERLAP,
CONTAINED-IN and DISJOINT. Determining such relationships
can be time consuming and tedious
7/17/2015
27
SEMantic INTegrator (SEMINT)
• The authors focus on the problem of identifying
corresponding attributes in different DBMSs that
reflect the same real-world class of information.
• Observations: Attributes in different databases that
represent the same real-world concept will likely have
similarities in schema designs, constraints, and data
value patterns.
• SEMINT supports access to a variety of database
systems and utilizes both schema information and
data contents to produce rules for matching
corresponding attributes automatically.
7/17/2015
28
Technologies in SEMINT
Neural networks versus traditional programmed computing
•
Programmed computing is a known set of rules
•
Neural networks are trained
7/17/2015
29
attribute correspondence identification procedure
7/17/2015
30
1.
Metadata extraction using DBMS-specific parsers
* Data type conversion
Defining five types: character, number, date, rowed, and raw
* Normalization of metadata
- Binary values
- Category values
converting a category input into a vector of binary values, e.g
1, 0, 0
- Range values
using SIGMOID-like function to convert to a range of [0, 1]
7/17/2015
31
2. Classifier
Before the metadata is used for neural network training,
classifier is used to cluster attributes into categories in a single
database.
Reasons:
– Ease of training
– If we have information in one database that refers to the same realworld information, we do not want to separate them into two
different categories.
The user can determine how fine the categories are by setting
the radius of clusters rather than the number of categories.
7/17/2015
32
7/17/2015
Illustration of a 3-D self-organizing map
33
7/17/2015
Classifier architecture in SEMINT
34
3. Category learning and recognition neural networks
The authors used the output of the classfier (M vectors) in a backpropagation network to train a network to recognize database
attributes’ signatures
After the back-propagation network is trained, the attribute
health_Plan.Insured# run through the neural network. This
network determines the similarity between the given input
pattern and each of the M categories. The network shows that
the input pattern `Insured#' is closest to the category 3 (SSN
and Emp_ID) with similarity 0.92. It also shows
health_Plan.Insured# is not likely related to other attributes
since the similarity is low.
7/17/2015
35
7/17/2015
36
Using neural networks
To determine attribute correspondences between two
databases, users take the network trained for one
database, and use information extracted from the
other database as input to this network.
7/17/2015
37
Automated attribute correspondence
identification procedure (summary)
7/17/2015
38
Test1: Preliminary experimental results
• The result of these experiments showed that there is
a substantial gap between high and low similarity.
High similarity reflected corresponding attributes, and
low similarity reflected non-corresponding attributes.
7/17/2015
39
Test2: The Boeing company tooling database
The database contains 12 tables and 412 attributes;
The authors split this information into two parts, The first
part (denoted AM) had 4 tables with 260 attributes;
the second (OZ) had eight tables with 152 attributes.
They trained a neural network to recognize the
attributes of AM, and then determined the similarity
between this and OZ.
The precision in this test was approximately 80%. And
the recall was close to 90%.
Why compare AM with OZ? Need to integrate them?
7/17/2015
40
Test3: US NSF and Canada NSERC funding
award databases
• The NSF and NSERC award databases are both
stored in relational databases.
• They contain information on research grants
• They contain a wide variety of information, with a few
attributes reflecting the same information.
7/17/2015
41
7/17/2015
42
Context Interchange: New Features and
Formalisms for the Intelligent Integration of
Information
Cheng Hian Goh
S. Bressan, S. Madnick, and M. Siegel
July, 1999 ACM Transaction in Information Systems
7/17/2015
43
Summary
• Problem: Intelligent integration of information
(Zhongming: please be more specific)
• Solution: The context interchange strategy is a
mediator-based approach for achieving semantic
interoperability among heterogeneous sources and
receivers
• Key characteristics of solution: context mediator
using logic
7/17/2015
44
CONTEXT INTERCHANGE BY EXAMPLE
7/17/2015
45
Q1: SELECT r1.cname, r1.revenue FROM r1, r2
WHERE r1.cname 5 r2.cname AND r1.revenue .
r2.expenses;
Without a mediation, this query will return the empty
answer. In the context interchange system, the
semantics of data can be explicitly represented in the
form of a context theory and a set of elevation axioms
with reference to a domain model.
7/17/2015
46
7/17/2015
47
Context Mediator rewrites the user query to a mediated
query.
The Optimizer transforms this to an optimized query plan
The optimized query plan is executed by an Executioner
Executioner dispatches subqueries to individual systems,
collates the results, undertakes conversions, and returns
the answers to the receiver.
7/17/2015
48
e.g. Q1 is transformed to a mediated query, MQ1 (How to do this?)
SELECT r1.cname, r1.revenue FROM r1, r2, r4
WHERE r1.country 5 r4.country
AND r4.currency 5 ‘USD’
AND r1.cname 5 r2.cname
AND r1.revenue . r2.expenses;
UNION
SELECT r1.cname, r1.revenue * 1000 * r3.rate
FROM r1, r2, r3, r4
WHERE r1.country 5 r4.country
AND r4.currency 5 ‘JPY’
AND r1.cname 5 r2.cname
AND r3.fromCur 5 ‘JPY’
AND r3.toCur 5 ‘USD’
AND r1.revenue * 1000 * r3.rate . r2.expenses
UNION
SELECT r1.cname, r1.revenue * r3.rate
FROM r1, r2, r3, r4
WHERE r1.country 5 r4.country
AND r4.currency ^& ‘USD’
AND r4.currency ^& ‘JPY’
AND r3.fromCur 5 r4.currency
AND r3.toCur 5 ‘USD’
AND 7/17/2015
r1.cname 5 r2.cname
AND r1.revenue * r3.rate . r2.expenses;
49
The context interchange framework
the truth or falsity of a statement can only be
understood with reference to a given context. This is
formalized using assertions of the form:
_
C = ist(c, )
7/17/2015
50
The Domain Model
Two kinds of data objects
primitive objects: native to sources and receivers, e.g
strings, integers, reals
semantic objects: complex types to support the integration
strategy
A domain model is a collection of primitive types and
semantic types
A semantic object may have different values in different
“contexts.” Suppose we introduce two contexts labeled
as c1 and c2 which we associate with sources and
receiver. We may write
f_r1_revenue(“NTT”)[value(c1), 1000000]
f_r1_revenue(“NTT”)[value(c2), 9600000]
7/17/2015
51
Elevation axioms
Elevation axioms provide the means for mapping
“values” present in sources to “objects” which are
meaningful with respect to a domain model.
The mapping of attributes to semantic types is formally
represented in two sets of assertions.
f_r1_revenue(x, y, z) : moneyAmt  r1(x, y, z)
f_r1_revenue(x)[value©  y]  r1(x, y, z), (r1,c)
 is a source-to-context mapping, e.g (s, c)
7/17/2015
52
Context axioms
Context axioms associated with a source or receiver
provide for the articulation of the data semantics
which are often implicit in the given context.
x : moneyAmt, y : semanticNumber |- y[value(c2)  1]
 x [scaleFactor(c2)  y]
x : moneyAmt, y : currencyType |- y[value(c2) 
“USD”]  x [currency(c2)  y]
7/17/2015
53
Query Mediation as Abductive Inferences
The simplest case of abduction takes the form
From observing A and the axiom B 3 A
Infer B as a possible “explanation” of A
Reference:
Slides 2-5 “Principles of distributed database systems” by M.
Tamer Ozsu and Patrick Valduriez, Prentice Hall, 1991
7/17/2015
54
References:
Slides 1-4 “Principles of distributed database systems” by M.
Tamer Ozsu and Patrick Valduriez, Prentice Hall, 1991
7/17/2015
55