people.ok.ubc.ca

Download Report

Transcript people.ok.ubc.ca

Reducing the Cost of Validating
Mapping Compositions by
Exploiting Semantic Relationships
Eduard C. Dragut
University of Illinois at Chicago
Ramon Lawrence
University of British Columbia Okanagan
ODBASE 2006, Montpellier, France
Talk Overview



Introduction
Background
 Model and Mapping representation systems
Proposed Mapping Representation System



Invert and Compose operator definitions and
properties
Mappings Composition
Experiment
 Estimate the quality of the proposed system
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 2
Introduction - Terminology

Models
 denote a representation of a domain in a formal
language (e.g., EER, Relational, Description Logic)
 has two components [Russell et al 2003]
terminological (or metadata)
 This
is the focus of this work and talk.
extensional

(i.e. facts or instances)
Mappings
 describe how two models are related to each other
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 3
Introduction - Mappings

Ways to define mappings between models

binary relationships


mapping using a helper model


[Bernstein et al. 2003]
mapping as queries


called morphisms [Melnik et al 2003] or inter-schema correspondences
[Popa et al. 2002]
[Madhavan et al. 2003, Berstein et al. 2006]
Our work falls in the class of the first two types of
mappings.

We call them metadata level mappings.

They are not concerned with the instances of a model.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 4
Introduction - Examples

Examples of models
 diagrams,
interface definitions, database schemas,
web site layouts, control flow, XML schemas

Applications of mappings
 mapping
between XML schemas to drive message
translation;
 schema and database integration;
 mapping between ontologies to help in the process of
merging and alignment
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 5
Background - Mapping creation


The creation will be rarely completely automated.
General strategy is to semi-automatically build mappings

use heuristics to generate matchings (e.g. name similarity)


translate matches into formulas


[Rahm and Bernstein 2001, Shvaiko and Euzenat 2005] (surveys)
E.g., Clio project [Popa et al. 2002]
generate new mappings from existing mappings
 Composition

E.g, [Madhavan et al. 2003, Berstein et al. 2006]
 Invert
 E.g,

[Fagin 2006]
Semi-automatic tools can significantly speed up the
process.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 6
Background - Morphisms

Mapping:


is just a set of binary relations between the elements of two models
is a set of pairs < l, r >
CREATE TABLE Actor1(
ActID int PRIMARY KEY,
Bio varchar,
ActorName varchar)

<schema xmlns=”...”>
<complexType name=”Actor2”>
<element name=”ActorID” type=”xs:int”/>
<element name=”Bio” type=”xs:string”/>
<element name=”FirstName” type=”xs:string”/>
<element name=”LastName” type=”xs:string”/>
</complexType>
</schema >
Advantages/Disadvantages


their expressiveness is enough for certain classes of problems and they
exhibit certain mathematical properties [Melnik et al. 2003]
main drawback
 assumes similarity to be transitive
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 7
Background – Morphisms problems

Composition



<ID, ActID> ○ <ActID, ActorID>
= <ID, ActorID>
due to transitivity assumption
Actor1
ID
Bio
FirstName
LastName
MiddleName
LastMovie
Problems with this technique


Whenever m:1 correspondence is
composed with a 1:n
correspondence, the composition
result is a cross-product; many
being false positives.
It may miss or suggest false
relationships.
Actor
Actor2
ActID
Bio
ActorName
ActorID
Bio
FirstName
LastName
RecentMovie
Actor1
ID
Bio
FirstName
LastName
MiddleName
LastMovie

Actor2
ActorID
Bio
FirstName
LastName
RecentMovie
Legend:
 Blue  correct
 Red  false positive or missed
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 8
Background – Mapping with helper models
Example:

Actor1
map1
ID
Bio
FirstName
LastName
MiddleName
LastMovie

Actor
m2
m3
m4
map2
ActID
Bio
m5
m6
ActorName
Actor2
m2
m3
m4
m5
m6
ActorID
Bio
FirstName
LastName
RecentMovie
m7
Algorithm (for right compose)[Bernstein et al 2002]


copy the right hand side mapping
for each mapping element, m, on the right, i.e. in map2


compute its Input(m)
for each mapping element, m, on the right, i.e. in map2

set its domain to the union of the domains of Input(m)
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 9
Background – Mapping with helper models

Composition result
Actor1
ID
Bio
FirstName
LastName
MiddleName
LastMovie

Problems with this technique

map1
Actor
m2
m3
m4
map2
ActID
Bio
m5
m6
m2
m3
m4
m5
m6
ActorName
Actor2
ActorID
Bio
FirstName
LastName
RecentMovie
m7
Actor1
map2
ID
It may miss or suggest false
relationships.
Bio
FirstName
LastName
MiddleName
LastMovie

Actor2
m2
ActorID
m3
Bio
FirstName
LastName
RecentMovie
Legend:
 Red  missed relationships
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 10
The Objectives

The driving motivation



Provide a mapping representation at the metadata level combining
the advantages of morphisms and mappings with helper models.



The need for a mapping definition subsuming the relationship kinds that the
state of the art matching algorithms discover with high precision.
Investigate to what extent a set of operations over this mapping definition can
be defined.
The former has good mathematical properties.
The latter is more expressive.
Provide a compose algorithm that exploits the semantic
relationships within the mapping expression to produce correct
semantic relationships whenever these can be determined
automatically and to isolate those instances that require human
intervention.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 11
Proposed Mappings Representation

Model

A model has similar expressiveness as an EER model and is consistent with
the definition of model used in previous work on model management.
[Bernstein et al 2002, Pottinger and Bernstein 2003]

Mapping Representation

A mapping consists of a set of mapping elements, each mapping element is
a directed, kinded binary relationship between a pair of elements not in the
same model:


Triplets of form < m1,type,m2 >, type = {IsA, AKindOf, HasA, PartOf, =,
Contains, ContainedBy, Unknown, Complex}
Comments

Some of these types were introduced in other works.

E.g, [Euzenat 2004, Giunchiglia et al. 2004, Pottinger and Bernstein 2003, Xu
and Embley 2003, Wu et al. 2004]
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 12
Proposed Mappings Representation

An example



Most of the relationship kinds in the
mapping representation are wellknown except for Unknown and
Complex
<a,Unknown,b >, means that the
relationship between concept a and
b is not precisely known.
< a,Complex,b >, the relationship
between concept a and b may
require a functional specification:
a = f(b)

PO
POrder
Product
Article
ShipTo
FirstName
Recipient
LastName
Street1
ShipAddress
Street2
Phone
Shipper
Legend
WorkPhone
Equality
Contains
HomePhone
HasA
Organization
IsA
e.g., Price = PriceVat(VAT + 1)
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 13
Operators - Invert

Each of the relationship types introduced have well defined
inversion properties:


IsA inverted is AKindOf, HasA inverted is PartOf, Contains inverted is
ContainedBy
Definition [invert for mapping elements]:


Consider m = < a,type,b >. Then its corresponding inverted mapping element,
denoted m-1, is given by the following expression: < b, type-1,a>
Mathematical form: < a,type, b >-1 = < b, type-1,a >


E.g. < a,HasA,b >-1 = < b,HasA-1,a >
Definition [invert for mappings]:

Given two models A and B and a mapping, map, between them, the invert of
map denoted by map-1, is defined from B to A and its expression is given by:
map-1 = {< b,type-1,a >| < a,type,b > map}
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 14
Operators - Compose


Composing two mappings involves defining a composition
operation between the elements of the mappings (i.e. between
triplets of form < a,type,b >)
Example

<HomePhone, IsA, Phone> ○ <Phone, = , Telephone> =
< HomePhone, IsA, Telephone>
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 15
Compose Properties

Remarks:
 The result of composing two mappings where mapping
elements are expressed as triplets < a,type,b > is closed.
 Mapping composition is symmetric in this framework:
(<a, type,b > ○ < b,type,c >)-1 =< c,type-1,b > ○ < b,type-1,a>
 The result of composing two mappings does not produce false
correspondences between the elements of the two models, i.e. it
does not suggest false directed, kinded relationships.
 The Compose operator uses the Unknown relationship to
indicate when it is not possible (in general) to suggest a
relationship type given only the information expressed in the
two mappings.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 16
Experiment - Setup

Experiment goal:



Show that the composition framework is robust when applied to real
world application and that we are able to correctly identify problematic
cases.
We compare it against mappings as morphisms.
Five real-world XML schemas in the purchase order domain:
CIDR, Excel, Noris, Paragon, and Apertum from www.biztalk.org
 They were used in other projects:


[Dragut and Lawrence 2004, Madhavan et al. 2001]
And a reference ontology

to which each XML schema is manually mapped both using morphisms
[Dragut and Lawrence 2004] and using the new mapping definition.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 17
Experiment - Setup

Example of XML schemas:

XML Excel and CIDR schemas
CIDR
Excel Purchase Order
Contact
POHeader
Header
Footer
totalValue
DeliverTo
InvoiceTo
Address
Contact
orderNum
ourAccountCode
city
yourAccountCode
country
Items
Item
itemNumber
yourPartNumber
partNumber
partDescription
quantity
contactFunctionCode
poNumber
orderDate
itemCount
contactName
poDate
stateProvince
street1
contactEmail
contactName
contactPhone
companyName
e-mail
telephone
POShipTo
entityidentifier
street2
city
street3
attn
street4
postalCode
POLines
POShipTo
entityidentifier
city
attn
count
Item
country
country
uom
stateProvince
unitPrice
street1
qty
stateProvince
street1
street2
street2
unitOfMeasure
startAt
street3
partno
street3
unitPrice
street4
salesValue
postalCode
street4
line
postalCode
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 18
Experiment - Intermediary model
POrder
hasAddress

Agent
Comments:

HomePage
Country
LastName
Email
State
FirstName
Phone
rso n
ctPe
conta
lied
Shipper
sh i p
By
hasItem
PurchasedItem
sh
ip
lTo
Supplier
Zip
OrganizationName
Title
Position
Street
Organization
Personnel
p
su p
City
Fax
Person
bil

The intermediary model does
not have all concepts in the
schemas (e.g. unitOfMeasure,
count, and VAT).
The intermediary model is
structurally different from the
five schemas considered and
it is defined using OWL.
Address
To
ped
By
hasItems
ItemsCollection
PurchaseOrder
UPC
Discount
ItemDescription
PurchaseOrderDate
PartNumber
ItemName
Quantity
Price
Amount
Currency
ShipmentDate
Comments
OrderDate
OrderNumber
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 19
Experiment - Methodology

Step 1: map the five schemas to the intermediary model:



Step 2: apply the compose operators to compute direct mappings
between the schemas



First, using morphisms
Second, using the proposed mapping
First, employing composition over morphisms
Second, using the new compose operator
Step 3: measure the quality of the two compositions in terms of
Precision, Recall, and Overall

A new metric is introduced User Effort.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 20
Experiment - Stats

Overall after composition was computed
1
0.9
0.8
Overall
0.7
0.6
ours
0.5
morphisms
0.4
0.3
0.2
0.1
0
1<->2

1<->3
1<->4
1<->5
2<->3
2<->4
2<->5
3<->4
3<->5
4<->5
CIDR, Excel, Noris, Paragon, and Apertum are assigned numbers 1, 2, 3, 4,
and 5 respectively.
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 21
Experiment - Stats

User effort is the % of mappings that must be validated by a user.


For morphisms, user effort is 100% as there is no way to distinguish true
over false relationships.
In our framework, it is the ratio of the number of Unknown relationships to the
number of all produced relationships. On average it is only 19%.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1<->2
1<->3
1<->4
1<->5
% correct (semantic)
% correct (morphism)
2<->3
2<->4
2<->5
3<->4
3<->5
4<->5
% unknown (semantic)
% incorrect (morphism)
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 22
End
Thank you for your time and patience!
E. Dragut and R. Lawrence Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships
Page 23