Transcript 1. dia

Semantic integration of
traditional and web-based
information sources
Gergely Lukácsy, BUTE
Péter Szeredi, BUTE
Péter Krauth, IQSYS
Attila Bodnár, IQSYS
What is a mashup?
• A mashup is a website or application that
combines content from more than one
source into an integrated experience.
• The etymology of this term possibly
derives from its similar use in pop music.
/Wikipedia/
Quotes on mashups
• “Web mashups, and other Web 2.0 development
(e.g. Ajax) are all facets of the same phenomenon
that :
– information and presentation are being separated in
ways that allow for novel forms of reuse.”
• “The mash-up is the offspring of an environment
where application developers facilitate the creation
of integrated, yet highly derivative application
hybrids by third parties, something they do by
providing rich public APIs to their user base.”
What’s so special about mashups?
• Content used in mashups is typically sourced from a third
party via a public interface or API.
• Other methods of sourcing content for mashups include web
feeds (e.g. RSS or Atom), web services and screen
scrapping.
• Some in the community believe that only cases where
public interfaces are not used count as mashups.
• Many people are experimenting with mashups using
Google, eBay, Amazon, Flickr, and Yahoos APIs.
• Google has a mashup editor in beta.
?
Mashup = Application Integration á la Web 2.0
What we are going to speak of?
S emantic
IN tegration
T echnology
A pplied in
G rid-like,
M odel-driven
A rchitectures
R&D project:
• Sponsored by the National
Research and Development
Program, 2005-2007
Consortia:
• Coordinator: IQSYS
• Developer Organisations:
• IQSYS, BUTE, SZTAKI
• User Organisations:
• OSZK, MTI, ARECO/eBolt
Information Integration with Sintagma
Database A
Data access and
transformation
Application B
(traditional)
Application A
(web service)
SINTAGMA
Presentation and
further processing
External Application
(e.g. mashup application)
Database B
(RDBMS, XML, RDF)
• Separates clearly the data access and transformation layers
of integration from the presentation layer
• Uses a comprehensive metadata repository
(Model Warehouse)
– Semantics of data represented in the repository: maps local and remote
metadata to each other
– Data access and transformation driven by the repository
Search and analysis of Web data
Search and analysis application (e.g. mashup)
d a t a
s e r v i c e
metadata mapping
metadata mapping
Legend:
SINTAGMA-node
Sintagma –
an approach to information integration
• Key Principles:
– No duplication of data: Model Warehouse vs. Data Warehouse
– Communication: one-way, on-line (no modification of data, instant access)
– Integration of web services as information sources supported (no modification
required)
• Key Components:
– Manages various forms of metadata (Model Manager)
– Accesses various structured and semi-structured information sources
(Wrappers):
•
•
•
•
RDBMS
RDF
XML
Web Services
– Preprocesses various „unstructured” information sources (Annotators):
• Texts
• Raster maps (labels and signs)
• Excel tables
– Optimises query execution: query planning using deduction (Mediator)
– Data Quality Control
Architecture of SINTAGMA
Model Manager /
Model Warehouse
DQ Engine
(meta)
Data
Quality
Controller
DQ log
Text
Annotator
texts
Mediator
(local)
XML
Wrapper
RDF
Wrapper
XML
RDF
Text Annotation subsystem
DQ Engine
(native)
WS
Wrapper
Web
Service
Sintagma
GUI
Model
Manager
(remote)
WD
Wrapper
JDBC
Wrapper
Map
Server
HTML
RDBMS
maps
Map
Annotator
Data Quality Control subsystem
Map Annotation subsystem
Model Warehouse of SINTAGMA
Conceptual views
of workers
in a business area
Common,
clarified
concepts
Special concepts of
business areas
Conceptual Level
Domain specific
terminology
Domain specific
knowledge/
ontologies
Integrated Conceptual Model
Integrated Application Model
local
Application Level
unified
Interface Level
transformed
local
local
local
External
model
(e.g. BPM)
local
Legend:
model
Source Level
Data
Source
1
Data
Source
2
Data
Source
3
Data
Source
n
mapping
input
Modelling in SINTAGMA
• The Model Warehouse
– content of the Model Warehouse
– interface models and abstractions
– ontology concepts
• Use cases
– Product comparison
– Workflow of Equipment purchase
– Web service integration demos
Model Warehouse
• Content of the Model Warehouse
– Object-oriented models
• Structural properties of sources in UML Object Model
• Non-structural information given as OCL Constraints
• Mapping between models as abstractions
– Description Logic models
– Queries: source and conceptual level
• Classification of models
– interface
– unified (application)
– conceptual
• Modeling: SILan – Semantic Integration Language
– Describes content of Model Warehouse in textual format
– Has well-defined semantics
Interface Models
Higher level models
• Abstractions (data transformations)
– populate higher level entities
• Filter low level data (suppliers)
• Transform data to appropriate higher level form (clients)
– can have multiple suppliers and clients
Higher level models (cont’d)
• Invariants
– have to be satisfied by all the instances of a model
element
– can contain navigation
• Queries
– can be formulated on any model
• Interface level models: directly accessing data sources
• higher level models: using mediation
– are interchangable with abstractions
Conceptual Models
Conceptual models (cont’d)
• These models encapsulate concepts given
in Description Logic formalism
Use case 1: Product comparison
• Goal: find products that are similar to the
products in a host system
• Information sources
– catalogues from various vendors in Excel
– database of the host system
• Problems to solve
– heterogenity of the catalogues: preprocessing
– algorithm for product comparison
Solution in SINTAGMA
Model Warehouse
Similar
Products
Product
comparison
Unified
Products
Host
Database
Catalogue
Excel
MySQL
XML
Preprocessing
Excel
Excel
Use case 2: Equipment purchase
Equipment purchase in an
organisation
• Scenario
– Each department maintains a wish-list of equipments
– There are vendors who provide products to departments
• Vendors sell different types of products (vendor A sells printers and
toners, Vendor B monitors and printers etc.)
• The financial department dynamically designates a preferred vendor
for each product
• Questions: is there any expensive order? what is teh total ? etc.
• Information Sources:
– Department’s wish-list:
• relational database with columns description, category, e.g.: „we
have run out of paper”, „15/18”
– Financial department:
• Web service, with operation determining where to buy a given
product, e.g.: (15,8) -> (A4 paper, 4, 23)
– Vendors:
• Heterogenous web service which return prices, units and delivery
date, e.g.: 23 -> (12, 1, 2007-07-01)
Event Driven Process Chain
Solution in Sintagma
Use case 3: Web Service Integration
• Integrating Amazon and Barnes&Noble
• Integrating RSS-sources (e.g. origo, nol,
index, metro)
• Integrating World Championship Results
(20o2 and 2006)
Integrating Amazon and Barnes&Noble
Conceptual Level
AmazonBN
Application Level
Barnes&Noble
Interface Level
currency
Amazon
Legend:
model
Source Level
Amazon.com
web service
Barnesand
noble.com
web service
Currency
exchange
service
mapping
query
input
Integrating results of World Championships
Positions
Conceptual Level
derivation
Team matches
First Four
by year
Teams
grouping
Team
matches
Optimised WC matches
transformation
Unified WC matches
Application Level
combination
Interface Level
2002 WC
2006 WC
Legend:
model
Score: n-m
Source Level
Score1: n
Result
Result
Score2: m
(2002 WC) Match Id: 0-63 (2006 WC) Match Id: 1-64
Web
Web
service
Service
mapping
query
input
Integrating RSS-feeds
Text
Annotator
opposition
goverment
Search for
high level concepts
(e.g. political
conflicts)
Conceptual
Level
Unified RSS-feeds
Application
Level
Interface
Level
members of
combination
nol
VIP
index
origo
metro
Legend:
model
Source
Level
VIP
database
Origo.hu
RSSsource
Nol.hu
RSS
source
Index.hu
RSSsource
metro.hu
RSS
source
mapping
query
input
Summary
• The system
– is a semantic information integration tool
– handles various structured sources
• relational, various semi-structured sources and web services
– preprocesses various unstructured sources
• texts, maps, tables
– uses logic / constraint logic programming
– can be used in mashup creation
• disciplined and flexible approach to data access in mashups
• separates data integration from mashup presentation logic
• resolves semantic and technical differences in sources
Real estate search - Trulia
• A real estate search engine that helps you find
homes for sale and provides real estate
information at the local level to help you make
better decisions in the process. Trulia pulls in
real estate data from partnerships with
thousands of brokers and agents and displays it
on a Google Maps interface.
• Trulia shows you how sales prices have been
trending where it matters—in your county, city,
ZIP code and neighborhood. They also offer
heat maps and real estate guides.
• http://www.trulia.com/#start
Hotel Guide - Trivop
• The self-proclaimed first videoguide for hotels
doesn’t disappoint. Locate hotels on this Google
Maps + Hotel mashup and view user-created
videos of the hotels. This gives a much better
view of a prospective hotel before visiting.
• Currently looks like they only have hotels in
England and France, but with their recruiting
efforts one can only assume Trivop will
becoming to a region near you.
• http://www.trivop.com
Visual Music search – Music Map
• Visual music search application mashed
with Amazon data. Choose and artist and
album, see related artists in an abstract
tree graph. Wicked.
• http://www.dimvision.com/musicmap/
Search for Popular Music –
Hype Machine
• The Hype Machine follows music blog discussion. Every day, hundreds of
people around the world write about music they love.
• The Hype Machine tracks a variety of MP3 blogs. If a post contains MP3
links, it adds those links to its database and displays them on the front page.
• Some of the frequently accessed tracks are cached by the Hype Machine
server, much like Google Search caches web pages, to reduce load on the
bloggers' servers and protect their bandwidth. Those tracks are NOT
available for download, but you can preview them via the "listen" links that
are next to each track or using your media player.
• The blog that posted a particular track is identified under every track by
name and with a "read post" link that leads to the blog post itself. If you
enjoyed a track someone posted, stop by and let them know!
• You can purchase CDs and individual tracks by using the "amazon" and
"itunes" links that appear next to most tracks. Each purchase you make via
the Amazon and iTunes links supports both the artists and the Hype
Machine. Please buy and enjoy.
• http://hypem.com/