Transcript ssd

Semistructured data -- June 2001
Semistructured data:
from practice to theory
Serge Abiteboul
INRIA & Xyleme SA
[email protected]
http://www-rocq.inria.fr/verso
[email protected]
http://www.xyleme.com
1
2
Organization
•
•
•
•
•
•
Motivations
XML
Typing XML
Querying XML
XML and the Web
Illustrations: 2 problems
– Incomplete information
– Xyleme
• Conclusion
Semistructured data -- June 2001
Motivations
3
4
Motivation: Complex data
• Structure is irregular (missing/extra data…)
• Schema does not exist or is unknown
• Schema is rapidly evolving
• Relational and ODB models are too rigid
• Example: BibTex, HTML, SGML, XML,
ASN.1, STEP/Express…
5
Complex data: mediation
User
Mediator
Ontology
meta-data
wrapper
wrapper
wrapper
wrapper
wrapper
wrapper
Source
Source
Source
Source
Source
Source
Many data sources coming and going
6
Motivations: The Web today
•
•
•
•
Terabytes of data
Private web: not publicly available pages
Deep web: data hidden behind forms
A lot of public pages
• Standard is a document/hypertext language
HTML
7
The Web today
• Browsing
• Search engines
– in: list of words
– out: sorted list of URLs
• Applis: hand-made wrappers
– Expensive
– Incomplete
– Short-lived, not adapted to the
Web constant changes
[Raghavan ’00]
8
A new standard XML
• HTML is not appropriate for data exchange on the
Web
• Standard database models are too constraining for
the Web
• The solution: a semistructured data model XML
– Reminder: a data model consists of a type definition
language, a query/update language + more
Semistructured data -- June 2001
The most successful
semistructured data model: XML
9
10
The origin of XML
• Parents
– SGML
– Relational and OO databases
•
•
•
•
SGML: markup language for documents
HTML and the Web: billions of pages
Not appropriate for data exchange
XML eXtensible Mark-up Language
– W3C and most industrial companies [B2B]
– Main idea: separate content and presentation
– Use tags to represent structure and semantics
11
XML: documents + databases
• HTML
XML
– comes from SGML
– hypertext language
– also
– semistructured data
– fixed number of tags
– content and presentation
are mixed
– very difficult to extract data
from a page
– old standard for the Web
– not fixed
– not mixed
– much easier
– new standard
12
HTML = Hypertext Language
The <b> X23 </b> new camera
Ref
Name Price
replaces the <b> X22 </b>. It
X23 Camera 359.99
comes equipped with a flash
R2D2 Robot 19350.00
(worth by itself <i>53.99 $</i>)
Z25
PC
1299.99 hard and provides great quality for
only <i>359.99 $</i>.
Information System
Text + presentation
Where is the data ?
HTML
13
XML = Semistructured Data
<product-table>
Ref
Name Price
< product reference=”X23">
X23 Camera 359.99
<designation> camera </designation>
R2D2 Robot 19350.00
<price unit=Dollars> 359.99 </price>
Z25
PC
1299.99
<description> … </description>
...
Information System easy </product>
< product reference=”R2D2">
<designation> Robot </designation>
Data + Structure
<price unit=Dollars> 19350 </price>
<description> … </description>
Semistructured:
...
more flexible
XML </product-table>
14
XML: example
<dealer>
<UsedCars>
<ad>
<model>Honda</model>
<year>96</year></ad>
</UsedCars>
<NewCars>
<ad>
<model>Acura</model>
</ad>
</NewCars>
<NewCars>
<ad>
<model>R406</model>
</ad>
</NewCars>
</dealer>
dealer
UsedCars NewCars NewCars
ad
model
Honda
ad
year model
96
Acura
ad
model
R406
It is just an unranked
tagged ordered tree
15
XML
• Tree or graph
• Data and structure/semantics are mixed
– Tags contain typing information
• Core constructor is list of tag/value pairs
• Details
– Each node may have an arbitrary number of children
with distinct or not tags
– Nodes also have attributes that are unordered and
unique per node
– Standard means to represent cyclic data: Id Idrefs
16
XML
Very active/noisy field - standards
– types (DTD/XML schema), style-sheet (XSL), resource
description (RDF...)
– DOM, SAX…
– WML (wap), MathML, SMIL (multimedia), RSS (news),
RDF (metadata)...
• How fast will XML conquer the web?
– so far rather slow (about 1% now of the visible web; much
more in intranets); accelerates (e.g., with Explorer 5.5)
Semistructured data -- June 2001
Typing XML
17
18
Typing XML
• This is heresy for the freedom of the Web
• Essential for data management: query
optimization, user interfaces, applications
• Differences with standard database typing
– Collections are sequences instead of sets
– Types may be very large (e.g., from integration)
– Data is more irregular so types should be more
permissive
– New issues sometimes: you have the data, extract its
type, an approximate type
19
Intuition : the type is a tree
dealer
UsedCars
NewCars
ad
ad
model
text
year
text
model
text
• Semantics and structure are in paths
– dealer/UsedCars/ad
– dealer/UsedCars/ad/model
20
DTD: a grammar
Catalog
 Product*
Product
 Name Price? Cat (Part Quantity)*
Part
 BasicPart + ComposedPart
BasicPart
 Pame
ComposedPart  Name (Part Quantity)*
• Nice and simple
• Shortcoming: type of an element is independent of
its context
21
More complex: specialization
• Type of ad depends on its context
• One way to view it: homomorphism
dealer
dealer
UsedCars
NewCars
UsedCars
NewCars
adused
adnew
ad
ad
model
year
model
model
year
model
22
Regular tree automata
• Set of accepted trees: regular tree languages
• Definable in monadic second-order logic
dealer q0
Used
p
ad
r
ad
New
q
ad
s
ad
s
m y m y m
qf qf qf qf qf
m
qf
r
Acceptance: there is a
computation such that
all leaves are labeled qf
• variants: top/down bottom/up,
nondeterminism, unranked trees
23
DTDs+specialization
Result: DTDs+specialization = regular tree
languages
• Closure (intersection, union, complement)
• Tests for validation, inclusion
• Static analysis
24
Situation today
• Many people are using DTDs
– Nice and simple in spite of ugly syntax
• New proposal: xml-schema
– More powerful but too complicated?
• Other proposals: Relax, Trex
– Usually based on some kind of regular tree automata
• From experience: one will win and not necessarily
the best
Semistructured data -- June 2001
Query languages for XML
25
26
Query Languages for XML
• Extensions of SQL
– first-order-logic
– Information retrieval keyword search
– Navigation via regular expression + pattern matching
Lorel, XML-QL, XMAS…
• Structural recursion
UnQL, XSLT…
• No official winner – leader is Xquery
27
Pattern matching
• Tree with variables
and constraints
• Pattern matching
between the query and
the data
• Each match provides a
valuation for X,Y,Z
catalog
product
X
Y
name price cat=elec
<200
Z
subcategory
28
Example in Lorel
select <offer> Z/name, P/name, P’/price </offer>
from P in catalog/product,
Z in discount_stores/store,
Z/storecatalog/product P’
where P/category=“camera” and P/make=“canon” and
P’/id = P/id
• Joins like in relational databases
• Construction of complex results
• Regular expressions for paths (e.g., W/*/name = “Gates”)
29
What is new in XML queries
• A bit new: limited recursion (like in deductive
databases)
• A bit new but no big deal: constructed answers
(like in OODB)
• Very new: ordered data
• Bothering
– Theoretical base is a bit messy: FO, tree automata,
bisimulation
– No yardstick like relational calculus/algebra
30
Proposal : k-pebble transducers
stack
[milo,suciu,vianu]
31
k-pebble transducers: result
root
a
c
b
a
a
a
b
b
Semistructured data -- June 2001
XML and the Web
32
33
Why it is the same old story
•
•
•
•
Massive amounts of data
Providers export data, users access data
Query languages, indexing, optimization
Database paradigm: still effective on the
Web
34
Why it is not the same old story
Databases
• rigid structure
• transactions,
concurrency control
• data independence
• controlled (e.g.,
known cost model)
• coherent system, very
polished artifact
The Web
• flexible, no schema
• flexible protocols
• fuzzy separation
• perfect mess (and that’s
why people like it?)
• closer to a natural
ecosystem!
35
The principles of the Web
• The uncertainty principle: you can never be sure of
anything or that the data is consistent
• The incompleteness principle: they do not give you
all the data you want (but some you don’t want :-)
• The chaos principle: you can rarely assume the
existence of some global schema
• The instability principle: everything keeps changing
Every piece of data you got is probably wrong,
incomplete, does not conform to its expected type
and is probably already stale
36
What can be reused?
• Some technology? indexes, B-trees, distributed
query processing (concurrency control and
transactions not yet)
• Database theory? little
–
–
–
–
Algebra and rewrite rules for optimization
Dependency theory
First order and other logics
Seems that because of the ordering, it opens the gates
for many more tools such as regular/tree languages
37
Metaphor [AV]: the Web is infinite
• What are the pages pointing to my
homepage?
– Google solution: milliseconds – stale data
– Freeze the Web: weeks to get exact answer
– Exact answer: no means to get it
• Leads to reconsider the notion of
computation
38
Computability
• Finitely computable: give the answer in finite time
– All pages reached from my HP in less than 3 links
• Eventually computable: each solution is given in
finite time; computation may be infinite
– All pages reached from my HP
• Not computable
– Can my HP be reached starting from my HP?
• Also: approximate, partial, stale, pipelined answers
39
Tough life: the Web is huge
• Relational calculus/algebra: logspace data
complexity (also AC0)
• What is the data complexity of an Xquery
of the Web?
• Complexity of computing on the Web
– Logspace in the Web?
– Need to trade quality for performance
40
The Web keeps changing
• Classical: versions, temporal queries
• Less classical: monitoring of the Web
[Xyleme]
– Smart crawling of the Web: flow of docs
– Query subscription: query on this flow
– Continuous queries
• What is the underlying theory?
Semistructured data -- June 2001
Illustration: incomplete
information
Work with Victor Vianu
41
42
Example
Access to an electronic catalog
Q1: name, subcat, price of electronic products with price
less than $200
Q2: name, pictures of cameras at least pictured once
43
catalog
missing
product
*
product*
product1product2
canon 120 elec
camera
product
nikon 199 elecsony 175 elec
camera
cdplayer
Q1: name, subcat, price of electronic products with price less than 200
44
Missing data after Q1
product1
product2
*
*
name price cat picture name price cat picture
>200 =elec
!=elec
subcategory
subcategory
45
*
catalog
*
product
*
product3
product1
product2
product2b
product2c
product3
missing
product2a
canon 120 elecc.jpgnikon 199 elec sony 175 elecakai a.jpg elec
camera
camera
cdplayer
Q2: name, pictures of cameras at least pictured once
camera
product +
Missing data
46
product2a
name price cat
>200 =elec
product1
*
no picture
name price cat picture
!=elec
picture
subcategory
product3
no picture
subcategory
product2c
product2b
*
name price cat
>200 =elec
name price cat
elec
name price cat
>200 =elec
picture
subcategory
!=camera
subcategory
subcategory
!=camera
Known
data
47
After two queries
• Known information:
– Prefix of the real data tree
• Missing information
– Complex type
• Q3: name, price, pictures of cameras costing less
than $100 and at least pictured once
– can be completely answered using A1, A2
• Q4: list all cameras
– can be partially answered using A1, A2
Semistructured data -- June 2001
Illustration: Xyleme
48
49
A dynamic warehouse of Web data
• Warehouse
– Xyleme stores huge quantities of data (teraB)
– Xyleme is not a search engine (only index) or a
mediator (only virtual data)
• XML
– Xyleme is focused on XML
• Dynamic
– Xyleme is interested in data evolution/changes
50
Technical Challenges
1. Data Acquisition and Maintenance
discover data of interest and maintain it up to date
2. Repository
store and index this data
3. Efficient query Processing
4. Semantic Integration
provide a simple view of each semantic domain
5. Change Control
Monitor the web and offer services such as Query
Subscription
51
Technical challenges
•
•
•
•
Scale to the web
Size of data: billions of pages
Size of index: terabytes
Number of customers
– thousands of simultaneous queries
– millions of subscriptions
52
Web Heterogeneity
• Semantic domains, e.g., cinema
• Many possible types for data in this domain,
many DTDs
• Semantic Integration
– one abstract DTD for the domain
– gives the illusion that the system maintains an
homogeneous database for this domain
1 domain = 1 abstract DTD
53
Discover the Domains
Cluster DTDs sharing
similar « tags » using
data mining techniques
(frequent item sets) and
linguistic tools (e.g.,
thesaurus, heuristics to
extract words from
composite words or
abbreviations, etc.)
to obtain domains
cdtd1 .
cdtd2 .
cdtd3 .
adtd1
cdtd4 .
cdtd5 .
cdtd6 .
cdtd7 .
cdtd8 .
cdtd9 .
cdtd10 .
adtd2
Many concrete
DTDs
adtd4
Fewer abstract
DTDs
54
Answering queries
• Choose an ADTD
– Automatically, manually, hybrid
• For each concrete DTD in a domain
– Find how it relates to the abstract DTD
– Mappings between paths in both
• Distributed query processing (cluster of PCs)
– Many concrete DTDs; often not possible to compute a
static execution plan
– Dynamic generation of execution plans [Cluet et al]
Semistructured data -- June 2001
Conclusion
55
56
One Question Only
• The web is turning from a large collection
of documents into a huge knowledge base
When will I be able to get
the precise knowledge I need?
Database + Knowledge Base + Linguistic + ...