Transcript WWW06

Symmetrically Exploiting XML
Shuohao Zhang and Curtis Dyreson
School of E.E. and Computer Science
Washington State University
Pullman, Washington, USA
The 15th International World Wide Web Conference
May 2006
Edinburgh, Scotland
1970’s Database Controversy
• Hierarchical model vs. relational model
• Codd: symmetric exploitation of data

Part
Project
Project
Part
Commit
Project
Part
part/project works on some, but not all
• Path expressions are asymmetric
• Currently, all XML query languages use path expressions
Symmetrically Exploiting XML: Zhang, Dyreson
Querying Data with Path Expressions
author
name
book
book
E. F. Codd
title publisher price
title
publisher price
46.95 Automata
DB
Addison Wesley
9.99
Academic Press
• Task

Find books by E. F. Codd
• XQuery

return doc("author.xml")//author[name= 'E. F. Codd']/book
Symmetrically Exploiting XML: Zhang, Dyreson
Same Data, Different Structure
author
name
book
book
book
title author price publisher title author price publisher
E. F. Codd
title publisher price
title
publisher price
46.95 Automata
DB
Addison Wesley
book
9.99
DB
name
46.95
Automata
Addison Wesley
name
E. F. Codd
Academic Press
• Same task

Find books by E. F. Codd
• Need different XQuery

return doc("book.xml")//book[author/name='E. F. Codd']
Symmetrically Exploiting XML: Zhang, Dyreson
Codd
9.99
Academic Press
Goal
• Make same query work on different structures
• Useful when there is




lack of schema knowledge
heterogeneous data
irregular data
schema evolution
• Factor off problem of different label sets, others are
working on it
Symmetrically Exploiting XML: Zhang, Dyreson
Existing Axes are Directional
ancestor
self
preceding
descendent
Symmetrically Exploiting XML: Zhang, Dyreson
following
Proposal: A Non-directional Axis
ancestor
self
preceding
descendent
Symmetrically Exploiting XML: Zhang, Dyreson
following
Proposal: A Non-directional Axis
ancestor
self
preceding
descendent
Symmetrically Exploiting XML: Zhang, Dyreson
following
Proposal: A Non-directional Axis
ancestor
self
preceding
descendent
Symmetrically Exploiting XML: Zhang, Dyreson
following
The Closest Axis
• Syntax


closest::
->name is abbreviation for closest::name
• Semantics

a function that takes a context node and returns a sequence of
closest nodes
Symmetrically Exploiting XML: Zhang, Dyreson
Closest Axis of the First Title
author
name
book
book
title publisher price title publisher price
• closest::*

Returns a list of five nodes
• closest::price

Returns the first price node
Symmetrically Exploiting XML: Zhang, Dyreson
When the First Book Lacks a Price
author
name
book
title publisher
book
title publisher price
• Node selection restricted by minimal type distance

The minimal distance between a title and a price is 2
• closest::price

Returns an empty list
Symmetrically Exploiting XML: Zhang, Dyreson
Type Distance is Crucial
• closest::name for each book?
author
name
book
title publisher
book
title publisher price
name
• Root-to-node path type


author/name
author/book/publisher/name
Symmetrically Exploiting XML: Zhang, Dyreson
Querying with the Closest Axes
Same query -return doc("any.xml")->author[->name='E. F. Codd']->book
Query
Result#1
Query
Closest axis-enabled
XQuery evaluation
engine
Result#2
Result#3
Query
Symmetrically Exploiting XML: Zhang, Dyreson
Querying with Directional Axes
Query#1 -- return doc("author.xml")//author[name= 'E. F. Codd']/book
Result#1
Query#2 -- ……
XQuery
evaluation
engine
Result#2
Result#3
Query#3 -- return doc("book.xml")//book[author/name='E. F. Codd']
Symmetrically Exploiting XML: Zhang, Dyreson
In-memory Implementation
• Naïve approach


Compute Closest for every node
Time complexity is O(sn2)


s: number of labels in the signature
n: number of nodes
• Converting to a path expression
Find the closest price for title
author
name
book
title publisher price
Symmetrically Exploiting XML: Zhang, Dyreson
Non-directional expression
closest::price
Directional (path) expression
parent::*/child::price
Experiment
• Compare directional vs. nondirectional
for $b in doc("bib.xml")//title/closest::publisher
return $b
for $b in doc("bib.xml")//title/..//publisher
return $b
1600
• Implemented closest in
eXist (an XML DBMS)
Time (milliseconds)
1400
1200
1000
descendant
800
closest
600
400
200
75
00
0
10
00
00
12
50
00
15
00
00
25
00
0
50
00
0
0
Number of Nodes
Symmetrically Exploiting XML: Zhang, Dyreson
Persistent Implementation
• Take advantage of type indexes
• LCA-join

Every Closest pair related via an LCA
Idea is to merge lists of types

O(sn)

Symmetrically Exploiting XML: Zhang, Dyreson
Related Work
•
Data integration

TSIMMIS


YAT


Christophides, Cluet, Simèon (SIGMOD Record June 2000)
Silkroute

•
Garcia-Molina et al. (Journal of Intelligent Information Systems 1997)
Fernandez, Tan, Suciu (WWW 2000)
LCA-related techniques



Schmidt, Kersten, Windhouwer (ICDE 2001)
Cohen, Mamou, Kanza, Sagiv (VLDB 2003)
Li, Yu, Jagadish (VLDB 2004)
Symmetrically Exploiting XML: Zhang, Dyreson
Related Research Projects
• XML Restructuring

Zhang, Dyreson (IIWeb 2006)
• XML Compaction

Zhang, Dyreson, Dang (DASFAA 2006)
• Common theme – symmetric exploitation!
Symmetrically Exploiting XML: Zhang, Dyreson
Conclusion
• Current XQuery depends on path expressions
• A path expression is directional (asymmetric)

May break down if structure changes
• The closest axis is non-directional (symmetric)

Simple in syntax


Can be easily integrated in XQuery
Can be implemented efficiently


In-memory
Persistent
Symmetrically Exploiting XML: Zhang, Dyreson
Thank You!