MonetDB/Pathfinder: XQuery on top of a relational DBMS

Download Report

Transcript MonetDB/Pathfinder: XQuery on top of a relational DBMS

MonetDB/XQuery:
Using a Relational DBMS for XML
Peter Boncz
CWI
The Netherlands
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
XML
• Standard, flexible syntax for data exchange
– Regular, structured data
Database content of all kinds: Inventory, billing, orders, …
“Small” typed values
– Irregular, unstructured text
Documents of all kinds: Transcripts, books, legal briefs, …
“Large” untyped values
• Lingua franca of B2B Applications…
– Increase access to products & services
– Integrate disparate data sources
– Automate business processes
• … and numerous other application domains
– Bio-informatics, library science, …
XML : A First Look
• XML document describing catalog of books
<?xml version="1.0" encoding="ISO-8859-1" ?>
<catalog>
<book isbn="ISBN 1565114302">
<title>No Such Thing as a Bad Day</title>
<author>Hamilton Jordan</author>
<publisher>Longstreet Press, Inc.</publisher>
<price currency="USD">17.60</price>
<review>
<reviewer>Publisher</reviewer>: This book is the moving
account of one man's successful battles against three
cancers ... <title>No Such Thing as a Bad Day</title> is
warmly recommended.
</review>
</book>
<!-- more books and specifications -->
</catalog>
XQuery 1.0
• Functional, strongly-typed query language
• XQuery 1.0 =
XPath 2.0 for navigation, selection, extraction
+ A few more expressions
For-Let-Where-Order By-Return (FLWOR)
XML construction
Operators on types
+ User-defined functions & modules
+ Strong typing
XSLT vs. XQuery
• XSLT 1.0: XML  XML, HTML, Text
– Loosely-typed scripting language
– Format XML in HTML for display in browser
– Must be highly tolerant of variability/errors in data
• XQuery 1.0: XML  XML
– Strongly-typed query language
– Large-scale database access
– Must guarantee safety/correctness of operations on data
• Over time, XSLT & XQuery may both serve needs of
many application domains
• XQuery will become a hidden, commodity language
Navigation, Selection, Extraction
• Titles of all books published by Longstreet Press
$cat/catalog/book[publisher=“Longstreet Press”]/title
<title>No Such Thing As A Bad Day</title>
• Publications with Jerome Simeon as author or editor
•
$cat//*[(author|editor) = “Jerome Simeon”]
<book><title>XQuery from the Experts</title>…</book>
<spec><title>XQuery Formal Semantics</title>…</spec>
Transformation & Construction
• First author & title of books published by A/W
for $b in $cat//book[publisher = “Addison Wesley”]
return <awbook> { $b/author[1], $b/title } </awbook>
<awbook>
<author>Don Chamberlin</author>
<title>XQuery from the Experts</title>
</awbook>
Sequences & Iteration
• Sequence constructor
Return all books followed by all W3C specifications
($cat/catalog/book, $cat/catalog/W3Cspec)
• XPath Expression
Return all books & W3C specifications in doc order
$cat/catalog/(book|W3Cspec)
• For Expression
– Similar to map : apply function to each item in sequence
Return number of authors in each book
for $b in $cat/catalog/book
return fn:count($b/authors)
=> (3,1,2,…)
Conditional & Quantified
• Conditional
if //show[year >= 2000] then “A-OK!” else “Error!”
• Existential quantification
– Implicit meaning of predicate expressions
//show[year >= 2000]
– Explicit expression:
//show[some $y in ./year satisfies $y >= 2000]
• Universal quantification
//show[every $y in year satisfies $y >= 2000]
Putting It Together
•
For each author, return number of books and receipts books published in past 2
years, ordered by name
let $cat := fn:doc(“www.bn.com/catalog.xml“),
Join
$sales := fn:doc(“www.publishersweekly.com/sales.xml“)
for $author in distinct-values($cat//author)
Grouping
let $books := $cat//book[@year >= 2000 and author = $a], S.J.
$receipts := $sales/book[@isbn = $books/@isbn]/receipts
order by $author
return
<sales>
{ $author }
<count> { fn:count($books) } </count>
<total> { fn:sum($receipts) } </total>
</sales>
Ordering
XML Construction
Aggregation
Recursive Processing
• Recursive functions support recursive data
<part id=“001”>
<partCt count=“2” id=“001”>
<part id=“002”>
<part id=“003”/>
<partCt count=“1” id=“002”/>
=>
<partCt count=“0” id=“003”/>
</part>
</partCt>
<part id=“004”/>
<partCt count=“0” id=“004”/>
</part>
</partCt>
declare function partCount($p as element(part))
as element(partCt) {
<partCt count=“{ count($p/part) }”>
{ $p1/@id, for $p2 in $p/part return partCount($p2) }
</partCt>
}
XML Schema Languages
• Many variants…
– DTDs, XML Schema, RELAX-N/G, XDuce
• … with similar goals to define
– Types of literal (terminal) data
– Names of elements & attribute
• XQuery designed to support (all of) XML Schema
– Structural & name constraints over types
– Regular tree expressions over elements, attributes, atomic types
TeXQuery : Full-text extensions
• Text search & querying of structured content
• Limited support in XQuery 1.0
– String operators with collation sequences
$cat//book[contains(review/text(), “two thumbs up”)]
• Stop words, proximity searching, ranking
Ex: “Tony Blair” within two words of “George Bush”
• Phrases that span tags and annotations
Ex: Match “Mr. English sponsored the bill” in
<sponsor> Mr. English </sponsor> <footnote> for himself and <cosponsor> Mr.Coyne </co-sponsor> </footnote> sponsored the bill in
the <committee-name> Committee for Financial Services </committeename>
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
XQuery Systems: 2 Approaches
• Tree-based
– Tree is basic data structure
• Also on disk (if an XQuery DBMS)
– Navigational Approach
• Galax [Simeon..], Flux [Koch..], X-Hive
– Tree Algebra Approach
• TIMBER [Jagadish..]
• Relational
– Data shredded in relational tables
– XQuery translated into database query (e.g. SQL)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
The Pathfinder Project
• Challenge / Goal:
– Turn RDBMSs into efficient XQuery engines
• People:
– Maurice van Keulen
• University of Twente
– Torsten Grust, Jens Teubner
• University of Konstanz
– Jan Rittinger
• University of Konstanz & CWI
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
The Pathfinder Project
• Challenge / Goal:
– Turn RDBMSs into efficient XQuery engines
• People:
– Maurice van Keulen
• University of Twente
– Torsten Grust, Jens Teubner
• University of Konstanz
– Jan Rittinger
• University of Konstanz & CWI
• Task: generate code for MonetDB
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
MonetDB: Applied CS Research at CWI
• a decade of “query-intensive” application experience
• image retrieval:
Peter Bosch  ImageSpotter
• audio/video retrieval: Alex van Ballegooij RAM
• XML text retrieval:
de Vries / Hiemstra TIJAH
• biological sequences: Arno Siebes  BRICKS
• XML databases:
Albrecht Schmidt  XMark
Grust / vKeulen  Pathfinder
• GIS:
Wilco Quak  MAGNUM
• data warehousing / OLAP / data mining
SPSS  DataDistilleries
Univ. Massachussetts  PROXIMITY
CWI research group successfully spun off DataDistilleries (now SPSS)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Pathfinder — MonetDB
Pathfinder
Parser
Sem. Analysis
SQL
Core Translation
Typechecking
Core to MIL
Translation
MIL (Query Algebra)
Relational Algebra
Database
MonetDB
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Open Source
• MonetDB + Pathfinder on Sourceforge
– Mozilla License
• Project Homepage
– http://monetdb.cwi.nl
• Developers website:
– http://sf.net/projects/monetdb
RoadMap
• 14-apr-04: initial Beta release MonetDB/SQL
• 30-sep-04: first official release MonetDB/SQL
• 30-may-05: beta release of MonetDB/XQuery (i.e. Pathfinder)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
MonetDB
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
MonetDB Particulars
• Column wise fragmentation
– BAT: Binary Association Tables [oid,X]
– Don’t touch what you don’t need
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Binary Association Tables (BATs)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
BAT storage as thin arrays
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
MonetDB Particulars
• Column wise fragmentation
– BAT: Binary Association Tables [oid,X]
– Don’t touch what you don’t need
• Void (virtual-oid) columns
– Contain dense sequence 0,1,2,3,4,…
– Require no space
– Positional access (nice for XPath skipping)
• pre = void
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
DBMS Architecture
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Monet: DBMS Microkernel
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
MonetDB: extensible architecture
Front-end/back-end:
• support multiple data
models
• support multiple enduser languages
• support diverse
application domains
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
MonetDB: extensible architecture Pathfinder
XQuery Frontend
Front-end/back-end:
• support multiple data
models
• support multiple enduser languages
• support diverse
application domains
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Architecture
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• MonetDB Implementation
– Data structures
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
XPath on and RDBMS
Node-based relational encoding of XQuery's data model
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Tree Knowledge 1: pruning
Tree Knowledge 2: Partitioning
Staircase Join Algorithm
Tree Knowledge 3: Skipping
Pre/Post  Pre/Level/Size
done for better skipping and updates
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Updates
• Dense pre-numbers are nice for XPath
– Positional skipping in Staircase join!
• But how to handle updates?
Updates
• Dense pre-numbers are nice for XPath
– Positional skipping in Staircase join!
• But how to handle updates?
Dense
Not Dense
Planned Update Solution
Planned Update Solution
Planned Update Solution
XPath  XQuery
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Sequence Representation
(10, “x”, <a/>, 10) →
Pos
1
2
3
4
Item
10
“X”
pre(a)
10
• sequence = table of items
• add pos column for maintaining order
• ignore polymorphism for the moment
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
For-loops: the iter column
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
For-loops: the iter column
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Loop-lifting
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Loop-lifting
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Full Example
join
Peter Boncz
calc
Pathfinder - MonetDB/XQuery
project
TU Delft 10-5-2005
Mapping Rules
XQuery construct relational algebra
See VLDB’04 / TDM’04 [Grust,Teubner]
–
–
–
–
–
–
–
Sequence construction  union
If-Then-[Else] select, [union]
For loop  map with cartesian product (all combinations)
Calculations  projection expressions
List-functions (e.g. fn:first)  select(pos=1)
Element Construction updates using descendant
Path steps  selections on the pre/post plane
• Staircase join [VLDB03]:
– Single-pass for a *set* of context nodes
– elaborate skipping!
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Xmark Query 2
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Xmark Query 2 (common subexpr)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Outline
• Basic XML / XQuery
• Introduction of Pathfinder and MonetDB projects
• Relational XQuery
– XPath steps in the pre/post plane
– Translating for-loops, and beyond
• MonetDB Implementation
– Data structures
• Optimizations
– Order prevention
– Loop-Lifted Staircase join
– Join recognition
• Outlook
– Conclusions
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Order Prevention
To encode order, we use the pos column
New pos columns are created using DENSE RANK (sql) primitive
• Needs [pos] | [iter] order
• More commonly [iter,pos]
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Order Prevention
To encode order, we use the pos column
New pos columns are created using DENSE RANK (SQL) primitive
• Needs [pos] | [iter] order
• More commonly [iter,pos]
This requires a lot of sorting!  often not necessary
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Order Prevention
[VLDB03 Wang&Cherniack]
• Order properties of relations
• Order propagation rules for relational operators
Decoration of physical plans with order properties  eliminate sort
New ideas:
• RefineSort: pipelined algorithm that extends sort order
• Order property [C1] | [C2]
“for each equal value of [C2] in order of appearance, the values
in [C1] are monotonically increasing”
Hash-based DENSE RANK only requires [pos] | [iter]
 sorts on [iter,pos] avoided
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Order Prevention
[VLDB03 Wang&Cherniack] define:
• Order properties of relations
• Order propagation rules for relational operators
Decoration of physical plans with order properties  eliminate sort
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Join Recognition (recap Mapping Rules)
XQuery construct relational algebra
See VLDB’04 / TDM’04 [Grust,Teubner]
–
–
–
–
–
–
–
Sequence construction  union
If-Then-[Else] select, [union]
For loop  map with cartesian product (all combinations)
Calculations  projection expressions
List-functions (e.g. fn:first)  select(pos=1)
Element Construction updates using descendant
Path steps  selections on the pre/post plane
• Staircase join [VLDB03]:
– Single-pass for a *set* of context nodes
– elaborate skipping!
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Join Recognition
for $p in $auction/site/people/person
for $t in $auction/site/closed_auctions/closed_auction
where $t/buyer/@person = $p/@id
return $t
– For loop  map with all combinations  O(N*N)
– If `simple’ condition exist on two loop variables  join
– Only make a map with the matching combinations
– E.g. with Hash-Table  O(N)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Join Recognition
for $p in $auction/site/people/person
for $t in $auction/site/closed_auctions/closed_auction
where $t/buyer/@person = $p/@id
return $t
– For loop  map with all combinations  O(N*N)
– If `simple’ condition exist on two loop variables  join
– Only make a map with the matching combinations
– E.g. with Hash-Table  O(N)
Performed on the XCore tree
Recognize if-then expressions
Open question:
where to optimize best??
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Join Optimization
p1
p1
thetajoin
for $x in $foo
for $y in $bar
where $x/p1/@a < $y/p2/@a
return $x
p2
project
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Join Optimization
p1
/p1
thetajoin
for $x in $foo
for $y in $bar
where $x/p1/@a < $y/p2/@a
return $x
p1
/p1
/p2
/p2
Aggr(min)
Aggr(max)
thetajoin
project
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Loop-Lifted StaircaseJoin (recap rules)
XQuery construct relational algebra
See VLDB’04 / TDM’04 [Grust,Teubner]
–
–
–
–
–
–
–
Sequence construction  union
If-Then-[Else] select, [union]
For loop  map with cartesian product (all combinations)
Calculations  projection expressions
List-functions (e.g. fn:first)  select(pos=1)
Element Construction updates using descendant
Path steps  selections on the pre/post plane
• Staircase join [VLDB03]:
– Single-pass for a *set* of context nodes
– elaborate skipping!
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Loop-lifted staircase join
• Staircase join [VLDB03]:
– Single-pass for a *set* of context nodes
Loop-lifting multiple iters  multiple sets of context nodes
– elaborate skipping!
– Loop-Lifted Staircase Join
In a single pass: process multiple input context node lists
– Use a stack
– Exploit axis properties for pruning
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Staircase join
document
List of context nodes
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Loop-lifted staircase join
document
document
Multiple lists of context nodes
List of context nodes
Peter Boncz
Active stack
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Loop-lifted staircase join
• Staircase join [VLDB03]:
– Single-pass for a *set* of context nodes
Loop-lifting multiple iters  multiple sets of context nodes
– elaborate skipping!
– Loop-Lifted Staircase Join
In a single pass: process multiple input context node lists
– Use a stack
– Exploit axis properties for pruning
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Scalability
Test platform
•
•
Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit
Can process 11GB document!
Mostly linear scaling with document size
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Scalability
Test platform
•
•
Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit
Can process 11GB document!
Mostly linear scaling with document size
•
Peter Boncz
Some swapping in the join queries
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Scalability
Test platform
•
•
Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit
Can process 11GB document!
Mostly linear scaling with document size
•
•
Peter Boncz
Some swapping in the join-queries
Q11 + Q12 generate quadratic result
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
XMark 10MB : Pathfinder vs XHive & Galax
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
XMark 1GB: Pathfinder vs X-Hive
did not finish
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Conclusions
• Relational approach can be scalable & fast
• Crucial Optimizations
– Join recognition
– Loop-lifted XPath steps
– Order awareness
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005
Conclusions
• Relational approach can be scalable & fast
• Crucial Optimizations
– Join recognition
– Loop-lifted XPath steps
– Order awareness
Future Roadmap (beta: May 30, Holland Open)
• Alegebraic Query Optimization
• Updates (not in release)
Peter Boncz
Pathfinder - MonetDB/XQuery
TU Delft 10-5-2005