Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K.

Download Report

Transcript Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K.

Twig2Stack: Bottom-up Processing of
Generalized-Tree-Pattern Queries over
XML Documents
Songting Chen, Hua-Gang Li *, Junichi Tatemura
Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan
NEC Laboratories America
* University of California, Santa Barbara
Background
• XML
– Hierarchical (tree) structured data
– Provide flexibility to model semi-structured data
– Widely accepted as universal data exchange format
• Query over XML
– XPath, XQuery [W3C]
– Extensively used by many applications
– Adopted by a number of commercial systems
VLDB' 2006. Seoul, Korea
2
State-of-the-art: XML Query Processing
Path
(GTP)
Generalized Tree Pattern
Tree
Algebraic Approach
Binary Structure Joins [Timber]
– Large intermediate results
Optimize multiple path expressions
of XQuery [Chen, et. al]
– Expensive post-processing
Holistic Approach
PathStack [Bruno, et. al]
TwigStack [Bruno, et. al]
Twig2Stack
VLDB' 2006. Seoul, Korea
3
?
Processing Generalized Tree Pattern (GTP) Queries
Type
Algebraic Approach [Chen et.al]
Mandatory Axis
Structural Joins
Optional Axis
Structural Outer Joins
Return node
Group return node
–
Grouping
Non return node
Duplication Elimination
a1
A
Example
//A//B
B
D
C
XQuery:
FOR $b in //A[E]/B,
$d in $b/$D
LET $c = $b/C
RETURN $b, $c, $d
VLDB' 2006. Seoul, Korea
a2
b1
Our goal:
Avoid ALL these!
Sort
a1
//A/B
a2
b1
4
b2
Motivation: PathStack [Bruno et.al]
• Query: //A//B; Data:
a1
a2
a1
S[A]
a2
b1
b2
b1
S[B]
b2
• Key observation: minimize intermediate results through compact
representation of path matches, by
– Inter-node: record AD relationship between elements in different query
nodes, e.g., b1→a2, b2→a2
– Intra-node: record AD relationship between elements within the same
query nodes, e.g., b1, b2

• TwigStack [Bruno et.al] minimizes intermediate results through:
– Output only those path matches that are in final twig results
– However, such optimality cannot be guaranteed [Choi, et.al]
– Not helpful for processing GTP queries

• Question: can we minimize intermediate results for twig queries
through compact result encoding (similar to PathStack)?
– Useful for processing GTP queries as well?
VLDB' 2006. Seoul, Korea
5
Hierarchical Stack Encoding
a1
a1
a2
• Inter-node: //A//B
– Can still use explicit edges
• Intra-node: A
a3
a2
a4
HS[A]
a3
a4
– Matching elements forms a tree structure as well
• Associate each query node with a hierarchical stack
– Push element e into hierarchical stack HS[E] iff e satisfies
the sub-twig query rooted at E
• Matching can be determined when entire sub-tree of e seen
• Require post-order document traversal
VLDB' 2006. Seoul, Korea
6
Twig2Stack: Running Example
[1,20], 1
a1
A
[2,15], 2
B
a2
D
HS[A]
[16,19], 2
a2
C
b3
[17,18], 3
[3,14], 3
d3
b1
[4,11], 4
d1
[12,13], 4
c2
[5,10], 5
b2
b1
b2
[6,7], 6
d2
d3
HS[D]
VLDB' 2006. Seoul, Korea
c1
Merging
Stacks TwigStack needs to enumerate
3 matches for //A/B//D and 2 for
//A/B//C then join them together.
HS[B]
d1
d2
[8, 9], 6
c1
Twig2Stack requires neither
path joins nor path enumeration!
c2
HS[C]
7
GTP Result Enumeration
a4
• Bottom-up Computation .vs. Top-down Enumeration
b1
b2
– Visit Only those that are in the twig matches
• Handling grouping results
d1
– Automatic grouping through Inter-node edges
• Handling duplicates and out-of-order results
– Problems coming from non-return nodes
– If D is return node while B is not
• b1 → d1, d2, d3 and b2 →d2, d3 (duplicates)
– Observation: Intra-node hierarchy provides hints
VLDB' 2006. Seoul, Korea
8
d2
d3
c1
c2
Experiment Setup
• Implementation
– Twig2Stack: Java 1.4.2
– TwigStack, TJFast: Java 1.4.2
• Kindly provided by Jiaheng Lu from National University of
Singapore (NUS)
• Datasets
– XMark, DBLP, TreeBank
• Metrics
– Query processing time
– IO time
VLDB' 2006. Seoul, Korea
9
Processing Full Twig Queries
Optimization of Query Processing: TwigStack Twig2Stack
Optimization of IO: TJFast
VLDB' 2006. Seoul, Korea
10
Not yet done: Memory Usage
• Hierarchical Stack Encoding could hold entire document
in memory in the worst case
– Unlike DOM approach, only matches need to be stored
• Tag match
• (Partial) twig match
• Predicate evaluation
• Early result enumeration dramatically reduces the
memory usage
– Enumerate query results before the end of document and
release buffer
– Main idea: hybrid of top-down (PathStack) and bottom-up
(Twig2Stack) approaches
VLDB' 2006. Seoul, Korea
11
Early Result Enumeration (ERM)
• Enumerate results and release buffer when elements in topbranch node are popped from PathStack
A
a2
a1
S[A]
[1,20], 1
a1
B
D
[2,15], 2
C
[16,19], 2
a2
HS[A]
b3
[17,18], 3
[3,14], 3
d3
b1
S[B]
[4,11], 4
d1
b1
b2
S[D]
b2
[6,7], 6
d2
d1
d2
d3
VLDB' 2006. Seoul, Korea
HS[D]
c2
[5,10], 5
HS[B]
S[C]
[12,13], 4
c1
c2
HS[C]
12
[8, 9], 6
c1
Memory Usage
dblp
article Small sub-tree 
title year
site
open_auctions
Huge sub-tree 
bid reserve
bidder increase
VLDB' 2006. Seoul, Korea
13
Conclusions and Future Work
• Proposed a bottom-up GTP processing solution
– A twig encoding scheme
– A GTP enumeration algorithm that avoids any
post-processing operations
– A hybrid scheme to reduce memory usage
• Future directions
– Handling worst case memory issues
– Optimizing IO cost by exploiting indexes
– Handling other axes, full XQuery, graph input
– Handling XML streams
–…
VLDB' 2006. Seoul, Korea
14
Processing GTP
Optimization of
non-return nodes
VLDB' 2006. Seoul, Korea
16
Automatic
grouping