Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K.
Download ReportTranscript Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K.
Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara Background • XML – Hierarchical (tree) structured data – Provide flexibility to model semi-structured data – Widely accepted as universal data exchange format • Query over XML – XPath, XQuery [W3C] – Extensively used by many applications – Adopted by a number of commercial systems VLDB' 2006. Seoul, Korea 2 State-of-the-art: XML Query Processing Path (GTP) Generalized Tree Pattern Tree Algebraic Approach Binary Structure Joins [Timber] – Large intermediate results Optimize multiple path expressions of XQuery [Chen, et. al] – Expensive post-processing Holistic Approach PathStack [Bruno, et. al] TwigStack [Bruno, et. al] Twig2Stack VLDB' 2006. Seoul, Korea 3 ? Processing Generalized Tree Pattern (GTP) Queries Type Algebraic Approach [Chen et.al] Mandatory Axis Structural Joins Optional Axis Structural Outer Joins Return node Group return node – Grouping Non return node Duplication Elimination a1 A Example //A//B B D C XQuery: FOR $b in //A[E]/B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d VLDB' 2006. Seoul, Korea a2 b1 Our goal: Avoid ALL these! Sort a1 //A/B a2 b1 4 b2 Motivation: PathStack [Bruno et.al] • Query: //A//B; Data: a1 a2 a1 S[A] a2 b1 b2 b1 S[B] b2 • Key observation: minimize intermediate results through compact representation of path matches, by – Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2 – Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2 • TwigStack [Bruno et.al] minimizes intermediate results through: – Output only those path matches that are in final twig results – However, such optimality cannot be guaranteed [Choi, et.al] – Not helpful for processing GTP queries • Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)? – Useful for processing GTP queries as well? VLDB' 2006. Seoul, Korea 5 Hierarchical Stack Encoding a1 a1 a2 • Inter-node: //A//B – Can still use explicit edges • Intra-node: A a3 a2 a4 HS[A] a3 a4 – Matching elements forms a tree structure as well • Associate each query node with a hierarchical stack – Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E • Matching can be determined when entire sub-tree of e seen • Require post-order document traversal VLDB' 2006. Seoul, Korea 6 Twig2Stack: Running Example [1,20], 1 a1 A [2,15], 2 B a2 D HS[A] [16,19], 2 a2 C b3 [17,18], 3 [3,14], 3 d3 b1 [4,11], 4 d1 [12,13], 4 c2 [5,10], 5 b2 b1 b2 [6,7], 6 d2 d3 HS[D] VLDB' 2006. Seoul, Korea c1 Merging Stacks TwigStack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. HS[B] d1 d2 [8, 9], 6 c1 Twig2Stack requires neither path joins nor path enumeration! c2 HS[C] 7 GTP Result Enumeration a4 • Bottom-up Computation .vs. Top-down Enumeration b1 b2 – Visit Only those that are in the twig matches • Handling grouping results d1 – Automatic grouping through Inter-node edges • Handling duplicates and out-of-order results – Problems coming from non-return nodes – If D is return node while B is not • b1 → d1, d2, d3 and b2 →d2, d3 (duplicates) – Observation: Intra-node hierarchy provides hints VLDB' 2006. Seoul, Korea 8 d2 d3 c1 c2 Experiment Setup • Implementation – Twig2Stack: Java 1.4.2 – TwigStack, TJFast: Java 1.4.2 • Kindly provided by Jiaheng Lu from National University of Singapore (NUS) • Datasets – XMark, DBLP, TreeBank • Metrics – Query processing time – IO time VLDB' 2006. Seoul, Korea 9 Processing Full Twig Queries Optimization of Query Processing: TwigStack Twig2Stack Optimization of IO: TJFast VLDB' 2006. Seoul, Korea 10 Not yet done: Memory Usage • Hierarchical Stack Encoding could hold entire document in memory in the worst case – Unlike DOM approach, only matches need to be stored • Tag match • (Partial) twig match • Predicate evaluation • Early result enumeration dramatically reduces the memory usage – Enumerate query results before the end of document and release buffer – Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack) approaches VLDB' 2006. Seoul, Korea 11 Early Result Enumeration (ERM) • Enumerate results and release buffer when elements in topbranch node are popped from PathStack A a2 a1 S[A] [1,20], 1 a1 B D [2,15], 2 C [16,19], 2 a2 HS[A] b3 [17,18], 3 [3,14], 3 d3 b1 S[B] [4,11], 4 d1 b1 b2 S[D] b2 [6,7], 6 d2 d1 d2 d3 VLDB' 2006. Seoul, Korea HS[D] c2 [5,10], 5 HS[B] S[C] [12,13], 4 c1 c2 HS[C] 12 [8, 9], 6 c1 Memory Usage dblp article Small sub-tree title year site open_auctions Huge sub-tree bid reserve bidder increase VLDB' 2006. Seoul, Korea 13 Conclusions and Future Work • Proposed a bottom-up GTP processing solution – A twig encoding scheme – A GTP enumeration algorithm that avoids any post-processing operations – A hybrid scheme to reduce memory usage • Future directions – Handling worst case memory issues – Optimizing IO cost by exploiting indexes – Handling other axes, full XQuery, graph input – Handling XML streams –… VLDB' 2006. Seoul, Korea 14 Processing GTP Optimization of non-return nodes VLDB' 2006. Seoul, Korea 16 Automatic grouping