Transcript Document
Chapter 3: Data Storage and Access Methods • Title: The R* Tree: An Efficient and Robust Access Method for Points and Rectangles • Authors: N. Beckmann, H. Kriegel, R. Schneider and B. Seeger • Pages: 207-216 The R* Tree: An Efficient and Robust Access Method for Points and Rectangles • Problem – Problem Statement – Why is this problem important? – Why is this problem hard? • Approaches – Approach description, key concepts – Contributions (novelty, improved) – Assumptions Problem Statement – R* Tree • Given – Data containing points and rectangles – Spatial queries (point, range query, insert, delete) • Find - An Access Method (Data Structure) – A hierarchical organization of rectangles – Example from wikipedia • Objectives – Efficiency of spatial queries • Constraints – – – – Balanced tree Each node is a disk page and has >= m (min # of entries) entries. Root has at least two children unless it is a leaf Efficiency metric = number of disk-pages accessed Why is this problem important? • Multi-dimensional Applications – Large geographic data. e.g., Map objects like countries occupy regions of non-zero size in two dimension. – Common real world usage: “Find all museums within 2 miles of my current location". – CAD – … • Many DBMS servers support spatial indices – Orcale, IBM DB2, … Why is this problem Hard? • B-tree split methods ineffective in 2-dimensions – Ex. Sorting • Size variation across data Rectangles – Large rectangles limit split options! • Non-uniform data distribution over space • Dynamic Access Method – Insertions and deletions – Overlapping directory rectangles => multiple search paths Novelty of Contribution • Related Work – Traditional one-dimensional indexing structures (e.g., hash, B-tree) are not appropriate for range search – B+ tree • Represents sorted data in a way that allows for efficient insertion and removal of elements. • Dynamic, multilevel index with maximum and minimum bounds on the number of keys in each node. • Leaf nodes are linked together as a linked list to make range queries easy. – R-tree • R-tree is a foundation for spatial access method • A complex spatial object is represented by minimum bounding rectangles while preserving essential geometric properties • Over-lapping regions • Heuristic: minimize the area of each enclosing rectangle in the inner nodes. Principles of R-tree • Height-balanced tree similar to a B-tree with index records in its leaf nodes containing pointers to data objects. • Heuristic Optimization: minimize the area of each enclosing rectangle in the inner nodes. Reference: A Guttman ‘R-tree a dynamic index structure for spatial searching’, 1984 Performance Parameters beyond R-tree • (Q1) The area covered by a directory rectangle should be minimized. • (Q2) The overlap between directory rectangles should be minimized. • (Q3) The margin of a directory rectangle should be minimized. • (Q4) Storage utilization should be optimized. • Intuitions: – Reduce overlap between sibling nodes. – Reduce traversal of multiple branches for point query – Reinsert old data changes entries between neighboring nodes and thus decreases overlap. – Due to more restructuring, less splits occur Difference between R-tree and R*-tree • Minimization of area, margin, and overlap is crucial to the performance of R-tree / R*-tree. • The R*-tree attempts to reduce the tree, using a combination of a revised node split algorithm and the concept of forced reinsertion at node overflow. This is based on the observation that R-tree structures are highly susceptible to the order in which their entries are inserted, so an insertion-built (rather than bulk-loaded) structure is likely to be sub-optimal. Deletion and reinsertion of entries allows them to "find" a place in the tree that may be more appropriate than their original location. Improve retrieval performance Example R1 R1 R2 R2 R4 R4 R5 R5 R3 R3 Preferred by R-tree R1 R2 R4 R5 R3 Preferred by R*-tree Validation Methodology • Methodology – Experiments with simulated workloads – Evaluation of design decisions • Results – R*-tree outperforms variants of R-tree and 2-level grid file. – R*-tree is robust against non-uniform data distributions. Summary • Paper’s focus – R*-tree – implementations and performance • Ideas – Heuristic Optimizations (pp. 208) • Reduction of area, margin, and overlap of the directory rectangles – Better Storage Utilization (pp 211) • Forced Reinsertion (splits can be prevented) • Experimental comparison – Using many data distributions Assumptions, Rewrite today • Assumptions – Indexing data in two-dimensional space – Bulk load and bulk reorganization not available – Concurrency control and recovery costs are negligible • Reinserts during split! • Rewrite today – Bulk-load of rectangles – Compare with newer methods • R+ tree (disjoint sibling), Hilbert-R-tree – Analytical results • Formally compare R*-tree with alternatives