Transcript Document

Chapter 3: Data Storage and Access Methods
• Title: The R* Tree: An Efficient and Robust Access
Method for Points and Rectangles
• Authors: N. Beckmann, H. Kriegel, R. Schneider and B.
Seeger
• Pages: 207-216
The R* Tree: An Efficient and Robust Access
Method for Points and Rectangles
• Problem
– Problem Statement
– Why is this problem important?
– Why is this problem hard?
• Approaches
– Approach description, key concepts
– Contributions (novelty, improved)
– Assumptions
Problem Statement – R* Tree
• Given
– Data containing points and rectangles
– Spatial queries (point, range query, insert, delete)
• Find - An Access Method (Data Structure)
– A hierarchical organization of rectangles
– Example from wikipedia
• Objectives
– Efficiency of spatial queries
• Constraints
–
–
–
–
Balanced tree
Each node is a disk page and has >= m (min # of entries) entries.
Root has at least two children unless it is a leaf
Efficiency metric = number of disk-pages accessed
Why is this problem important?
• Multi-dimensional Applications
– Large geographic data. e.g., Map objects like countries occupy
regions of non-zero size in two dimension.
– Common real world usage: “Find all museums within 2 miles of
my current location".
– CAD
– …
• Many DBMS servers support spatial indices
– Orcale, IBM DB2, …
Why is this problem Hard?
• B-tree split methods ineffective in 2-dimensions
– Ex. Sorting
• Size variation across data Rectangles
– Large rectangles limit split options!
• Non-uniform data distribution over space
• Dynamic Access Method
– Insertions and deletions
– Overlapping directory rectangles => multiple search paths
Novelty of Contribution
• Related Work
– Traditional one-dimensional indexing structures (e.g., hash, B-tree)
are not appropriate for range search
– B+ tree
• Represents sorted data in a way that allows for efficient insertion and
removal of elements.
• Dynamic, multilevel index with maximum and minimum bounds on the
number of keys in each node.
• Leaf nodes are linked together as a linked list to make range queries easy.
– R-tree
• R-tree is a foundation for spatial access method
• A complex spatial object is represented by minimum bounding rectangles
while preserving essential geometric properties
• Over-lapping regions
• Heuristic: minimize the area of each enclosing rectangle in the inner nodes.
Principles of R-tree
• Height-balanced tree similar to a B-tree with index records
in its leaf nodes containing pointers to data objects.
• Heuristic Optimization: minimize the area of each
enclosing rectangle in the inner nodes.
Reference: A Guttman ‘R-tree a dynamic index structure for spatial searching’, 1984
Performance Parameters beyond R-tree
• (Q1) The area covered by a directory rectangle should be minimized.
• (Q2) The overlap between directory rectangles should be minimized.
• (Q3) The margin of a directory rectangle should be minimized.
• (Q4) Storage utilization should be optimized.
• Intuitions:
– Reduce overlap between sibling nodes.
– Reduce traversal of multiple branches for point query
– Reinsert old data changes entries between neighboring nodes and thus
decreases overlap.
– Due to more restructuring, less splits occur
Difference between R-tree and R*-tree
• Minimization of area, margin, and overlap is crucial to the
performance of R-tree / R*-tree.
• The R*-tree attempts to reduce the tree, using a combination of a
revised node split algorithm and the concept of forced reinsertion at
node overflow. This is based on the observation that R-tree structures
are highly susceptible to the order in which their entries are inserted,
so an insertion-built (rather than bulk-loaded) structure is likely to be
sub-optimal. Deletion and reinsertion of entries allows them to "find" a
place in the tree that may be more appropriate than their original
location.  Improve retrieval performance
Example
R1
R1
R2
R2
R4
R4
R5
R5
R3
R3
Preferred by R-tree
R1
R2
R4
R5
R3
Preferred by R*-tree
Validation Methodology
• Methodology
– Experiments with simulated workloads
– Evaluation of design decisions
• Results
– R*-tree outperforms variants of R-tree and 2-level grid file.
– R*-tree is robust against non-uniform data distributions.
Summary
• Paper’s focus
– R*-tree – implementations and performance
• Ideas
– Heuristic Optimizations (pp. 208)
• Reduction of area, margin, and overlap of the directory rectangles
– Better Storage Utilization (pp 211)
• Forced Reinsertion (splits can be prevented)
• Experimental comparison
– Using many data distributions
Assumptions, Rewrite today
• Assumptions
– Indexing data in two-dimensional space
– Bulk load and bulk reorganization not available
– Concurrency control and recovery costs are negligible
• Reinserts during split!
• Rewrite today
– Bulk-load of rectangles
– Compare with newer methods
• R+ tree (disjoint sibling), Hilbert-R-tree
– Analytical results
• Formally compare R*-tree with alternatives