Transcript Slides
Fast Set Intersection in Memory
Bolin Ding Arnd Christian König UIUC Microsoft Research
Outline
Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments
Outline
Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments
Introduction
• Motivation: general operation in many contexts – – – – Information retrieval (boolean keyword queries) Evaluation of conjunctive predicates Data mining Web search • • Preprocessing index in linear (to set size) space Fast online processing algorithms • Focusing on in-memory index
• • •
Related Work
|L 1 |=n, |L 2 |=m, |L 1 ∩ Sorted lists merge B-trees, skip lists, or treaps [Hwang and Lin 72], [Knuth 73], [Brown and Tarjan 79], [Pugh 90], [Blelloch and Reid-Miller 98], … Adaptive algorithms (bound # comparisons w.r.t. opt) [Demaine, Lopez-Ortiz, and Munro 00], … L 2 |=r • • Hash-table lookup Word-level parallelism – – – Intersection size r w : word size Map L 1 and L 2 is small to a small range [Bille, Pagh, Pagh 07]
Small groups L 1 Hash images Hash images Small groups L 2
Basic Idea
Preprocessing: 1. Partitioning into small groups 2. Hash mapping to a small range {1, …, w} w : word size
Small groups L 1 Hash images Hash images Small groups L 2
Basic Idea
Two observations: 1. At most r = |L 1 ∩ L 2 | ( usually small ) pairs of small groups intersect 2. If a pair of small groups does NOT intersect, then, w.h.p., their hash images do not intersect either ( we can rule it out )
Small groups L 1 Hash images Hash images Small groups L 2
Basic Idea
h(I 2 ) I 2 I 1 h(I 1 ) Online processing: 1. For each possible pair, compute h(I 1 ) ∩ h(I 2 ) 2.1. If empty, skip (if r = |L 1 ∩ L 2 | is small we can skip many); 2.2. otherwise, compute the intersection of two small groups I 1 ∩ I 2
Small groups L 1 Hash images Hash images Small groups L 2
Basic Idea
h(I 2 ) I 2 I 1 h(I 1 ) Two questions: 1. How to partition?
2. How to compute the intersection of two small groups?
Our Results
• Sets of comparable sizes |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r , • (efficient in practice) Sets of skewed sizes (reuse the index) (better when n< Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments • Outline – Sort two sets L 1 and L 2 |L 1 |=n, |L 2 |=m, |L 1 ∩ w : word size L 2 |=r – – Partition into equal-depth intervals (small groups) elements in each interval Compute the intersection of two small groups using QuickIntersect(I 1 , I 2 ) pairs of small groups (Preprocessing) Map I 1 and I 2 to {1, 2, …, w} using a hash function h h 123401 132405 156710 0100001000100001 {1, 2, …, w} 0100000000000001 (Online processing) Compute h(I 1 ) ∩ h(I 2 ) (bitwise AND ) h 0100010001000001 122402 132406 156710 (Online processing) Lookup 1’s in h(I 1 ) ∩ h(I 2 ) and “1-entries” in the hash tables Go back to I 1 and I 2 for each “1-entry” 189840 Hash images are encoded as words of size w 192340 Bad case : two different elements in I 1 and I 2 are mapped to the same one in {1, 2, …, w} (Preprocessing) Map I 1 and I 2 to {1, 2, …, w} using a hash function h (Online processing) Compute h(I 1 ) ∩ h(I 2 ) (bitwise AND ) one operation h h 123401 {1, 2, …, w} 122402 132405 0100001000100001 0100000000000001 0100010001000001 132406 156710 156710 189840 Hash images are encoded as words of size w 192340 (Online processing) Lookup 1’s in Go back to I 1 h(I 1 ) and ∩ I 2 h(I 2 ) and “1-entries” in the hash tables for each “1-entry” |h(I 1 ) ∩ h(I 2 )| = (|I 1 ∩ I 2 | + # bad cases) operations |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r Total complexity: Bad case : two different elements in I 1 and I 2 are mapped to the same one in {1, 2, …, w} one bad case for each pair in expectation • s.t. Optimize the parameter a bit |L 1 |=n, |L 2 |=m, |L 1 ∩ w : word size --- total # of small groups ( |I 1 | = a , |I 2 | = b ) L 2 |=r --- ensure one bad case for each pair of I 1 and I 2 Total complexity: Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments • Outline – Two sets L 1 and L 2 |L 1 |=n, |L 2 |=m, |L 1 ∩ w : word size L 2 |=r – Grouping hash function g (different from h ) : group elements in L 1 and L 2 according to g(.) The same grouping function for all sets g(x) = 1 g(x) = 2 … … – Compute the intersection of two small groups: I 1 i ={x ∈ L 1 | g(x)=i} and I 2 i ={y ∈ L 2 | g(y)=i} using QuickIntersect • Outline – Two sets L 1 and L 2 |L 1 |=n, |L 2 |=m, |L 1 ∩ w : word size L 2 |=r – Grouping hash function g (different from h ) : group elements in L 1 and L 2 according to g(.) The same grouping function for all sets g(x) = 1 g(x) = 2 … … g(y) = 1 g(y) = 2 … … • Number of bad cases for each pair Rigorous analysis is trickier • Then follow the previous analysis: • Generalized to k sets • • Multi-resolution structure for online processing Linear (to the size of each set) space • Outline – – Motivation: linear scan is cheap in memory Instead of using QuickIntersect to compute I 1 i ∩ I 2 i , just linear scan them, in time O(|I 1 i | +| I 2 i |) – – (Preprocessing) Map I 1 i and I 2 i into {1, …, w} (Online processing) w : word size using a hash function h Linear scan I 1 i and I 2 i only if h(I 1 i ) ∩ h(I 2 i ) ≠ • Outline – – Motivation: linear scan is cheap in memory Instead of using QuickIntersect to compute I 1 i ∩ I 2 i , just linear scan them, in time O(|I 1 i | +| I 2 i |) I 1 i h(I 1 i ) h(I 2 i ) Linear scan I 2 i • Outline – – Motivation: linear scan is cheap in memory Instead of using QuickIntersect to compute I 1 i ∩ I 2 i , just linear scan them, in time O(|I 1 i | +| I 2 i |) I 1 i h(I 1 i ) h(I 2 i ) Linear scan I 2 i • The probability of false positive is bounded: – Using p hash functions, false positive happens with lower probability: at most • Complexity – – When I 1 i ∩ I 2 i When I 1 i ∩ I 2 i Rigorous analysis is trickier = , ( false positive ) scan it with prob ≠ , must scan it (only r such pairs) • Outline ( when |I 1 i |>>|I 2 i | ) – – – Linear scan in I 2 i Binary search in I 1 i Based on the same data structure, we get L 1 : L 2 : Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments • • Synthetic data – Uniformly generating elements in {1, 2, …, R} Real data – 8 million Wikipedia documents (inverted index – sets) – The 10 thousand most frequent queries from Bing (online intersection queries) • Implemented in C, 4GB 64-bit 2.4GHz PC • Time (in milliseconds) • • • |L 1 |=n, |L 2 |=m, |L 1 ∩ Sorted lists merge B-trees, skip lists, or treaps [Hwang and Lin 72], [Knuth 73], [Brown and Tarjan 79], [Pugh 90], [Blelloch and Reid-Miller 98], … Adaptive algorithms (bound # comparisons w.r.t. opt) [Demaine, Lopez-Ortiz, and Munro 00], … L 2 |=r • • Hash-table lookup Word-level parallelism – – – Intersection size r w : word size Map L 1 and L 2 is small to a small range [Bille, Pagh, Pagh 07] • • Sorted lists merge B-trees, skip lists, or treaps • Adaptive algorithms • • Hash-table lookup Word-level parallelism Merge SkipList Adaptive SvS Hash BPP Lookup • Our approaches – Fixed-width partition – Randomized partition – The practical version – Short-long list intersection • Our approaches – Fixed-width partition – Randomized partition – The practical version – Short-long list intersection IntGroup RanGroup RanGroupScan HashBin |L 1 |=n, |L 2 |=m, |L 1 ∩ n = m = 1M~10M r = n*1% L 2 |=r |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r n = m = 10M r = n*(0.005%~100%) = 500~10M |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 m = 10M, m/n = 2~500 r = n*1% |=r • • • Simple and fast set intersection algorithms Novel performance guarantee in theory Better wall-clock performance: best in most cases; otherwise, close to the best • Future work – – Storage compression in our approaches Select the best algorithm/parameter by estimating the size of intersection (RanGroup v.s. HashBin) (RanGroup v.s. Merge) (parameter p) (sizes of groups) • • Storage compression in RanGroup – – Grouping hash function g: random permutation/prefix of permuted ID Storage of short lists suffix of permuted ID Compared with Merge algo on d -compression of inverted index – – Compressed Merge: 70% compression, 7 times slower (800ms) Compressed RanGroup: 40% compression, 1.5 times slower (140ms) • Linear additional space for preprocessing and indexing • Generalized to k lists • Generalized to disk I/O model • Difference between Pagh’s and ours – Pagh: Mapping + Word-level parallelism – Ours: Grouping + Mapping Linear scan size = 10M p = 1 p = 2 p = 4 p = 6 p = 8 n = m = 1M~10M r = n*1% • Outline – Two sorted lists L 1 and L 2 – Hash function g: (Preprocessing) Group elements in L 1 and L 2 according to g(.) – For each element in I 2 i ={y ∈ L 2 binary search it in I 1 i ={x ∈ L 1 | g(y)=i} | g(x)=i} , L 1 : L 2 : (Online processing) • Online processing time – Random variable S i : size of I 2 i a) E(S i ) = 1; b) ∑ i |I 1 i | = n; c) concavity of log(.) Comments: 1. The same performance guarantee as B-trees, skip-lists, and treaps 2. Simple and efficient in practice 3. Can be generalized for k lists Random access Linear scanOutline
Fixed-Width Partitions
Subroutine QuickIntersect
Analysis
Analysis
Outline
Randomized Partition
Randomized Partition
Analysis
Data Structures
A Practical Version
A Practical Version
A Practical Version
Analysis
Intersection of Short-Long Lists
Outline
Experiments
Experiments
Experiments
Experiments
Experiments
Varying the Size of Sets
Varying the Size of Intersection
Varying the Ratio of Set Sizes
Real Data
Conclusion and Future Work
Compression (preliminary results)
Remarks (cut or not?)
Data Structures (remove)
Number of Lists (skip)
Number of Hash Functions (remove)
Compression (remove)
Intersection of Short-Long Lists
Analysis
Data Structures