C-Store: Class Overview Spring, 2009

Download Report

Transcript C-Store: Class Overview Spring, 2009

Bitmap Indices for Data Warehouse
Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
Feb 27, 2009
Star Schema Vs.
Multi-dimensional Range Queries
product
prodId name price
p1
bolt
10
p2
nut
5
sale oderId
o100
o102
o105
date custId
1/7/97
53
2/7/97
53
3/8/97 111
SUM (qty * amt)
WHERE ProdId in [p1..
p10] AND custId < 200
store
prodId
p1
p2
p1
customer
storeId
c1
c1
c3
custId
53
81
111
qty
1
2
5
storeId
c1
c2
c3
city
nyc
sfo
la
amt
12
11
50
name
joe
fred
sally
address
10 main
12 main
80 willow
city
sfo
sfo
la
Characteristics of Multi-Dimensional
Range Queries in Data Warehouse

Ad-Hoc


Give N dimensions (attributes), every combination
is possible: 2N combinations
A Data Cube equals to 2N GROUP-Bys

High Dimensions ( > 20)

Large Number of Records
Multi-Dimensional Index Fails!

R-Trees or KD-Trees



Effective only for moderate number of dimensions
Efficient only for queries involving all indexed
dimensions.
For Ad-hoc Rang Queries, Projection Index is
usually better, and Bitmap Index is even
better.
Projection Index

Fix the order of the
records in the base table


Project records along
some dimension




Store
i.e, A single Column
Keeping the record order
Keeping the duplicates
Like “array” in C language
base table
store
storeId
c1
c2
c3
storeId
c1
c2
c3
city
nyc
sfo
la
Projection
Index
Multi-dimensional Range Queries :
A General Idea

Build an index for each dimension (attribute);


A Projection Index
A B-Tree

1 Primary B-Tree, N -1 Secondary B-Trees

For each involved dimension, use the index
on that dimension to select records;

“AND” the records to get the final answer set.
How to make the “AND” operation fast?

Projection Index (B-Tree is similar)



Scan each involved dimension,
And return a set of RIDs.
Intersection the RID sets



Sets have different lengths
We can use Sort and Merge to do the Intersection
Life is easier


when all the sets have the same length and in the same
order
Use 1/0 to record the membership of each record
General Ideas of Bitmap Index



Fix the order of records in the base table
Suppose the base table has m records
For each dimension




For each distinct dimension value (as the KEY)
Build a bitmap with m bits (as the POSITIONS)
A bitmap is like an Inverted Index
“AND”, “OR” operations


realized by bitwise logical operations
Well supported by hardware
Basic Bitmap Index
P. O’Neil, Model 204,1987
Size of Bitmap Indices

Number of Bitmap (Indices)



How to build bitmap indices for dimensions with
large distinct values
Temperature dimension
Size (i.e., Length) of a Single Bitmap
Three Solutions

Encoding


Binning


Reduce the Number of Bitmaps
Reduce the Number of Bitmaps
Compression

Reduce the Size of a Single Bitmap
Encoding Strategies

Equality-encoded



Bit-sliced index


Assume dimension A has c distinct values, use log2c
bitmap indices to represent each record (its value)
Range-encoded


Good for equality queries,such as “temperature == 100”
Basic Bitmap Index
Good for one-sided range queries, such as “Pressure <
56.7”
Interval-encoded

Good for two-sided range queries, such as“35.8 <
Pressure < 56.7”
Binning

Encoding mainly considers discrete dimension
values


Basic Ideas of Binning



Usually integers
Build a bitmap index for a bin instead of for a distinct value
The Number of Bitmaps has nothing to do with the number
of distinct values in a dimension.
Pros and Cons


Pros:control the number of bitmap via controling the
number of bins.
Cons:need to check original dimension values to decide if
the records really satisfy query conditions.
A Binning Example:
Values of Dimension A lie in [0, 100]
Compression Strategies

General-purpose compression methods


Software packages are widely available
Tradeoff between query processing and compression ration


De-compress data first
Specific methods

BBC (Byte-aligned Bitmap Code ), Antoshenkov,1994,1996.


Adopted since Oracle 7.3
WAH(Word-aligned Hybrid Bitmap code ), Wu et al 2004,
2006.

Used in Lawrence Berkeley Lab for high-energy physics
WAH(Word-aligned Hybrid Bitmap code )

Based on run-length encoding


Use machine WORD as the unit for compression


For consecutive 0s or 1s in a bit sequence (part of a
bitmap)
Instead of BYTE in BBC
Design Goal:

reduce the overhead of de-compression, in order to speedup query response.
Run-length encoding





Bit sequence B: 11111111110001110000111111110001001
fill:a set of consecutive identical bits (all 0s or all 1s)
 The first 10 bits in B
 fill = count “+” bit value
 1111111111=10 “+” 1
tail: a set of mizxed 0s and 1s
 The last 8 bits in B
Run:
 Run = fill + tail
Basic Ideas of WAH
 Define fill and tail appropriately so that they can be stored in
WORDs.
Word-aligned Hybrid Bitmap code:
32-bit WORD
WAH vs. B-tree vs. BBC

On one dimensional range queries


B-tree has the same time complexity


The query response time grows linearly in the number of
hits.
Records selected by each single dimension can not be
easily combined.
Query response time




I/O + CPU
I/O: WAH > BBC
CPU: BBC > WAH
In Total:using WAH is 10 times faster than using BBC
Characteristics of Industrial Products

Model 204. (Pat O’Neil,1987)




Oracle (1995)



The first that adopted bitmap index
Basic Bitmap Index, No binning, No compression
Now owned by Computer Corporation of America
Adopted compressed bitmap index since 7.3
Probably use BBC for compression, Equality-encoded, No
binning.
Sybase IQ



bit-sliced index(Pat O’Neil et al,1997)
No binning, No compression
For dimension with small number of distinct values, use
Basic Bitmap Index.
References

Kurt Stockinger, Kesheng Wu, Bitmap Indices
for Data Warehouses, In Wrembel R.,
Koncilia Ch.: Data Warehouses and OLAP:
Concepts, Architectures and Solutions. Idea
Group, Inc. 2006.