C-Store: Integrating Compression and Execution

Download Report

Transcript C-Store: Integrating Compression and Execution

C-Store: Integrating
Compression and Execution
Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
Mar 20, 2009
High Compressibility in Column Store

Each attribute is stored in a separate column.

A Column Store can not only use traditional
compression techniques


Dictionary encoding, Huffman Encoding, etc
But also can use column-oriented techniques

Run-Length Encoding
Benefits of Compression in DBMS

Reduces the Size of Data

Improve I/O Performance



Reducing seek times:
data are stored nearer
to each other.
Reducing transfer times: there is less data
Increasing buffer hit rate: buffer can hold larger
fraction of data
How to Query a Compressed Column?

De-compress the data

Query the compressed column directly?

Run-Length Encoding
A Simple Example

In column C1, the value “42” appears 1000
times consecutively.

Assume C1 is compressed by Run-Length
Encoding

Query: SUM(C1)

==42 * 1000
History of Compression in DBMS

80s:


90s:


Focus on compression ratio
CPU cost of compressing/decompressing should
be less than the savings of reducing the size of
data.
Now:
CPU speed increases much faster
than memory speed and disk speed.

Light-weight Compression is good
Reducing CPU cost on compressed data
(Graefe and Shapiro, 1991)

Lazy Decompression


Data is decompressed only if needed to be
operated on.
Query the compressed data directly

Exact-match Comparison, Natural Join

If the constant portion of the predicate is compressed in
the same way as the data
New Work in C-Store

Simultaneously apply an operation on
multiple values in a single column.

Introduces a novel architecture for passing
compressed data between query operators.


Minimizes operator code complexity
While maximizes chances for direct operation on
compressed data.
Review: Basic Concepts of C-Store


A logical table is physically represented as a
set of projections.
Each projection consists of a set of columns



Columns are stored separately, along with a
common sort order defined by SORT KEY.
Each column appears in at least one
projection.
A column can have different sort orders if it is
stored in multiple projections.
An example of C-Store Projection

LINEITEM(shipdate, quantity, retflag, suppkey |
shipdate, quantity, retflag)




First sorted by shipdate
Second sorted by quantity
Third sorted by retflag
Sorting increases locality of data.

Favors Compression Techniques such as Run-Length
Encoding
C-Store Operators vs. Relational Operators

Selection


Mask


Reorder a column using a join index.
Projection



Materialize a set of values from a column and a bitmap.
Permute


Produce bitmaps that can be efficiently combined.
Is free to project a column.
Two columns in the same order can be concatenated for
free.
Join

Produces positions rather than values.
Join over Two Columns: An Example
Compressed Query Execution:
Two Classes for Each New Compression Technique

Compression Block


Encapsulates an intermediate representation for
compressed data.
DataSource operator

Reads in compressed pages from disk and
converts them into compression blocks.
A Compression Block


contains a buffer of the column data in
compressed format
Provides an API that allows the buffer to be
accessed in several ways.
Accessing Properties of Compression
Block

isOneValue()


isValueSorted()


Returns whether or not the block contains just one
value (and many positions for that value).
Returns whether or not the block’s values are
sorted.
isPosContig()

Returns whether or not the block a consecutive
subeset of a column.
Properties of Compression Block:
for Various Encoding Schemes.
Iterator Access:
where decompression cannot be avoided.

getNext()



Transiently decompresses the next value in the
compression block
Returns that value along with the position in the
uncompressed column.
asArray()


Decompresses the entire compression block
And returns a pointer to an array of data in the
uncompressed column type.
Block Information Methods (1):
Getting Data without Decompression

For Run-Length Encoding


A compression block consists of a single RLE
triple of the form (value, start_pos, run_length)
getSize():


getStartValue():


Returns run_length;
Returns value;
getEndPosition():

Returns (start_pos + run_length -1);
Block Information Methods (2):
Getting Data without Decompression

For bitmaps


A compression block is a consecutive subset of a
bitmap for a single value.
getSize() :


getStartValue() :


Returns the number of on bits in the block (i.e., a bit
string).
Returns the value with which the bit string is associated.
getEndPosition() :

Returns the position of the last on bit in the bit string.
Compression-Aware Optimization

Natural Join



An input column is compressed by Run-Length
Encoding,
The other input column is uncompressed
Do the join directly


Reduce the number of operations by a factor of k, where
k is the run-length of the RLE triple.
Count
Summary:
Integrating Compression and Execution

Operate directly on compressed data
whenever possible


Degenerate to a lazy decompression scheme
when decompression cannot be avoided


Using compression blocks as an intermediate
representation of data.
Iterating through values in a compression block.
Reduce query executor complexity

By abstracting general properties of compression
techniques.
References
1.
2.
Mike Stonebraker, Daniel Abadi, Adam Batkin,
Xuedong Chen, Mitch Cherniack, Miguel Ferreira,
Edmond Lau, Amerson Lin, Sam Madden,
Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran
and Stan Zdonik. C-Store: A Column Oriented
DBMS , VLDB, 2005.
(http://db.csail.mit.edu/projects/cstore/vldb.pdf)
Daniel J. Abadi, Samuel R. Madden, and Miguel C.
Ferreira. Integrating Compression and Execution
in Column-Oriented Database Systems.In
SIGMOD, June, 2006, Chicago, USA.
http://db.csail.mit.edu/projects/cstore/abadisigmod
06.pdf