PowerPoint 演示文稿

Transcript PowerPoint 演示文稿

DGFIndex for Smart Grid:
Enhancing Hive with a Cost Effective
Multi-dimensional Range Index
Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu,
Hans-Arno Jacobsen, Kaifeng Wu, Jian Chen, Jintao Li
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
国网浙江省电力公司
国网电力科学研究院
STATE GRID ZHEJIANG ELECTRIC POWER COMPANY
STATE GRID ELECTRIC POWER RESEARCH INSTITUTE
1/19
Outline

Big Data challenges in Smart Grid

DGFIndex design

Experiments on smart grid data

Conclusions
2/19
Big Data Challenges in Smart Grid
Smart
Meter
The Electricity Consumption Information
Collection System of the Grid
CDMA
230MHZ
Collector
GPRS
RDBMS
Figure 1 Data flow in State Grid

Example


22 million smart meters in Zhejiang province, as required by China
State Grid, it should be 96 measurements/day
Will be 2.1 billion records in a single table
3/19
Features of Smart Meter Data
UserId
PowerConsumed
TimeStamp
PATE with Rate 1
Other Metrics
24012
12.34
1332988833
10.45
…
Table 1 An example format of smart meter data

Smart meter data features:




Queries features:




Time stamp field
Append only
Schema is static
Multi-dimensional range query
Lots of aggregation query
Query examples:
What is the average power consumption of user ids in the range 100 to
1000 and dates in the ranges “2013-01-01” to “2013-02-01”?
4/19
Why Migrate to Hadoop/Hive

Limitations of RDBMS



Low write throughput
Weak scalability
High license cost
16 times faster with only 1/10 cost
Figure 2 Write throughput comparison of RDBMS and HDFS
Hadoop/Hive is a good choice for solving smart meter big data problem

Hadoop and Hive



High write throughput
Flexible scalability
Low budget and cost effective
5/19
Indexes in Hive

Index



Data partition


Compact Index, Aggregate Index and
Bitmap Index
Store all combinations of index
dimensions and location
Each partition is a directory, reorganize
data into different directory
Column
Name
Type
Index
dimension 1
Type in base
table
Index
dimension 2
Type in base
table
Index
dimension 3
Type in base
table
_bucketname
string
_offset
Array<bigint>
Table 2 3-dimensional compact index
Disadvantages on multi-dimensional range queries
when large number of distinct value in index dimensions


Index: large index table size
Data partition: lots of directories and small files
6/19
Limitations of Indexing in Hive
Temporary File
File:Offset
2
1
Predicate
4
Scan Index
Table
3
JobTracker
InputFormat.getSplits
5
Chosen Splits
6
MR Job
Figure 3 Query with current index

Limitations:



Storing combination of index dimensions leads to extremely
large index table
High selectivity leads to large temporary file, which may
overflow the JobTracker’s memory
Poor filter effect when the value of index dimension scattered
evenly in data file
7/19
Recall Data Features

Smart meter data features:




Time stamp field
Append only
Schema is static
Queries features:


Multi-dimensional range query
Lots of aggregation query
8/19
DGFIndex Design




GFUKey is the left lower coordinate of
GFU in the data space
GFUValue consists of header and
location


2-dimensional
DGFIndex
aggregation values
dimension Y

Using grid file to split logical data
space into units(GFU)
Data in same GFU is stored together in
the file of HDFS, named Slice
GFU is stored as a GFUKey/GFUValue
pair in key/value store
data file on
HDFS
GFUKey
Slice
GFUValue
header location
GFU
dimension X
Figure 4 DGFIndex architecture
Header contains some pre-computed
aggregation values
Location is the start and end offset of
corresponding data segment in file of
HDFS
9/19
DGFIndex Construction
Input
Map Phase
Reduce Phase
Splitting Policy
DGFIndex Table
19
GFUKey
test
A
1
5
7
2
9
11
3
12
8
B
14
18
12
11
14
16
18
12
13
17
C
0.1
0.5
1.2
0.5
0.8
1.3
0.9
0.3
0.2
B 15
13
11
1
4
A:9|B:14
7
A
10
Output
13
1_13
4_17
7_11
1_11
7_13
1_17
10_11
10_15
GFUValue
header
location
sum(C) filename start
0.1
test
0
0.5
test
9
1.2
test
18
0.5
test
27
1
test
36
0.9
test
54
0.3
test
63
1.3
test
72
end
0
9
18
27
45
54
63
72
7_13
Reducer
Mapper
Input:{36,9|14|0.8}
Input:{7_13,<9|14|0.8,8|13|0.2>}
Output:{7_13,9|14|0.8}
Output:{null,<9|14|0.8,8|13|0.2>}
Figure 6 DGFIndex construction
test after
reorganization
A
1
5
7
2
9
8
3
12
11
B
14
18
12
11
14
13
18
12
16
C
0.1
0.5
1.2
0.5
0.8
0.2
0.9
0.3
1.3
10/19
DGFIndex Query
SELECT SUM(C)
FROM test
WHERE A>=5 AND A<12
AND B>=12 AND B<16
19
Step 1
17
(Hive)
B
15
13
11
1
4
7
10
13
A
GFUKey
1_13
4_17
7_11
1_11
7_13
1_17
10_11
10_15
GFUValue
header
location
sum(C) filename start
0.1
test
0
0.5
test
9
1.2
test
18
0.5
test
27
1
test
36
0.9
test
54
0.3
test
63
1.3
test
72
end
0
9
18
27
45
54
63
72
1.0
filename start
test
18
test
63
test
72
end
18
63
72
overlapped?
split
test:0
Step 2
List<start,end>
18,18,63,63,72,72
(InputFormat.getSplits)
Step 3
chosen splits
Skip
(Reader.next)
0
test:0
test
combine
2.2
all splits
1.2
Skip
18
test:0
63
72
Figure 7 DGFIndex query
11/19
Advantages of DGFIndex



Smaller index size
High index read speed, selective
Pre-computation
Data File
on HDFS
2-Dimensional
DGFIndex
Slice
Dimension Y
GFUKey GFUValue
Header Location
GFU
Slice
Dimension X
Figure 8 DGFIndex architecture
12/19
Experiments

Comparison System


Hive with Compact Index, HadoopDB
Environment

Hardware


Software





29 virtual nodes, each has 8 cores, 8GB RAM, 300GB disk
CentOS 6.5 b4bit, Jdk 1.6.0_45 64bit, Hadoop-1.2.1, HBase-0.94.13
DGFIndex is implemented in Hive-0.10.0
Replication factor is 2 in HDFS, mapred.task.io.sort.mb=512MB
PostgreSQL 8.4.20 for HadoopDB
Data Set and Query


Real meter data (1TB in TextFile and 890GB in RCFile, no
compression) and ad-hoc queries from Zhejiang Grid
Lineitem table (518GB in TextFile and 468GB in RCFile, no
compression) and Q6 from TPC-H
13/19
Index Size and Construction Time
Dimension Name
# of distinct
value
Name
# of intervals in
UserId
UserId
14,000,000
DGF-L
100
RegionId
11
DGF-M
1,000
Time
30
DGF-S
10,000
Table 3 The number of distinct value in index dimensions
4 The number of intervals in UserId dimension
DGFIndex construction costs moreTable
time,
but has smaller size
Index Type
Table Type
# of index
dimension
Index dimension
Size
Time(s)
Compact
RCFile
3
UserId,RegionId,Time
821GB
23,350
Compact
RCFile
2
RegionId,Time
7MB
1,884
DGF-L
TextFile
3
UserId,RegionId,Time
0.94MB
25,816
DGF-M
TextFile
3
UserId,RegionId,Time
3MB
25,632
DGF-S
TextFile
3
UserId,RegionId,Time
13MB
26,027
Table 5 Index size and construction time
14/19
Aggregation Query
For aggregation query, DGFIndex is 2-50 faster
Figure 9 Point query
than Compact Index and HadoopDB
Figure 10 5% selectivity
Figure 11 12% selectivity
Index Type
Point
5%
12%
SELECT SUM(powerConsumed)
FROM meterdata
WHERE regionId>r1 AND regionId<r2
AND userId>u1 AND userId<u2
AND time>t1 AND time<t2
Compact
169,395,953
4,756,501,768
6,586,886,752
DGF-L
4,347,200
67,678
100,386
DGF-M
4,258,358
20,280
31,215
DGF-S
2,291,718
16,122
23,712
Listing 5 Aggregation query
Accurate
26
569,186,384
1,354,351,336
Table 6 number of records needed to read after being filtered by index
15/19
GroupyBy Query
or non-aggregation query, DGFIndex is 2-5 faster than Compact Index and
Figure 12 Point query
Figure 13
5% selectivity data of
Figure
14 12%Index.
selectivity
HadoopDB,
only need to read
1/40-1/5
Compact
SELECT time,SUM(powerConsumed)
FROM meterdata
WHERE regionId>r1 AND regionId<r2
AND userId>u1 AND userId<u2
AND time>t1 AND time<t2
GROUP BY time
Listing 6 GroupBy query
Index Type
Point
5%
Compact
169,395,953
4,756,501,768
6,586,886,752
DGF-L
4,347,200
681,321,681
1,433,931,728
DGF-M
4,258,358
641,128,331
1,401,070,456
DGF-S
2,291,718
572,231,864
1,367,754,156
26
569,186,384
1,354,351,336
Accurate
12%
Table 7 number of records needed to read after being filtered by index
16/19
TPC-H Data Set and Q6
SELECT SUM(l_extendedprice*l_discount) as revenue
FROM lineitem
WHERE l_shipdate>=date’[DATE]’
AND l_shipdate<date’[DATE]’+interval ‘1’ year
AND l_discount between [DISCOUNT]-0.01
and [DISCOUNT]+0.01
AND l_quantity<[QUANTITY]
Listing 8 Q6 from TPC-H
DGFIndex is also efficient for general case data
Index Type
Record
Number
Whole table
4,095,002,340
Compact-3D
Figure 16 Q6 from TPC-H cost time
Table
Type
# of index
dimension
Size
Time(s)
4,095,002,340
Index
Type
Compact-2D
4,095,002,340
Compact
RCFile
3
189GB
7,367
DGFIndex
85,430,966
Compact
RCFile
2(l_discount,l_quantity)
637MB
991
Accurate
77,955,077
DGFIndex
TextFile
3
4.3MB
10,997
Table 10 number of records needed to
read after being filtered by index
Table 8 Index size and construction time
17/19
Conclusions



Multi-dimensional range index is essential for
Hive-based smart meter data processing
We propose a cost effective multi-dimensional
range index for Hadoop/Hive
Experimental results show the efficiency of our
DGFIndex
18/19
Thanks
Questions?
19/19