2013-10-09-A Charact..

Download Report

Transcript 2013-10-09-A Charact..

A Characterization of Big Data
Benchmarks
Wen.Xiong Zhibin Yu, Zhendong Bei, Juanjuan
Zhao, Fan Zhang, Yubin Zou, Xue Bai, Ye Li,
Chengzhong Xu
Shenzhen Institutes of Advanced Technology
Chinese Academy of Sciences
1
Agenda
•
•
•
•
•
•
Background
Motivation
Methodology
Evaluation
Conclusion
Future work
2
Background
• Requirements of a benchmark suite
• Characteristics of different workload-input pairs
• Spatio-temporal data in a real world system
ETI Confidential
13/04/2015
3
Background (1/3)
• Requirements of a benchmark suite
– a benchmark suite should contain workloads that
represent a wide range of application domains.
– workloads in a benchmark suite should be as diverse as
possible.
– a benchmark suite should not have redundant
workloads in itself, keeping simulation or measure time
as short as possible.
ETI Confidential
13/04/2015
4
Background (1/3)
• simulation time between different numbers of
workload-input pairs
After removing redundancy, it can decrease 30% number of
workload-input pairs and %40 simulation time.
ETI Confidential
13/04/2015
5
Background (2/3)
• Characteristics of different workload-input pairs
– Characteristics of workloads as the size of input data set
changing
• Stable
• Unstable
ETI Confidential
13/04/2015
6
Background (3/3)
• Spatio-temporal data in Shenzhen Transportation
System
– GPS trajectory data of taxicabs, 30000+ taxicabs, 90
millions GPS points per day.
– Smart card data in metro transportation system, 15+
millions smart cards, 12+ millions transaction records
per day.
ETI Confidential
13/04/2015
7
Background (3/3)
(1) 2000 square kilometers, 18 millions of people.
(2) road network in Shenzhen contains 73515 vertices and
101794 road segments.
ETI Confidential
13/04/2015
8
Motivation
• Remove redundancy of a typical benchmark suite
• Provide a benchmark suite for spatio-temporal
data
ETI Confidential
13/04/2015
9
Motivation (1/2)
• Remove redundancy of a typical benchmark suite
– To decrease experiment time of benchmarking the
objective system by minimizing the number of typical
workload-input pairs.
ETI Confidential
13/04/2015
10
Motivation (2/2)
• Provide a benchmark suite for spatio-temporal data
– Representative workloads in our benchmark suite are as
follows:
• transaction count (hotregion)
• spatiotemporal origin destination (sztod)
• map matching
• hotspot monitoring
• spatiotemporal secondary sort
ETI Confidential
13/04/2015
11
Methodology
•
•
•
•
Typical MapReduce-based workloads
Micro architecture level metrics
Principal component analysis (PCA)
Hierarchical clustering and K-means clustering
ETI Confidential
13/04/2015
12
Methodology
• Typical MapReduce-based workloads (1/2):
index
workload
source
1
sort
HiBench
2
wordcount
HiBench
3
terasort
HiBench
4
bayes
HiBench
5
K-means
HiBench
6
Nutch indexing
HiBench
7
pagerank
HiBench
8
hive-jion
HiBench
9
Hive-aggregate
HiBench
10
grep
DCBench
11
svm
DCBench
ETI Confidential
13/04/2015
13
Methodology
• Typical MapReduce-based workloads (2/2):
index
workload
source
12
ibcf
DCBench
13
fpg
DCBench
14
hmm
DCBench
15
sztod
our internal program for
trajectory data
16
hotregion
our internal program for
trajectory data
ETI Confidential
13/04/2015
14
Methodology
• Micro architecture level metrics are as follows:
–
–
–
–
–
–
–
Instruction per cycle (IPC)
L1 instruction cache miss ratio
L2 instruction cache miss ratio
Last level cache miss ratio
Branch prediction per instruction
Branch miss prediction per instruction
Off-chip bandwidth utilization
ETI Confidential
13/04/2015
15
Methodology
• Principal Component Analysis:
– It can reduce program characteristics while controlling
the amount of information that is thrown away.
ETI Confidential
13/04/2015
16
Methodology
• Hierarchical clustering
– Hierarchical clustering is a "bottom up" approach: each
observation starts in its own cluster, and workload-input
pairs of clusters are merged as one moves up the
hierarchy. It is useful in simultaneously looking at
multiple clustering possibilities, and we can use a
dendrogram for selecting desired number of clusters.
• K-means clustering
– K-means clustering aims to partition n workloads-input
pairs into k clusters in which each workload-input pair
belongs to the cluster with the nearest mean, where K is a
value specified by user.
ETI Confidential
13/04/2015
17
Evaluation (instruction per cycle)
The IPC of these sixteen workloads are range from 0.72 to
0.96, with an average value of 0.85. Wordcount has the
lowest IPC value and hotregion has highest value among
these workloads.
ETI Confidential
13/04/2015
18
Evaluation (L1 ICache miss ratio)
The cache miss ratios of these typical workloads are range
from 3.9% to 19.8%, with an average value of 8.9%. svm has
the lowest L1 instruction cache miss ratio and hive-aggre has
the highest L1 instruction cache miss ratio.
ETI Confidential
13/04/2015
19
Evaluation (L2 ICache miss ratio)
The cache misses value of these workloads are range from
23.7% to 64.9%. On average, workloads from DCBench in
right side have larger L2 instruction miss rate then workloads
from HiBench in the left side. Overall, the L2 cache is
ineffective in our experiment platform.
ETI Confidential
13/04/2015
20
Evaluation (branch prediction per instruction )
These values are range from 0.18 to 0.23, with an average
value of 0.21. Hotregion has the lowest value of branch
prediction per instruction while nutchindexing has the highest
value of branch prediction per instruction.
ETI Confidential
13/04/2015
21
Evaluation (branch missprediction ratio )
These ratios are range from 1.5% to 5.6%, with an average
value of 2.7%. Pagerank has the lowest branch miss
prediction ratio while nutch indexing has the highest branch
miss prediction ratio. The results show that the branch
predictor of our processor matches these typical MapReduce
13/04/2015
22
ETI Confidential
based applications.
Evaluation (off-chip bandwidth utilization)
Among these workloads we evaluated, terasort is the only one
that has the highest utilization ratio with a value of 14%.
Overall, in our experiment platform, processors significantly
over-provision off-chip bandwidth for these typical
workloads.
13/04/2015
23
ETI Confidential
Evaluation (Hierarchical clustering )
sort-30G
sort-60G
sort-15G
terasort-100G
terasort-50G
bayes
terasort-25G
sztod-98G
pagerank
hotregion-17G
hotregion-35G
grep-80G
hive-join
sztod-49G
svm-40G
grep-20G
hotregion-70G
hive-aggre
wordcount-15G
wordcount-30G
wordcount-60G
k-means
sztod-24G
ibcf-8G
hmm-16G
ibcf-4G
hmm-32G
hmm-8G
svm-20G
grep-40G
ibcf-2G
svm-10G
fpg
nutchindexing
a
b
c
2
3
4
5
Linkage Distance
6
ETI Confidential
7
8
13/04/2015
24
Evaluation (Hierarchical clustering )
index
cluster type
workloads
1
strong cluster
wordcount, sort, terasort
2
weak cluster
sztod, hotregion
3
non cluster
svm, ibcf
(1) strong cluster, three workload-input pairs of same workload clustered
together.
(2) weak cluster, two workload-input pairs of same workload clustered
together.
(3) non cluster, no workload-input pairs of same workload clustered
together.
ETI Confidential
13/04/2015
25
Evaluation(K-means clustering)
• Seclecting 8 workload-input pairs via K-means clustering
cluster workloads
representative
1
sztod-98G,hotregion-17G, hmm-16G
hmm-16G
2
fpg, ibcf-2G
fpg
3
sztod-24G,sztod-49G
sztod-49G
4
wordcount-15G,wordcount-30G,
wordcount-60G, svm-20G
wordcount-30G
5
nutchindexing
nutchindexing
6
hotregion-35G, hotregion-70G, bayes, hive- hotregion-35G
aggre
7
sort-15G, sort-30G, sort-60G,
terasort-25G, terasort-50G, terasort-100G,
hive-join, pagerank
Sort-60G
8
kmeans
kmeans
ETI Confidential
13/04/2015
26
Evaluation(K-means clustering)
sort-60G can be taken as the representative workload-input pair
of its group including eight members.
ETI Confidential
13/04/2015
27
Conclusion
• Redundancy exists in these pioneering benchmark
suites
– Such as sort and terasort.
• The workload behavior of trajectory data analysis
applications is dramatically affected by their input
data sets.
ETI Confidential
13/04/2015
28
Future work
• Conduct similarity analysis in workload-input
pairs at a larger scale.
– More metrics and larger input size
• Fully implement a big data benchmark suite for
spatio-temporal data
– Data model, data generator and typical workload-input
pairs.
ETI Confidential
13/04/2015
29
Thank You !!!
30