Experience from Hadoop Benchmarking with HiBench

Transcript Experience from Hadoop Benchmarking with HiBench

Experience with HiBench
From Micro-Benchmarks toward End-to-End Pipelines
WBDB 2013 Workshop Presentation
Lan Yi
[email protected]
Senior Software Engineer
Intel China Software Center
2013.07.16
HiBench
Micro Benchmarks
Web Search
1. Different from– Nutch Indexing
– Sort
– Page Rank
– WordCount GrixMix, SWIM?
– TeraSort
2. Micro Benchmark?
–
–
3. Isolated
components?
HiBench
4. End-2-end Benchmark?
HDFS
Machine
5. Learning
We need ETLBayesian Classification
Recommendation
– Enhanced DFSIO
K-Means Clustering
Pipeline
See our paper “The HiBench Suite: Characterization of the MapReduce-Based Data
Analysis” in ICDE’10 workshops (WISS’10)
2020/
4/26
ETL-Recommendation (hammer)
TPC-DS
Sales
updates
ETL
h1 h2
Cookies
updates
ETL-sales
h24
ETL-logs
CF
Test
WP
Sales tables
cookies
Pref
ip
agent
Retcode
Pref-sales
log table
Item-item
similarity
matrix
Statistics &
Measureme
nts
Pref-logs
Offline
test
Sales
preferences
Pref-comb
User-item
preferences
Browsing
preferences
Mahout
Item
based
Collaborati
ve Filtering
HIVE-Hadoop Cluster (Data Warehouse)
Test
data
ETL-Recommendation (hammer)
Task Dependences
ETL-sales
ETL-logs
Pref-sales
Pref-logs
Offline
test
Pref-comb
Item
based
Collaborati
ve Filtering
Empirical Data (hammer)
140,00
15,85
120,00
100,00
Test
80,00
16,33
87,44
60,00
Gen-Recomm
20,00
0,00
12,29
2,93
19,44
12,00
3,01
19,78
orig
opt
75,75
Gen-Pref
40,69
40,00
62,19
2-hot-jobs
rest
Refresh-Logs
Refresh-Sales
Gen-Recomm Hotspots
Intel Xeon E5-2600 @ 2.2Ghz, sandyBridge
2 x 8 x HT = 32 cores
192G Mem, WD 7200 0.3x12x4=14.4T
22,36
22,78
ETL
115,57
Recomm
69,02
1000M net, 300M~400M/s
ETL
4-node cluster , RHL6.2, cdh4.1.2
Recomm
HiBench etl-recomm branch, HiTune-0.9
ETL vs. Recomm (orig)
5
ETL vs. Recomm (opt)
Sales ~14G (TPC-DS scale 100), logs
~105G
Empirical Data (hammer)
CPU(cores) Utilization
100%
4000000
3500000
3000000
2500000
2000000
1500000
1000000
500000
0
Sum of %idle
80%
Sum of %guest
Sum of %iowait
60%
%
millisecond
Completion Time
Sum of %irq
40%
Sum of %nice
20%
0%
Sum of %sys
Sum of %idle
80%
Sum of %guest
Sum of %iowait
60%
%
millisecond
Timeline
100%
Sum of %irq
40%
Sum of %nice
Sum of %soft
20%
0%
6
Sum of %steal
CPU(cores) Utilization
Completion Time
2000000
1800000
1600000
1400000
1200000
1000000
800000
600000
400000
200000
0
Sum of %soft
Sum of %steal
Timeline
Sum of %sys
Empirical Data (hammer)
Network Throughput
IO Utilization
7000
25
6000
20
5000
kB/s
%
15
10
3000
2000
5
0
4000
1000
0
Timeline
Network Throughput
IO Utilization
40000
20
35000
30000
kB/s
15
%
Timeline
10
25000
20000
15000
10000
5
5000
0
Timeline
0
Timeline
LinkBench
• Benchmark for Social Graph Service
• Originally Developed by Facebook on Top of MySQL
– Simulate social graph workloads similar to Facebook’s online
service
– Key workload properties match Facebook’s real production
workload
• Different from Analytical Workloads
• Our Work
– Port LinkBench to HBase
– On top of Phoenix (SQL support over HBase)
8
Resources
• HiBench
– https://github.com/intel-hadoop/HiBench
• HiBench ETL-Recomm Branch
– https://github.com/intel-hadoop/HiBench/tree/etl-recomm
• LinkBench
– https://github.com/intel-hadoop/linkbench
• HiTune
– https://github.com/intel-hadoop/HiTune
• Phoenix
– https://github.com/intel-hadoop/phoenix
9

Experience from Hadoop Benchmarking with HiBench

Transcript Experience from Hadoop Benchmarking with HiBench

Directory