PPT - The University of Hong Kong
Download
Report
Transcript PPT - The University of Hong Kong
System Software for Big Data
Computing
Cho-Li Wang
The University of Hong Kong
HKU High-Performance Computing Lab.
Total # of cores: 3004 CPU + 5376 GPU cores
RAM Size: 8.34 TB
Disk storage: 130 TB
Peak computing power: 27.05 TFlops
GPU-Cluster (Nvidia M2050,
“Tianhe-1a”): 7.62 Tflops
31.45TFlops (X12 in 3.5 years)
35
30
20T
25
2007.7
2009
2010
2011.1
20
15
10
2.6T
3.1T
5
CS Gideon-II & CC MDRP Clusters
0
2007.7
2009
2010
2011.1
2
Big Data: The "3Vs" Model
•
•
•
High Volume (amount of data)
High Velocity (speed of data in and out)
High Variety (range of data types and sources)
2.5 x 1018
2010: 800,000
petabytes (would
fill a stack of DVDs
reaching from the
earth to the
moon and back)
By 2020, that pile
of DVDs would
stretch half way to
Mars.
Our Research
•
•
•
Heterogeneous Manycore Computing (CPUs+ GUPs)
Big Data Computing on Future Manycore Chips
Multi-granularity Computation Migration
(1) Heterogeneous Manycore
Computing (CPUs+ GUPs)
JAPONICA : Java with AutoParallelization ON GraphIcs
Coprocessing Architecture
CPUs
GPU
Heterogeneous Manycore Architecture
6
New GPU & Coprocessors
Vendor
Model
Launch
Date
Sandy
2011Q1
Bridge
Intel
Ivy
2012Q2
Bridge
Xeon
Phi
2012H2
Brazos
2012Q2
2.0
Fab.
(nm)
#Accelerator
Cores (Max.)
GPU
Clock
(MHz)
32
12 HD graphics
3000 EUs (8
threads/EU)
850 –
1350
22
16 HD graphics
4000 EUs (8
threads/EU)
650 –
1150
22
60 x86 cores
(with a 512-bit
vector unit)
6001100
40
80 Evergreen
shader cores
AMD
488-680
TDP
(watts)
Memory
Bandwidth
(GB/s)
95
L3: 8MB
Sys mem
(DDR3)
21
77
L3: 8MB
Sys mem
(DDR3)
25.6
300
8GB
GDDR5
320
18
L2: 1MB
Sys mem
(DDR3)
21
17-100
L2: 4MB
Sys mem
(DDR3)
25
148
288.5
32
128-384
Northern
Islands cores
2010Q1
40
512 Cuda
cores
(16 SMs)
1300
238
L1: 48KB
L2: 768KB
6GB
Kepler 2012Q4
28
2880 Cuda
cores
836/876
300
6GB
GDDR5
Trinity 2012Q2
Fermi
Nvidia
(GK110)
723-800
Programming
Model
Remarks
OpenCL
Bandwidth is system
DDR3 memory
bandwidth
OpenMP#,
OpenCL*,
OpenACC%
Less sensitive to branch
divergent workloads
OpenCL,
C++AMP
APU
CUDA, OpenCL,
OpenACC
3X Perf/Watt, Dynamic
Parallelism, HyperQ
7
#1 in Top500 (11/2012):
Titan @ Oak Ridge National Lab.
•
•
•
•
•
•
•
18,688 AMD Opteron 6274 16-core
CPUs (32GB DDR3) .
18,688 Nvidia Tesla K20X GPUs
Total RAM size: over 710 TB
Total Storage: 10 PB.
Peak Performance: 27 Petaflop/s
o GPU: CPU = 1.311 TF/s: 0.141 TF/s = 9.3 : 1
Linpack: 17.59 Petaflop/s
Power Consumption: 8.2 MW
Titan compute board: 4 AMD Opteron
+ 4 NVIDIA Tesla K20X GPUs
NVIDIA Tesla K20X (Kepler GK110)
GPU: 2688 CUDA cores
8
Design Challenge:
GPU Can’t Handle Dynamic Loops
GPU = SIMD/Vector
Data Dependency Issues (RAW, WAW)
9
Solutions?
Static loops
Dynamic loops
for(i=0; i<N; i++)
{
C[i] = A[i] + B[i];
}
for(i=0; i<N; i++)
{
A[ w[i] ] = 3 * A[ r[i] ];
}
Non-deterministic data dependencies
inhibit exploitation of inherent parallelism;
only DO-ALL loops or embarrassingly
parallel workload gets admitted to GPUs.
9
Dynamic loops are common in scientific and
engineering applications
10
Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies"
GPU-TLS : Thread-level Speculation on GPU
•
Incremental parallelization
o
•
•
Efficient dependency checking schemes
Deferred update
o
•
sliding window style execution.
Speculative updates are stored in the write buffer of each thread
until the commit time.
3 phases of execution
Phase
I
Phase
II
Phase
III
• Speculative execution
• Dependency checking
intra-thread RAW
valid inter-thread RAW in GPU
• Commit
true inter-thread RAW
GPU: lock-step execution in the
same warp (32 threads per warp).
11
JAPONICA : Profile-Guided Work Dispatching
Dynamic
Profiling
High
Inter-iteration dependence:
-- Read-After-Write (RAW)
-- Write-After-Read (WAR)
-- Write-After-Write (WAW)
Scheduler
Dependency
density
Low/None
Medium
Parallel
Highly parallel
Massively parallel
…
8 high-speed x86
cores
Multi-core CPU
64 x86 cores
2880 cores
Many-core
coprocessors
12
JAPONICA : System Architecture
Sequential Java Code
with user annotation
Profiler (on GPU)
JavaR
Code
Translation
Dependency Density
Analysis
Uncertain
Static Dep.
Analysis
Intra-warp
Dep. Check
No dependence
RAW
DO-ALL Parallelizer
CPU-Multi
threads
GPU-Many
threads
Inter-warp
Dep. Check
one loop
Profiling Results
WAW/WAR
Speculator
GPU-TLS
CUDA kernels & CPU Multi-threads
Program
Dependence
Graph (PDG)
Privatization
CUDA kernels with GPU-TLS|
Privatization & CPU Single-thread
Task Scheduler : CPU-GPU Co-Scheduling
Task Sharing
Task Stealing
High DD : CPU single core
Low DD : CPU+GPU-TLS
0 : CPU multithreads + GPU
CPU
CPU queue: low, high, 0
GPU queue: low, 0
Communication
GPU
Assign the tasks
among CPU & GPU
according to their
dependency
density (DD)
13
(2) Crocodiles: Cloud
Runtime with Object
Coherence On Dynamic
tILES”
“General Purpose” Manycore
Tile-based architecture: Cores are connected through a 2D networkon-a-chip
鳄鱼 @ HKU (01/2013-12/2015)
• Crocodiles: Cloud Runtime with Object Coherence On Dynamic
tILES for future 1000-core tiled processors”
GbE
ZONE 4
ZONE 2
ZONE 3
DRAM Controller
PCI-E
GbE
Memory Controller
RAM
Memory Controller
RAM
PCI-E
RAM
ZONE 1
GbE
RAM
Memory Controller
GbE
PCI-E
PCI-E
16
•
Dynamic Zoning
o
o
o
o
Multi-tenant Cloud Architecture Partition varies over
time, mimic “Data center on a Chip”.
Performance isolation
On-demand scaling.
Power efficiency (high flops/watt).
Design Challenge:
“Off-chip Memory Wall” Problem
– DRAM performance (latency) improved slowly over the past 40 years.
(a) Gap of DRAM Density & Speed
(b) DRAM Latency Not Improved
Memory density has doubled nearly every two
years, while performance has improved slowly (e.g.
still 100+ of core clock cycles per memory access)
Lock Contention in Multicore System
•
Physical memory allocation performance sorted by function.
As more cores are added more processing time is spent
contending for locks.
Lock
Contention
Exim on Linux
collapse
Kernel CPU time (milliseconds/message)
Challenges and Potential Solutions
•
Cache-aware design
o
o
•
Data Locality/Working Set getting critical!
Compiler or runtime techniques to improve data reuse
Stop multitasking
o
o
Context switching breaks data locality
Time Sharing Space Sharing
马其顿方阵众核操作系统 :
Next-generation Operating
System for 1000-core
processor
21
Thanks!
For more information:
C.L. Wang’s webpage:
http://www.cs.hku.hk/~clwang/
http://i.cs.hku.hk/~clwang/recruit2012.htm
Multi-granularity Computation Migration
Granularity
Coarse
WAVNet Desktop Cloud
G-JavaMPI
JESSICA2
SOD
Fine
System scale
Small
Large
(Size of state)
24
WAVNet: Live VM Migration over WAN
A P2P Cloud with Live VM Migration over WAN
“Virtualized LAN” over the Internet”
High penetration via NAT hole punching
Establish direct host-to-host connection
Free from proxies, able to traverse most NATs
Key Members
VM
VM
Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization
Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (ICPP2011)
25
WAVNet: Experiments at Pacific Rim Areas
北京高能物理所
IHEP, Beijing
日本产业技术综合研究所
(AIST, Japan)
SDSC, San Diego
深圳先进院 (SIAT)
中央研究院
(Sinica, Taiwan)
静宜大学
(Providence University)
26
香港大学 (HKU)
26
JESSICA2: Distributed Java Virtual Machine
A Multithreaded Java
Program
Java
Enabled
Single
System
Image
Computing
Architecture
Thread Migration
JIT Compiler
Mode
Portable Java Frame
JESSICA2
JVM
Master
JESSICA2
JVM
Worker
JESSICA2
JVM
Worker
JESSICA2
JVM
JESSICA2
JVM
JESSICA2
JVM
Worker
27 27
History and Roadmap of JESSICA Project
•
•
•
•
JESSICA V1.0 (1996-1999)
– Execution mode: Interpreter Mode
– JVM kernel modification (Kaffe JVM)
– Global heap: built on top of TreadMarks (Lazy Release
Consistency + homeless)
JESSICA V2.0 (2000-2006)
– Execution mode: JIT-Compiler Mode
– JVM kernel modification
– Lazy release consistency + migrating-home protocol
JESSICA V3.0 (2008~2010)
– Built above JVM (via JVMTI)
– Support Large Object Space
JESSICA v.4 (2010~)
– Japonica : Automatic loop parallization and
speculative execution on GPU and multicore CPU
– TrC-DC : a software transactional memory system on
cluster with distributed clocks (not discussed)
J1 and J2 received a total of 1107 source code downloads
Past Members
King Tin LAM,
Kinson Chan
Chenggang Zhang
Ricky Ma
28
Stack-on-Demand (SOD)
Stack frame A
Method
Stack frame A
Stack frame A
Method Area
Program
Counter
Local variables
Program
Counter
Local variables
Stack frame B
Program
Counter
Heap
Area
Rebuilt
Method
Area
Local variables
Heap Area
Cloud node
Object (Pre-)fetching
objects
Mobile node
Elastic Execution Model via SOD
(a) “Remote Method
Call”
(b) Mimic thread
migration
(c) “Task Roaming”: like a
mobile agent roaming over
the network or workflow
With such flexible or composable execution paths, SOD
enables agile and elastic exploitation of distributed
resources (storage), a Big Data Solution !
Lightweight, Portable, Adaptable
30
Live migration
Stack-on-demand (SOD)
Method
Area
…
…
Method
Area
Code
Stack
segments
Partial
Heap
Stacks
Thread
migration
(JESSICA2)
iOS
Heap
JVM process
Cloud service
provider
duplicate VM instances
for scaling
Small
footprint
JVM
Code
Load
balancer
Mobile
client
Internet
Multi-thread
Java process
JVM
comm.
JVM
guest OS
guest OS
Xen VM
Xen VM
trigger live
migration
Load
balancer
Xen-aware host OS
Overloaded
eXCloud : Integrated
Solution for Multigranularity Migration
Desktop PC
Ricky K. K. Ma, King Tin Lam, Cho-Li Wang, "eXCloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE
International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)