GM_0930_ChihSheng.ppt

Transcript GM_0930_ChihSheng.ppt

Parallel, Communication, and the Future

Chih-Sheng Lin (Patrick)

• Parallel Programming on Many-core Processors • Software-Defined Radio for GPUs • Future Works • (365-18) days in USA 2

Parallel Programming on Many-core Processors

Modern Video Games Demand Copious Processing Power • • • • Million of pixels Thousands pixel of calculations per One hundred frames per second 100’s of GFLOPS are needed 4

Why Massively Parallel Computing?

• A quiet revolution and potential build-up • • Calculation: TFLOPS vs. 100 GFLOPS Memory Bandwidth: ~10x • GPUs are almost in every PC 5

Multi-core vs Many-core • • “A multi-core processor is a single computing component with two or more independent actual processors (called "cores")” “The many-core threshold is roughly in the range of several tens of cores” - wiki 6

Comparison 7

GPU Programming Languages • • • Compute Unified Device Architecture ( CUDA ) OpenCL DirectCompute 8

CUDA • • Fortran, C/C++ Integrated ( host + device ) program • Serial or modestly parallel parts in host code • Highly parallel parts in device SPMD kernel code 9

CUDA Devices and Threads • • • A compute device • • • • is a coprocessor to the CPU or the host has its own memory ( device memory ) runs many threads in parallel is typically a GPU but can also be another type of parallel processing device Data-level portions of an application are expressed as kernels which run on many threads Differences between CPU and GPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only few 10

Programming Model 11

Example 1:Vector Addition { void VectorAdd(float* VecA, float* VecB, float* VecR) int size = Width*sizeof(float); //1. Allocate for vectors in device memory cudaMalloc(&dVecA, size); cudaMalloc(&dVecB, size); cudaMalloc(&dVecR, size); //2. Load vectors to device memory cudaMemcpy(dVecA, VecA, size, cudaMemcpyHostToDevice); cudaMemcpy(dVecB, VecB, size, cudaMemcpyHostToDevice); cudaMemset(dVecR, 0, size); 12

Example 1:Vector Addition (Cont.) //3. Kernel invocation code … //4. Read result vector from the device cudeMemcpy(VecR, dVecR, size, cudaMemcpyDeviceToHost); //5. Free device vector cudaFree(dVecA); cudaFree(dVecB); cudaFree(dVecR); 13

Wait, where is the kernel?

• Kernel : a sequence of instructions per thread • • All threads run the same code except divergence Each thread has an ID that it uses to compute memory address and make control decisions 14

Threads, Blocks, and Grids • • • Threads: 3D Ids, unique within a block Blocks: 3D Ids, unique within a grid Built-in ID: threadIdx.x/.y/.z

blockDim.x/.y/.z

blockIdx.x/.y/.z

gridDimx.x/.y/.z

Warp • Warp • • consists of 32 threads Is the minimal execution unit for streaming multiprocessor (SM) 16

Indexization Global ID = blockDim .

x * blockIdx .

x + threadIdx .

x 17

Example 1:Vector Addition (Cont.) } { __global__ void VectorAddKernel(float* dVecA, float* dVecB, float* dVecR) //indexization int tid = blockDim.x * blockIdx.x + threadIdx.x; dVecR[tid] = dVecA[tid] + dVecB[tid];

(Host-side code)

{ void VectorAdd(float* VecA, float* VecB, float* VecR) … } VectorAddKernel<<>>(float* dVecA, float* dVecB, float* dVecR); … 18

GPGPU Architecture 19

CUDA Memory Overview • Access latency • • Register: ~zero Constant/Texture Memory: ~zero • • Shared Memory: 4~6 cycles Global/Local Memory: 400~600 cycles 20

Compiling CUDA Program

PTX: Parallel Thread eXecution

Example 2: Matrix Multiplication … { for(int k = 0; k < l; ++k) float Aelement = A[ty * l + k]; float Belement = B[k * n + tx]; } Pvalue += Aelement * Belement; //Write the matrix to device memory //Each thread writes one element } C[ty * m + tx] += Pvalue; 22

Tiled Multiply 23

Tiled Multiply Phase 2 Nd 0,0 Nd 1,0 Nd 0,1 Nd 1,1 Nd 0,2 Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3 24

Tiled Multiply – Data Reuse 25

Tiled Multiply – Large Matrices • All tiles are loaded from global memory into shared memory 26

Tiled Width – G80 • Check the compute capability of device • What is the maximum number of active blocks per multiprocessor • in compute capability 1.3 is 8 • What is the maximum number of threads per block • in compute capability 1.3 is 512 • What is the maximum number of active threads per multiprocessor • in compute capability 1.3 is 768 27

Tiled Multiply - Kernel 28

Limitations • • • • No support for recursive function No support for static variable No support for variable-length argument No support for host-only function • ex: printf, fopen, etc.

Optimization Patterns • • • • • • • • Tiling/Data Reuse Thread coarsening Data layout transformation Scatter to gather Binning Privatization Compaction Regularization 30

A Tiling-Scheme Viterbi Decoder in Software-Defined Radio for GPUs

Software-Defined Radio • In 1991, Joseph Mitola coined the term of SDR which is designed for realizing wireless standards in a digital programmable platform • The advantages of SDR • Reduce the development costs • • User demands driven Shorter time-to-market 32

SDR Platform • The requirements of SDR • • High system throughput Intensive computing • Real-time enforcement • Candidates of SDR platform • CPUs • FPGAs • DSPs •

GPUs

Related Works of SDR for GPUs • Speed up of link simulator for SDR systems • Multiple-input multiple-output (MIMO) detector for multi-antenna SDR systems • Implementation of SDR system • Encoder/Decoder 34

Channel Coding 35

Example for Viterbi Decoding

State transition pattern

Decoding bit

2 1 11 01/0 10/1 01/0 10/1 0 00 11/1 2 10 10/1 01/0 0 11/1 00 1 10/1 10 01/0 4 11/1 01 1 11/1 00/0 00 00/0 2 10/1 2 01/0 11/1 00/0 2 11/1 0 0 00/0 00/0 00/0 Code Word:

00 10 10 01

Decoding Sequence: 0 1 1 0 1 3 2 3 0 0 Trellis 36

Compute Unified Device Architecture (CUDA) 37

The Previous Method on GPUs [1] [1] J. Kim, S. Hyeon, and S. Choi. Implementation of an SDR System Using Graphics Processing Unit.

Communications Magazine, IEEE

, 48(3):156 – 162, 2010.

Tiled Viterbi Decoding Algorithm (TVDA) 39

Running Example: Stage 1 • Viterbi decoding in chunks 40

Running Example: Stage 2 • Merge chunks 41

Experimental Environment • Hardware • CPU: Intel Core 2 Quad [email protected]

• GPU: nVIDIA Geforce 8800 GTX 128 [email protected]

• OS • Linux Ubuntu 11.04

• Testing Scenario • Wireless standard • GSM • Block size = 456 bits 42

BER Performance 43

Performance Estimation • • Add Operations of TVDA = 2 × (T/N)+S × (N-1) Comparison Operations of TVDA = T/N+(S-1) × (N-1)

: number of trellises

: number of chunks

: number of states per trellis 44

Performance Result 45

Conclusion

• A tiled Viterbi decoding algorithm was proposed.

• The proposed algorithm is designed for eliminating partial data dependencies and accelerating 2.5x/14.6x faster than the previous GPU/CPU implementation.

FEC Block Why not as we think?

4X Ideal Actual 2.5X

Performance Modeling and Tuning for Heterogeous System

Factors for Performance Factors for Performance Application/ Algorithm Level 1.

Heuristics (divide-and-conquer, geometric decomposition, …) Optimization patterns (tiling/data reuse, scatter to gather, …) … Compiler/ OS Level 1. Scheduling 2. Code optimization … Platform/ Architecture Level 1. DMA transfer size 2. Clock rate, number of cores 3. Memory bandwidth … 49

Current Methods • Manually deriving by designers • Statistical Methods • • Factor analysis Principal Component Analysis • Linear regression • Emulation/Simulation 50

(365-18) days in USA

回國後的感想 • 研究精神和態度 • 推銷自我 • 團隊合作與溝通 • 待人處事 • 英文非常重要 52