Transcript GM_0930_ChihSheng.ppt
Parallel, Communication, and the Future
Chih-Sheng Lin (Patrick)
• Parallel Programming on Many-core Processors • Software-Defined Radio for GPUs • Future Works • (365-18) days in USA 2
Parallel Programming on Many-core Processors
3
Modern Video Games Demand Copious Processing Power • • • • Million of pixels Thousands pixel of calculations per One hundred frames per second 100’s of GFLOPS are needed 4
Why Massively Parallel Computing?
• A quiet revolution and potential build-up • • Calculation: TFLOPS vs. 100 GFLOPS Memory Bandwidth: ~10x • GPUs are almost in every PC 5
Multi-core vs Many-core • • “A multi-core processor is a single computing component with two or more independent actual processors (called "cores")” “The many-core threshold is roughly in the range of several tens of cores” - wiki 6
Comparison 7
GPU Programming Languages • • • Compute Unified Device Architecture ( CUDA ) OpenCL DirectCompute 8
CUDA • • Fortran, C/C++ Integrated ( host + device ) program • Serial or modestly parallel parts in host code • Highly parallel parts in device SPMD kernel code 9
CUDA Devices and Threads • • • A compute device • • • • is a coprocessor to the CPU or the host has its own memory ( device memory ) runs many threads in parallel is typically a GPU but can also be another type of parallel processing device Data-level portions of an application are expressed as kernels which run on many threads Differences between CPU and GPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only few 10
Programming Model 11
Example 1:Vector Addition { void VectorAdd(float* VecA, float* VecB, float* VecR) int size = Width*sizeof(float); //1. Allocate for vectors in device memory cudaMalloc(&dVecA, size); cudaMalloc(&dVecB, size); cudaMalloc(&dVecR, size); //2. Load vectors to device memory cudaMemcpy(dVecA, VecA, size, cudaMemcpyHostToDevice); cudaMemcpy(dVecB, VecB, size, cudaMemcpyHostToDevice); cudaMemset(dVecR, 0, size); 12
Example 1:Vector Addition (Cont.) //3. Kernel invocation code … //4. Read result vector from the device cudeMemcpy(VecR, dVecR, size, cudaMemcpyDeviceToHost); //5. Free device vector cudaFree(dVecA); cudaFree(dVecB); cudaFree(dVecR); 13
Wait, where is the kernel?
• Kernel : a sequence of instructions per thread • • All threads run the same code except divergence Each thread has an ID that it uses to compute memory address and make control decisions 14
Threads, Blocks, and Grids • • • Threads: 3D Ids, unique within a block Blocks: 3D Ids, unique within a grid Built-in ID: threadIdx.x/.y/.z
blockDim.x/.y/.z
blockIdx.x/.y/.z
gridDimx.x/.y/.z
15
Warp • Warp • • consists of 32 threads Is the minimal execution unit for streaming multiprocessor (SM) 16
Indexization Global ID = blockDim .
x * blockIdx .
x + threadIdx .
x 17
Example 1:Vector Addition (Cont.) } { __global__ void VectorAddKernel(float* dVecA, float* dVecB, float* dVecR) //indexization int tid = blockDim.x * blockIdx.x + threadIdx.x; dVecR[tid] = dVecA[tid] + dVecB[tid];
(Host-side code)
{ void VectorAdd(float* VecA, float* VecB, float* VecR) … } VectorAddKernel<<
GPGPU Architecture 19
CUDA Memory Overview • Access latency • • Register: ~zero Constant/Texture Memory: ~zero • • Shared Memory: 4~6 cycles Global/Local Memory: 400~600 cycles 20
Compiling CUDA Program
PTX: Parallel Thread eXecution
21
Example 2: Matrix Multiplication … { for(int k = 0; k < l; ++k) float Aelement = A[ty * l + k]; float Belement = B[k * n + tx]; } Pvalue += Aelement * Belement; //Write the matrix to device memory //Each thread writes one element } C[ty * m + tx] += Pvalue; 22
Tiled Multiply 23
Tiled Multiply Phase 2 Nd 0,0 Nd 1,0 Nd 0,1 Nd 1,1 Nd 0,2 Nd 1,2 Nd 0,3 Nd 1,3 Md 0,0 Md 1,0 Md 2,0 Md 3,0 Md 0,1 Md 1,1 Md 2,1 Md 3,1 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1 Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2 Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3 24
Tiled Multiply – Data Reuse 25
Tiled Multiply – Large Matrices • All tiles are loaded from global memory into shared memory 26
Tiled Width – G80 • Check the compute capability of device • What is the maximum number of active blocks per multiprocessor • in compute capability 1.3 is 8 • What is the maximum number of threads per block • in compute capability 1.3 is 512 • What is the maximum number of active threads per multiprocessor • in compute capability 1.3 is 768 27
Tiled Multiply - Kernel 28
Limitations • • • • No support for recursive function No support for static variable No support for variable-length argument No support for host-only function • ex: printf, fopen, etc.
29
Optimization Patterns • • • • • • • • Tiling/Data Reuse Thread coarsening Data layout transformation Scatter to gather Binning Privatization Compaction Regularization 30
A Tiling-Scheme Viterbi Decoder in Software-Defined Radio for GPUs
31
Software-Defined Radio • In 1991, Joseph Mitola coined the term of SDR which is designed for realizing wireless standards in a digital programmable platform • The advantages of SDR • Reduce the development costs • • User demands driven Shorter time-to-market 32
SDR Platform • The requirements of SDR • • High system throughput Intensive computing • Real-time enforcement • Candidates of SDR platform • CPUs • FPGAs • DSPs •
GPUs
33
Related Works of SDR for GPUs • Speed up of link simulator for SDR systems • Multiple-input multiple-output (MIMO) detector for multi-antenna SDR systems • Implementation of SDR system • Encoder/Decoder 34
Channel Coding 35
Example for Viterbi Decoding
State transition pattern
/
Decoding bit
2 1 11 01/0 10/1 01/0 10/1 0 00 11/1 2 10 10/1 01/0 0 11/1 00 1 10/1 10 01/0 4 11/1 01 1 11/1 00/0 00 00/0 2 10/1 2 01/0 11/1 00/0 2 11/1 0 0 00/0 00/0 00/0 Code Word:
00 10 10 01
Decoding Sequence: 0 1 1 0 1 3 2 3 0 0 Trellis 36
Compute Unified Device Architecture (CUDA) 37
The Previous Method on GPUs [1] [1] J. Kim, S. Hyeon, and S. Choi. Implementation of an SDR System Using Graphics Processing Unit.
Communications Magazine, IEEE
, 48(3):156 – 162, 2010.
38
Tiled Viterbi Decoding Algorithm (TVDA) 39
Running Example: Stage 1 • Viterbi decoding in chunks 40
Running Example: Stage 2 • Merge chunks 41
Experimental Environment • Hardware • CPU: Intel Core 2 Quad [email protected]
• GPU: nVIDIA Geforce 8800 GTX 128 [email protected]
• OS • Linux Ubuntu 11.04
• Testing Scenario • Wireless standard • GSM • Block size = 456 bits 42
BER Performance 43
Performance Estimation • • Add Operations of TVDA = 2 × (T/N)+S × (N-1) Comparison Operations of TVDA = T/N+(S-1) × (N-1)
T
: number of trellises
N
: number of chunks
S
: number of states per trellis 44
Performance Result 45
Conclusion
• A tiled Viterbi decoding algorithm was proposed.
• The proposed algorithm is designed for eliminating partial data dependencies and accelerating 2.5x/14.6x faster than the previous GPU/CPU implementation.
46
FEC Block Why not as we think?
4X Ideal Actual 2.5X
47
Performance Modeling and Tuning for Heterogeous System
48
Factors for Performance Factors for Performance Application/ Algorithm Level 1.
2.
Heuristics (divide-and-conquer, geometric decomposition, …) Optimization patterns (tiling/data reuse, scatter to gather, …) … Compiler/ OS Level 1. Scheduling 2. Code optimization … Platform/ Architecture Level 1. DMA transfer size 2. Clock rate, number of cores 3. Memory bandwidth … 49
Current Methods • Manually deriving by designers • Statistical Methods • • Factor analysis Principal Component Analysis • Linear regression • Emulation/Simulation 50
(365-18) days in USA
51
回國後的感想 • 研究精神和態度 • 推銷自我 • 團隊合作與溝通 • 待人處事 • 英文非常重要 52
Thank you~
53