[CSE 4152] 고급 소프트웨어 실습 I 6주차:『GPU(CUDA) Programming』

Download Report

Transcript [CSE 4152] 고급 소프트웨어 실습 I 6주차:『GPU(CUDA) Programming』

[CSE 4152] 고급 소프트웨어 실습 I
6주차:『GPU(CUDA) Programming』
2013. 12.3 (화)
안재풍 ( AS907, [email protected])
CUDA Memory Architecture

Warp

하나의 SM에서 동시에 작동하는 thread의 단위


연속된 32개의 threads를 뜻함
Memory access 및 연산은 warp 단위로 동시에 수행
SM
Shared
memory
Threads
block
(8x32)
Shared
memory
Shared
memory
Shared
memory
CUDA Memory Architecture (Global Memory)

Global memory access pattern



Warp 단위로 code가 동시에 수행되는 구조이기 때문에 warp 내의
모든 thread의 memory access가 완료되어야 다음 명령어를 수행
가장 빈번하게 사용되는 global memory access는 접근 형태에 따라
최적의 access 시간 대비 최대 16배의 시간이 소요될 수 있음
한번의 global memory access는 연속된 128byte 단위로 memory
access가 발생


32개의 thread warp가 연속된 memory 영역을 access할 경우 가장 효율적
만약 32개의 thread warp가 비연속적인 memory 영역을 access할 경우 비
효율적
CUDA Memory Architecture (Global Memory)
1x128byte memory
transaction at 128
Memory
address
0
128
256
warp
2x128byte memory
transaction at 128, 256
Memory
address
0
0
31
128
256
warp
16x128byte memory
transaction
Memory
address
0
0
31
128
256
…
0
…
31 warp
CUDA Memory Architecture (Global Memory)

Array of structure VS structure of array

CPU : array of structure


GPU : structure of array

float x
float y
float z
float d
…
Cache hit ratio ↑
Global memory transaction ↓
float x
float y
float z
float d
…
…
float x
float y
float z
float d
…
float x0
float x1
float x2
float x3
…
float y0
float y1
float y2
float y3
…
float z0
float z1
float z2
float z3
…
float d0
float d1
float d2
float d3
…
…
Shared Memory
Shared memory는 on chip memory로써 global
memory에 비해 접근 속도가 확연하게 빠름

Shared memory는 global memory와는 다른 구
조로 구성되어 있으며 bank conflicts가 발생
하지 않을경우 register와 비슷한 성능을 발
휘

6
Shared Memory : Usage
7
Shared Memory : Bank Conflict
Shared memory has 16 banks (compute capability 2.x = 32 bank)

bank
0
1
…
15
0
1
…
15
0
…
Shared
memory
4byte
4byte
…
4byte
4byte
4byte
…
4byte
4byte
…

Shared memory is divided into equally-sized memory modules,
called banks, which can be accessed simultaneously

If two addresses of a memory request fall in the same memory bank,
there is a bank conflict and the access has to be serialized

Shared memory features a broadcast mechanism whereby a 32-bit
word can be read and broadcast to several threads simultaneously
when servicing one memory read request
8
Shared Memory : Bank Conflict
Some example

bank
0
1
2
3
…
Shared
memory
4byte
4byte
4byte
4byte
…
Threads
…
4 way bank conflict
4 way bank conflict
bank
0
1
2
3
…
Shared
memory
4byte
4byte
4byte
4byte
…
Threads
…
Conflict free
9
4 way ..
Shared Memory : Bank Conflict

For devices of compute capability 2.x, multiple words can be
broadcast in a single transaction (for devices of compute
capability 1.x, single word can be broadcast in a single
transaction)
bank
Shared
memor
y
Thread
s
0
1
2
3
…
4byte
4byte
4byte
4byte
…
…
Compute capability 1.x : 4 way bank conflict
Compute capability 2.x : no bank conflict
10