Transcript Free list
A non-blocking approach on GPU
dynamical memory management
Joy Lee @ NVIDIA
Outline
Introduce Buddy memory system
Our parallel implementation
Performance comparison
Discussion
Fixed size memory (memory pool)
Ever fastest & simplest memory system
Free list (item = address)
Allocate
Just take one item from free list
Free
Each item of free list records the available address to
allocate.
Free list can be implement with queue,
Free list
stack, list, or any data structure.
Just return the address to free list.
Performance
Constant time on both allocation & free
0x0000
0x0100
0x0200
0x0300
….
Multi-lists memory
For management on non-fixed size memory system, a natural extension
from fixed size memory is multi-lists memory system
Free list
Allocate
Find the correct free list to free
Return the address to the target free list.
Performance
Find the first free list with size larger than
request size by arithmetic operation
example: ceil(log2(size))
Take one element from the target free list
Free
multi free lists of fixed size memory with different size (ex: twice size grow)
Constant time on both allocation & free,
since it is possible to find suitable free list
with arithmetic operation instead of linear searching.
Drawback: waste memory
Free lists
Size = 256
…
Size = 512
…
Size = 1024
…
Size = 2048
…
….
…
Buddy memory
To avoid the wasting memory problem in multi-lists memory, it is natural to
allocate memory from the direct upper layers (twice size) when the free list
is empty, instead of pre-allocated memory in all free lists.
Free list
Allocate
Find the first free list with size larger than request size
Take one element from the target free list
If the free list is empty , create pairs from upper list
Free
multi free lists of fixed size memory, with sizes growing up in power of 2
Find the correct free list to free (using records)
Return the address to the target free list.
If the buddy is also in the free list, then
free to upper.
Performance
Constant time on both allocation & free
Free lists
Size = 256
Size = 512
Size = 1024
Size = 2048
Size = 4096
Buddy memory
Good internal de-fragment
The buddy address can be calculated by
address XOR size
Constant time operation O(h), where h =
log2(max size/min size) is a constant.
buddy
this
Memory layers
Just implement one class
of single layer, other layers
are instances with
different size.
Lower layer
Current layer
The memory layer with 1/2
size of current layer
The allocating request layer
Upper layer
The memory layer with 2x size
of current layer
Free lists
Size = 256
Lower layer
Size = 512
Current layer
Size = 1024 Upper layer
Size = 2048
Size = 4096
…
Pair creation
If the current free list is empty,
it will allocate memory from
upper allocator.
Since the size of upper is 2x,
it will create a pair of available
memory into current free list.
If there are N threads
simultaneously allocate
memory in current layer,
of that the free list is empty,
only N/2 threads shall
allocate memory from upper
layer.
Memory from upper layer
Memory to
current layer
Memory to
current layer
Free Queue
The free list was implemented with queue, of
which head can run over tail.
Head<Tail
Head=Tail
Head>Tail
available memory (directly allocate
from this free list)
empty free list
under available (require pair
creation from upper layer)
Use the above states to determine which
threads shall call pair_creation() from upper
layer.
Parallel strategy (Alloc)
Each allocation requestor creates a socket to listen the
address.
The socket was implemented on free queue.
atomicAdd(&head,1) creates a socket.
The output address can come from current free list or pair
creation from upper free list.
Threads with allocation requests to this layer
Head
Tail
Available memory
in free queue
New Head
Need pair
creation from
upper layer
Odd/Even Pair Creation
Threads with allocation requests to this layer
Head
Tail
New Head
Pair Creations
New Tail
The under available threads will perform pair
creations in odd/even loop until new tail >= new
head to avoid the overhead of simultaneous pair
creation.
Parallel strategy (Free)
Store the freed address to free list
Calculate the buddy address.
Check if the buddy is already in the free list.
XOR(addr, size)
Use hand shake algorithm for fast lookup
If YES, mark both elements in free list as
N/A, then free the memory block into upper
layer.
Hand shake
Hand shake
The freed memory record its
index in free list
The free list record the freed
memory address
Fast check if buddy memory
address is in free list
Calculate buddy memory
address (XOR)
Read the index from this address
Check if the address of this
index in free list is equal to the
buddy memory address.
Record address
of memory
Record index
in free list
Memory block
Performance
gridDim=512 blockDim=512 K20
CUDA 5.0
This
Speedup
256 bytes alloc/free
single time
278.9 ms
10.8 ms
25.8 x
256 bytes alloc
7155.4 ms
10.48 ms
682 x
256 bytes free
5671.2 ms
7.27 ms
780 x
Random # of bytes
alloc/free 35 times
size < lower 2 layer
5376.3 ms
65.8 ms
81.7x
Random # of bytes
alloc/free 35 times
full range
4153.8 ms
370.5 ms
11.2 x
Discussion
Warp level group allocation
Dynamic expanding free queue
Backup Slides
Slow atomicCAS() loop
long ret=now;
do{
now=ret;
ret=atomicCAS(&head, now, now->next);
}while(ret!=now);