[olcf's CUDA series]https://vimeo.com/showcase/6729038
01. CUDA C Basics
-
Host
: The CPU and its memory -
Device
: The GPU and its memory
Simple Processing Flow
-
COPY
memory (from CPU to GPU) -
Load
GPU program andExecute
-
COPY
memory (from GPU to CPU) Free
Problem::vector addition
- 1:1 (input:output)
Concepts
__global void mykernel(void) {};
mykernel<<<N,1>>>(); // Grid (N blocks), Block(1 thread)
-
__global__
is kernel code (run in device) -
<<<GRID, Block>>>
, which means- GRID: # of blocks per grid
- Block: # of threads per block
// 1-1. prepare gpu's global memory
cudaMalloc((void **)&d_a, size);
// 1-2. copy (to device A from host A)
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
// 2. Load and Execute
add<<<N,1>>>(d_a, d_b, d_c)
// 3. Copy (GPU -> CPU)
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// 4. Free
free(a);cudaFree(d_a);
02. Shared Memory
Problem::1D Stencil
- It is not an 1:1 (input: ouput) problem.
- e.g. blue element is read seven times (if radius 3)
Concept::Shared Memory
- On Chip memory (>= Global memory)
- Per Block (invisible other blocks)
- User managed memory
__shared__ int s[64];
...
Starting from Volta (2017 and later),
__shared__
(SW) and the L1 cache(SW) share the same on-chip SRAM(HW) resources. Developers can configure how much of this SRAM is allocated to shared memory versus L1 cache depending on the application needs.
Top comments (0)