제민욱

Posted on Mar 13 • Edited on Mar 16

CUDA Series

#nvidia #programming #ai #cuda

[olcf's CUDA series]https://vimeo.com/showcase/6729038

01. CUDA C Basics

slide

Host: The CPU and its memory
Device: The GPU and its memory

Simple Processing Flow

COPY memory (from CPU to GPU)
Load GPU program and Execute
COPY memory (from GPU to CPU)
Free

Problem::vector addition

1:1 (input:output)

Concepts

__global void mykernel(void) {};

mykernel<<<N,1>>>(); // Grid (N blocks), Block(1 thread)

__global__ is kernel code (run in device)
<<<GRID, Block>>>, which means
- GRID: # of blocks per grid
- Block: # of threads per block

// 1-1. prepare gpu's global memory
cudaMalloc((void **)&d_a, size);

// 1-2. copy (to device A from host A)
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

// 2. Load and Execute
add<<<N,1>>>(d_a, d_b, d_c)

// 3. Copy (GPU -> CPU)
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// 4. Free
free(a);cudaFree(d_a);

02. Shared Memory

slide

Problem::1D Stencil

It is not an 1:1 (input: ouput) problem.
e.g. blue element is read seven times (if radius 3)

Concept::Shared Memory

On Chip memory (>= Global memory)
Per Block (invisible other blocks)
User managed memory

__shared__ int s[64];
...

Starting from Volta (2017 and later), __shared__(SW) and the L1 cache(SW) share the same on-chip SRAM(HW) resources. Developers can configure how much of this SRAM is allocated to shared memory versus L1 cache depending on the application needs.

03. CUDA Optimization (1 of 2)

https://vimeo.com/showcase/6729038/video/398824746

slide

Your AI Code Assistant

Generate and update README files, create data-flow diagrams, and keep your project fully documented. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

DEV Community

CUDA Series

01. CUDA C Basics

Simple Processing Flow

Problem::vector addition

Concepts

02. Shared Memory

Problem::1D Stencil

Concept::Shared Memory

03. CUDA Optimization (1 of 2)

Your AI Code Assistant

Top comments (0)

A Workflow Copilot. Tailored to You.

Read next

Overview："Agentic Retrieval-Augmented Generation: A Comprehensive Survey"

The Ultimate Guide to Local DeepSeek Deployment

Leveraging Mock Service Workers for NestJS e2e tests

How to make an API call in the middle of an OpenAI conversation

Okay