Getting started with GPU Programming on an EC2!

#cuda #gpu #aws #nvidia

There's a short backstory to this thread, and it starts with a phone call from my undergrad Uni in September '25. It was the Head of the Department of IT at the other end, asking me if I were willing to take a technical knowledge-sharing session for the students.

I was obviously in, I just needed an excuse to go back to the campus. It's been 2 years since I graduated and I dearly miss this place. Having said that, the topic of discussion was open-ended, so they left that on me to decide. I thought about many things - AI, Blockchain, Data Science, etc. but stumbled upon one thing that I had already been exploring lately, and I was almost certain that this was something most of the undergrad engineering students wouldn't have come across yet.

Such is the world of GPU programming. GPUs - the curiosity that got me into gaming as a kid, exploring the compute prowess they bring to a machine. At the time, my understanding was as simple as - "big GPU, smooth game".

Thanks to this gift of cloud, access to GPUs is much easier than back in the day. Looking at today's marker, accessing a cloud GPU is much easier than getting hands on an actual one even if you had the money XD

Finding a GPU Instance on EC2

In just a few clicks, you can find yourself in a shell of a machine that has a powerful Nvidia GPU.

You might need a service quota approval to get access to the g4dn.xlarge type of EC2 instance, it's a fairly straightforward thing to request from the Service Quotas Dashboard.

Once that is done, you can spin up an Ubuntu instance.

Installing CUDA

Once you've entered your EC2 Instance on the shell, you can download the latest CUDA Toolkit from its official documentation with the network install. Looks something like this.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.1.0/local_installers/cuda-repo-ubuntu2404-13-1-local_13.1.0-590.44.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-13-1-local_13.1.0-590.44.01-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-13-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-1

Following the same instructions on the download page you may need to install Nvidia drivers separately.

The entire thing takes only a few seconds running commands one after another. Running nvidia-smi on the terminal showing you this nice little table is a sign that the installation is completed perfectly.

Troubleshooting

If nvcc --version doesn't work, validate that it's there in the local directory and add it to PATH in bashrc.

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Basics of CUDA

It took me an hour long C++ refresher to get comfortable working with CUDA files. You can get started by cloning my GitHub repository, it has two basic CUDA programs to work with.

git clone https://github.com/sreekeshiyer/hello-cuda.git

The most fundamental idea of working with something like CUDA is that you want to parallelise your compute process to get as many tasks done as possible. I talked in-depth about how CPU and GPU are different and why the GPU is able to complete a lot more tasks in less time purely based on its hardware differences.

Let's walk through the most basic vector addition program to understand how to work with CUDA.

Vector addition walkthrough

When you start reading through vector_add_simple.cu in the basics directory, you'll realise how similar CUDA is to C++. There are two key differences, however.

#define BLOCK_SIZE 256

// ---------------- Define GPU Kernel ----------------
__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;  // Compute global thread index
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

// 5. Launch kernel on GPU
    cout << "[5] Launching kernel...\n";
    int blocksPerGrid = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
    cout << "    → Grid size: " << blocksPerGrid << " blocks\n";
    cout << "    → Block size: " << BLOCK_SIZE << " threads\n";

    vectorAdd<<<blocksPerGrid, BLOCK_SIZE>>>(d_A, d_B, d_C, N);
    cout << "    ✅ Kernel execution complete.\n";

I want you to closely observe these two code-blocks. The first one where we define the kernel. The operation, as one would expect, is rather simple - you are adding two one-dimensional vectors. The only difference being that we're going to distribute 1 operation to each thread, which is why we compute the thread-index first.

After that, we're running this function in blocks and we're pre-defining blocks in a grid. I have a nice article from Nvidia that covers these terminologies in depth, you might want to take a look. I spent about 5-10 minutes on the whiteboard in the class trying to explain the concept of threads, blocks and grids.

Compiling the CUDA Program

It's quite easy to run a CUDA program. You first compile it using nvcc and you can run the outfile that gets generated as a result.

### The flag -o is to mention the outFile name as 00
nvcc -o 00 vector_add_simple.cu
### You can `ls` in the directory and find this outFile once compiled. 
### Then you can open this outFile as an executable which eventually runs your program. 
./00

We're essentially adding million-sized vector arrays and you'll notice that the entire operation completes in an instant. But that doesn't tell us much unless we compare it to a regular CPU-based program.

Comparing Results with a CPU

We explored the capabilities of parallel computing with a GPU using a simple vector addition program. But how does it actually fair against a CPU? This is what the students in the class were the most curious about. I've added another file vector_add_benchmark.cu in the same folder for this very purpose. It basically runs the entire thing on a CPU (single-threaded) and then on the GPU to compare the time difference.

You'll notice a stark difference (2600x in my case, results may vary but you can expect a general trend). While 32ms doesn't look huge, you can try playing around and expanding the array size, at which point you'll see the difference expanding as the CPU takes longer and longer.

I don't expect this article to be a "Welcome to CUDA", and I apologise if you expected it to be. I just wanted to walk through and document the little session I took in my university as a good habit. Overall, I had very positive feedback for the session. Feel free to reach out to me if you want to take a look at the deck ( PS - I make good decks wink-wink )

Thanks for reading, hope you're having a lovely time with your friends and family in the holiday season. I'll see you again in 2026 :D