Starting GPU computing with CUDA

August 14, 2025

With recent advances in AI, there’s a lot of talk about how critical GPU computation is for AI and why NVIDIA’s CUDA platform matters so much for training models. Coming from a web development background, it felt like a distant and complex field, reserved for specialists with deep knowledge of hardware and parallel programming libraries. Still, I got curious about why this technology is so critical for training AI models. When I tried CUDA it turned out to have a very elegant API and was quite easy to get started with.

I started exploring GPU programming based on recommendations in this Reddit thread and picked up the book Professional CUDA C programming (2014). While it’s quite old, the fundamentals should still hold. It’s a hands-on guide for beginners that lets you practice parallel programming on the GPU right from the start.

Chapter 1 gives an intro to heterogeneous computing, which allows computation to be carried out on either CPU or GPU depending on the task at hand. The challenge of splitting a program so that sequential parts run on CPU while parallelisable parts run on GPU is managed quite elegantly with CUDA. You write both CPU code (in C) and GPU code (in CUDA C, an extension of C) in the same file. The CUDA compiler then separates these parts, invokes the appropriate compiler for each part, and links everything together with the necessary CUDA libraries to enable execution on the GPU.

Here’s what a basic “Hello World” GPU program looks like in CUDA:

#include <stdio.h>

// GPU kernel function that prints a message
__global__ void helloGPU (void)
{
	printf("Hello from GPU! (thread %d)\n", threadIdx.x);
}

int main(void)
{
	printf("Hello from CPU!\n");
   
    // Launch kernel with 10 threads
	helloGPU <<<1, 10>>>();
    
	// Wait for GPU to finish
	cudaDeviceReset();

	return 0;
}

The __global__ keyword marks helloGPU as a kernel - a function that runs on the GPU. We call it with helloGPU <<<1, 10>>>(), where the numbers specify 1 block (a GPU concept explored in Chapter 2) and 10 threads.

To run this code, I used a GPU-based machine in Google Cloud (I chose it for convenience, though AWS or other GPU providers would work just as well). At about $0.8 per hour for access to an NVIDIA L4 GPU, it’s a bit pricey, but it provides full command-line access to the system for experimenting with compiling and running code. I used Deep Learning VM with CUDA 12.4, M129 image with CUDA preinstalled. There is also much cheaper option with running instance with NVIDIA T4 GPU, but in my experience it was almost never available whenever I’ve tried to create instance.

What helps further with experimenting is setting up VS Code to connect to your remote host over SSH (with Remote - SSH extension): While CUDA syntax looks quite elegant, there are newer approaches that go further. For example Mojo allows CPU and GPU code to be written in the same programming language and provides memory safety with a borrow checker.

Coming next:

Summing up vectors with CUDA (Chapter 2 of Professional CUDA C Programming)
Summing up vectors with Metal (Apple’s GPU framework)
Summing up vectors with Mojo programming language