Understanding GPU programming: Vector Addition
Introduction
This year I’ve spent a lot of time building with agents and poking around with local LLMs. But I’ve realized that I have a gap in my mental model.
I have some burning questions that need an answer: How do we write code to run on GPUs? How are GPUs so fast? What mental model of GPU programming do we need to reason about model training and inference?
I’m not going to answer all of these in this post. But we have to start somewhere. The aim of this post is to try and explain the mental model for writing parallel code that runs on the GPU. I will start with the ‘Hello World’ of CUDA: Vector Addition.
Pre-requisites
This post assumes you can read code and understand loops. No CUDA experience necessary. A basic understanding of what a GPU is will help, but I’ll cover the parallel programming ideas as we go.
All the GPU code in this post is written in CUDA C, which is NVIDIA’s platform for programming GPUs. If you want to try this example out, check out LeetGPU (it’s the first problem).
Before we move on, you should know that a vector is just a one-dimensional array of numbers. So vector addition means that we’re adding them element-wise.
For example, [1, 2, 3] + [4, 5, 6] = [5, 7, 9].
Sequential vs Parallel Computation
Let’s first consider how we would add two vectors sequentially. For example, in TypeScript:
function addVectors(a: number[], b: number[]): number[] {
const n = a.length;
const c: number[] = [];
for (let i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
return c;
}
One key thing to notice with this code is that each iteration doesn’t care about the others. c[1] doesn’t need c[2] to finish first. c[999] doesn’t wait for c[998]. Every iteration of the loop is independent.
This independence is the reason a GPU can help us. Whenever the work fits this shape, you can stop writing a loop and describe a single iteration. This is an example of data parallelism.
In CUDA, a grid of threads handles the iteration. Each thread handles one element. They all run in parallel.
Which raises the question: if there’s no i from a loop counter, how does each thread know which element it owns?
How Threads Organize into Blocks and Grids
When we think about CPUs, “thread” is a loaded word. Spinning up a hundred threads is something you’d think long and hard about.
Now I want you to forget all of that.
The reason a CPU can’t just “spawn a million threads” is that every thread costs real OS resources and competes for scheduling. A GPU is built around the opposite assumption where threads are cheap and disposable.
So now let’s talk about the CUDA hierarchy:
- Thread executes a Kernel
- Threads are grouped into Blocks
- Blocks are grouped into a Grid
Let’s start with the smallest unit of execution and work our way up.
A kernel is a special function that runs on the GPU and describes what one thread does. Each thread executes the kernel code independently.
A block is a group of threads that can cooperate and share data quickly through shared memory. Each block contains some number of threads (up to 1024). The 1024-per-block ceiling means you can’t put a million threads in one block, you spread them across many.
The grid is an array of thread blocks and it represents the entire set of threads launched for a single kernel call. When you launch a kernel, you specify how many blocks to create (the dimensions of the grid).
Inside the kernel, there are three built-in variables that tell each thread who it is:
blockIdx.x= which block am I in?threadIdx.x= what position am I inside my block?blockDim.x= how many threads are in each block?
Note: The
.xis a hint that there’s also.yand.z. Grids and blocks can be 2D or 3D for data like images. We don’t need to worry about that for this 1D vector example.
These three properties are enough for each thread to figure out its global position in the grid:
int i = blockIdx.x * blockDim.x + threadIdx.x;
Every thread gets its own blockIdx.x and threadIdx.x which together resolve to a unique global index.
Click a thread in the grid below to watch
blockIdx.xandthreadIdx.xplug into the formula and resolve to a global index. Try changing the block size and block count to see how the same thread ends up at a differenti:
One more point worth knowing: threads can run in any order. The hardware doesn’t promise that block 0 finishes before block 1 starts, or that blocks even run in numerical order. If your kernel relies on a particular order, you’ll run into problems.
Launch Syntax
We’ve talked about kernels and grids in the abstract. Now let’s see what they actually look like.
A kernel is just a C-like function with a special qualifier in front of it:
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
// ...
}
The __global__ keyword tells the compiler: this function runs on the GPU, and it’s called from the CPU. Without it, the compiler has no idea this code is meant to leave the host.
In CUDA terminology, Host refers to the CPU and Device refers to the GPU.
Calling the kernel function looks like this:
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
The <<< and >>> aren’t a typo. They’re CUDA’s launch syntax, and they’re the one piece of the language that isn’t valid C++ (you can only compile this with nvcc, NVIDIA’s CUDA compiler).
Between the triple angle brackets, you tell the GPU exactly how to shape the grid, i.e., how many blocks, and how many threads per block. The hardware uses these values to populate blockIdx.x and blockDim.x inside every thread.
Choosing the values is straightforward. Pick the threads-per-block (256 is a common default), then compute how many blocks you need to cover N elements:
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
That (N + threadsPerBlock - 1) / threadsPerBlock is ceiling division, which rounds up the number of blocks needed to cover all N elements.
Checking the Boundary
The launch code from the previous section has an issue. Data sizes don’t usually divide cleanly into block sizes.
With N = 1000 and 256 threads per block, that gives 4 blocks and 1024 threads for 1000 elements. The last 24 threads have nothing to add.
But those extra threads still run the kernel and we need to ensure that they don’t end up reading and writing past the end of every array.
Check out the explorer below. Adjust
NandthreadsPerBlock, then toggle the bounds check on and off.
On a CPU, an out-of-bounds access often gets caught. The OS hands you a segmentation fault and the program dies (you might know the pain).
On a GPU it’s worse. There’s no automatic bounds check on a memory access. Those reads and writes will succeed, silently, against whatever happens to live next to those arrays in GPU memory. You may get back a result that looks right but isn’t.
The fix is one line:
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
GPU Memory
There are several types of memory that CUDA exposes on the GPU, each with different speed, capacity, and scope.
| Type | Speed | Size | Visible to | Use for |
|---|---|---|---|---|
| Global | Slowest | Largest (GBs) | All threads | Large datasets, inputs and outputs |
| Shared | Fast | Small (KB per block) | Threads in same block | Data shared frequently within a block |
| Registers | Fastest | Tiny (per thread) | One thread only | Local variables, loop counters |
| Constant | Fast (cached) | Small (64KB) | All threads (read-only) | Fixed configuration, lookup tables |
| Local | Slow | Per-thread (backed by global) | One thread only | Register spill, avoid if possible |
In our vector addition kernel, the arrays A, B, and C live in global memory on the device.
Note: Before the kernel launches, the CPU allocates space on the GPU with
cudaMallocand copies the input data across withcudaMemcpy. Every thread in the kernel can then read from or write to that memory.
The index i and any intermediate float values are stored in registers, private to each thread. The compiler places local variables there automatically.
We won’t use shared or constant memory for this kernel. Shared memory in particular becomes important for more advanced kernels like matrix multiplication, where threads in a block need to reuse the same data repeatedly. But that’s a future post.
Putting it all together
Here’s the whole thing in one place.
#include <cuda_runtime.h>
// a function that runs on the GPU and is called from the CPU
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
// every thread computes its global index from where it sits in the grid
int i = blockIdx.x * blockDim.x + threadIdx.x;
// guard against the extra threads we launched but don't need
if (i < N) {
C[i] = A[i] + B[i];
}
}
// this CPU function launches the kernel
void launch_vector_add(const float* A, const float* B, float* C, int N) {
int threadsPerBlock = 256;
// ceiling division to cover all `N` elements
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// launch the kernel with our grid configuration
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
// kernel launches are async.
// The CPU returns immediately, so we have to wait before reading the results
cudaDeviceSynchronize();
}
Conclusion
Vector addition is the Hello, World of CUDA. The interesting kernels like matrix multiplication and attention change what the body does, but not the shape around it.
If you write PyTorch and you’ve ever typed A + B on a tensor with .cuda(), the kernel that ran underneath looks an awful lot like the one we just built.
I encourage you to take what we’ve learned here and try it out. Vector addition is the first problem on LeetGPU. Go see if you can solve it.