Understanding GPU programming: Vector Addition

Introduction

This year I’ve spent a lot of time building with agents and poking around with local LLMs. But I’ve realized that I have a gap in my mental model.

I have some burning questions that need an answer: How do we write code to run on GPUs? How are GPUs so fast? What mental model of GPU programming do we need to reason about model training and inference?

I’m not going to answer all of these in this post. But we have to start somewhere. The aim of this post is to try and explain the mental model for writing parallel code that runs on the GPU. I will start with the ‘Hello World’ of CUDA: Vector Addition.

Pre-requisites

This post assumes you can read code and understand loops. No CUDA experience necessary. A basic understanding of what a GPU is will help, but I’ll cover the parallel programming ideas as we go.

All the GPU code in this post is written in CUDA C, which is NVIDIA’s platform for programming GPUs. If you want to try this example out, check out LeetGPU (it’s the first problem).

Before we move on, you should know that a vector is just a one-dimensional array of numbers. So vector addition means that we’re adding them element-wise.

For example, [1, 2, 3] + [4, 5, 6] = [5, 7, 9].

Sequential vs Parallel Computation

Let’s first consider how we would add two vectors sequentially. For example, in TypeScript:

function addVectors(a: number[], b: number[]): number[] {
  const n = a.length;
  const c: number[] = [];
  for (let i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
  }
  return c;
}

One key thing to notice with this code is that each iteration doesn’t care about the others. c[1] doesn’t need c[2] to finish first. c[999] doesn’t wait for c[998]. Every iteration of the loop is independent.

This independence is the reason a GPU can help us. Whenever the work fits this shape, you can stop writing a loop and describe a single iteration. This is an example of data parallelism.

In CUDA, a grid of threads handles the iteration. Each thread handles one element. They all run in parallel.

Which raises the question: if there’s no i from a loop counter, how does each thread know which element it owns?

How Threads Organize into Blocks and Grids

When we think about CPUs, “thread” is a loaded word. Spinning up a hundred threads is something you’d think long and hard about.

Now I want you to forget all of that.

Tangent How is a GPU thread different from a CPU thread?

	CPU thread	GPU thread
Creation cost	Expensive. OS syscall, stack allocation etc.	Cheap. Hardware-managed
Scheduling	OS scheduler, context switches between threads	Hardware scheduler, no context switch
Typical count	Dozens to hundreds	Thousands to millions
Execution	Each thread can run different code	All threads run the same code
Optimized for	Latency on diverse, branchy work	Throughput on uniform work

The reason a CPU can’t just “spawn a million threads” is that every thread costs real OS resources and competes for scheduling. A GPU is built around the opposite assumption where threads are cheap and disposable.

So now let’s talk about the CUDA hierarchy:

Thread executes a Kernel
Threads are grouped into Blocks
Blocks are grouped into a Grid

Let’s start with the smallest unit of execution and work our way up.

A kernel is a special function that runs on the GPU and describes what one thread does. Each thread executes the kernel code independently.

A block is a group of threads that can cooperate and share data quickly through shared memory. Each block contains some number of threads (up to 1024). The 1024-per-block ceiling means you can’t put a million threads in one block, you spread them across many.

The grid is an array of thread blocks and it represents the entire set of threads launched for a single kernel call. When you launch a kernel, you specify how many blocks to create (the dimensions of the grid).

Inside the kernel, there are three built-in variables that tell each thread who it is:

blockIdx.x = which block am I in?
threadIdx.x = what position am I inside my block?
blockDim.x = how many threads are in each block?

Note: The .x is a hint that there’s also .y and .z. Grids and blocks can be 2D or 3D for data like images. We don’t need to worry about that for this 1D vector example.

These three properties are enough for each thread to figure out its global position in the grid:

int i = blockIdx.x * blockDim.x + threadIdx.x;

Every thread gets its own blockIdx.x and threadIdx.x which together resolve to a unique global index.

Click a thread in the grid below to watch blockIdx.x and threadIdx.x plug into the formula and resolve to a global index. Try changing the block size and block count to see how the same thread ends up at a different i:

Threads / block

Blocks

Block 0

Block 1

Block 2

Block 3

ii=blockIdx.x—×blockDim.x8+threadIdx.x—= —

One more point worth knowing: threads can run in any order. The hardware doesn’t promise that block 0 finishes before block 1 starts, or that blocks even run in numerical order. If your kernel relies on a particular order, you’ll run into problems.

Launch Syntax

We’ve talked about kernels and grids in the abstract. Now let’s see what they actually look like.

A kernel is just a C-like function with a special qualifier in front of it:

__global__ void vector_add(const float* A, const float* B, float* C, int N) {
    // ...
}

The __global__ keyword tells the compiler: this function runs on the GPU, and it’s called from the CPU. Without it, the compiler has no idea this code is meant to leave the host.

In CUDA terminology, Host refers to the CPU and Device refers to the GPU.

Calling the kernel function looks like this:

vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);

The <<< and >>> aren’t a typo. They’re CUDA’s launch syntax, and they’re the one piece of the language that isn’t valid C++ (you can only compile this with nvcc, NVIDIA’s CUDA compiler).

Between the triple angle brackets, you tell the GPU exactly how to shape the grid, i.e., how many blocks, and how many threads per block. The hardware uses these values to populate blockIdx.x and blockDim.x inside every thread.

Choosing the values is straightforward. Pick the threads-per-block (256 is a common default), then compute how many blocks you need to cover N elements:

int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);

That (N + threadsPerBlock - 1) / threadsPerBlock is ceiling division, which rounds up the number of blocks needed to cover all N elements.

Checking the Boundary

The launch code from the previous section has an issue. Data sizes don’t usually divide cleanly into block sizes.

With N = 1000 and 256 threads per block, that gives 4 blocks and 1024 threads for 1000 elements. The last 24 threads have nothing to add.

But those extra threads still run the kernel and we need to ensure that they don’t end up reading and writing past the end of every array.

Check out the explorer below. Adjust N and threadsPerBlock, then toggle the bounds check on and off.

N1,000

Threads / block

Bounds check if (i < N) { ... }

N1,000

blocksPerGrid4

Total threads1,024

Overshoot24

Threads launched1,024 = 4 × 256

MemoryC[0..999] · neighbour

24 threads are writing past the end of C into memory you don't own. On a GPU these writes succeed silently and may corrupt unrelated data.

On a CPU, an out-of-bounds access often gets caught. The OS hands you a segmentation fault and the program dies (you might know the pain).

On a GPU it’s worse. There’s no automatic bounds check on a memory access. Those reads and writes will succeed, silently, against whatever happens to live next to those arrays in GPU memory. You may get back a result that looks right but isn’t.

The fix is one line:

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
    C[i] = A[i] + B[i];
}

GPU Memory

There are several types of memory that CUDA exposes on the GPU, each with different speed, capacity, and scope.

Type	Speed	Size	Visible to	Use for
Global	Slowest	Largest (GBs)	All threads	Large datasets, inputs and outputs
Shared	Fast	Small (KB per block)	Threads in same block	Data shared frequently within a block
Registers	Fastest	Tiny (per thread)	One thread only	Local variables, loop counters
Constant	Fast (cached)	Small (64KB)	All threads (read-only)	Fixed configuration, lookup tables
Local	Slow	Per-thread (backed by global)	One thread only	Register spill, avoid if possible

In our vector addition kernel, the arrays A, B, and C live in global memory on the device.

Note: Before the kernel launches, the CPU allocates space on the GPU with cudaMalloc and copies the input data across with cudaMemcpy. Every thread in the kernel can then read from or write to that memory.

The index i and any intermediate float values are stored in registers, private to each thread. The compiler places local variables there automatically.

We won’t use shared or constant memory for this kernel. Shared memory in particular becomes important for more advanced kernels like matrix multiplication, where threads in a block need to reuse the same data repeatedly. But that’s a future post.

Putting it all together

Here’s the whole thing in one place.

#include <cuda_runtime.h>

// a function that runs on the GPU and is called from the CPU
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
    // every thread computes its global index from where it sits in the grid
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    // guard against the extra threads we launched but don't need
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

// this CPU function launches the kernel
void launch_vector_add(const float* A, const float* B, float* C, int N) {
    int threadsPerBlock = 256;

    // ceiling division to cover all `N` elements
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    // launch the kernel with our grid configuration
    vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);

    // kernel launches are async.
    // The CPU returns immediately, so we have to wait before reading the results
    cudaDeviceSynchronize();
}

Conclusion

Vector addition is the Hello, World of CUDA. The interesting kernels like matrix multiplication and attention change what the body does, but not the shape around it.

If you write PyTorch and you’ve ever typed A + B on a tensor with .cuda(), the kernel that ran underneath looks an awful lot like the one we just built.

I encourage you to take what we’ve learned here and try it out. Vector addition is the first problem on LeetGPU. Go see if you can solve it.

Richie McColl