Back to blog
Explainer·Technical Deep Dive

What Is a CUDA Kernel? A Visual Explainer

A visual guide to GPU kernels: threads, blocks, warps, the memory hierarchy, kernel launch syntax, and the FlashAttention-class kernels powering every LLM in 2026.

Jonathan Chavez
Jonathan Chavez
Co-Founder @ LLM Stats
·14 min read
What Is a CUDA Kernel? A Visual Explainer

The thing your favorite LLM is doing right now, billions of times per second, is running CUDA kernels. They are the smallest unit of work the GPU understands. They are also the place where every interesting AI system in 2026 lives or dies.

This is a visual explainer. We will start from the single-most-important idea in GPU programming, which is the difference between thinking serially and thinking in parallel. Then we will build up the kernel: what it is, what its launch looks like, how threads find their work, where the speed comes from, and what trips it up. By the end you should be able to read a kernel and have an opinion about whether it is fast.

One Idea, Two Machines

256 elements · y[i] = a·x[i] + y[i]

A CPU does it one at a time.
A GPU does all of them at once.

CPU · serial loop

for i in 0..N

elements done · 000 / 256

One worker walks the array. Each element waits for the one before it. The loop is the program.

GPU · parallel kernel

kernel<<<1, N>>>

threads launched · 256 / 256

256 workers start at once. Each one handles a single element. The launch is the program.

Visualisation of the same SAXPY operation on a CPU (sequential for-loop) and a GPU (one-thread-per-element kernel). Real timings are not to scale; the intent is to show the change of frame, from time as iteration to time as launch.

The Parallel Revelation

A CPU is a small number of strong workers. Eight cores, sixteen threads, each running at 4 GHz, each capable of complex branching and big caches. A GPU is the opposite. An NVIDIA H100 has 132 streaming multiprocessors (SMs), each able to have 2,048 threads in flight. That is over a quarter of a million threads, each running at a slower clock, each individually less capable than a CPU thread. The trade is intentional. If you have a task that decomposes into independent pieces, the GPU will eat it.

Almost everything in deep learning decomposes that way. A matrix multiply is independent dot products. A convolution is independent windows. Attention is independent rows of a softmax. All of these are jobs that look like "do the same thing, everywhere, at once." That is exactly what a kernel expresses.


A Kernel, In One Sentence

A CUDA kernel is a function that runs on the GPU and is executed in parallel by many threads.

That sentence has the entire model. The function is just C++ (with a few extra keywords). The runtime is "in parallel by many threads". You write the function once. The GPU runs it as many copies as you ask for, all at the same time, each on slightly different data.

The function is marked with __global__, which tells the compiler "this gets compiled to GPU instructions and is launched from the host". To launch it, you use the triple-chevron syntax that has confused every CUDA beginner since 2007.

Anatomy of a Launch

Triple-chevron syntax

Four numbers on one line
are the entire program.

kernel name

A function marked __global__. Compiled for the GPU.

arguments

Pointers shipped to the device.

vecAdd <<< 4, 256 >>>(A, B, C);

gridDim

4 blocks in the grid.

blockDim

256 threads per block. Multiple of 32.

What this launches

Grid

1

The full launch

Blocks

4

Each scheduled to one SM

Threads · total

1,024

4 blocks × 256 threads

Source: NVIDIA CUDA C++ Programming Guide, §2.1. The triple-chevron notation has been the standard kernel launch syntax since CUDA 1.0 (2007). cudaLaunchKernelEx is the equivalent runtime API.

The Anatomy of a Launch

The chevrons hold the execution configuration: how many blocks, how many threads per block. Everything inside the parentheses is the regular argument list. There is nothing else to it. The compiler turns the chevron call into a runtime API call, the runtime ships your function and its arguments across the PCIe bus to the GPU, and the GPU schedules the work.

Two things are doing all the work in that line. The first is blockDim: how many threads make up a block. A block is the unit that is guaranteed to run together on a single SM and can talk to each other through shared memory. The second is gridDim: how many of those blocks you want. The grid is the entire population of threads for the launch. Multiply them and you get the total thread count. The standard ceiling-division trick to cover an array of size N is (N + blockSize - 1) / blockSize blocks of blockSize threads each.


Threads, Blocks, Grids

Three levels of hierarchy. Thread is the smallest unit, the actual instance of your function. Threads in the same block live on the same SM and can synchronize and share memory cheaply. Blocks make up the grid, and they cannot directly talk to each other (without going through global memory or, on Hopper, distributed shared memory inside a block cluster).

Inside the kernel, every thread can ask "who am I?" and gets back four built-in variables: threadIdx, blockIdx, blockDim, and gridDim. The standard way to convert that identity into an index into a global array is the most important formula in the language.

The Index Formula

4 blocks × 8 threads = 32

Every thread computes its own
place in the array.

int i = blockIdx.x × blockDim.x + threadIdx.x;2 × 8 + 5 = 21
Gridblock 001234567block 101234567block 201234567block 301234567Global array · 32 elements08162124
Each thread reads four built-in variables (threadIdx, blockIdx, blockDim, gridDim) and computes its own global index. There is no for loop in the kernel; the for loop is the launch.

That formula is doing the work of the for loop in a CPU implementation. There is no for loop in a kernel. The for loop is the launch itself. Each thread is one iteration, running concurrently with all the others. This shift, from "one worker iterating" to "every iteration is its own worker", is the change of frame that makes everything else click.


The Memory Hierarchy

The threads themselves are cheap. Feeding them is the entire problem.

A modern GPU is structured as a deep memory hierarchy. The closer to the SM, the faster and the smaller. The further out, the larger and the slower. Writing a kernel is, more than anything else, an exercise in moving the right data into the right tier and reusing it as much as possible before sending it back out.

The Memory Hierarchy

NVIDIA H100 SXM5

Five floors. Each one
10× the size and slower.

The art of writing a fast kernel is moving the right data into the right floor and reusing it before sending it back down. Almost every CUDA optimization is a version of that one move.

01

Registers

per thread

On-chip storage in the SM. Allocated to each thread at compile time. Free to read.

Capacity

256 KB / SM

Latency

0 cycles

Bandwidth

~10 TB/s

log scale · 50 GB/s → 10 TB/s

02

Shared memory

per block

Threads in the same block read and write a small scratchpad. The workbench.

Capacity

228 KB / SM

Latency

~30 cycles

Bandwidth

~10 TB/s

log scale · 50 GB/s → 10 TB/s

03

L2 cache

per GPU

Last on-die cache. Shared across all SMs. Hidden from the programmer.

Capacity

50 MB

Latency

~200 cycles

Bandwidth

~5.5 TB/s

log scale · 50 GB/s → 10 TB/s

04

Global memory · HBM3

per GPU

Off-chip DRAM. The warehouse. Where the model weights live.

Capacity

80 GB

Latency

~400 cycles

Bandwidth

3.35 TB/s

log scale · 50 GB/s → 10 TB/s

05

Host · PCIe Gen5

per system

System RAM, reached over the PCIe bus. About 50× slower than HBM.

Capacity

TB+

Latency

thousands ns

Bandwidth

64 GB/s

log scale · 50 GB/s → 10 TB/s

Specifications from the NVIDIA H100 white paper and CUDA C++ Best Practices Guide. Latencies are typical; actual values vary with access pattern and contention. Bandwidth bars use log scale to keep PCIe visible against on-die memories that are ~150× faster.

A useful mental model: HBM is the warehouse, L2 is the loading dock, shared memory is the workbench, registers are your hands. You do not assemble things in the warehouse. You bring a tile to the workbench, you finish your task there, and you ship the result back. Most kernel optimization is some version of "keep more of the work on the workbench, make fewer trips to the warehouse".

The numbers above are approximate H100 figures. Blackwell B200 raises HBM to roughly 8 TB/s and adds FP4 Tensor Cores, but the shape of the hierarchy is the same. The ratios that matter (registers ~10× faster than shared, shared ~3× faster than HBM) hold across generations.


Warps and the Cost of Branches

One more piece of the architecture you need before you can read a real kernel. Threads are not actually scheduled individually. The SM groups them into bundles of 32 threads called warps, and the warp is the unit it executes. Every cycle, every thread in a warp issues the same instruction at the same time, on its own data. NVIDIA calls this SIMT: Single Instruction, Multiple Threads.

The neat consequence: launching threads in multiples of 32 is basically free. Below 32, you are paying for unused lanes in the warp. The annoying consequence is what happens at branches.

Warp Divergence

32 threads · 1 warp

Disagree on a branch,
and the warp pays for both paths.

A · convergent

All 32 threads take the same branch. The warp executes the path once. Full speed.

if (true) { fast_path() }

cycle 1 · all active

Time

1.0×

B · divergent

Half take the if, half take the else. The warp serializes, running each path with the other half masked off. Both costs are paid.

if (tid < 16) { ... } else { ... }

cycle 1 · 16 active · 16 masked

cycle 2 · 16 masked · 16 active

Time

2.0×

Worst case: a 32-way switch with one thread per branch is 32× slower than the convergent equivalent. Most kernels keep branch divergence in mind from the first line.

SIMT (Single Instruction Multiple Threads) execution model. Source: NVIDIA CUDA C++ Programming Guide §10. Modern Volta+ architectures added independent thread scheduling, which lets divergent threads make forward progress, but does not eliminate the serialization tax on the inner loop.

When the threads in a warp disagree about which way to go in an if, the warp serializes. It runs the true branch with the false-branch threads masked off, then runs the false branch with the true-branch threads masked off. Both paths cost time. This is warp divergence, and it is the most common reason a kernel that "should be fast" turns out to be merely fine.

The cure is structural: arrange your data and your branches so that whole warps go the same way. Sort your work into groups before launching. Pad to the warp boundary. Use predicated math (compute both sides and select) instead of branches when the true and false paths are short. The fastest kernels in the world have almost no branches inside their inner loops.


A Complete Kernel: SAXPY

Time to put it together. SAXPY stands for single-precision A times X Plus Y, and it has been the "hello world" of GPU programming since the 1980s. The operation is y[i] = a * x[i] + y[i] for every i. It is embarrassingly parallel. Every thread does one multiplication and one addition, with no dependencies on any other thread.

A Complete Kernel

SAXPY · y = a·x + y

Six lines of CUDA.
Eight threads, one shot.

saxpy.cu

__global__ void saxpy(int n, float a,
float *x, float *y) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = a * x[i] + y[i];
}
}

host launch

int blocks = (n + 255) / 256;
saxpy<<<blocks, 256>>>(n, 2.0f, x, y);

scalar a

2.0

array x · in HBM

1
3
2
4
5
1
7
2

kernel<<<1, 8>>> launches...

t0
t1
t2
t3
t4
t5
t6
t7

y[i] = 2.0 · x[i] + y[i]

array y · in HBM

4
1
7
3
2
8
1
5

FLOPs per thread

2

One multiply, one add. Fused on the hardware.

Bytes per thread

12

Read x[i] (4) + read y[i] (4) + write y[i] (4).

Arithmetic intensity

0.17

FLOPs per byte. Below the memory roofline. Memory-bound.

SAXPY is the "hello world" of GPU programming. On an H100, this kernel sustains around 3 TB/s, the HBM3 bandwidth limit. The path to faster is not more compute, it is less memory traffic. That is the entire FlashAttention insight in one sentence.

The kernel is six lines. The launch is one line. That is the entire program. Every thread reads two floats from HBM, does one fused multiply-add, writes one float back. There are no loops inside the kernel. The for loop is the grid.

SAXPY is also a useful sanity check for hardware. On an H100, this kernel runs at about 3 TB/s, almost exactly the HBM3 bandwidth, because every thread reads 8 bytes, writes 4 bytes, and does 2 floating-point operations. It is memory-bound. You cannot make it faster by adding more compute. You can only make it faster by moving less data, which is exactly the move that FlashAttention pulled off for the attention operation.


Why Every LLM Lives on Kernels

Once you can write SAXPY, you have the model for everything else. The trick is that the operations that matter are harder. A matrix multiply has data dependencies. Softmax has a reduction. Attention has both, plus a giant intermediate matrix you do not want to materialize. These are all kernels too, just much more carefully designed ones.

Three pieces of work define the modern landscape.

FlashAttention

FlashAttention (Tri Dao, 2022, with v2 in 2023 and v3 in 2024) is the single most important AI kernel of the last five years. The insight: the attention output for one query block can be computed by streaming through the keys and values in tiles, keeping a running softmax in shared memory, and never writing the full attention matrix to HBM. Memory traffic drops from O(N²) to O(N). On long contexts, the speedup is several×.

FlashAttention-3 added warp-specialization (different warps in a block do different jobs, like loading vs computing), asynchronous WGMMA on Hopper Tensor Cores, and FP8 support. It hits 740 TFLOPs/s in FP16 and 1.2 PFLOPs/s in FP8, around 75% of an H100's theoretical peak. Without this kernel, the 128K and 1M context windows that GPT-4 and Llama 3 ship with would not be commercially viable.

Triton

Triton (Phil Tillet, OpenAI, 2019) is a Python-like language that compiles to PTX. It handles a huge amount of the bookkeeping (tile loading, register allocation, pipelining) for you. It does not always match a hand-written CUDA kernel on the latest hardware, but it is fast enough that most of PyTorch's newer fused ops ship as Triton kernels. If you want to write a custom kernel without spending three weeks learning CUTLASS, Triton is the right entry point.

ThunderKittens

ThunderKittens (Hazy Research, Stanford, 2024) is the cleanest abstraction in the space. Its unit of work is a 16×16 tile (the size that fits the H100 Tensor Core), and its primitives map directly to warp-level instructions. Their FlashAttention implementation is under 100 lines of code and reaches 855 TFLOPs/s, 86% of H100 peak, matching FA3. They have since produced kernels for Mamba-2, linear attention, RoPE, LayerNorm, and FlashFFTConv, several of which beat the best Triton implementations by 6-14×.

The lesson across all three: the abstraction that wins is the one that exposes the GPU's actual structure (warps, tiles, async memory copies) instead of hiding it. PyTorch hides too much for frontier kernels. Pure CUDA shows too much. Triton, ThunderKittens, and CUTLASS each find a different equilibrium in the middle.


How to Read Kernel Performance

When you run a kernel under Nsight Compute, the first question to ask is whether you are memory-bound or compute-bound. The roofline model gives you a clean answer: plot achieved performance against arithmetic intensity (FLOPs per byte loaded). If you are below the diagonal "memory roof", you are waiting on HBM. If you are below the horizontal "peak FLOPs roof", you are waiting on the Tensor Cores.

SAXPY has an arithmetic intensity of 0.17 FLOPs/byte, far below where compute could ever bottleneck it. It is unfixably memory-bound. A matmul with a big M, N, K has arithmetic intensity in the hundreds, comfortably in the compute-bound regime, and that is why GEMMs are how we measure peak FLOPs in the first place.

The other number that matters is occupancy: how many of the SM's 2,048 thread slots are actually in flight. Low occupancy is usually a sign of register pressure (each thread using too many registers) or shared memory pressure. The fix is almost always to use smaller tiles, fewer registers, or both.

That is the entire mental model: arithmetic intensity tells you which roof you are hitting, occupancy tells you whether you are even on the roof. Everything else (warp efficiency, memory coalescing, shared memory bank conflicts) is a way to climb a little higher.

The interesting work in 2026 is happening just below those rooflines, where 75% of peak is the new floor for serious attention kernels and 90% is the active research frontier. The tools to write those kernels are better than ever. The hardware (Hopper, Blackwell) is finally fast enough that getting the kernel right matters more than getting more silicon. If you build AI infrastructure and you have not stared at an ncu profile in a while, this is the year to start.

Questions

Frequently Asked Questions

  • A CUDA kernel is a function written in CUDA C++ (or a CUDA-like language) that runs on the GPU instead of the CPU. When you launch a kernel, you tell the GPU how many copies of that function to run in parallel. Each copy is a thread. Threads are organized into blocks, which are organized into a grid. A typical kernel launch starts thousands or millions of threads at once, all running the same code on different pieces of data. That parallelism is the entire point of using a GPU.
  • The syntax kernel<<<gridDim, blockDim>>>(args) is how CUDA C++ launches a kernel from host (CPU) code. gridDim is the number of blocks in the grid. blockDim is the number of threads in each block. Each can be a 1D integer or a 3D dim3. So vecAdd<<<4, 256>>>(A, B, C) launches 4 blocks of 256 threads each, for 1,024 total threads. Each thread can identify itself with the built-in variables threadIdx, blockIdx, blockDim, and gridDim.
  • A warp is a group of 32 threads that the GPU executes together in lockstep. Every CUDA SM schedules warps, not individual threads. When all 32 threads in a warp take the same branch of an if statement, performance is full. When they diverge (some take the true branch, some take the false), the warp serializes, executing each path with the inactive threads masked off. This is called warp divergence and it can cut performance in half or worse. It is the most common source of mysterious slowdowns in CUDA code.
  • Global memory (HBM) on an H100 reaches around 3 TB/s of bandwidth, but with ~400 cycles of latency. Shared memory inside an SM hits around 10 TB/s with ~30 cycles. Loading a tile from HBM into shared memory once and reusing it across many threads can be the difference between a memory-bound kernel and a compute-bound one. FlashAttention's entire trick is reorganizing attention so that the working set fits in shared memory and is reused, instead of materializing the N×N attention matrix in HBM.

  • FlashAttention is a fused CUDA kernel for the attention operation in Transformers. The naïve implementation reads and writes the N×N attention matrix to global memory, which dominates runtime on long contexts. FlashAttention tiles the computation so that the intermediate values live in shared memory and are never written to HBM, reducing memory traffic from O(N²) to O(N). FlashAttention-3 (Tri Dao et al., 2024) added warp-specialization, GEMM-softmax pipelining, and FP8 support to reach 740 TFLOP/s on H100, around 75% of theoretical peak.
  • Most people do not. PyTorch and JAX call into highly optimized kernels (cuBLAS, cuDNN, FlashAttention, FlashInfer, vLLM's custom ops) for you. But if you are doing serious AI infrastructure work, custom kernel performance is the bottleneck. The 2026 stack gives you three productive entry points: Triton (Python-like, JIT compiles to PTX), ThunderKittens (a CUDA DSL with 16×16 tiles as the unit of work), and CUTLASS (NVIDIA's C++ template library, what FA3 is built on). All three remove a large fraction of the boilerplate and most of the bugs.

Continue Reading