Optimise LLM Inference Throughput from First Principles (Part I)

This article explores why the most intuitive lever for throughput, batching concurrent requests, is so effective at improving inference throughput. By dissecting the underlying computations of a transformer during inference on a single GPU setup, this article aims to establish a rigorous mental model of the LLM inference workload. This serves as a foundational step toward building a full inference cost model, allowing us to predict token throughput for a given model and hardware before deployment, which we will conclude at the end of this blog series.

1. Lifetime of a prompt

In inference, a prompt goes through two stages: prefill and decode. Prefill is a single forward pass that processes all the prompt's tokens in parallel and produces the first output token. Decode then generates each subsequent output token one by one — one forward pass per token.

In the diagram below, all the orange tokens are processed in a single prefill forward pass to produce the first blue token ("jumps"). The K and V vectors are cached as they're computed,For the motivation behind KV caching, see this primer. The short version: attention re-reads every previous K and V at every step, so caching them turns an O(N²) recompute into an O(N) read. so every subsequent forward pass only has to operate on a single token — the latest output token.

Prefill and decode stages of a transformer inference, showing prompt tokens processed in parallel during prefill and one token at a time during decode
Illustration from https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests.

These two stages have contrasting performance characteristics: prefill is compute-bound, decode is memory-bound. To see why, let's look at what actually happens inside a transformer layer.

What one forward pass does

The computation graph of a transformer LM starts with token embedding, then a stack of transformer layers (each contains one attention and one MLP layer), and finishes with an unembedding projection into the final probablity distribution over the vocabulary called logits.

Transformer compute graph: token embedding, residual block with attention heads and MLP, unembedding to logits
Illustration from https://transformer-circuits.pub/2021/framework/index.html.

The key architectural feature for our purposes is the residual stream running across every layer. It has the same dimension everywhere — starting as the input embedding (at x₀ in the diagram) and ending as the input to the unembedding layer. Attention and MLP layers are best understood as transformations on the stream: each one reads the stream, computes something, and adds its result back. You can think of the residual stream as embedding vectors of tokens being updated over time.3Blue1Brown has a nice visualisation of this on YouTube: each token's vector is gradually rewritten by successive layers as more context flows in.

Residual stream visualization: attention output added back to the residual stream as new embedding vectors for each token

The final logits are produced by applying the unembedding only to the residual stream of the last token.

Unembedding diagram: residual stream of the last token projected to vocabulary logits

Together with KV cache mechanism, this means decode only has to process one token in its forward pass, while prefill processes all input tokens at once. That alone is the first heuristic for why prefill is compute-bound while decode is memory-bound.

Let's actually compare the number of arithmetic operations in prefill vs decode to ground this heuristic.

Prefill is compute-bound, decode is memory-bound

This section will calculate the arithmetic intensity (number of arith operations / number of memory transfer in bytes) of prefill and decode and draw a roofline model on A100 GPU. Let's start by counting the floating-point operations FLOPs not to be confused with FLOPS, which is FLOPs per second. (the main arithmetic operations in transformer) in one transformer layer for a concrete model: Qwen3-32B. Here is the configuration of the model:

d_model    / hidden_size  = 5120
d_ff       / intermediate = 25600
n_layers                  = 64
n_heads                   = 64
n_kv_heads                = 8       (Grouped Query Attention)
d_head                    = 128
d_q_total  = n_heads    × d_head = 8192
d_kv_total = n_kv_heads × d_head = 1024
dtype      = bf16 = 2 bytes/param
Attention variants: MQA, GQA, MHA, MLA showing different sharing patterns of key/value heads
Illustration from https://verticalserve.medium.com/group-query-attention-58283b337c65

Qwen3 uses GQA, which reduces the number of KV heads by assigning each KV head to a group of query heads. For Qwen3-32B, each KV head is shared across 64 / 8 = 8 query heads.

At 32B parameters × 2 bytes, the model is ~64 GB. Spread over 64 layers, one layer of weights is ~1 GB — keep that number, it's the denominator of every arithmetic-intensity calculation below.

The one rule everything reduces to

A matmul between two matrices [M, K] @ [K, N] produces output matrix of size [M, N] and costs 2 × K × M × N FLOPs. This is because each of the M × N output entries is an inner product over K (K multiplies and K-1 adds), which combines into the 2 × K term. Every estimate below is an instance of this rule.

Matrix multiplication FLOP visualization: each output entry requires K multiplies and K-1 adds

Attention FLOPs

The attention layer breaks down into two main matmul groups: projections (Q, K, V, O) and attention score computation. The residual stream x is projected into Q, K, V via matmuls, attention scores are computed per head, the per-head outputs are concatenated, and the result is projected back to the residual stream via O.

Multi-head attention layer: residual stream X projected to per-head Q, K, V; attention computed per head; concatenated and projected by W_O
Illustration from https://arxiv.org/pdf/2408.06357

The diagram below is a handy reference for the matrix dimensions involved in the following FLOPs calculation.

Attention dimensions: input X of shape (n, d) projected by W_Q, W_K, W_V to Q, K, V; QK^T softmax times V produces output Z
Illustration from https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

Attention projections (Q, K, V, O)

Each projection is a matmul of dimension [N , d_model] @ [d_model, d_head]N is the number of tokens. A prompt of 4096 token is used as input for this example., costing 2 × d_model × N × d_head = 2 × 5120 × N × 128 FLOPs per head. Total FLOPs scale with the number of heads — and under GQA, the number of Q heads and KV heads differs, so each has to be counted separately:

Q : 2 × 5120 × N × 128 × 64   (64 query heads)
K : 2 × 5120 × N × 128 ×  8   (8  kv heads)
V : 2 × 5120 × N × 128 ×  8   (8  kv heads)
O : 2 × 5120 × N × 128 × 64   (output projection)

For prefill (N = 4096): 343.6 B + 43.0 B + 43.0 B + 343.6 B ≈ 0.77 T FLOPs.
For decode (N = 1) ≈ 0.19 B FLOPs.

Attention scores

Prefill: N queries attend over N keys and values. Per query head:

Q @ K^T     : [N, d_head] @ [d_head, N] → 2 × N² × d_head
softmax @ V : [N, N] @ [N, d_head] → 2 × N² × d_head
Total       : 4 × N² × d_head

Across 64 query heads, total FLOPs = 4 × N² × 128 × 64Notice FLOPs of attention score grows quadratically with number of tokens in prefill.. For N = 4096: 4 × 4096² × 128 × 64 ≈ 0.55 T FLOPs.

Decode: One new query attends over N_cache cached keys and values. Per query head:

Q @ K^T     : [1, d_head] @ [d_head, N_cache] → 2 × N_cache × d_head
softmax @ V : [1, N_cache] @ [N_cache, d_head] → 2 × N_cache × d_head
Total       : 4 × N_cache × d_head

Across 64 query heads, total FLOPs = 4 × N_cache × 128 × 64. For N_cache = 4096: 4 × 4096 × 128 × 64 ≈ 0.13 B FLOPs.

MLP FLOPs

MLP SwiGLU block: linear gate and linear up multiplied through Swish, projected back by linear down
Illustration from https://machinelearningmastery.com/linear-layers-and-activation-functions-in-transformer-models/

Qwen3's MLP has three dense layers — gate, up, and down — wired up as SwiGLU. The output is:

Gate: [N, d_model] @ [d_model, d_ff]    → 2 × N × d_model × d_ff
Up  : [N, d_model] @ [d_model, d_ff]    → 2 × N × d_model × d_ff
Down: [N, d_ff   ] @ [d_ff,    d_model] → 2 × N × d_ff    × d_model
Total                                   : 6 × N × d_model × d_ff

Prefill (N = 4096) ≈ 3.2 T FLOPs; Decode (N = 1) ≈ 0.79 B FLOPs. The SiLU (Swish) and the elementwise gate × up product are O(N × d_ff) — negligible compared to the matmuls, so they're left out.

Putting the per-stage numbers together

Decode, 1 token, N_cache = 4096, one layer:

Projections (Q, K, V, O) ~0.19 B Attention scores + softmax @ V ~0.13 B MLP (gate + up + down) ~0.79 B ──── Total ~1.1 B FLOPs HBM transfer ~1 GB (weight of one layer) Arithmetic intensity ~1.1 GFLOP / 1 GB ≈ ~1 FLOP/byte

Prefill, S = 4096 tokens, one layer:

Projections ~0.77 T Attention QK^T + softmax @ V ~0.55 T MLP ~3.2 T ──── Total ~4.5 T FLOPs HBM transfer ~1 GB (loaded once, reused across all 4096 tokens) Arithmetic intensity ~4.5 TFLOP / 1 GB ≈ ~4500 FLOPs/byte

In this simplified calculation, memory transfer does not take into account KV cache transfer in decode which would further decrease its arithmetic intensity. Section 3 will discuss the size of KV cache in more details.

An A100 GPU has ~2 TB/s HBM bandwidth and ~312 TFLOP/s for bf16 compute. The roofline crossover — the point at which a kernel flips from memory-bound to compute-bound — sits at 312 / 2 = 156 FLOPs/byte.

Decode has arithmetic intensity of ~1, making it massively memory-bound, compute cores almost idle, the whole step bottlenecked on streaming 1 GB of weights from HBM. On the other hand, prefill lands at ~4500, well above 156, making it solidly compute-bound.

Prefill TFLOP/s 1 10 100 312 1 10 1K 10K 156 memory-bound compute-bound prefill Arithmetic intensity (FLOP/byte) Decode TFLOP/s 1 10 100 312 1 10 1K 10K 156 memory-bound compute-bound decode Arithmetic intensity (FLOP/byte)

So how can we increase the arithmetic intensity of decode to better utilise GPU compute?

The MLP dominates either way

As a final note, whether the bottleneck is memory (decode) or compute (prefill), the dominant cost on that axis is the MLP. Attention stays comparatively cheap until the context grows long — the quadratic N² attention term in prefill only overtakes the linear projection term around N ≈ 6K for these dimensions, well past the 4K we're using here.

ProjectionsAttentionMLPMLP share
Decode (N = 1)0.19 B~0.13 B0.79 B~71%
Prefill (N = 4096)0.77 T0.55 T3.2 T~71%

2. Batching behaviour of the inference system

We have established that decode stage of inference drastically underutilised GPU compute. To increase decode's arithmetic intensity, we need to give it more work to do or reduce the amount of memory transfer. To increase the amount of work without increasing memory transfer, we can batch multiple decode tokens together. This effectively means running multiple prompts concurrently.

Suppose the GPU has capacity to process at most 10 tokens in parallel at one time. In step 0 of the figure below, three prompts are prefilled and the token budget is fully occupied (GPU utilisation at 100%). By step 3, all three have transitioned to decode and the GPU utilisation falls to 30%. Imagine starting with 10 concurrent prompts, GPU utilisation will always be 100% and throughput (tokens per second) remains highAverage Time to First Token might take longer as there are more prefills..

Scheduler diagram: three prompts R1, R2, R3 with a token budget of 10. Step 0 has full budget filled by prefill; subsequent steps trail off as requests enter decode and the budget empties.
Illustration from https://vllm.ai/blog/2025-01-27-v1-alpha-release.

Throughput scaling

In general, prefill throughput is essentially flat with concurrency / batch size Batch Size / Concurrency / Concurrent Requests all refer to the number of concurrent prompts. as a single long prefill already saturates compute, while decode throughput scales close to linearly with concurrency. As such, to increase the overall inference throughput, we just need to increase concurrency.

Two bar charts: prefill throughput is flat across batch sizes 1, 2, 4, 8 around 4K-5K tokens/sec. Decode throughput rises linearly from ~10 to ~800 tokens/sec as batch size grows from 1 to 64.
Illustration from https://arxiv.org/abs/2403.02310

First principle of throughput optimisation: increase concurrency until compute or memory is saturated.

How to know when compute or memory is saturated?

Continue with the example workload from Section 1, the roofline crossover sits at 156 FLOP/byte and decode arithmetic intensity is ~1.1 FLOP/byte per token. Therefore, the theoretical maximum batch size before compute saturates is therefore 156 / 1.1 ≈ 141 concurrent requests. Memory will be saturated by the growing KV caches from more concurrent requests. This begs the question of whether memory or compute is the bottleneck?


3. The memory wall: KV cache sizing

In the previous section, we determined the compute ceiling of running Qwen3-32B on one A100 GPU sits at ~141 concurrent requests. But in practice, you almost never get there as the GPU runs out of memory for the KV cache first.

KV cache memory formula

The cache stores one K vector and one V vector, each of dimension n_kv_heads × d_head for every token of the prompt (both input and output tokens).

KV_cache_bytes_per_token = 2 (K and V) × n_layers × n_kv_heads × d_head × dtype_bytes

For Qwen3-32B (n_layers = 64, n_kv_heads = 8, d_head = 128, bf16):

KV_cache_bytes_per_token = 2 × 64 × 8 × 128 × 2 ≈ 256 KB / token

A 4096-token context is therefore ~1 GB of KV cache. A batch of 141 such requests would be ~141 GB which is larger than a single A100's HBM. And this is with GQA shrinking the cache 8× compared to full MHAThere has been mutliple optimisation techniques for KV caches such as DeepSeek's MLA — which compresses K and V into a small latent vector for KV cache storage and only expands them at attention time or FP8 KV cache quantisation: https://vllm.ai/blog/2026-05-11-turboquant..

Memory budget on a single GPU

Sizing the KV cache pool comes down to subtracting everything else from total HBMActivation buffers store the intermediary outputs of each operation in a forward pass before it gets passed into the next operation. It is set as the peak activation which is often the output of the MLP layer. For Qwen3-32B, the peak activation would be of size 2x[B, N, d_ff] for the output of gate and up projections.:

Total HBM (A100 80GB) 80 GB ├─ Model weights (32B × bf16) ~64 GB ├─ Activation buffers ~0.5–2 GB ├─ Other overhead ~1 GB (CUDA graph, logits buffer, cuBLAS, etc) └─ Remaining → KV cache pool (~13 GB)

What's left over for the KV cache pool on an 80 GB A100 is roughly 13 GB — enough for about 13 concurrent requests at 4K context. Comparing this to the compute-saturation ceiling (~141 requests) makes the asymmetry vivid: the memory wall arrives an order of magnitude before the compute wall.

4. Profiling and tuning

We have established the key conceptual foundations of LLM inference. This section discusses practical details when optimising throughput using vLLM.

vLLM exposes kv_cache_usage_pct as a live metric. Because KV cache usage is roughly linear in concurrent request count (assuming similar context lengths), it's a clean knob: tune max_num_seqs (max number of concurrent requests) to land at whatever steady-state utilisation you want. 85–90% is a reasonable target — high enough to push throughput, with slack for transient spikes.

The following run sweeps concurrency on a real serving config. At concurrency 512 (brown line), KV cache averages ~30% — plenty of headroom. Doubling to 1028 (green) doubles cache usage to ~60% and decode throughput roughly doubles in turn, exactly the linear scaling discussed in Section 2.

vLLM kv_cache_usage chart over time: brown line at concurrency 512 plateaus near 30 percent; green line at concurrency 1028 plateaus near 60 percent.
KV cache utilisation at concurrency 512 (brown) vs 1028 (green).
vLLM decode throughput chart: green line at concurrency 1028 peaks above 2000 tokens/sec, roughly double the brown 512 line.
Decode throughput at the same two configs. Green is ~2× brown, matching the cache-usage ratio.

Encouraged by the headroom, I pushed concurrency further to 1200 (blue line). KV cache now saturates at 100%, and against expectation, throughput collapses:

vLLM kv_cache_usage chart at concurrency 1200: blue line saturates at 100 percent across most of the run; green 1028 line tops out near 60 percent for comparison.
At concurrency 1200, cache usage pins at 100%.
vLLM decode throughput chart at concurrency 1200: blue line is volatile and averages well below the green 1028 baseline, hovering around 300-500 tokens/sec.
Throughput at concurrency 1200 collapses below 1028, despite higher KV usage.

Preemption: what happens when the cache fills

The throughput collapse is due to preemption. When a new request arrives and there isn't enough KV cache memory for it, vLLM has to evict one or more running requests to make space. There are two eviction modes:

Either way the cost is real: a request preempted once roughly doubles its end-to-end latency.

As such, one should be careful when tuning concurrency with kv cache. Two metrics to watch:

Second principle of throughput optimisation: push concurrency to fill up the KV cache, but stop before triggering preemption. The optimum sits just below 100% cache utilisation.


References