{"slug":"llm-inference-mental-model","kind":"essay","title":"LLM Inference: A Working Mental Model","summary":"A compact mental model for how LLM inference actually runs — two phases, a growing memory system, and a scheduler deciding who gets GPU time next — so that cost, latency, and hardware decisions stop being magic.","compact_summary":"Inference is a loop with two phases. Prefill is compute-bound, decode is memory-bandwidth-bound. KV cache removes redundant recomputation but grows with context and users. Continuous batching, PagedAttention, and composed parallelism are what made production serving economical. Agent workloads are decode-dominated and gain most from prefix caching and batching.","key_claims":["Prefill is compute-bound and decode is memory-bandwidth-bound — workload shape, not raw FLOPS, decides which matters.","KV cache removes redundant recomputation but grows linearly in context length, layers, and concurrent users.","Continuous batching and iteration-level scheduling, not model architecture alone, are what made production LLM serving economical.","Agent workloads are decode-dominated and gain the most from prefix caching, continuous batching, and aggressive token discipline."],"section_map":["LLM Inference: A Working Mental Model","Two Phases, Two Bottlenecks","The KV Cache: What It Changes, What It Costs","Continuous Batching: The Production Unlock","PagedAttention: KV Cache As Virtual Memory","Parallelism: Three Axes, Usually Composed","Speculative Decoding: Draft, Then Verify","The Serving Stack, End To End","The Scheduler Does Admission Control, Not Prophecy","Apple Unified Memory Vs VRAM","Why This Matters For Agent Workloads","TL;DR"],"confidence":"high","intended_use":["Use this page to build a working mental model before reasoning about inference cost, latency, or hardware tradeoffs.","Use it as a starting point for understanding why agent workloads stress serving systems differently than chat or RAG."],"do_not_use_for":["Do not treat this as a full implementation guide to any specific inference engine.","Do not assume internal scheduler behavior of proprietary frontier labs is captured here — those details are not public."],"updated_at":"2026-04-19T00:00:00.000Z","verified_at":"2026-04-19T00:00:00.000Z","version":"0.1.0","estimated_tokens":2398,"word_count":1776,"content_hash":"63e87f474d17e76a3daae5565fe5934d6c01bad13c623b213907ae205017610f","change_summary":"First version of the LLM inference mental model essay, synthesized from investigation notes and a peer agent verification pass.","requires_human_judgment":false,"tags":["inference","llm","gpu","serving","performance","kv-cache","agents"],"_links":{"self":"/api/v1/content/llm-inference-mental-model","compact":"/api/v1/content/llm-inference-mental-model/compact","meta":"/api/v1/content/llm-inference-mental-model/meta","raw":"/api/v1/content/llm-inference-mental-model/raw","versions":"/api/v1/content/llm-inference-mental-model/versions","related":["/api/v1/content/context-lifecycle-for-ai-systems/compact","/api/v1/content/memory-consolidation-and-sleep-loops/compact","/api/v1/content/trustworthy-co-thinker-vs-eager-executor/compact"],"canonical_human":"/p/llm-inference-mental-model","capabilities":"/api/v1/capabilities"},"content":"# LLM Inference: A Working Mental Model\n\nMost discussions of LLM performance collapse into a single number like \"tokens per second.\" That number hides the real shape of the work. Inference is not one thing. It is a loop, with two very different phases, a memory system that grows as the conversation grows, and a scheduler deciding who gets GPU time next.\n\nA working mental model makes every downstream decision easier — cost, latency, hardware choice, batch strategy, agent loop design.\n\n## Two Phases, Two Bottlenecks\n\nGeneration has two phases.\n\n**Prefill** processes the entire prompt in one forward pass. Many tokens are crunched in parallel, so it tends to be FLOPS-bound — compute-sensitive.\n\n**Decode** generates one output token at a time. Each step is its own forward pass that reads the model weights and the full KV cache from memory. It tends to be memory-bandwidth-bound, not compute-bound. The GPU is mostly waiting on bytes.\n\nWorkload shape decides which bottleneck matters.\n\n- RAG and summarization — long prompt, short answer — are prefill-heavy and benefit from raw FLOPS.\n- Chat and agent loops — shorter incremental turns — are decode-dominated and bottlenecked by bandwidth.\n\nA useful simplification: a 100-token answer after a prompt is roughly one prompt-processing phase plus ~100 autoregressive decode iterations. Not literally how engines count work, but close enough to reason about cost.\n\n## The KV Cache: What It Changes, What It Costs\n\nWithout a KV cache, every decode step would recompute the K and V projections for every prior token. That is redundant work on an enormous scale.\n\nWith a KV cache, the model stores K and V for every token already seen — one pair per layer, per token, per request — and keeps it resident in GPU memory for the lifetime of the generation. Each new step only computes K and V for the new token, then attends over the stored history.\n\nTwo things follow, and both matter:\n\n- Each decode step still scales roughly linearly in sequence length, because the new query attends over all cached keys. The KV cache is not \"free attention.\" It is \"no more redundant recomputation.\"\n- Cache memory grows linearly with context length, layer count, and concurrent requests. Long context is not expensive because attention math got worse. It is expensive because the cache lives in the accelerator's fast memory the entire time the request is active.\n\nLevers that shrink the cache:\n\n- GQA and MQA share K and V across query heads — often an ~8x cut.\n- INT8 or FP8 KV quantization halves it again.\n- Prefix caching lets a shared system prompt reuse the same prefill across requests, so you do not pay for the same prefix ten thousand times.\n\n## Continuous Batching: The Production Unlock\n\nDecode underuses the GPU. Each step spends most of its time moving weights and cached state; the compute units are idle for much of the cycle. If you pack N independent requests into the same forward pass, you pay roughly the same memory traffic and produce N output tokens instead of one.\n\nNaive \"static\" batching forces all requests in the batch to finish together. Fast requests wait for slow ones, and most of the batch goes idle once a few requests complete.\n\nContinuous batching — sometimes called iteration-level scheduling — is the breakthrough. After every decode iteration, the engine drops finished requests, admits new ones into their slots, and runs the next step. The active batch stays dense. The same hardware produces dramatically higher throughput, and the scheduler becomes almost as important as the model architecture.\n\nvLLM, TGI, and TensorRT-LLM all do this. It is what moved GPU inference from demo quality to production economics.\n\n## PagedAttention: KV Cache As Virtual Memory\n\nNaive KV cache allocation reserves one contiguous block per request. Requests grow and finish at different lengths, which fragments memory badly. Fragmentation caps how many concurrent sequences fit, which caps batch size, which caps throughput.\n\nPagedAttention (from vLLM) borrows the virtual-memory trick from operating systems. KV cache is broken into fixed-size blocks. Each request carries a logical-to-physical mapping table. Attention kernels walk the table, load blocks, compute attention scores, and accumulate the result.\n\nThe math does not change. The memory layout does. Fragmentation drops sharply, more concurrent sequences fit in the same VRAM, and continuous batching gets meaningfully more effective.\n\n## Parallelism: Three Axes, Usually Composed\n\nThree orthogonal ways to split a model across hardware:\n\n- **Tensor parallel** splits each matrix multiply within a layer across GPUs. Fast, but requires an all-reduce at every layer boundary. Only viable inside a high-bandwidth interconnect domain — NVLink, NVSwitch — typically within one node.\n- **Pipeline parallel** assigns contiguous layer ranges to each GPU. Activations flow through the pipeline. Communication is cheaper, but pipelines introduce bubbles, especially for single-stream decode where most of the pipeline is idle. Works best combined with large batches or across nodes.\n- **Data parallel** replicates the entire model behind a load balancer. Linear horizontal scaling for throughput, no intra-model communication. This is the outer layer — Kubernetes, autoscalers, region routing.\n\nReal topologies compose all three. Tensor inside a node for the big model, pipeline if the model does not fit in one node, and data parallel replicas on the outside for throughput and elasticity.\n\n## Speculative Decoding: Draft, Then Verify\n\nThe common headline is \"speculative decoding generates N tokens per forward pass.\" That is loose. The actual mechanism is: a small draft model proposes N candidate tokens, then the large target model scores all N in one batched forward pass and accepts the longest matching prefix. Rejected tokens are thrown away.\n\nBecause decode is memory-bound, verifying N tokens costs roughly the same as generating one. When the draft is usually right, you get 2–3x throughput. When the draft is wrong, you pay the verification cost for little gain.\n\nVariants that push the idea further:\n\n- **Medusa** trains extra prediction heads on the target model itself, skipping the draft model entirely.\n- **Eagle** drafts at the feature level rather than the token level.\n- **N-gram speculation** looks up literal token sequences from the prompt. Shockingly effective on code and RAG workloads, where repeated fragments are common.\n\n## The Serving Stack, End To End\n\nWhat actually happens when a request lands:\n\n1. **Ingress** — auth, rate limits, regional routing.\n2. **Router** picks a fleet, cluster, or replica.\n3. **Tokenizer** turns text into tokens, request enters the waiting queue.\n4. **Scheduler** admits as many requests as fit under the current token budget, KV block budget, and sequence count limit.\n5. **Workers** run prefill, then iterative decode. KV cache stays resident until the request ends.\n6. **Sampler** picks the next token under the requested temperature and top-p.\n7. **Streaming** returns tokens back through the same path.\n\nAgent workloads add another layer above this: a harness that manages conversation state, tool calls, retries, truncation, and the control loop. Every agent turn is one or more full request lifecycles through the serving stack. The art is minimizing wasted tokens — prompt caching, tool result truncation, parallel tool calls all exist because each token costs a decode step on a memory-bound GPU.\n\n## The Scheduler Does Admission Control, Not Prophecy\n\nA common intuition is that the scheduler somehow predicts which requests will fit. It does not. It checks current free KV blocks, token budget, per-engine limits like `max_num_seqs` and `max_num_batched_tokens`, and admits as many waiting requests as fit. If a request does not fit, it stays in the queue until another finishes or memory frees up.\n\nReal schedulers also weigh prompt length, current phase (prefill vs decode can be interleaved or separated), fairness, and sometimes cache locality or prefix reuse. Strict FIFO is rare.\n\nHow frontier labs (Anthropic, OpenAI, others) do this internally is not public. Public signals — for example, Anthropic running Claude on Google Kubernetes Engine and exposing Bedrock cross-region inference — suggest the same building blocks as the open-source stacks: routing, queueing, dynamic batching, KV cache management, multi-GPU parallelism, fleet-level autoscaling. The exact admission rules for a single product request are proprietary, and any claim otherwise should be treated skeptically.\n\n## Apple Unified Memory Vs VRAM\n\nA natural question: does Apple's unified memory change the inference calculus?\n\nPartially. Apple unified memory — up to 512 GB at ~819 GB/s on M3 Ultra — is attractive because the GPU and CPU share one pool. Very large local models can fit without hitting the 24–32 GB VRAM wall of consumer NVIDIA cards.\n\nBut decode speed is bandwidth-bound. An RTX 5090 has about 32 GB of VRAM and ~1.79 TB/s of memory bandwidth — roughly 2x Apple's. If a model fits comfortably on both, the 5090 wins on raw tokens per second. If the model does not fit in 32 GB, the Mac runs it locally while the 5090 needs multi-GPU, aggressive quantization, or CPU offload.\n\nThe intuition \"first VRAM to fit, then RAM for speed\" is backwards. There are not two tiers. The critical state — weights plus active KV cache — has to live in the fastest memory the accelerator can reach. If it spills to slower memory, generation slows sharply. The right question is always: does the active state fit in fast memory, and how fast is that memory?\n\n## Why This Matters For Agent Workloads\n\nAgent loops are the worst case for naive serving. Many turns, each short, each carrying long cached context. That means:\n\n- Agent loops are decode-dominated. Bandwidth matters more than FLOPS.\n- KV cache per request grows fast. Prefix caching across turns is a major cost lever, not a micro-optimization.\n- Continuous batching amortizes memory stalls across many concurrent agents. Single-tenant serving of one agent is the expensive shape.\n- Every saved token is a saved decode step on memory-bound hardware. Prompt caching, tool result truncation, and parallel tool calls exist for exactly this reason.\n\nA harness that does not think about these levers is burning money on every turn. A harness that does — and sits on a serving stack that batches well, caches prefixes, and schedules aggressively — gets production economics other harnesses cannot.\n\n## TL;DR\n\n- Generation is a loop: prefill processes the prompt, decode emits one token at a time.\n- Prefill is compute-sensitive. Decode is memory-bandwidth-sensitive.\n- KV cache avoids recomputing past K and V, but cache memory grows linearly with context and users.\n- Continuous batching keeps the GPU busy by mixing requests at iteration granularity.\n- PagedAttention makes KV allocation efficient enough for high concurrency.\n- Parallelism is usually a composition of tensor, pipeline, and replica-level scaling.\n- Speculative decoding is \"draft then verify,\" not free tokens from nowhere.\n- Serving is a scheduler over a bounded memory resource. The scheduler is as important as the model.","author":"civ.build","sources":[],"related_pages":["context-lifecycle-for-ai-systems","memory-consolidation-and-sleep-loops","trustworthy-co-thinker-vs-eager-executor"],"canonical_url":null,"license":null,"contact":null,"status":null,"audience":["humans","agents"],"agent_takeaway":{"type":"learned","content":"Inference is a loop — prefill is compute-bound, decode is memory-bandwidth-bound. KV cache, continuous batching, and PagedAttention are the levers that make serving economical. Agent workloads are decode-dominated and gain the most from prefix caching and iteration-level batching."}}