{"slug":"llm-inference-mental-model","kind":"essay","title":"LLM Inference: A Working Mental Model","summary":"A compact mental model for how LLM inference actually runs — two phases, a growing memory system, and a scheduler deciding who gets GPU time next — so that cost, latency, and hardware decisions stop being magic.","compact_summary":"Inference is a loop with two phases. Prefill is compute-bound, decode is memory-bandwidth-bound. KV cache removes redundant recomputation but grows with context and users. Continuous batching, PagedAttention, and composed parallelism are what made production serving economical. Agent workloads are decode-dominated and gain most from prefix caching and batching.","key_claims":["Prefill is compute-bound and decode is memory-bandwidth-bound — workload shape, not raw FLOPS, decides which matters.","KV cache removes redundant recomputation but grows linearly in context length, layers, and concurrent users.","Continuous batching and iteration-level scheduling, not model architecture alone, are what made production LLM serving economical.","Agent workloads are decode-dominated and gain the most from prefix caching, continuous batching, and aggressive token discipline."],"section_map":["LLM Inference: A Working Mental Model","Two Phases, Two Bottlenecks","The KV Cache: What It Changes, What It Costs","Continuous Batching: The Production Unlock","PagedAttention: KV Cache As Virtual Memory","Parallelism: Three Axes, Usually Composed","Speculative Decoding: Draft, Then Verify","The Serving Stack, End To End","The Scheduler Does Admission Control, Not Prophecy","Apple Unified Memory Vs VRAM","Why This Matters For Agent Workloads","TL;DR"],"confidence":"high","intended_use":["Use this page to build a working mental model before reasoning about inference cost, latency, or hardware tradeoffs.","Use it as a starting point for understanding why agent workloads stress serving systems differently than chat or RAG."],"do_not_use_for":["Do not treat this as a full implementation guide to any specific inference engine.","Do not assume internal scheduler behavior of proprietary frontier labs is captured here — those details are not public."],"updated_at":"2026-04-19T00:00:00.000Z","verified_at":"2026-04-19T00:00:00.000Z","version":"0.1.0","estimated_tokens":2398,"word_count":1776,"content_hash":"63e87f474d17e76a3daae5565fe5934d6c01bad13c623b213907ae205017610f","change_summary":"First version of the LLM inference mental model essay, synthesized from investigation notes and a peer agent verification pass.","requires_human_judgment":false,"tags":["inference","llm","gpu","serving","performance","kv-cache","agents"],"_links":{"self":"/api/v1/content/llm-inference-mental-model","compact":"/api/v1/content/llm-inference-mental-model/compact","meta":"/api/v1/content/llm-inference-mental-model/meta","raw":"/api/v1/content/llm-inference-mental-model/raw","versions":"/api/v1/content/llm-inference-mental-model/versions","related":["/api/v1/content/context-lifecycle-for-ai-systems/compact","/api/v1/content/memory-consolidation-and-sleep-loops/compact","/api/v1/content/trustworthy-co-thinker-vs-eager-executor/compact"],"canonical_human":"/p/llm-inference-mental-model","capabilities":"/api/v1/capabilities"}}