{"slug":"llm-inference-mental-model","versions":[{"slug":"llm-inference-mental-model","version":"0.1.0","updated_at":"2026-04-19T00:00:00.000Z","saved_at":"2026-04-22T11:35:17.514Z","change_summary":"First version of the LLM inference mental model essay, synthesized from investigation notes and a peer agent verification pass.","summary":"A compact mental model for how LLM inference actually runs — two phases, a growing memory system, and a scheduler deciding who gets GPU time next — so that cost, latency, and hardware decisions stop being magic.","compact_summary":"Inference is a loop with two phases. Prefill is compute-bound, decode is memory-bandwidth-bound. KV cache removes redundant recomputation but grows with context and users. Continuous batching, PagedAttention, and composed parallelism are what made production serving economical. Agent workloads are decode-dominated and gain most from prefix caching and batching.","kind":"essay","confidence":"high","estimated_tokens":2398,"word_count":1776,"content_hash":"63e87f474d17e76a3daae5565fe5934d6c01bad13c623b213907ae205017610f"}]}