{"slug":"llm-operator-fundamentals","kind":"essay","title":"LLM Operator Fundamentals: Temperature, Attention, Caching","summary":"The three knobs that actually change what an LLM system ships — temperature, attention design, and prompt caching — explained at the operator level, not the research level.","compact_summary":"Temperature controls determinism and should be set per task, not left at the API default. Attention is a finite budget that structured prompts and focused retrieval spend well. Prompt caching has specific semantics — cache boundaries, breakpoints, and prefix stability — that decide whether an agent costs cents or dollars per session.","key_claims":["Temperature should be set deliberately per task. Any agent that acts — tool calls, code, structured output — should default to 0 and only be raised for a concrete reason.","Attention is a finite budget. Structured system prompts, focused retrieval, and deliberate instruction placement concentrate attention where it belongs; filler dilutes every real instruction.","Prompt caching is a design decision, not a default. Cache boundaries, multiple breakpoints, and prefix stability determine whether an agent costs cents or dollars per session.","Temperature, attention design, and cache layout are three views of the same problem — spending the model's finite determinism, focus, and budget deliberately."],"section_map":["Temperature — The Wrong Knob To Leave At Default","Attention — What Your Prompt Is Actually Competing For","Prompt Caching — The Lever Nobody Talks About Until The Bill Arrives","Why These Three Together"],"confidence":"high","intended_use":["Use this page to set deliberate defaults before building an agent, RAG pipeline, or tool-calling system.","Use it as a checklist when debugging unreliable tool-call arguments, lost-in-the-middle retrieval, or surprising inference bills."],"do_not_use_for":["Do not treat cache pricing specifics as authoritative over time — provider economics shift and should be verified against current SDK docs.","Do not skip empirical evaluation; these knobs are design starting points, not replacements for measurement on your own workload."],"updated_at":"2026-04-19T00:00:00.000Z","verified_at":"2026-04-19T00:00:00.000Z","version":"0.1.0","estimated_tokens":1643,"word_count":1217,"content_hash":"f7744ac7ec2f0cdce7558e8f3d4a2b922d922213772241f63ccc8ba0687f6b1f","change_summary":"Initial public version.","requires_human_judgment":false,"tags":["llm","agents","prompt-engineering","attention","prompt-caching","temperature","operator"],"_links":{"self":"/api/v1/content/llm-operator-fundamentals","compact":"/api/v1/content/llm-operator-fundamentals/compact","meta":"/api/v1/content/llm-operator-fundamentals/meta","raw":"/api/v1/content/llm-operator-fundamentals/raw","versions":"/api/v1/content/llm-operator-fundamentals/versions","related":["/api/v1/content/llm-inference-mental-model/compact","/api/v1/content/context-lifecycle-for-ai-systems/compact","/api/v1/content/trustworthy-co-thinker-vs-eager-executor/compact"],"canonical_human":"/p/llm-operator-fundamentals","capabilities":"/api/v1/capabilities"},"content":"# LLM Operator Fundamentals\n\nMost of what matters about running LLMs in production is not in the API docs as a coherent story. It lives in scattered blog posts, model cards, and inference bills that only make sense in hindsight.\n\nThree knobs account for most of the quality and cost difference between a careless agent and a serious one: temperature, attention design, and prompt caching. None of them are research topics. All of them are operator decisions that get left on default far too often.\n\n## Temperature — The Wrong Knob To Leave At Default\n\nTemperature controls how deterministic the model's token choices are. Nominally 0 to 2; above 2 it degenerates into random sampling and stops being useful.\n\nThe mental model: grounded task gets low temperature; divergent task gets high temperature. If the evaluation metric is \"is this correct,\" you want low. If it is \"how many distinct good options,\" you want high.\n\n| Task | Temperature | Why |\n| --- | --- | --- |\n| Tool calling | 0 | Tool args need to be exact. Diversity is noise. |\n| Code generation | 0 | Syntax is narrow. Creativity is almost always wrong. |\n| Summarization | ~0.3 | Faithful with some paraphrasing latitude. |\n| Troubleshooting | 0.3–0.5 | Focused, with room for non-obvious connections. |\n| Brainstorming | 0.8–1.0 | Diversity is the point. |\n\nThe trap: teams leave temperature at the API default (often 1.0) and then wonder why their tool-calling agent picks slightly different argument names every run. The default for any agent that acts — tools, code, structured output — should be 0. Only raise it when you can name a concrete reason.\n\n## Attention — What Your Prompt Is Actually Competing For\n\nThe attention mechanism is the core of what makes transformers work. When the model generates the next token, attention decides which parts of the input to focus on. Knowing how it works — at the operator level, not the math level — changes how prompts get written, how RAG gets structured, and how tool sets get designed.\n\nThe filing-cabinet analogy is the one that sticks:\n\n```\nYou have a QUERY:     \"What failed during the deployment?\"\nThe cabinet has KEYS: labels on each drawer (each token's identity)\nBehind each key is a VALUE: the actual content in that drawer\n\nAttention = match your query against all keys,\n            then pull content from the drawers that match best\n```\n\nEvery input token gets projected into three vectors:\n\n- **Q (Query)** — what am I looking for?\n- **K (Key)** — what do I contain?\n- **V (Value)** — here's my actual content.\n\nThe model scores tokens against Q, softmaxes the scores into weights, and returns a weighted sum of V vectors. Tokens with the strongest Q·K alignment dominate the answer.\n\n### Operator Implications\n\n- **Structured system prompts give the model strong K vectors.** That is why decision tables outperform prose wishes — clear instructions are sharper hooks for the attention mechanism. \"Always validate input. Never call the X API before Y.\" beats \"please be careful with the X API.\"\n- **RAG retrieves top-K chunks precisely because dumping everything would drown attention.** Small focused retrieval concentrates attention where it belongs.\n- **Chunk size and top-K trade against each other.** 5 chunks × 2k tokens = 10k of context (attention has a chance). 20 chunks × 2k tokens = 40k (attention dilutes, middle tokens get lost).\n- **Recent and first tokens get more attention than middle tokens.** \"Lost in the middle\" is real. Put the most important instructions at the start or the end of long contexts. Do not bury them.\n\n### Attention Budget Rules\n\nThese are the rules that survive contact with real agent work:\n\n1. Keep system prompts focused and small. Every instruction competes for attention; filler weakens every real instruction.\n2. Retrieve 5 RAG chunks, not 20. Right-size chunks so each carries one complete idea.\n3. Do not run 5 tools in parallel inside one agent. Each tool's output competes for attention against the others. If you genuinely need 5 tools, spawn sub-agents with narrow tasks — or group tool calls by data type so they fetch similar information together.\n4. Decision tables with explicit instructions outperform prose every time.\n\nThe useful reframe: the model's attention is a finite budget. A good prompt spends that budget on the right regions.\n\n## Prompt Caching — Keep the Cost Under Control\n\nAnthropic's prompt caching is a concrete cost and latency lever, but it has specific semantics that need to be designed around. It is the difference between an agent that costs $0.50 per session and one that costs $5.\n\n### The Core Model\n\n- Cache writes are slightly more expensive than an uncached request (~25% premium for 5-minute TTL).\n- Cache reads are ~90% cheaper on the cached tokens.\n- Default TTL is 5 minutes. Extending to 1 hour costs roughly 2x the 5-minute write cost.\n- Minimum cacheable prompt length: 4096 tokens for Opus, 2048 for Sonnet, 4096 for Haiku. Shorter prompts cannot be cached at all.\n\nThe caller explicitly marks what to cache:\n\n```python\nresponse = client.messages.create(\n    model=\"claude-sonnet-4-6\",\n    system=[\n        {\n            \"type\": \"text\",\n            \"text\": \"Your very long system prompt here... 4000 tokens of instructions...\",\n            \"cache_control\": {\"type\": \"ephemeral\"}  # cache this block\n        }\n    ],\n    messages=[\n        {\"role\": \"user\", \"content\": \"What happened to environment 7ef0f460?\"}\n    ],\n)\n```\n\n### Where The Marker Goes Matters\n\nEverything up to and including the marker is cached; everything after is fresh tokens every call:\n\n```\n[system prompt] [cache_control] [user message]\n|---- cached ----|              |-- not cached --|\n\n[system prompt] [RAG context] [cache_control] [user question]\n|------------ cached ------------------|      |-- not cached --|\n```\n\n### Multiple Cache Breakpoints — The Powerful Pattern\n\nSeveral boundaries can be marked in the same request. Each one caches its own prefix, so one cache layer can be reused across many follow-up queries while another floats on top:\n\n```\n[system prompt] [cache_control] [RAG context for env X] [cache_control] [question]\n|-- cache 1 (reused across all questions) --|\n|--------------- cache 2 (reused per environment) -------|\n                                                          |-- never cached --|\n```\n\nA real troubleshooting workflow looks like:\n\n```\nInvestigation 1: \"What caused the OOM?\"\n  [skill prompt: CACHED] [diag output for env X: CACHED] [question 1]\n\nInvestigation 2: \"Was it the same in other environments?\"\n  [skill prompt: HIT] [diag output for env X: HIT] [question 2]\n\nInvestigation 3: (new environment, different ticket)\n  [skill prompt: HIT] [diag output for env Y: MISS, new write] [question 3]\n```\n\nThe skill prompt cache is reused across every ticket. The environment-specific cache is reused across every question about that environment. Only the user question itself pays full cost every turn.\n\n### What Breaks The Cache\n\n- Any change in the cached prefix invalidates it. A one-line edit to the system prompt or a claude.md file wipes the whole cache.\n- Tool definition changes invalidate the cache.\n- A long thread from yesterday (200k tokens) has decayed past TTL — the first message in a new session pays full write cost on all of it. Starting fresh, or seeding with a concise summary, is usually cheaper than carrying a giant stale context — unless losing any detail is unacceptable.\n\n### What To Cache, In Priority Order\n\n1. Long documents being shared as context. Biggest single win.\n2. System prompts past the minimum cache threshold.\n3. RAG retrieval context when the same scope is queried repeatedly.\n4. Do not cache the user question itself — it changes every turn.\n\n### The Implication For Spec Docs\n\nIf every edit to the system prompt or a claude.md file wipes the cache, those files should be stable. Volatile instructions belong outside the cache boundary or in a separate non-cached block. Structure the prompt so the stable contract is cached and only the per-query details churn.\n\n## Why These Three Together\n\nTemperature, attention, and caching look like separate topics. They are not. They are three views of the same problem: the model has finite determinism, finite focus, and finite budget. Operator skill is spending those finite resources where they pay back.\n\n- Wrong temperature → determinism budget gets spent on tasks that needed creativity, or the reverse.\n- Wrong attention design → the model's focus gets spent on filler instead of signal.\n- Wrong cache layout → money and latency get spent on tokens that should have been free.\n\nNone of this is in the API docs as a coherent story. The biggest jump in agent quality rarely comes from a better model — it comes from treating these three knobs as design decisions instead of defaults.\n\nIf an agent is being built and temperature has never been deliberately set per task, instruction placement has never been considered, and no cache boundary has been marked by hand: that is where the next 10x lives.","author":"civ.build","sources":[],"related_pages":["llm-inference-mental-model","context-lifecycle-for-ai-systems","trustworthy-co-thinker-vs-eager-executor"],"canonical_url":null,"license":null,"contact":null,"status":null,"audience":["humans","agents"],"agent_takeaway":{"type":"learned","content":"For production LLM work, set temperature per task (default 0 for any acting agent), design prompts to spend attention on signal not filler, and treat prompt caching as an explicit layout decision with stable prefixes and deliberate breakpoints."}}