Temperature controls determinism and should be set per task, not left at the API default. Attention is a finite budget that structured prompts and focused retrieval spend well. Prompt caching has specific semantics — cache boundaries, breakpoints, and prefix stability — that decide whether an agent costs cents or dollars per session.
LLM Operator Fundamentals
Most of what matters about running LLMs in production is not in the API docs as a coherent story. It lives in scattered blog posts, model cards, and inference bills that only make sense in hindsight.
Three knobs account for most of the quality and cost difference between a careless agent and a serious one: temperature, attention design, and prompt caching. None of them are research topics. All of them are operator decisions that get left on default far too often.
Temperature — The Wrong Knob To Leave At Default
Temperature controls how deterministic the model's token choices are. Nominally 0 to 2; above 2 it degenerates into random sampling and stops being useful.
The mental model: grounded task gets low temperature; divergent task gets high temperature. If the evaluation metric is "is this correct," you want low. If it is "how many distinct good options," you want high.
| Task | Temperature | Why |
|---|---|---|
| Tool calling | 0 | Tool args need to be exact. Diversity is noise. |
| Code generation | 0 | Syntax is narrow. Creativity is almost always wrong. |
| Summarization | ~0.3 | Faithful with some paraphrasing latitude. |
| Troubleshooting | 0.3–0.5 | Focused, with room for non-obvious connections. |
| Brainstorming | 0.8–1.0 | Diversity is the point. |
The trap: teams leave temperature at the API default (often 1.0) and then wonder why their tool-calling agent picks slightly different argument names every run. The default for any agent that acts — tools, code, structured output — should be 0. Only raise it when you can name a concrete reason.
Attention — What Your Prompt Is Actually Competing For
The attention mechanism is the core of what makes transformers work. When the model generates the next token, attention decides which parts of the input to focus on. Knowing how it works — at the operator level, not the math level — changes how prompts get written, how RAG gets structured, and how tool sets get designed.
The filing-cabinet analogy is the one that sticks:
You have a QUERY: "What failed during the deployment?"
The cabinet has KEYS: labels on each drawer (each token's identity)
Behind each key is a VALUE: the actual content in that drawer
Attention = match your query against all keys,
then pull content from the drawers that match best
Every input token gets projected into three vectors:
- Q (Query) — what am I looking for?
- K (Key) — what do I contain?
- V (Value) — here's my actual content.
The model scores tokens against Q, softmaxes the scores into weights, and returns a weighted sum of V vectors. Tokens with the strongest Q·K alignment dominate the answer.
Operator Implications
- Structured system prompts give the model strong K vectors. That is why decision tables outperform prose wishes — clear instructions are sharper hooks for the attention mechanism. "Always validate input. Never call the X API before Y." beats "please be careful with the X API."
- RAG retrieves top-K chunks precisely because dumping everything would drown attention. Small focused retrieval concentrates attention where it belongs.
- Chunk size and top-K trade against each other. 5 chunks × 2k tokens = 10k of context (attention has a chance). 20 chunks × 2k tokens = 40k (attention dilutes, middle tokens get lost).
- Recent and first tokens get more attention than middle tokens. "Lost in the middle" is real. Put the most important instructions at the start or the end of long contexts. Do not bury them.
Attention Budget Rules
These are the rules that survive contact with real agent work:
- Keep system prompts focused and small. Every instruction competes for attention; filler weakens every real instruction.
- Retrieve 5 RAG chunks, not 20. Right-size chunks so each carries one complete idea.
- Do not run 5 tools in parallel inside one agent. Each tool's output competes for attention against the others. If you genuinely need 5 tools, spawn sub-agents with narrow tasks — or group tool calls by data type so they fetch similar information together.
- Decision tables with explicit instructions outperform prose every time.
The useful reframe: the model's attention is a finite budget. A good prompt spends that budget on the right regions.
Prompt Caching — Keep the Cost Under Control
Anthropic's prompt caching is a concrete cost and latency lever, but it has specific semantics that need to be designed around. It is the difference between an agent that costs $0.50 per session and one that costs $5.
The Core Model
- Cache writes are slightly more expensive than an uncached request (~25% premium for 5-minute TTL).
- Cache reads are ~90% cheaper on the cached tokens.
- Default TTL is 5 minutes. Extending to 1 hour costs roughly 2x the 5-minute write cost.
- Minimum cacheable prompt length: 4096 tokens for Opus, 2048 for Sonnet, 4096 for Haiku. Shorter prompts cannot be cached at all.
The caller explicitly marks what to cache:
response = client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": "Your very long system prompt here... 4000 tokens of instructions...",
"cache_control": {"type": "ephemeral"} # cache this block
}
],
messages=[
{"role": "user", "content": "What happened to environment 7ef0f460?"}
],
)
Where The Marker Goes Matters
Everything up to and including the marker is cached; everything after is fresh tokens every call:
[system prompt] [cache_control] [user message]
|---- cached ----| |-- not cached --|
[system prompt] [RAG context] [cache_control] [user question]
|------------ cached ------------------| |-- not cached --|
Multiple Cache Breakpoints — The Powerful Pattern
Several boundaries can be marked in the same request. Each one caches its own prefix, so one cache layer can be reused across many follow-up queries while another floats on top:
[system prompt] [cache_control] [RAG context for env X] [cache_control] [question]
|-- cache 1 (reused across all questions) --|
|--------------- cache 2 (reused per environment) -------|
|-- never cached --|
A real troubleshooting workflow looks like:
Investigation 1: "What caused the OOM?"
[skill prompt: CACHED] [diag output for env X: CACHED] [question 1]
Investigation 2: "Was it the same in other environments?"
[skill prompt: HIT] [diag output for env X: HIT] [question 2]
Investigation 3: (new environment, different ticket)
[skill prompt: HIT] [diag output for env Y: MISS, new write] [question 3]
The skill prompt cache is reused across every ticket. The environment-specific cache is reused across every question about that environment. Only the user question itself pays full cost every turn.
What Breaks The Cache
- Any change in the cached prefix invalidates it. A one-line edit to the system prompt or a claude.md file wipes the whole cache.
- Tool definition changes invalidate the cache.
- A long thread from yesterday (200k tokens) has decayed past TTL — the first message in a new session pays full write cost on all of it. Starting fresh, or seeding with a concise summary, is usually cheaper than carrying a giant stale context — unless losing any detail is unacceptable.
What To Cache, In Priority Order
- Long documents being shared as context. Biggest single win.
- System prompts past the minimum cache threshold.
- RAG retrieval context when the same scope is queried repeatedly.
- Do not cache the user question itself — it changes every turn.
The Implication For Spec Docs
If every edit to the system prompt or a claude.md file wipes the cache, those files should be stable. Volatile instructions belong outside the cache boundary or in a separate non-cached block. Structure the prompt so the stable contract is cached and only the per-query details churn.
Why These Three Together
Temperature, attention, and caching look like separate topics. They are not. They are three views of the same problem: the model has finite determinism, finite focus, and finite budget. Operator skill is spending those finite resources where they pay back.
- Wrong temperature → determinism budget gets spent on tasks that needed creativity, or the reverse.
- Wrong attention design → the model's focus gets spent on filler instead of signal.
- Wrong cache layout → money and latency get spent on tokens that should have been free.
None of this is in the API docs as a coherent story. The biggest jump in agent quality rarely comes from a better model — it comes from treating these three knobs as design decisions instead of defaults.
If an agent is being built and temperature has never been deliberately set per task, instruction placement has never been considered, and no cache boundary has been marked by hand: that is where the next 10x lives.