Prompt Caching Is the Difference Between a Viable AI Product and a Bankrupt One

Gloss Key Takeaways

Prompt caching can cut the cost of long, multi-turn AI conversations by roughly an order of magnitude, turning an otherwise unprofitable product into a viable one.
Without caching, each new message forces the model to reprocess the entire conversation history, causing token usage (and cost) to explode as sessions get longer.
Claude’s prompt caching relies on prefix matching: if anything in the cached prefix changes, even a single character, the cache is invalidated and you pay full price again.
Cached reads are dramatically cheaper (about 10% of normal input pricing) and also reduce latency, making agents feel faster as context grows.
To maximize cache hits, structure prompts like a cache hierarchy—static content first (system/tools), then project/session context, and put the most dynamic content (latest messages) last.

Claude Code treats prompt cache misses like server outages. They run alerts. They declare incidents. A few percentage points of cache miss rate triggers an emergency response.

That sounds dramatic until you do the math. Without prompt caching, every message in a long AI conversation reprocesses the entire conversation history from scratch. A 100,000-token conversation with 50 messages means the API processes 5 million tokens of input. With caching, it reads the repeated tokens from cache at a 90% discount.

The difference between a cached and uncached agentic session is the difference between a product that costs $0.50 and one that costs $5. At scale, that is the difference between staying in business and shutting down.

How prompt caching works

Every time you send a message to the Claude API, the model processes all input tokens: system prompt, tools, conversation history, everything. Processing is the expensive part.

Prompt caching lets the API remember the processed result of tokens it has seen before. On the next request, if the beginning of your input matches what was cached, the API skips reprocessing and reads the cached result instead. Cached reads cost 10% of normal input pricing.

The critical mechanic is prefix matching. The API caches from the start of your request up to a cache breakpoint. If the next request has an identical prefix, those tokens are read from cache. If anything in that prefix changes, even one character, the cache is invalidated.

For Claude Sonnet 4.6, cached reads cost $0.30 per million tokens versus $3.00 uncached. For Opus 4.6, it is $1.50 versus $15.00. There is also a meaningful latency improvement, cached tokens process faster, which means your agent responds noticeably quicker as conversations grow.

Implementation: two options

Auto-caching (the easy way)

Add one field to your API request:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "cache_control": {"type": "ephemeral"},
  "system": "Your system prompt here...",
  "messages": [...]
}

The API automatically places the cache breakpoint at the end of the last cacheable block and moves it forward as the conversation grows. For most multi-turn conversation use cases, this is all you need.

Explicit breakpoints (for control)

When you need precise control over what gets cached, place cache_control on specific content blocks:

{
  "system": [
    {
      "type": "text",
      "text": "Your long system prompt...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Large reference document here...",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    }
  ]
}

This lets you set multiple breakpoints at specific positions. Useful when you have distinct sections of context, like a system prompt, reference documents, and conversation history, and want each cached independently.

Supporting illustration

The lessons from Claude Code's production

Auto-caching handles the basics. But the Claude Code team's design lessons, learned through real incidents and cost spikes, are where the actual value is.

Order your prompt like a cache hierarchy

Because caching matches from the beginning forward, the order of content determines how much gets cached across requests.

The rule: static content first, dynamic content last.

Claude Code structures every request like this:

Static system prompt and tool definitions (same for every user)
Project-level context like CLAUDE.md (same within a project)
Session-level context (same within a session)
Conversation messages (changes every turn)

Even requests from different sessions share cache hits on the system prompt and tools. Flip the ordering and you invalidate the cache on every single request.

The prefix is more fragile than you think

The Claude Code team broke their own caching multiple times with changes that seemed harmless:

Putting a timestamp in the system prompt (changes every second, invalidates the entire prefix)
Shuffling tool definitions in a non-deterministic order (same tools, different ordering, cache miss)
Updating tool parameters dynamically

Each caused costs to spike. Anything in your static prefix needs to be truly static. If it changes, move it into conversation messages instead.

Use messages for updates, not prompt changes

When information changes mid-session, the tempting approach is to update the system prompt. This breaks the cache.

Instead, pass the update as a conversation message. Claude Code uses a <system-reminder> tag inside user messages to communicate updates: "It is now Wednesday." "The user changed file X." The model reads the update from the message. The system prompt stays identical. The cache stays intact.

Never change tools mid-session

This one catches people off guard. Adding or removing a tool invalidates the cache for the entire conversation, because tool definitions are part of the cached prefix.

Claude Code's solution: keep all tools in every request. For tools that are only sometimes needed, they send lightweight stubs with just the name and a defer_loading: true flag. The model discovers full schemas through a discovery tool when it actually needs them. The stubs stay stable in the prefix. The cache holds.

Do not switch models mid-session

Prompt caches are unique to each model. If you are 100,000 tokens into a conversation with Opus and switch to Haiku for a quick question, you rebuild the entire cache from scratch. That is more expensive, not less.

The better pattern: use sub-agents. The main model prepares a concise handoff with relevant context, and a sub-agent on a cheaper model handles the task in its own session with a focused context window.

Design features around the cache

Plan mode in Claude Code illustrates this perfectly. The obvious implementation: swap out tools for read-only tools when the user enters plan mode. But swapping tools breaks the cache.

Instead, Claude Code implements plan mode as a tool itself. The model calls EnterPlanMode, receives constraints through a message, and calls ExitPlanMode when done. Tool definitions never change. Cache never breaks. And because it is a tool the model can call on its own, it autonomously enters plan mode when it detects a hard problem. A constraint-driven design produced better behavior.

The five rules

If you are building anything on the Claude API with multi-turn conversations, these are the rules:

Enable auto-caching. One field, 80-90% cost reduction on long conversations.
Put static content at the top of your prompts. Dynamic content at the bottom.
Do not modify your system prompt mid-session. Use messages.
Do not add, remove, or reorder tools mid-session.
Monitor your cache hit rate. A drop means something changed in your prefix.

The Claude Code team did not optimize for caching after the fact. They designed around it from day one. Every architectural decision, from how plan mode works to how tools are loaded to how context compaction runs, was shaped by one question: does this break the cache?

If you are building on the API and ignoring prompt caching, you are leaving money and speed on the table. At the token volumes that agentic products generate, probably enough of both to determine whether your product survives.

Gloss What This Means For You

Turn on auto-caching for multi-turn apps by adding the cache_control field, then audit your prompt structure so stable content comes first and changing content comes last. If you have big, reusable chunks (system instructions, reference docs, project context), consider explicit cache breakpoints so they can be cached independently. Treat cache miss rate as a production metric—small prompt tweaks can silently break prefix matching and spike both cost and latency.