How Prompt Caching Can Cut Your AI Agent Costs by 80%
Prompt caching is the highest-ROI optimization most AI agent teams haven't tried yet. By storing repeated context — system prompts, few-shot examples, retrieved documents — you can reduce input token costs by 60–90% with almost no code changes. Here's how it works, when to use it, and how to trace cache effectiveness in Nexus.
LLM APIs charge per token. Every token you send in a prompt costs money — even if you've sent the exact same system prompt a thousand times today. Prompt caching changes this: instead of re-billing you for context that hasn't changed, the API reuses a stored version at a dramatically reduced rate.
Anthropic's Claude charges 10% of the standard input price for cache reads. That means a 10,000-token system prompt that you'd pay $0.03 for on every call now costs $0.003 per cache hit. At 1,000 calls per day, that's $27/day vs. $2.70/day — just from caching your system prompt.
What can be cached
Prompt caching works by marking a portion of your prompt as a cache checkpoint. Everything up to that checkpoint is eligible for caching. The best candidates are:
- System prompts — same across all calls; ideal for caching
- Few-shot examples — static demonstrations that don't change per request
- Retrieved context — RAG documents that are reused across multiple turns
- Tool definitions — your tool schemas rarely change mid-session
Enabling cache control in TypeScript
import Anthropic from '@anthropic-ai/sdk'
const client = new Anthropic()
const response = await client.messages.create({
model: 'claude-opus-4-6',
max_tokens: 1024,
system: [
{
type: 'text',
text: yourLongSystemPrompt, // <-- mark this for caching
cache_control: { type: 'ephemeral' },
},
],
messages: [{ role: 'user', content: userMessage }],
})
// Check cache effectiveness
const usage = response.usage
console.log('Cache read tokens:', usage.cache_read_input_tokens)
console.log('Cache write tokens:', usage.cache_creation_input_tokens)
console.log('Uncached input tokens:', usage.input_tokens)
Tracking cache effectiveness with Nexus
The key metric for caching is your cache hit rate: what percentage of your input tokens are being served from cache? Log this in every span's metadata:
const span = await trace.startSpan({ name: 'claude-with-cache' })
const response = await client.messages.create({ /* ... */ })
const cacheHits = response.usage.cache_read_input_tokens ?? 0
const cacheWrites = response.usage.cache_creation_input_tokens ?? 0
const freshTokens = response.usage.input_tokens
await span.end({
status: 'success',
metadata: {
model: 'claude-opus-4-6',
cache_hit_tokens: cacheHits,
cache_write_tokens: cacheWrites,
fresh_input_tokens: freshTokens,
cache_hit_rate: cacheHits / (cacheHits + freshTokens + cacheWrites),
output_tokens: response.usage.output_tokens,
},
})
With this in place, your Nexus trace detail shows cache hit rates per span. When you see a low hit rate on a span that should be caching (e.g., 0.1 instead of 0.9), you know something changed in your prompt structure that's busting the cache.
Common caching mistakes
Caching dynamic content. Any content that changes per request (user messages, timestamps, session IDs) should come after the cache checkpoint — not before it. If you include dynamic content in the cached portion, you'll get 0% cache hits.
Not reusing client sessions. Cache TTLs are typically 5 minutes. If you create a new client instance per request, you lose the cache on every call. Reuse your API client.
Caching short prompts. Prompt caching has a minimum cacheable length (typically 1,024 tokens for Claude). Short prompts don't qualify. Focus caching effort on long system prompts and large few-shot blocks.
Expected savings by agent type
Based on real production deployments:
- RAG pipelines: 50–70% input cost reduction (retrieved docs cached across turns)
- Tool-use agents with long system prompts: 60–80% savings on repeated calls
- Batch processing agents: 70–90% savings when processing many items with a shared system prompt
- Conversational agents (multi-turn): 30–50% savings on turns 2+ when conversation history is cached
The ceiling is roughly: (cached tokens / total input tokens) × (1 - cache price ratio). For a 90% cacheable prompt at Claude's 10% cache rate, that's 81% savings on input costs.
Track cache hit rates in your traces
Nexus lets you log cache metadata on every span. See which calls are hitting the cache and which are missing it — with no changes to your LLM provider.
Start free →