2026-04-15 · 8 min read

How Prompt Caching Can Cut Your AI Agent Costs by 80%

Prompt caching is the highest-ROI optimization most AI agent teams haven't tried yet. By storing repeated context — system prompts, few-shot examples, retrieved documents — you can reduce input token costs by 60–90% with almost no code changes. Here's how it works, when to use it, and how to trace cache effectiveness in Nexus.

LLM APIs charge per token. Every token you send in a prompt costs money — even if you've sent the exact same system prompt a thousand times today. Prompt caching changes this: instead of re-billing you for context that hasn't changed, the API reuses a stored version at a dramatically reduced rate.

Anthropic's Claude charges 10% of the standard input price for cache reads. That means a 10,000-token system prompt that you'd pay $0.03 for on every call now costs $0.003 per cache hit. At 1,000 calls per day, that's $27/day vs. $2.70/day — just from caching your system prompt.

What can be cached

Prompt caching works by marking a portion of your prompt as a cache checkpoint. Everything up to that checkpoint is eligible for caching. The best candidates are:

Enabling cache control in TypeScript

import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic()

const response = await client.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: yourLongSystemPrompt,          // <-- mark this for caching
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{ role: 'user', content: userMessage }],
})

// Check cache effectiveness
const usage = response.usage
console.log('Cache read tokens:', usage.cache_read_input_tokens)
console.log('Cache write tokens:', usage.cache_creation_input_tokens)
console.log('Uncached input tokens:', usage.input_tokens)

Tracking cache effectiveness with Nexus

The key metric for caching is your cache hit rate: what percentage of your input tokens are being served from cache? Log this in every span's metadata:

const span = await trace.startSpan({ name: 'claude-with-cache' })

const response = await client.messages.create({ /* ... */ })

const cacheHits = response.usage.cache_read_input_tokens ?? 0
const cacheWrites = response.usage.cache_creation_input_tokens ?? 0
const freshTokens = response.usage.input_tokens

await span.end({
  status: 'success',
  metadata: {
    model: 'claude-opus-4-6',
    cache_hit_tokens: cacheHits,
    cache_write_tokens: cacheWrites,
    fresh_input_tokens: freshTokens,
    cache_hit_rate: cacheHits / (cacheHits + freshTokens + cacheWrites),
    output_tokens: response.usage.output_tokens,
  },
})

With this in place, your Nexus trace detail shows cache hit rates per span. When you see a low hit rate on a span that should be caching (e.g., 0.1 instead of 0.9), you know something changed in your prompt structure that's busting the cache.

Common caching mistakes

Caching dynamic content. Any content that changes per request (user messages, timestamps, session IDs) should come after the cache checkpoint — not before it. If you include dynamic content in the cached portion, you'll get 0% cache hits.

Not reusing client sessions. Cache TTLs are typically 5 minutes. If you create a new client instance per request, you lose the cache on every call. Reuse your API client.

Caching short prompts. Prompt caching has a minimum cacheable length (typically 1,024 tokens for Claude). Short prompts don't qualify. Focus caching effort on long system prompts and large few-shot blocks.

Expected savings by agent type

Based on real production deployments:

The ceiling is roughly: (cached tokens / total input tokens) × (1 - cache price ratio). For a 90% cacheable prompt at Claude's 10% cache rate, that's 81% savings on input costs.

Track cache hit rates in your traces

Nexus lets you log cache metadata on every span. See which calls are hitting the cache and which are missing it — with no changes to your LLM provider.

Start free →