How Trace Analysis Cut Our AI Agent Costs by 60%

Running AI agents in production gets expensive fast. We went from $800/month to $310/month on LLM costs — without reducing quality. Here's the trace-driven approach we used: identifying the spans burning the most tokens, eliminating unnecessary retries, and caching repeated context.

When we started running AI agents in production at scale, our LLM bill was shocking. Three things drove most of the cost: retry storms from rate-limited calls, oversized context windows being sent on every request, and using Opus where Haiku would have worked fine.

Trace analysis exposed all three. Here's the exact process we used to cut costs by 60% without reducing quality.

Step 1: Track estimated cost per span

You can't optimize what you don't measure. Start by logging estimated cost in every LLM span's output. This takes about 10 minutes to add and immediately shows you which calls are expensive:

import { NexusClient } from '@keylightdigital/nexus'
import Anthropic from '@anthropic-ai/sdk'

const nexus = new NexusClient({ apiKey: process.env.NEXUS_API_KEY!, agentId: 'cost-agent' })
const anthropic = new Anthropic()

// Pricing per 1M tokens (as of 2026)
const PRICING: Record<string, { input: number; output: number }> = {
  'claude-opus-4-6':   { input: 15.00, output: 75.00 },
  'claude-sonnet-4-6': { input: 3.00,  output: 15.00 },
  'claude-haiku-4-5':  { input: 0.25,  output: 1.25 },
}

async function tracedLLMCall(trace: any, model: string, messages: any[]) {
  const span = await trace.addSpan({
    name: `llm:${model}`,
    input: { message_count: messages.length },
  })

  const resp = await anthropic.messages.create({ model, max_tokens: 2048, messages })

  const pricing = PRICING[model] ?? { input: 0, output: 0 }
  const inputCost  = (resp.usage.input_tokens  / 1_000_000) * pricing.input
  const outputCost = (resp.usage.output_tokens / 1_000_000) * pricing.output

  await span.end({
    status: 'ok',
    output: {
      input_tokens:  resp.usage.input_tokens,
      output_tokens: resp.usage.output_tokens,
      estimated_cost_usd: +(inputCost + outputCost).toFixed(6),
    },
  })

  return resp
}

Once this is in place, open a trace in Nexus and look at the span outputs. You'll immediately see which models and which steps are costing the most.

Step 2: Identify retry storms

Look at your Nexus trace list filtered by status: error. If you see the same agent name appearing 5+ times in rapid succession, you have a retry storm — each retry attempt counts as a full LLM call even if it's immediately rate-limited.

Fix: add exponential backoff and a maximum retry limit. This alone cut our costs by 20% because we had an agent that would retry rate limit errors 10 times without delay.

Step 3: Spot context window bloat

In our trace spans, we logged input_tokens on each LLM call. What we found: on iteration 1, context was ~2,000 tokens. By iteration 10, it had grown to 18,000 tokens — because we were including the full conversation history including every tool result verbatim.

Fix: summarize tool outputs before adding them to context. A bash command that returns 5,000 characters of output only needs a 50-character summary in the conversation history. This cut average context size by 70% on long-running agents.

Step 4: Use prompt caching

If your agent uses a large, static system prompt (instructions, tool descriptions, knowledge base), you're paying to transmit it on every call. Claude's prompt caching feature lets you mark blocks as cacheable so they're only billed at 10% of the normal input token rate after the first call:

# Use prompt caching to avoid re-sending large system prompts
import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """
You are an expert data analyst...
[2000 tokens of context that never changes]
"""

def analyze_with_cache(data: str) -> str:
    # The system prompt is cached after the first call.
    # Subsequent calls only bill for the small data input.
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # cache this block
            }
        ],
        messages=[{"role": "user", "content": f"Analyze this data: {data}"}],
    )
    return response.content[0].text

For a 2,000-token system prompt sent 1,000 times per day, this saves roughly $0.54/day using Claude Sonnet — or about $200/year just from one agent's system prompt.

Step 5: Right-size your model

Not every step in your agent needs the most capable model. Classification tasks, simple reformatting, and yes/no routing decisions are often handled well by Haiku at 20x lower cost than Opus.

Look at your Nexus trace spans filtered by model name. Identify steps that are using Opus where you could use Haiku. Run evals on those specific spans to confirm quality holds. Then switch the model for those spans only.

The results

After implementing all five changes over two weeks:

Retry storms eliminated: −20% cost
Context window optimization: −25% cost
Prompt caching: −8% cost
Model right-sizing: −7% cost
Total: $800/month → $310/month

None of this required reducing the quality of the agent's output — it required understanding exactly where the tokens were going, which is only possible with trace-level observability.