Why AI Agent Costs Surprise Teams
A simple chatbot using GPT-3.5 Turbo might cost a fraction of a cent per conversation. A multi-tool agent using GPT-4o with web search, code execution, and 10 reasoning steps can cost $0.50–$2.00 per run. At 100 runs/day, that's $1,500–$6,000/month — from a single agent.
Most teams discover this after the invoice arrives. By that point, the patterns that drove the cost (unnecessary retries, context overflow, redundant tool calls) have been running for weeks. Here's how to model costs before they surprise you, and how to trace the patterns that compound them.
Cost Ranges by Agent Type
| Agent Type | Model | Cost/Run | 100 runs/day |
|---|---|---|---|
| Simple chatbot | GPT-3.5 Turbo | $0.001–$0.003 | ~$9/mo |
| Research agent (3–5 tools) | GPT-4o mini | $0.01–$0.05 | ~$150/mo |
| Document analysis agent | GPT-4o | $0.05–$0.20 | ~$750/mo |
| Multi-step reasoning agent | GPT-4o | $0.20–$1.00 | ~$3,000/mo |
| Multi-agent system (3+ agents) | GPT-4o + sub-agents | $0.50–$3.00 | ~$7,500/mo |
These are ballpark estimates based on typical token usage patterns. Actual costs depend heavily on your specific prompts, context sizes, and retry behavior.
How to Calculate Token Costs
Token pricing is asymmetric: input tokens (your prompt + context) are cheaper than output tokens (the model's response). For GPT-4o as of early 2026: $0.005/1K input tokens, $0.015/1K output tokens. For Claude Sonnet 3.5: $0.003/1K input, $0.015/1K output.
The formula:
cost = (prompt_tokens × input_price + completion_tokens × output_price) / 1000
Track this at the span level so you can see cost by tool call, by agent step, and roll up to trace total:
// Track token usage in every LLM span
const llmSpan = await trace.addSpan({
name: 'gpt-4o-analysis',
input: {
prompt_tokens: 1200,
model: 'gpt-4o',
},
output: {
completion_tokens: 340,
total_tokens: 1540,
// At $0.005/1K input + $0.015/1K output (gpt-4o):
// cost = (1200 * 0.005 + 340 * 0.015) / 1000 = $0.0111
estimated_cost_usd: 0.0111,
},
})
Hidden Cost Patterns
1. Retry Amplification
A tool that fails and retries 3 times doesn't just cost 3× the token budget — it also adds latency and often triggers additional LLM calls to re-evaluate the situation. Track retry counts as metadata so you can identify which tools are unreliable and driving disproportionate cost:
// Log retry attempts to catch cost amplifiers
let attempt = 0
while (attempt < 3) {
attempt++
const retrySpan = await trace.addSpan({
name: 'tool-call-with-retry',
input: { attempt, tool: 'web-search', query: searchQuery },
})
try {
const result = await webSearch(searchQuery)
await retrySpan.end({ status: 'ok', output: { result_count: result.length } })
break
} catch (err) {
await retrySpan.end({ status: 'error', error: `Attempt ${attempt}: ${err.message}` })
// If all 3 attempts fail, you'll see 3x the token cost in the trace
}
}
2. Context Window Overflow
As conversation history grows, each new LLM call becomes more expensive — you're re-paying for every past message on every new turn. A 10-turn conversation doesn't cost 10× a 1-turn conversation; it often costs 50× because the context accumulates.
The fix is to monitor context size and prune proactively:
// Detect context window bloat before it hits the limit
const span = await trace.addSpan({
name: 'context-assembly',
input: {
message_count: conversationHistory.length,
estimated_tokens: estimateTokens(conversationHistory),
context_window_limit: 128000,
},
})
// Alert if context exceeds 80% of window — prune before the expensive overflow
if (estimateTokens(conversationHistory) > 128000 * 0.8) {
console.warn('[nexus] Context window at 80% — consider pruning old messages')
conversationHistory = pruneOldestMessages(conversationHistory, 0.5)
}
await span.end({ status: 'ok', output: { pruned: conversationHistory.length } })
3. Unnecessary Tool Calls
Some agents call tools "just to be sure" — searching for information they already have, re-reading documents they just read, calling APIs they called two steps ago. These are invisible costs until you look at trace data.
Once you have trace data, you can spot patterns: "this agent always calls web-search twice in a row for the same query." A simple dedup cache at the tool call level cuts costs 30–50% for research agents.
4. Sub-Agent Fanout
Multi-agent systems can fan out dramatically — a coordinator spawns 3 sub-agents, each spawns 2 more, and suddenly you have 7 concurrent agent runs each paying full context costs. Budget modeling must account for the fanout factor, not just the top-level agent cost.
Using Tracing for Cost Management
The most effective cost management technique is trace-based cost attribution: tag every trace with context that lets you group costs by feature, customer, or workflow.
// Tag traces with cost metadata for budget reporting
const trace = await nexus.startTrace({
name: 'customer-support-agent',
metadata: {
customer_tier: 'enterprise',
ticket_id: ticketId,
budget_center: 'support-ops',
// Track at trace level so you can group costs in dashboard
model_family: 'gpt-4o',
},
})
With this tagging, your trace dashboard becomes a cost dashboard: you can answer "which customer is driving 40% of our token spend?" or "which feature request triggers the most expensive agent runs?"
Cost Reduction Checklist
- 1. Instrument token usage — Track prompt_tokens and completion_tokens in every LLM span. You can't optimize what you don't measure.
- 2. Set step limits — Hard-cap agent iterations. A research agent that runs for 20 steps when 8 would suffice is burning 2.5× your budget.
- 3. Prune context aggressively — For conversational agents, summarize old turns rather than passing them verbatim. A 100-token summary replaces a 2,000-token history.
- 4. Cache tool results — If a tool returns the same result for the same input, cache it for 5–60 minutes. Many research agents call the same APIs multiple times per session.
- 5. Use smaller models for cheap steps — Classification, routing, and simple extraction don't need GPT-4o. GPT-4o mini or Claude Haiku costs 10–20× less for tasks that don't require frontier reasoning.
See your token costs in Nexus
Track token usage, cost per trace, and error rates. Free tier: 1,000 traces/month. No credit card required.
Start monitoring free →