Monitoring AI Agent Token Budget and Cost Thresholds with Nexus
OpenAI charges per token. Claude charges per token. A runaway agent in a bad loop can spend $50 in minutes before anyone notices. Here's how to record token usage as Nexus span metadata, compute cost per trace using model pricing tables, build a token budget guard that aborts a run before it exceeds your limit, and alert when a session hits 80% of its budget.
Why token budgets matter in production
Most teams discover their token cost problem in one of two ways: a Stripe invoice that's three times higher than expected, or an agent that loops for 20 minutes before timing out. Both are avoidable with a token budget guard built on trace data.
The per-token prices are small individually — GPT-4o costs $2.50 per million input tokens, Claude 3.5 Sonnet costs $3.00 per million — but agents multiply them. A research agent that makes 15 tool calls per task, each with a 10,000-token context window, costs about $0.45 per run. At 500 runs per day, that's $225/day before you've added caching. A single bad prompt that inflates the context to 80,000 tokens doubles the cost of every affected run.
The solution is to track tokens at the span level — each LLM call records its own prompt_tokens and completion_tokens — then aggregate them at the trace level and enforce a budget before the next call is made.
Recording token usage as Nexus span metadata
Every LLM API response returns token counts in its usage object. Record them as span metadata so they're searchable and aggregatable in Nexus.
Python — OpenAI with token metadata:
import os
from openai import OpenAI
from nexus_sdk import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
openai = OpenAI()
def run_agent(user_query: str, task_name: str) -> str:
trace = nexus.start_trace(name=task_name, metadata={"query": user_query})
cumulative_prompt_tokens = 0
cumulative_completion_tokens = 0
messages = [{"role": "user", "content": user_query}]
for turn in range(10):
span = nexus.start_span(
trace_id=trace["trace_id"],
name=f"llm_call_turn_{turn + 1}",
metadata={"turn": turn + 1, "message_count": len(messages)}
)
response = openai.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=1024
)
usage = response.usage
cumulative_prompt_tokens += usage.prompt_tokens
cumulative_completion_tokens += usage.completion_tokens
estimated_cost_usd = round(
cumulative_prompt_tokens * 0.0000025 +
cumulative_completion_tokens * 0.000010,
6
)
nexus.end_span(
span_id=span["id"],
status="success",
metadata={
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"cumulative_prompt_tokens": cumulative_prompt_tokens,
"cumulative_completion_tokens": cumulative_completion_tokens,
"estimated_cost_usd": estimated_cost_usd,
"model": "gpt-4o"
}
)
reply = response.choices[0].message.content
if response.choices[0].finish_reason == "stop":
nexus.end_trace(
trace_id=trace["trace_id"],
status="success",
metadata={
"total_prompt_tokens": cumulative_prompt_tokens,
"total_completion_tokens": cumulative_completion_tokens,
"total_cost_usd": estimated_cost_usd,
"turns": turn + 1
}
)
return reply
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "Continue."})
nexus.end_trace(trace_id=trace["trace_id"], status="error",
metadata={"reason": "max_turns_exceeded"})
return "Agent did not complete within turn limit."
TypeScript — Anthropic Claude with token metadata:
import Anthropic from '@anthropic-ai/sdk'
import { NexusClient } from 'keylightdigital-nexus'
const nexus = new NexusClient({ apiKey: process.env.NEXUS_API_KEY! })
const anthropic = new Anthropic()
async function runAgent(userQuery: string, taskName: string): Promise<string> {
const trace = await nexus.startTrace({ name: taskName, metadata: { query: userQuery } })
let cumulativeInputTokens = 0
let cumulativeOutputTokens = 0
const messages: Anthropic.MessageParam[] = [{ role: 'user', content: userQuery }]
for (let turn = 0; turn < 10; turn++) {
const span = await nexus.startSpan(trace.id, {
name: `llm_call_turn_${turn + 1}`,
metadata: { turn: turn + 1, messageCount: messages.length }
})
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages
})
const usage = response.usage
cumulativeInputTokens += usage.input_tokens
cumulativeOutputTokens += usage.output_tokens
const estimatedCostUsd = parseFloat((
cumulativeInputTokens * 0.000003 +
cumulativeOutputTokens * 0.000015
).toFixed(6))
await nexus.endSpan(span.id, {
status: 'success',
metadata: {
inputTokens: usage.input_tokens,
outputTokens: usage.output_tokens,
cumulativeInputTokens,
cumulativeOutputTokens,
estimatedCostUsd,
model: 'claude-3-5-sonnet-20241022'
}
})
const reply = response.content[0].type === 'text' ? response.content[0].text : ''
if (response.stop_reason === 'end_turn') {
await nexus.endTrace(trace.id, {
status: 'success',
metadata: {
totalInputTokens: cumulativeInputTokens,
totalOutputTokens: cumulativeOutputTokens,
totalCostUsd: estimatedCostUsd,
turns: turn + 1
}
})
return reply
}
messages.push({ role: 'assistant', content: reply })
messages.push({ role: 'user', content: 'Continue.' })
}
await nexus.endTrace(trace.id, { status: 'error', metadata: { reason: 'max_turns_exceeded' } })
return 'Agent did not complete within turn limit.'
}
Model pricing reference table
Cost-per-token varies significantly by model. Using the wrong pricing constant will silently underestimate real spend. Current rates as of April 2026:
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
| Gemini 1.5 Pro | $3.50 | $10.50 |
Store these as a lookup table in your codebase rather than hardcoding per-call. When model prices change, one update propagates everywhere:
MODEL_PRICING = {
"gpt-4o": {"input": 0.0000025, "output": 0.000010},
"gpt-4o-mini": {"input": 0.00000015, "output": 0.0000006},
"claude-3-5-sonnet-20241022":{"input": 0.000003, "output": 0.000015},
"claude-3-5-haiku-20241022": {"input": 0.0000008, "output": 0.000004},
"gemini-1.5-flash": {"input": 0.000000075, "output": 0.0000003},
"gemini-1.5-pro": {"input": 0.0000035, "output": 0.00001050},
}
def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
rates = MODEL_PRICING.get(model, {"input": 0, "output": 0})
return round(
input_tokens * rates["input"] + output_tokens * rates["output"],
6
)
Building a token budget guard
A budget guard checks cumulative token spend before each LLM call and aborts the run if it would exceed a threshold. This stops runaway loops before they become runaway invoices.
class TokenBudgetGuard:
def __init__(self, model: str, max_cost_usd: float):
self.model = model
self.max_cost_usd = max_cost_usd
self.cumulative_input = 0
self.cumulative_output = 0
@property
def current_cost(self) -> float:
return compute_cost(self.model, self.cumulative_input, self.cumulative_output)
@property
def budget_pct(self) -> float:
return (self.current_cost / self.max_cost_usd) * 100 if self.max_cost_usd > 0 else 0
def check(self, nexus, trace_id: str) -> None:
"""Raise BudgetExceeded if current spend >= max_cost_usd."""
if self.current_cost >= self.max_cost_usd:
nexus.end_trace(
trace_id=trace_id,
status="error",
metadata={
"reason": "budget_exceeded",
"cost_usd": self.current_cost,
"budget_usd": self.max_cost_usd,
"budget_pct": self.budget_pct
}
)
raise BudgetExceededError(
f"Token budget exceeded: spent ${self.current_cost:.4f} "
f"of ${self.max_cost_usd:.4f} limit"
)
def record(self, input_tokens: int, output_tokens: int) -> None:
self.cumulative_input += input_tokens
self.cumulative_output += output_tokens
class BudgetExceededError(RuntimeError):
pass
# Usage in your agent loop
def run_agent_with_budget(user_query: str, max_cost_usd: float = 0.05) -> str:
guard = TokenBudgetGuard(model="gpt-4o", max_cost_usd=max_cost_usd)
trace = nexus.start_trace(name="budget_guarded_agent",
metadata={"query": user_query, "budget_usd": max_cost_usd})
for turn in range(20):
guard.check(nexus, trace["trace_id"]) # abort before the next call if over budget
span = nexus.start_span(trace_id=trace["trace_id"], name=f"llm_call_turn_{turn + 1}")
response = openai.chat.completions.create(model="gpt-4o", messages=messages)
usage = response.usage
guard.record(usage.prompt_tokens, usage.completion_tokens)
nexus.end_span(
span_id=span["id"],
status="success",
metadata={
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"cost_usd": guard.current_cost,
"budget_pct": round(guard.budget_pct, 1)
}
)
# ... rest of loop logic
Alerting at 80% budget utilization
Rather than waiting for hard failure, trigger an alert when a session hits 80% of its budget. This gives you time to investigate before the next call fails.
import os
import requests
ALERT_WEBHOOK = os.environ.get("SLACK_WEBHOOK_URL")
def send_budget_alert(task_name: str, cost_usd: float, budget_usd: float, pct: float, trace_id: str):
if not ALERT_WEBHOOK:
return
requests.post(ALERT_WEBHOOK, json={
"text": (
f":warning: *Token budget at {pct:.0f}%* for `{task_name}`\n"
f"Spent: `${cost_usd:.4f}` of `${budget_usd:.4f}` budget\n"
f"Trace: `{trace_id}`"
)
}, timeout=5)
class TokenBudgetGuard:
def __init__(self, model: str, max_cost_usd: float, task_name: str = "agent"):
self.model = model
self.max_cost_usd = max_cost_usd
self.task_name = task_name
self.cumulative_input = 0
self.cumulative_output = 0
self._alert_sent = False
def record_and_alert(self, input_tokens: int, output_tokens: int, trace_id: str):
self.cumulative_input += input_tokens
self.cumulative_output += output_tokens
pct = self.budget_pct
if pct >= 80 and not self._alert_sent:
send_budget_alert(
self.task_name, self.current_cost,
self.max_cost_usd, pct, trace_id
)
self._alert_sent = True
For webhook-based alerting with more detail — including the specific span that pushed the agent over the threshold — see the AI agent alerting guide.
TypeScript budget guard
import { NexusClient } from 'keylightdigital-nexus'
const MODEL_PRICING: Record<string, { input: number; output: number }> = {
'gpt-4o': { input: 0.0000025, output: 0.00001 },
'gpt-4o-mini': { input: 0.00000015, output: 0.0000006 },
'claude-3-5-sonnet-20241022': { input: 0.000003, output: 0.000015 },
'gemini-1.5-flash': { input: 0.000000075, output: 0.0000003 },
}
class TokenBudgetGuard {
private cumulativeInput = 0
private cumulativeOutput = 0
private alertSent = false
constructor(
private model: string,
private maxCostUsd: number,
private taskName: string
) {}
get currentCost(): number {
const rates = MODEL_PRICING[this.model] ?? { input: 0, output: 0 }
return this.cumulativeInput * rates.input + this.cumulativeOutput * rates.output
}
get budgetPct(): number {
return this.maxCostUsd > 0 ? (this.currentCost / this.maxCostUsd) * 100 : 0
}
async record(inputTokens: number, outputTokens: number, traceId: string): Promise<void> {
this.cumulativeInput += inputTokens
this.cumulativeOutput += outputTokens
if (this.budgetPct >= 80 && !this.alertSent) {
this.alertSent = true
const webhookUrl = process.env.SLACK_WEBHOOK_URL
if (webhookUrl) {
const pct = this.budgetPct.toFixed(0)
const cost = this.currentCost.toFixed(4)
const budget = this.maxCostUsd.toFixed(4)
const msg = ':warning: *Token budget at ' + pct + '%* for ' + this.taskName +
'\nSpent: $' + cost + ' of $' + budget + '\nTrace: ' + traceId
await fetch(webhookUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: msg })
})
}
}
}
checkHardLimit(nexus: NexusClient, traceId: string): void {
if (this.currentCost >= this.maxCostUsd) {
nexus.endTrace(traceId, {
status: 'error',
metadata: { reason: 'budget_exceeded', costUsd: this.currentCost, budgetUsd: this.maxCostUsd }
})
const msg = 'Budget exceeded: $' + this.currentCost.toFixed(4) + ' of $' + this.maxCostUsd.toFixed(4)
throw new Error(msg)
}
}
}
Comparing token efficiency across model versions
Once you have per-span token metadata in Nexus, you can compare models systematically. Run the same workload through two model versions and group traces by model name. The question to answer: does the cheaper model produce the same output quality with fewer total tokens, or does it compensate by being more verbose?
import os
from nexus_sdk import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
MODELS_TO_COMPARE = [
("gpt-4o", 0.0000025, 0.000010),
("gpt-4o-mini", 0.00000015, 0.0000006),
]
BENCHMARK_QUERIES = [
"Summarize the key arguments in favor of carbon pricing.",
"Write a Python function that merges two sorted lists.",
"Explain the CAP theorem in plain language.",
"Draft a polite decline to a sales pitch email.",
"List five questions to validate a B2B SaaS idea.",
]
results = {}
for model, input_rate, output_rate in MODELS_TO_COMPARE:
totals = {"input": 0, "output": 0, "cost": 0.0}
for query in BENCHMARK_QUERIES:
trace = nexus.start_trace(
name="model_comparison",
metadata={"model": model, "query": query}
)
span = nexus.start_span(trace_id=trace["trace_id"], name="llm_call",
metadata={"model": model})
response = openai.chat.completions.create(model=model,
messages=[{"role": "user", "content": query}])
usage = response.usage
cost = round(usage.prompt_tokens * input_rate + usage.completion_tokens * output_rate, 6)
totals["input"] += usage.prompt_tokens
totals["output"] += usage.completion_tokens
totals["cost"] += cost
nexus.end_span(span_id=span["id"], status="success",
metadata={"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"cost_usd": cost, "model": model})
nexus.end_trace(trace_id=trace["trace_id"], status="success",
metadata={"total_cost_usd": cost, "model": model})
results[model] = totals
for model, totals in results.items():
avg_cost = totals["cost"] / len(BENCHMARK_QUERIES)
print(f"{model}: avg input={totals['input']//len(BENCHMARK_QUERIES)} "
f"output={totals['output']//len(BENCHMARK_QUERIES)} cost={avg_cost:.5f}")
In Nexus, filter by metadata.model and compare the estimated_cost_usd distribution. Outlier traces — tasks where the cheaper model used 3× more tokens — point to prompts that need few-shot examples or tighter constraints to work well with smaller models.
What to monitor in production
- Cost per trace by task type: Group by
metadata.task_nameto identify which workflows are expensive outliers. A task with high cost variance usually means one prompt is underspecified and the model compensates with verbose output. - Budget utilization distribution: What percentage of your traces hit 80%+ of their budget? If it's more than 5%, lower the per-trace budget or add a mid-run summarization step to compress context.
- Output token ratio:
completion_tokens / prompt_tokensabove 0.3 often means the model is being asked to produce long-form output where a structured shorter output would do. Use this ratio to identify prompts ripe for optimization. - Turn count vs. cost correlation: Agents that loop more than 5 turns on average are accumulating context across turns. Instrument turn count alongside cumulative token count to find the exact turn where cost starts compounding.
- Model upgrade regression: After switching model versions, compare median
estimated_cost_usdbefore and after. An increase in output tokens often signals the new model version is more verbose — a calibrated system prompt can bring it back in line.
Next steps
Token budgets are the first line of defense against runaway AI agent costs. Recording prompt_tokens and completion_tokens at the span level, computing cost per trace against a pricing table, and enforcing a hard limit before each LLM call gives you cost visibility that no provider dashboard offers — because only you can see across tasks, models, and users at once. Sign up for a free Nexus account to start capturing token usage from your agents today.
Monitor token budgets across all your AI agents
Free tier, no credit card required. Span-level token tracking in under 5 minutes.