2026-04-28 · 9 min read

Monitoring AI Agent Token Budget and Cost Thresholds with Nexus

OpenAI charges per token. Claude charges per token. A runaway agent in a bad loop can spend $50 in minutes before anyone notices. Here's how to record token usage as Nexus span metadata, compute cost per trace using model pricing tables, build a token budget guard that aborts a run before it exceeds your limit, and alert when a session hits 80% of its budget.

Why token budgets matter in production

Most teams discover their token cost problem in one of two ways: a Stripe invoice that's three times higher than expected, or an agent that loops for 20 minutes before timing out. Both are avoidable with a token budget guard built on trace data.

The per-token prices are small individually — GPT-4o costs $2.50 per million input tokens, Claude 3.5 Sonnet costs $3.00 per million — but agents multiply them. A research agent that makes 15 tool calls per task, each with a 10,000-token context window, costs about $0.45 per run. At 500 runs per day, that's $225/day before you've added caching. A single bad prompt that inflates the context to 80,000 tokens doubles the cost of every affected run.

The solution is to track tokens at the span level — each LLM call records its own prompt_tokens and completion_tokens — then aggregate them at the trace level and enforce a budget before the next call is made.

Recording token usage as Nexus span metadata

Every LLM API response returns token counts in its usage object. Record them as span metadata so they're searchable and aggregatable in Nexus.

Python — OpenAI with token metadata:

import os
from openai import OpenAI
from nexus_sdk import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
openai = OpenAI()

def run_agent(user_query: str, task_name: str) -> str:
    trace = nexus.start_trace(name=task_name, metadata={"query": user_query})
    cumulative_prompt_tokens = 0
    cumulative_completion_tokens = 0
    messages = [{"role": "user", "content": user_query}]

    for turn in range(10):
        span = nexus.start_span(
            trace_id=trace["trace_id"],
            name=f"llm_call_turn_{turn + 1}",
            metadata={"turn": turn + 1, "message_count": len(messages)}
        )

        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            max_tokens=1024
        )

        usage = response.usage
        cumulative_prompt_tokens += usage.prompt_tokens
        cumulative_completion_tokens += usage.completion_tokens
        estimated_cost_usd = round(
            cumulative_prompt_tokens * 0.0000025 +
            cumulative_completion_tokens * 0.000010,
            6
        )

        nexus.end_span(
            span_id=span["id"],
            status="success",
            metadata={
                "prompt_tokens": usage.prompt_tokens,
                "completion_tokens": usage.completion_tokens,
                "cumulative_prompt_tokens": cumulative_prompt_tokens,
                "cumulative_completion_tokens": cumulative_completion_tokens,
                "estimated_cost_usd": estimated_cost_usd,
                "model": "gpt-4o"
            }
        )

        reply = response.choices[0].message.content
        if response.choices[0].finish_reason == "stop":
            nexus.end_trace(
                trace_id=trace["trace_id"],
                status="success",
                metadata={
                    "total_prompt_tokens": cumulative_prompt_tokens,
                    "total_completion_tokens": cumulative_completion_tokens,
                    "total_cost_usd": estimated_cost_usd,
                    "turns": turn + 1
                }
            )
            return reply
        messages.append({"role": "assistant", "content": reply})
        messages.append({"role": "user", "content": "Continue."})

    nexus.end_trace(trace_id=trace["trace_id"], status="error",
                    metadata={"reason": "max_turns_exceeded"})
    return "Agent did not complete within turn limit."

TypeScript — Anthropic Claude with token metadata:

import Anthropic from '@anthropic-ai/sdk'
import { NexusClient } from 'keylightdigital-nexus'

const nexus = new NexusClient({ apiKey: process.env.NEXUS_API_KEY! })
const anthropic = new Anthropic()

async function runAgent(userQuery: string, taskName: string): Promise<string> {
  const trace = await nexus.startTrace({ name: taskName, metadata: { query: userQuery } })
  let cumulativeInputTokens = 0
  let cumulativeOutputTokens = 0
  const messages: Anthropic.MessageParam[] = [{ role: 'user', content: userQuery }]

  for (let turn = 0; turn < 10; turn++) {
    const span = await nexus.startSpan(trace.id, {
      name: `llm_call_turn_${turn + 1}`,
      metadata: { turn: turn + 1, messageCount: messages.length }
    })

    const response = await anthropic.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages
    })

    const usage = response.usage
    cumulativeInputTokens += usage.input_tokens
    cumulativeOutputTokens += usage.output_tokens
    const estimatedCostUsd = parseFloat((
      cumulativeInputTokens * 0.000003 +
      cumulativeOutputTokens * 0.000015
    ).toFixed(6))

    await nexus.endSpan(span.id, {
      status: 'success',
      metadata: {
        inputTokens: usage.input_tokens,
        outputTokens: usage.output_tokens,
        cumulativeInputTokens,
        cumulativeOutputTokens,
        estimatedCostUsd,
        model: 'claude-3-5-sonnet-20241022'
      }
    })

    const reply = response.content[0].type === 'text' ? response.content[0].text : ''
    if (response.stop_reason === 'end_turn') {
      await nexus.endTrace(trace.id, {
        status: 'success',
        metadata: {
          totalInputTokens: cumulativeInputTokens,
          totalOutputTokens: cumulativeOutputTokens,
          totalCostUsd: estimatedCostUsd,
          turns: turn + 1
        }
      })
      return reply
    }
    messages.push({ role: 'assistant', content: reply })
    messages.push({ role: 'user', content: 'Continue.' })
  }

  await nexus.endTrace(trace.id, { status: 'error', metadata: { reason: 'max_turns_exceeded' } })
  return 'Agent did not complete within turn limit.'
}

Model pricing reference table

Cost-per-token varies significantly by model. Using the wrong pricing constant will silently underestimate real spend. Current rates as of April 2026:

Model Input ($/1M tokens) Output ($/1M tokens)
GPT-4o $2.50 $10.00
GPT-4o mini $0.15 $0.60
Claude 3.5 Sonnet $3.00 $15.00
Claude 3.5 Haiku $0.80 $4.00
Gemini 1.5 Flash $0.075 $0.30
Gemini 1.5 Pro $3.50 $10.50

Store these as a lookup table in your codebase rather than hardcoding per-call. When model prices change, one update propagates everywhere:

MODEL_PRICING = {
    "gpt-4o":                    {"input": 0.0000025, "output": 0.000010},
    "gpt-4o-mini":               {"input": 0.00000015, "output": 0.0000006},
    "claude-3-5-sonnet-20241022":{"input": 0.000003,  "output": 0.000015},
    "claude-3-5-haiku-20241022": {"input": 0.0000008, "output": 0.000004},
    "gemini-1.5-flash":          {"input": 0.000000075, "output": 0.0000003},
    "gemini-1.5-pro":            {"input": 0.0000035, "output": 0.00001050},
}

def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    return round(
        input_tokens * rates["input"] + output_tokens * rates["output"],
        6
    )

Building a token budget guard

A budget guard checks cumulative token spend before each LLM call and aborts the run if it would exceed a threshold. This stops runaway loops before they become runaway invoices.

class TokenBudgetGuard:
    def __init__(self, model: str, max_cost_usd: float):
        self.model = model
        self.max_cost_usd = max_cost_usd
        self.cumulative_input = 0
        self.cumulative_output = 0

    @property
    def current_cost(self) -> float:
        return compute_cost(self.model, self.cumulative_input, self.cumulative_output)

    @property
    def budget_pct(self) -> float:
        return (self.current_cost / self.max_cost_usd) * 100 if self.max_cost_usd > 0 else 0

    def check(self, nexus, trace_id: str) -> None:
        """Raise BudgetExceeded if current spend >= max_cost_usd."""
        if self.current_cost >= self.max_cost_usd:
            nexus.end_trace(
                trace_id=trace_id,
                status="error",
                metadata={
                    "reason": "budget_exceeded",
                    "cost_usd": self.current_cost,
                    "budget_usd": self.max_cost_usd,
                    "budget_pct": self.budget_pct
                }
            )
            raise BudgetExceededError(
                f"Token budget exceeded: spent ${self.current_cost:.4f} "
                f"of ${self.max_cost_usd:.4f} limit"
            )

    def record(self, input_tokens: int, output_tokens: int) -> None:
        self.cumulative_input += input_tokens
        self.cumulative_output += output_tokens

class BudgetExceededError(RuntimeError):
    pass

# Usage in your agent loop
def run_agent_with_budget(user_query: str, max_cost_usd: float = 0.05) -> str:
    guard = TokenBudgetGuard(model="gpt-4o", max_cost_usd=max_cost_usd)
    trace = nexus.start_trace(name="budget_guarded_agent",
                               metadata={"query": user_query, "budget_usd": max_cost_usd})

    for turn in range(20):
        guard.check(nexus, trace["trace_id"])  # abort before the next call if over budget

        span = nexus.start_span(trace_id=trace["trace_id"], name=f"llm_call_turn_{turn + 1}")
        response = openai.chat.completions.create(model="gpt-4o", messages=messages)
        usage = response.usage
        guard.record(usage.prompt_tokens, usage.completion_tokens)

        nexus.end_span(
            span_id=span["id"],
            status="success",
            metadata={
                "prompt_tokens": usage.prompt_tokens,
                "completion_tokens": usage.completion_tokens,
                "cost_usd": guard.current_cost,
                "budget_pct": round(guard.budget_pct, 1)
            }
        )
        # ... rest of loop logic

Alerting at 80% budget utilization

Rather than waiting for hard failure, trigger an alert when a session hits 80% of its budget. This gives you time to investigate before the next call fails.

import os
import requests

ALERT_WEBHOOK = os.environ.get("SLACK_WEBHOOK_URL")

def send_budget_alert(task_name: str, cost_usd: float, budget_usd: float, pct: float, trace_id: str):
    if not ALERT_WEBHOOK:
        return
    requests.post(ALERT_WEBHOOK, json={
        "text": (
            f":warning: *Token budget at {pct:.0f}%* for `{task_name}`\n"
            f"Spent: `${cost_usd:.4f}` of `${budget_usd:.4f}` budget\n"
            f"Trace: `{trace_id}`"
        )
    }, timeout=5)

class TokenBudgetGuard:
    def __init__(self, model: str, max_cost_usd: float, task_name: str = "agent"):
        self.model = model
        self.max_cost_usd = max_cost_usd
        self.task_name = task_name
        self.cumulative_input = 0
        self.cumulative_output = 0
        self._alert_sent = False

    def record_and_alert(self, input_tokens: int, output_tokens: int, trace_id: str):
        self.cumulative_input += input_tokens
        self.cumulative_output += output_tokens
        pct = self.budget_pct
        if pct >= 80 and not self._alert_sent:
            send_budget_alert(
                self.task_name, self.current_cost,
                self.max_cost_usd, pct, trace_id
            )
            self._alert_sent = True

For webhook-based alerting with more detail — including the specific span that pushed the agent over the threshold — see the AI agent alerting guide.

TypeScript budget guard

import { NexusClient } from 'keylightdigital-nexus'

const MODEL_PRICING: Record<string, { input: number; output: number }> = {
  'gpt-4o':                     { input: 0.0000025,    output: 0.00001 },
  'gpt-4o-mini':                { input: 0.00000015,   output: 0.0000006 },
  'claude-3-5-sonnet-20241022': { input: 0.000003,     output: 0.000015 },
  'gemini-1.5-flash':           { input: 0.000000075,  output: 0.0000003 },
}

class TokenBudgetGuard {
  private cumulativeInput = 0
  private cumulativeOutput = 0
  private alertSent = false

  constructor(
    private model: string,
    private maxCostUsd: number,
    private taskName: string
  ) {}

  get currentCost(): number {
    const rates = MODEL_PRICING[this.model] ?? { input: 0, output: 0 }
    return this.cumulativeInput * rates.input + this.cumulativeOutput * rates.output
  }

  get budgetPct(): number {
    return this.maxCostUsd > 0 ? (this.currentCost / this.maxCostUsd) * 100 : 0
  }

  async record(inputTokens: number, outputTokens: number, traceId: string): Promise<void> {
    this.cumulativeInput += inputTokens
    this.cumulativeOutput += outputTokens

    if (this.budgetPct >= 80 && !this.alertSent) {
      this.alertSent = true
      const webhookUrl = process.env.SLACK_WEBHOOK_URL
      if (webhookUrl) {
        const pct = this.budgetPct.toFixed(0)
        const cost = this.currentCost.toFixed(4)
        const budget = this.maxCostUsd.toFixed(4)
        const msg = ':warning: *Token budget at ' + pct + '%* for ' + this.taskName +
          '\nSpent: $' + cost + ' of $' + budget + '\nTrace: ' + traceId
        await fetch(webhookUrl, {
          method: 'POST',
          headers: { 'Content-Type': 'application/json' },
          body: JSON.stringify({ text: msg })
        })
      }
    }
  }

  checkHardLimit(nexus: NexusClient, traceId: string): void {
    if (this.currentCost >= this.maxCostUsd) {
      nexus.endTrace(traceId, {
        status: 'error',
        metadata: { reason: 'budget_exceeded', costUsd: this.currentCost, budgetUsd: this.maxCostUsd }
      })
      const msg = 'Budget exceeded: $' + this.currentCost.toFixed(4) + ' of $' + this.maxCostUsd.toFixed(4)
      throw new Error(msg)
    }
  }
}

Comparing token efficiency across model versions

Once you have per-span token metadata in Nexus, you can compare models systematically. Run the same workload through two model versions and group traces by model name. The question to answer: does the cheaper model produce the same output quality with fewer total tokens, or does it compensate by being more verbose?

import os
from nexus_sdk import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])

MODELS_TO_COMPARE = [
    ("gpt-4o",      0.0000025,  0.000010),
    ("gpt-4o-mini", 0.00000015, 0.0000006),
]

BENCHMARK_QUERIES = [
    "Summarize the key arguments in favor of carbon pricing.",
    "Write a Python function that merges two sorted lists.",
    "Explain the CAP theorem in plain language.",
    "Draft a polite decline to a sales pitch email.",
    "List five questions to validate a B2B SaaS idea.",
]

results = {}

for model, input_rate, output_rate in MODELS_TO_COMPARE:
    totals = {"input": 0, "output": 0, "cost": 0.0}

    for query in BENCHMARK_QUERIES:
        trace = nexus.start_trace(
            name="model_comparison",
            metadata={"model": model, "query": query}
        )
        span = nexus.start_span(trace_id=trace["trace_id"], name="llm_call",
                                 metadata={"model": model})
        response = openai.chat.completions.create(model=model,
                                                   messages=[{"role": "user", "content": query}])
        usage = response.usage
        cost = round(usage.prompt_tokens * input_rate + usage.completion_tokens * output_rate, 6)
        totals["input"] += usage.prompt_tokens
        totals["output"] += usage.completion_tokens
        totals["cost"] += cost

        nexus.end_span(span_id=span["id"], status="success",
                       metadata={"prompt_tokens": usage.prompt_tokens,
                                  "completion_tokens": usage.completion_tokens,
                                  "cost_usd": cost, "model": model})
        nexus.end_trace(trace_id=trace["trace_id"], status="success",
                        metadata={"total_cost_usd": cost, "model": model})

    results[model] = totals

for model, totals in results.items():
    avg_cost = totals["cost"] / len(BENCHMARK_QUERIES)
    print(f"{model}: avg input={totals['input']//len(BENCHMARK_QUERIES)} "
          f"output={totals['output']//len(BENCHMARK_QUERIES)} cost={avg_cost:.5f}")

In Nexus, filter by metadata.model and compare the estimated_cost_usd distribution. Outlier traces — tasks where the cheaper model used 3× more tokens — point to prompts that need few-shot examples or tighter constraints to work well with smaller models.

What to monitor in production

Next steps

Token budgets are the first line of defense against runaway AI agent costs. Recording prompt_tokens and completion_tokens at the span level, computing cost per trace against a pricing table, and enforcing a hard limit before each LLM call gives you cost visibility that no provider dashboard offers — because only you can see across tasks, models, and users at once. Sign up for a free Nexus account to start capturing token usage from your agents today.

Monitor token budgets across all your AI agents

Free tier, no credit card required. Span-level token tracking in under 5 minutes.