Tracking Token Costs for AI Agents in Production

Token costs are the biggest variable expense in AI agent systems — but most teams have no per-agent cost visibility. A trace that ran for 3 seconds may cost $0.001 or $0.15 depending on model and prompt size. Here's how to record, aggregate, and alert on token costs using Nexus.

Why per-agent token cost visibility matters

Most teams know their total OpenAI bill. Few know which agent or which type of task is responsible for the majority of that spend. Without per-agent cost data, you can't answer the questions that matter: Is the customer support agent 10x more expensive than the document summarizer? Is a prompt regression in staging costing $0.50/run instead of $0.05?

The path to answering these is to record token usage as span metadata on every LLM call. Then aggregate by agent ID in your trace data. Nexus stores this metadata alongside traces and lets you filter and compare across agents.

Recording token usage in Python

OpenAI, Anthropic, and most other LLM clients return token usage in the API response. Record it as metadata on the LLM span:

import os
from nexus_sdk import NexusClient
from openai import OpenAI

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def llm_call_with_cost(trace_id: str, system: str, user: str, model="gpt-4o") -> str:
    span = nexus.start_span(trace_id, {
        "name": f"llm:{model}",
        "type": "llm",
        "metadata": {"model": model},
    })

    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )

    usage = response.usage
    nexus.end_span(span["id"], {
        "output": response.choices[0].message.content,
        "metadata": {
            "prompt_tokens": usage.prompt_tokens,
            "completion_tokens": usage.completion_tokens,
            "total_tokens": usage.total_tokens,
            # Approximate cost for gpt-4o (adjust for your model)
            "estimated_cost_usd": round(
                usage.prompt_tokens * 0.0000025 + usage.completion_tokens * 0.00001,
                6
            ),
        },
    })
    return response.choices[0].message.content

Recording token usage in TypeScript

import OpenAI from 'openai'
import { NexusClient } from '@nexus/sdk'

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! })
const nexus = new NexusClient({ apiKey: process.env.NEXUS_API_KEY! })

async function llmCallWithCost(
  traceId: string,
  system: string,
  user: string,
  model = 'gpt-4o'
): Promise<string> {
  const span = await nexus.startSpan(traceId, {
    name: 'llm:' + model,
    type: 'llm',
    metadata: { model },
  })

  const response = await openai.chat.completions.create({
    model,
    messages: [
      { role: 'system', content: system },
      { role: 'user', content: user },
    ],
  })

  const usage = response.usage!
  // Approximate cost for gpt-4o (adjust for your model and pricing)
  const estimatedCostUsd =
    usage.prompt_tokens * 0.0000025 + usage.completion_tokens * 0.00001

  await nexus.endSpan(span.id, {
    output: response.choices[0].message.content ?? '',
    metadata: {
      promptTokens: usage.prompt_tokens,
      completionTokens: usage.completion_tokens,
      totalTokens: usage.total_tokens,
      estimatedCostUsd: Math.round(estimatedCostUsd * 1_000_000) / 1_000_000,
    },
  })

  return response.choices[0].message.content ?? ''
}

Cost per model reference (2026 approximate)

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-4o-mini	$0.15	$0.60
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$0.80	$4.00
Gemini 2.0 Flash	$0.10	$0.40

Approximate prices. Always verify against provider pricing pages.

What to look for in your cost data

High prompt token counts: If prompt_tokens is consistently 3-5x your completion_tokens, you have a system prompt problem. Large system prompts on every call are the fastest way to multiply costs — especially with cached prompt billing, consider moving stable context to a prefix cache.

Completion token spikes: A sudden spike in completion_tokens on what should be a short-answer task means the model is verbose — add an explicit output length instruction or use a structured output format to constrain it.

Cost difference across agents: If one agent costs 10x another for similar tasks, check the system prompt length and whether the agent is looping more turns than expected. Both are fixable.

Cost regression across model versions: When testing a new model, compare estimated_cost_usd distributions. A cheaper model might look good on average but have a fat tail of expensive edge-case calls.

Token cost tracking doesn't require a separate analytics system. Adding three metadata fields per LLM span — prompt_tokens, completion_tokens, estimated_cost_usd — gives you everything you need to understand and control your AI agent spend.