5 Metrics Every AI Agent Team Should Track

Most teams start monitoring AI agents with a single metric: did it work? That's a good start, but it tells you almost nothing about why failures happen, where time is spent, or when you're approaching a cliff edge in costs or context limits.

Here are the five metrics that actually matter — what they measure, why they predict problems, and how to track them with the Nexus SDK.

1. Latency (p50, p95, p99)

Why it matters: Average latency lies. An agent that completes in 2s average but 45s at p99 will produce terrible user experiences 1% of the time — which is unacceptable if you're processing thousands of requests. Percentile latency tells you the shape of your distribution.

p50 (median): Most users experience this. Use for capacity planning.
p95: The slow tail. If this is 10× your p50, you have a high-variance problem.
p99: The cliff edge. Spikes here often indicate timeouts, retries, or cold start issues.

Nexus captures trace duration automatically from startTrace to trace.end(). Span timing is also captured per-step, so you can identify which step is slow.

TypeScript

const nexus = new NexusClient({ apiKey: 'nxs_...', agentId: 'invoice-processor' })

const trace = await nexus.startTrace({ name: 'process-invoice' })

// Span timing is captured automatically
const span = await trace.addSpan({
  name: 'gpt-4o-extraction',
  input: { prompt: 'Extract fields from...', tokens: 1200 },
})

// ... your LLM call ...

await span.end({
  output: { result: extracted, tokens: 800 },
  status: 'ok',
})
await trace.end({ status: 'success' })

In the Nexus dashboard, the trace list shows duration for each trace. For p50/p95/p99 analysis, export trace data or use the D1 database directly: SELECT percentile(duration_ms, 95) FROM traces WHERE agent_id = ?.

2. Token cost per request

Why it matters: LLM APIs charge per token. An agent that processes 10,000 requests/day at $0.01 each costs $100/day. If token usage grows unexpectedly (longer prompts, more retries, context accumulation), costs compound fast. Track cost per request, not total cost — so you catch per-request inflation early.

The Nexus SDK doesn't have built-in token counting (we don't know which model you're using), but you can log it as span metadata:

TypeScript

// Log token cost as metadata on each span
await trace.addSpan({
  name: 'gpt-4o-extraction',
  input: { tokens_in: 1200, estimated_cost_usd: 0.0036 },
  output: { tokens_out: 800, estimated_cost_usd: 0.0024 },
  status: 'ok',
})

Alert trigger: Set a budget alert when average cost per trace exceeds your target. For a $9/mo product, most teams target under $0.05 per agent run.

3. Error rate by tool and agent

Why it matters: Aggregate error rate hides the real problem. A 5% error rate might mean one tool (web search, a flaky API, a database query) is failing 30% of the time while everything else is fine. Error rate by tool tells you where to look.

Instrument tool calls as spans with status: 'error' and an error message:

TypeScript

// Track tool call errors in span metadata
const toolSpan = await trace.addSpan({
  name: 'search-web',
  input: { query: userQuery },
  status: 'error',
  error: 'Timeout after 5s — search API unavailable',
})

In the Nexus trace viewer, span status is color-coded (red = error, green = ok). You can scan a trace's span waterfall and immediately see which step failed. Query across traces: SELECT name, COUNT(*) as errors FROM spans WHERE status = 'error' GROUP BY name ORDER BY errors DESC.

4. Trace completion rate

Why it matters: An agent that starts a task but never finishes — no error, no success, just running forever — is invisible without this metric. Incomplete traces indicate runaway loops, infinite retries, or crashed processes that didn't clean up.

Completion rate = traces with status success or error / total traces. Traces stuck at running are abandoned runs.

TypeScript

// Mark trace completion explicitly
// If end() is not called, trace stays 'running' — easy to spot abandoned runs
await trace.end({ status: 'success' }) // or 'error', 'timeout'

Nexus marks traces running until you call trace.end(). The dashboard shows running traces with a yellow status dot — any trace still running after 10 minutes warrants investigation.

Tip: Wrap your entire agent run in a try/finally block to ensure trace.end() always fires, even on uncaught exceptions.

5. Context window utilization

Why it matters: Context overflow is one of the most common causes of agent degradation. When you approach the context limit, models start hallucinating, losing track of earlier instructions, or truncating tool results silently. Tracking utilization lets you catch this before it causes failures.

Log utilization as a percentage in span metadata:

TypeScript

// Log context window utilization as span metadata
const contextSpan = await trace.addSpan({
  name: 'llm-call',
  input: {
    prompt_tokens: 15000,
    context_limit: 16384,
    utilization_pct: Math.round((15000 / 16384) * 100), // 91%
  },
  status: 'ok',
})

Alert threshold: Flag any span where utilization_pct > 85. At 90%+, you're in the danger zone where model behavior degrades noticeably for most providers.

Setting up the dashboard

These five metrics give you a complete picture of agent health: speed, cost, reliability, throughput, and capacity. Start with all five instrumented from day one — it's far easier to add logging before problems appear than to debug retrospectively from logs alone.

See how this looks in practice on the Nexus demo, or read more in the docs. If you're using LangChain, LlamaIndex, or DSPy, check the framework-specific guides.

5 Metrics Every AI Agent Team Should Track

1. Latency (p50, p95, p99)

2. Token cost per request

3. Error rate by tool and agent

4. Trace completion rate

5. Context window utilization

Setting up the dashboard

Start tracking these metrics

More articles