OpenTelemetry for AI Agents: Why Standard APM Falls Short

What OpenTelemetry Gets Right

OpenTelemetry is one of the best things to happen to observability in the last decade. A vendor-neutral standard for traces, metrics, and logs — with first-class support in every major language and cloud. If you're running microservices, you should absolutely be using OTEL.

The distributed tracing model maps well to AI agents: a "trace" corresponds to a single agent run, and "spans" map to individual steps — tool calls, LLM invocations, retrieval queries. The waterfall view OTEL popularized is exactly what you want for understanding agent execution order and timing.

Here's a standard OTEL span:

// Standard OTEL span — adequate for web services
const span = tracer.startSpan('http.request', {
  attributes: { 'http.method': 'GET', 'http.url': '/api/data' }
})
span.end()

Clean. Simple. Works perfectly for HTTP requests, database queries, and service calls. The problem is that AI agent "spans" look nothing like this.

5 Things Standard APM Misses for AI Agents

1. Prompt and Response Capture

Standard OTEL attributes are key-value pairs designed for infrastructure metadata. They weren't built to store multi-kilobyte prompt strings or structured JSON responses from LLMs. Most APM tools either truncate them, ignore them, or charge per-character for storage.

When an agent misbehaves, the first thing you need is the exact prompt it received and the exact response it got. Without that, you're debugging blind. Here's what purpose-built agent spans capture:

// What you actually need for an AI agent span
const span = await trace.addSpan({
  name: 'gpt-4o-extraction',
  input: {
    messages: [{ role: 'user', content: extractPrompt }],
    model: 'gpt-4o',
    temperature: 0.2,
  },
  output: {
    content: result.content,
    usage: { prompt_tokens: 1200, completion_tokens: 340, total_tokens: 1540 },
    finish_reason: 'stop',
  },
})
await span.end({ status: 'ok' })

2. Token Usage Tracking

Token usage is the primary cost driver for AI agents — and standard APM has no concept of it. OTEL metrics can record a counter, but they don't know that prompt_tokens is structurally different from completion_tokens, that they're priced differently, or that watching token growth over time predicts runaway cost before the invoice arrives.

Purpose-built agent observability surfaces token usage as a first-class metric per span, per trace, and per agent — with trend charts that show when your context windows are growing unexpectedly.

3. Agent Loop Detection

One of the most expensive AI agent failures is an infinite loop: the agent keeps calling tools, generating responses, and incurring costs without making progress. Standard APM sees this as "many spans over a long time" — it has no concept of whether that's expected or pathological.

Agent-aware observability tracks step counts against declared limits and fires alerts when the ratio breaks expectations. A 20-step trace that should finish in 5 steps is a loop — not just a slow request.

// Detecting agent loops — impossible with standard OTEL counters
const trace = await nexus.startTrace({
  name: 'research-task',
  metadata: { max_steps: 20, task_id: taskId }
})

let steps = 0
while (!done) {
  steps++
  const span = await trace.addSpan({
    name: 'agent-step',
    input: { step: steps, tool: currentTool },
    output: { result: stepResult },
  })
  await span.end({ status: 'ok' })

  if (steps >= 20) {
    await trace.end({ status: 'timeout' })
    // Nexus fires an alert — standard APM would see 20 spans and shrug
    break
  }
}

4. Tool Call Error Propagation

When a web search tool returns no results, should the agent fail? Retry? Hallucinate an answer? Standard APM records the HTTP 200 from the search API and considers the call successful. But from the agent's perspective, empty results is a semantic failure that should be tracked and alerted on.

AI-specific tracing captures semantic status at the tool call level — not just HTTP status codes — and propagates failures up to the trace level so you see the real error picture:

// Tool call tracing with error propagation
const toolSpan = await trace.addSpan({
  name: 'web-search',
  input: { query: searchQuery, engine: 'google' },
})

try {
  const results = await searchWeb(searchQuery)
  await toolSpan.end({ status: 'ok', output: { result_count: results.length } })
} catch (err) {
  // Error captured at tool level — visible in waterfall
  await toolSpan.end({ status: 'error', error: err.message })
  // Propagate to trace level — APM would lose this
  await trace.end({ status: 'error' })
}

5. Hallucination and Context Quality Monitoring

Hallucination monitoring requires capturing what context was given to the LLM alongside what it produced. Standard APM can't correlate retrieval quality with generation quality because it treats them as independent services. An agent observability tool that captures both steps in the same trace can surface patterns like "retrieval quality dropped → hallucination rate spiked three minutes later."

This is why tools like Datadog LLM Observability had to build separate AI-specific layers on top of their APM infrastructure — the fundamental data model doesn't transfer.

The OTEL vs Purpose-Built Tradeoff

This isn't a knock on OpenTelemetry. OTEL is a standards body solving a hard interoperability problem, not a product company solving an AI monitoring problem. The right comparison is: OTEL is to AI agents what Prometheus is to application-layer business metrics — technically capable, but requiring significant wrapper work to surface what you actually care about.

The current generation of AI observability tools falls into two camps:

OTEL-compatible layers (Arize Phoenix, some Langfuse configurations): export OTEL spans and let you plug into existing APM infra. Good if you already have OTEL. Bad if you need AI-specific features fast.
Purpose-built agent tools (Nexus, LangSmith, Helicone): model the agent run as the primary unit, with spans, token tracking, and alerts designed specifically for LLM workloads. Higher signal-to-noise for AI teams.

Nexus is OTEL-inspired — the trace/span hierarchy comes directly from OTEL — but the data model is extended for AI-specific attributes. You get the familiar waterfall view without the boilerplate of setting up an OTEL collector, exporter, and backend.

Getting Started

If you're already using OTEL for your service infrastructure, Nexus can run alongside it for AI-specific spans without replacing your existing setup. Add the SDK to your agent code:

See the integration guides for LangChain, CrewAI, LlamaIndex, AutoGen, and Google ADK
Try the interactive demo to see what agent traces look like in the dashboard
Free tier: 1,000 traces/month, no credit card required

Monitor your AI agents — not just your services

Purpose-built agent observability. Free tier, no credit card required.

Get started free →

AI Observability Tools Compared: The 2026 Guide → Nexus vs Datadog LLM Observability → 5 Metrics Every AI Agent Team Should Track → Integration guides →