How to Monitor Your AI Agents in Production

AI agents fail in production in ways that are invisible without observability. You ship a working agent, it runs great in development, and then one day a user's request triggers a cascade: the LLM returns an unexpected format, the tool parser throws, the agent retries three times silently, and eventually returns an empty response. The user sees nothing. You see nothing.

This post covers the most common agent failure modes we've seen, and how to instrument your agents before they become expensive problems.

Why agent observability is different

Traditional application monitoring tools (Datadog, Sentry, New Relic) are built around request-response cycles and error rates. AI agents don't fit that model. A single "agent run" might involve:

• 5-20 sequential LLM calls, each with different token counts and latencies
• Tool executions that may fail and retry
• Sub-agent spawning and handoffs
• Non-deterministic outputs — the same input can take wildly different paths

You need trace-level visibility: the full sequence of LLM calls and tool uses for each agent run, with timing, input, output, and errors at each step.

The 5 failure modes that will bite you

1. Silent retry loops

An agent retries a failed tool call 3 times before giving up. In logs: nothing. In your users' experience: a 30-second hang. This is the most common failure mode and the hardest to debug without trace data.

2. Token budget exhaustion

You set max_tokens=4096 and the agent hits the limit mid-reasoning. The response gets truncated, the tool call is malformed, and the next step fails. Without span-level token counts, you'll never know which step triggered the cascade.

3. Context drift across turns

Multi-turn agents accumulate context. By turn 8, the agent has "forgotten" the original user goal and is optimizing for something else entirely. You need per-turn input/output logging to detect this.

4. Tool output schema mismatch

An external API you call from a tool changes its response format. Your parsing code throws. The agent catches the exception and either retries indefinitely or returns a hallucinated fallback. Logging tool inputs and outputs makes this immediately visible.

5. Runaway sub-agent spawning

An agent that can spawn sub-agents will sometimes do so unnecessarily — especially when given a vague task. Without trace data, you won't know whether the agent completed your task in 2 steps or 20.

How to instrument your agents in 10 minutes

Here's a before/after for a typical OpenAI-based agent loop:

Before — zero visibility

async def run_agent(task: str) -> str:
    messages = [{"role": "user", "content": task}]
    for turn in range(10):
        response = await client.chat.completions.create(
            model="gpt-4o", messages=messages
        )
        if response.choices[0].finish_reason == "stop":
            return response.choices[0].message.content
        # handle tool calls...
    return "max turns reached"

After — full trace with Nexus (3 lines added)

from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-agent")

async def run_agent(task: str) -> str:
    trace = nexus.start_trace(name=f"Task: {task[:60]}")  # +1 line
    messages = [{"role": "user", "content": task}]
    try:
        for turn in range(10):
            response = await client.chat.completions.create(
                model="gpt-4o", messages=messages
            )
            trace.add_span(                                # +1 line
                name=f"turn-{turn}",
                output={"finish_reason": response.choices[0].finish_reason,
                        "tokens": response.usage.total_tokens}
            )
            if response.choices[0].finish_reason == "stop":
                trace.end(status="success")               # +1 line
                return response.choices[0].message.content
            # handle tool calls...
        trace.end(status="error")
        return "max turns reached"
    except Exception:
        trace.end(status="error")
        raise

Install with: pip install keylightdigital-nexus. See the full API reference →

What to track in each span

A good span captures enough to reproduce the failure without capturing so much that storage becomes a concern. For each LLM call, we recommend:

• Input: message count, turn number (not the full messages — that's expensive)
• Output: finish_reason, token counts, stop reason
• Timing: auto-captured by the SDK

For tool calls, capture the full input and output — tool schemas are usually small:

trace.add_span(
    name=f"tool-{tool_name}",
    input=tool_args,          # full tool arguments
    output={"result": result, "error": error_msg},
)

Setting up alerts

Once you have traces, you want to know when things go wrong. Nexus sends you an email when any agent trace ends with status error or timeout — rate-limited to 1 alert per agent per 5 minutes to prevent noise.

Alerts are a Pro feature. See the pricing page — Pro is $9/month.

The meta-story: Ralph monitors itself

Nexus was built by an AI agent named Ralph. Ralph runs as a scheduled Claude Code session that reads a PRD, picks the next user story, implements it, runs quality checks, and commits. Each iteration is a Nexus trace. Each LLM call and tool use is a span.

This means we dogfood our own product. When Ralph has a bad iteration — a test that didn't run, a commit that broke typecheck — we see exactly where it went wrong in the trace viewer. That feedback loop is why Nexus exists.

Start monitoring

Nexus is free for 1,000 traces/month with a 1-agent limit. If you're running a production agent, that covers most side projects and prototypes. Pro is $9/month for 50,000 traces, unlimited agents, and email alerts.