How to Debug AI Agents in Production

You deployed your AI agent. It worked fine in testing. Then it started failing in production — and you have no idea why. The LLM call returned something. A tool threw an error. Or maybe it just… stopped responding. You can see the output (or lack of one), but the reasoning steps, intermediate state, and tool calls that led there are gone.

This is the central challenge of AI agent debugging: failures are non-deterministic, the state space is enormous, and without instrumentation, you're debugging blind. Here's how to fix that.

The 5 Common Agent Failure Modes

Before diving into debugging strategies, it helps to know what you're looking for. These are the five failure modes that cause the most production pain:

1. Tool Call Errors

The agent calls a tool with malformed arguments, the tool throws, and the agent either silently retries with the same bad input or gives up without explanation. Common cause: LLM hallucinates argument names or formats not in the schema.

2. Context Window Overflow

Long-running agents accumulate conversation history until they hit the context limit. The LLM starts truncating or ignoring earlier instructions, causing degraded performance or outright errors — often silently.

3. Infinite Loops

The agent decides to call a tool, the tool returns an unexpected result, the agent decides to call the same tool again, and so on. Without a step limit and tracing, these can run until they hit a timeout or run up a large API bill.

4. Hallucinated Actions

The LLM invents tool names that don't exist, invents valid-looking arguments with wrong values, or generates plausible-sounding reasoning that leads to a completely wrong action. Hard to catch without input/output logging at each step.

5. Latency Spikes

An agent that usually takes 2 seconds suddenly takes 30. Is it the LLM? A slow tool? Retry backoff? Network? Without per-span timing, you can't tell which step is the bottleneck — and users just see a hang.

Strategy 1: Trace-First Debugging

The single biggest upgrade you can make to your debugging process is adding structured tracing before you start debugging. Trying to debug an uninstrumented agent is like debugging a web service with no request logs — technically possible, but painful.

The minimum viable instrumentation: one trace per agent run, one span per LLM call, one span per tool call.

from nexus import NexusClient

nexus = NexusClient(api_key="nxs_...", agent_id="my-agent")

async def run_agent(user_input: str):
    trace = await nexus.start_trace(
        name="agent-run",
        metadata={"input": user_input, "agent_version": "1.2.0"}
    )
    try:
        result = await _execute(user_input, trace)
        await trace.end(status="success", output={"result": result})
        return result
    except Exception as e:
        await trace.end(status="error", output={"error": str(e)})
        raise

With this in place, every agent run produces a trace you can inspect in Nexus. You can see exact start/end times, inputs, outputs, and status for each step. When something goes wrong, you open the trace and immediately know which span failed, not just "the agent errored."

Strategy 2: Instrument Every Tool Call

Tool call failures are the most common agent failure mode, but they're invisible without span-level logging. The pattern: wrap every tool call in its own span, capturing input, output, and any exception.

# Wrap every tool call in its own span
async def call_tool(trace, tool_name: str, tool_input: dict):
    span = await trace.start_span(
        name="tool-call",
        metadata={"tool": tool_name}
    )
    try:
        result = await TOOLS[tool_name](**tool_input)
        await span.end(
            status="success",
            output={"tool": tool_name, "result_preview": str(result)[:200]}
        )
        return result
    except Exception as e:
        await span.end(
            status="error",
            output={"tool": tool_name, "error": str(e), "input": tool_input}
        )
        raise  # re-raise so the agent loop can handle it

This gives you a clear record of: what tool was called, with what arguments, what it returned, and whether it failed. When the agent hallucinates a tool argument, you'll see it in the span input. When a tool throws, you'll see the exact error and input that caused it.

Strategy 3: Track Token Usage Per Step

Context window overflow is sneaky because it degrades performance gradually rather than causing a hard error. The fix is to log token counts at every LLM call so you can see the trend and set alerts.

# Track token usage at each LLM call
llm_span = await trace.start_span(
    name="llm-call",
    metadata={"model": "claude-3-5-sonnet", "step": step_number}
)
response = await llm.ainvoke(messages)
await llm_span.end(
    status="success",
    output={
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "total_tokens": response.usage.input_tokens + response.usage.output_tokens,
        "message_count": len(messages),
    }
)
# Alert if approaching limit
if response.usage.input_tokens > 150_000:
    print(f"[WARN] step {step_number}: context at {response.usage.input_tokens} tokens")

With token counts in your spans, you can see exactly when context starts growing, which steps add the most tokens, and whether you're approaching the limit in a given run. Set a warning threshold (e.g., 80% of the context limit) to catch this before it causes failures.

Strategy 4: Enforce Step Limits with Trace Evidence

Every agent loop should have an explicit step limit. But more importantly, when that limit is hit, you want a trace showing exactly what happened in each step — so you can diagnose the loop, not just know it happened.

MAX_STEPS = 20
step_count = 0

while not done:
    step_count += 1
    step_span = await trace.start_span(
        name="agent-step",
        metadata={"step": step_count}
    )

    if step_count > MAX_STEPS:
        await step_span.end(status="error", output={"error": "max steps exceeded"})
        await trace.end(status="error", output={"error": f"loop: exceeded {MAX_STEPS} steps"})
        raise RuntimeError(f"Agent loop exceeded {MAX_STEPS} steps")

    action = await llm_decide(messages)
    await step_span.end(status="success", output={"action": action.type})

When the limit triggers, the trace will show you the step sequence, the action at each step, and where the loop started. Usually you'll find the same tool being called repeatedly with the same input — which points to the LLM not processing the tool's output correctly.

Strategy 5: Structured Metadata for Replay

To debug a specific failing run, you need to be able to find it. That means tagging traces with enough metadata to filter by: which user, which session, which environment, which code version.

# Tag traces so you can find them later
trace = await nexus.start_trace(
    name="agent-run",
    metadata={
        "input": user_input,
        "user_id": user_id,
        "session_id": session_id,    # group multi-turn conversations
        "environment": "production",
        "git_sha": os.environ.get("GIT_SHA", "unknown"),
    }
)

With consistent metadata tagging, you can filter traces in Nexus by user, session, or deploy version. When a user reports a bug, you can find their exact session and inspect every step. When a deploy introduces a regression, you can compare traces from before and after.

Debugging Checklist

When an agent run fails and you're starting from scratch, work through this list:

1. Open the trace. Find the failing span. Is it a tool call, an LLM call, or the trace itself? The span status and output tell you what failed.
2. Read the span input. Did the LLM hallucinate a tool argument? Did a tool receive malformed input? The span input is the exact data that was passed — no guessing.
3. Check token counts. Is the input_tokens climbing step by step? Is any single step adding a large chunk? Context overflow usually shows up as a steady token increase over many steps.
4. Look at step timing. Which span took the longest? If one tool call is an outlier, that's your latency bottleneck. If all LLM calls are slow, it's the model or the API.
5. Count the steps. How many iterations did the loop run before failing or stopping? If it hit your step limit, look at what the agent was doing in the last 3-5 steps to find the loop pattern.

Getting Started

The Nexus SDK is a two-line addition to any agent. Install it, add your API key, and you have structured traces for every run:

pip install keylightdigital-nexus

From there, the integration guides cover every major framework: LangChain, LlamaIndex, DSPy, CrewAI, and the Anthropic SDK directly. The demo shows what the trace view looks like with real agent data.

Add tracing before your next production incident, not after.