How to Debug LangChain Agents: 5 Bugs You Can Only Find with Distributed Tracing

LangChain agents fail in ways that logs alone will never show you. Silent tool failures, retry storms, token overflows, infinite loops, and sub-agent timeouts all look the same in a stack trace: vague. Distributed tracing makes each one obvious. Here are the five bugs and how traces expose them.

LangChain agents generate logs. Lots of them. But when something goes wrong in a multi-step agent — a tool silently returns nothing, a retry storm burns your token budget, a sub-agent hangs — the log output gives you timestamps and text. It doesn't give you causality.

Distributed tracing does. A trace shows you every span in the agent's execution as a tree: which step called which, how long each took, and what each received and returned. The five bugs below are essentially invisible in logs. In a trace, they're obvious.

Bug 1: Silent tool failures

In logs: You see the tool was called. You see the agent continued. You don't see that the tool returned an empty result — or a result that failed validation upstream — and the agent silently decided to skip that branch.

In a trace: The tool span shows its output payload. You can see result: {} or result: null immediately. The parent LLM span shows the tool output was included in context — confirming the agent saw the empty result and proceeded anyway.

from langchain.tools import tool

@tool
def search_docs(query: str) -> str:
    results = vector_store.search(query)
    # Bug: returns empty string instead of raising on no results
    return results[0].content if results else ""

# In a trace you'll see: tool_output="" for this span
# The LLM then hallucinates an answer because it got nothing

Fix: add a span attribute for result length, or raise a ToolException on empty results so the trace shows an error span rather than silent success.

Bug 2: LLM retry storms

In logs: You see repeated lines like Retrying LLM call... but can't easily tell how many retries happened across which agent steps, or how much they cost.

In a trace: Each LLM call is a span with token counts. A retry storm shows up as 5–10 sibling spans under the same parent with identical inputs and escalating latency. The trace makes it obvious that one agent step cost 40,000 input tokens and took 12 seconds — even though your business logic "completed successfully."

# LangChain's default retry behavior
llm = ChatOpenAI(
    model="gpt-4o",
    max_retries=5,  # Default: will retry on rate limits and timeouts
    request_timeout=30,
)

# In a trace you'll see 5 llm_call spans under one agent_step span
# Total input_tokens: 5 * 8000 = 40,000 tokens burned on one step

Fix: lower max_retries and add circuit-breaker logic. Use trace data to identify which agent steps are retry-prone, then target those specifically.

Bug 3: Token overflow in summarization

In logs: You get a context length error — or worse, the model silently truncates. Either way, the log shows the exception without context: how long was the prompt? Which documents caused the overflow?

In a trace: Every LLM span carries input token count. You can see the summarization chain accumulating context across iterations. When iteration 3 suddenly jumps from 4,000 to 18,000 tokens, that's where a large retrieved document got included without a size check.

from langchain.chains.summarize import load_summarize_chain

chain = load_summarize_chain(llm, chain_type="refine")

# Each "refine" iteration adds the previous summary + new document chunk
# With large chunks, tokens compound quickly across iterations
# A trace shows per-iteration token counts — making the problem step obvious

Fix: add a chunk size limit and log the token count per document before adding to context. A trace attribute on each summarization span for doc_token_count lets you filter for outliers.

Bug 4: Infinite loops from bad tool outputs

In logs: The agent loops. Your logs fill with repeated tool calls. Eventually you hit a token limit or timeout. But the logs don't show why the loop started — only that it did.

In a trace: The trace tree is the loop. You see the same tool span repeating under the same parent, with near-identical inputs each time. Because traces capture the LLM's reasoning step (the thought field in ReAct agents), you can read exactly what the model decided and why it kept re-calling the same tool.

from langchain.agents import AgentType, initialize_agent

agent = initialize_agent(
    tools=[search_tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    max_iterations=15,  # Without this, the agent loops indefinitely
    early_stopping_method="generate",
)

# In a trace: the "Thought" span shows the agent deciding to "search again"
# because the previous search returned something ambiguous
# Fix: add an iteration counter span attribute and set max_iterations

Bug 5: Sub-agent timeouts

In logs: A sub-agent times out. You get an exception with a timestamp. You don't know how far the sub-agent got, which tool it was waiting on, or whether it had partially succeeded before failing.

In a trace: The sub-agent's execution is a child trace linked to the parent by a parent_trace_id. You can see exactly which span was in-flight when the timeout hit — the tool call, the LLM response wait, or the result parsing. If the sub-agent completed 3 out of 4 steps, the trace shows that too.

import nexus_sdk as nexus

# Parent agent
async def run_orchestrator(task: str):
    trace = nexus.start_trace("orchestrator", metadata={"task": task})

    # Sub-agent call — pass parent trace ID for linked traces
    result = await call_sub_agent(task, parent_trace_id=trace.id)

    nexus.end_trace(trace.id, status="success")
    return result

async def call_sub_agent(task: str, parent_trace_id: str):
    trace = nexus.start_trace(
        "sub_agent",
        metadata={"parent_trace_id": parent_trace_id}
    )
    try:
        # ... sub-agent logic
        nexus.end_trace(trace.id, status="success")
    except asyncio.TimeoutError:
        nexus.end_trace(trace.id, status="timeout")
        raise

Linked traces let you navigate from the parent orchestrator's timeout error directly into the sub-agent's execution tree — without searching logs across multiple processes.

Making these bugs findable

The common thread: all five bugs are relational failures. They happen between steps, not within them. Logs capture individual events. Traces capture the causal chain connecting those events.

Adding tracing to a LangChain agent takes about 10 minutes with the Nexus SDK. Wrap your agent execution in a nexus.start_trace() call, add tool spans for each tool invocation, and pass metadata with token counts and tool outputs. The next time one of these five bugs hits production, you'll have the trace to diagnose it in minutes rather than hours.