Debugging LangGraph Agents: Tracing Node Execution and State Transitions
LangGraph makes it easy to build stateful, cyclic agent workflows — and equally easy to build ones that infinite-loop, route incorrectly, or corrupt state silently. Here's how distributed tracing surfaces each failure mode and how to instrument LangGraph StateGraph nodes with Nexus spans.
What LangGraph adds — and where it breaks
LangGraph extends LangChain with a graph-based execution model. Instead of a linear chain, you define a StateGraph where nodes are Python functions and edges are routing rules. This enables cycles — agents that loop until a condition is met — which are essential for tool-use and self-reflection patterns.
That power comes with failure modes that don't exist in linear chains:
- Infinite loops: A conditional edge misconfiguration causes the graph to cycle indefinitely. Without tracing, you see a hung process — not which nodes are cycling.
- Wrong routing: A router node returns the wrong edge name due to a model hallucination or a regex match failure. The graph silently takes the wrong path.
- State corruption: A node writes an unexpected key into the shared state dict, overwriting data that a later node depends on.
- Tool call failures: A tool node raises an exception that gets swallowed by LangGraph's error handling, leaving the state in an undefined intermediate shape.
All of these are invisible without per-node spans. Here's how to add them.
Instrumenting a LangGraph StateGraph
The key insight is that every LangGraph node is just a Python function that takes state and returns updated state. Wrap each node in a Nexus span:
import os
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from nexus_sdk import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
class AgentState(TypedDict):
messages: list[dict]
next_action: str
tool_result: str | None
def node_span(trace_id: str, node_name: str):
"""Decorator that wraps a LangGraph node in a Nexus span."""
def decorator(fn):
def wrapper(state: AgentState) -> AgentState:
span = nexus.start_span(trace_id, {
"name": f"node:{node_name}",
"type": "llm" if "llm" in node_name else "tool",
"metadata": {
"node": node_name,
"input_message_count": len(state["messages"]),
"next_action_before": state.get("next_action"),
},
})
try:
result = fn(state)
nexus.end_span(span["id"], {
"output": str(result.get("next_action", "")),
"metadata": {
"node": node_name,
"next_action_after": result.get("next_action"),
"output_message_count": len(result.get("messages", state["messages"])),
},
})
return result
except Exception as e:
nexus.end_span(span["id"], {"error": str(e)})
raise
return wrapper
return decorator
Building a traced StateGraph
from openai import OpenAI
from langgraph.graph import StateGraph, END
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def run_agent_with_tracing(user_query: str) -> str:
trace = nexus.start_trace({
"agent_id": "langgraph-research-agent",
"name": f"query: {user_query[:60]}",
"status": "running",
"started_at": nexus.now(),
})
trace_id = trace["trace_id"]
try:
# Wrap nodes with tracing
@node_span(trace_id, "llm_router")
def router_node(state: AgentState) -> AgentState:
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Decide: 'search', 'answer', or 'clarify'"},
*state["messages"],
],
)
action = response.choices[0].message.content.strip().lower()
return {**state, "next_action": action}
@node_span(trace_id, "tool_search")
def search_node(state: AgentState) -> AgentState:
# Simulate a search tool
result = f"Search results for: {state['messages'][-1]['content']}"
return {
**state,
"tool_result": result,
"messages": [*state["messages"], {"role": "tool", "content": result}],
}
@node_span(trace_id, "llm_answer")
def answer_node(state: AgentState) -> AgentState:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=state["messages"],
)
answer = response.choices[0].message.content
return {
**state,
"messages": [*state["messages"], {"role": "assistant", "content": answer}],
}
def route_from_router(state: AgentState) -> str:
action = state.get("next_action", "answer")
if action == "search":
return "search"
return "answer"
# Build graph
graph = StateGraph(AgentState)
graph.add_node("router", router_node)
graph.add_node("search", search_node)
graph.add_node("answer", answer_node)
graph.set_entry_point("router")
graph.add_conditional_edges("router", route_from_router, {
"search": "search",
"answer": "answer",
})
graph.add_edge("search", "answer")
graph.add_edge("answer", END)
app = graph.compile()
initial_state: AgentState = {
"messages": [{"role": "user", "content": user_query}],
"next_action": "",
"tool_result": None,
}
final_state = app.invoke(initial_state)
final_answer = final_state["messages"][-1]["content"]
nexus.end_trace(trace_id, {"status": "success"})
return final_answer
except Exception as e:
nexus.end_trace(trace_id, {"status": "error", "metadata": {"error": str(e)}})
raise
What to look for in the trace
Detecting infinite loops: If the same node name appears 10+ times in the span waterfall, you have a cycle. Check the router node's next_action_after metadata — it'll show the same value repeating, which means the conditional edge is stuck.
Detecting wrong routing: Compare next_action_after values across router spans for successful vs failed runs. If the router outputs "search" when the input clearly needs "answer", the system prompt or the output parser has a bug.
Detecting state corruption: Log input_message_count and output_message_count on each node. If a node shows fewer messages out than in, it's overwriting instead of appending — a common TypedDict merge mistake.
Detecting tool failures: Tool node spans that end with an error field are tool failures. If the span shows no error but the tool_result metadata is empty or None, the tool silently returned nothing — add a validation check inside the tool node.
Adding three metadata fields per node — the node name, input state shape, and output action — gives you a complete picture of how your LangGraph runs succeed and fail. Five minutes of instrumentation saves hours of print-statement debugging.
Debug LangGraph agents with Nexus
Nexus stores span metadata alongside traces, giving you per-node visibility into your LangGraph execution. Free tier, no credit card required.
Start free →