Tracing AG2 (AutoGen v2) Multi-Agent Conversations with Nexus

AG2 (formerly AutoGen) makes it easy to spin up teams of ConversableAgents — but when a multi-agent conversation goes wrong, figuring out which agent said what and where the chain broke is painful. Here's how to add full trace observability to AG2 conversations with Nexus.

What is AG2 and why is tracing hard?

AG2 (formerly known as AutoGen) is Microsoft's open-source framework for building multi-agent systems using ConversableAgent — a flexible agent class that can act as an LLM-backed assistant, a human-in-the-loop proxy, or a tool-executing code agent. AG2's key primitive is agent.initiate_chat(recipient, message), which kicks off a back-and-forth conversation between agents until a termination condition is met.

The tracing problem: a multi-agent conversation in AG2 is a sequence of messages exchanged between agents over multiple turns. Each turn may involve an LLM call, tool execution, or code run. When the conversation fails — the task isn't completed, an agent loops, or the wrong answer is produced — you need to answer: which agent made the wrong decision? Which turn triggered the failure? Which tool returned bad data?

Without external tracing, the only visibility you have is the console output and the final chat_result. That's not enough for production systems.

Setup

pip install ag2 nexus-sdk

import os
from autogen import ConversableAgent
from nexus_sdk import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])

Tracing a two-agent conversation

Wrap initiate_chat in a Nexus trace and record each agent turn as a span. Here's a minimal example with an assistant agent and a user proxy:

llm_config = {
    "config_list": [{"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}]
}

assistant = ConversableAgent(
    name="assistant",
    system_message="You are a helpful assistant. Reply TERMINATE when done.",
    llm_config=llm_config,
)

user_proxy = ConversableAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=5,
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
)

task = "Write a Python function that reverses a linked list."

trace = nexus.start_trace(
    agent_id="ag2-assistant",
    input=task,
)

try:
    chat_result = user_proxy.initiate_chat(
        recipient=assistant,
        message=task,
        max_turns=6,
    )

    # Record the conversation summary
    messages = chat_result.chat_history
    nexus.end_trace(trace["id"], output=messages[-1]["content"] if messages else "", status="success")

except Exception as e:
    nexus.end_trace(trace["id"], output=str(e), status="error")
    raise

Recording per-turn spans

The above gives you start/end coverage but misses turn-level detail. AG2 lets you hook into the message-sending flow by subclassing ConversableAgent and overriding receive to record each message exchange as a span:

class TracedAgent(ConversableAgent):
    def __init__(self, *args, nexus_client, trace_id, **kwargs):
        super().__init__(*args, **kwargs)
        self._nexus = nexus_client
        self._trace_id = trace_id
        self._turn = 0

    def receive(self, message, sender, request_reply=None, silent=False):
        self._turn += 1
        span = self._nexus.start_span(self._trace_id, {
            "name": f"turn.{self._turn} — {sender.name} → {self.name}",
            "type": "llm",
            "metadata": {
                "sender": sender.name,
                "recipient": self.name,
                "turn": self._turn,
                "content_preview": str(message)[:200] if isinstance(message, str) else str(message.get("content", ""))[:200],
            },
        })
        try:
            result = super().receive(message, sender, request_reply, silent)
            self._nexus.end_span(span["id"], {"output": "ok"})
            return result
        except Exception as e:
            self._nexus.end_span(span["id"], {"output": str(e), "status": "error"})
            raise

Use TracedAgent in place of ConversableAgent for the recipient (the agent that receives and processes messages):

trace = nexus.start_trace(agent_id="ag2-assistant", input=task)

traced_assistant = TracedAgent(
    name="assistant",
    system_message="You are a helpful assistant. Reply TERMINATE when done.",
    llm_config=llm_config,
    nexus_client=nexus,
    trace_id=trace["id"],
)

chat_result = user_proxy.initiate_chat(
    recipient=traced_assistant,
    message=task,
    max_turns=6,
)

Useful metadata to capture

Turn count — how many turns the conversation ran before termination; high turn counts often indicate the task description is ambiguous
Termination reason — was it TERMINATE in the message, max turns reached, or an exception?
Agent names — for multi-agent pipelines with 3+ agents, recording sender/recipient per span is essential to follow the message flow
Model used — when testing GPT-4o vs GPT-4o-mini for cost/quality tradeoffs, log the model name in the span metadata

Common AG2 failure patterns in traces

Infinite conversation loop: The termination condition is never met — max turns is hit instead. In the trace, you see the expected number of spans (one per max turn) with the last span showing no TERMINATE in the output. Usually means the termination message pattern doesn't match what the LLM is actually returning.

Human proxy timeout: If human_input_mode="ALWAYS" is set accidentally in production, the agent hangs waiting for input. The trace shows a span that never ends — and the trace duration grows indefinitely until timeout.

Code execution failure: When using a UserProxyAgent with code execution, a Python error in the generated code causes a span with status: error. Without the span, you'd only see the final unhelpful "FAILED" message in the chat history.

Context window overflow: Long multi-turn conversations can hit the LLM's context limit. The trace shows escalating span latency as the model processes larger contexts, then an error span when the limit is reached.

AG2's simplicity is a feature — a ConversableAgent really is just a function that sends and receives messages. Wrapping that at the trace level keeps your observability just as simple: one trace per conversation, one span per turn. From the Nexus dashboard, you get the full turn waterfall, per-agent error rates, and alerts when conversations fail in production.