Building Multi-Agent Systems: Observability Patterns

Single-agent observability is relatively straightforward: one trace per run, spans for LLM calls and tool uses, done. Multi-agent systems are harder. When 5 agents collaborate on a task, "which agent caused the failure?" becomes genuinely ambiguous. Blame can cascade. Timeouts can be indirect. Emergent behaviors appear that no individual agent produces alone.

This post covers 4 multi-agent architectural patterns — supervisor, peer-to-peer, hierarchical, and consensus — with practical instrumentation approaches for each. All examples use Python with the keylightdigital-nexus SDK.

Pattern 1: Supervisor + Sub-agents

Architecture: A central supervisor agent receives the task, routes it to specialized sub-agents, and aggregates results. The supervisor is the single entry point — it breaks down work and assigns it.

Observability challenge: When a sub-agent fails, the supervisor often swallows the error or retries silently. Without instrumentation, you see the supervisor fail but not which sub-agent caused it.

Instrumentation approach: One trace per full pipeline run. The supervisor creates the trace; each sub-agent invocation gets its own span. If a sub-agent fails, its span captures the error before the supervisor handles it.

python

import os
from nexus_client import NexusClient

nexus = NexusClient(
    api_key=os.environ["NEXUS_API_KEY"],
    agent_id="supervisor-agent",
)

def run_supervised_pipeline(task: str):
    # One trace for the full pipeline
    trace = nexus.start_trace(
        name=f"supervisor: {task[:50]}",
        metadata={"pattern": "supervisor", "task": task},
    )

    try:
        # Supervisor analyzes and routes
        router_span = trace.add_span(
            name="supervisor-routing",
            input={"task": task},
        )
        agent_name = route_to_agent(task)  # your routing logic
        router_span.end(output={"routed_to": agent_name}, status="ok")

        # Sub-agent execution
        sub_span = trace.add_span(
            name=f"sub-agent:{agent_name}",
            input={"task": task, "agent": agent_name},
        )
        result = run_sub_agent(agent_name, task)
        sub_span.end(output={"result": result[:200]}, status="ok")

        trace.end(status="success")
        return result
    except Exception as e:
        trace.end(status="error")
        raise

In the Nexus dashboard, you'll see one trace per task with spans showing supervisor routing time, sub-agent execution time, and any failures isolated to the specific span. This is the pattern used by AutoGen and CrewAI hierarchies.

Pattern 2: Peer-to-peer collaboration

Architecture: Multiple agents run in parallel, each handling a specialization, with results combined at the end. No central authority — agents operate independently.

Observability challenge: When peers run in parallel, failures from different agents can land in logs out of order. The aggregation step may succeed even if one peer produced bad output — which won't show as an error until much later.

Instrumentation approach: One trace for the full pipeline, one span per peer. Since spans are added to the trace as they complete, the waterfall shows each peer's contribution even if they ran in parallel.

python

import asyncio
from nexus_client import NexusClient

nexus = NexusClient(
    api_key=os.environ["NEXUS_API_KEY"],
    agent_id="peer-coordinator",
)

async def run_peer_pipeline(task: str):
    trace = nexus.start_trace(
        name=f"peer-pipeline: {task[:50]}",
        metadata={"pattern": "peer-to-peer", "agent_count": 3},
    )

    try:
        # All peers run in parallel — one span per peer
        async def run_peer(name: str, subtask: str):
            span = trace.add_span(
                name=f"peer:{name}",
                input={"subtask": subtask},
            )
            result = await agent_run(name, subtask)
            span.end(output={"result": result[:200]}, status="ok")
            return result

        results = await asyncio.gather(
            run_peer("researcher", "gather facts"),
            run_peer("analyst", "analyze trends"),
            run_peer("writer", "draft outline"),
        )

        # Aggregation step
        agg_span = trace.add_span(
            name="aggregation",
            input={"result_count": len(results)},
        )
        final = aggregate(results)
        agg_span.end(output={"summary_len": len(final)}, status="ok")

        trace.end(status="success")
        return final
    except Exception as e:
        trace.end(status="error")
        raise

Pattern 3: Hierarchical orchestration

Architecture: An orchestrator breaks a complex task into subtasks recursively. Each level delegates to the next until reaching leaf tasks that execute directly. Used for long-horizon planning and complex research tasks.

Observability challenge: Recursive systems can produce unbounded depth. An orchestrator that decomposes too aggressively creates hundreds of sub-tasks, each taking real time and money. Without instrumentation, you won't know how deep the recursion went or where most of the time was spent.

Instrumentation approach: One trace per orchestrator level. The depth metadata allows you to see the decomposition tree across traces. Add a depth limit guard with explicit error handling to prevent runaway recursion.

python

def run_hierarchical(task: str, depth: int = 0):
    """Recursive orchestrator — each level creates a span."""
    trace = nexus.start_trace(
        name=f"orchestrator-L{depth}: {task[:40]}",
        metadata={"depth": depth, "pattern": "hierarchical"},
    )

    try:
        plan_span = trace.add_span(
            name="planning",
            input={"task": task, "depth": depth},
        )
        subtasks = decompose(task)  # returns list of subtasks
        plan_span.end(
            output={"subtask_count": len(subtasks), "subtasks": subtasks[:3]},
            status="ok",
        )

        results = []
        for subtask in subtasks:
            if is_leaf(subtask) or depth >= 2:
                # Execute directly
                exec_span = trace.add_span(
                    name=f"execute:{subtask[:30]}",
                    input={"subtask": subtask},
                )
                result = execute_leaf(subtask)
                exec_span.end(output={"result": result[:200]}, status="ok")
                results.append(result)
            else:
                # Recurse — new trace at next depth level
                result = run_hierarchical(subtask, depth + 1)
                results.append(result)

        trace.end(status="success")
        return combine(results)
    except Exception as e:
        trace.end(status="error")
        raise

By including depth in trace metadata, you can filter traces by depth level to understand your decomposition tree. The pattern naturally works with LangChain and Google ADK recursive agent patterns.

Pattern 4: Consensus voting

Architecture: Multiple agents answer the same question independently. A consensus mechanism (majority vote, threshold agreement, ranking) selects the final answer. Used for high-stakes decisions where a single agent's judgment is insufficient.

Observability challenge: A consensus failure (no agreement) can mask a deeper problem: all agents gave different wrong answers because of bad context, not because the question was ambiguous. You need to see all individual answers, not just the final outcome.

Instrumentation approach: One trace per consensus run, one span per agent vote (with the answer captured in span output). A final consensus-check span captures the vote tally, the winner, and whether the threshold was reached.

python

def run_consensus(question: str, agents: list[str], required: int = 2):
    """Run agents in parallel, require N agreeing answers."""
    trace = nexus.start_trace(
        name=f"consensus: {question[:50]}",
        metadata={"pattern": "consensus", "required": required, "agents": agents},
    )

    try:
        answers = {}
        for agent_name in agents:
            span = trace.add_span(
                name=f"vote:{agent_name}",
                input={"question": question},
            )
            answer = ask_agent(agent_name, question)
            answers[agent_name] = answer
            span.end(output={"answer": answer}, status="ok")

        # Tally votes
        from collections import Counter
        counts = Counter(answers.values())
        winner, votes = counts.most_common(1)[0]

        consensus_span = trace.add_span(
            name="consensus-check",
            input={"vote_tally": dict(counts)},
            output={"winner": winner, "votes": votes, "reached": votes >= required},
            status="ok" if votes >= required else "error",
        )

        if votes < required:
            trace.end(status="error")
            raise ValueError(f"No consensus: best answer got {votes}/{required} votes")

        trace.end(status="success")
        return winner
    except Exception as e:
        trace.end(status="error")
        raise

Common multi-agent debugging mistakes

Logging at the wrong granularity: One trace per agent (not per pipeline run) makes it impossible to understand which agents ran in the same session and how they relate.
Swallowing sub-agent errors at the coordinator: If a supervisor catches exceptions and retries without logging them, failures become invisible. Log every exception in a span before re-raising or retrying.
Not logging the routing decision: In supervisor patterns, the routing span (which agent was chosen and why) is often the most valuable debugging information. Log the routing rationale explicitly.
Missing timeout instrumentation: Parallel peer pipelines can hang if one agent times out without surfacing the error. Add span-level timeout tracking.

Framework guides

These patterns apply across frameworks. For framework-specific integration guides:

AutoGen integration guide — ConversableAgent and GroupChat tracing
CrewAI integration guide — multi-agent crew execution tracing
LangChain integration guide — agent executor and tool call tracing
Pydantic AI integration guide — typed agent run tracing
Google ADK integration guide — multi-agent pipeline tracing

Building Multi-Agent Systems: Observability Patterns

Pattern 1: Supervisor + Sub-agents

Pattern 2: Peer-to-peer collaboration

Pattern 3: Hierarchical orchestration

Pattern 4: Consensus voting

Common multi-agent debugging mistakes

Framework guides

Start tracing your multi-agent system

More articles