Single-agent observability is relatively straightforward: one trace per run, spans for LLM calls and tool uses, done. Multi-agent systems are harder. When 5 agents collaborate on a task, "which agent caused the failure?" becomes genuinely ambiguous. Blame can cascade. Timeouts can be indirect. Emergent behaviors appear that no individual agent produces alone.
This post covers 4 multi-agent architectural patterns — supervisor, peer-to-peer, hierarchical, and consensus — with practical instrumentation approaches for each. All examples use Python with the keylightdigital-nexus SDK.
Pattern 1: Supervisor + Sub-agents
Architecture: A central supervisor agent receives the task, routes it to specialized sub-agents, and aggregates results. The supervisor is the single entry point — it breaks down work and assigns it.
Observability challenge: When a sub-agent fails, the supervisor often swallows the error or retries silently. Without instrumentation, you see the supervisor fail but not which sub-agent caused it.
Instrumentation approach: One trace per full pipeline run. The supervisor creates the trace; each sub-agent invocation gets its own span. If a sub-agent fails, its span captures the error before the supervisor handles it.
import os
from nexus_client import NexusClient
nexus = NexusClient(
api_key=os.environ["NEXUS_API_KEY"],
agent_id="supervisor-agent",
)
def run_supervised_pipeline(task: str):
# One trace for the full pipeline
trace = nexus.start_trace(
name=f"supervisor: {task[:50]}",
metadata={"pattern": "supervisor", "task": task},
)
try:
# Supervisor analyzes and routes
router_span = trace.add_span(
name="supervisor-routing",
input={"task": task},
)
agent_name = route_to_agent(task) # your routing logic
router_span.end(output={"routed_to": agent_name}, status="ok")
# Sub-agent execution
sub_span = trace.add_span(
name=f"sub-agent:{agent_name}",
input={"task": task, "agent": agent_name},
)
result = run_sub_agent(agent_name, task)
sub_span.end(output={"result": result[:200]}, status="ok")
trace.end(status="success")
return result
except Exception as e:
trace.end(status="error")
raise
In the Nexus dashboard, you'll see one trace per task with spans showing supervisor routing time, sub-agent execution time, and any failures isolated to the specific span. This is the pattern used by AutoGen and CrewAI hierarchies.
Pattern 2: Peer-to-peer collaboration
Architecture: Multiple agents run in parallel, each handling a specialization, with results combined at the end. No central authority — agents operate independently.
Observability challenge: When peers run in parallel, failures from different agents can land in logs out of order. The aggregation step may succeed even if one peer produced bad output — which won't show as an error until much later.
Instrumentation approach: One trace for the full pipeline, one span per peer. Since spans are added to the trace as they complete, the waterfall shows each peer's contribution even if they ran in parallel.
import asyncio
from nexus_client import NexusClient
nexus = NexusClient(
api_key=os.environ["NEXUS_API_KEY"],
agent_id="peer-coordinator",
)
async def run_peer_pipeline(task: str):
trace = nexus.start_trace(
name=f"peer-pipeline: {task[:50]}",
metadata={"pattern": "peer-to-peer", "agent_count": 3},
)
try:
# All peers run in parallel — one span per peer
async def run_peer(name: str, subtask: str):
span = trace.add_span(
name=f"peer:{name}",
input={"subtask": subtask},
)
result = await agent_run(name, subtask)
span.end(output={"result": result[:200]}, status="ok")
return result
results = await asyncio.gather(
run_peer("researcher", "gather facts"),
run_peer("analyst", "analyze trends"),
run_peer("writer", "draft outline"),
)
# Aggregation step
agg_span = trace.add_span(
name="aggregation",
input={"result_count": len(results)},
)
final = aggregate(results)
agg_span.end(output={"summary_len": len(final)}, status="ok")
trace.end(status="success")
return final
except Exception as e:
trace.end(status="error")
raise
Pattern 3: Hierarchical orchestration
Architecture: An orchestrator breaks a complex task into subtasks recursively. Each level delegates to the next until reaching leaf tasks that execute directly. Used for long-horizon planning and complex research tasks.
Observability challenge: Recursive systems can produce unbounded depth. An orchestrator that decomposes too aggressively creates hundreds of sub-tasks, each taking real time and money. Without instrumentation, you won't know how deep the recursion went or where most of the time was spent.
Instrumentation approach: One trace per orchestrator level. The depth metadata allows you to see the decomposition tree across traces. Add a depth limit guard with explicit error handling to prevent runaway recursion.
def run_hierarchical(task: str, depth: int = 0):
"""Recursive orchestrator — each level creates a span."""
trace = nexus.start_trace(
name=f"orchestrator-L{depth}: {task[:40]}",
metadata={"depth": depth, "pattern": "hierarchical"},
)
try:
plan_span = trace.add_span(
name="planning",
input={"task": task, "depth": depth},
)
subtasks = decompose(task) # returns list of subtasks
plan_span.end(
output={"subtask_count": len(subtasks), "subtasks": subtasks[:3]},
status="ok",
)
results = []
for subtask in subtasks:
if is_leaf(subtask) or depth >= 2:
# Execute directly
exec_span = trace.add_span(
name=f"execute:{subtask[:30]}",
input={"subtask": subtask},
)
result = execute_leaf(subtask)
exec_span.end(output={"result": result[:200]}, status="ok")
results.append(result)
else:
# Recurse — new trace at next depth level
result = run_hierarchical(subtask, depth + 1)
results.append(result)
trace.end(status="success")
return combine(results)
except Exception as e:
trace.end(status="error")
raise
By including depth in trace metadata, you can filter traces by depth level to understand your decomposition tree. The pattern naturally works with LangChain and Google ADK recursive agent patterns.
Pattern 4: Consensus voting
Architecture: Multiple agents answer the same question independently. A consensus mechanism (majority vote, threshold agreement, ranking) selects the final answer. Used for high-stakes decisions where a single agent's judgment is insufficient.
Observability challenge: A consensus failure (no agreement) can mask a deeper problem: all agents gave different wrong answers because of bad context, not because the question was ambiguous. You need to see all individual answers, not just the final outcome.
Instrumentation approach: One trace per consensus run, one span per agent vote (with the answer captured in span output). A final consensus-check span captures the vote tally, the winner, and whether the threshold was reached.
def run_consensus(question: str, agents: list[str], required: int = 2):
"""Run agents in parallel, require N agreeing answers."""
trace = nexus.start_trace(
name=f"consensus: {question[:50]}",
metadata={"pattern": "consensus", "required": required, "agents": agents},
)
try:
answers = {}
for agent_name in agents:
span = trace.add_span(
name=f"vote:{agent_name}",
input={"question": question},
)
answer = ask_agent(agent_name, question)
answers[agent_name] = answer
span.end(output={"answer": answer}, status="ok")
# Tally votes
from collections import Counter
counts = Counter(answers.values())
winner, votes = counts.most_common(1)[0]
consensus_span = trace.add_span(
name="consensus-check",
input={"vote_tally": dict(counts)},
output={"winner": winner, "votes": votes, "reached": votes >= required},
status="ok" if votes >= required else "error",
)
if votes < required:
trace.end(status="error")
raise ValueError(f"No consensus: best answer got {votes}/{required} votes")
trace.end(status="success")
return winner
except Exception as e:
trace.end(status="error")
raise
Common multi-agent debugging mistakes
- Logging at the wrong granularity: One trace per agent (not per pipeline run) makes it impossible to understand which agents ran in the same session and how they relate.
- Swallowing sub-agent errors at the coordinator: If a supervisor catches exceptions and retries without logging them, failures become invisible. Log every exception in a span before re-raising or retrying.
- Not logging the routing decision: In supervisor patterns, the routing span (which agent was chosen and why) is often the most valuable debugging information. Log the routing rationale explicitly.
- Missing timeout instrumentation: Parallel peer pipelines can hang if one agent times out without surfacing the error. Add span-level timeout tracking.
Framework guides
These patterns apply across frameworks. For framework-specific integration guides:
- AutoGen integration guide — ConversableAgent and GroupChat tracing
- CrewAI integration guide — multi-agent crew execution tracing
- LangChain integration guide — agent executor and tool call tracing
- Pydantic AI integration guide — typed agent run tracing
- Google ADK integration guide — multi-agent pipeline tracing