Debugging CrewAI Agents: 5 Multi-Agent Bugs You Can Only Find with Traces

CrewAI makes it easy to build multi-agent pipelines. It also makes it easy to build pipelines where failures are invisible. Silent tool failures, agent handoff context bugs, infinite delegation loops, crew memory leaks — here's what each looks like in a trace and how to fix them.

CrewAI's agent abstraction is elegant: define agents with roles, give them tools, wire them together into a crew. The framework handles delegation, inter-agent communication, and task sequencing. What it doesn't do is tell you why things went wrong — especially when they go wrong silently.

Most CrewAI failures leave no obvious error. The crew completes. The final output looks plausible. But somewhere in the multi-agent chain, a tool returned nothing, an agent misread the output of its predecessor, or a delegation loop burned your entire token budget on two agents arguing with each other. None of this is visible in logs. In a trace, it's immediate.

Bug 1: Silent tool failures in crew tasks

In logs: Tool execution completes. Task completes. The final report mentions "the search results indicated..." but the search actually returned nothing. The LLM hallucinated the results.

In a trace: The tool span shows output: "". The subsequent LLM span shows this empty string was passed as the tool result. You can trace the hallucination directly to its source: the agent received empty context and invented a response rather than declaring failure.

from crewai.tools import BaseTool

class SearchTool(BaseTool):
    name: str = "search"
    description: str = "Search for information"

    def _run(self, query: str) -> str:
        results = search_api.query(query)
        # Bug: returns empty string silently on no results
        return results.text if results else ""

# In a trace: tool_output="" — agent saw nothing, hallucinated answer
# Fix: raise an exception on empty results so the trace shows error status

Bug 2: Agent handoff context bugs

In logs: Agent A completes successfully. Agent B produces incorrect output. Nothing obviously went wrong between them.

In a trace: You can see exactly what Agent A's output was (the context passed to Agent B) and compare it to what Agent B received. In CrewAI, the output of one task becomes the context for the next. If Agent A's output is truncated, formatted unexpectedly, or contains a marker that Agent B's prompt wasn't designed to handle, the trace shows the exact mismatch.

from crewai import Agent, Task, Crew

researcher = Agent(role='Researcher', goal='Find information', llm=llm)
writer = Agent(role='Writer', goal='Summarize findings', llm=llm)

research_task = Task(
    description='Research topic X',
    agent=researcher,
    expected_output='Bullet list of findings'  # Agent B expects bullets
)
write_task = Task(
    description='Write a summary of {research_task.output}',
    agent=writer,
    context=[research_task]  # Agent A may return prose, not bullets
)

# In a trace: research_task.output is prose — write_task gets wrong format
# Fix: be explicit about output format in expected_output field

Bug 3: Infinite delegation loops

In logs: The crew runs for a long time. Eventually it hits max iterations or a token limit. The error is generic.

In a trace: The delegation chain is the trace hierarchy. You see Agent A delegating to Agent B, which delegates back to Agent A, which delegates to Agent B again. The trace shows every step of the loop with the exact input and output at each stage — making it obvious that neither agent was equipped to handle the task it kept receiving.

from crewai import Agent, Crew, Process

# Agents can delegate when allow_delegation=True (default)
agent_a = Agent(role='Analyst', goal='Analyze data', allow_delegation=True, llm=llm)
agent_b = Agent(role='Researcher', goal='Find data', allow_delegation=True, llm=llm)

crew = Crew(agents=[agent_a, agent_b], tasks=[task], process=Process.hierarchical)
# Hierarchical + mutual delegation = potential loops

# Fix: set allow_delegation=False on leaf agents
# Or set max_iter on individual agents to cap iterations

Bug 4: CrewAI memory leaks across tasks

In logs: The crew's later tasks take longer and cost more tokens than early tasks. No obvious cause.

In a trace: Each task span shows the token count for that step. You can see the input token count growing across tasks — the crew is accumulating conversation history and passing the entire history to each subsequent LLM call. By task 6, the agent is sending 15,000 tokens of history to answer a simple question.

from crewai import Crew, Memory

# memory=True enables CrewAI's built-in memory across tasks
crew = Crew(
    agents=[agent_a, agent_b],
    tasks=[task_1, task_2, task_3, task_4, task_5],
    memory=True,  # Accumulates context across all tasks
    verbose=True
)

# In a trace: task_5 input_tokens = 18,000 vs task_1 input_tokens = 2,000
# Fix: use memory selectively, or limit memory window with custom config

Bug 5: Tool permission errors that look like agent failures

In logs: Agent raises an exception. Stack trace points to the agent's execute method. Root cause is buried three levels deep in the tool's error handling.

In a trace: The tool span is red (error status). The error field shows the actual exception — PermissionError: API key not set or RateLimitError: 429. The agent span shows this tool error as the cause of the agent failure. In the log, the agent error masks the tool error. In the trace, the tool error is the first thing you see.

import nexus_sdk as nexus

# Wrap your crew execution in a Nexus trace
trace = nexus.start_trace("crewai_run", metadata={"crew": "research_crew"})

try:
    result = crew.kickoff()
    nexus.end_trace(trace.id, status="success")
except Exception as e:
    # The tool error that caused the agent failure is visible in spans
    nexus.end_trace(trace.id, status="error", error=str(e))
    raise

The pattern: CrewAI needs a trace for each kickoff

Multi-agent systems fail at the boundaries between agents — not inside any single agent. CrewAI's orchestration layer is exactly where these failures accumulate. A trace wrapping each crew.kickoff() call gives you a timestamped record of every agent execution, every tool call, and every handoff, with causal links between them.

Adding Nexus to a CrewAI pipeline is three lines of Python. The ROI is immediate the first time you hit a bug that would have taken hours to debug from logs.