Observability for LlamaIndex Agents and Query Pipelines
LlamaIndex gives you QueryPipelines and AgentWorkers for building RAG and agent workflows — but when retrieval quality drops, a ReAct loop over-iterates, or a tool call fails silently, standard logging can't tell you which step broke. Here's how to instrument LlamaIndex with full trace observability using Nexus.
LlamaIndex’s two agent primitives
LlamaIndex gives you two main ways to build AI agents: QueryPipelines and AgentWorkers. A QueryPipeline is a declarative, DAG-based abstraction — you chain retrieval, reranking, LLM, and synthesis steps together. An AgentWorker (used via AgentRunner) drives a ReAct loop where the agent iteratively calls tools until it reaches an answer.
Both share a failure surface that standard logging misses:
- Retrieval quality drops silently: the retriever returns low-relevance chunks and the LLM synthesizes an incorrect answer without any error raised.
- ReAct loops over-iterate: the agent calls the same tool repeatedly, burning tokens without making progress.
- Tool call failures are swallowed: a tool raises an exception that the agent treats as an observation, continuing the loop rather than surfacing a hard error.
- Latency spikes in pipeline steps: a slow reranker or vector store adds seconds to every query without visibility into which step is the bottleneck.
Instrumenting a QueryPipeline
QueryPipelines are step-based, which maps cleanly to Nexus spans — one span per pipeline step. Wrap the full pipeline run in a parent trace and emit child spans at each stage:
import os
import time
from llama_index.core.query_pipeline import QueryPipeline, InputComponent
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import TreeSummarize
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from nexus_sdk import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
# Build pipeline components
retriever = VectorIndexRetriever(index=VectorStoreIndex.from_documents([...]))
llm = OpenAI(model="gpt-4o")
summarizer = TreeSummarize(llm=llm)
def run_pipeline_with_tracing(query: str, user_id: str) -> str:
trace = nexus.start_trace({
"agent_id": "llamaindex-rag-pipeline",
"name": f"rag: {query[:60]}",
"status": "running",
"started_at": nexus.now(),
"metadata": {
"user_id": user_id,
"query_length": len(query),
"environment": os.environ.get("APP_ENV", "dev"),
},
})
trace_id = trace["trace_id"]
try:
# Retrieval span
t0 = time.time()
nodes = retriever.retrieve(query)
retrieval_ms = int((time.time() - t0) * 1000)
nexus.add_span(trace_id, {
"name": "retrieve",
"status": "success",
"latency_ms": retrieval_ms,
"metadata": {
"nodes_retrieved": len(nodes),
"top_score": nodes[0].score if nodes else None,
},
})
# Synthesis span
t1 = time.time()
response = summarizer.synthesize(query, nodes=nodes)
synthesis_ms = int((time.time() - t1) * 1000)
nexus.add_span(trace_id, {
"name": "synthesize",
"status": "success",
"latency_ms": synthesis_ms,
"metadata": {
"response_length": len(str(response)),
},
})
nexus.end_trace(trace_id, {"status": "success"})
return str(response)
except Exception as e:
nexus.end_trace(trace_id, {
"status": "error",
"error": str(e),
})
raise
Instrumenting an AgentWorker
AgentWorkers run a ReAct loop internally. The key insight is that each task step (retrieve-plan-call-observe cycle) is a natural span boundary. Capture the number of steps and which tools were called to diagnose over-iteration:
import os
import time
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from nexus_sdk import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])
# Define tools
def search_docs(query: str) -> str:
"""Search internal documentation."""
return f"Documentation result for: {query}"
def run_sql(query: str) -> str:
"""Run a read-only SQL query."""
return f"Query result for: {query}"
search_tool = FunctionTool.from_defaults(fn=search_docs)
sql_tool = FunctionTool.from_defaults(fn=run_sql)
agent = ReActAgent.from_tools(
[search_tool, sql_tool],
llm=OpenAI(model="gpt-4o"),
max_iterations=10,
verbose=False,
)
def run_agent_with_tracing(task: str, user_id: str) -> str:
trace = nexus.start_trace({
"agent_id": "llamaindex-react-agent",
"name": f"agent: {task[:60]}",
"status": "running",
"started_at": nexus.now(),
"metadata": {
"user_id": user_id,
"tools": ["search_docs", "run_sql"],
},
})
trace_id = trace["trace_id"]
t0 = time.time()
try:
response = agent.chat(task)
elapsed_ms = int((time.time() - t0) * 1000)
# Inspect the task steps for tool call metadata
task_obj = agent.create_task(task)
step_output = agent.run_step(task_obj.task_id)
nexus.end_trace(trace_id, {
"status": "success",
"latency_ms": elapsed_ms,
"metadata": {
"response_length": len(str(response)),
},
})
return str(response)
except Exception as e:
elapsed_ms = int((time.time() - t0) * 1000)
nexus.end_trace(trace_id, {
"status": "error",
"latency_ms": elapsed_ms,
"error": str(e),
})
raise
What to track per tool call
For each tool call in the ReAct loop, emit a span that captures the tool name, input, and whether the call succeeded. This makes over-iteration patterns immediately visible — you can see that a specific tool was called seven times before the agent gave up:
# Wrap individual tool functions to emit spans
def traced_tool(tool_fn, tool_name: str, nexus_client, trace_id: str):
def wrapper(*args, **kwargs):
t0 = time.time()
try:
result = tool_fn(*args, **kwargs)
nexus_client.add_span(trace_id, {
"name": f"tool:{tool_name}",
"status": "success",
"latency_ms": int((time.time() - t0) * 1000),
"metadata": {
"input": str(args[0]) if args else str(kwargs),
"output_length": len(str(result)),
},
})
return result
except Exception as e:
nexus_client.add_span(trace_id, {
"name": f"tool:{tool_name}",
"status": "error",
"latency_ms": int((time.time() - t0) * 1000),
"error": str(e),
})
raise
return wrapper
Diagnosing common failure patterns
Once you have traces flowing into Nexus, three patterns emerge quickly in production LlamaIndex deployments:
Low retrieval scores with correct answers. If your top node score is consistently below 0.5 but the agent still produces correct answers, your reranker or retriever thresholds may be too conservative. Bump the top_k or switch to a hybrid retriever.
ReAct loops hitting max_iterations. If you see traces where the agent hits max_iterations and returns an incomplete answer, the issue is usually either a poorly scoped tool (returning too much data per call) or ambiguous instructions. Look at which tool was called repeatedly.
Synthesis latency spikes. QueryPipeline synthesis latency spikes often trace back to the LLM being passed too many retrieved chunks. If your synthesis span shows latency > 5s, reduce similarity_top_k or add a reranker to trim the context window.
Adding metadata for search and filtering
The most useful traces are filterable. Add metadata that lets you slice by document collection, user segment, or query type in the Nexus dashboard:
trace = nexus.start_trace({
"agent_id": "llamaindex-rag-pipeline",
"name": f"rag: {query[:60]}",
"status": "running",
"started_at": nexus.now(),
"metadata": {
# Indexing context
"index_name": "product-docs-v3",
"retriever_type": "hybrid",
"top_k": 8,
# User context
"user_id": user_id,
"user_plan": "pro",
# Query context
"query_type": classify_query(query), # "factual" | "how-to" | "comparison"
"environment": os.environ.get("APP_ENV", "dev"),
},
})
Next steps
With LlamaIndex traces in Nexus, you gain real production visibility: which pipeline steps are slow, which tools over-iterate, and which retrievers produce low-quality results. The per-agent error rate view in Nexus makes it easy to spot regressions when you change index configurations or switch models.
Sign up for a free Nexus account and start tracing your LlamaIndex pipelines in production.
Add observability to LlamaIndex
Free tier, no credit card required. Instrument your first pipeline in under 5 minutes.