Monitoring RAG Pipelines in Production: A Practical Guide
RAG (retrieval-augmented generation) pipelines are deceptively easy to get working in development and deceptively hard to keep working in production. The retrieval looks fine. The generation looks fine. But users ask questions that fall outside the indexed corpus, the vector similarity scores are borderline, or the LLM confidently answers from irrelevant context — and you have no visibility into which step failed.
This guide covers the RAG failure modes you'll encounter in production, the metrics worth tracking, and how to instrument your pipeline with trace-level observability using the Nexus SDK.
What can go wrong in RAG
RAG failures cluster into three categories:
1. Retrieval failures
The vector store returns chunks that are syntactically similar but semantically irrelevant. The LLM receives wrong context and either hallucinates or says "I don't know." These are the hardest failures to diagnose because the system doesn't error — it just answers incorrectly.
- • Low similarity scores — retrieved chunks have cosine similarity below your threshold but still get passed to the LLM
- • Embedding model mismatch — your query embeddings and document embeddings were generated by different model versions
- • Chunking artifacts — a fact is split across two chunks; neither chunk alone answers the question
- • Stale index — your knowledge base was updated but the vector index wasn't re-embedded
2. Context window issues
You retrieve 10 chunks because you want high recall. Each chunk is 500 tokens. Add the system prompt, the conversation history, and the query — you're at 7,000 tokens before the LLM generates a single word. With GPT-4o's 128k context, this seems fine. But:
- • Lost-in-the-middle problem — LLMs pay more attention to content at the start and end of the context window. Information buried in the middle of a long context gets less attention.
- • Token cost — 7,000 input tokens per query adds up fast at scale. Without tracking context length per request, you won't catch runaway costs until the bill arrives.
- • Truncation — if you're using a model with a smaller context window, long contexts get truncated silently.
3. Hallucinations from bad context
When retrieved chunks don't contain the answer, LLMs often extrapolate rather than admitting they don't know. The answer sounds confident and coherent but is fabricated. This is the worst failure mode — users trust wrong answers more than obvious errors.
You can't catch hallucinations without either human review or automated answer evaluation. But you can detect the preconditions for hallucination: low retrieval scores, short context, queries with no good chunk matches.
What to monitor
For each RAG query, you want to capture:
| Metric | Why it matters | Red flag |
|---|---|---|
| retrieval_latency_ms | Vector search cost per query | > 500ms at p95 |
| chunk_count | How many chunks were retrieved | 0 chunks = no context |
| avg_relevance_score | Average cosine similarity of retrieved chunks | < 0.7 signals poor retrieval |
| context_tokens | Total tokens sent to the LLM | Spikes = inefficient chunking |
| generation_latency_ms | LLM call duration | Correlated with context_tokens |
| answer_length | Length of LLM response | Very short = LLM giving up |
Instrumenting a RAG pipeline with Nexus
Install the Nexus Python SDK:
pip install keylightdigital-nexus
from nexus import NexusClient
nexus = NexusClient(api_key="nxs_...", agent_id="rag-pipeline")
Tracing retrieval and generation separately
The key pattern is creating one trace per user query, with separate child spans for the retrieval step and the generation step. This lets you see exactly where latency comes from and what data each step received:
async def answer_question(query: str) -> str:
trace = await nexus.start_trace(name="rag-query", metadata={"query": query})
# Trace the retrieval step
retrieval_span = await trace.start_span(
name="vector-retrieval",
input={"query": query, "top_k": 5}
)
chunks = await vector_store.similarity_search(query, k=5)
await retrieval_span.end(output={
"chunk_count": len(chunks),
"top_score": chunks[0].score if chunks else 0,
"sources": [c.metadata["source"] for c in chunks],
})
# Trace the generation step
context = "\n".join(c.page_content for c in chunks)
generation_span = await trace.start_span(
name="llm-generation",
input={"context_length": len(context), "query": query}
)
response = await llm.apredict(
"Answer based on context:\n" + context + "\n\nQuestion: " + query
)
await generation_span.end(output={
"answer": response,
"answer_length": len(response),
})
await trace.end(status="success", output={"answer": response})
return response
In your Nexus dashboard, each query appears as a trace with two child spans: vector-retrieval and llm-generation. You can instantly see whether latency is dominated by vector search or LLM generation — and inspect the inputs and outputs of each.
Logging relevance scores
Most vector stores return similarity scores alongside chunks. Log them. They're your earliest warning signal for retrieval quality degradation:
# Log relevance scores alongside chunk count
retrieval_span = await trace.start_span(
name="vector-retrieval",
input={"query": query, "top_k": 5}
)
chunks = await vector_store.similarity_search_with_score(query, k=5)
scores = [score for _, score in chunks]
await retrieval_span.end(output={
"chunk_count": len(chunks),
"avg_score": sum(scores) / len(scores) if scores else 0,
"min_score": min(scores) if scores else 0,
"max_score": max(scores) if scores else 0,
"low_relevance": sum(1 for s in scores if s < 0.7), # flag weak retrievals
})
Now in your Nexus trace inspector, you can filter for queries where low_relevance > 0 — these are the queries most likely to produce hallucinations. Review them manually to understand the gap in your knowledge base.
Tracing multi-turn RAG agents
When your RAG system is part of a multi-turn agent (plan → retrieve → reason → answer), you want a single trace per session with spans for each reasoning step:
from nexus import NexusClient
nexus = NexusClient(api_key="nxs_...", agent_id="rag-agent")
async def rag_agent_loop(user_query: str):
trace = await nexus.start_trace(name="agent-session", metadata={"query": user_query})
turn = 0
while turn < 10:
turn += 1
plan_span = await trace.start_span(name=f"plan-turn-{turn}", input={"turn": turn})
action = await llm_plan(user_query, history)
await plan_span.end(output={"action": action["type"]})
if action["type"] == "retrieve":
ret_span = await trace.start_span(name="retrieve", input={"query": action["query"]})
chunks = await vector_store.search(action["query"])
await ret_span.end(output={"chunks": len(chunks)})
elif action["type"] == "answer":
await trace.end(status="success", output={"answer": action["text"]})
return action["text"]
await trace.end(status="error", output={"reason": "max_turns_exceeded"})
The trace waterfall in Nexus will show the plan-retrieve loop as repeated spans, making it immediately visible when an agent is spinning (retrieving without making progress) versus converging toward an answer.
What a healthy RAG trace looks like
rag-query [0ms, success]
vector-retrieval [0ms, 87ms] chunk_count=5, avg_score=0.83
llm-generation [87ms, 1.2s] context_tokens=1840, answer_length=312
A degraded trace looks like:
rag-query [0ms, success] ← "success" but answer is wrong
vector-retrieval [0ms, 340ms] chunk_count=5, avg_score=0.51, low_relevance=4
llm-generation [340ms, 2.1s] context_tokens=3200, answer_length=89
Low avg_score, high low_relevance, high context_tokens, short answer_length. The LLM received bad context and gave a hedging non-answer. You can find every trace matching this pattern before users file bug reports.
Integration guides
If you're using a RAG framework, see the specific integration guide for your stack:
- • LlamaIndex integration guide — callback-based auto-tracing for query engines and agents
- • LangChain integration guide — trace RAG chains and retrieval QA
- • DSPy integration guide — trace DSPy RAGModule and optimizers
- • Full SDK docs — manual instrumentation for any framework
Related
- LlamaIndex integration guide — auto-trace query engines with CallbackManager
- LangChain integration guide — trace chains and retrieval QA
- How to Monitor AI Agents in Production — agent-level failure modes