2026-04-07 · 8 min read · RAG Observability

Monitoring RAG Pipelines in Production: A Practical Guide

RAG (retrieval-augmented generation) pipelines are deceptively easy to get working in development and deceptively hard to keep working in production. The retrieval looks fine. The generation looks fine. But users ask questions that fall outside the indexed corpus, the vector similarity scores are borderline, or the LLM confidently answers from irrelevant context — and you have no visibility into which step failed.

This guide covers the RAG failure modes you'll encounter in production, the metrics worth tracking, and how to instrument your pipeline with trace-level observability using the Nexus SDK.

What can go wrong in RAG

RAG failures cluster into three categories:

1. Retrieval failures

The vector store returns chunks that are syntactically similar but semantically irrelevant. The LLM receives wrong context and either hallucinates or says "I don't know." These are the hardest failures to diagnose because the system doesn't error — it just answers incorrectly.

• Low similarity scores — retrieved chunks have cosine similarity below your threshold but still get passed to the LLM
• Embedding model mismatch — your query embeddings and document embeddings were generated by different model versions
• Chunking artifacts — a fact is split across two chunks; neither chunk alone answers the question
• Stale index — your knowledge base was updated but the vector index wasn't re-embedded

2. Context window issues

You retrieve 10 chunks because you want high recall. Each chunk is 500 tokens. Add the system prompt, the conversation history, and the query — you're at 7,000 tokens before the LLM generates a single word. With GPT-4o's 128k context, this seems fine. But:

• Lost-in-the-middle problem — LLMs pay more attention to content at the start and end of the context window. Information buried in the middle of a long context gets less attention.
• Token cost — 7,000 input tokens per query adds up fast at scale. Without tracking context length per request, you won't catch runaway costs until the bill arrives.
• Truncation — if you're using a model with a smaller context window, long contexts get truncated silently.

3. Hallucinations from bad context

When retrieved chunks don't contain the answer, LLMs often extrapolate rather than admitting they don't know. The answer sounds confident and coherent but is fabricated. This is the worst failure mode — users trust wrong answers more than obvious errors.

You can't catch hallucinations without either human review or automated answer evaluation. But you can detect the preconditions for hallucination: low retrieval scores, short context, queries with no good chunk matches.

What to monitor

For each RAG query, you want to capture:

Metric	Why it matters	Red flag
retrieval_latency_ms	Vector search cost per query	> 500ms at p95
chunk_count	How many chunks were retrieved	0 chunks = no context
avg_relevance_score	Average cosine similarity of retrieved chunks	< 0.7 signals poor retrieval
context_tokens	Total tokens sent to the LLM	Spikes = inefficient chunking
generation_latency_ms	LLM call duration	Correlated with context_tokens
answer_length	Length of LLM response	Very short = LLM giving up

Instrumenting a RAG pipeline with Nexus

Install the Nexus Python SDK:

terminal

pip install keylightdigital-nexus

setup.py

from nexus import NexusClient

nexus = NexusClient(api_key="nxs_...", agent_id="rag-pipeline")

Tracing retrieval and generation separately

The key pattern is creating one trace per user query, with separate child spans for the retrieval step and the generation step. This lets you see exactly where latency comes from and what data each step received:

rag_pipeline.py

async def answer_question(query: str) -> str:
    trace = await nexus.start_trace(name="rag-query", metadata={"query": query})

    # Trace the retrieval step
    retrieval_span = await trace.start_span(
        name="vector-retrieval",
        input={"query": query, "top_k": 5}
    )
    chunks = await vector_store.similarity_search(query, k=5)
    await retrieval_span.end(output={
        "chunk_count": len(chunks),
        "top_score": chunks[0].score if chunks else 0,
        "sources": [c.metadata["source"] for c in chunks],
    })

    # Trace the generation step
    context = "\n".join(c.page_content for c in chunks)
    generation_span = await trace.start_span(
        name="llm-generation",
        input={"context_length": len(context), "query": query}
    )
    response = await llm.apredict(
        "Answer based on context:\n" + context + "\n\nQuestion: " + query
    )
    await generation_span.end(output={
        "answer": response,
        "answer_length": len(response),
    })

    await trace.end(status="success", output={"answer": response})
    return response

In your Nexus dashboard, each query appears as a trace with two child spans: vector-retrieval and llm-generation. You can instantly see whether latency is dominated by vector search or LLM generation — and inspect the inputs and outputs of each.

Logging relevance scores

Most vector stores return similarity scores alongside chunks. Log them. They're your earliest warning signal for retrieval quality degradation:

rag_pipeline.py

# Log relevance scores alongside chunk count
retrieval_span = await trace.start_span(
    name="vector-retrieval",
    input={"query": query, "top_k": 5}
)
chunks = await vector_store.similarity_search_with_score(query, k=5)
scores = [score for _, score in chunks]

await retrieval_span.end(output={
    "chunk_count": len(chunks),
    "avg_score": sum(scores) / len(scores) if scores else 0,
    "min_score": min(scores) if scores else 0,
    "max_score": max(scores) if scores else 0,
    "low_relevance": sum(1 for s in scores if s < 0.7),  # flag weak retrievals
})

Now in your Nexus trace inspector, you can filter for queries where low_relevance > 0 — these are the queries most likely to produce hallucinations. Review them manually to understand the gap in your knowledge base.

Tracing multi-turn RAG agents

When your RAG system is part of a multi-turn agent (plan → retrieve → reason → answer), you want a single trace per session with spans for each reasoning step:

rag_agent.py

from nexus import NexusClient

nexus = NexusClient(api_key="nxs_...", agent_id="rag-agent")

async def rag_agent_loop(user_query: str):
    trace = await nexus.start_trace(name="agent-session", metadata={"query": user_query})
    turn = 0

    while turn < 10:
        turn += 1
        plan_span = await trace.start_span(name=f"plan-turn-{turn}", input={"turn": turn})
        action = await llm_plan(user_query, history)
        await plan_span.end(output={"action": action["type"]})

        if action["type"] == "retrieve":
            ret_span = await trace.start_span(name="retrieve", input={"query": action["query"]})
            chunks = await vector_store.search(action["query"])
            await ret_span.end(output={"chunks": len(chunks)})
        elif action["type"] == "answer":
            await trace.end(status="success", output={"answer": action["text"]})
            return action["text"]

    await trace.end(status="error", output={"reason": "max_turns_exceeded"})

The trace waterfall in Nexus will show the plan-retrieve loop as repeated spans, making it immediately visible when an agent is spinning (retrieving without making progress) versus converging toward an answer.

What a healthy RAG trace looks like

rag-query                     [0ms, success]
  vector-retrieval             [0ms, 87ms]   chunk_count=5, avg_score=0.83
  llm-generation               [87ms, 1.2s]  context_tokens=1840, answer_length=312

A degraded trace looks like:

rag-query                     [0ms, success]   ← "success" but answer is wrong
  vector-retrieval             [0ms, 340ms]  chunk_count=5, avg_score=0.51, low_relevance=4
  llm-generation               [340ms, 2.1s] context_tokens=3200, answer_length=89

Low avg_score, high low_relevance, high context_tokens, short answer_length. The LLM received bad context and gave a hedging non-answer. You can find every trace matching this pattern before users file bug reports.

Integration guides

If you're using a RAG framework, see the specific integration guide for your stack:

• LlamaIndex integration guide — callback-based auto-tracing for query engines and agents
• LangChain integration guide — trace RAG chains and retrieval QA
• DSPy integration guide — trace DSPy RAGModule and optimizers
• Full SDK docs — manual instrumentation for any framework

LlamaIndex integration guide — auto-trace query engines with CallbackManager
LangChain integration guide — trace chains and retrieval QA
How to Monitor AI Agents in Production — agent-level failure modes

Monitor your RAG pipeline free

1,000 traces/month, no credit card required.

Start free → View demo