The Hallucination Problem in Production
Hallucinations are AI's most embarrassing failure mode — and the hardest to catch at scale. A model generates a confident, fluent, completely wrong answer. Your users get misinformation. Your trust erodes. And unless you're logging every LLM response and checking it, you won't know until a user complains.
Most teams discover hallucinations through user feedback. By that point, the damage is done. The key insight is that hallucinations leave a trace — if you're capturing the right data at inference time, you can detect most of them before they reach users, or at least understand when and why they happen.
This post covers the three main types of hallucinations and practical trace-based detection strategies for each.
The 3 Types of AI Hallucinations
1. Factual Hallucinations
The model asserts a fact that is false or unverifiable. "The Eiffel Tower was built in 1850" (it was 1889). "Company X raised a $50M Series B" (the number is fabricated). These are the hallucinations people fear most because they're indistinguishable from correct answers at a glance.
Detection approach: output-vs-source verification spans. If your agent retrieves documents before generating, you can compare the generated claims against the source material programmatically.
2. Logical Hallucinations
The model makes internally inconsistent statements or draws conclusions that don't follow from the premises. "Since X is true, Y must also be true" — where the inference is invalid. These are common in multi-step reasoning tasks and agent pipelines where earlier errors cascade.
Detection approach: chain-of-thought tracing. Log each reasoning step as a separate span. Logical breaks become visible as spans where the input assumptions don't match the previous span's output.
3. Context Hallucinations
The model generates an answer that contradicts the context it was given. You explicitly said "the user's name is Alice" and the model later refers to "Bob." Or a RAG agent generates an answer that contradicts the retrieved documents it was supposed to use. These are the most tractable to detect because you have the source of truth in your trace.
Detection approach: retrieval-comparison tracing, covered in detail below.
Pattern 1: Output Verification Spans
The most direct detection method: after generating an answer, run a separate verification step that checks whether the answer is grounded in the source documents. Log both the answer and the verdict as span attributes. This adds one extra LLM call per query but gives you a hallucination signal you can query across all your traces.
import os
from nexus_client import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="rag-agent")
def answer_question(question: str, retrieved_docs: list[str]) -> str:
with nexus.trace(name="rag-query") as trace:
# Log what context we gave to the LLM
with trace.span("retrieval") as span:
span.set_attribute("retrieved_doc_count", len(retrieved_docs))
span.set_attribute("retrieved_context", "\n---\n".join(retrieved_docs[:3]))
span.set_attribute("query", question)
# Generate the answer
with trace.span("generation") as span:
answer = llm.complete(question, context=retrieved_docs)
span.set_attribute("prompt", question)
span.set_attribute("response", answer)
span.set_attribute("model", "gpt-4o")
return answer
First, capture both the retrieval context and the generated answer in the same trace:
def verify_answer(answer: str, retrieved_docs: list[str]) -> dict:
"""Check if the answer is grounded in the retrieved context."""
with nexus.trace(name="hallucination-check") as trace:
with trace.span("output-verification") as span:
context_text = " ".join(retrieved_docs)
# Ask the LLM to verify its own output against source
verification_prompt = (
f"Answer: {answer}\n\n"
f"Source documents: {context_text}\n\n"
"Is this answer fully supported by the source documents? "
"Reply with: SUPPORTED, UNSUPPORTED, or PARTIALLY_SUPPORTED"
)
verdict = llm.complete(verification_prompt)
span.set_attribute("answer", answer)
span.set_attribute("verdict", verdict)
span.set_attribute("grounded", "SUPPORTED" in verdict)
if "UNSUPPORTED" in verdict:
span.set_status("warning", "Potential hallucination detected")
trace.set_attribute("hallucination_detected", True)
return {"verdict": verdict, "grounded": "SUPPORTED" in verdict}
In Nexus, you can filter traces where hallucination_detected = true and review them as a group. Over time, you'll find patterns: specific question types, context lengths, or topics where your model hallucinates most.
Pattern 2: Confidence Score Logging
Models don't natively emit uncertainty estimates, but you can ask them to self-rate. While self-reported confidence isn't perfectly calibrated, it's a useful proxy — consistently low-confidence answers correlate with higher hallucination rates in practice.
def score_confidence(answer: str, question: str) -> float:
"""Log confidence scores to detect low-certainty responses."""
with nexus.trace(name="confidence-scoring") as trace:
with trace.span("confidence-check") as span:
# Ask model to rate its own certainty (1-10)
confidence_prompt = (
f"Question: {question}\n"
f"Answer: {answer}\n\n"
"Rate your confidence in this answer from 1-10. "
"Reply with just the number."
)
raw_score = llm.complete(confidence_prompt)
score = float(raw_score.strip()) / 10.0
span.set_attribute("confidence_score", score)
span.set_attribute("answer", answer)
# Flag low-confidence answers for review
if score < 0.6:
span.set_status("warning", f"Low confidence: {score:.2f}")
trace.set_attribute("low_confidence", True)
return score
With confidence scores logged as span attributes, you can build dashboards showing your p10/p50/p90 confidence distribution. Sudden drops in average confidence often precede spikes in user-reported errors — useful leading indicator before your support queue fills up.
See 5 Metrics Every AI Agent Team Should Track for more on building metric dashboards from span attributes.
Pattern 3: Retrieval-vs-Generation Comparison Tracing
For RAG pipelines, the most predictive hallucination signal is retrieval quality. When your vector search returns low-relevance chunks, the model has to fill gaps — and fills them with hallucinations. By capturing retrieval relevance scores in the same trace as the generation, you can correlate the two.
def retrieval_comparison_trace(question: str) -> str:
"""Compare retrieval quality vs generation quality in one trace."""
with nexus.trace(name="retrieval-comparison") as trace:
# Step 1: Retrieve
with trace.span("vector-search") as span:
docs = vector_db.search(question, top_k=5)
scores = [d.relevance_score for d in docs]
span.set_attribute("top_k", 5)
span.set_attribute("avg_relevance_score", sum(scores) / len(scores))
span.set_attribute("min_relevance_score", min(scores))
span.set_attribute("retrieved_chunks", len(docs))
# Step 2: Generate — with retrieval quality in scope
with trace.span("llm-generation") as span:
answer = llm.complete(question, context=[d.text for d in docs])
span.set_attribute("input_token_count", len(question.split()))
span.set_attribute("output_token_count", len(answer.split()))
# Flag: poor retrieval → high hallucination risk
if min(scores) < 0.3:
span.set_attribute("hallucination_risk", "high")
span.set_status("warning", "Low retrieval quality — high hallucination risk")
return answer
The key insight: min_relevance_score is a better predictor of hallucination than avg_relevance_score. If even one retrieved chunk is irrelevant, the model may anchor on it and generate plausible-sounding nonsense. Flag traces where min_relevance_score < 0.3 for manual review.
For a deeper dive on RAG monitoring, see Monitoring RAG Pipelines in Production.
Practical Monitoring Strategy
Start with the lowest-overhead approach and add layers as your volume grows:
- Log everything first. Capture prompts, responses, and retrieved context as span attributes. You can't detect patterns you haven't logged. The Nexus SDK stores these in your D1 database — cheap and queryable.
- Add retrieval quality scores. If you're running RAG, your vector DB already computes relevance scores. Log them. This costs nothing extra and gives you the most predictive hallucination signal.
- Sample verification calls. Run output-vs-source verification on 10-20% of queries. Full coverage is expensive; sampled coverage still gives you a statistically reliable hallucination rate over time.
- Set alerts on
hallucination_detected = true. Route flagged traces to a Slack channel or email alert for human review. This is your QA loop — the ground truth that lets you improve your prompts and retrieval.
What to Do When You Find Hallucinations
Detection is only valuable if it drives improvement. When you identify hallucination-prone traces, look for:
- Question type patterns — do hallucinations cluster around specific query categories (dates, numbers, company facts)?
- Context length thresholds — does hallucination rate spike above certain input token counts? This signals context overflow or attention fragmentation.
- Retrieval gaps — are there topics your vector DB consistently fails to retrieve relevant context for? Expand your knowledge base there first.
- Model-specific patterns — if you're A/B testing models, trace which model hallucinates more on which query types. The answer is often surprising.
Getting Started
Add hallucination monitoring to your agent in three steps:
- Install the Nexus SDK:
pip install nexus-client - Wrap your RAG pipeline with retrieval and generation spans (examples above)
- Add
hallucination_detectedboolean attributes to verification spans
See the integration guides for LangChain, LlamaIndex, and CrewAI — all support the trace/span pattern shown above. Or try the interactive demo to see what hallucination traces look like in the dashboard.
Catch hallucinations before your users do
Trace-level hallucination monitoring for production AI agents. Free tier, no credit card required.
Get started free →