Docs DSPy

Integration Guide

DSPy Observability with Nexus

Add production observability to DSPy programs. Trace every module forward pass, monitor optimizer compilations, and track evaluation runs — all in a single dashboard.

Why use Nexus with DSPy?

Step 1 — Install dependencies

DSPy is a Python framework. Install the Nexus Python SDK alongside DSPy:

pip install keylightdigital-nexus dspy-ai

Requires Python 3.9+ and dspy-ai ≥ 2.4.0

Step 2 — Create an API key

Go to /dashboard/keys and create a new API key. Add it to your environment:

export NEXUS_API_KEY="nxs_your_api_key_here"

Step 3 — Configure DSPy and Nexus

Configure your language model and Nexus client at startup. The Nexus client is lightweight and safe to initialize globally — all methods fail silently if the API is unreachable.

import dspy
import os
from nexus_client import NexusClient

# Configure your LM as usual
lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

# Initialize Nexus client
nexus = NexusClient(
    api_key=os.environ["NEXUS_API_KEY"],
    agent_id="dspy-app",
)

Step 4 — Trace DSPy modules

Wrap your dspy.Module.forward() method with a Nexus trace. Add child spans for each distinct step (retrieval, prediction, tool call). This gives you a full waterfall view per module run.

import dspy
from nexus_client import NexusClient
import os

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="dspy-rag")

# Define your DSPy signatures
class GenerateAnswer(dspy.Signature):
    """Answer a question given supporting context."""
    context: str = dspy.InputField(desc="Relevant passages from the knowledge base")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="Concise answer based on the context")

class RAGModule(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question: str) -> dspy.Prediction:
        # Create a trace for the full module run
        trace = nexus.start_trace(
            name="rag: " + question[:60],
            metadata={"module": "RAGModule", "retrieve_k": 3},
        )
        try:
            # Span for retrieval step
            retrieve_span = nexus.add_span(
                trace_id=trace["id"],
                name="retrieve",
                input={"question": question, "k": 3},
            )
            passages = self.retrieve(question)
            nexus.end_span(
                trace_id=trace["id"],
                span_id=retrieve_span["span_id"],
                output={"num_passages": len(passages.passages)},
                status="ok",
            )

            # Span for generation step
            gen_span = nexus.add_span(
                trace_id=trace["id"],
                name="chain-of-thought",
                input={"context_length": sum(len(p) for p in passages.passages)},
            )
            prediction = self.generate(
                context=passages.passages,
                question=question,
            )
            nexus.end_span(
                trace_id=trace["id"],
                span_id=gen_span["span_id"],
                output={"answer_length": len(prediction.answer)},
                status="ok",
            )

            nexus.end_trace(trace_id=trace["id"], status="success")
            return prediction
        except Exception as e:
            nexus.end_trace(trace_id=trace["id"], status="error")
            raise

rag = RAGModule()
result = rag("What is DSPy and how does it differ from LangChain?")
print(result.answer)

# Nexus dashboard shows:
#   trace: rag: What is DSPy...
#   +-- retrieve (3 passages fetched)
#   +-- chain-of-thought (answer generated)

Step 5 — Trace optimizer runs

DSPy optimizers (teleprompts) like BootstrapFewShot and MIPRO can run for minutes or hours. Wrapping them in a trace lets you compare compilation cost across different configurations and track regressions.

import dspy
from dspy.teleprompt import BootstrapFewShot
from nexus_client import NexusClient
import os

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="dspy-optimizer")

# Define metric
def answer_exact_match(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

# Training data
trainset = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare").with_inputs("question"),
    # ... more examples
]

# Trace the entire optimization run
opt_trace = nexus.start_trace(
    name="bootstrap-few-shot optimization",
    metadata={"metric": "exact_match", "trainset_size": len(trainset), "max_bootstrapped_demos": 3},
)
try:
    teleprompter = BootstrapFewShot(metric=answer_exact_match, max_bootstrapped_demos=3)
    optimized_program = teleprompter.compile(RAGModule(), trainset=trainset)

    # Log a compilation-complete span
    nexus.add_span(
        trace_id=opt_trace["id"],
        name="compilation-complete",
        output={"status": "compiled", "demos_generated": 3},
        status="ok",
    )
    nexus.end_trace(trace_id=opt_trace["id"], status="success")
except Exception as e:
    nexus.end_trace(trace_id=opt_trace["id"], status="error")
    raise

# Save and reuse the optimized program
optimized_program.save("optimized_rag.json")

# Nexus dashboard shows:
#   trace: bootstrap-few-shot optimization
#   +-- compilation-complete

Step 6 — Monitor evaluations

Trace dspy.Evaluate calls to track metric scores over time. Compare pre-optimization vs post-optimization performance in the Nexus dashboard without digging through terminal output.

import dspy
from nexus_client import NexusClient
import os

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="dspy-evaluator")

# Evaluation dataset
devset = [
    dspy.Example(question="What is RAG?", answer="Retrieval-Augmented Generation").with_inputs("question"),
    dspy.Example(question="Define gradient descent", answer="An optimization algorithm").with_inputs("question"),
    # ... more examples
]

def answer_f1(example, pred, trace=None):
    pred_tokens = set(pred.answer.lower().split())
    gold_tokens = set(example.answer.lower().split())
    if not pred_tokens or not gold_tokens:
        return 0.0
    precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
    recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
    return (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0

# Trace the evaluation run
eval_trace = nexus.start_trace(
    name="evaluation run",
    metadata={"metric": "f1", "devset_size": len(devset), "program": "RAGModule"},
)

evaluate = dspy.Evaluate(devset=devset, metric=answer_f1, num_threads=4)
score = evaluate(optimized_program)

nexus.add_span(
    trace_id=eval_trace["id"],
    name="evaluation-result",
    output={"f1_score": score, "devset_size": len(devset)},
    status="ok",
)
nexus.end_trace(trace_id=eval_trace["id"], status="success")
print(f"Evaluation score: {score:.2f}")

# Nexus dashboard shows:
#   trace: evaluation run
#   +-- evaluation-result (f1_score: 0.78)

Step 7 — View traces in Nexus

Navigate to /dashboard/traces to see every module run, optimizer compilation, and evaluation as a trace with span waterfall, latency breakdown, and status indicators.

View demo with sample traces →

More resources

Start monitoring your DSPy programs

Free plan: 1,000 traces/month. No credit card needed. Add tracing in under 10 minutes.