A/B Testing AI Models with Nexus Observability

Switching from GPT-4o-mini to Claude Haiku or Llama 3 8B can cut token costs 60% — but how do you know if quality degrades? Here's how to use Nexus span metadata to run controlled A/B tests between models, compare token costs and latency per variant, and make data-driven model decisions without guessing.

Why Model Switching Is Risky Without Data

The business case for switching models is clear: GPT-4o-mini costs $0.15/M input tokens; Claude Haiku costs $0.25/M; Llama 3 8B on Together AI costs $0.20/M. Moving your agent from GPT-4o-mini to a cheaper model could meaningfully cut costs — but only if quality holds.

The problem is that "quality" is difficult to define and measure for open-ended agent tasks. Without a systematic comparison, you're making a bet. With Nexus span metadata, you can run a controlled A/B test in production and make the decision with real data.

Step 1: Tag Traces with Variant Metadata

The core pattern: assign each request to a variant at random, then tag the trace and span with the variant name, experiment ID, and model. This makes every trace queryable by variant later:

import os
import random
import time
from openai import OpenAI
from anthropic import Anthropic
from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-agent")
openai = OpenAI()
anthropic = Anthropic()

VARIANTS = {
    "control":    {"provider": "openai",    "model": "gpt-4o-mini"},
    "challenger": {"provider": "anthropic", "model": "claude-haiku-4-5-20251001"},
}

def run_agent(prompt: str, experiment_id: str = "model-comparison-v1") -> str:
    # 50/50 split between variants
    variant_name = random.choice(list(VARIANTS.keys()))
    variant = VARIANTS[variant_name]

    trace = nexus.start_trace(
        name=f"agent: {prompt[:60]}",
        metadata={
            "experiment_id": experiment_id,
            "model_variant": variant_name,
            "model": variant["model"],
        },
    )
    span = trace.add_span(
        name="llm-call",
        input={"prompt": prompt, "model": variant["model"]},
    )
    start = time.time()

    if variant["provider"] == "openai":
        response = openai.chat.completions.create(
            model=variant["model"],
            messages=[{"role": "user", "content": prompt}],
        )
        content = response.choices[0].message.content or ""
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
    else:
        response = anthropic.messages.create(
            model=variant["model"],
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )
        content = response.content[0].text
        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens

    latency_ms = int((time.time() - start) * 1000)

    span.end(status="ok", output={
        "model_variant": variant_name,
        "model": variant["model"],
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "output_length": len(content),  # proxy quality signal
        "latency_ms": latency_ms,
    })
    trace.end(status="success")
    return content

Step 2: Build a Report from the Nexus API

After collecting enough traces (50+ per variant for statistically meaningful results), query the Nexus API and aggregate by variant. Compare average latency, output length (a rough quality proxy), and estimated cost per 1,000 calls:

import requests
from collections import defaultdict

NEXUS_API_KEY = os.environ["NEXUS_API_KEY"]
BASE_URL = "https://nexus.keylightdigital.dev/v1"
HEADERS = {"Authorization": f"Bearer {NEXUS_API_KEY}"}

# Cost per 1K tokens (update for your models)
COSTS = {
    "gpt-4o-mini":  {"input": 0.00015, "output": 0.00060},
    "claude-haiku-4-5-20251001": {"input": 0.00025, "output": 0.00125},
}

def ab_test_report(experiment_id: str):
    resp = requests.get(f"{BASE_URL}/traces", headers=HEADERS, params={"limit": 1000})
    traces = resp.json().get("traces", [])

    stats = defaultdict(lambda: {
        "count": 0, "total_input_tokens": 0, "total_output_tokens": 0,
        "total_latency_ms": 0, "total_output_length": 0,
    })

    for trace in traces:
        meta = trace.get("metadata", {})
        if meta.get("experiment_id") != experiment_id:
            continue
        variant = meta.get("model_variant", "unknown")
        model = meta.get("model", "unknown")
        for span in trace.get("spans", []):
            out = span.get("output", {}) or {}
            stats[variant]["count"] += 1
            stats[variant]["total_input_tokens"] += out.get("input_tokens", 0)
            stats[variant]["total_output_tokens"] += out.get("output_tokens", 0)
            stats[variant]["total_latency_ms"] += out.get("latency_ms", 0)
            stats[variant]["total_output_length"] += out.get("output_length", 0)

    print(f"A/B Test Report: {experiment_id}")
    print(f"{'Variant':<15} {'N':>5} {'Avg Latency':>12} {'Avg Output Len':>15} {'Cost/1K Calls':>15}")
    print("-" * 65)

    for variant, s in stats.items():
        n = s["count"] or 1
        avg_latency = s["total_latency_ms"] / n
        avg_output_len = s["total_output_length"] / n
        # Estimate cost per 1000 calls
        model = VARIANTS.get(variant, {}).get("model", "")
        cost_per_1k = 0
        if model in COSTS:
            c = COSTS[model]
            avg_in = s["total_input_tokens"] / n
            avg_out = s["total_output_tokens"] / n
            cost_per_1k = (avg_in * c["input"] + avg_out * c["output"]) * 1000 / 1000
        print(f"{variant:<15} {n:>5} {avg_latency:>11.0f}ms {avg_output_len:>14.0f} ${cost_per_1k:>13.2f}")

ab_test_report("model-comparison-v1")

Quality Signals to Track

Since you can't automatically evaluate output quality for most agent tasks, these proxy signals are useful:

Output length — shorter responses may indicate the model is truncating or refusing; very long responses may indicate verbosity. Record output_length in span output
Error rate — track how often each variant results in error spans (refusals, empty responses, tool call failures)
User-facing signals — if your agent drives a user-facing product, tag traces with downstream success metrics (user_satisfied: true) and compare by variant

Prompt A/B Testing Works the Same Way

The same pattern applies to prompt variants — tag with prompt_variant instead of model_variant:

# The same pattern works for prompt A/B testing — just tag differently
def run_with_prompt_variant(user_input: str) -> str:
    variant = random.choice(["concise_prompt", "detailed_prompt"])
    prompts = {
        "concise_prompt": f"Answer briefly: {user_input}",
        "detailed_prompt": f"Please provide a thorough and detailed answer to: {user_input}",
    }

    trace = nexus.start_trace(
        name=f"agent: {user_input[:60]}",
        metadata={
            "experiment_id": "prompt-test-v1",
            "prompt_variant": variant,
        },
    )
    # ... rest of the call unchanged
    return ""

When You Have Enough Data

With 50+ traces per variant, you have enough data to compare averages meaningfully. If the challenger variant shows similar output length and error rate but lower cost, the case for switching is strong. If quality signals diverge, you can make an informed cost-vs-quality tradeoff rather than guessing.

Get Started

Create a free Nexus account at nexus.keylightdigital.dev/pricing and add the three metadata tags (experiment_id, model_variant, model) to your existing agent traces. The free tier covers 10,000 spans/month — plenty for a production A/B test.