A/B Testing AI Models with Nexus Observability
Switching from GPT-4o-mini to Claude Haiku or Llama 3 8B can cut token costs 60% — but how do you know if quality degrades? Here's how to use Nexus span metadata to run controlled A/B tests between models, compare token costs and latency per variant, and make data-driven model decisions without guessing.
Why Model Switching Is Risky Without Data
The business case for switching models is clear: GPT-4o-mini costs $0.15/M input tokens; Claude Haiku costs $0.25/M; Llama 3 8B on Together AI costs $0.20/M. Moving your agent from GPT-4o-mini to a cheaper model could meaningfully cut costs — but only if quality holds.
The problem is that "quality" is difficult to define and measure for open-ended agent tasks. Without a systematic comparison, you're making a bet. With Nexus span metadata, you can run a controlled A/B test in production and make the decision with real data.
Step 1: Tag Traces with Variant Metadata
The core pattern: assign each request to a variant at random, then tag the trace and span with the variant name, experiment ID, and model. This makes every trace queryable by variant later:
import os
import random
import time
from openai import OpenAI
from anthropic import Anthropic
from nexus_client import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-agent")
openai = OpenAI()
anthropic = Anthropic()
VARIANTS = {
"control": {"provider": "openai", "model": "gpt-4o-mini"},
"challenger": {"provider": "anthropic", "model": "claude-haiku-4-5-20251001"},
}
def run_agent(prompt: str, experiment_id: str = "model-comparison-v1") -> str:
# 50/50 split between variants
variant_name = random.choice(list(VARIANTS.keys()))
variant = VARIANTS[variant_name]
trace = nexus.start_trace(
name=f"agent: {prompt[:60]}",
metadata={
"experiment_id": experiment_id,
"model_variant": variant_name,
"model": variant["model"],
},
)
span = trace.add_span(
name="llm-call",
input={"prompt": prompt, "model": variant["model"]},
)
start = time.time()
if variant["provider"] == "openai":
response = openai.chat.completions.create(
model=variant["model"],
messages=[{"role": "user", "content": prompt}],
)
content = response.choices[0].message.content or ""
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
else:
response = anthropic.messages.create(
model=variant["model"],
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
content = response.content[0].text
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
latency_ms = int((time.time() - start) * 1000)
span.end(status="ok", output={
"model_variant": variant_name,
"model": variant["model"],
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"output_length": len(content), # proxy quality signal
"latency_ms": latency_ms,
})
trace.end(status="success")
return content
Step 2: Build a Report from the Nexus API
After collecting enough traces (50+ per variant for statistically meaningful results), query the Nexus API and aggregate by variant. Compare average latency, output length (a rough quality proxy), and estimated cost per 1,000 calls:
import requests
from collections import defaultdict
NEXUS_API_KEY = os.environ["NEXUS_API_KEY"]
BASE_URL = "https://nexus.keylightdigital.dev/v1"
HEADERS = {"Authorization": f"Bearer {NEXUS_API_KEY}"}
# Cost per 1K tokens (update for your models)
COSTS = {
"gpt-4o-mini": {"input": 0.00015, "output": 0.00060},
"claude-haiku-4-5-20251001": {"input": 0.00025, "output": 0.00125},
}
def ab_test_report(experiment_id: str):
resp = requests.get(f"{BASE_URL}/traces", headers=HEADERS, params={"limit": 1000})
traces = resp.json().get("traces", [])
stats = defaultdict(lambda: {
"count": 0, "total_input_tokens": 0, "total_output_tokens": 0,
"total_latency_ms": 0, "total_output_length": 0,
})
for trace in traces:
meta = trace.get("metadata", {})
if meta.get("experiment_id") != experiment_id:
continue
variant = meta.get("model_variant", "unknown")
model = meta.get("model", "unknown")
for span in trace.get("spans", []):
out = span.get("output", {}) or {}
stats[variant]["count"] += 1
stats[variant]["total_input_tokens"] += out.get("input_tokens", 0)
stats[variant]["total_output_tokens"] += out.get("output_tokens", 0)
stats[variant]["total_latency_ms"] += out.get("latency_ms", 0)
stats[variant]["total_output_length"] += out.get("output_length", 0)
print(f"A/B Test Report: {experiment_id}")
print(f"{'Variant':<15} {'N':>5} {'Avg Latency':>12} {'Avg Output Len':>15} {'Cost/1K Calls':>15}")
print("-" * 65)
for variant, s in stats.items():
n = s["count"] or 1
avg_latency = s["total_latency_ms"] / n
avg_output_len = s["total_output_length"] / n
# Estimate cost per 1000 calls
model = VARIANTS.get(variant, {}).get("model", "")
cost_per_1k = 0
if model in COSTS:
c = COSTS[model]
avg_in = s["total_input_tokens"] / n
avg_out = s["total_output_tokens"] / n
cost_per_1k = (avg_in * c["input"] + avg_out * c["output"]) * 1000 / 1000
print(f"{variant:<15} {n:>5} {avg_latency:>11.0f}ms {avg_output_len:>14.0f} ${cost_per_1k:>13.2f}")
ab_test_report("model-comparison-v1")
Quality Signals to Track
Since you can't automatically evaluate output quality for most agent tasks, these proxy signals are useful:
- Output length — shorter responses may indicate the model is truncating or refusing; very long responses may indicate verbosity. Record
output_lengthin span output - Error rate — track how often each variant results in error spans (refusals, empty responses, tool call failures)
- User-facing signals — if your agent drives a user-facing product, tag traces with downstream success metrics (
user_satisfied: true) and compare by variant
Prompt A/B Testing Works the Same Way
The same pattern applies to prompt variants — tag with prompt_variant instead of model_variant:
# The same pattern works for prompt A/B testing — just tag differently
def run_with_prompt_variant(user_input: str) -> str:
variant = random.choice(["concise_prompt", "detailed_prompt"])
prompts = {
"concise_prompt": f"Answer briefly: {user_input}",
"detailed_prompt": f"Please provide a thorough and detailed answer to: {user_input}",
}
trace = nexus.start_trace(
name=f"agent: {user_input[:60]}",
metadata={
"experiment_id": "prompt-test-v1",
"prompt_variant": variant,
},
)
# ... rest of the call unchanged
return ""
When You Have Enough Data
With 50+ traces per variant, you have enough data to compare averages meaningfully. If the challenger variant shows similar output length and error rate but lower cost, the case for switching is strong. If quality signals diverge, you can make an informed cost-vs-quality tradeoff rather than guessing.
Get Started
Create a free Nexus account at nexus.keylightdigital.dev/pricing and add the three metadata tags (experiment_id, model_variant, model) to your existing agent traces. The free tier covers 10,000 spans/month — plenty for a production A/B test.
Ready to A/B test your AI models?
Start free — no credit card required. Up to 10,000 spans/month on the free tier.
Start monitoring for free →