Monitoring Multi-Model AI Agents: Routing Between GPT-4, Claude, and Gemini

Modern AI agents increasingly route requests across multiple LLM providers — OpenAI GPT-4 for reasoning, Claude for long-context tasks, Gemini for multimodal inputs. When a routing decision sends the wrong request to the wrong model, costs spike, latency degrades, or quality silently drops. Here's how to track model routing, compare cost and latency across providers, and detect quality regressions with Nexus.

Why teams use multi-model routing

No single LLM is best at everything. GPT-4o handles complex reasoning chains and structured JSON output well. Claude excels at long-context tasks — processing large codebases, lengthy documents, or multi-turn conversations that exceed GPT-4’s comfortable context window. Gemini 1.5 Pro brings native multimodal support for image and audio inputs, and Flash variants offer dramatically lower cost-per-token for high-volume, lower-stakes tasks.

Production AI agents take advantage of this by routing requests to whichever model is best suited for the task at hand:

Tool selection and function calling — GPT-4o for its reliable structured output
Document analysis and summarization — Claude 3.5 Sonnet for 200k-token context
Multimodal inputs (images, screenshots) — Gemini 1.5 Pro or GPT-4V
High-volume, low-stakes classification — Gemini Flash or Haiku for cost efficiency
Code generation — Claude 3.5 Sonnet or GPT-4o depending on language and context

The routing logic itself may be a rule-based classifier, a smaller LLM acting as a meta-agent, or a cost-weighted heuristic based on input token count. Whatever the mechanism, the result is the same: a single agent run touches multiple providers, and each provider call has different cost, latency, and quality characteristics.

Observability blind spots in multi-model agents

Multi-model routing introduces failure modes that single-provider agents don’t have:

Misrouted requests: A routing bug sends a 150k-token document to GPT-4o instead of Claude, hitting context limits silently or generating truncated output. Without a span recording which model handled which request, you can’t distinguish a model quality issue from a routing bug.
Cost spikes from routing drift: A routing heuristic that was calibrated last quarter starts sending more traffic to GPT-4o as input sizes grow. Your per-trace cost doubles without any code change. Without model-level cost breakdowns, the spike is invisible until your monthly bill arrives.
Quality regression after model switch: You switch from GPT-4o to Claude for a task type to reduce cost. Average response quality degrades subtly — not enough to trigger errors, but enough to reduce user satisfaction. Without pre/post comparison on the same prompts, the regression is undetectable.
Latency variance across providers: Gemini Flash is fast on average but has high p99 latency under load. GPT-4o has more consistent latency but higher median. Without per-model latency percentiles in your traces, you’re optimizing for the wrong thing.

Tagging spans with model metadata (TypeScript)

The first step is to record which model handled each LLM call as structured span metadata. Here’s a TypeScript routing wrapper that wraps each model call in a Nexus span:

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { GoogleGenerativeAI } from "@google/generative-ai";
import { nexus } from "./nexus-client";

type ModelProvider = "openai" | "anthropic" | "google";

interface RoutedCallOptions {
  provider: ModelProvider;
  model: string;
  prompt: string;
  systemPrompt?: string;
  maxTokens?: number;
  routingReason?: string;
}

async function routedLLMCall(
  traceId: string,
  options: RoutedCallOptions
): Promise<string> {
  const { provider, model, prompt, systemPrompt, maxTokens = 1024 } = options;
  const startedAt = Date.now();

  const span = await nexus.startSpan({
    traceId,
    name: `llm.${provider}.${model}`,
    attributes: {
      "llm.provider": provider,
      "llm.model": model,
      "routing.reason": options.routingReason ?? "default",
    },
  });

  try {
    let responseText = "";
    let inputTokens = 0;
    let outputTokens = 0;

    if (provider === "openai") {
      const client = new OpenAI();
      const messages: OpenAI.ChatCompletionMessageParam[] = [];
      if (systemPrompt) messages.push({ role: "system", content: systemPrompt });
      messages.push({ role: "user", content: prompt });

      const response = await client.chat.completions.create({
        model,
        messages,
        max_tokens: maxTokens,
      });
      responseText = response.choices[0].message.content ?? "";
      inputTokens = response.usage?.prompt_tokens ?? 0;
      outputTokens = response.usage?.completion_tokens ?? 0;

    } else if (provider === "anthropic") {
      const client = new Anthropic();
      const response = await client.messages.create({
        model,
        max_tokens: maxTokens,
        system: systemPrompt,
        messages: [{ role: "user", content: prompt }],
      });
      responseText = response.content[0].type === "text"
        ? response.content[0].text : "";
      inputTokens = response.usage.input_tokens;
      outputTokens = response.usage.output_tokens;

    } else if (provider === "google") {
      const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
      const gemini = genAI.getGenerativeModel({ model });
      const result = await gemini.generateContent(prompt);
      responseText = result.response.text();
      inputTokens = result.response.usageMetadata?.promptTokenCount ?? 0;
      outputTokens = result.response.usageMetadata?.candidatesTokenCount ?? 0;
    }

    const durationMs = Date.now() - startedAt;
    await nexus.endSpan(span.id, {
      status: "ok",
      attributes: {
        "llm.input_tokens": inputTokens,
        "llm.output_tokens": outputTokens,
        "llm.duration_ms": durationMs,
        "llm.tokens_per_second": outputTokens / (durationMs / 1000),
      },
    });

    return responseText;
  } catch (err) {
    await nexus.endSpan(span.id, {
      status: "error",
      errorMessage: err instanceof Error ? err.message : String(err),
    });
    throw err;
  }
}

The key fields to tag on every LLM span are llm.provider, llm.model, llm.input_tokens, llm.output_tokens, and llm.duration_ms. With these fields on every span, Nexus can break down cost and latency by provider across your entire trace history.

Routing between models in Python

Here’s the same pattern in Python, with an explicit routing function that selects the model based on input characteristics:

import anthropic
import openai
import google.generativeai as genai
import os
import time
from nexus_sdk import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])

def route_to_model(prompt: str, context_length: int, task_type: str) -> dict:
    """Route to the best model given task characteristics."""
    if context_length > 100_000:
        return {"provider": "anthropic", "model": "claude-3-5-sonnet-20241022", "reason": "long_context"}
    elif task_type == "multimodal":
        return {"provider": "google", "model": "gemini-1.5-pro", "reason": "multimodal_input"}
    elif task_type == "classification" and context_length < 2_000:
        return {"provider": "google", "model": "gemini-1.5-flash", "reason": "low_cost_classification"}
    else:
        return {"provider": "openai", "model": "gpt-4o", "reason": "general_reasoning"}

def call_with_routing(trace_id: str, prompt: str, task_type: str = "general") -> str:
    context_length = len(prompt.split())
    route = route_to_model(prompt, context_length, task_type)

    span = nexus.start_span(
        trace_id=trace_id,
        name=f"llm.{route['provider']}.{route['model']}",
        attributes={
            "llm.provider": route["provider"],
            "llm.model": route["model"],
            "routing.reason": route["reason"],
            "routing.task_type": task_type,
        }
    )

    started_at = time.time()
    try:
        if route["provider"] == "anthropic":
            client = anthropic.Anthropic()
            response = client.messages.create(
                model=route["model"],
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            text = response.content[0].text
            input_tokens = response.usage.input_tokens
            output_tokens = response.usage.output_tokens

        elif route["provider"] == "openai":
            client = openai.OpenAI()
            response = client.chat.completions.create(
                model=route["model"],
                messages=[{"role": "user", "content": prompt}]
            )
            text = response.choices[0].message.content
            input_tokens = response.usage.prompt_tokens
            output_tokens = response.usage.completion_tokens

        elif route["provider"] == "google":
            genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
            model = genai.GenerativeModel(route["model"])
            result = model.generate_content(prompt)
            text = result.text
            input_tokens = result.usage_metadata.prompt_token_count
            output_tokens = result.usage_metadata.candidates_token_count

        duration_ms = (time.time() - started_at) * 1000
        nexus.end_span(span["id"], status="ok", attributes={
            "llm.input_tokens": input_tokens,
            "llm.output_tokens": output_tokens,
            "llm.duration_ms": round(duration_ms),
        })
        return text

    except Exception as e:
        nexus.end_span(span["id"], status="error", error_message=str(e))
        raise

Comparing cost and latency across models in Nexus

Once you’re tagging spans with llm.provider and llm.model, Nexus lets you filter and compare across dimensions that were previously invisible:

Cost per provider: Filter traces by llm.provider=openai vs llm.provider=anthropic and compare average token counts to estimate per-provider spend without waiting for your monthly bill.
Latency by model: The llm.duration_ms attribute lets you compare p50 and p99 latency per model directly in the trace list. Gemini Flash may look attractive on p50 but expose high variance at p99 during peak hours.
Routing distribution: Filter by routing.reason to see what fraction of requests are being routed to each path. A routing bug often shows up as an unexpected shift — long_context suddenly drops to zero while general_reasoning spikes.
Error rate per provider: Tag spans with status=error and filter by provider to see which one is causing the most failures under load.

Detecting quality regressions after model switches

Cost and latency comparisons are straightforward — they’re numbers in your spans. Quality regression detection is harder because “quality” is subjective. But you can make it concrete by logging a quality signal alongside each LLM call.

One approach is to add a lightweight self-evaluation step that scores the response on a defined rubric:

async function callWithQualityScore(
  traceId: string,
  prompt: string,
  options: RoutedCallOptions
): Promise<{ response: string; qualityScore: number }> {
  const response = await routedLLMCall(traceId, options);

  // Quick self-eval: ask a cheap model to score the response 1-5
  const evalPrompt = `Rate this AI response on a scale of 1-5 for accuracy,
completeness, and relevance. Reply with just the number.

Prompt: ${prompt.slice(0, 500)}
Response: ${response.slice(0, 500)}`;

  const scoreStr = await routedLLMCall(traceId, {
    provider: "google",
    model: "gemini-1.5-flash",
    prompt: evalPrompt,
    maxTokens: 10,
    routingReason: "quality_eval",
  });

  const qualityScore = parseInt(scoreStr.trim(), 10) || 3;

  await nexus.addEvent(traceId, {
    name: "llm.quality_scored",
    attributes: {
      "llm.model": options.model,
      "llm.quality_score": qualityScore,
    },
  });

  return { response, qualityScore };
}

With quality scores in your spans, you can filter by llm.model and compare average quality scores before and after a model switch. A drop from 4.2 to 3.7 average quality score after switching from GPT-4o to Claude for a specific task type is a concrete signal worth investigating — even if error rates stayed flat.

Handling provider errors and fallback routing

Multi-model agents need graceful fallback when a provider is unavailable. Tag fallback spans explicitly so you can distinguish planned routing from failure-induced routing:

async def call_with_fallback(trace_id: str, prompt: str, primary_route: dict) -> str:
    """Attempt primary model, fall back to secondary on error."""
    try:
        return call_with_routing(trace_id, prompt, primary_route["reason"])
    except Exception as primary_err:
        nexus.add_event(trace_id, {
            "name": "routing.fallback_triggered",
            "attributes": {
                "primary_provider": primary_route["provider"],
                "primary_model": primary_route["model"],
                "error": str(primary_err),
                "fallback_reason": "provider_error",
            }
        })

        fallback_provider = "anthropic" if primary_route["provider"] == "openai" else "openai"
        fallback_model = "claude-3-haiku-20240307" if fallback_provider == "anthropic" else "gpt-4o-mini"

        return call_with_routing(trace_id, prompt, "fallback")

With the routing.fallback_triggered event in your trace, you can filter Nexus traces for fallback events to see how often each provider is causing fallback activations — a leading indicator of provider reliability issues before they surface as user-visible errors.

What to put in every multi-model span

As a reference, here’s the full set of attributes worth logging on every LLM call span in a multi-model agent:

llm.provider — openai, anthropic, google
llm.model — exact model name (e.g. gpt-4o, claude-3-5-sonnet-20241022)
llm.input_tokens — actual count from provider response
llm.output_tokens — actual count from provider response
llm.duration_ms — wall-clock time for the API call
routing.reason — why this model was chosen (e.g. long_context, multimodal_input)
routing.task_type — what kind of task this is (classification, summarization, generation)
llm.quality_score — optional self-eval score (1–5) for quality regression tracking

Multi-model routing gives you the best of each provider — but only if your routing logic stays correct and your cost/latency tradeoffs stay visible. Nexus spans are the instrumentation layer that keeps those tradeoffs in view as your agent evolves. Sign up for a free Nexus account to start capturing multi-model traces today.