2026-04-22 · 9 min read

Monitoring Multi-Model AI Agents: Routing Between GPT-4, Claude, and Gemini

Modern AI agents increasingly route requests across multiple LLM providers — OpenAI GPT-4 for reasoning, Claude for long-context tasks, Gemini for multimodal inputs. When a routing decision sends the wrong request to the wrong model, costs spike, latency degrades, or quality silently drops. Here's how to track model routing, compare cost and latency across providers, and detect quality regressions with Nexus.

Why teams use multi-model routing

No single LLM is best at everything. GPT-4o handles complex reasoning chains and structured JSON output well. Claude excels at long-context tasks — processing large codebases, lengthy documents, or multi-turn conversations that exceed GPT-4’s comfortable context window. Gemini 1.5 Pro brings native multimodal support for image and audio inputs, and Flash variants offer dramatically lower cost-per-token for high-volume, lower-stakes tasks.

Production AI agents take advantage of this by routing requests to whichever model is best suited for the task at hand:

The routing logic itself may be a rule-based classifier, a smaller LLM acting as a meta-agent, or a cost-weighted heuristic based on input token count. Whatever the mechanism, the result is the same: a single agent run touches multiple providers, and each provider call has different cost, latency, and quality characteristics.

Observability blind spots in multi-model agents

Multi-model routing introduces failure modes that single-provider agents don’t have:

Tagging spans with model metadata (TypeScript)

The first step is to record which model handled each LLM call as structured span metadata. Here’s a TypeScript routing wrapper that wraps each model call in a Nexus span:

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { GoogleGenerativeAI } from "@google/generative-ai";
import { nexus } from "./nexus-client";

type ModelProvider = "openai" | "anthropic" | "google";

interface RoutedCallOptions {
  provider: ModelProvider;
  model: string;
  prompt: string;
  systemPrompt?: string;
  maxTokens?: number;
  routingReason?: string;
}

async function routedLLMCall(
  traceId: string,
  options: RoutedCallOptions
): Promise<string> {
  const { provider, model, prompt, systemPrompt, maxTokens = 1024 } = options;
  const startedAt = Date.now();

  const span = await nexus.startSpan({
    traceId,
    name: `llm.${provider}.${model}`,
    attributes: {
      "llm.provider": provider,
      "llm.model": model,
      "routing.reason": options.routingReason ?? "default",
    },
  });

  try {
    let responseText = "";
    let inputTokens = 0;
    let outputTokens = 0;

    if (provider === "openai") {
      const client = new OpenAI();
      const messages: OpenAI.ChatCompletionMessageParam[] = [];
      if (systemPrompt) messages.push({ role: "system", content: systemPrompt });
      messages.push({ role: "user", content: prompt });

      const response = await client.chat.completions.create({
        model,
        messages,
        max_tokens: maxTokens,
      });
      responseText = response.choices[0].message.content ?? "";
      inputTokens = response.usage?.prompt_tokens ?? 0;
      outputTokens = response.usage?.completion_tokens ?? 0;

    } else if (provider === "anthropic") {
      const client = new Anthropic();
      const response = await client.messages.create({
        model,
        max_tokens: maxTokens,
        system: systemPrompt,
        messages: [{ role: "user", content: prompt }],
      });
      responseText = response.content[0].type === "text"
        ? response.content[0].text : "";
      inputTokens = response.usage.input_tokens;
      outputTokens = response.usage.output_tokens;

    } else if (provider === "google") {
      const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
      const gemini = genAI.getGenerativeModel({ model });
      const result = await gemini.generateContent(prompt);
      responseText = result.response.text();
      inputTokens = result.response.usageMetadata?.promptTokenCount ?? 0;
      outputTokens = result.response.usageMetadata?.candidatesTokenCount ?? 0;
    }

    const durationMs = Date.now() - startedAt;
    await nexus.endSpan(span.id, {
      status: "ok",
      attributes: {
        "llm.input_tokens": inputTokens,
        "llm.output_tokens": outputTokens,
        "llm.duration_ms": durationMs,
        "llm.tokens_per_second": outputTokens / (durationMs / 1000),
      },
    });

    return responseText;
  } catch (err) {
    await nexus.endSpan(span.id, {
      status: "error",
      errorMessage: err instanceof Error ? err.message : String(err),
    });
    throw err;
  }
}

The key fields to tag on every LLM span are llm.provider, llm.model, llm.input_tokens, llm.output_tokens, and llm.duration_ms. With these fields on every span, Nexus can break down cost and latency by provider across your entire trace history.

Routing between models in Python

Here’s the same pattern in Python, with an explicit routing function that selects the model based on input characteristics:

import anthropic
import openai
import google.generativeai as genai
import os
import time
from nexus_sdk import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"])

def route_to_model(prompt: str, context_length: int, task_type: str) -> dict:
    """Route to the best model given task characteristics."""
    if context_length > 100_000:
        return {"provider": "anthropic", "model": "claude-3-5-sonnet-20241022", "reason": "long_context"}
    elif task_type == "multimodal":
        return {"provider": "google", "model": "gemini-1.5-pro", "reason": "multimodal_input"}
    elif task_type == "classification" and context_length < 2_000:
        return {"provider": "google", "model": "gemini-1.5-flash", "reason": "low_cost_classification"}
    else:
        return {"provider": "openai", "model": "gpt-4o", "reason": "general_reasoning"}

def call_with_routing(trace_id: str, prompt: str, task_type: str = "general") -> str:
    context_length = len(prompt.split())
    route = route_to_model(prompt, context_length, task_type)

    span = nexus.start_span(
        trace_id=trace_id,
        name=f"llm.{route['provider']}.{route['model']}",
        attributes={
            "llm.provider": route["provider"],
            "llm.model": route["model"],
            "routing.reason": route["reason"],
            "routing.task_type": task_type,
        }
    )

    started_at = time.time()
    try:
        if route["provider"] == "anthropic":
            client = anthropic.Anthropic()
            response = client.messages.create(
                model=route["model"],
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            text = response.content[0].text
            input_tokens = response.usage.input_tokens
            output_tokens = response.usage.output_tokens

        elif route["provider"] == "openai":
            client = openai.OpenAI()
            response = client.chat.completions.create(
                model=route["model"],
                messages=[{"role": "user", "content": prompt}]
            )
            text = response.choices[0].message.content
            input_tokens = response.usage.prompt_tokens
            output_tokens = response.usage.completion_tokens

        elif route["provider"] == "google":
            genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
            model = genai.GenerativeModel(route["model"])
            result = model.generate_content(prompt)
            text = result.text
            input_tokens = result.usage_metadata.prompt_token_count
            output_tokens = result.usage_metadata.candidates_token_count

        duration_ms = (time.time() - started_at) * 1000
        nexus.end_span(span["id"], status="ok", attributes={
            "llm.input_tokens": input_tokens,
            "llm.output_tokens": output_tokens,
            "llm.duration_ms": round(duration_ms),
        })
        return text

    except Exception as e:
        nexus.end_span(span["id"], status="error", error_message=str(e))
        raise

Comparing cost and latency across models in Nexus

Once you’re tagging spans with llm.provider and llm.model, Nexus lets you filter and compare across dimensions that were previously invisible:

Detecting quality regressions after model switches

Cost and latency comparisons are straightforward — they’re numbers in your spans. Quality regression detection is harder because “quality” is subjective. But you can make it concrete by logging a quality signal alongside each LLM call.

One approach is to add a lightweight self-evaluation step that scores the response on a defined rubric:

async function callWithQualityScore(
  traceId: string,
  prompt: string,
  options: RoutedCallOptions
): Promise<{ response: string; qualityScore: number }> {
  const response = await routedLLMCall(traceId, options);

  // Quick self-eval: ask a cheap model to score the response 1-5
  const evalPrompt = `Rate this AI response on a scale of 1-5 for accuracy,
completeness, and relevance. Reply with just the number.

Prompt: ${prompt.slice(0, 500)}
Response: ${response.slice(0, 500)}`;

  const scoreStr = await routedLLMCall(traceId, {
    provider: "google",
    model: "gemini-1.5-flash",
    prompt: evalPrompt,
    maxTokens: 10,
    routingReason: "quality_eval",
  });

  const qualityScore = parseInt(scoreStr.trim(), 10) || 3;

  await nexus.addEvent(traceId, {
    name: "llm.quality_scored",
    attributes: {
      "llm.model": options.model,
      "llm.quality_score": qualityScore,
    },
  });

  return { response, qualityScore };
}

With quality scores in your spans, you can filter by llm.model and compare average quality scores before and after a model switch. A drop from 4.2 to 3.7 average quality score after switching from GPT-4o to Claude for a specific task type is a concrete signal worth investigating — even if error rates stayed flat.

Handling provider errors and fallback routing

Multi-model agents need graceful fallback when a provider is unavailable. Tag fallback spans explicitly so you can distinguish planned routing from failure-induced routing:

async def call_with_fallback(trace_id: str, prompt: str, primary_route: dict) -> str:
    """Attempt primary model, fall back to secondary on error."""
    try:
        return call_with_routing(trace_id, prompt, primary_route["reason"])
    except Exception as primary_err:
        nexus.add_event(trace_id, {
            "name": "routing.fallback_triggered",
            "attributes": {
                "primary_provider": primary_route["provider"],
                "primary_model": primary_route["model"],
                "error": str(primary_err),
                "fallback_reason": "provider_error",
            }
        })

        fallback_provider = "anthropic" if primary_route["provider"] == "openai" else "openai"
        fallback_model = "claude-3-haiku-20240307" if fallback_provider == "anthropic" else "gpt-4o-mini"

        return call_with_routing(trace_id, prompt, "fallback")

With the routing.fallback_triggered event in your trace, you can filter Nexus traces for fallback events to see how often each provider is causing fallback activations — a leading indicator of provider reliability issues before they surface as user-visible errors.

What to put in every multi-model span

As a reference, here’s the full set of attributes worth logging on every LLM call span in a multi-model agent:

Multi-model routing gives you the best of each provider — but only if your routing logic stays correct and your cost/latency tradeoffs stay visible. Nexus spans are the instrumentation layer that keeps those tradeoffs in view as your agent evolves. Sign up for a free Nexus account to start capturing multi-model traces today.

Track cost and latency across every model

Free tier, no credit card required. Full trace visibility in under 5 minutes.