Blog

Thoughts on AI agent observability, developer tools, and building in public.

RSS
2026-07-01 · 7 min read

Observability for Modal Agents: Tracing Serverless GPU Functions with Nexus

Modal runs your Python functions on serverless GPU/CPU infrastructure — cold starts, GPU allocation time, and async task execution are all invisible by default. Here's how to wrap Modal functions with Nexus spans to capture cold start latency, GPU allocation overhead, CUDA OOM errors, and end-to-end trace context across async tasks.

Read more →
2026-06-24 · 7 min read

A/B Testing AI Models with Nexus Observability

Switching from GPT-4o-mini to Claude Haiku or Llama 3 8B can cut token costs 60% — but how do you know if quality degrades? Here's how to use Nexus span metadata to run controlled A/B tests between models, compare token costs and latency per variant, and make data-driven model decisions without guessing.

Read more →
2026-06-10 · 6 min read

Observability for Groq API Agents: Tracing Ultra-Fast LLM Calls with Nexus

Groq's LPU inference delivers sub-second response times for Llama 3, Mixtral, and Gemma — but fast doesn't mean free from operational concerns. Token costs accumulate, rate limits hit silently, and latency still varies by model and request size. Here's how to wrap every Groq API call in a Nexus span for full trace-level visibility.

Read more →
2026-06-03 · 7 min read

Observability for CrewAI Flows: Tracing State Machines and Conditional Routes with Nexus

CrewAI Flows (introduced in v0.60) bring structured, event-driven workflow orchestration to CrewAI — @start methods, @listen handlers, and @router decorators for conditional branching. Unlike unstructured Crew runs, Flows are deterministic state machines that need different observability: tracking state transitions, detecting unexpected routing branches, and recording what triggered each step. Here's how to wrap every Flow method with Nexus spans.

Read more →
2026-05-27 · 8 min read

Multi-Tenant AI Agent Cost Attribution with Nexus

When multiple users share the same AI agent backend, token costs accumulate invisibly — you know your total OpenAI bill, but not which users or features are driving it. Here's how to use Nexus span metadata to tag every LLM call with user_id, tenant_id, and feature, then aggregate token spend per tag to build a cost-per-user report and enforce per-tenant budgets.

Read more →
2026-05-20 · 7 min read

Observability for Cloudflare Workers AI Agents: Tracing Serverless LLM Calls with Nexus

Cloudflare Workers AI lets you run LLM inference inside a Worker with a single env.AI.run() call — no GPU provisioning, no rate limits to manage, no cold starts to fear. But serverless doesn't mean invisible: model quota limits, per-model latency spikes, and token usage you can't see still affect production agents. Here's how to wrap every Workers AI call with Nexus spans for full trace-level visibility.

Read more →
2026-05-13 · 7 min read

Observability for Ollama Agents: Tracing Local LLMs with Nexus

Ollama lets you run Llama 3, Mistral, and Phi-3 locally via a simple REST API — but local LLMs still suffer from latency variance, quality regressions, and token usage you can't see. Here's how to wrap Ollama calls with Nexus spans using both direct REST requests and the OpenAI-compatible endpoint, so you get trace-level visibility into every local model invocation.

Read more →
2026-05-06 · 8 min read

Observability for n8n AI Workflow Agents: Tracing Every LLM Call with Nexus

n8n's AI Agent node lets you drop a ReAct agent into any workflow with no code — but when the agent loops on a tool call, burns tokens retrying a bad prompt, or silently returns a hallucinated answer, your workflow logs show nothing. Here's how to wrap every n8n AI Agent execution with Nexus traces using HTTP Request nodes and n8n expressions, so you get span-level visibility into LLM calls, tool use, and token spend across every workflow run.

Read more →
2026-04-29 · 9 min read

Observability for Model Context Protocol (MCP) Servers: Tracing Tool Calls with Nexus

The Model Context Protocol (MCP) lets AI hosts like Claude Desktop and Cursor call your server's tools over a standard JSON-RPC transport — but when a tool call returns the wrong result, takes 10 seconds, or throws a silent exception, the host LLM has no way to surface which tool failed or why. Here's how to wrap MCP tool handlers with Nexus spans in both Python (FastMCP) and TypeScript (@modelcontextprotocol/sdk) to get full trace-level visibility into every tool call your server handles.

Read more →
2026-04-28 · 9 min read

Monitoring AI Agent Token Budget and Cost Thresholds with Nexus

OpenAI charges per token. Claude charges per token. A runaway agent in a bad loop can spend $50 in minutes before anyone notices. Here's how to record token usage as Nexus span metadata, compute cost per trace using model pricing tables, build a token budget guard that aborts a run before it exceeds your limit, and alert when a session hits 80% of its budget.

Read more →
2026-04-27 · 10 min read

Monitoring Google Gemini and Vertex AI Agents with Nexus

Google Gemini and Vertex AI offer two entry points for building AI agents: the google-generativeai SDK for direct Gemini API access and the Vertex AI SDK for enterprise GCP-hosted agents. When a safety filter silently blocks a generation, a function call loop spins without reaching stop, or Vertex AI Search grounding returns zero results, the API response tells you what happened but not when or why. Here's how to wrap Gemini and Vertex AI agents in Nexus traces for full span-level observability.

Read more →
2026-04-26 · 9 min read

Monitoring Mistral AI Agents: Tracing Function Calls, Token Costs, and Rate Limits

Mistral AI's function-calling API lets you build agents that route between tools using mistral-large or mistral-small. When a tool schema validation fails silently, a rate limit error gets swallowed, or your agent burns through tokens on a loop, the Mistral API response gives you an error code but no execution timeline. Here's how to wrap Mistral chat completions in Nexus traces and get full span-level observability.

Read more →
2026-04-25 · 10 min read

Monitoring Azure AI Agent Service: Tracing Threads, Runs, and Tool Call Steps

Azure AI Agent Service is Microsoft's managed agent runtime built on the same threads/runs/steps model as OpenAI Assistants. When a run fails silently, a code interpreter execution times out, or a function tool call returns an unexpected value, the Azure portal doesn't give you span-level visibility into what went wrong. Here's how to wrap Azure AI Agent runs in Nexus traces and get full observability.

Read more →
2026-04-24 · 10 min read

Observability for AWS Bedrock Agents: Tracing InvokeAgent, Action Groups, and Knowledge Bases

AWS Bedrock Agents orchestrate multi-step tasks using action groups (Lambda functions) and knowledge bases (RAG retrieval). When an action group Lambda throws silently, a knowledge base returns zero chunks, or the agent loops unexpectedly, Bedrock's built-in logs don't tell you which step failed or why. Here's how to add full trace observability to Bedrock Agents using Nexus.

Read more →
2026-04-23 · 9 min read

Tracing Flowise Chatflows: Observability for No-Code AI Agent Workflows

Flowise lets you build AI chatflows visually by connecting LangChain nodes in a drag-and-drop UI — but when a chatflow returns a wrong answer, a custom tool node throws silently, or a production chatflow starts hallucinating, Flowise's built-in logs don't tell you which node failed or why. Here's how to add full trace observability to Flowise chatflows using Nexus.

Read more →
2026-04-22 · 9 min read

Monitoring Multi-Model AI Agents: Routing Between GPT-4, Claude, and Gemini

Modern AI agents increasingly route requests across multiple LLM providers — OpenAI GPT-4 for reasoning, Claude for long-context tasks, Gemini for multimodal inputs. When a routing decision sends the wrong request to the wrong model, costs spike, latency degrades, or quality silently drops. Here's how to track model routing, compare cost and latency across providers, and detect quality regressions with Nexus.

Read more →
2026-04-21 · 9 min read

Tracing Haystack Pipelines: Observability for RAG and Document AI

Haystack (by deepset) builds AI pipelines from composable components — Embedder, Retriever, PromptBuilder, Generator. When a retriever returns empty results, an embedder cold-starts slowly, or prompt length creep degrades generation quality, you need per-component trace visibility to diagnose it. Here's how to instrument Haystack pipelines with Nexus.

Read more →
2026-04-20 · 8 min read

Tracing Agno Agents: Observability for Python Multi-Agent Pipelines

Agno (formerly phidata) is a Python-native multi-agent framework built around Agent and Team primitives. When a team routes to the wrong member agent, a tool call fails silently, or an agent run returns a low-quality response, you need trace visibility to diagnose what happened. Here's how to instrument Agno agents and teams with Nexus.

Read more →
2026-04-20 · 8 min read

Observability for Microsoft Semantic Kernel Agents in Python

Microsoft Semantic Kernel gives you a structured way to build AI agents in Python with plugins, planners, and multi-model support. When a planner selects the wrong function, a plugin throws silently, or a kernel invocation spikes latency, you need trace visibility to diagnose it. Here's how to integrate Nexus into Semantic Kernel agents.

Read more →
2026-04-19 · 8 min read

Tracing Google ADK Agents: Observability for Gemini-Powered Agent Pipelines

Google's Agent Development Kit (ADK) gives you Agent, SequentialAgent, and LoopAgent primitives for building Gemini-powered multi-agent systems. When a LoopAgent runs indefinitely, a sequential step fails silently, or a tool call surfaces as an agent observation instead of an error, you need trace visibility. Here's how to instrument ADK with Nexus.

Read more →
2026-04-19 · 8 min read

Tracing Mastra Agents: Observability for TypeScript Agent Workflows

Mastra is a TypeScript-native agent framework with Agents, Workflows, and Networks built for Node.js and Vercel. When a workflow step fails silently, a tool call throws on malformed JSON, or a network routes to the wrong agent, you need trace visibility to debug it. Here's how to instrument Mastra with Nexus.

Read more →
2026-04-19 · 9 min read

Tracing DSPy Programs: Observability for Prompt Optimization Pipelines

DSPy replaces hand-written prompts with compiled LM programs — but when an optimizer iteration degrades performance, a multi-hop retrieval chain produces irrelevant context, or production inputs diverge from your training set, you need trace visibility to diagnose what's happening. Here's how to instrument DSPy programs with Nexus.

Read more →
2026-04-19 · 8 min read

Observability for LlamaIndex Agents and Query Pipelines

LlamaIndex gives you QueryPipelines and AgentWorkers for building RAG and agent workflows — but when retrieval quality drops, a ReAct loop over-iterates, or a tool call fails silently, standard logging can't tell you which step broke. Here's how to instrument LlamaIndex with full trace observability using Nexus.

Read more →
2026-04-18 · 7 min read

Using Metadata to Make AI Agent Traces Searchable and Debuggable

Most teams record traces but never add metadata. That's a missed opportunity: metadata fields like model version, user ID, environment, and feature flag turn a trace from a raw log into a queryable record. Here's what to capture, how to name it, and how to use it to debug production incidents.

Read more →
2026-04-18 · 7 min read

Tracking Token Costs for AI Agents in Production

Token costs are the biggest variable expense in AI agent systems — but most teams have no per-agent cost visibility. A trace that ran for 3 seconds may cost $0.001 or $0.15 depending on model and prompt size. Here's how to record, aggregate, and alert on token costs using Nexus.

Read more →
2026-04-18 · 7 min read

Tracing AG2 (AutoGen v2) Multi-Agent Conversations with Nexus

AG2 (formerly AutoGen) makes it easy to spin up teams of ConversableAgents — but when a multi-agent conversation goes wrong, figuring out which agent said what and where the chain broke is painful. Here's how to add full trace observability to AG2 conversations with Nexus.

Read more →
2026-04-18 · 8 min read

How to Add Observability to HuggingFace Smolagents

HuggingFace's Smolagents framework is compact by design — a minimal API for tool-calling and code-executing agents. That minimalism extends to debugging: when a Smolagents run fails, there's almost no built-in visibility. Here's how to add full distributed tracing to CodeAgent and ToolCallingAgent runs with Nexus in under 15 lines of Python.

Read more →
2026-04-15 · 8 min read

How Prompt Caching Can Cut Your AI Agent Costs by 80%

Prompt caching is the highest-ROI optimization most AI agent teams haven't tried yet. By storing repeated context — system prompts, few-shot examples, retrieved documents — you can reduce input token costs by 60–90% with almost no code changes. Here's how it works, when to use it, and how to trace cache effectiveness in Nexus.

Read more →
2026-04-15 · 9 min read

Debugging Multi-Agent Orchestration: A Practical Guide

Multi-agent systems fail in ways that single-agent debugging can't handle. When an orchestrator delegates to 5 sub-agents in parallel and one fails silently, you need distributed trace data — not just a single error message. This guide covers the 4 most common multi-agent failure modes and how to diagnose each one using trace spans.

Read more →
2026-04-15 · 10 min read

How to Write Tests for LLM-Based AI Agents

Testing LLM-based agents is hard because outputs are non-deterministic. But "it's probabilistic" isn't an excuse to skip tests — it means you need different tests: deterministic unit tests for tool logic, contract tests for LLM interfaces, integration tests with seeded scenarios, and trace-based regression tests that compare execution paths. Here's the full testing pyramid for AI agents.

Read more →
2026-04-14 · 8 min read

AI Agent Reliability Patterns: Retry, Timeout, and Circuit Breaker

AI agents fail differently from traditional software. Retry storms burn your token budget. Silent timeouts leave traces hanging. Circuit breakers prevent cascading LLM failures. Here are four battle-tested reliability patterns — with trace examples showing what each looks like in Nexus.

Read more →
2026-04-14 · 9 min read

How Trace Analysis Cut Our AI Agent Costs by 60%

Running AI agents in production gets expensive fast. We went from $800/month to $310/month on LLM costs — without reducing quality. Here's the trace-driven approach we used: identifying the spans burning the most tokens, eliminating unnecessary retries, and caching repeated context.

Read more →
2026-04-09 · 9 min read

Detecting AI Hallucinations in Production with Trace Analysis

Hallucinations are the silent killers of AI agent reliability. Most teams only discover them from user complaints. Here's how to use trace analysis to detect hallucinations before they reach your users — with output verification spans, confidence scoring, and retrieval comparison tracing.

Read more →
2026-04-09 · 9 min read

How to Choose an AI Observability Tool in 2026

Evaluating AI observability tools? Most comparisons list features without helping you decide. Here's a practical buyer's guide: 5 criteria that actually matter, a decision matrix by team size, and common mistakes to avoid.

Read more →
2026-04-09 · 8 min read

How Much Does It Cost to Run AI Agents? A Token Economics Guide

Running AI agents in production costs more than most teams expect. Token costs compound quickly across retries, context overflows, and unnecessary tool calls. Here's how to calculate realistic costs, identify hidden cost patterns, and use tracing to keep your bill predictable.

Read more →
2026-04-09 · 9 min read

OpenTelemetry for AI Agents: Why Standard APM Falls Short

OpenTelemetry is great at instrumenting web services. But AI agents fail in ways that standard spans and metrics were never designed to capture. Here's what OTEL gets right, five things it misses, and how purpose-built agent observability fills the gaps.

Read more →
2026-04-09 · 8 min read

5 Metrics Every AI Agent Team Should Track

Most teams monitoring AI agents track the wrong things. Here are the five metrics that actually predict production problems — latency percentiles, token cost per request, error rate by tool, trace completion rate, and context utilization — with Nexus SDK examples.

Read more →
2026-04-09 · 11 min read

AI Observability Tools Compared: The 2026 Guide

Langfuse, LangSmith, Helicone, Braintrust, Arize Phoenix, AgentOps, or Nexus? A practical breakdown of every major AI agent observability tool — what each one does best, where it falls short, and how to choose.

Read more →
2026-04-07 · 9 min read

How to Debug AI Agents in Production

AI agents fail in non-obvious ways: tool call errors that cascade silently, context windows that overflow mid-task, loops that spin without terminating. Here's a practical debugging playbook with trace-first strategies and Nexus SDK examples.

Read more →
2026-04-07 · 6 min read

How to Monitor Your AI Agents in Production

AI agents fail in production in ways that are invisible without observability. Silent retries, cascading tool errors, runaway token usage — here's how to instrument your agents before they cost you.

Read more →