How to Choose an AI Observability Tool in 2026

The AI observability tooling landscape has exploded in 18 months. Langfuse, LangSmith, Arize Phoenix, AgentOps, Helicone, Braintrust, Datadog LLM Observability, Portkey, Nexus — and a dozen more launched since last year. Every comparison table lists 30 features and calls it a guide. That doesn't help you decide.

This post is different. We'll walk through 5 criteria that actually separate tools in practice, a decision matrix by team type, and the 3 most common mistakes teams make when evaluating. We'll be honest — there are cases where Nexus isn't the right answer.

The 5 criteria that actually matter

1. Cost model

The question: Is it flat-rate or usage-based? Will costs grow linearly with your agent volume?

Usage-based pricing (per log, per token, per request) is cheap at zero scale and expensive at production scale. A single AI agent making 50 LLM calls per run, processing 1,000 requests/day, generates 50,000 logged calls/day. At $0.001/call (a typical tier), that's $1,500/month — not $9.

Flat-rate tools (Nexus at $9/mo, some tiers of Langfuse) are predictable regardless of volume. Usage-based tools (Helicone, AgentOps, Braintrust, Datadog) start cheap and scale with you — which is either a feature (you only pay for what you use) or a trap (costs surprise you at 10× load).

2. Integration depth

The question: Does the tool require a proxy, an SDK, or framework-specific hooks? What visibility do you get at each level?

There are three integration models, each with tradeoffs:

Proxy-based (Helicone, Portkey): Route LLM calls through a gateway. Zero code changes, automatic request logging. The tradeoff: you see LLM calls, not agent logic. No visibility into tool orchestration, loop detection, or multi-step reasoning.
SDK-based (Nexus, LangSmith, Braintrust): Instrument your code with traces and spans. More setup, but you capture agent-level semantics: the full run, individual steps, tool calls as spans, sub-agent invocations. The right choice for complex agents.
Framework hooks (Langfuse, LangSmith): Automatic tracing if you use LangChain or specific frameworks. Zero setup for framework users, but limited to what the framework exposes.

For simple chatbots that make one or two LLM calls, proxy-based is often enough. For autonomous agents with tool use, loops, and multi-step reasoning, SDK-based instrumentation gives you the depth you need.

3. Data privacy

The question: Where does your trace data live? Who can see it? Can you self-host?

If your agents process PII, healthcare data, financial records, or proprietary business data, sending full inputs and outputs to a third-party SaaS may be a compliance blocker. Options:

Self-hosted: Langfuse (Docker), Arize Phoenix (Python server) — full data control, operational overhead
Hosted with data agreements: LangSmith, Nexus — SOC 2 compliance and DPA available
Edge-hosted (Nexus): Cloudflare-native means data processed at edge nodes globally, with D1 storage — no centralized US server

For most indie developers and small teams, hosted SaaS is fine. For regulated industries, self-hosting or reviewing the vendor's data handling terms is non-negotiable.

4. Setup time

The question: How long until you see your first trace? How much ongoing maintenance does the integration require?

Setup time varies dramatically by tool type. Here's a realistic estimate:

Tool	Time to first trace	Notes
Nexus	< 2 min	3-line SDK, no framework dependency
Langfuse	5–10 min	More config for LangChain, simpler for standalone
LangSmith	1–2 min (with LangChain)	Set env var — LangChain auto-instruments. Without LangChain: manual SDK.
Arize Phoenix	15–30 min	Run local server or Colab notebook first
Helicone / Portkey	1–5 min	Swap base URL — instant LLM call logging

5. Team size and use case fit

The question: Is this tool designed for your team size and primary use case?

Tools have implicit target customers. Datadog LLM Observability is designed for enterprises already paying Datadog $10K+/year. W&B Weave is designed for ML researchers doing prompt experiments. Arize Phoenix is designed for data scientists in notebooks. Nexus is designed for indie developers and small teams shipping agents to production. Using the wrong tool for your team size means paying for features you'll never use or missing the ones you need.

Decision matrix by team type

Team type	Best fit	Why
Solo developer / indie	Nexus	$9/mo flat, minimal setup, framework-agnostic
LangChain team	LangSmith	Zero-config auto-tracing, deep LangChain integration
Self-hosting requirement	Langfuse	Best self-hosted option (21K stars, Docker, mature)
ML research team	W&B Weave	Already in W&B ecosystem, eval-first workflow
Gateway/routing needed	Portkey	Fallbacks, routing, virtual keys — observability is secondary
Enterprise on Datadog APM	Datadog	LLM obs integrates with existing dashboards, alerting, SLAs
Prompt eval focus	Braintrust	Best eval framework, test datasets, structured comparisons

3 common evaluation mistakes

Mistake 1: Evaluating based on the feature checklist

Every tool has a feature comparison table on its pricing page. The problem: features are listed, not weighted. "Supports custom metadata" appears in the same row as "built-in evaluation framework" — but for your use case, one is essential and the other is irrelevant. Before reading any feature table, write down your top 3 requirements. Then evaluate only on those.

Mistake 2: Evaluating at zero scale

Tools that are free at zero scale often have steep pricing curves. Evaluate the pricing model at your expected production volume, not at your dev/test volume. If you're planning to process 50,000 agent runs/month by Q3, price that scenario for every tool you're evaluating. Usage-based tools look cheap in POCs and expensive in production.

Mistake 3: Choosing based on framework auto-tracing

"Zero code changes" is compelling. But proxy-based and auto-tracing integrations often give you request logs, not agent traces. If your agent has a loop that calls tools 8 times before finishing, a request log shows you 8 separate events with no connection between them. An SDK-based trace shows you a single agent run with 8 child spans in waterfall order. The extra 10 minutes of SDK setup is worth it for complex agents.

All comparison pages

If you want a detailed breakdown of Nexus vs a specific tool, we've written honest comparison pages for each:

Nexus vs Langfuse — hosted simplicity vs OSS self-hosting
Nexus vs LangSmith — framework-agnostic vs LangChain-native
Nexus vs Helicone — SDK instrumentation vs proxy logging
Nexus vs Braintrust — production monitoring vs prompt evaluation
Nexus vs Arize Phoenix — hosted vs self-hosted Jupyter-native
Nexus vs AgentOps — trace/span model vs session-based monitoring
Nexus vs Datadog LLM Monitoring — purpose-built vs bolted-on APM
Nexus vs Weights & Biases Weave — production monitoring vs experiment tracking
Nexus vs Portkey — observability-first vs gateway-first

If you're still not sure, the fastest path is to run a POC with your actual agent code — most tools have free tiers. Nexus has a free plan and takes under 2 minutes to set up. The decision matrix above is a starting point; your own agent's requirements are the final word.