The AI observability tooling landscape has exploded in 18 months. Langfuse, LangSmith, Arize Phoenix, AgentOps, Helicone, Braintrust, Datadog LLM Observability, Portkey, Nexus — and a dozen more launched since last year. Every comparison table lists 30 features and calls it a guide. That doesn't help you decide.
This post is different. We'll walk through 5 criteria that actually separate tools in practice, a decision matrix by team type, and the 3 most common mistakes teams make when evaluating. We'll be honest — there are cases where Nexus isn't the right answer.
The 5 criteria that actually matter
1. Cost model
The question: Is it flat-rate or usage-based? Will costs grow linearly with your agent volume?
Usage-based pricing (per log, per token, per request) is cheap at zero scale and expensive at production scale. A single AI agent making 50 LLM calls per run, processing 1,000 requests/day, generates 50,000 logged calls/day. At $0.001/call (a typical tier), that's $1,500/month — not $9.
Flat-rate tools (Nexus at $9/mo, some tiers of Langfuse) are predictable regardless of volume. Usage-based tools (Helicone, AgentOps, Braintrust, Datadog) start cheap and scale with you — which is either a feature (you only pay for what you use) or a trap (costs surprise you at 10× load).
2. Integration depth
The question: Does the tool require a proxy, an SDK, or framework-specific hooks? What visibility do you get at each level?
There are three integration models, each with tradeoffs:
- Proxy-based (Helicone, Portkey): Route LLM calls through a gateway. Zero code changes, automatic request logging. The tradeoff: you see LLM calls, not agent logic. No visibility into tool orchestration, loop detection, or multi-step reasoning.
- SDK-based (Nexus, LangSmith, Braintrust): Instrument your code with traces and spans. More setup, but you capture agent-level semantics: the full run, individual steps, tool calls as spans, sub-agent invocations. The right choice for complex agents.
- Framework hooks (Langfuse, LangSmith): Automatic tracing if you use LangChain or specific frameworks. Zero setup for framework users, but limited to what the framework exposes.
For simple chatbots that make one or two LLM calls, proxy-based is often enough. For autonomous agents with tool use, loops, and multi-step reasoning, SDK-based instrumentation gives you the depth you need.
3. Data privacy
The question: Where does your trace data live? Who can see it? Can you self-host?
If your agents process PII, healthcare data, financial records, or proprietary business data, sending full inputs and outputs to a third-party SaaS may be a compliance blocker. Options:
- Self-hosted: Langfuse (Docker), Arize Phoenix (Python server) — full data control, operational overhead
- Hosted with data agreements: LangSmith, Nexus — SOC 2 compliance and DPA available
- Edge-hosted (Nexus): Cloudflare-native means data processed at edge nodes globally, with D1 storage — no centralized US server
For most indie developers and small teams, hosted SaaS is fine. For regulated industries, self-hosting or reviewing the vendor's data handling terms is non-negotiable.
4. Setup time
The question: How long until you see your first trace? How much ongoing maintenance does the integration require?
Setup time varies dramatically by tool type. Here's a realistic estimate:
| Tool | Time to first trace | Notes |
|---|---|---|
| Nexus | < 2 min | 3-line SDK, no framework dependency |
| Langfuse | 5–10 min | More config for LangChain, simpler for standalone |
| LangSmith | 1–2 min (with LangChain) | Set env var — LangChain auto-instruments. Without LangChain: manual SDK. |
| Arize Phoenix | 15–30 min | Run local server or Colab notebook first |
| Helicone / Portkey | 1–5 min | Swap base URL — instant LLM call logging |
5. Team size and use case fit
The question: Is this tool designed for your team size and primary use case?
Tools have implicit target customers. Datadog LLM Observability is designed for enterprises already paying Datadog $10K+/year. W&B Weave is designed for ML researchers doing prompt experiments. Arize Phoenix is designed for data scientists in notebooks. Nexus is designed for indie developers and small teams shipping agents to production. Using the wrong tool for your team size means paying for features you'll never use or missing the ones you need.
Decision matrix by team type
| Team type | Best fit | Why |
|---|---|---|
| Solo developer / indie | Nexus | $9/mo flat, minimal setup, framework-agnostic |
| LangChain team | LangSmith | Zero-config auto-tracing, deep LangChain integration |
| Self-hosting requirement | Langfuse | Best self-hosted option (21K stars, Docker, mature) |
| ML research team | W&B Weave | Already in W&B ecosystem, eval-first workflow |
| Gateway/routing needed | Portkey | Fallbacks, routing, virtual keys — observability is secondary |
| Enterprise on Datadog APM | Datadog | LLM obs integrates with existing dashboards, alerting, SLAs |
| Prompt eval focus | Braintrust | Best eval framework, test datasets, structured comparisons |
3 common evaluation mistakes
Mistake 1: Evaluating based on the feature checklist
Every tool has a feature comparison table on its pricing page. The problem: features are listed, not weighted. "Supports custom metadata" appears in the same row as "built-in evaluation framework" — but for your use case, one is essential and the other is irrelevant. Before reading any feature table, write down your top 3 requirements. Then evaluate only on those.
Mistake 2: Evaluating at zero scale
Tools that are free at zero scale often have steep pricing curves. Evaluate the pricing model at your expected production volume, not at your dev/test volume. If you're planning to process 50,000 agent runs/month by Q3, price that scenario for every tool you're evaluating. Usage-based tools look cheap in POCs and expensive in production.
Mistake 3: Choosing based on framework auto-tracing
"Zero code changes" is compelling. But proxy-based and auto-tracing integrations often give you request logs, not agent traces. If your agent has a loop that calls tools 8 times before finishing, a request log shows you 8 separate events with no connection between them. An SDK-based trace shows you a single agent run with 8 child spans in waterfall order. The extra 10 minutes of SDK setup is worth it for complex agents.
All comparison pages
If you want a detailed breakdown of Nexus vs a specific tool, we've written honest comparison pages for each:
- Nexus vs Langfuse — hosted simplicity vs OSS self-hosting
- Nexus vs LangSmith — framework-agnostic vs LangChain-native
- Nexus vs Helicone — SDK instrumentation vs proxy logging
- Nexus vs Braintrust — production monitoring vs prompt evaluation
- Nexus vs Arize Phoenix — hosted vs self-hosted Jupyter-native
- Nexus vs AgentOps — trace/span model vs session-based monitoring
- Nexus vs Datadog LLM Monitoring — purpose-built vs bolted-on APM
- Nexus vs Weights & Biases Weave — production monitoring vs experiment tracking
- Nexus vs Portkey — observability-first vs gateway-first
If you're still not sure, the fastest path is to run a POC with your actual agent code — most tools have free tiers. Nexus has a free plan and takes under 2 minutes to set up. The decision matrix above is a starting point; your own agent's requirements are the final word.