Observability for Cloudflare Workers AI Agents: Tracing Serverless LLM Calls with Nexus
Cloudflare Workers AI lets you run LLM inference inside a Worker with a single env.AI.run() call — no GPU provisioning, no rate limits to manage, no cold starts to fear. But serverless doesn't mean invisible: model quota limits, per-model latency spikes, and token usage you can't see still affect production agents. Here's how to wrap every Workers AI call with Nexus spans for full trace-level visibility.
What Cloudflare Workers AI Is
Cloudflare Workers AI is a serverless LLM inference service built into the Workers runtime. You add an AI binding to your Worker and call env.AI.run(model, input) — Cloudflare handles GPU scheduling, model loading, and routing. There's no infrastructure to manage and no cold start penalty for inference itself.
The appeal is obvious for developers already on the Cloudflare stack: your Worker, your database (D1), your KV store, and your LLM inference all live in the same runtime with zero egress and generous free-tier limits. It's the fastest path from a Workers-based API to a working AI agent.
Why Serverless LLMs Still Need Observability
"Serverless" removes infrastructure concerns but not operational ones:
- Per-model latency variance — a 7B model and a 70B model have very different response times; when you switch models (or Cloudflare updates a model version), latency changes silently
- Model quota limits — Workers AI free tier enforces per-model neuron budgets; when you hit the limit, calls fail with a quota error that's hard to distinguish from a prompt error without span metadata
- Token usage blind spots — Workers AI returns
usagein some model responses and omits it in others; without tracking it in spans, you have no signal on context growth in multi-turn agents - Silent empty responses — some models return an empty
responsestring on truncation or content filter; your agent silently receives nothing
Nexus wraps each env.AI.run() call in a span, recording model name, latency, token counts when available, and error details — so you have full trace-level visibility without leaving the Workers ecosystem.
The Core Pattern: Wrap env.AI.run() in a Span
Here's a complete Cloudflare Worker that instruments every inference call with Nexus:
import { NexusClient } from '@keylightdigital/nexus'
export interface Env {
AI: Ai
NEXUS_API_KEY: string
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const nexus = new NexusClient({
apiKey: env.NEXUS_API_KEY,
agentId: 'my-workers-ai-agent',
})
const body = await request.json<{ prompt: string }>()
const model = '@cf/meta/llama-3-8b-instruct'
const trace = await nexus.startTrace({
name: `workers-ai: ${body.prompt.slice(0, 60)}`,
metadata: { model },
})
const span = await trace.addSpan({
name: 'workers-ai-inference',
input: { prompt: body.prompt, model },
})
const start = Date.now()
try {
const response = await env.AI.run(model, {
prompt: body.prompt,
}) as { response: string; usage?: { input_tokens: number; output_tokens: number } }
const latencyMs = Date.now() - start
if (!response.response?.trim()) {
await span.end({
status: 'error',
output: { error: 'empty_response', model, latency_ms: latencyMs },
})
await trace.end({ status: 'error' })
return new Response('Model returned empty response', { status: 500 })
}
await span.end({
status: 'ok',
output: {
model,
latency_ms: latencyMs,
// Workers AI returns usage for some models — guard for undefined
input_tokens: response.usage?.input_tokens ?? null,
output_tokens: response.usage?.output_tokens ?? null,
response_preview: response.response.slice(0, 200),
},
})
await trace.end({ status: 'success' })
return Response.json({ result: response.response })
} catch (err) {
const latencyMs = Date.now() - start
await span.end({
status: 'error',
output: { error: String(err), model, latency_ms: latencyMs },
})
await trace.end({ status: 'error' })
throw err
}
},
}
Key implementation notes:
- Create one
NexusClientper request — Workers are stateless, so client initialization is cheap - Record
latency_msmanually usingDate.now()before and afterenv.AI.run() - Guard against
response.usagebeingundefined— token counts are not available for all Workers AI models - Detect empty responses explicitly — they arrive as an empty string, not an error, so they need to be checked and recorded as error spans
Multi-Model Comparison: Parallel Spans Under One Trace
If you're comparing two models for the same prompt — a common pattern when evaluating whether to switch from Llama 3 8B to Mistral 7B — you can run both inference calls in parallel and record each as a separate span under a single trace:
// Compare two models with separate spans under one trace
async function compareModels(trace: Trace, prompt: string, env: Env) {
const models = [
'@cf/meta/llama-3-8b-instruct',
'@cf/mistral/mistral-7b-instruct-v0.1',
]
const results = await Promise.all(
models.map(async (model) => {
const span = await trace.addSpan({
name: `inference:${model.split('/').pop()}`,
input: { prompt, model },
})
const start = Date.now()
const response = await env.AI.run(model, { prompt }) as { response: string }
const latencyMs = Date.now() - start
await span.end({
status: 'ok',
output: { model, latency_ms: latencyMs, response: response.response.slice(0, 200) },
})
return { model, response: response.response, latencyMs }
})
)
return results
}
In the Nexus trace detail view, you'll see both spans side-by-side with their individual latencies and token counts — making it easy to compare model performance on real production prompts.
AI Gateway Compatibility
Cloudflare AI Gateway sits between your Worker and Workers AI inference, adding rate limiting, response caching, and its own usage logs. Nexus is complementary: AI Gateway gives you infrastructure-level logs; Nexus gives you application-level span context — which user request triggered this inference, what the agent was trying to accomplish, and custom metadata you want to query later.
// AI Gateway adds rate limiting, caching, and logging on top of Workers AI.
// Point your AI binding at the gateway URL in wrangler.toml:
//
// [ai]
// binding = "AI"
// gateway = { id = "my-gateway-id" }
//
// Then instrument the same way — Nexus adds application-level span context
// that AI Gateway's own logs don't capture (e.g. which user request triggered
// this inference, what the agent was trying to accomplish, or custom metadata
// you want to query later).
const span = await trace.addSpan({
name: 'workers-ai-inference',
input: {
prompt: body.prompt,
model,
via_ai_gateway: true, // tag for filtering in Nexus
gateway_id: 'my-gateway-id',
},
})
There's no conflict between the two. Your Worker uses the AI Gateway binding as normal; Nexus spans wrap the calls from the application layer without affecting routing or caching.
Agentic Loops: Per-Call Spans Across Multiple Turns
For Workers that implement a multi-turn agent loop — calling env.AI.run() multiple times in a single request — create one span per inference call and share a single trace across the full loop:
// Agentic loop: multiple inference calls, one trace
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const nexus = new NexusClient({ apiKey: env.NEXUS_API_KEY, agentId: 'workers-ai-agent' })
const { task } = await request.json<{ task: string }>()
const model = '@cf/meta/llama-3-8b-instruct'
const trace = await nexus.startTrace({
name: `agent: ${task.slice(0, 60)}`,
metadata: { model },
})
let messages = [{ role: 'user', content: task }]
let iteration = 0
try {
while (iteration < 6) {
iteration++
const span = await trace.addSpan({
name: `llm-call-${iteration}`,
input: { iteration, message_count: messages.length },
})
const start = Date.now()
const resp = await env.AI.run(model, { messages }) as {
response: string
usage?: { input_tokens: number; output_tokens: number }
}
await span.end({
status: 'ok',
output: {
latency_ms: Date.now() - start,
output_tokens: resp.usage?.output_tokens ?? null,
input_tokens: resp.usage?.input_tokens ?? null,
},
})
messages.push({ role: 'assistant', content: resp.response })
if (resp.response.includes('DONE') || iteration >= 6) break
messages.push({ role: 'user', content: 'Continue.' })
}
await trace.end({ status: 'success' })
return Response.json({ result: messages[messages.length - 1].content })
} catch (err) {
await trace.end({ status: 'error' })
throw err
}
},
}
What You'll See in the Nexus Dashboard
Once instrumented, every Worker execution appears as a trace. For Workers AI agents, the most useful signals are:
- Latency per model — compare
@cf/meta/llama-3-8b-instructvs.@cf/mistral/mistral-7b-instruct-v0.1across real requests - Token usage trends — track output token counts to catch runaway generation before it hits quota limits
- Error spans — quota errors, empty responses, and timeouts show up as red spans with metadata, not silent failures
- Model field — filter traces by model version to isolate regressions after a Cloudflare model update
Getting Started
Install the Nexus SDK:
npm install @keylightdigital/nexus
Add your NEXUS_API_KEY to your wrangler.toml as a secret (wrangler secret put NEXUS_API_KEY), grab a free API key at nexus.keylightdigital.dev/pricing, and you'll have spans flowing from your Workers AI agent in under five minutes.
Ready to see inside your Cloudflare Workers AI agents?
Start free — no credit card required. Up to 10,000 spans/month on the free tier.
Start monitoring for free →