Observability for Cloudflare Workers AI Agents: Tracing Serverless LLM Calls with Nexus

Cloudflare Workers AI lets you run LLM inference inside a Worker with a single env.AI.run() call — no GPU provisioning, no rate limits to manage, no cold starts to fear. But serverless doesn't mean invisible: model quota limits, per-model latency spikes, and token usage you can't see still affect production agents. Here's how to wrap every Workers AI call with Nexus spans for full trace-level visibility.

What Cloudflare Workers AI Is

Cloudflare Workers AI is a serverless LLM inference service built into the Workers runtime. You add an AI binding to your Worker and call env.AI.run(model, input) — Cloudflare handles GPU scheduling, model loading, and routing. There's no infrastructure to manage and no cold start penalty for inference itself.

The appeal is obvious for developers already on the Cloudflare stack: your Worker, your database (D1), your KV store, and your LLM inference all live in the same runtime with zero egress and generous free-tier limits. It's the fastest path from a Workers-based API to a working AI agent.

Why Serverless LLMs Still Need Observability

"Serverless" removes infrastructure concerns but not operational ones:

Per-model latency variance — a 7B model and a 70B model have very different response times; when you switch models (or Cloudflare updates a model version), latency changes silently
Model quota limits — Workers AI free tier enforces per-model neuron budgets; when you hit the limit, calls fail with a quota error that's hard to distinguish from a prompt error without span metadata
Token usage blind spots — Workers AI returns usage in some model responses and omits it in others; without tracking it in spans, you have no signal on context growth in multi-turn agents
Silent empty responses — some models return an empty response string on truncation or content filter; your agent silently receives nothing

Nexus wraps each env.AI.run() call in a span, recording model name, latency, token counts when available, and error details — so you have full trace-level visibility without leaving the Workers ecosystem.

The Core Pattern: Wrap env.AI.run() in a Span

Here's a complete Cloudflare Worker that instruments every inference call with Nexus:

import { NexusClient } from '@keylightdigital/nexus'

export interface Env {
  AI: Ai
  NEXUS_API_KEY: string
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const nexus = new NexusClient({
      apiKey: env.NEXUS_API_KEY,
      agentId: 'my-workers-ai-agent',
    })

    const body = await request.json<{ prompt: string }>()
    const model = '@cf/meta/llama-3-8b-instruct'

    const trace = await nexus.startTrace({
      name: `workers-ai: ${body.prompt.slice(0, 60)}`,
      metadata: { model },
    })

    const span = await trace.addSpan({
      name: 'workers-ai-inference',
      input: { prompt: body.prompt, model },
    })

    const start = Date.now()
    try {
      const response = await env.AI.run(model, {
        prompt: body.prompt,
      }) as { response: string; usage?: { input_tokens: number; output_tokens: number } }

      const latencyMs = Date.now() - start

      if (!response.response?.trim()) {
        await span.end({
          status: 'error',
          output: { error: 'empty_response', model, latency_ms: latencyMs },
        })
        await trace.end({ status: 'error' })
        return new Response('Model returned empty response', { status: 500 })
      }

      await span.end({
        status: 'ok',
        output: {
          model,
          latency_ms: latencyMs,
          // Workers AI returns usage for some models — guard for undefined
          input_tokens: response.usage?.input_tokens ?? null,
          output_tokens: response.usage?.output_tokens ?? null,
          response_preview: response.response.slice(0, 200),
        },
      })
      await trace.end({ status: 'success' })

      return Response.json({ result: response.response })
    } catch (err) {
      const latencyMs = Date.now() - start
      await span.end({
        status: 'error',
        output: { error: String(err), model, latency_ms: latencyMs },
      })
      await trace.end({ status: 'error' })
      throw err
    }
  },
}

Key implementation notes:

Create one NexusClient per request — Workers are stateless, so client initialization is cheap
Record latency_ms manually using Date.now() before and after env.AI.run()
Guard against response.usage being undefined — token counts are not available for all Workers AI models
Detect empty responses explicitly — they arrive as an empty string, not an error, so they need to be checked and recorded as error spans

Multi-Model Comparison: Parallel Spans Under One Trace

If you're comparing two models for the same prompt — a common pattern when evaluating whether to switch from Llama 3 8B to Mistral 7B — you can run both inference calls in parallel and record each as a separate span under a single trace:

// Compare two models with separate spans under one trace
async function compareModels(trace: Trace, prompt: string, env: Env) {
  const models = [
    '@cf/meta/llama-3-8b-instruct',
    '@cf/mistral/mistral-7b-instruct-v0.1',
  ]

  const results = await Promise.all(
    models.map(async (model) => {
      const span = await trace.addSpan({
        name: `inference:${model.split('/').pop()}`,
        input: { prompt, model },
      })

      const start = Date.now()
      const response = await env.AI.run(model, { prompt }) as { response: string }
      const latencyMs = Date.now() - start

      await span.end({
        status: 'ok',
        output: { model, latency_ms: latencyMs, response: response.response.slice(0, 200) },
      })

      return { model, response: response.response, latencyMs }
    })
  )

  return results
}

In the Nexus trace detail view, you'll see both spans side-by-side with their individual latencies and token counts — making it easy to compare model performance on real production prompts.

AI Gateway Compatibility

Cloudflare AI Gateway sits between your Worker and Workers AI inference, adding rate limiting, response caching, and its own usage logs. Nexus is complementary: AI Gateway gives you infrastructure-level logs; Nexus gives you application-level span context — which user request triggered this inference, what the agent was trying to accomplish, and custom metadata you want to query later.

// AI Gateway adds rate limiting, caching, and logging on top of Workers AI.
// Point your AI binding at the gateway URL in wrangler.toml:
//
// [ai]
// binding = "AI"
// gateway = { id = "my-gateway-id" }
//
// Then instrument the same way — Nexus adds application-level span context
// that AI Gateway's own logs don't capture (e.g. which user request triggered
// this inference, what the agent was trying to accomplish, or custom metadata
// you want to query later).

const span = await trace.addSpan({
  name: 'workers-ai-inference',
  input: {
    prompt: body.prompt,
    model,
    via_ai_gateway: true,       // tag for filtering in Nexus
    gateway_id: 'my-gateway-id',
  },
})

There's no conflict between the two. Your Worker uses the AI Gateway binding as normal; Nexus spans wrap the calls from the application layer without affecting routing or caching.

Agentic Loops: Per-Call Spans Across Multiple Turns

For Workers that implement a multi-turn agent loop — calling env.AI.run() multiple times in a single request — create one span per inference call and share a single trace across the full loop:

// Agentic loop: multiple inference calls, one trace
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const nexus = new NexusClient({ apiKey: env.NEXUS_API_KEY, agentId: 'workers-ai-agent' })
    const { task } = await request.json<{ task: string }>()
    const model = '@cf/meta/llama-3-8b-instruct'

    const trace = await nexus.startTrace({
      name: `agent: ${task.slice(0, 60)}`,
      metadata: { model },
    })

    let messages = [{ role: 'user', content: task }]
    let iteration = 0

    try {
      while (iteration < 6) {
        iteration++
        const span = await trace.addSpan({
          name: `llm-call-${iteration}`,
          input: { iteration, message_count: messages.length },
        })

        const start = Date.now()
        const resp = await env.AI.run(model, { messages }) as {
          response: string
          usage?: { input_tokens: number; output_tokens: number }
        }

        await span.end({
          status: 'ok',
          output: {
            latency_ms: Date.now() - start,
            output_tokens: resp.usage?.output_tokens ?? null,
            input_tokens: resp.usage?.input_tokens ?? null,
          },
        })

        messages.push({ role: 'assistant', content: resp.response })

        if (resp.response.includes('DONE') || iteration >= 6) break
        messages.push({ role: 'user', content: 'Continue.' })
      }

      await trace.end({ status: 'success' })
      return Response.json({ result: messages[messages.length - 1].content })
    } catch (err) {
      await trace.end({ status: 'error' })
      throw err
    }
  },
}

What You'll See in the Nexus Dashboard

Once instrumented, every Worker execution appears as a trace. For Workers AI agents, the most useful signals are:

Latency per model — compare @cf/meta/llama-3-8b-instruct vs. @cf/mistral/mistral-7b-instruct-v0.1 across real requests
Token usage trends — track output token counts to catch runaway generation before it hits quota limits
Error spans — quota errors, empty responses, and timeouts show up as red spans with metadata, not silent failures
Model field — filter traces by model version to isolate regressions after a Cloudflare model update

Getting Started

Install the Nexus SDK:

npm install @keylightdigital/nexus

Add your NEXUS_API_KEY to your wrangler.toml as a secret (wrangler secret put NEXUS_API_KEY), grab a free API key at nexus.keylightdigital.dev/pricing, and you'll have spans flowing from your Workers AI agent in under five minutes.