Integration guide

OpenAI Realtime API Observability

The OpenAI Realtime API streams audio and text back to users turn by turn. Standard request/response tracing doesn't capture this flow. This guide shows how to wrap each turn in a Nexus span so you get a complete conversation trace in the dashboard.

Realtime API overview

The OpenAI Realtime API uses a persistent WebSocket connection to stream audio and text between the client and the model. Unlike a standard chat completion, a single session contains multiple turns — each turn is a user utterance followed by a model response.

Realtime session anatomy

1Session open — WebSocket connects, system prompt is set

2Turn start — user speaks or types, audio is streamed in

3Model response — model streams audio + optional text back

4Session close — connection is torn down, final usage is emitted

Nexus models each session as a trace and each turn as a span. You can see the full conversation timeline in the waterfall view.

Why streaming agents differ

With a standard chat completion you get one request and one response — easy to time and log. The Realtime API is different in three ways:

No single response object — audio arrives as chunks; text arrives in response.text.delta events
Turn boundaries are events — conversation.item.created and response.done mark the start and end of a turn
Usage is deferred — token counts arrive in response.done after the response finishes streaming

The pattern that works: open a span on conversation.item.created and close it on response.done, capturing the usage data from the done event.

Per-turn span wrapping

Install the Nexus SDK and create a trace when the session opens. Then open a span for each turn and close it when the response finishes.

import { RealtimeClient } from '@openai/realtime-api-beta'
import Nexus from 'nexus-sdk'

const nexus = new Nexus({ apiKey: process.env.NEXUS_API_KEY! })

async function runRealtimeSession(agentId: string) {
  // One trace per session
  const trace = await nexus.startTrace({
    agentId,
    name: 'realtime-session',
    status: 'running',
  })

  const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY! })

  let currentSpan: Awaited> | null = null
  let turnStartedAt = Date.now()

  // Turn starts when the user's item is committed
  client.on('conversation.item.created', async ({ item }) => {
    if (item.role === 'user') {
      turnStartedAt = Date.now()
      currentSpan = await trace.startSpan({
        name: `turn-${item.id}`,
        input: { role: 'user', content: item.content },
      })
    }
  })

  // Turn ends when the model response is complete
  client.on('response.done', async ({ response }) => {
    if (!currentSpan) return
    const usage = response.usage ?? {}
    await currentSpan.end({
      status: 'ok',
      output: { text: response.output?.[0]?.content?.[0]?.text ?? '' },
      metadata: {
        input_tokens: usage.input_tokens ?? 0,
        output_tokens: usage.output_tokens ?? 0,
        modalities: response.output?.[0]?.content?.map((c: { type: string }) => c.type) ?? [],
        duration_ms: Date.now() - turnStartedAt,
      },
    })
    currentSpan = null
  })

  await client.connect()
  await client.updateSession({
    instructions: 'You are a helpful voice assistant.',
    modalities: ['text', 'audio'],
  })

  return { trace, client }
}

Audio + text output tracing

When a model response includes both audio and text, capture both in the span metadata. The modalities field in the response tells you which output types were used.

// Inside response.done handler
const outputItems = response.output ?? []
const textContent = outputItems
  .flatMap((item: { content: { type: string; text?: string; transcript?: string }[] }) => item.content ?? [])
  .find((c: { type: string }) => c.type === 'text' || c.type === 'audio')

await currentSpan.end({
  status: 'ok',
  output: {
    text: textContent?.text ?? textContent?.transcript ?? '',
  },
  metadata: {
    input_tokens: response.usage?.input_tokens ?? 0,
    output_tokens: response.usage?.output_tokens ?? 0,
    // Which modalities did the model actually produce?
    audio_produced: outputItems.some((item: { content: { type: string }[] }) =>
      item.content?.some((c: { type: string }) => c.type === 'audio')
    ),
    text_produced: outputItems.some((item: { content: { type: string }[] }) =>
      item.content?.some((c: { type: string }) => c.type === 'text')
    ),
    // Approximate audio duration from token count (1 token ≈ 0.05s)
    audio_duration_s: ((response.usage?.output_tokens ?? 0) * 0.05).toFixed(1),
    duration_ms: Date.now() - turnStartedAt,
  },
})

Note on audio token counts

OpenAI counts audio tokens separately in the usage.output_token_details field. Capture both text_tokens and audio_tokens if you need accurate cost tracking.

Full instrumented example

This complete example shows a voice agent that handles multiple turns and ends the trace when the session closes. Error handling ensures the trace ends even if the WebSocket drops unexpectedly.

import { RealtimeClient } from '@openai/realtime-api-beta'
import Nexus from 'nexus-sdk'

const nexus = new Nexus({ apiKey: process.env.NEXUS_API_KEY! })

async function runInstrumentedVoiceAgent(agentId: string): Promise {
  const trace = await nexus.startTrace({
    agentId,
    name: 'voice-session',
    status: 'running',
  })

  const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY! })
  let currentSpan: Awaited> | null = null
  let turnStart = 0
  let totalInputTokens = 0
  let totalOutputTokens = 0

  client.on('conversation.item.created', async ({ item }) => {
    if (item.role !== 'user') return
    turnStart = Date.now()
    currentSpan = await trace.startSpan({
      name: `turn-${item.id}`,
      input: { role: 'user', content: item.content },
    })
  })

  client.on('response.done', async ({ response }) => {
    if (!currentSpan) return
    const usage = response.usage ?? {}
    totalInputTokens += usage.input_tokens ?? 0
    totalOutputTokens += usage.output_tokens ?? 0

    await currentSpan.end({
      status: response.status === 'failed' ? 'error' : 'ok',
      output: { text: response.output?.[0]?.content?.[0]?.text ?? '' },
      metadata: {
        input_tokens: usage.input_tokens ?? 0,
        output_tokens: usage.output_tokens ?? 0,
        duration_ms: Date.now() - turnStart,
      },
    })
    currentSpan = null
  })

  client.on('error', async (err: Error) => {
    currentSpan?.end({ status: 'error', metadata: { error: err.message } }).catch(() => {})
    currentSpan = null
  })

  try {
    await client.connect()
    await client.updateSession({
      instructions: 'You are a helpful voice assistant.',
      modalities: ['text', 'audio'],
    })

    // ... your session logic here ...

    await client.disconnect()
    await trace.end({
      status: 'success',
      metadata: { total_input_tokens: totalInputTokens, total_output_tokens: totalOutputTokens },
    })
  } catch (err) {
    const e = err instanceof Error ? err : new Error(String(err))
    await trace.end({ status: 'error', metadata: { error: e.message } })
    throw e
  }
}

See your Realtime agents in the dashboard

Sign up free — no credit card required. 1,000 traces/month on the free plan. Every voice session shows up as a multi-span waterfall within seconds of completion.

Start free → Back to docs