OpenAI Realtime API Observability
The OpenAI Realtime API streams audio and text back to users turn by turn. Standard request/response tracing doesn't capture this flow. This guide shows how to wrap each turn in a Nexus span so you get a complete conversation trace in the dashboard.
Realtime API overview
The OpenAI Realtime API uses a persistent WebSocket connection to stream audio and text between the client and the model. Unlike a standard chat completion, a single session contains multiple turns — each turn is a user utterance followed by a model response.
Realtime session anatomy
Nexus models each session as a trace and each turn as a span. You can see the full conversation timeline in the waterfall view.
Why streaming agents differ
With a standard chat completion you get one request and one response — easy to time and log. The Realtime API is different in three ways:
- No single response object — audio arrives as chunks; text arrives in
response.text.deltaevents - Turn boundaries are events —
conversation.item.createdandresponse.donemark the start and end of a turn - Usage is deferred — token counts arrive in
response.doneafter the response finishes streaming
The pattern that works: open a span on conversation.item.created and close it on response.done, capturing the usage data from the done event.
Per-turn span wrapping
Install the Nexus SDK and create a trace when the session opens. Then open a span for each turn and close it when the response finishes.
import { RealtimeClient } from '@openai/realtime-api-beta'
import Nexus from 'nexus-sdk'
const nexus = new Nexus({ apiKey: process.env.NEXUS_API_KEY! })
async function runRealtimeSession(agentId: string) {
// One trace per session
const trace = await nexus.startTrace({
agentId,
name: 'realtime-session',
status: 'running',
})
const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY! })
let currentSpan: Awaited> | null = null
let turnStartedAt = Date.now()
// Turn starts when the user's item is committed
client.on('conversation.item.created', async ({ item }) => {
if (item.role === 'user') {
turnStartedAt = Date.now()
currentSpan = await trace.startSpan({
name: `turn-${item.id}`,
input: { role: 'user', content: item.content },
})
}
})
// Turn ends when the model response is complete
client.on('response.done', async ({ response }) => {
if (!currentSpan) return
const usage = response.usage ?? {}
await currentSpan.end({
status: 'ok',
output: { text: response.output?.[0]?.content?.[0]?.text ?? '' },
metadata: {
input_tokens: usage.input_tokens ?? 0,
output_tokens: usage.output_tokens ?? 0,
modalities: response.output?.[0]?.content?.map((c: { type: string }) => c.type) ?? [],
duration_ms: Date.now() - turnStartedAt,
},
})
currentSpan = null
})
await client.connect()
await client.updateSession({
instructions: 'You are a helpful voice assistant.',
modalities: ['text', 'audio'],
})
return { trace, client }
}
Audio + text output tracing
When a model response includes both audio and text, capture both in the span metadata. The modalities field in the response tells you which output types were used.
// Inside response.done handler
const outputItems = response.output ?? []
const textContent = outputItems
.flatMap((item: { content: { type: string; text?: string; transcript?: string }[] }) => item.content ?? [])
.find((c: { type: string }) => c.type === 'text' || c.type === 'audio')
await currentSpan.end({
status: 'ok',
output: {
text: textContent?.text ?? textContent?.transcript ?? '',
},
metadata: {
input_tokens: response.usage?.input_tokens ?? 0,
output_tokens: response.usage?.output_tokens ?? 0,
// Which modalities did the model actually produce?
audio_produced: outputItems.some((item: { content: { type: string }[] }) =>
item.content?.some((c: { type: string }) => c.type === 'audio')
),
text_produced: outputItems.some((item: { content: { type: string }[] }) =>
item.content?.some((c: { type: string }) => c.type === 'text')
),
// Approximate audio duration from token count (1 token ≈ 0.05s)
audio_duration_s: ((response.usage?.output_tokens ?? 0) * 0.05).toFixed(1),
duration_ms: Date.now() - turnStartedAt,
},
})
Note on audio token counts
OpenAI counts audio tokens separately in the usage.output_token_details field. Capture both text_tokens and audio_tokens if you need accurate cost tracking.
Full instrumented example
This complete example shows a voice agent that handles multiple turns and ends the trace when the session closes. Error handling ensures the trace ends even if the WebSocket drops unexpectedly.
import { RealtimeClient } from '@openai/realtime-api-beta'
import Nexus from 'nexus-sdk'
const nexus = new Nexus({ apiKey: process.env.NEXUS_API_KEY! })
async function runInstrumentedVoiceAgent(agentId: string): Promise {
const trace = await nexus.startTrace({
agentId,
name: 'voice-session',
status: 'running',
})
const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY! })
let currentSpan: Awaited> | null = null
let turnStart = 0
let totalInputTokens = 0
let totalOutputTokens = 0
client.on('conversation.item.created', async ({ item }) => {
if (item.role !== 'user') return
turnStart = Date.now()
currentSpan = await trace.startSpan({
name: `turn-${item.id}`,
input: { role: 'user', content: item.content },
})
})
client.on('response.done', async ({ response }) => {
if (!currentSpan) return
const usage = response.usage ?? {}
totalInputTokens += usage.input_tokens ?? 0
totalOutputTokens += usage.output_tokens ?? 0
await currentSpan.end({
status: response.status === 'failed' ? 'error' : 'ok',
output: { text: response.output?.[0]?.content?.[0]?.text ?? '' },
metadata: {
input_tokens: usage.input_tokens ?? 0,
output_tokens: usage.output_tokens ?? 0,
duration_ms: Date.now() - turnStart,
},
})
currentSpan = null
})
client.on('error', async (err: Error) => {
currentSpan?.end({ status: 'error', metadata: { error: err.message } }).catch(() => {})
currentSpan = null
})
try {
await client.connect()
await client.updateSession({
instructions: 'You are a helpful voice assistant.',
modalities: ['text', 'audio'],
})
// ... your session logic here ...
await client.disconnect()
await trace.end({
status: 'success',
metadata: { total_input_tokens: totalInputTokens, total_output_tokens: totalOutputTokens },
})
} catch (err) {
const e = err instanceof Error ? err : new Error(String(err))
await trace.end({ status: 'error', metadata: { error: e.message } })
throw e
}
}
See your Realtime agents in the dashboard
Sign up free — no credit card required. 1,000 traces/month on the free plan. Every voice session shows up as a multi-span waterfall within seconds of completion.