AI Agent Reliability Patterns: Retry, Timeout, and Circuit Breaker

AI agents fail differently from traditional software. Retry storms burn your token budget. Silent timeouts leave traces hanging. Circuit breakers prevent cascading LLM failures. Here are four battle-tested reliability patterns — with trace examples showing what each looks like in Nexus.

Traditional software reliability patterns — retries, timeouts, circuit breakers — all apply to AI agents. But the failure modes are different. An LLM call is slow, expensive, and non-deterministic in a way that a database query isn't. Getting reliability patterns wrong means burning your token budget on retry storms, or letting a hung trace block your entire pipeline.

Here are four patterns, with trace examples showing what each looks like in Nexus.

1. Retry with exponential backoff

The most basic reliability pattern. When an LLM call fails (rate limit, network error), retry with increasing delays. The key is to instrument each attempt as a separate span so you can see in Nexus exactly how many retries happened and what the cumulative cost was.

import asyncio
from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-agent")

async def llm_call_with_retry(trace, prompt: str, max_retries: int = 3):
    for attempt in range(max_retries):
        span = trace.add_span(
            name=f"llm-call-attempt-{attempt + 1}",
            input={"prompt_len": len(prompt), "attempt": attempt + 1},
        )
        try:
            result = await call_llm(prompt)
            span.end(status="ok", output={"tokens": result.usage.total_tokens})
            return result
        except RateLimitError:
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
            span.end(status="error", output={"error": "rate_limit", "wait_s": wait})
            if attempt < max_retries - 1:
                await asyncio.sleep(wait)
            else:
                raise  # surface to trace.end(status="error")

In the Nexus waterfall, you'll see llm-call-attempt-1 (red), llm-call-attempt-2 (red), llm-call-attempt-3 (green). Immediately obvious without parsing logs.

2. Timeouts at every level

Agent loops can run indefinitely if you don't enforce time limits. Set timeouts at three levels: per LLM call (30s), per tool execution (60s), and per full agent session (300s). When a timeout fires, end the trace with status="timeout" so you can filter for them in Nexus.

import asyncio
from nexus_client import NexusClient

async def agent_run_with_timeout(user_request: str, timeout_seconds: int = 120):
    trace = nexus.start_trace(
        name=f"agent: {user_request[:60]}",
        metadata={"timeout_s": timeout_seconds},
    )
    try:
        result = await asyncio.wait_for(
            run_agent(trace, user_request),
            timeout=timeout_seconds,
        )
        trace.end(status="success")
        return result
    except asyncio.TimeoutError:
        # The trace lands in Nexus with status="timeout" — easy to filter
        trace.end(
            status="timeout",
            output={"message": f"Agent timed out after {timeout_seconds}s"},
        )
        raise

Timeout traces appear in your Nexus dashboard with orange timeout badges. If 10% of your traces are timing out, you have a problem. Without observability, you'd never notice.

3. Circuit breaker for LLM providers

If your LLM provider is having an incident, you don't want to keep hammering it with requests — each one burns tokens and delays the user. A circuit breaker opens after N consecutive failures and rejects calls for a recovery period, letting the system breathe.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.opened_at: float | None = None

    def is_open(self) -> bool:
        if self.opened_at is None:
            return False
        if time.time() - self.opened_at > self.recovery_timeout:
            self.failures = 0  # half-open: allow one request through
            self.opened_at = None
            return False
        return True

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.threshold:
            self.opened_at = time.time()

    def record_success(self):
        self.failures = 0
        self.opened_at = None

llm_circuit = CircuitBreaker()

async def llm_call_with_circuit(trace, prompt: str):
    if llm_circuit.is_open():
        span = trace.add_span(name="llm-call-blocked")
        span.end(status="error", output={"error": "circuit_open"})
        raise RuntimeError("LLM circuit breaker open — too many recent failures")

    span = trace.add_span(name="llm-call", input={"prompt_len": len(prompt)})
    try:
        result = await call_llm(prompt)
        llm_circuit.record_success()
        span.end(status="ok")
        return result
    except Exception as e:
        llm_circuit.record_failure()
        span.end(status="error", output={"error": str(e)})
        raise

4. Dead letter queue for unhandled failures

Some agent failures are transient (network blip, rate limit) and safe to retry. Others are permanent (invalid input, tool misconfiguration). Route permanent failures to a dead letter queue for human review rather than retrying endlessly. Log the full trace ID in the DLQ record so you can pull up the Nexus trace for any queued item.

How observability ties these together

Without trace data, you're guessing which reliability pattern is failing and how often. With Nexus, you can:

Filter traces by status to see timeout and error rates over time
See retry counts in span names without parsing logs
Get email alerts the moment error rates spike
Track p95 latency to detect slow degradation before users notice