AI Agent Reliability Patterns: Retry, Timeout, and Circuit Breaker
AI agents fail differently from traditional software. Retry storms burn your token budget. Silent timeouts leave traces hanging. Circuit breakers prevent cascading LLM failures. Here are four battle-tested reliability patterns — with trace examples showing what each looks like in Nexus.
Traditional software reliability patterns — retries, timeouts, circuit breakers — all apply to AI agents. But the failure modes are different. An LLM call is slow, expensive, and non-deterministic in a way that a database query isn't. Getting reliability patterns wrong means burning your token budget on retry storms, or letting a hung trace block your entire pipeline.
Here are four patterns, with trace examples showing what each looks like in Nexus.
1. Retry with exponential backoff
The most basic reliability pattern. When an LLM call fails (rate limit, network error), retry with increasing delays. The key is to instrument each attempt as a separate span so you can see in Nexus exactly how many retries happened and what the cumulative cost was.
import asyncio
from nexus_client import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-agent")
async def llm_call_with_retry(trace, prompt: str, max_retries: int = 3):
for attempt in range(max_retries):
span = trace.add_span(
name=f"llm-call-attempt-{attempt + 1}",
input={"prompt_len": len(prompt), "attempt": attempt + 1},
)
try:
result = await call_llm(prompt)
span.end(status="ok", output={"tokens": result.usage.total_tokens})
return result
except RateLimitError:
wait = 2 ** attempt # exponential backoff: 1s, 2s, 4s
span.end(status="error", output={"error": "rate_limit", "wait_s": wait})
if attempt < max_retries - 1:
await asyncio.sleep(wait)
else:
raise # surface to trace.end(status="error")
In the Nexus waterfall, you'll see llm-call-attempt-1 (red), llm-call-attempt-2 (red), llm-call-attempt-3 (green). Immediately obvious without parsing logs.
2. Timeouts at every level
Agent loops can run indefinitely if you don't enforce time limits. Set timeouts at three levels: per LLM call (30s), per tool execution (60s), and per full agent session (300s). When a timeout fires, end the trace with status="timeout" so you can filter for them in Nexus.
import asyncio
from nexus_client import NexusClient
async def agent_run_with_timeout(user_request: str, timeout_seconds: int = 120):
trace = nexus.start_trace(
name=f"agent: {user_request[:60]}",
metadata={"timeout_s": timeout_seconds},
)
try:
result = await asyncio.wait_for(
run_agent(trace, user_request),
timeout=timeout_seconds,
)
trace.end(status="success")
return result
except asyncio.TimeoutError:
# The trace lands in Nexus with status="timeout" — easy to filter
trace.end(
status="timeout",
output={"message": f"Agent timed out after {timeout_seconds}s"},
)
raise
Timeout traces appear in your Nexus dashboard with orange timeout badges. If 10% of your traces are timing out, you have a problem. Without observability, you'd never notice.
3. Circuit breaker for LLM providers
If your LLM provider is having an incident, you don't want to keep hammering it with requests — each one burns tokens and delays the user. A circuit breaker opens after N consecutive failures and rejects calls for a recovery period, letting the system breathe.
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.opened_at: float | None = None
def is_open(self) -> bool:
if self.opened_at is None:
return False
if time.time() - self.opened_at > self.recovery_timeout:
self.failures = 0 # half-open: allow one request through
self.opened_at = None
return False
return True
def record_failure(self):
self.failures += 1
if self.failures >= self.threshold:
self.opened_at = time.time()
def record_success(self):
self.failures = 0
self.opened_at = None
llm_circuit = CircuitBreaker()
async def llm_call_with_circuit(trace, prompt: str):
if llm_circuit.is_open():
span = trace.add_span(name="llm-call-blocked")
span.end(status="error", output={"error": "circuit_open"})
raise RuntimeError("LLM circuit breaker open — too many recent failures")
span = trace.add_span(name="llm-call", input={"prompt_len": len(prompt)})
try:
result = await call_llm(prompt)
llm_circuit.record_success()
span.end(status="ok")
return result
except Exception as e:
llm_circuit.record_failure()
span.end(status="error", output={"error": str(e)})
raise
4. Dead letter queue for unhandled failures
Some agent failures are transient (network blip, rate limit) and safe to retry. Others are permanent (invalid input, tool misconfiguration). Route permanent failures to a dead letter queue for human review rather than retrying endlessly. Log the full trace ID in the DLQ record so you can pull up the Nexus trace for any queued item.
How observability ties these together
Without trace data, you're guessing which reliability pattern is failing and how often. With Nexus, you can:
- Filter traces by status to see timeout and error rates over time
- See retry counts in span names without parsing logs
- Get email alerts the moment error rates spike
- Track p95 latency to detect slow degradation before users notice
See your agent reliability patterns in action
Start free — trace your first agent session in under 5 minutes.
Start monitoring for free →