Building an Autonomous AI Agent with Observability — Lessons from Ralph

Ralph is an autonomous AI agent that builds software. It reads a product requirements document, picks the next user story, implements it, runs quality checks, commits the code, and reports progress. Then it stops and waits for the next session.

Ralph built Nexus. And Ralph uses Nexus to monitor itself. This post is the story of what we learned from running a production autonomous agent for three weeks — the failure modes, the trace patterns that caught them, and the design decisions that followed.

The architecture

Ralph runs on Claude Code (Anthropic's AI coding tool) invoked via a cron schedule. Each session is a single agent run:

Read prd.json and progress.txt
Pick the highest-priority story where passes: false
Implement the story using file edits and bash commands
Run quality checks (TypeScript, tests)
Commit the changes and update the PRD

A typical session spans 20-40 LLM calls (reads, edits, shell commands), takes 5-15 minutes, and produces 1-3 commits. Over 3 weeks, Ralph completed 84 user stories.

The failure modes we saw

1. Context collapse

As sessions grew longer, Ralph would occasionally "forget" what it was implementing mid-story. In trace data, this showed up as a sudden shift in span names: implement-auth-endpoint → read-file → read-file → read-file (stuck re-reading instead of writing).

Fix: we added a "Codebase Patterns" section to progress.txt that Ralph reads at the start of every session. Shorter context → less drift.

2. Blocker loops

Some stories required external credentials (a Cloudflare API token, a PyPI token). Ralph couldn't proceed and would retry the same blocked action 5-8 times before logging the blocker. In traces, this looked like repeated identical spans with error status.

Fix: we added explicit blocker detection — if Ralph hits the same error 3 times, it logs to blockers.log, emails the operator (Steve), and skips to the next story. This is now the standard autonomous agent pattern.

3. Duplicate work

Early on, Ralph would sometimes implement a feature, not recognize it had passed the acceptance criteria, and implement it again in a later session. Traces showed overlapping span patterns across sessions: the same file reads and the same function signatures appearing twice.

Fix: the PRD's passes: true/false field became the canonical state. Once a story passes, Ralph never touches it again. Git commit history provides the audit trail.

4. Test environment side effects

Ralph's integration tests hit a real SQLite database (via Cloudflare D1 locally). Some test runs left state that caused the next test to fail. In traces, this appeared as test spans with status error that passed on retry — the classic test isolation bug.

Fix: test setup/teardown added to each test file. The trace pattern (pass → fail → pass → fail) was the diagnostic signal.

What a healthy trace looks like

A successful Ralph session has a predictable span pattern:

session-start          [0ms]
  read-prd              [45ms]    — reads prd.json
  read-progress         [38ms]    — reads progress.txt
  pick-story            [320ms]   — LLM selects next story
  read-existing-code    [180ms]   — reads relevant files
  plan-implementation   [890ms]   — LLM plans the approach
  write-files           [2100ms]  — creates/edits source files
  run-typecheck         [4200ms]  — npx tsc --noEmit
  run-tests             [8300ms]  — test suite
  commit-changes        [1200ms]  — git add + commit
  update-prd            [350ms]   — sets passes: true
  append-progress       [200ms]   — updates progress.txt
session-end            [success, 17.8s total]

When a session goes wrong, the deviation from this pattern is immediately visible. A blocker loop looks like run-typecheck repeating 5 times. A context collapse looks like read-existing-code spanning 40% of the session time.

The design principles that emerged

After 84 stories and hundreds of agent sessions, three design principles emerged:

1. Traces are your deployment logs

For autonomous agents, traces replace traditional deployment logs. You don't deploy a build — you deploy an agent run. The trace is the record of what it did and whether it succeeded. Store every trace. You'll need them for debugging.

2. Skip, don't fail

Autonomous agents should never block indefinitely. If a task can't be completed, log the blocker and move to the next task. An agent that skips 3 tasks and completes 7 is more valuable than one that hangs on task 1 for an hour.

3. Make state external and durable

Ralph's state is entirely in files: prd.json, progress.txt, blockers.log. When a session crashes mid-way, the next session picks up exactly where the last one left off. The state survives restarts because it lives in git.

Try the demo

You can see what Ralph's traces look like in the Nexus demo — realistic sample data showing the span waterfall, status colors, and input/output inspector. No signup required.

If you're building autonomous agents, instrument them with Nexus. The API reference has SDKs for TypeScript and Python. Free plan covers 1,000 traces/month — more than enough to get started.