Building an Autonomous AI Agent with Observability — Lessons from Ralph
Ralph is an autonomous AI agent that builds software. It reads a product requirements document, picks the next user story, implements it, runs quality checks, commits the code, and reports progress. Then it stops and waits for the next session.
Ralph built Nexus. And Ralph uses Nexus to monitor itself. This post is the story of what we learned from running a production autonomous agent for three weeks — the failure modes, the trace patterns that caught them, and the design decisions that followed.
The architecture
Ralph runs on Claude Code (Anthropic's AI coding tool) invoked via a cron schedule. Each session is a single agent run:
- Read
prd.jsonandprogress.txt - Pick the highest-priority story where
passes: false - Implement the story using file edits and bash commands
- Run quality checks (TypeScript, tests)
- Commit the changes and update the PRD
A typical session spans 20-40 LLM calls (reads, edits, shell commands), takes 5-15 minutes, and produces 1-3 commits. Over 3 weeks, Ralph completed 84 user stories.
The failure modes we saw
1. Context collapse
As sessions grew longer, Ralph would occasionally "forget" what it was implementing mid-story. In trace data, this showed up as a sudden shift in span names: implement-auth-endpoint → read-file → read-file → read-file (stuck re-reading instead of writing).
Fix: we added a "Codebase Patterns" section to progress.txt that Ralph reads at the start of every session. Shorter context → less drift.
2. Blocker loops
Some stories required external credentials (a Cloudflare API token, a PyPI token). Ralph couldn't proceed and would retry the same blocked action 5-8 times before logging the blocker. In traces, this looked like repeated identical spans with error status.
Fix: we added explicit blocker detection — if Ralph hits the same error 3 times, it logs to blockers.log, emails the operator (Steve), and skips to the next story. This is now the standard autonomous agent pattern.
3. Duplicate work
Early on, Ralph would sometimes implement a feature, not recognize it had passed the acceptance criteria, and implement it again in a later session. Traces showed overlapping span patterns across sessions: the same file reads and the same function signatures appearing twice.
Fix: the PRD's passes: true/false field became the canonical state. Once a story passes, Ralph never touches it again. Git commit history provides the audit trail.
4. Test environment side effects
Ralph's integration tests hit a real SQLite database (via Cloudflare D1 locally). Some test runs left state that caused the next test to fail. In traces, this appeared as test spans with status error that passed on retry — the classic test isolation bug.
Fix: test setup/teardown added to each test file. The trace pattern (pass → fail → pass → fail) was the diagnostic signal.
What a healthy trace looks like
A successful Ralph session has a predictable span pattern:
session-start [0ms]
read-prd [45ms] — reads prd.json
read-progress [38ms] — reads progress.txt
pick-story [320ms] — LLM selects next story
read-existing-code [180ms] — reads relevant files
plan-implementation [890ms] — LLM plans the approach
write-files [2100ms] — creates/edits source files
run-typecheck [4200ms] — npx tsc --noEmit
run-tests [8300ms] — test suite
commit-changes [1200ms] — git add + commit
update-prd [350ms] — sets passes: true
append-progress [200ms] — updates progress.txt
session-end [success, 17.8s total]
When a session goes wrong, the deviation from this pattern is immediately visible. A blocker loop looks like run-typecheck repeating 5 times. A context collapse looks like read-existing-code spanning 40% of the session time.
The design principles that emerged
After 84 stories and hundreds of agent sessions, three design principles emerged:
1. Traces are your deployment logs
For autonomous agents, traces replace traditional deployment logs. You don't deploy a build — you deploy an agent run. The trace is the record of what it did and whether it succeeded. Store every trace. You'll need them for debugging.
2. Skip, don't fail
Autonomous agents should never block indefinitely. If a task can't be completed, log the blocker and move to the next task. An agent that skips 3 tasks and completes 7 is more valuable than one that hangs on task 1 for an hour.
3. Make state external and durable
Ralph's state is entirely in files: prd.json, progress.txt, blockers.log. When a session crashes mid-way, the next session picks up exactly where the last one left off. The state survives restarts because it lives in git.
Try the demo
You can see what Ralph's traces look like in the Nexus demo — realistic sample data showing the span waterfall, status colors, and input/output inspector. No signup required.
If you're building autonomous agents, instrument them with Nexus. The API reference has SDKs for TypeScript and Python. Free plan covers 1,000 traces/month — more than enough to get started.
Related
- Introducing Nexus — the story of the tool Ralph built to monitor itself
- API reference and quickstart — instrument your agent in 3 lines
- Interactive demo — see what Ralph's traces look like
Instrument your autonomous agent
1,000 traces/month free. No credit card needed.