How to Write Tests for LLM-Based AI Agents

Testing LLM-based agents is hard because outputs are non-deterministic. But "it's probabilistic" isn't an excuse to skip tests — it means you need different tests: deterministic unit tests for tool logic, contract tests for LLM interfaces, integration tests with seeded scenarios, and trace-based regression tests that compare execution paths. Here's the full testing pyramid for AI agents.

"You can't test AI — the outputs are random." This claim is wrong, but it's common enough that many teams ship AI agents with zero automated tests. The truth: most of what makes an agent valuable is deterministic and testable. The non-deterministic parts need different testing strategies, not no testing.

Here's the full testing pyramid for LLM-based agents.

Layer 1: Unit tests for tool logic

Your agent's tools are regular functions. They should have regular unit tests. Test the logic of what each tool does, mocking only the external calls:

// tools/search.ts
export async function webSearch(query: string): Promise<SearchResult[]> {
  const response = await fetch(`https://api.search.example/v1?q=${encodeURIComponent(query)}`)
  if (!response.ok) throw new Error(`Search failed: ${response.status}`)
  return response.json()
}

// tools/search.test.ts
import { webSearch } from './search'

vi.mock('node-fetch', () => ({
  default: vi.fn().mockResolvedValue({
    ok: true,
    json: () => Promise.resolve([{ title: 'Result', url: 'https://example.com' }]),
  }),
}))

test('returns search results on success', async () => {
  const results = await webSearch('test query')
  expect(results).toHaveLength(1)
  expect(results[0].url).toBe('https://example.com')
})

test('throws on non-200 response', async () => {
  vi.mocked(fetch).mockResolvedValueOnce({ ok: false, status: 429 } as Response)
  await expect(webSearch('test')).rejects.toThrow('Search failed: 429')
})

These tests are fast, deterministic, and catch regressions in your tool logic without touching an LLM API.

Layer 2: Contract tests for LLM interfaces

You can't test that the LLM "gives the right answer," but you can test that your code handles the range of shapes the LLM might return. Contract tests stub the LLM response and verify your parsing logic handles each case:

// Test: agent handles tool_use response correctly
test('routes tool call to correct handler', async () => {
  mockLLM.mockResolvedValue({
    stop_reason: 'tool_use',
    content: [{
      type: 'tool_use',
      name: 'web_search',
      id: 'tool_1',
      input: { query: 'AI observability' },
    }],
    usage: { input_tokens: 100, output_tokens: 50 },
  })

  const result = await runAgentTurn(mockLLM, 'Find info about AI observability')
  expect(mockSearchTool).toHaveBeenCalledWith({ query: 'AI observability' })
})

// Test: agent handles end_turn correctly
test('returns final answer on end_turn', async () => {
  mockLLM.mockResolvedValue({
    stop_reason: 'end_turn',
    content: [{ type: 'text', text: 'The answer is 42.' }],
    usage: { input_tokens: 200, output_tokens: 10 },
  })

  const result = await runAgentTurn(mockLLM, 'What is the answer?')
  expect(result.answer).toBe('The answer is 42.')
})

Layer 3: Integration tests with seeded scenarios

For end-to-end tests, use a test LLM that returns a pre-scripted sequence of responses. This lets you test the full agent loop deterministically — without calling the real API or paying per token:

class ScriptedLLM {
  private responses: LLMResponse[]
  private index = 0

  constructor(responses: LLMResponse[]) {
    this.responses = responses
  }

  async messages(params: MessageParams): Promise<LLMResponse> {
    const response = this.responses[this.index++]
    if (!response) throw new Error('ScriptedLLM: ran out of responses')
    return response
  }
}

test('research agent completes 2-step task', async () => {
  const scripted = new ScriptedLLM([
    // Turn 1: LLM decides to search
    { stop_reason: 'tool_use', content: [{ type: 'tool_use', name: 'web_search', id: 't1', input: { query: 'nexus agent observability' } }] },
    // Turn 2: LLM synthesizes result
    { stop_reason: 'end_turn', content: [{ type: 'text', text: 'Nexus is an observability platform for AI agents.' }] },
  ])

  const agent = new ResearchAgent(scripted)
  const result = await agent.run('What is Nexus?')

  expect(result).toContain('observability')
  expect(mockSearchTool).toHaveBeenCalledWith({ query: 'nexus agent observability' })
})

Layer 4: Trace-based regression tests

Once your agent is in production, your best regression tests are snapshots of real traces. When a user reports a bug, capture the trace, identify the failing span, and write a test that seeds the agent with those exact inputs and verifies the output:

// Reproduce bug from trace ACP-trace-abc123
// Original: agent returned empty result when search returned 0 results
test('handles empty search results gracefully', async () => {
  const scripted = new ScriptedLLM([
    { stop_reason: 'tool_use', content: [{ type: 'tool_use', name: 'web_search', id: 't1', input: { query: 'very obscure query' } }] },
    // Scripted: tool returns []
    { stop_reason: 'end_turn', content: [{ type: 'text', text: 'I couldn't find any results for that query.' }] },
  ])

  mockSearchTool.mockResolvedValue([])  // empty results

  const agent = new ResearchAgent(scripted)
  const result = await agent.run('very obscure query')

  // Should return a graceful "no results" message, not an empty string
  expect(result.length).toBeGreaterThan(10)
  expect(result).toContain('couldn't find')
})

What to measure, not just test

Some agent properties can't be pass/fail tested — but they can be monitored over time. Track these in Nexus:

Average turns to completion — should be stable; sudden increases indicate prompt drift
Tool call frequency by tool — unexpected tool calls reveal prompt injection or context confusion
Error rate by input type — some input patterns reliably cause failures
Token usage per task type — cost regressions are often test-detectable by looking at metadata

The goal isn't to make tests that assert "the LLM gave the right answer." It's to make the deterministic parts of your agent bulletproof, and to have enough visibility into the non-deterministic parts that you catch regressions quickly.