Observability for Ollama Agents: Tracing Local LLMs with Nexus

Ollama lets you run Llama 3, Mistral, and Phi-3 locally via a simple REST API — but local LLMs still suffer from latency variance, quality regressions, and token usage you can't see. Here's how to wrap Ollama calls with Nexus spans using both direct REST requests and the OpenAI-compatible endpoint, so you get trace-level visibility into every local model invocation.

What Ollama Is (and Why It Still Needs Observability)

Ollama is an open-source tool that lets you run large language models — Llama 3, Mistral, Phi-3, Gemma, and dozens more — locally via a simple REST API. You install Ollama, run ollama pull llama3, and a model is available at http://localhost:11434. No API keys, no cloud costs, no rate limits.

The appeal for agent builders is obvious: local models mean zero per-token cost, full data privacy, and no dependency on external APIs. But "free and private" doesn't mean "easy to operate." Local LLMs still exhibit:

Latency variance — response times depend on your hardware, model size, and concurrent load; a 7B model on a MacBook Pro might take 800ms or 12 seconds for the same prompt
Quality regressions — when you switch from llama3 to llama3:instruct, response quality may degrade silently
Token usage blind spots — you're not paying per token, but context window exhaustion still breaks your agent; knowing your average eval_count is essential for sizing prompts
Silent empty responses — Ollama occasionally returns an empty message.content on resource-constrained hardware; without tracing, your agent silently fails

Nexus solves this by wrapping every Ollama call in a span, recording model name, output token count, latency, and error details — so you have a full trace of every local LLM invocation.

Pattern 1: Direct Ollama REST API with Nexus Spans

The Ollama /api/chat endpoint returns a JSON response with the model name, generated content, and token counts. Wrapping it in a Nexus span is straightforward:

import os
import time
import requests
from nexus_client import NexusClient

nexus = NexusClient(
    api_key=os.environ["NEXUS_API_KEY"],
    agent_id="my-ollama-agent",
)

OLLAMA_URL = "http://localhost:11434"

def run_ollama_with_tracing(prompt: str, model: str = "llama3") -> str:
    trace = nexus.start_trace(
        name=f"ollama: {prompt[:60]}",
        metadata={"model": model},
    )

    span = trace.add_span(
        name="ollama-chat",
        input={"prompt": prompt, "model": model},
    )

    start = time.time()
    try:
        resp = requests.post(
            f"{OLLAMA_URL}/api/chat",
            json={
                "model": model,
                "messages": [{"role": "user", "content": prompt}],
                "stream": False,
            },
            timeout=120,
        )
        resp.raise_for_status()
        data = resp.json()

        content = data["message"]["content"]
        num_predict = data.get("eval_count", 0)   # output tokens
        latency_ms = int((time.time() - start) * 1000)

        if not content.strip():
            # Empty response — record as an error span
            span.end(
                status="error",
                output={
                    "error": "empty_response",
                    "model": model,
                    "latency_ms": latency_ms,
                },
            )
            trace.end(status="error")
            raise ValueError("Ollama returned an empty response")

        span.end(
            status="ok",
            output={
                "model": model,
                "output_tokens": num_predict,
                "latency_ms": latency_ms,
                "response_preview": content[:200],
            },
        )
        trace.end(status="success")
        return content

    except requests.RequestException as e:
        span.end(status="error", output={"error": str(e), "model": model})
        trace.end(status="error")
        raise

Key things to record from the Ollama response:

eval_count — output token count (Ollama's name for what OpenAI calls completion_tokens)
prompt_eval_count — prompt token count (may be absent if Ollama served from KV cache)
model — the exact model tag used (important when you test multiple quantizations)

The empty-response check is essential. On constrained hardware, Ollama occasionally returns a response with an empty message.content rather than an error. Without explicit detection, your agent loop silently receives empty input and may loop forever. Recording it as an error span ensures you see it in the Nexus dashboard.

Pattern 2: Ollama as an OpenAI-Compatible Endpoint

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. If your agent already uses the OpenAI Python SDK (or you want to reuse OpenAI-compatible tooling), you can point the SDK at Ollama and wrap it with Nexus the same way:

import os
import time
from openai import OpenAI
from nexus_client import NexusClient

nexus = NexusClient(
    api_key=os.environ["NEXUS_API_KEY"],
    agent_id="my-ollama-agent",
)

# Ollama exposes an OpenAI-compatible endpoint on port 11434
ollama_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",   # required by the SDK but ignored by Ollama
)

def run_ollama_openai_compat(prompt: str, model: str = "llama3") -> str:
    trace = nexus.start_trace(
        name=f"ollama: {prompt[:60]}",
        metadata={"model": model, "pattern": "openai-compat"},
    )

    span = trace.add_span(
        name="ollama-chat",
        input={"prompt": prompt, "model": model},
    )

    start = time.time()
    try:
        response = ollama_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )

        content = response.choices[0].message.content or ""
        latency_ms = int((time.time() - start) * 1000)

        # OpenAI-compat endpoint returns usage — but Ollama may omit prompt_tokens
        output_tokens = response.usage.completion_tokens if response.usage else 0

        if not content.strip():
            span.end(
                status="error",
                output={"error": "empty_response", "model": model, "latency_ms": latency_ms},
            )
            trace.end(status="error")
            raise ValueError("Ollama returned an empty response")

        span.end(
            status="ok",
            output={
                "model": model,
                "output_tokens": output_tokens,
                "latency_ms": latency_ms,
                "finish_reason": response.choices[0].finish_reason,
            },
        )
        trace.end(status="success")
        return content

    except Exception as e:
        span.end(status="error", output={"error": str(e), "model": model})
        trace.end(status="error")
        raise

One caveat: Ollama's OpenAI-compatible endpoint may return null for usage.prompt_tokens when it serves from the KV cache. Always guard against None when reading usage fields. The direct REST API pattern is more reliable for token counting because eval_count and prompt_eval_count are always present in the response JSON.

Multi-Turn Agent Loops: Per-Call Spans

For agents that run multiple LLM calls in a loop, create one span per call and share a single trace across all iterations. This gives you a waterfall view in Nexus showing how token usage and latency accumulate across the full agent session:

def run_agent_loop(task: str, model: str = "llama3") -> str:
    """Multi-turn agent loop with per-call span tracking."""
    trace = nexus.start_trace(
        name=f"agent: {task[:60]}",
        metadata={"model": model},
    )

    messages = [{"role": "user", "content": task}]
    iteration = 0

    try:
        while iteration < 10:
            iteration += 1
            span = trace.add_span(
                name=f"llm-call-{iteration}",
                input={"iteration": iteration, "messages": len(messages)},
            )

            start = time.time()
            resp = requests.post(
                f"{OLLAMA_URL}/api/chat",
                json={"model": model, "messages": messages, "stream": False},
                timeout=120,
            )
            resp.raise_for_status()
            data = resp.json()

            content = data["message"]["content"]
            num_predict = data.get("eval_count", 0)
            latency_ms = int((time.time() - start) * 1000)

            # Manually embed prompt token count — Ollama reports it in eval_count
            # for the previous prompt via prompt_eval_count field
            prompt_tokens = data.get("prompt_eval_count", 0)

            span.end(
                status="ok",
                output={
                    "output_tokens": num_predict,
                    "prompt_tokens": prompt_tokens,
                    "latency_ms": latency_ms,
                },
            )

            messages.append({"role": "assistant", "content": content})

            if "DONE" in content or iteration >= 10:
                break

            messages.append({"role": "user", "content": "Continue."})

        trace.end(status="success")
        return messages[-1]["content"]

    except Exception as e:
        trace.end(status="error")
        raise

The prompt_eval_count field is particularly useful in multi-turn loops: it shows you how much context Ollama is processing on each call, which helps you detect context window growth before it causes truncation errors.

What You'll See in the Nexus Dashboard

Once instrumented, every agent run appears as a trace in your Nexus dashboard. For Ollama agents, the most useful signals are:

Latency per call — see which prompts take 1 second vs. 20 seconds on your hardware; useful for detecting when context growth starts slowing the model
Token counts per span — track eval_count across iterations to spot runaway generation
Error spans — empty responses, timeouts, and connection errors show up as red spans, not silent failures
Model field in metadata — compare traces across model versions (llama3 vs. llama3:instruct vs. phi3:mini) by filtering on trace metadata

Getting Started

Install the Nexus Python client:

pip install nexus-client

Create a free account at nexus.keylightdigital.dev/pricing, grab an API key, and you'll have traces flowing from your local Ollama agent in under five minutes. The free tier covers 10,000 spans per month — easily enough for a development environment with frequent local model calls.