Observability for Ollama Agents: Tracing Local LLMs with Nexus
Ollama lets you run Llama 3, Mistral, and Phi-3 locally via a simple REST API — but local LLMs still suffer from latency variance, quality regressions, and token usage you can't see. Here's how to wrap Ollama calls with Nexus spans using both direct REST requests and the OpenAI-compatible endpoint, so you get trace-level visibility into every local model invocation.
What Ollama Is (and Why It Still Needs Observability)
Ollama is an open-source tool that lets you run large language models — Llama 3, Mistral, Phi-3, Gemma, and dozens more — locally via a simple REST API. You install Ollama, run ollama pull llama3, and a model is available at http://localhost:11434. No API keys, no cloud costs, no rate limits.
The appeal for agent builders is obvious: local models mean zero per-token cost, full data privacy, and no dependency on external APIs. But "free and private" doesn't mean "easy to operate." Local LLMs still exhibit:
- Latency variance — response times depend on your hardware, model size, and concurrent load; a 7B model on a MacBook Pro might take 800ms or 12 seconds for the same prompt
- Quality regressions — when you switch from
llama3tollama3:instruct, response quality may degrade silently - Token usage blind spots — you're not paying per token, but context window exhaustion still breaks your agent; knowing your average
eval_countis essential for sizing prompts - Silent empty responses — Ollama occasionally returns an empty
message.contenton resource-constrained hardware; without tracing, your agent silently fails
Nexus solves this by wrapping every Ollama call in a span, recording model name, output token count, latency, and error details — so you have a full trace of every local LLM invocation.
Pattern 1: Direct Ollama REST API with Nexus Spans
The Ollama /api/chat endpoint returns a JSON response with the model name, generated content, and token counts. Wrapping it in a Nexus span is straightforward:
import os
import time
import requests
from nexus_client import NexusClient
nexus = NexusClient(
api_key=os.environ["NEXUS_API_KEY"],
agent_id="my-ollama-agent",
)
OLLAMA_URL = "http://localhost:11434"
def run_ollama_with_tracing(prompt: str, model: str = "llama3") -> str:
trace = nexus.start_trace(
name=f"ollama: {prompt[:60]}",
metadata={"model": model},
)
span = trace.add_span(
name="ollama-chat",
input={"prompt": prompt, "model": model},
)
start = time.time()
try:
resp = requests.post(
f"{OLLAMA_URL}/api/chat",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
},
timeout=120,
)
resp.raise_for_status()
data = resp.json()
content = data["message"]["content"]
num_predict = data.get("eval_count", 0) # output tokens
latency_ms = int((time.time() - start) * 1000)
if not content.strip():
# Empty response — record as an error span
span.end(
status="error",
output={
"error": "empty_response",
"model": model,
"latency_ms": latency_ms,
},
)
trace.end(status="error")
raise ValueError("Ollama returned an empty response")
span.end(
status="ok",
output={
"model": model,
"output_tokens": num_predict,
"latency_ms": latency_ms,
"response_preview": content[:200],
},
)
trace.end(status="success")
return content
except requests.RequestException as e:
span.end(status="error", output={"error": str(e), "model": model})
trace.end(status="error")
raise
Key things to record from the Ollama response:
eval_count— output token count (Ollama's name for what OpenAI callscompletion_tokens)prompt_eval_count— prompt token count (may be absent if Ollama served from KV cache)model— the exact model tag used (important when you test multiple quantizations)
The empty-response check is essential. On constrained hardware, Ollama occasionally returns a response with an empty message.content rather than an error. Without explicit detection, your agent loop silently receives empty input and may loop forever. Recording it as an error span ensures you see it in the Nexus dashboard.
Pattern 2: Ollama as an OpenAI-Compatible Endpoint
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. If your agent already uses the OpenAI Python SDK (or you want to reuse OpenAI-compatible tooling), you can point the SDK at Ollama and wrap it with Nexus the same way:
import os
import time
from openai import OpenAI
from nexus_client import NexusClient
nexus = NexusClient(
api_key=os.environ["NEXUS_API_KEY"],
agent_id="my-ollama-agent",
)
# Ollama exposes an OpenAI-compatible endpoint on port 11434
ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK but ignored by Ollama
)
def run_ollama_openai_compat(prompt: str, model: str = "llama3") -> str:
trace = nexus.start_trace(
name=f"ollama: {prompt[:60]}",
metadata={"model": model, "pattern": "openai-compat"},
)
span = trace.add_span(
name="ollama-chat",
input={"prompt": prompt, "model": model},
)
start = time.time()
try:
response = ollama_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
content = response.choices[0].message.content or ""
latency_ms = int((time.time() - start) * 1000)
# OpenAI-compat endpoint returns usage — but Ollama may omit prompt_tokens
output_tokens = response.usage.completion_tokens if response.usage else 0
if not content.strip():
span.end(
status="error",
output={"error": "empty_response", "model": model, "latency_ms": latency_ms},
)
trace.end(status="error")
raise ValueError("Ollama returned an empty response")
span.end(
status="ok",
output={
"model": model,
"output_tokens": output_tokens,
"latency_ms": latency_ms,
"finish_reason": response.choices[0].finish_reason,
},
)
trace.end(status="success")
return content
except Exception as e:
span.end(status="error", output={"error": str(e), "model": model})
trace.end(status="error")
raise
One caveat: Ollama's OpenAI-compatible endpoint may return null for usage.prompt_tokens when it serves from the KV cache. Always guard against None when reading usage fields. The direct REST API pattern is more reliable for token counting because eval_count and prompt_eval_count are always present in the response JSON.
Multi-Turn Agent Loops: Per-Call Spans
For agents that run multiple LLM calls in a loop, create one span per call and share a single trace across all iterations. This gives you a waterfall view in Nexus showing how token usage and latency accumulate across the full agent session:
def run_agent_loop(task: str, model: str = "llama3") -> str:
"""Multi-turn agent loop with per-call span tracking."""
trace = nexus.start_trace(
name=f"agent: {task[:60]}",
metadata={"model": model},
)
messages = [{"role": "user", "content": task}]
iteration = 0
try:
while iteration < 10:
iteration += 1
span = trace.add_span(
name=f"llm-call-{iteration}",
input={"iteration": iteration, "messages": len(messages)},
)
start = time.time()
resp = requests.post(
f"{OLLAMA_URL}/api/chat",
json={"model": model, "messages": messages, "stream": False},
timeout=120,
)
resp.raise_for_status()
data = resp.json()
content = data["message"]["content"]
num_predict = data.get("eval_count", 0)
latency_ms = int((time.time() - start) * 1000)
# Manually embed prompt token count — Ollama reports it in eval_count
# for the previous prompt via prompt_eval_count field
prompt_tokens = data.get("prompt_eval_count", 0)
span.end(
status="ok",
output={
"output_tokens": num_predict,
"prompt_tokens": prompt_tokens,
"latency_ms": latency_ms,
},
)
messages.append({"role": "assistant", "content": content})
if "DONE" in content or iteration >= 10:
break
messages.append({"role": "user", "content": "Continue."})
trace.end(status="success")
return messages[-1]["content"]
except Exception as e:
trace.end(status="error")
raise
The prompt_eval_count field is particularly useful in multi-turn loops: it shows you how much context Ollama is processing on each call, which helps you detect context window growth before it causes truncation errors.
What You'll See in the Nexus Dashboard
Once instrumented, every agent run appears as a trace in your Nexus dashboard. For Ollama agents, the most useful signals are:
- Latency per call — see which prompts take 1 second vs. 20 seconds on your hardware; useful for detecting when context growth starts slowing the model
- Token counts per span — track
eval_countacross iterations to spot runaway generation - Error spans — empty responses, timeouts, and connection errors show up as red spans, not silent failures
- Model field in metadata — compare traces across model versions (
llama3vs.llama3:instructvs.phi3:mini) by filtering on trace metadata
Getting Started
Install the Nexus Python client:
pip install nexus-client
Create a free account at nexus.keylightdigital.dev/pricing, grab an API key, and you'll have traces flowing from your local Ollama agent in under five minutes. The free tier covers 10,000 spans per month — easily enough for a development environment with frequent local model calls.
Ready to see inside your local LLM agents?
Start free — no credit card required. Up to 10,000 spans/month on the free tier.
Start monitoring for free →