Observability for Groq API Agents: Tracing Ultra-Fast LLM Calls with Nexus
Groq's LPU inference delivers sub-second response times for Llama 3, Mixtral, and Gemma — but fast doesn't mean free from operational concerns. Token costs accumulate, rate limits hit silently, and latency still varies by model and request size. Here's how to wrap every Groq API call in a Nexus span for full trace-level visibility.
What Groq Is and Why It Still Needs Observability
Groq is an inference provider built on custom LPU (Language Processing Unit) hardware that delivers response times in the hundreds of milliseconds — often 5–10× faster than GPU-based providers for the same model. For interactive agent applications where latency matters, Groq is frequently the right choice for Llama 3, Mixtral, and Gemma models.
But "fast" doesn't eliminate operational concerns:
- Token costs still accumulate — Groq is not free; Llama 3 70B at $0.59/M tokens adds up quickly in high-volume agents
- Rate limits hit silently — Groq enforces per-minute and per-day token limits; when you exceed them, requests fail with 429 errors that look identical to other failures without span metadata
- Latency varies by model — Llama 3 8B and Llama 3 70B have very different latency profiles; tracking this in spans lets you measure the tradeoff when switching models
- Multi-turn context growth — Groq's fast responses encourage more iterations, which means context windows fill faster;
prompt_tokensgrowing across turns is your signal
The Core Pattern: Wrap groq.chat.completions.create()
The Groq Python SDK is OpenAI-compatible — response.usage.prompt_tokens and response.usage.completion_tokens are always present in non-streaming responses:
import os
import time
from groq import Groq
from nexus_client import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-groq-agent")
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])
def chat(prompt: str, model: str = "llama3-8b-8192") -> str:
trace = nexus.start_trace(
name=f"groq: {prompt[:60]}",
metadata={"model": model},
)
span = trace.add_span(name="groq-chat", input={"prompt": prompt, "model": model})
start = time.time()
try:
response = groq_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
content = response.choices[0].message.content or ""
latency_ms = int((time.time() - start) * 1000)
if not content.strip():
span.end(status="error", output={"error": "empty_response", "model": model, "latency_ms": latency_ms})
trace.end(status="error")
raise ValueError("Groq returned empty response")
span.end(status="ok", output={
"model": model,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"latency_ms": latency_ms,
})
trace.end(status="success")
return content
except Exception as e:
span.end(status="error", output={"error": str(e), "model": model})
trace.end(status="error")
raise
Handling Rate Limit Errors
Groq rate limits are per-minute and per-day, separately enforced per model. When you hit them, the SDK raises groq.RateLimitError. Recording each retry attempt as a separate span gives you visibility into how often rate limits affect your agents:
import groq
def chat_with_retry(prompt: str, model: str = "llama3-8b-8192", max_retries: int = 3) -> str:
trace = nexus.start_trace(name=f"groq: {prompt[:60]}", metadata={"model": model})
for attempt in range(max_retries):
span = trace.add_span(
name=f"groq-chat-attempt-{attempt + 1}",
input={"prompt": prompt, "model": model, "attempt": attempt + 1},
)
start = time.time()
try:
response = groq_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
content = response.choices[0].message.content or ""
span.end(status="ok", output={
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"latency_ms": int((time.time() - start) * 1000),
})
trace.end(status="success")
return content
except groq.RateLimitError as e:
span.end(status="error", output={
"error": "rate_limit",
"retry_after": getattr(e, 'retry_after', None),
"latency_ms": int((time.time() - start) * 1000),
})
if attempt == max_retries - 1:
trace.end(status="error")
raise
time.sleep(2 ** attempt) # exponential backoff
trace.end(status="error")
raise RuntimeError("Max retries exceeded")
Multi-Turn Agent Loops
For agents that make multiple Groq calls in a loop, track token accumulation across iterations by summing token counts and recording the total on the trace at the end:
def agent_loop(task: str, model: str = "llama3-70b-8192") -> str:
"""Multi-turn agent loop with per-call span tracking."""
trace = nexus.start_trace(name=f"agent: {task[:60]}", metadata={"model": model})
messages = [{"role": "user", "content": task}]
iteration = 0
total_input_tokens = 0
total_output_tokens = 0
try:
while iteration < 8:
iteration += 1
span = trace.add_span(
name=f"llm-call-{iteration}",
input={"iteration": iteration, "messages": len(messages)},
)
start = time.time()
response = groq_client.chat.completions.create(
model=model,
messages=messages,
)
content = response.choices[0].message.content or ""
total_input_tokens += response.usage.prompt_tokens
total_output_tokens += response.usage.completion_tokens
span.end(status="ok", output={
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"latency_ms": int((time.time() - start) * 1000),
})
messages.append({"role": "assistant", "content": content})
if "DONE" in content or iteration >= 8:
break
messages.append({"role": "user", "content": "Continue."})
trace.end(status="success", output={
"total_input_tokens": total_input_tokens,
"total_output_tokens": total_output_tokens,
"iterations": iteration,
})
return messages[-1]["content"]
except Exception as e:
trace.end(status="error")
raise
Recording total_input_tokens at the trace level (rather than only per span) lets you see the full context cost of each agent run in the trace list view — useful for identifying unusually expensive sessions.
Popular Groq Models and When to Use Them
llama3-8b-8192— fastest, lowest cost; good for classification, routing, and simple generation tasksllama3-70b-8192— stronger reasoning; use for multi-step tasks and complex tool usemixtral-8x7b-32768— 32K context window; useful for long-document tasksgemma2-9b-it— Google's Gemma 2; competitive quality for its size
Tag your spans with the model name (as shown above) so you can compare latency and token efficiency across models in the Nexus dashboard by filtering on trace metadata.
Get Started
Install the Nexus Python client (pip install nexus-client groq), create a free account at nexus.keylightdigital.dev/pricing, and you'll have traces flowing from your Groq agent in under five minutes.
Ready to see inside your Groq agents?
Start free — no credit card required. Up to 10,000 spans/month on the free tier.
Start monitoring for free →