2026-06-10 · 6 min read

Observability for Groq API Agents: Tracing Ultra-Fast LLM Calls with Nexus

Groq's LPU inference delivers sub-second response times for Llama 3, Mixtral, and Gemma — but fast doesn't mean free from operational concerns. Token costs accumulate, rate limits hit silently, and latency still varies by model and request size. Here's how to wrap every Groq API call in a Nexus span for full trace-level visibility.

What Groq Is and Why It Still Needs Observability

Groq is an inference provider built on custom LPU (Language Processing Unit) hardware that delivers response times in the hundreds of milliseconds — often 5–10× faster than GPU-based providers for the same model. For interactive agent applications where latency matters, Groq is frequently the right choice for Llama 3, Mixtral, and Gemma models.

But "fast" doesn't eliminate operational concerns:

The Core Pattern: Wrap groq.chat.completions.create()

The Groq Python SDK is OpenAI-compatible — response.usage.prompt_tokens and response.usage.completion_tokens are always present in non-streaming responses:

import os
import time
from groq import Groq
from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-groq-agent")
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

def chat(prompt: str, model: str = "llama3-8b-8192") -> str:
    trace = nexus.start_trace(
        name=f"groq: {prompt[:60]}",
        metadata={"model": model},
    )
    span = trace.add_span(name="groq-chat", input={"prompt": prompt, "model": model})
    start = time.time()
    try:
        response = groq_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        content = response.choices[0].message.content or ""
        latency_ms = int((time.time() - start) * 1000)

        if not content.strip():
            span.end(status="error", output={"error": "empty_response", "model": model, "latency_ms": latency_ms})
            trace.end(status="error")
            raise ValueError("Groq returned empty response")

        span.end(status="ok", output={
            "model": model,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "latency_ms": latency_ms,
        })
        trace.end(status="success")
        return content
    except Exception as e:
        span.end(status="error", output={"error": str(e), "model": model})
        trace.end(status="error")
        raise

Handling Rate Limit Errors

Groq rate limits are per-minute and per-day, separately enforced per model. When you hit them, the SDK raises groq.RateLimitError. Recording each retry attempt as a separate span gives you visibility into how often rate limits affect your agents:

import groq

def chat_with_retry(prompt: str, model: str = "llama3-8b-8192", max_retries: int = 3) -> str:
    trace = nexus.start_trace(name=f"groq: {prompt[:60]}", metadata={"model": model})
    for attempt in range(max_retries):
        span = trace.add_span(
            name=f"groq-chat-attempt-{attempt + 1}",
            input={"prompt": prompt, "model": model, "attempt": attempt + 1},
        )
        start = time.time()
        try:
            response = groq_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
            )
            content = response.choices[0].message.content or ""
            span.end(status="ok", output={
                "input_tokens": response.usage.prompt_tokens,
                "output_tokens": response.usage.completion_tokens,
                "latency_ms": int((time.time() - start) * 1000),
            })
            trace.end(status="success")
            return content
        except groq.RateLimitError as e:
            span.end(status="error", output={
                "error": "rate_limit",
                "retry_after": getattr(e, 'retry_after', None),
                "latency_ms": int((time.time() - start) * 1000),
            })
            if attempt == max_retries - 1:
                trace.end(status="error")
                raise
            time.sleep(2 ** attempt)  # exponential backoff
    trace.end(status="error")
    raise RuntimeError("Max retries exceeded")

Multi-Turn Agent Loops

For agents that make multiple Groq calls in a loop, track token accumulation across iterations by summing token counts and recording the total on the trace at the end:

def agent_loop(task: str, model: str = "llama3-70b-8192") -> str:
    """Multi-turn agent loop with per-call span tracking."""
    trace = nexus.start_trace(name=f"agent: {task[:60]}", metadata={"model": model})
    messages = [{"role": "user", "content": task}]
    iteration = 0
    total_input_tokens = 0
    total_output_tokens = 0

    try:
        while iteration < 8:
            iteration += 1
            span = trace.add_span(
                name=f"llm-call-{iteration}",
                input={"iteration": iteration, "messages": len(messages)},
            )
            start = time.time()
            response = groq_client.chat.completions.create(
                model=model,
                messages=messages,
            )
            content = response.choices[0].message.content or ""
            total_input_tokens += response.usage.prompt_tokens
            total_output_tokens += response.usage.completion_tokens

            span.end(status="ok", output={
                "input_tokens": response.usage.prompt_tokens,
                "output_tokens": response.usage.completion_tokens,
                "latency_ms": int((time.time() - start) * 1000),
            })
            messages.append({"role": "assistant", "content": content})
            if "DONE" in content or iteration >= 8:
                break
            messages.append({"role": "user", "content": "Continue."})

        trace.end(status="success", output={
            "total_input_tokens": total_input_tokens,
            "total_output_tokens": total_output_tokens,
            "iterations": iteration,
        })
        return messages[-1]["content"]
    except Exception as e:
        trace.end(status="error")
        raise

Recording total_input_tokens at the trace level (rather than only per span) lets you see the full context cost of each agent run in the trace list view — useful for identifying unusually expensive sessions.

Popular Groq Models and When to Use Them

Tag your spans with the model name (as shown above) so you can compare latency and token efficiency across models in the Nexus dashboard by filtering on trace metadata.

Get Started

Install the Nexus Python client (pip install nexus-client groq), create a free account at nexus.keylightdigital.dev/pricing, and you'll have traces flowing from your Groq agent in under five minutes.

Ready to see inside your Groq agents?

Start free — no credit card required. Up to 10,000 spans/month on the free tier.

Start monitoring for free →
← Back to blog