Observability for Groq API Agents: Tracing Ultra-Fast LLM Calls with Nexus

Groq's LPU inference delivers sub-second response times for Llama 3, Mixtral, and Gemma — but fast doesn't mean free from operational concerns. Token costs accumulate, rate limits hit silently, and latency still varies by model and request size. Here's how to wrap every Groq API call in a Nexus span for full trace-level visibility.

What Groq Is and Why It Still Needs Observability

Groq is an inference provider built on custom LPU (Language Processing Unit) hardware that delivers response times in the hundreds of milliseconds — often 5–10× faster than GPU-based providers for the same model. For interactive agent applications where latency matters, Groq is frequently the right choice for Llama 3, Mixtral, and Gemma models.

But "fast" doesn't eliminate operational concerns:

Token costs still accumulate — Groq is not free; Llama 3 70B at $0.59/M tokens adds up quickly in high-volume agents
Rate limits hit silently — Groq enforces per-minute and per-day token limits; when you exceed them, requests fail with 429 errors that look identical to other failures without span metadata
Latency varies by model — Llama 3 8B and Llama 3 70B have very different latency profiles; tracking this in spans lets you measure the tradeoff when switching models
Multi-turn context growth — Groq's fast responses encourage more iterations, which means context windows fill faster; prompt_tokens growing across turns is your signal

The Core Pattern: Wrap groq.chat.completions.create()

The Groq Python SDK is OpenAI-compatible — response.usage.prompt_tokens and response.usage.completion_tokens are always present in non-streaming responses:

import os
import time
from groq import Groq
from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-groq-agent")
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

def chat(prompt: str, model: str = "llama3-8b-8192") -> str:
    trace = nexus.start_trace(
        name=f"groq: {prompt[:60]}",
        metadata={"model": model},
    )
    span = trace.add_span(name="groq-chat", input={"prompt": prompt, "model": model})
    start = time.time()
    try:
        response = groq_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        content = response.choices[0].message.content or ""
        latency_ms = int((time.time() - start) * 1000)

        if not content.strip():
            span.end(status="error", output={"error": "empty_response", "model": model, "latency_ms": latency_ms})
            trace.end(status="error")
            raise ValueError("Groq returned empty response")

        span.end(status="ok", output={
            "model": model,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "latency_ms": latency_ms,
        })
        trace.end(status="success")
        return content
    except Exception as e:
        span.end(status="error", output={"error": str(e), "model": model})
        trace.end(status="error")
        raise

Handling Rate Limit Errors

Groq rate limits are per-minute and per-day, separately enforced per model. When you hit them, the SDK raises groq.RateLimitError. Recording each retry attempt as a separate span gives you visibility into how often rate limits affect your agents:

import groq

def chat_with_retry(prompt: str, model: str = "llama3-8b-8192", max_retries: int = 3) -> str:
    trace = nexus.start_trace(name=f"groq: {prompt[:60]}", metadata={"model": model})
    for attempt in range(max_retries):
        span = trace.add_span(
            name=f"groq-chat-attempt-{attempt + 1}",
            input={"prompt": prompt, "model": model, "attempt": attempt + 1},
        )
        start = time.time()
        try:
            response = groq_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
            )
            content = response.choices[0].message.content or ""
            span.end(status="ok", output={
                "input_tokens": response.usage.prompt_tokens,
                "output_tokens": response.usage.completion_tokens,
                "latency_ms": int((time.time() - start) * 1000),
            })
            trace.end(status="success")
            return content
        except groq.RateLimitError as e:
            span.end(status="error", output={
                "error": "rate_limit",
                "retry_after": getattr(e, 'retry_after', None),
                "latency_ms": int((time.time() - start) * 1000),
            })
            if attempt == max_retries - 1:
                trace.end(status="error")
                raise
            time.sleep(2 ** attempt)  # exponential backoff
    trace.end(status="error")
    raise RuntimeError("Max retries exceeded")

Multi-Turn Agent Loops

For agents that make multiple Groq calls in a loop, track token accumulation across iterations by summing token counts and recording the total on the trace at the end:

def agent_loop(task: str, model: str = "llama3-70b-8192") -> str:
    """Multi-turn agent loop with per-call span tracking."""
    trace = nexus.start_trace(name=f"agent: {task[:60]}", metadata={"model": model})
    messages = [{"role": "user", "content": task}]
    iteration = 0
    total_input_tokens = 0
    total_output_tokens = 0

    try:
        while iteration < 8:
            iteration += 1
            span = trace.add_span(
                name=f"llm-call-{iteration}",
                input={"iteration": iteration, "messages": len(messages)},
            )
            start = time.time()
            response = groq_client.chat.completions.create(
                model=model,
                messages=messages,
            )
            content = response.choices[0].message.content or ""
            total_input_tokens += response.usage.prompt_tokens
            total_output_tokens += response.usage.completion_tokens

            span.end(status="ok", output={
                "input_tokens": response.usage.prompt_tokens,
                "output_tokens": response.usage.completion_tokens,
                "latency_ms": int((time.time() - start) * 1000),
            })
            messages.append({"role": "assistant", "content": content})
            if "DONE" in content or iteration >= 8:
                break
            messages.append({"role": "user", "content": "Continue."})

        trace.end(status="success", output={
            "total_input_tokens": total_input_tokens,
            "total_output_tokens": total_output_tokens,
            "iterations": iteration,
        })
        return messages[-1]["content"]
    except Exception as e:
        trace.end(status="error")
        raise

Recording total_input_tokens at the trace level (rather than only per span) lets you see the full context cost of each agent run in the trace list view — useful for identifying unusually expensive sessions.

Popular Groq Models and When to Use Them

llama3-8b-8192 — fastest, lowest cost; good for classification, routing, and simple generation tasks
llama3-70b-8192 — stronger reasoning; use for multi-step tasks and complex tool use
mixtral-8x7b-32768 — 32K context window; useful for long-document tasks
gemma2-9b-it — Google's Gemma 2; competitive quality for its size

Tag your spans with the model name (as shown above) so you can compare latency and token efficiency across models in the Nexus dashboard by filtering on trace metadata.

Get Started

Install the Nexus Python client (pip install nexus-client groq), create a free account at nexus.keylightdigital.dev/pricing, and you'll have traces flowing from your Groq agent in under five minutes.