Observability for Together AI Agents: Tracing Open-Source Model Calls with Nexus

Together AI hosts Llama 3, Mistral, Qwen, DBRX, and dozens of other open-source models behind an OpenAI-compatible API — pay-per-token for OSS models without managing GPU infrastructure. Here's how to wrap Together AI calls in Nexus spans to track token costs, latency per model, and rate limit errors across every agent run.

What Together AI Is

Together AI is a cloud inference provider that hosts open-source models — Llama 3, Mistral, Qwen, DBRX, DeepSeek, and many others — behind an OpenAI-compatible REST API. You get pay-per-token access to the same models you'd self-host on GPU infrastructure, without managing GPU servers or dealing with model quantization.

The appeal for agent builders is flexibility: you can run a cost-optimized Llama 3 8B for simple tasks and Llama 3 70B for complex reasoning, both through the same API interface, without switching providers.

The Core Pattern: Wrap together.chat.completions.create()

Together AI's Python SDK is OpenAI-compatible — the same usage fields (usage.prompt_tokens, usage.completion_tokens) are present in non-streaming responses:

import os
import time
from together import Together
from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-together-agent")
together = Together(api_key=os.environ["TOGETHER_API_KEY"])

def chat(prompt: str, model: str = "meta-llama/Llama-3-8b-chat-hf") -> str:
    trace = nexus.start_trace(
        name=f"together: {prompt[:60]}",
        metadata={"model": model},
    )
    span = trace.add_span(name="together-chat", input={"prompt": prompt, "model": model})
    start = time.time()
    try:
        response = together.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        content = response.choices[0].message.content or ""
        latency_ms = int((time.time() - start) * 1000)

        if not content.strip():
            span.end(status="error", output={"error": "empty_response", "model": model, "latency_ms": latency_ms})
            trace.end(status="error")
            raise ValueError("Together AI returned empty response")

        span.end(status="ok", output={
            "model": model,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "latency_ms": latency_ms,
        })
        trace.end(status="success")
        return content
    except Exception as e:
        span.end(status="error", output={"error": str(e), "model": model})
        trace.end(status="error")
        raise

Streaming Responses

Together AI supports streaming via the same pattern as OpenAI. When streaming, the usage field is not available in the stream chunks. Use output character count as a proxy, or switch to non-streaming calls when token tracking is critical:

def chat_streaming(prompt: str, model: str = "meta-llama/Llama-3-8b-chat-hf") -> str:
    trace = nexus.start_trace(name=f"together: {prompt[:60]}", metadata={"model": model, "streaming": True})
    span = trace.add_span(name="together-chat-stream", input={"prompt": prompt, "model": model})
    start = time.time()
    collected = []
    try:
        # Together AI supports streaming via the OpenAI SDK pattern
        for chunk in together.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        ):
            delta = chunk.choices[0].delta.content or ""
            collected.append(delta)

        content = "".join(collected)
        latency_ms = int((time.time() - start) * 1000)
        # Streaming responses do not include usage — estimate from content length or track separately
        span.end(status="ok", output={
            "model": model,
            "output_chars": len(content),  # proxy when token counts unavailable
            "latency_ms": latency_ms,
        })
        trace.end(status="success")
        return content
    except Exception as e:
        span.end(status="error", output={"error": str(e)})
        trace.end(status="error")
        raise

Popular Models Reference

Together AI model IDs use the format organization/model-name. Tag your spans with the model name so you can compare cost and latency across model versions in the Nexus dashboard:

# Popular Together AI models (as of June 2026)
MODELS = {
    # Meta Llama 3
    "llama3-8b":  "meta-llama/Llama-3-8b-chat-hf",
    "llama3-70b": "meta-llama/Llama-3-70b-chat-hf",

    # Mistral
    "mistral-7b": "mistralai/Mistral-7B-Instruct-v0.2",
    "mixtral-8x7b": "mistralai/Mixtral-8x7B-Instruct-v0.1",

    # Qwen
    "qwen2-72b": "Qwen/Qwen2-72B-Instruct",

    # DeepSeek
    "deepseek-coder": "deepseek-ai/deepseek-coder-33b-instruct",
}

What to Watch in the Dashboard

Cost per model — filter traces by model metadata to compare token spend across Llama 3 8B vs. 70B
Latency variance — Together AI latency varies by model size and server load; track latency_ms per model in span output
Rate limit errors — Together AI enforces per-minute and per-day limits; error spans with error: "rate_limit" tell you when you're hitting ceilings

Get Started

Install the Nexus Python client and Together AI SDK (pip install nexus-client together), create a free account at nexus.keylightdigital.dev/pricing, and you'll have traces flowing in under five minutes.