2026-06-17 · 6 min read

Observability for Together AI Agents: Tracing Open-Source Model Calls with Nexus

Together AI hosts Llama 3, Mistral, Qwen, DBRX, and dozens of other open-source models behind an OpenAI-compatible API — pay-per-token for OSS models without managing GPU infrastructure. Here's how to wrap Together AI calls in Nexus spans to track token costs, latency per model, and rate limit errors across every agent run.

What Together AI Is

Together AI is a cloud inference provider that hosts open-source models — Llama 3, Mistral, Qwen, DBRX, DeepSeek, and many others — behind an OpenAI-compatible REST API. You get pay-per-token access to the same models you'd self-host on GPU infrastructure, without managing GPU servers or dealing with model quantization.

The appeal for agent builders is flexibility: you can run a cost-optimized Llama 3 8B for simple tasks and Llama 3 70B for complex reasoning, both through the same API interface, without switching providers.

The Core Pattern: Wrap together.chat.completions.create()

Together AI's Python SDK is OpenAI-compatible — the same usage fields (usage.prompt_tokens, usage.completion_tokens) are present in non-streaming responses:

import os
import time
from together import Together
from nexus_client import NexusClient

nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-together-agent")
together = Together(api_key=os.environ["TOGETHER_API_KEY"])

def chat(prompt: str, model: str = "meta-llama/Llama-3-8b-chat-hf") -> str:
    trace = nexus.start_trace(
        name=f"together: {prompt[:60]}",
        metadata={"model": model},
    )
    span = trace.add_span(name="together-chat", input={"prompt": prompt, "model": model})
    start = time.time()
    try:
        response = together.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        content = response.choices[0].message.content or ""
        latency_ms = int((time.time() - start) * 1000)

        if not content.strip():
            span.end(status="error", output={"error": "empty_response", "model": model, "latency_ms": latency_ms})
            trace.end(status="error")
            raise ValueError("Together AI returned empty response")

        span.end(status="ok", output={
            "model": model,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "latency_ms": latency_ms,
        })
        trace.end(status="success")
        return content
    except Exception as e:
        span.end(status="error", output={"error": str(e), "model": model})
        trace.end(status="error")
        raise

Streaming Responses

Together AI supports streaming via the same pattern as OpenAI. When streaming, the usage field is not available in the stream chunks. Use output character count as a proxy, or switch to non-streaming calls when token tracking is critical:

def chat_streaming(prompt: str, model: str = "meta-llama/Llama-3-8b-chat-hf") -> str:
    trace = nexus.start_trace(name=f"together: {prompt[:60]}", metadata={"model": model, "streaming": True})
    span = trace.add_span(name="together-chat-stream", input={"prompt": prompt, "model": model})
    start = time.time()
    collected = []
    try:
        # Together AI supports streaming via the OpenAI SDK pattern
        for chunk in together.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        ):
            delta = chunk.choices[0].delta.content or ""
            collected.append(delta)

        content = "".join(collected)
        latency_ms = int((time.time() - start) * 1000)
        # Streaming responses do not include usage — estimate from content length or track separately
        span.end(status="ok", output={
            "model": model,
            "output_chars": len(content),  # proxy when token counts unavailable
            "latency_ms": latency_ms,
        })
        trace.end(status="success")
        return content
    except Exception as e:
        span.end(status="error", output={"error": str(e)})
        trace.end(status="error")
        raise

Popular Models Reference

Together AI model IDs use the format organization/model-name. Tag your spans with the model name so you can compare cost and latency across model versions in the Nexus dashboard:

# Popular Together AI models (as of June 2026)
MODELS = {
    # Meta Llama 3
    "llama3-8b":  "meta-llama/Llama-3-8b-chat-hf",
    "llama3-70b": "meta-llama/Llama-3-70b-chat-hf",

    # Mistral
    "mistral-7b": "mistralai/Mistral-7B-Instruct-v0.2",
    "mixtral-8x7b": "mistralai/Mixtral-8x7B-Instruct-v0.1",

    # Qwen
    "qwen2-72b": "Qwen/Qwen2-72B-Instruct",

    # DeepSeek
    "deepseek-coder": "deepseek-ai/deepseek-coder-33b-instruct",
}

What to Watch in the Dashboard

Get Started

Install the Nexus Python client and Together AI SDK (pip install nexus-client together), create a free account at nexus.keylightdigital.dev/pricing, and you'll have traces flowing in under five minutes.

Ready to see inside your Together AI agents?

Start free — no credit card required. Up to 10,000 spans/month on the free tier.

Start monitoring for free →
← Back to blog