Observability for Together AI Agents: Tracing Open-Source Model Calls with Nexus
Together AI hosts Llama 3, Mistral, Qwen, DBRX, and dozens of other open-source models behind an OpenAI-compatible API — pay-per-token for OSS models without managing GPU infrastructure. Here's how to wrap Together AI calls in Nexus spans to track token costs, latency per model, and rate limit errors across every agent run.
What Together AI Is
Together AI is a cloud inference provider that hosts open-source models — Llama 3, Mistral, Qwen, DBRX, DeepSeek, and many others — behind an OpenAI-compatible REST API. You get pay-per-token access to the same models you'd self-host on GPU infrastructure, without managing GPU servers or dealing with model quantization.
The appeal for agent builders is flexibility: you can run a cost-optimized Llama 3 8B for simple tasks and Llama 3 70B for complex reasoning, both through the same API interface, without switching providers.
The Core Pattern: Wrap together.chat.completions.create()
Together AI's Python SDK is OpenAI-compatible — the same usage fields (usage.prompt_tokens, usage.completion_tokens) are present in non-streaming responses:
import os
import time
from together import Together
from nexus_client import NexusClient
nexus = NexusClient(api_key=os.environ["NEXUS_API_KEY"], agent_id="my-together-agent")
together = Together(api_key=os.environ["TOGETHER_API_KEY"])
def chat(prompt: str, model: str = "meta-llama/Llama-3-8b-chat-hf") -> str:
trace = nexus.start_trace(
name=f"together: {prompt[:60]}",
metadata={"model": model},
)
span = trace.add_span(name="together-chat", input={"prompt": prompt, "model": model})
start = time.time()
try:
response = together.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
content = response.choices[0].message.content or ""
latency_ms = int((time.time() - start) * 1000)
if not content.strip():
span.end(status="error", output={"error": "empty_response", "model": model, "latency_ms": latency_ms})
trace.end(status="error")
raise ValueError("Together AI returned empty response")
span.end(status="ok", output={
"model": model,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"latency_ms": latency_ms,
})
trace.end(status="success")
return content
except Exception as e:
span.end(status="error", output={"error": str(e), "model": model})
trace.end(status="error")
raise
Streaming Responses
Together AI supports streaming via the same pattern as OpenAI. When streaming, the usage field is not available in the stream chunks. Use output character count as a proxy, or switch to non-streaming calls when token tracking is critical:
def chat_streaming(prompt: str, model: str = "meta-llama/Llama-3-8b-chat-hf") -> str:
trace = nexus.start_trace(name=f"together: {prompt[:60]}", metadata={"model": model, "streaming": True})
span = trace.add_span(name="together-chat-stream", input={"prompt": prompt, "model": model})
start = time.time()
collected = []
try:
# Together AI supports streaming via the OpenAI SDK pattern
for chunk in together.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
):
delta = chunk.choices[0].delta.content or ""
collected.append(delta)
content = "".join(collected)
latency_ms = int((time.time() - start) * 1000)
# Streaming responses do not include usage — estimate from content length or track separately
span.end(status="ok", output={
"model": model,
"output_chars": len(content), # proxy when token counts unavailable
"latency_ms": latency_ms,
})
trace.end(status="success")
return content
except Exception as e:
span.end(status="error", output={"error": str(e)})
trace.end(status="error")
raise
Popular Models Reference
Together AI model IDs use the format organization/model-name. Tag your spans with the model name so you can compare cost and latency across model versions in the Nexus dashboard:
# Popular Together AI models (as of June 2026)
MODELS = {
# Meta Llama 3
"llama3-8b": "meta-llama/Llama-3-8b-chat-hf",
"llama3-70b": "meta-llama/Llama-3-70b-chat-hf",
# Mistral
"mistral-7b": "mistralai/Mistral-7B-Instruct-v0.2",
"mixtral-8x7b": "mistralai/Mixtral-8x7B-Instruct-v0.1",
# Qwen
"qwen2-72b": "Qwen/Qwen2-72B-Instruct",
# DeepSeek
"deepseek-coder": "deepseek-ai/deepseek-coder-33b-instruct",
}
What to Watch in the Dashboard
- Cost per model — filter traces by
modelmetadata to compare token spend across Llama 3 8B vs. 70B - Latency variance — Together AI latency varies by model size and server load; track
latency_msper model in span output - Rate limit errors — Together AI enforces per-minute and per-day limits; error spans with
error: "rate_limit"tell you when you're hitting ceilings
Get Started
Install the Nexus Python client and Together AI SDK (pip install nexus-client together), create a free account at nexus.keylightdigital.dev/pricing, and you'll have traces flowing in under five minutes.
Ready to see inside your Together AI agents?
Start free — no credit card required. Up to 10,000 spans/month on the free tier.
Start monitoring for free →