Observability for Modal Agents: Tracing Serverless GPU Functions with Nexus
Modal runs your Python functions on serverless GPU/CPU infrastructure — cold starts, GPU allocation time, and async task execution are all invisible by default. Here's how to wrap Modal functions with Nexus spans to capture cold start latency, GPU allocation overhead, CUDA OOM errors, and end-to-end trace context across async tasks.
What Modal Is
Modal is a serverless compute platform for Python — you write a function, decorate it with @modal.function, and Modal provisions GPU or CPU instances on demand. No Dockerfile, no cluster management, no idle costs. Functions cold-start in seconds and scale to thousands of parallel workers.
Modal is popular for AI workloads: fine-tuned model inference, batch embedding pipelines, async agent tasks, and GPU-heavy processing jobs. For agent builders, it means you can delegate compute-intensive steps (running a 70B model, processing a large document batch) to ephemeral Modal workers without managing infrastructure.
Why Modal Agents Need Observability
Modal's serverless model introduces latency sources that are invisible in standard logs:
- Cold start time — First call to a GPU function can take 5–30 seconds for container startup and GPU allocation. Without tracing, your agent just looks slow with no explanation.
- GPU allocation variance — A10G vs H100 allocation times differ, and spot availability affects scheduling. Span metadata captures this variance so you can tune GPU selection.
- CUDA OOM errors — Large model weights or batch sizes that exceed GPU VRAM throw cryptic errors. A span with
error_type: "cuda_oom"makes these patterns visible. - Async task linkage — When your agent spawns Modal workers with
.spawn(), background tasks have no foreground logs. Passing trace context links them back to the root span.
Full Integration Example
The pattern: create a root trace in your local entrypoint, pass trace_id and parent_span_id explicitly to remote functions, and record Modal-specific metadata as span output attributes.
import modal
import time
from nexus_client import NexusClient
app = modal.App("agent-pipeline")
nexus = NexusClient(api_key="nxs_...")
@app.function(gpu="A10G", timeout=300)
def run_inference(prompt: str, trace_id: str, parent_span_id: str) -> dict:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
fn_start = time.time()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-instruct")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8b-instruct", torch_dtype=torch.float16
)
model.to("cuda")
model_load_ms = int((time.time() - fn_start) * 1000)
infer_start = time.time()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
infer_ms = int((time.time() - infer_start) * 1000)
return {
"response": response,
"model_load_ms": model_load_ms,
"infer_ms": infer_ms,
}
@app.local_entrypoint()
def main():
with nexus.start_trace(name="modal-pipeline") as trace:
span = trace.add_span(name="modal-gpu-call")
try:
result = run_inference.remote(
prompt="Summarize the following research paper...",
trace_id=trace.trace_id,
parent_span_id=span.span_id,
)
span.end(status="ok", output={
"modal.gpu": "A10G",
"modal.model_load_ms": result["model_load_ms"],
"modal.infer_ms": result["infer_ms"],
"modal.warm": result["model_load_ms"] < 500,
})
except modal.exception.FunctionTimeoutError as e:
span.end(status="error", output={"error": str(e), "error_type": "timeout"})
raise
except Exception as e:
error_type = "cuda_oom" if "CUDA out of memory" in str(e) else "unknown"
span.end(status="error", output={"error": str(e), "error_type": error_type})
raise
Cold Start Detection
Modal sets the MODAL_IS_COLD_START environment variable to "true" on cold container starts. Record this as a boolean attribute so you can split latency histograms by warm vs cold in the Nexus dashboard.
import os
@app.function(gpu="A10G")
def traced_inference(prompt: str, trace_id: str, parent_span_id: str) -> dict:
# Modal sets MODAL_IS_COLD_START="true" on cold container starts
is_cold = os.environ.get("MODAL_IS_COLD_START", "false") == "true"
start = time.time()
result = do_inference(prompt)
total_ms = int((time.time() - start) * 1000)
return {**result, "cold_start": is_cold, "total_ms": total_ms}
Async Task Fan-Out
When your agent fans out work with .spawn(), pass trace_id and parent_span_id as parameters. Each worker creates a child span that links back to the root, giving you a complete trace tree even for async workloads with dozens of parallel workers.
@app.function(cpu=4)
def process_batch(items: list[str], trace_id: str, parent_span_id: str) -> list[str]:
span = nexus.start_span(trace_id=trace_id, name="batch-worker", parent_span_id=parent_span_id)
results = [transform(item) for item in items]
span.end(status="ok", output={"batch.size": len(items)})
return results
@app.local_entrypoint()
def main():
with nexus.start_trace(name="batch-pipeline") as trace:
root_span = trace.add_span(name="fan-out")
futures = [
process_batch.spawn(chunk, trace.trace_id, root_span.span_id)
for chunk in chunks(data, size=100)
]
results = [f.get() for f in futures]
root_span.end(status="ok", output={"total_batches": len(futures)})
Span Attributes Worth Recording
modal.gpu— GPU type (A10G, H100, T4)modal.warm— boolean, true when model load time is under thresholdmodal.model_load_ms— time to load model weights from coldmodal.infer_ms— pure inference time excluding model load overheadmodal.batch_size— number of items processed in batch functionserror_type— "timeout", "cuda_oom", or "container_dead" for error classification
Get Started
Install the Nexus Python client (pip install nexus-client modal), create a free account at nexus.keylightdigital.dev/pricing, and your Modal agents will have full trace visibility in under ten minutes.
Ready to see inside your Modal agents?
Start free — no credit card required. Up to 10,000 spans/month on the free tier.
Start monitoring for free →