2026-07-01 · 7 min read

Observability for Modal Agents: Tracing Serverless GPU Functions with Nexus

Modal runs your Python functions on serverless GPU/CPU infrastructure — cold starts, GPU allocation time, and async task execution are all invisible by default. Here's how to wrap Modal functions with Nexus spans to capture cold start latency, GPU allocation overhead, CUDA OOM errors, and end-to-end trace context across async tasks.

What Modal Is

Modal is a serverless compute platform for Python — you write a function, decorate it with @modal.function, and Modal provisions GPU or CPU instances on demand. No Dockerfile, no cluster management, no idle costs. Functions cold-start in seconds and scale to thousands of parallel workers.

Modal is popular for AI workloads: fine-tuned model inference, batch embedding pipelines, async agent tasks, and GPU-heavy processing jobs. For agent builders, it means you can delegate compute-intensive steps (running a 70B model, processing a large document batch) to ephemeral Modal workers without managing infrastructure.

Why Modal Agents Need Observability

Modal's serverless model introduces latency sources that are invisible in standard logs:

Full Integration Example

The pattern: create a root trace in your local entrypoint, pass trace_id and parent_span_id explicitly to remote functions, and record Modal-specific metadata as span output attributes.

import modal
import time
from nexus_client import NexusClient

app = modal.App("agent-pipeline")
nexus = NexusClient(api_key="nxs_...")

@app.function(gpu="A10G", timeout=300)
def run_inference(prompt: str, trace_id: str, parent_span_id: str) -> dict:
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM

    fn_start = time.time()

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-instruct")
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3-8b-instruct", torch_dtype=torch.float16
    )
    model.to("cuda")
    model_load_ms = int((time.time() - fn_start) * 1000)

    infer_start = time.time()
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    infer_ms = int((time.time() - infer_start) * 1000)

    return {
        "response": response,
        "model_load_ms": model_load_ms,
        "infer_ms": infer_ms,
    }

@app.local_entrypoint()
def main():
    with nexus.start_trace(name="modal-pipeline") as trace:
        span = trace.add_span(name="modal-gpu-call")
        try:
            result = run_inference.remote(
                prompt="Summarize the following research paper...",
                trace_id=trace.trace_id,
                parent_span_id=span.span_id,
            )
            span.end(status="ok", output={
                "modal.gpu": "A10G",
                "modal.model_load_ms": result["model_load_ms"],
                "modal.infer_ms": result["infer_ms"],
                "modal.warm": result["model_load_ms"] < 500,
            })
        except modal.exception.FunctionTimeoutError as e:
            span.end(status="error", output={"error": str(e), "error_type": "timeout"})
            raise
        except Exception as e:
            error_type = "cuda_oom" if "CUDA out of memory" in str(e) else "unknown"
            span.end(status="error", output={"error": str(e), "error_type": error_type})
            raise

Cold Start Detection

Modal sets the MODAL_IS_COLD_START environment variable to "true" on cold container starts. Record this as a boolean attribute so you can split latency histograms by warm vs cold in the Nexus dashboard.

import os

@app.function(gpu="A10G")
def traced_inference(prompt: str, trace_id: str, parent_span_id: str) -> dict:
    # Modal sets MODAL_IS_COLD_START="true" on cold container starts
    is_cold = os.environ.get("MODAL_IS_COLD_START", "false") == "true"
    start = time.time()
    result = do_inference(prompt)
    total_ms = int((time.time() - start) * 1000)
    return {**result, "cold_start": is_cold, "total_ms": total_ms}

Async Task Fan-Out

When your agent fans out work with .spawn(), pass trace_id and parent_span_id as parameters. Each worker creates a child span that links back to the root, giving you a complete trace tree even for async workloads with dozens of parallel workers.

@app.function(cpu=4)
def process_batch(items: list[str], trace_id: str, parent_span_id: str) -> list[str]:
    span = nexus.start_span(trace_id=trace_id, name="batch-worker", parent_span_id=parent_span_id)
    results = [transform(item) for item in items]
    span.end(status="ok", output={"batch.size": len(items)})
    return results

@app.local_entrypoint()
def main():
    with nexus.start_trace(name="batch-pipeline") as trace:
        root_span = trace.add_span(name="fan-out")
        futures = [
            process_batch.spawn(chunk, trace.trace_id, root_span.span_id)
            for chunk in chunks(data, size=100)
        ]
        results = [f.get() for f in futures]
        root_span.end(status="ok", output={"total_batches": len(futures)})

Span Attributes Worth Recording

Get Started

Install the Nexus Python client (pip install nexus-client modal), create a free account at nexus.keylightdigital.dev/pricing, and your Modal agents will have full trace visibility in under ten minutes.

Ready to see inside your Modal agents?

Start free — no credit card required. Up to 10,000 spans/month on the free tier.

Start monitoring for free →
← Back to blog