Observability for Modal Agents: Tracing Serverless GPU Functions with Nexus

Modal runs your Python functions on serverless GPU/CPU infrastructure — cold starts, GPU allocation time, and async task execution are all invisible by default. Here's how to wrap Modal functions with Nexus spans to capture cold start latency, GPU allocation overhead, CUDA OOM errors, and end-to-end trace context across async tasks.

What Modal Is

Modal is a serverless compute platform for Python — you write a function, decorate it with @modal.function, and Modal provisions GPU or CPU instances on demand. No Dockerfile, no cluster management, no idle costs. Functions cold-start in seconds and scale to thousands of parallel workers.

Modal is popular for AI workloads: fine-tuned model inference, batch embedding pipelines, async agent tasks, and GPU-heavy processing jobs. For agent builders, it means you can delegate compute-intensive steps (running a 70B model, processing a large document batch) to ephemeral Modal workers without managing infrastructure.

Why Modal Agents Need Observability

Modal's serverless model introduces latency sources that are invisible in standard logs:

Cold start time — First call to a GPU function can take 5–30 seconds for container startup and GPU allocation. Without tracing, your agent just looks slow with no explanation.
GPU allocation variance — A10G vs H100 allocation times differ, and spot availability affects scheduling. Span metadata captures this variance so you can tune GPU selection.
CUDA OOM errors — Large model weights or batch sizes that exceed GPU VRAM throw cryptic errors. A span with error_type: "cuda_oom" makes these patterns visible.
Async task linkage — When your agent spawns Modal workers with .spawn(), background tasks have no foreground logs. Passing trace context links them back to the root span.

Full Integration Example

The pattern: create a root trace in your local entrypoint, pass trace_id and parent_span_id explicitly to remote functions, and record Modal-specific metadata as span output attributes.

import modal
import time
from nexus_client import NexusClient

app = modal.App("agent-pipeline")
nexus = NexusClient(api_key="nxs_...")

@app.function(gpu="A10G", timeout=300)
def run_inference(prompt: str, trace_id: str, parent_span_id: str) -> dict:
    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM

    fn_start = time.time()

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-instruct")
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3-8b-instruct", torch_dtype=torch.float16
    )
    model.to("cuda")
    model_load_ms = int((time.time() - fn_start) * 1000)

    infer_start = time.time()
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    infer_ms = int((time.time() - infer_start) * 1000)

    return {
        "response": response,
        "model_load_ms": model_load_ms,
        "infer_ms": infer_ms,
    }

@app.local_entrypoint()
def main():
    with nexus.start_trace(name="modal-pipeline") as trace:
        span = trace.add_span(name="modal-gpu-call")
        try:
            result = run_inference.remote(
                prompt="Summarize the following research paper...",
                trace_id=trace.trace_id,
                parent_span_id=span.span_id,
            )
            span.end(status="ok", output={
                "modal.gpu": "A10G",
                "modal.model_load_ms": result["model_load_ms"],
                "modal.infer_ms": result["infer_ms"],
                "modal.warm": result["model_load_ms"] < 500,
            })
        except modal.exception.FunctionTimeoutError as e:
            span.end(status="error", output={"error": str(e), "error_type": "timeout"})
            raise
        except Exception as e:
            error_type = "cuda_oom" if "CUDA out of memory" in str(e) else "unknown"
            span.end(status="error", output={"error": str(e), "error_type": error_type})
            raise

Cold Start Detection

Modal sets the MODAL_IS_COLD_START environment variable to "true" on cold container starts. Record this as a boolean attribute so you can split latency histograms by warm vs cold in the Nexus dashboard.

import os

@app.function(gpu="A10G")
def traced_inference(prompt: str, trace_id: str, parent_span_id: str) -> dict:
    # Modal sets MODAL_IS_COLD_START="true" on cold container starts
    is_cold = os.environ.get("MODAL_IS_COLD_START", "false") == "true"
    start = time.time()
    result = do_inference(prompt)
    total_ms = int((time.time() - start) * 1000)
    return {**result, "cold_start": is_cold, "total_ms": total_ms}

Async Task Fan-Out

When your agent fans out work with .spawn(), pass trace_id and parent_span_id as parameters. Each worker creates a child span that links back to the root, giving you a complete trace tree even for async workloads with dozens of parallel workers.

@app.function(cpu=4)
def process_batch(items: list[str], trace_id: str, parent_span_id: str) -> list[str]:
    span = nexus.start_span(trace_id=trace_id, name="batch-worker", parent_span_id=parent_span_id)
    results = [transform(item) for item in items]
    span.end(status="ok", output={"batch.size": len(items)})
    return results

@app.local_entrypoint()
def main():
    with nexus.start_trace(name="batch-pipeline") as trace:
        root_span = trace.add_span(name="fan-out")
        futures = [
            process_batch.spawn(chunk, trace.trace_id, root_span.span_id)
            for chunk in chunks(data, size=100)
        ]
        results = [f.get() for f in futures]
        root_span.end(status="ok", output={"total_batches": len(futures)})

Span Attributes Worth Recording

modal.gpu — GPU type (A10G, H100, T4)
modal.warm — boolean, true when model load time is under threshold
modal.model_load_ms — time to load model weights from cold
modal.infer_ms — pure inference time excluding model load overhead
modal.batch_size — number of items processed in batch functions
error_type — "timeout", "cuda_oom", or "container_dead" for error classification

Get Started

Install the Nexus Python client (pip install nexus-client modal), create a free account at nexus.keylightdigital.dev/pricing, and your Modal agents will have full trace visibility in under ten minutes.