2026-04-19 · 7 min read

Setting Up Alerts for AI Agent Failures: Webhooks, Slack, and Error Rate Monitoring

Polling dashboards doesn't work for production AI agents — they fail silently, degrade gradually, and spike in error rate before you notice. Here's how to set up webhook and Slack alerts for agent errors and latency thresholds with Nexus, so you're notified within minutes of a failure.

Why polling dashboards fails for production agents

For traditional software, checking error dashboards every morning is fine — a crashed server is usually obvious and recovers quickly. AI agents fail differently: they degrade gradually, fail silently, or spike in error rate for a specific input pattern while handling other requests normally. By the time you notice in a dashboard, the impact has already happened.

Production AI agents need the same alerting model as distributed services:

Setting up Nexus webhook alerts

Nexus Pro users configure webhook alerts in Settings. When a trace ends with status: "error" or status: "timeout", Nexus sends a POST request to your webhook URL with a structured payload:

# Example Nexus webhook payload (trace.error event)
{
  "event": "trace.error",
  "trace_id": "tr_01HX...",
  "agent_id": "customer-support-agent",
  "status": "error",
  "error": "Tool call failed: search_docs returned 503",
  "latency_ms": 4823,
  "started_at": "2026-04-19T02:31:00Z",
  "ended_at": "2026-04-19T02:31:04Z",
  "metadata": {
    "user_id": "u_abc123",
    "environment": "production",
    "model": "gpt-4o"
  }
}

You configure the webhook URL in Settings → Webhook URL. Nexus auto-detects Slack Incoming Webhook URLs and sends richly formatted Slack blocks instead of raw JSON — no custom adapter needed.

Receiving alerts in Slack

The fastest path to Slack alerts: create a Slack Incoming Webhook, paste the URL into Nexus Settings. Nexus detects the Slack URL and formats the alert as a Slack block message automatically:

# What Nexus sends to Slack (formatted automatically):
# 
# 🚨 Agent error: customer-support-agent
# Trace: tr_01HX...
# Error: Tool call failed: search_docs returned 503
# Latency: 4.8s
# Environment: production
# User: u_abc123
# 
# [View trace] [View agent]

Rate limiting prevents alert fatigue: Nexus sends at most one alert per agent per 5 minutes by default, regardless of how many traces fail in that window.

Building a custom webhook receiver

For custom alerting — PagerDuty, custom Slack formatting, or internal dashboards — build a simple webhook receiver:

# Simple Flask webhook receiver
from flask import Flask, request, jsonify
import requests
import os

app = Flask(__name__)
PAGERDUTY_KEY = os.environ["PAGERDUTY_INTEGRATION_KEY"]

@app.route('/nexus-webhook', methods=['POST'])
def handle_nexus_webhook():
    payload = request.json
    event = payload.get('event')
    
    if event == 'trace.error':
        agent_id = payload['agent_id']
        error = payload.get('error', 'Unknown error')
        trace_id = payload['trace_id']
        metadata = payload.get('metadata', {})
        
        # Trigger PagerDuty incident for production errors
        if metadata.get('environment') == 'production':
            trigger_pagerduty(
                summary=f"AI agent error: {agent_id}",
                details={
                    "trace_id": trace_id,
                    "error": error,
                    "agent": agent_id,
                    "user_id": metadata.get("user_id"),
                },
            )
    
    elif event == 'trace.slow':
        # Latency threshold exceeded
        latency_ms = payload.get('latency_ms', 0)
        agent_id = payload['agent_id']
        post_to_slack(
            f":clock3: Slow trace on {agent_id}: {latency_ms}ms (threshold exceeded)"
        )
    
    return jsonify({"ok": True})

def trigger_pagerduty(summary: str, details: dict):
    requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json={
            "routing_key": PAGERDUTY_KEY,
            "event_action": "trigger",
            "payload": {
                "summary": summary,
                "severity": "error",
                "source": "nexus-agent-monitor",
                "custom_details": details,
            },
        },
    )

def post_to_slack(message: str):
    slack_url = os.environ["SLACK_WEBHOOK_URL"]
    requests.post(slack_url, json={"text": message})

Latency threshold alerts

Beyond error alerts, Nexus Pro supports latency threshold alerts: when a trace duration exceeds your configured threshold, Nexus fires a trace.slow event. Configure the threshold in Settings → Latency threshold.

Useful thresholds by agent type:

Alert fatigue and rate limiting

The most common alerting mistake: alerting on every error. If your agent handles 1,000 traces per hour and has a 5% error rate, you'll receive 50 alerts per hour — which you'll start ignoring within a day.

The right approach:

Next steps

Webhook and email alerts are included in the Nexus Pro plan ($9/mo flat). Sign up, connect your agents, and configure a webhook URL in Settings to start receiving alerts within minutes.

Get alerts when your agents fail

Webhook + email alerts on Pro. Free tier available, no credit card required.