Observability for Model Context Protocol (MCP) Servers: Tracing Tool Calls with Nexus
The Model Context Protocol (MCP) lets AI hosts like Claude Desktop and Cursor call your server's tools over a standard JSON-RPC transport — but when a tool call returns the wrong result, takes 10 seconds, or throws a silent exception, the host LLM has no way to surface which tool failed or why. Here's how to wrap MCP tool handlers with Nexus spans in both Python (FastMCP) and TypeScript (@modelcontextprotocol/sdk) to get full trace-level visibility into every tool call your server handles.
What the Model Context Protocol is
The Model Context Protocol (MCP) is an open standard introduced by Anthropic that lets AI hosts — applications like Claude Desktop, Cursor, Cline, and any other LLM-powered client — call tools exposed by external servers over a standardized JSON-RPC transport. You write an MCP server that advertises a list of tools (functions with typed input schemas), and any compatible host can discover and invoke them.
A typical MCP server exposes tools like these:
- search_docs — semantic search over your documentation or knowledge base
- run_sql — execute a read-only SQL query against an analytics database
- get_weather — fetch current weather for a location
- create_issue — open a GitHub issue from natural language
- web_search — call a search API and return results
The host LLM decides when to call each tool, passes typed arguments, and integrates the tool’s response into its next generation. MCP decouples tool implementation from the AI host — the same server can be used by Claude Desktop, Cursor, and your own custom agent with no changes.
Why MCP servers need observability
From the host LLM’s perspective, a tool call is atomic: it sends arguments and receives a result. If the tool takes five seconds, returns garbage, or throws a non-descriptive error, the host sees only the failure — not which internal step caused it. Three failure modes are invisible without external instrumentation:
- Silent errors — your tool handler catches an exception and returns an empty string or a generic error message. The LLM hallucinates an answer because it has no data. The host logs show a successful tool call.
- Latency spikes — a database query or external API call inside a tool takes 8 seconds. The LLM waits, the user sees a spinner, and you have no span-level data to identify which tool is slow.
- Input/output blind spots — you know the tool was called but not what arguments the LLM chose to pass, or whether the output was truncated before being returned. Debugging prompt quality or tool design is impossible without this.
MCP servers are the interface between your infrastructure and an AI host you don’t control. Instrumenting them at the tool-call level gives you the only complete picture of what’s happening inside that interface.
Install the SDK
Python (FastMCP):
pip install nexus-agent mcp fastmcp
TypeScript (@modelcontextprotocol/sdk):
npm install @keylightdigital/nexus @modelcontextprotocol/sdk zod
Basic setup: Python with FastMCP
FastMCP is the official Python library for building MCP servers. Initialize the Nexus client once at module level alongside your FastMCP app:
import nexus_agent
from mcp.server.fastmcp import FastMCP
nexus = nexus_agent.Nexus(api_key="YOUR_API_KEY", agent_id="my-mcp-server")
app = FastMCP("my-server")
Then wrap each tool handler with a start_trace / start_span pair. Record the tool name and input in metadata:
@app.tool()
def search_docs(query: str) -> str:
"""Search documentation for a query."""
trace = nexus.start_trace(name="mcp_tool_call")
span = nexus.start_span(
trace_id=trace["trace_id"],
name="search_docs",
metadata={"tool": "search_docs", "input": query},
)
try:
result = _do_search(query)
nexus.end_span(
span_id=span["id"],
status="success",
metadata={"output_length": len(result)},
)
nexus.end_trace(trace_id=trace["trace_id"], status="success")
return result
except Exception as e:
nexus.end_span(
span_id=span["id"],
status="error",
metadata={"error": str(e)},
)
nexus.end_trace(trace_id=trace["trace_id"], status="error")
raise
Every tool call now appears in Nexus as a trace with a single span. The tool field in metadata lets you filter the Nexus dashboard by tool name to see per-tool latency and error distributions.
Scaling to multiple tools: a Python context manager
Repeating the start_trace / end_trace block in every tool handler adds noise. A small context manager centralizes the span lifecycle and adds latency tracking automatically:
import time
def traced_tool(tool_name: str, inputs: dict):
"""Context manager that wraps any MCP tool call with a Nexus span."""
import contextlib
@contextlib.contextmanager
def _ctx():
trace = nexus.start_trace(name="mcp_tool_call")
span = nexus.start_span(
trace_id=trace["trace_id"],
name=tool_name,
metadata={"tool": tool_name, **inputs},
)
started = time.monotonic()
try:
yield trace, span
latency_ms = (time.monotonic() - started) * 1000
nexus.end_span(
span_id=span["id"],
status="success",
metadata={"latency_ms": round(latency_ms, 1)},
)
nexus.end_trace(trace_id=trace["trace_id"], status="success")
except Exception as exc:
latency_ms = (time.monotonic() - started) * 1000
nexus.end_span(
span_id=span["id"],
status="error",
metadata={"error": str(exc), "latency_ms": round(latency_ms, 1)},
)
nexus.end_trace(trace_id=trace["trace_id"], status="error")
raise
return _ctx()
@app.tool()
def get_weather(city: str) -> str:
with traced_tool("get_weather", {"city": city}):
return _fetch_weather(city)
@app.tool()
def run_sql(query: str) -> list:
with traced_tool("run_sql", {"query": query}):
return _execute_query(query)
The latency_ms field recorded on every span lets you build a P95 latency view per tool in Nexus without any additional instrumentation. If run_sql spikes to 4× normal latency after a schema migration, you’ll see it immediately.
TypeScript: basic setup with @modelcontextprotocol/sdk
The official TypeScript SDK uses McpServer with Zod schema definitions. Initialize Nexus alongside your server:
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'
import { z } from 'zod'
import Nexus from '@keylightdigital/nexus'
const nexus = new Nexus({ apiKey: 'YOUR_API_KEY', agentId: 'my-mcp-server' })
const server = new McpServer({ name: 'my-server', version: '1.0.0' })
Then instrument each tool registration with startTrace and startSpan:
server.tool(
'search_docs',
{ query: z.string() },
async ({ query }) => {
const trace = await nexus.startTrace({ name: 'mcp_tool_call' })
const span = await nexus.startSpan({
traceId: trace.traceId,
name: 'search_docs',
metadata: { tool: 'search_docs', input: query },
})
try {
const result = await doSearch(query)
await nexus.endSpan({
spanId: span.id,
status: 'success',
metadata: { outputLength: result.length },
})
await nexus.endTrace({ traceId: trace.traceId, status: 'success' })
return { content: [{ type: 'text', text: result }] }
} catch (err) {
await nexus.endSpan({
spanId: span.id,
status: 'error',
metadata: { error: String(err) },
})
await nexus.endTrace({ traceId: trace.traceId, status: 'error' })
throw err
}
}
)
TypeScript: a reusable tool wrapper
Just as with Python, a wrapper function eliminates the repeated span boilerplate and ensures every tool gets latency tracking:
function tracedTool<TArgs extends Record<string, unknown>>(
toolName: string,
handler: (args: TArgs) => Promise<{ content: Array<{ type: string; text: string }> }>
) {
return async (args: TArgs) => {
const started = Date.now()
const trace = await nexus.startTrace({ name: 'mcp_tool_call' })
const span = await nexus.startSpan({
traceId: trace.traceId,
name: toolName,
metadata: { tool: toolName, ...args },
})
try {
const result = await handler(args)
await nexus.endSpan({
spanId: span.id,
status: 'success',
metadata: { latencyMs: Date.now() - started },
})
await nexus.endTrace({ traceId: trace.traceId, status: 'success' })
return result
} catch (err) {
await nexus.endSpan({
spanId: span.id,
status: 'error',
metadata: { error: String(err), latencyMs: Date.now() - started },
})
await nexus.endTrace({ traceId: trace.traceId, status: 'error' })
throw err
}
}
}
server.tool('get_weather', { city: z.string() }, tracedTool('get_weather', async ({ city }) => {
const data = await fetchWeather(city)
return { content: [{ type: 'text', text: JSON.stringify(data) }] }
}))
server.tool('run_sql', { query: z.string() }, tracedTool('run_sql', async ({ query }) => {
const rows = await executeQuery(query)
return { content: [{ type: 'text', text: JSON.stringify(rows) }] }
}))
The tracedTool wrapper accepts any tool handler and returns a span-wrapped version with the same type signature. Add it to your server registration once and every tool call lands in Nexus automatically.
What to record in metadata
The most useful fields to capture at the span level for MCP tools:
tool— the tool name. Enables per-tool filtering in the Nexus dashboard without querying by span name.input/inputs— the arguments the LLM passed. Essential for debugging cases where the LLM chose the wrong arguments.output_length— the length of the result string or number of rows returned. A sudden drop in output length (zero results, empty string) often signals a silent failure.latency_ms— wall-clock time inside the tool handler. Compare against an SLO (e.g., p95 < 500ms) to catch slow tools before users do.error— the exception message or error code when the tool fails. Gives you the “why” that the host LLM never sees.
Tracking which tools get called most often
After a few days of production traffic, the Nexus trace list gives you a natural frequency distribution by tool name. Filter by metadata.tool in the Nexus dashboard to see call volume per tool. High-frequency tools are your optimization priority — a 50ms latency improvement on a tool called 500 times per day saves more than the same improvement on a tool called 5 times.
Low-frequency tools that the LLM consistently selects with wrong arguments indicate a tool description problem. If run_sql is called with malformed queries, update the tool’s JSON schema description and example to steer the LLM toward valid inputs.
Alerting when a tool returns an error
Use the Nexus API to check error rates per tool and send an alert before users notice:
# In your MCP server startup or a separate monitoring process
import requests
NEXUS_BASE = "https://api.nexus.keylightdigital.dev"
HEADERS = {"Authorization": "Bearer YOUR_API_KEY"}
def check_tool_error_rate(tool_name: str, window_minutes: int = 60) -> float:
"""Return the error rate for a specific MCP tool over the last N minutes."""
resp = requests.get(
f"{NEXUS_BASE}/v1/spans",
headers=HEADERS,
params={"name": tool_name, "limit": 200},
)
spans = resp.json().get("spans", [])
if not spans:
return 0.0
errors = sum(1 for s in spans if s.get("status") == "error")
return errors / len(spans)
# Alert if any tool exceeds 10% error rate
TOOLS = ["search_docs", "get_weather", "run_sql"]
for tool in TOOLS:
rate = check_tool_error_rate(tool)
if rate > 0.10:
print(f"ALERT: {tool} error rate is {rate:.0%} — check Nexus traces")
Run this as a cron job every five minutes. When any tool exceeds a 10% error rate, you get an alert immediately — not when a user complains that the AI assistant “isn’t working.”
You can also use Nexus webhook alerts to post directly to Slack whenever an error rate threshold is crossed, without any polling infrastructure.
What to monitor in production
- Per-tool error rate: Track error rates separately for each tool. A spike in
run_sqlerrors is a different problem than a spike inget_weathererrors, and the fix is different. - P95 latency per tool: Set latency SLOs per tool. A
search_docscall over 800ms degrades the user’s perceived AI response time even if the LLM is fast. - Output length distribution: Zero-length results from a normally productive tool indicate a retrieval failure, an empty database query, or a quota exhaustion upstream. Catch it before the LLM hallucinates a substitute.
- Tool call volume trends: A sudden drop in call volume means the LLM stopped selecting the tool — usually because a recent prompt change removed it from context or a schema change made it unselectable. Increasing volume can signal a loop.
- Input argument patterns: Recurring malformed inputs (empty strings, out-of-range values) indicate the LLM is misunderstanding the tool’s contract. Update the tool description or add input validation that surfaces a helpful error message rather than an exception.
Next steps
MCP servers are the edge of your AI infrastructure — the boundary between your code and an AI host you don’t control. Wrapping each tool handler with a Nexus span gives you the input, output, latency, and error data you need to debug failures, optimize slow tools, and catch regressions before your users do. Sign up for a free Nexus account to start tracing your MCP server tool calls today, or read the Flowise integration guide if you’re building on top of a no-code AI workflow platform.
Trace every MCP tool call
Free tier, no credit card required. Full span-level visibility in under 5 minutes.