Monitoring Google Gemini and Vertex AI Agents with Nexus
Google Gemini and Vertex AI offer two entry points for building AI agents: the google-generativeai SDK for direct Gemini API access and the Vertex AI SDK for enterprise GCP-hosted agents. When a safety filter silently blocks a generation, a function call loop spins without reaching stop, or Vertex AI Search grounding returns zero results, the API response tells you what happened but not when or why. Here's how to wrap Gemini and Vertex AI agents in Nexus traces for full span-level observability.
Two SDK entry points for Google AI agents
Google offers two distinct paths for building AI agents, each with different abstractions and observability challenges:
- google-generativeai — The direct Gemini API SDK. You call
model.generate_content()or manage multi-turn conversations withChatSession. Function calling is controlled viatool_config. Best for lightweight agents or when you want direct control over the model. - google-cloud-aiplatform (Vertex AI SDK) — The enterprise GCP-hosted path. Adds Vertex AI Search grounding, managed model deployment, and integration with Google Cloud services. Agents can retrieve from a Vertex AI Search datastore before generating, giving them enterprise retrieval-augmented generation (RAG) without building your own retriever.
Both SDKs expose safety filters that can silently block a generation with a finish_reason of SAFETY, RECITATION, or OTHER. Without instrumentation, a safety block in production looks identical to a network timeout — you get no response, and you don’t know why.
Wrapping generate_content() in Nexus traces
The core pattern: open a Nexus trace before calling generate_content(), record the response in a span with token counts and finish reason, then check for safety blocks and record them as error spans.
import google.generativeai as genai
from nexus_sdk import NexusClient
genai.configure(api_key="YOUR_GEMINI_API_KEY")
nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")
model = genai.GenerativeModel("gemini-1.5-pro")
def run_gemini_agent(prompt: str, user_id: str) -> str:
trace = nexus.start_trace(
name="gemini_agent_run",
metadata={"user_id": user_id, "model": "gemini-1.5-pro"}
)
span = nexus.start_span(
trace_id=trace["trace_id"],
name="generate_content",
metadata={"prompt_length": len(prompt)}
)
try:
response = model.generate_content(prompt)
# Check for safety block before accessing text
finish_reason = response.candidates[0].finish_reason.name
if finish_reason in ("SAFETY", "RECITATION", "OTHER"):
nexus.end_span(
span_id=span["id"],
status="error",
metadata={
"finish_reason": finish_reason,
"safety_ratings": [
{"category": r.category.name, "probability": r.probability.name}
for r in response.candidates[0].safety_ratings
]
}
)
nexus.end_trace(trace_id=trace["trace_id"], status="error", metadata={"blocked_by": finish_reason})
return f"[Blocked: {finish_reason}]"
usage = response.usage_metadata
nexus.end_span(
span_id=span["id"],
status="success",
metadata={
"finish_reason": finish_reason,
"prompt_tokens": usage.prompt_token_count,
"completion_tokens": usage.candidates_token_count,
"total_tokens": usage.total_token_count,
"estimated_cost_usd": round(
usage.prompt_token_count * 0.00000035 +
usage.candidates_token_count * 0.00000105, 6
)
}
)
nexus.end_trace(trace_id=trace["trace_id"], status="success")
return response.text
except Exception as e:
nexus.end_span(span_id=span["id"], status="error", metadata={"error": str(e)})
nexus.end_trace(trace_id=trace["trace_id"], status="error")
raise
The finish_reason check is critical: Gemini does not raise an exception for safety blocks. The response object is populated with candidates, but accessing response.text on a blocked response raises a ValueError. Recording the block as an error span gives you a searchable trace of which prompts trigger safety filters and which categories are flagged.
Tracing Gemini function calling with tool_config
Gemini’s function calling uses tool_config to control how the model selects tools. Setting function_calling_config.mode = "AUTO" lets the model decide when to call a function versus generating a direct response. The agent loop is structurally similar to OpenAI’s tool use:
import google.generativeai as genai
from nexus_sdk import NexusClient
nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")
search_tool = genai.protos.Tool(
function_declarations=[
genai.protos.FunctionDeclaration(
name="search_knowledge_base",
description="Search the internal knowledge base for answers",
parameters=genai.protos.Schema(
type=genai.protos.Type.OBJECT,
properties={
"query": genai.protos.Schema(type=genai.protos.Type.STRING)
},
required=["query"]
)
)
]
)
def search_knowledge_base(query: str) -> dict:
return {"results": [f"Result for: {query}"]}
def run_function_calling_agent(user_query: str) -> str:
model = genai.GenerativeModel(
"gemini-1.5-pro",
tools=[search_tool],
tool_config={"function_calling_config": {"mode": "AUTO"}}
)
chat = model.start_chat()
trace = nexus.start_trace(
name="gemini_function_calling_agent",
metadata={"user_query": user_query[:200]}
)
loop_count = 0
current_message = user_query
while True:
loop_count += 1
completion_span = nexus.start_span(
trace_id=trace["trace_id"],
name=f"generate_content_loop_{loop_count}",
metadata={"loop": loop_count}
)
response = chat.send_message(current_message)
finish_reason = response.candidates[0].finish_reason.name
if finish_reason == "STOP":
usage = response.usage_metadata
nexus.end_span(
span_id=completion_span["id"],
status="success",
metadata={
"finish_reason": "STOP",
"prompt_tokens": usage.prompt_token_count,
"completion_tokens": usage.candidates_token_count
}
)
nexus.end_trace(
trace_id=trace["trace_id"],
status="success",
metadata={"loop_count": loop_count, "total_tokens": usage.total_token_count}
)
return response.text
tool_calls = response.candidates[0].content.parts
nexus.end_span(
span_id=completion_span["id"],
status="success",
metadata={"finish_reason": "TOOL_CALLS", "tool_call_count": len(tool_calls)}
)
tool_results = []
for part in tool_calls:
if not hasattr(part, "function_call"):
continue
fc = part.function_call
tool_span = nexus.start_span(
trace_id=trace["trace_id"],
name=f"tool_call:{fc.name}",
metadata={"function": fc.name, "args": dict(fc.args)}
)
try:
result = search_knowledge_base(**fc.args)
nexus.end_span(
span_id=tool_span["id"],
status="success",
metadata={"result_keys": list(result.keys())}
)
tool_results.append(genai.protos.Part(
function_response=genai.protos.FunctionResponse(
name=fc.name,
response={"output": result}
)
))
except Exception as e:
nexus.end_span(span_id=tool_span["id"], status="error", metadata={"error": str(e)})
raise
current_message = genai.protos.Content(parts=tool_results, role="user")
Tracing Vertex AI Search grounding
Vertex AI agents can ground responses in a Vertex AI Search datastore — the model retrieves enterprise documents before generating, which dramatically reduces hallucination on internal knowledge. The grounding response includes grounding_metadata with the retrieved chunks. Recording these in Nexus spans lets you detect when grounding returns zero results (often the root cause of hallucination in production).
from vertexai.generative_models import GenerativeModel, Tool, grounding
from vertexai import init as vertexai_init
from nexus_sdk import NexusClient
vertexai_init(project="YOUR_GCP_PROJECT", location="us-central1")
nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")
def run_vertex_grounded_agent(query: str, user_id: str) -> str:
grounding_tool = Tool.from_google_search_retrieval(
grounding.GoogleSearchRetrieval()
)
model = GenerativeModel(
"gemini-1.5-pro-002",
tools=[grounding_tool]
)
trace = nexus.start_trace(
name="vertex_grounded_agent",
metadata={"user_id": user_id, "query": query[:200]}
)
span = nexus.start_span(
trace_id=trace["trace_id"],
name="grounded_generate",
metadata={"grounding": "vertex_search"}
)
try:
response = model.generate_content(query)
candidate = response.candidates[0]
grounding_chunks = []
if hasattr(candidate, "grounding_metadata") and candidate.grounding_metadata:
meta = candidate.grounding_metadata
grounding_chunks = [
{"uri": chunk.retrieved_context.uri, "title": chunk.retrieved_context.title}
for chunk in (meta.grounding_chunks or [])
if hasattr(chunk, "retrieved_context")
]
usage = response.usage_metadata
nexus.end_span(
span_id=span["id"],
status="success",
metadata={
"finish_reason": candidate.finish_reason.name,
"grounding_chunk_count": len(grounding_chunks),
"grounding_chunks": grounding_chunks[:5],
"prompt_tokens": usage.prompt_token_count,
"completion_tokens": usage.candidates_token_count,
"estimated_cost_usd": round(
usage.prompt_token_count * 0.0000035 +
usage.candidates_token_count * 0.00001050, 6
)
}
)
nexus.end_trace(trace_id=trace["trace_id"], status="success")
return response.text
except Exception as e:
nexus.end_span(span_id=span["id"], status="error", metadata={"error": str(e)})
nexus.end_trace(trace_id=trace["trace_id"], status="error")
raise
When grounding_chunk_count is 0, the model generated its response without retrieval — this is the most common source of hallucination in grounded agents. Alerting on grounding_chunk_count == 0 for more than a threshold percentage of requests lets you detect datastore configuration issues before they become user complaints.
TypeScript: Tracing with @google/generative-ai
The TypeScript SDK follows the same structure. The key difference: response.usageMetadata uses camelCase, and safety ratings are accessed via response.candidates[0].safetyRatings.
import { GoogleGenerativeAI } from "@google/generative-ai";
import { NexusClient } from "keylightdigital-nexus";
const genai = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const nexus = new NexusClient({ apiKey: process.env.NEXUS_API_KEY! });
export async function runGeminiAgent(prompt: string, userId: string): Promise<string> {
const model = genai.getGenerativeModel({ model: "gemini-1.5-pro" });
const trace = await nexus.startTrace({
name: "gemini_agent_run",
metadata: { userId, model: "gemini-1.5-pro" }
});
const span = await nexus.startSpan(trace.id, {
name: "generateContent",
metadata: { promptLength: prompt.length }
});
try {
const result = await model.generateContent(prompt);
const response = result.response;
const candidate = response.candidates?.[0];
const finishReason = candidate?.finishReason ?? "UNKNOWN";
if (finishReason === "SAFETY" || finishReason === "RECITATION" || finishReason === "OTHER") {
await nexus.endSpan(span.id, {
status: "error",
metadata: { finishReason, safetyRatings: candidate?.safetyRatings ?? [] }
});
await nexus.endTrace(trace.id, { status: "error", metadata: { blockedBy: finishReason } });
return `[Blocked: ${finishReason}]`;
}
const usage = response.usageMetadata;
await nexus.endSpan(span.id, {
status: "success",
metadata: {
finishReason,
promptTokens: usage?.promptTokenCount ?? 0,
completionTokens: usage?.candidatesTokenCount ?? 0,
totalTokens: usage?.totalTokenCount ?? 0,
estimatedCostUsd: Number((
(usage?.promptTokenCount ?? 0) * 0.00000035 +
(usage?.candidatesTokenCount ?? 0) * 0.00000105
).toFixed(6))
}
});
await nexus.endTrace(trace.id, { status: "success" });
return response.text();
} catch (err) {
await nexus.endSpan(span.id, { status: "error", metadata: { error: String(err) } });
await nexus.endTrace(trace.id, { status: "error" });
throw err;
}
}
Gemini vs GPT-4 vs Claude: cost-per-token comparison with Nexus trace data
One of the most valuable uses of Nexus trace data is comparing model costs for equivalent tasks. Here are approximate per-token rates (as of April 2026, subject to change) and what they mean in practice for agent workloads:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| Gemini 1.5 Flash | $0.075 | $0.30 | High-volume classification, routing |
| Gemini 1.5 Pro | $3.50 | $10.50 | Long-context reasoning, multimodal |
| GPT-4o | $2.50 | $10.00 | General reasoning, tool use |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Code generation, analysis |
| Claude 3 Haiku | $0.25 | $1.25 | Fast, cheap extraction tasks |
With estimated_cost_usd recorded in every span, you can compare actual cost at the task level — not just model tier. A routing task that costs $0.0003 on Gemini 1.5 Flash vs $0.003 on GPT-4o is a 10x cost difference that compounds across millions of requests. Nexus lets you query by task type, compare finish quality via your own evaluation fields, and make the model-switching decision with data rather than intuition.
Monitoring multi-turn ChatSession conversations
Gemini’s ChatSession maintains conversation history automatically — which means prompt token counts grow with every turn. Without monitoring, a long conversation silently approaches the context window limit, causing either a truncation error or degraded response quality as early context is dropped.
import google.generativeai as genai
from nexus_sdk import NexusClient
nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")
def run_chat_session(session_id: str, user_messages: list[str]) -> list[str]:
model = genai.GenerativeModel("gemini-1.5-pro")
chat = model.start_chat()
trace = nexus.start_trace(
name="gemini_chat_session",
metadata={"session_id": session_id, "turn_count": len(user_messages)}
)
responses = []
cumulative_prompt_tokens = 0
for i, message in enumerate(user_messages):
turn_span = nexus.start_span(
trace_id=trace["trace_id"],
name=f"chat_turn_{i+1}",
metadata={"turn": i+1, "message_length": len(message)}
)
response = chat.send_message(message)
usage = response.usage_metadata
cumulative_prompt_tokens = usage.prompt_token_count
nexus.end_span(
span_id=turn_span["id"],
status="success",
metadata={
"prompt_tokens": usage.prompt_token_count,
"completion_tokens": usage.candidates_token_count,
"cumulative_prompt_tokens": cumulative_prompt_tokens,
"context_utilization_pct": round(usage.prompt_token_count / 1_000_000 * 100, 2)
}
)
responses.append(response.text)
nexus.end_trace(
trace_id=trace["trace_id"],
status="success",
metadata={
"total_turns": len(user_messages),
"peak_prompt_tokens": cumulative_prompt_tokens
}
)
return responses
What to monitor in production
Once traces are flowing from your Gemini and Vertex AI agents, these metrics give you the most actionable signal:
- Safety block rate by category: Filter spans with
finish_reason: SAFETYorRECITATION. A spike in SAFETY blocks often indicates a prompt template change that pushes responses into filtered territory. - Function call loop count: Agents that loop more than expected (e.g., more than 5 turns) indicate the model is struggling to satisfy tool output constraints. Compare loop_count distributions across model versions to detect regressions after a model upgrade.
- Grounding chunk count (Vertex AI Search): Zero grounding chunks means the model answered from training data, not your datastore. Alert when this rate exceeds a threshold — it often signals a datastore indexing issue.
- Context utilization growth in ChatSession: When
context_utilization_pctapproaches 80%, users are approaching the context limit. Use this signal to trigger a conversation summarization step before quality degrades. - Cost per task type: Group traces by task name and compare
estimated_cost_usdmedians. Tasks with high cost variance often reveal prompt inefficiencies — a simple few-shot example can reduce token usage by 30–50%.
Next steps
Google Gemini and Vertex AI offer a compelling combination of cost efficiency (Gemini Flash), long-context capability (1.5 Pro), and enterprise grounding (Vertex AI Search). Instrumenting each generate_content call, each function tool span, and grounding metadata gives you the data to debug safety blocks, optimize costs, and choose the right model tier for each task in your pipeline. Sign up for a free Nexus account to start capturing traces from your Google AI agents today.
Add observability to Google Gemini and Vertex AI agents
Free tier, no credit card required. Full trace visibility in under 5 minutes.