2026-04-27 · 10 min read

Monitoring Google Gemini and Vertex AI Agents with Nexus

Google Gemini and Vertex AI offer two entry points for building AI agents: the google-generativeai SDK for direct Gemini API access and the Vertex AI SDK for enterprise GCP-hosted agents. When a safety filter silently blocks a generation, a function call loop spins without reaching stop, or Vertex AI Search grounding returns zero results, the API response tells you what happened but not when or why. Here's how to wrap Gemini and Vertex AI agents in Nexus traces for full span-level observability.

Two SDK entry points for Google AI agents

Google offers two distinct paths for building AI agents, each with different abstractions and observability challenges:

Both SDKs expose safety filters that can silently block a generation with a finish_reason of SAFETY, RECITATION, or OTHER. Without instrumentation, a safety block in production looks identical to a network timeout — you get no response, and you don’t know why.

Wrapping generate_content() in Nexus traces

The core pattern: open a Nexus trace before calling generate_content(), record the response in a span with token counts and finish reason, then check for safety blocks and record them as error spans.

import google.generativeai as genai
from nexus_sdk import NexusClient

genai.configure(api_key="YOUR_GEMINI_API_KEY")
nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")

model = genai.GenerativeModel("gemini-1.5-pro")

def run_gemini_agent(prompt: str, user_id: str) -> str:
    trace = nexus.start_trace(
        name="gemini_agent_run",
        metadata={"user_id": user_id, "model": "gemini-1.5-pro"}
    )

    span = nexus.start_span(
        trace_id=trace["trace_id"],
        name="generate_content",
        metadata={"prompt_length": len(prompt)}
    )

    try:
        response = model.generate_content(prompt)

        # Check for safety block before accessing text
        finish_reason = response.candidates[0].finish_reason.name
        if finish_reason in ("SAFETY", "RECITATION", "OTHER"):
            nexus.end_span(
                span_id=span["id"],
                status="error",
                metadata={
                    "finish_reason": finish_reason,
                    "safety_ratings": [
                        {"category": r.category.name, "probability": r.probability.name}
                        for r in response.candidates[0].safety_ratings
                    ]
                }
            )
            nexus.end_trace(trace_id=trace["trace_id"], status="error", metadata={"blocked_by": finish_reason})
            return f"[Blocked: {finish_reason}]"

        usage = response.usage_metadata
        nexus.end_span(
            span_id=span["id"],
            status="success",
            metadata={
                "finish_reason": finish_reason,
                "prompt_tokens": usage.prompt_token_count,
                "completion_tokens": usage.candidates_token_count,
                "total_tokens": usage.total_token_count,
                "estimated_cost_usd": round(
                    usage.prompt_token_count * 0.00000035 +
                    usage.candidates_token_count * 0.00000105, 6
                )
            }
        )
        nexus.end_trace(trace_id=trace["trace_id"], status="success")
        return response.text

    except Exception as e:
        nexus.end_span(span_id=span["id"], status="error", metadata={"error": str(e)})
        nexus.end_trace(trace_id=trace["trace_id"], status="error")
        raise

The finish_reason check is critical: Gemini does not raise an exception for safety blocks. The response object is populated with candidates, but accessing response.text on a blocked response raises a ValueError. Recording the block as an error span gives you a searchable trace of which prompts trigger safety filters and which categories are flagged.

Tracing Gemini function calling with tool_config

Gemini’s function calling uses tool_config to control how the model selects tools. Setting function_calling_config.mode = "AUTO" lets the model decide when to call a function versus generating a direct response. The agent loop is structurally similar to OpenAI’s tool use:

import google.generativeai as genai
from nexus_sdk import NexusClient

nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")

search_tool = genai.protos.Tool(
    function_declarations=[
        genai.protos.FunctionDeclaration(
            name="search_knowledge_base",
            description="Search the internal knowledge base for answers",
            parameters=genai.protos.Schema(
                type=genai.protos.Type.OBJECT,
                properties={
                    "query": genai.protos.Schema(type=genai.protos.Type.STRING)
                },
                required=["query"]
            )
        )
    ]
)

def search_knowledge_base(query: str) -> dict:
    return {"results": [f"Result for: {query}"]}

def run_function_calling_agent(user_query: str) -> str:
    model = genai.GenerativeModel(
        "gemini-1.5-pro",
        tools=[search_tool],
        tool_config={"function_calling_config": {"mode": "AUTO"}}
    )
    chat = model.start_chat()

    trace = nexus.start_trace(
        name="gemini_function_calling_agent",
        metadata={"user_query": user_query[:200]}
    )

    loop_count = 0
    current_message = user_query

    while True:
        loop_count += 1
        completion_span = nexus.start_span(
            trace_id=trace["trace_id"],
            name=f"generate_content_loop_{loop_count}",
            metadata={"loop": loop_count}
        )

        response = chat.send_message(current_message)
        finish_reason = response.candidates[0].finish_reason.name

        if finish_reason == "STOP":
            usage = response.usage_metadata
            nexus.end_span(
                span_id=completion_span["id"],
                status="success",
                metadata={
                    "finish_reason": "STOP",
                    "prompt_tokens": usage.prompt_token_count,
                    "completion_tokens": usage.candidates_token_count
                }
            )
            nexus.end_trace(
                trace_id=trace["trace_id"],
                status="success",
                metadata={"loop_count": loop_count, "total_tokens": usage.total_token_count}
            )
            return response.text

        tool_calls = response.candidates[0].content.parts
        nexus.end_span(
            span_id=completion_span["id"],
            status="success",
            metadata={"finish_reason": "TOOL_CALLS", "tool_call_count": len(tool_calls)}
        )

        tool_results = []
        for part in tool_calls:
            if not hasattr(part, "function_call"):
                continue
            fc = part.function_call
            tool_span = nexus.start_span(
                trace_id=trace["trace_id"],
                name=f"tool_call:{fc.name}",
                metadata={"function": fc.name, "args": dict(fc.args)}
            )
            try:
                result = search_knowledge_base(**fc.args)
                nexus.end_span(
                    span_id=tool_span["id"],
                    status="success",
                    metadata={"result_keys": list(result.keys())}
                )
                tool_results.append(genai.protos.Part(
                    function_response=genai.protos.FunctionResponse(
                        name=fc.name,
                        response={"output": result}
                    )
                ))
            except Exception as e:
                nexus.end_span(span_id=tool_span["id"], status="error", metadata={"error": str(e)})
                raise

        current_message = genai.protos.Content(parts=tool_results, role="user")

Tracing Vertex AI Search grounding

Vertex AI agents can ground responses in a Vertex AI Search datastore — the model retrieves enterprise documents before generating, which dramatically reduces hallucination on internal knowledge. The grounding response includes grounding_metadata with the retrieved chunks. Recording these in Nexus spans lets you detect when grounding returns zero results (often the root cause of hallucination in production).

from vertexai.generative_models import GenerativeModel, Tool, grounding
from vertexai import init as vertexai_init
from nexus_sdk import NexusClient

vertexai_init(project="YOUR_GCP_PROJECT", location="us-central1")
nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")

def run_vertex_grounded_agent(query: str, user_id: str) -> str:
    grounding_tool = Tool.from_google_search_retrieval(
        grounding.GoogleSearchRetrieval()
    )

    model = GenerativeModel(
        "gemini-1.5-pro-002",
        tools=[grounding_tool]
    )

    trace = nexus.start_trace(
        name="vertex_grounded_agent",
        metadata={"user_id": user_id, "query": query[:200]}
    )

    span = nexus.start_span(
        trace_id=trace["trace_id"],
        name="grounded_generate",
        metadata={"grounding": "vertex_search"}
    )

    try:
        response = model.generate_content(query)
        candidate = response.candidates[0]

        grounding_chunks = []
        if hasattr(candidate, "grounding_metadata") and candidate.grounding_metadata:
            meta = candidate.grounding_metadata
            grounding_chunks = [
                {"uri": chunk.retrieved_context.uri, "title": chunk.retrieved_context.title}
                for chunk in (meta.grounding_chunks or [])
                if hasattr(chunk, "retrieved_context")
            ]

        usage = response.usage_metadata
        nexus.end_span(
            span_id=span["id"],
            status="success",
            metadata={
                "finish_reason": candidate.finish_reason.name,
                "grounding_chunk_count": len(grounding_chunks),
                "grounding_chunks": grounding_chunks[:5],
                "prompt_tokens": usage.prompt_token_count,
                "completion_tokens": usage.candidates_token_count,
                "estimated_cost_usd": round(
                    usage.prompt_token_count * 0.0000035 +
                    usage.candidates_token_count * 0.00001050, 6
                )
            }
        )
        nexus.end_trace(trace_id=trace["trace_id"], status="success")
        return response.text

    except Exception as e:
        nexus.end_span(span_id=span["id"], status="error", metadata={"error": str(e)})
        nexus.end_trace(trace_id=trace["trace_id"], status="error")
        raise

When grounding_chunk_count is 0, the model generated its response without retrieval — this is the most common source of hallucination in grounded agents. Alerting on grounding_chunk_count == 0 for more than a threshold percentage of requests lets you detect datastore configuration issues before they become user complaints.

TypeScript: Tracing with @google/generative-ai

The TypeScript SDK follows the same structure. The key difference: response.usageMetadata uses camelCase, and safety ratings are accessed via response.candidates[0].safetyRatings.

import { GoogleGenerativeAI } from "@google/generative-ai";
import { NexusClient } from "keylightdigital-nexus";

const genai = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const nexus = new NexusClient({ apiKey: process.env.NEXUS_API_KEY! });

export async function runGeminiAgent(prompt: string, userId: string): Promise<string> {
  const model = genai.getGenerativeModel({ model: "gemini-1.5-pro" });

  const trace = await nexus.startTrace({
    name: "gemini_agent_run",
    metadata: { userId, model: "gemini-1.5-pro" }
  });

  const span = await nexus.startSpan(trace.id, {
    name: "generateContent",
    metadata: { promptLength: prompt.length }
  });

  try {
    const result = await model.generateContent(prompt);
    const response = result.response;
    const candidate = response.candidates?.[0];
    const finishReason = candidate?.finishReason ?? "UNKNOWN";

    if (finishReason === "SAFETY" || finishReason === "RECITATION" || finishReason === "OTHER") {
      await nexus.endSpan(span.id, {
        status: "error",
        metadata: { finishReason, safetyRatings: candidate?.safetyRatings ?? [] }
      });
      await nexus.endTrace(trace.id, { status: "error", metadata: { blockedBy: finishReason } });
      return `[Blocked: ${finishReason}]`;
    }

    const usage = response.usageMetadata;
    await nexus.endSpan(span.id, {
      status: "success",
      metadata: {
        finishReason,
        promptTokens: usage?.promptTokenCount ?? 0,
        completionTokens: usage?.candidatesTokenCount ?? 0,
        totalTokens: usage?.totalTokenCount ?? 0,
        estimatedCostUsd: Number((
          (usage?.promptTokenCount ?? 0) * 0.00000035 +
          (usage?.candidatesTokenCount ?? 0) * 0.00000105
        ).toFixed(6))
      }
    });
    await nexus.endTrace(trace.id, { status: "success" });
    return response.text();
  } catch (err) {
    await nexus.endSpan(span.id, { status: "error", metadata: { error: String(err) } });
    await nexus.endTrace(trace.id, { status: "error" });
    throw err;
  }
}

Gemini vs GPT-4 vs Claude: cost-per-token comparison with Nexus trace data

One of the most valuable uses of Nexus trace data is comparing model costs for equivalent tasks. Here are approximate per-token rates (as of April 2026, subject to change) and what they mean in practice for agent workloads:

Model Input (per 1M tokens) Output (per 1M tokens) Best for
Gemini 1.5 Flash $0.075 $0.30 High-volume classification, routing
Gemini 1.5 Pro $3.50 $10.50 Long-context reasoning, multimodal
GPT-4o $2.50 $10.00 General reasoning, tool use
Claude 3.5 Sonnet $3.00 $15.00 Code generation, analysis
Claude 3 Haiku $0.25 $1.25 Fast, cheap extraction tasks

With estimated_cost_usd recorded in every span, you can compare actual cost at the task level — not just model tier. A routing task that costs $0.0003 on Gemini 1.5 Flash vs $0.003 on GPT-4o is a 10x cost difference that compounds across millions of requests. Nexus lets you query by task type, compare finish quality via your own evaluation fields, and make the model-switching decision with data rather than intuition.

Monitoring multi-turn ChatSession conversations

Gemini’s ChatSession maintains conversation history automatically — which means prompt token counts grow with every turn. Without monitoring, a long conversation silently approaches the context window limit, causing either a truncation error or degraded response quality as early context is dropped.

import google.generativeai as genai
from nexus_sdk import NexusClient

nexus = NexusClient(api_key="YOUR_NEXUS_API_KEY")

def run_chat_session(session_id: str, user_messages: list[str]) -> list[str]:
    model = genai.GenerativeModel("gemini-1.5-pro")
    chat = model.start_chat()

    trace = nexus.start_trace(
        name="gemini_chat_session",
        metadata={"session_id": session_id, "turn_count": len(user_messages)}
    )

    responses = []
    cumulative_prompt_tokens = 0

    for i, message in enumerate(user_messages):
        turn_span = nexus.start_span(
            trace_id=trace["trace_id"],
            name=f"chat_turn_{i+1}",
            metadata={"turn": i+1, "message_length": len(message)}
        )

        response = chat.send_message(message)
        usage = response.usage_metadata
        cumulative_prompt_tokens = usage.prompt_token_count

        nexus.end_span(
            span_id=turn_span["id"],
            status="success",
            metadata={
                "prompt_tokens": usage.prompt_token_count,
                "completion_tokens": usage.candidates_token_count,
                "cumulative_prompt_tokens": cumulative_prompt_tokens,
                "context_utilization_pct": round(usage.prompt_token_count / 1_000_000 * 100, 2)
            }
        )
        responses.append(response.text)

    nexus.end_trace(
        trace_id=trace["trace_id"],
        status="success",
        metadata={
            "total_turns": len(user_messages),
            "peak_prompt_tokens": cumulative_prompt_tokens
        }
    )
    return responses

What to monitor in production

Once traces are flowing from your Gemini and Vertex AI agents, these metrics give you the most actionable signal:

Next steps

Google Gemini and Vertex AI offer a compelling combination of cost efficiency (Gemini Flash), long-context capability (1.5 Pro), and enterprise grounding (Vertex AI Search). Instrumenting each generate_content call, each function tool span, and grounding metadata gives you the data to debug safety blocks, optimize costs, and choose the right model tier for each task in your pipeline. Sign up for a free Nexus account to start capturing traces from your Google AI agents today.

Add observability to Google Gemini and Vertex AI agents

Free tier, no credit card required. Full trace visibility in under 5 minutes.