The Dead Reckoning Agent: Why Your LangGraph Pipeline Is Flying Blind (And How Google Just Fixed Half of It)

#ai #agents #observability #python

*> Part 2 of the Agent Reliability Series — Part 1 covered state persistence and conditional branching
*
I got an email from Google last week.
Not unusual. But this one landed differently. Cloud Observability was telling me that starting June 1, 2026, a new endpoint — telemetry.googleapis.com — would automatically activate on any Google Cloud project that already had Cloud Logging, Cloud Trace, or Cloud Monitoring running. No action required. No migration. Just suddenly, a native OpenTelemetry ingestion pipeline sitting live in my project, waiting.
I've been writing about agent reliability. State persistence. Crash recovery. Conditional branching that doesn't silently loop forever. And the whole time I was doing that, I had been running LangGraph pipelines with zero observability. No traces. No spans. No way to answer the single most important question when something goes wrong at 3am: which node failed, what state did it have when it failed, and how many times had it already tried?
That's dead reckoning. Before GPS, sailors estimated their position from the last known location, their speed, and their heading. No confirmation. No ground truth. Just calculation and hope. That's what you're doing every time you debug a failed agent run by reading log output and inferring what must have happened between the graph's entry point and the crash. You know it started. You know it ended badly. Everything in between is estimation.
telemetry.googleapis.com is the GPS signal. Let me show you how to use it, what it costs when you get it wrong, and why you now have two paths that are fundamentally different architectural bets.

Why This Matters (The Audit Perspective)
After working through the failure modes in production LangGraph pipelines, observability gaps show up as the second-largest category of silent failures — right behind state persistence problems. And they're related. The reason state corruption is so hard to catch is that you have no trace of what the state looked like at each node when things went wrong. You have a final state and a crash. Dead reckoning.
The signal that matters most here: Google now calls telemetry.googleapis.com the recommended best practice for sending trace data — for both new and existing users — especially for those sending high volumes of trace data. That's a replacement recommendation, not an addition. If you're using the old Cloud Trace exporter, Google is telling you directly to migrate.
The second signal: Google has published official instrumentation guidance specifically for LangGraph and Agent Development Kit frameworks — not just generic Python apps. Agent observability is a first-class concern at Google Cloud now. The tooling exists. The question is which approach gives you more for less overhead.
And the third signal — the one that appeared in my inbox — is the June 1, 2026 auto-activation. If you have Cloud Logging running on a project today, you are about to have an OTel ingestion endpoint running on that same project whether you configure it or not. The question is whether that endpoint is receiving useful agent traces or sitting idle while your pipeline guesses its way through failures.

The Two Approaches: Proprietary vs. Open Standard
LangSmith is LangChain's managed observability platform. If you're already using LangGraph, it's the zero-friction path — a @traceable decorator and an API key and your traces are flowing to app.langsmith.com. It knows the LangGraph data model. It surfaces inputs and outputs per node automatically. For getting started, nothing is faster.
OpenTelemetry on Google Cloud is the open standard path. More setup. More control. Traces stay in your own GCP project. Every attribute is a dimension you define. You can query by val_loss, by retry_count, by model path — because you put those on the span explicitly. And critically: you get a completely vendor-agnostic pipeline — you can create OTLP data using the OpenTelemetry SDK, collect and transform using an OpenTelemetry collector, and send directly to Cloud Monitoring without any proprietary format conversion.
The architectural bet is: do you want fast observability now with a vendor dependency, or slower setup now with full ownership of your telemetry data?

The Code: Instrumenting Your LangGraph Agent
Approach A — LangSmith: The Fast Path

# requirements: langsmith>=0.1.0
# Environment: LANGCHAIN_API_KEY must be set
# Cost: $0 for first 3,000 traces/month
#       then $0.005 per trace — 100k traces/month = $485/month

import os
from langsmith import traceable

# Guard upfront — a missing key fails silently at export time,
# not at import time. You'll think it's working. It isn't.
if not os.environ.get("LANGCHAIN_API_KEY"):
    raise EnvironmentError(
        "LANGCHAIN_API_KEY is not set. "
        "LangSmith will silently drop traces without it."
    )

@traceable(name="fastai-trainer")
def run_fastai_trainer_langsmith(state: dict) -> dict:
    """
    LangSmith instruments this automatically.
    Inputs and outputs are captured per call.

    What you get: zero-config tracing in app.langsmith.com
    What you don't get:
    - Custom queryable dimensions (val_loss, retry_count)
    - Data in your own GCP project
    - OTel-compatible export to Datadog, Grafana, Honeycomb
    - Control over what stays private
    """
    from fastai.vision.all import load_learner
    learn = load_learner(state["model_path"])
    learn.fine_tune(state["epoch"])
    return {
        "train_loss": float(learn.recorder.losses[-1]),
        "val_loss":   float(learn.recorder.values[-1][1]),
        "error_log":  None,
    }

What you get: Working traces in under 10 minutes. Node inputs and outputs captured automatically. Good enough for prototyping and solo projects.
What you still don't have: Any of your traces after you stop paying LangSmith. No custom attributes queryable at scale. No path to Datadog or Grafana without re-instrumentation.
Approach B — OpenTelemetry on Google Cloud: The Ownership Path

# setup_otel.py
# Requirements (Python 3.9+):
#   opentelemetry-sdk>=1.24.0
#   opentelemetry-exporter-otlp-proto-grpc>=1.24.0
#   google-auth>=2.29.0
#   grpcio>=1.62.0
#
# Credentials: Application Default Credentials (ADC) — NEVER hardcoded.
# Local dev:  gcloud auth application-default login
# Cloud Run:  automatic from attached service account
#
# Install:
#   pip install opentelemetry-sdk \
#               opentelemetry-exporter-otlp-proto-grpc \
#               google-auth grpcio

import os
import logging
from typing import Optional
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
import google.auth
import google.auth.transport.requests
from google.auth.transport.grpc import AuthMetadataPlugin
import grpc

logger = logging.getLogger(__name__)


def setup_agent_telemetry(service_name: str) -> trace.Tracer:
    """
    Configures OTel to export agent traces to telemetry.googleapis.com
    via Application Default Credentials.

    Args:
        service_name: Identifies this agent in Cloud Trace.
                      Be specific: "fastai-eval-agent" not "agent".

    Returns:
        Configured OpenTelemetry Tracer instance.

    Raises:
        google.auth.exceptions.DefaultCredentialsError: ADC not configured.
        EnvironmentError: Cannot determine GCP project ID.
    """
    try:
        credentials, project_id = google.auth.default()
    except google.auth.exceptions.DefaultCredentialsError:
        raise google.auth.exceptions.DefaultCredentialsError(
            "No Application Default Credentials found.\n"
            "Fix: gcloud auth application-default login\n"
            "Or attach a service account on Cloud Run/GKE."
        )

    # Some ADC configurations don't return project_id from google.auth.default()
    if not project_id:
        project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
        if not project_id:
            raise EnvironmentError(
                "Cannot determine GCP project ID from ADC. "
                "Set the GOOGLE_CLOUD_PROJECT environment variable."
            )

    resource = Resource.create(attributes={
        SERVICE_NAME: service_name,
        "gcp.project_id": project_id,
    })

    # ── CRITICAL: Use gRPC exporter, NOT the HTTP exporter ───────────────
    # HTTP SDK exporters don't support dynamic token refresh.
    # A LangGraph pipeline running longer than ~60 minutes will silently
    # stop exporting traces when the ADC token expires.
    # You'll think you have observability. You'll have silence.
    # gRPC with AuthMetadataPlugin refreshes tokens automatically.
    request = google.auth.transport.requests.Request()
    auth_plugin = AuthMetadataPlugin(credentials=credentials, request=request)
    channel_creds = grpc.composite_channel_credentials(
        grpc.ssl_channel_credentials(),
        grpc.metadata_call_credentials(auth_plugin),
    )

    exporter = OTLPSpanExporter(
        endpoint="https://telemetry.googleapis.com:443",
        credentials=channel_creds,
    )

    provider = TracerProvider(resource=resource)
    # BatchSpanProcessor is non-blocking — spans export in background threads.
    # SimpleSpanProcessor is synchronous and adds latency to every node call.
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    logger.info(
        "OTel configured. project=%s service=%s endpoint=telemetry.googleapis.com",
        project_id,
        service_name,
    )
    return trace.get_tracer(service_name)

# langgraph_instrumented.py
# Extends the FastAI eval graph from Part 1 with per-node OTel spans.
# Every node call is a traced operation with typed, queryable attributes.

import traceback
import logging
from typing import Optional, Literal
from typing_extensions import TypedDict
from opentelemetry import trace as otel_trace
from setup_otel import setup_agent_telemetry

logger = logging.getLogger(__name__)

# ── Initialize once at module level — NOT inside each node call ───────────
# Creating a new TracerProvider per node is a performance trap.
# It also creates multiple exporters, multiplying your trace volume
# and your Cloud Trace bill.
tracer = setup_agent_telemetry("fastai-eval-agent")


class FastAIEvalState(TypedDict):
    model_path: str
    epoch: int
    train_loss: Optional[float]
    val_loss:   Optional[float]
    error_log:  Optional[str]
    retry_count: int
    status: Literal["training", "evaluating", "failed", "passed", "escalated"]


def run_fastai_trainer(state: FastAIEvalState) -> dict:
    """
    LangGraph node — fully instrumented with OTel.
    One span per node call. Attributes are queryable dimensions in Cloud Trace.
    Full traceback captured on exception — not just str(e).
    """
    with tracer.start_as_current_span("fastai.trainer") as span:
        span.set_attribute("agent.node",         "trainer")
        span.set_attribute("model.path",          state["model_path"])
        span.set_attribute("model.epoch",         state["epoch"])
        span.set_attribute("agent.retry_count",   state["retry_count"])

        try:
            from fastai.vision.all import load_learner
            learn = load_learner(state["model_path"])
            learn.fine_tune(state["epoch"])

            train_loss = float(learn.recorder.losses[-1])
            val_loss   = float(learn.recorder.values[-1][1])

            span.set_attribute("training.train_loss", train_loss)
            span.set_attribute("training.val_loss",   val_loss)
            span.set_attribute("training.outcome",    "success")

            return {"train_loss": train_loss, "val_loss": val_loss, "error_log": None}

        except Exception:
            full_trace = traceback.format_exc()
            # record_exception attaches the traceback to the span itself.
            # In Cloud Trace UI, you see the failure inline with the span,
            # not in a separate log you have to correlate manually.
            span.record_exception(Exception(full_trace))
            span.set_attribute("training.outcome", "error")

            logger.error("Trainer node failed:\n%s", full_trace)
            return {"train_loss": None, "val_loss": None, "error_log": full_trace}


def evaluate_result(state: FastAIEvalState) -> dict:
    """
    Evaluation gate — now traced.
    val_loss threshold is strictly less than 0.15.
    val_loss == 0.15 is a FAIL. Documented on the span attribute.
    """
    VAL_LOSS_THRESHOLD = 0.15

    with tracer.start_as_current_span("fastai.evaluator") as span:
        span.set_attribute("agent.node",        "evaluator")
        span.set_attribute("eval.val_loss",     state["val_loss"] or -1.0)
        span.set_attribute("eval.threshold",    VAL_LOSS_THRESHOLD)
        span.set_attribute("agent.retry_count", state["retry_count"])

        if state["error_log"]:
            new_retry = state["retry_count"] + 1
            span.set_attribute("eval.decision", "failed_on_error")
            return {"status": "failed", "retry_count": new_retry}

        if state["val_loss"] < VAL_LOSS_THRESHOLD:  # type: ignore[operator]
            span.set_attribute("eval.decision", "passed")
            return {"status": "passed"}

        new_retry = state["retry_count"] + 1
        span.set_attribute("eval.decision",           "failed_on_threshold")
        span.set_attribute("agent.retry_count_after", new_retry)
        return {"status": "failed", "retry_count": new_retry}

What you get: Every node call traced with typed attributes. val_loss, retry_count, model.path are queryable dimensions in Cloud Trace. Failed nodes surface as error spans, not as successful spans with an error_log key nobody checks. Traces stay in your GCP project. Export to Datadog or Grafana later by changing one exporter line.

The Unit Tests: Verifying Your Spans Before Your Bill Does
Routing logic has no LLM in it — we covered that in Part 1. OTel span attributes are the same category: pure Python, zero LLM, must have tests. A missing model.path attribute means you can't filter traces by model in Cloud Trace. You find out when the query returns nothing at 3am, not in a test suite.

# test_otel_spans.py
# Tests verify span coverage without requiring a live GCP project.
# InMemorySpanExporter captures spans locally — no credentials, no billing.

import pytest
from unittest.mock import patch, MagicMock
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor


def _make_test_tracer():
    """Factory: in-memory tracer for tests. No GCP project, no credentials."""
    exporter = InMemorySpanExporter()
    provider = TracerProvider()
    provider.add_span_processor(SimpleSpanProcessor(exporter))
    return provider.get_tracer("test"), exporter


def _base_state(**overrides) -> dict:
    base = {
        "model_path": "models/resnet34",
        "epoch": 3,
        "train_loss": None,
        "val_loss": None,
        "error_log": None,
        "retry_count": 0,
        "status": "training",
    }
    return {**base, **overrides}


class TestTrainerSpanAttributes:
    def test_successful_run_sets_required_attributes(self):
        """
        If model.path isn't on the span, you can't filter by model in Cloud Trace.
        If training.outcome isn't set, you can't build a success rate dashboard.
        Both caught here — not in a production debugging session.
        """
        import langgraph_instrumented as lg
        test_tracer, exporter = _make_test_tracer()

        mock_recorder = MagicMock()
        mock_recorder.losses = [0.12]
        mock_recorder.values = [[0.11, 0.13]]

        with patch.object(lg, "tracer", test_tracer), \
             patch("fastai.vision.all.load_learner") as mock_learn:
            mock_learn.return_value.recorder = mock_recorder
            lg.run_fastai_trainer(_base_state())

        spans = exporter.get_finished_spans()
        assert len(spans) == 1, "Trainer must produce exactly one span per call"

        attrs = spans[0].attributes
        assert attrs.get("model.path")        == "models/resnet34"
        assert attrs.get("agent.node")        == "trainer"
        assert attrs.get("training.outcome")  == "success"
        assert attrs.get("training.val_loss") == 0.13

    def test_failed_run_records_exception_on_span(self):
        """
        A failed node must surface as an error span in Cloud Trace.
        Without record_exception(), failure appears as a successful span
        with an error_log attribute nobody is alerting on.
        """
        import langgraph_instrumented as lg
        test_tracer, exporter = _make_test_tracer()

        with patch.object(lg, "tracer", test_tracer), \
             patch("fastai.vision.all.load_learner",
                   side_effect=FileNotFoundError("model not found")):
            result = lg.run_fastai_trainer(_base_state())

        assert result["error_log"] is not None, \
            "error_log must be populated on failure"

        spans = exporter.get_finished_spans()
        attrs = spans[0].attributes
        assert attrs.get("training.outcome") == "error", \
            "outcome must be 'error' — not 'success' with a log nobody checks"

    def test_trainer_span_does_not_expose_credentials(self):
        """
        Span attributes become searchable in Cloud Trace and queryable by
        anyone with project read access. Credentials must never appear.
        """
        import langgraph_instrumented as lg
        test_tracer, exporter = _make_test_tracer()

        mock_recorder = MagicMock()
        mock_recorder.losses = [0.10]
        mock_recorder.values = [[0.09, 0.11]]

        with patch.object(lg, "tracer", test_tracer), \
             patch("fastai.vision.all.load_learner") as mock_learn:
            mock_learn.return_value.recorder = mock_recorder
            lg.run_fastai_trainer(_base_state())

        spans = exporter.get_finished_spans()
        attrs = spans[0].attributes

        credential_keys = {"api_key", "token", "secret", "password", "credential"}
        for key in attrs:
            assert key.lower() not in credential_keys, \
                f"Credential-like attribute '{key}' found on span"

The Audit Section: What Would Kill This in Production
I ran the full security and logic audit before publishing this. Here is every failure found.
Bug 1 — Silent Auth Failure on Long-Running Agents (🔴 Fatal). This is the one nobody writes about. Using the HTTP OTLP exporter with SDK direct export means your ADC token expires after approximately 60 minutes. The exporter doesn't refresh it. Your traces stop appearing in Cloud Trace. Your pipeline keeps running. You think you have observability. The fix is the gRPC exporter with AuthMetadataPlugin — it handles token refresh automatically. First draft used HTTP. Fixed to gRPC.
Bug 2 — TracerProvider Created Per Node Call (🟠 High). Creating a TracerProvider inside each node function creates a new set of background export threads per call. In a pipeline with 3 retries × 2 nodes = 6 node calls, you've created 6 exporters. Trace volume multiplies. Cloud Trace bill multiplies. Fixed: initialize once at module level.
Bug 3 — Missing Project ID Guard (🟠 High). google.auth.default() sometimes returns None for project_id depending on how ADC is configured. Passing None to Resource.create() produces a trace with no project association — it exports successfully and disappears silently. Fixed: explicit fallback to GOOGLE_CLOUD_PROJECT environment variable with a guard that raises if both are missing.
Bug 4 — str(e) Exception Capture (🟠 High). First draft captured str(e) in error_log. For FileNotFoundError, PermissionError, and most import errors, str(e) is a single line with no stack context. traceback.format_exc() gives you the full call stack. Fixed in every node.
Bug 5 — Credentials Leak via Span Attributes (🟠 High). Span attributes in Cloud Trace are visible to everyone with project read access. A state dict that contains API keys or tokens — passed to a node that blindly sets span.set_attribute(k, v) for every state key — leaks credentials into your observability backend. Fixed: explicit attribute allowlist per node. Never for k, v in state.items(): span.set_attribute(k, v).
Bug 6 — SimpleSpanProcessor in Production (🟡 Med). SimpleSpanProcessor is synchronous — it blocks the calling thread until the span exports. In a LangGraph node, that means every node call waits for the network round-trip to telemetry.googleapis.com. Fixed: BatchSpanProcessor exports asynchronously in background threads.
Bug 7 — LangSmith Silently Drops Traces Without API Key (🟡 Med). If LANGCHAIN_API_KEY is unset, LangSmith's @traceable decorator does nothing and returns no error. Your function runs. Your trace disappears. The fix is an explicit guard at startup that raises if the key is missing.

Pitfalls and Gotchas
The Billing Surprise on June 1. When telemetry.googleapis.com auto-activates on your existing GCP project, it doesn't start billing you — it starts accepting traces. What starts billing you is when you configure an exporter to send to it. But here's the trap: OTLP metrics billing is accounted for under the "Prometheus Samples Ingested" SKU. If you're already running Cloud Monitoring and you add metric spans to your OTel pipeline, you're adding to an existing bill line item that's easy to miss. Set a Cloud Billing budget alert before you enable metric export.
The Attribute Cardinality Trap. OpenTelemetry pricing on every backend is tied to unique attribute combinations — called cardinality. If you put user_id or request_id on every span, you've created one unique time series per user. At 10,000 users making 100 agent calls each, that's 1 million unique series. This is how Cloud Monitoring bills spiral. Attributes on spans should be low-cardinality: model.path, agent.node, eval.decision — not identifiers that change per request.
The LangSmith Lock-In Exit Cost. Migrating off LangSmith after six months of traces is not a data export problem — it's a re-instrumentation problem. Your @traceable decorators need to be replaced with OTel spans. Your dashboards are LangSmith-specific. Your alerting is LangSmith-specific. The exit cost is proportional to how deeply you've built on top of their UI. OTel doesn't have this problem — you change one exporter line and your data goes somewhere else.
The gcp.project_id Resource Attribute. This attribute is what tells Cloud Trace which project to store the span in. Without it, the span exports successfully and Google routes it based on the credentials alone — which may not match what you expect in multi-project setups. Always set it explicitly from project_id returned by google.auth.default().
The span.record_exception() Misconception. span.record_exception() adds an event to the span — it does not change the span's status to ERROR. If you want Cloud Trace to surface the span as failed (red in the UI, triggering error rate alerts), you also need span.set_status(Status(StatusCode.ERROR, description="...")). Without it, your failed nodes appear green.

Recommendations
Beginner use: Start with LangSmith's @traceable. You'll have traces in 10 minutes, you'll see your agent's inputs and outputs, and you'll understand what observability actually feels like before you instrument anything manually. Set a LangSmith budget alert at $20/month so the cost doesn't surprise you as usage grows.
Production use: OpenTelemetry on Google Cloud with telemetry.googleapis.com. Use the gRPC exporter — not HTTP. Initialize the TracerProvider once at module level. Set explicit span attributes per node — never blindly mirror the state dict. Configure a Cloud Billing alert before enabling metric export. Write span attribute tests against InMemorySpanExporter before shipping — they catch cardinality problems before your billing dashboard does.
Research/prototyping use: LangSmith for the first two weeks of any new agent project, then migrate to OTel when the pipeline stabilizes. The @traceable decorator is faster than writing span code when you're still changing node structure daily. Once the graph is stable and you know which attributes you actually need, the OTel migration is a few hours — not a rewrite.
What to Try Next
Open Cloud Trace on your existing GCP project and check if telemetry.googleapis.com is already listed as an enabled API. If you created the project via Cloud Console or gcloud, it may already be active — meaning you have a live OTel endpoint waiting for your first setup_agent_telemetry() call.
Add span.set_status(Status(StatusCode.ERROR)) to your exception handlers. record_exception() adds event data but doesn't change the span color in Cloud Trace UI — your failed nodes appear green until you set the status explicitly. Two lines, instant improvement to error visibility.
Write one cardinality test before adding any new span attribute. Ask: how many unique values can this attribute have? If the answer is "one per user" or "one per request", it's high-cardinality and will compound your monitoring bill. If the answer is "one per model" or "one per decision type", it's safe. The test is a comment, not code — but writing it forces the question before you ship.

The broader move happening here is worth naming clearly: Google is converging its entire observability stack onto OpenTelemetry as the universal standard, with a fully managed, one-click pipeline for GKE called Managed OpenTelemetry that handles collector lifecycle, upgrades, and scaling automatically. For agent developers, this means the infrastructure investment in OTel instrumentation today pays dividends across every future Google Cloud service. LangSmith is faster to start. OTel is cheaper to keep.

The next post instruments the retry loop itself — tracking API cost per agent run as an OTel metric, so your Cloud Monitoring dashboard tells you what each failed training attempt actually cost before you approve another retry.

I build with Claude as my technical co-pilot — this post was researched, audited, and security-tested with AI assistance. The process of directing that analysis and verifying every code block is the work.

DEV Community

The Dead Reckoning Agent: Why Your LangGraph Pipeline Is Flying Blind (And How Google Just Fixed Half of It)

Top comments (0)