3 Levels of Observability for Coding Agents

#ai #observability #agents #opentelemetry

In this blogpost I am exploring how a framework that translates code into a graph fits within the observability stack. Intuitively, something that helps decompose a pipeline code into its corresponding elements should help - it should help coding agent with testing & verification, optimized debugging, auditability. It should help support a host of coding agent architectures, especially with longer-term horizons. But this blogpost is trying to explore less of what it can do, and more of where this framework fits in.

First, as a quick reminder: the framework I’m exploring - Etiq, maps your code and traces artifacts and their lineage deterministically and without manual instrumentation (a bit like extreme auto-logging) and it works for data and AI pipelines. It does so on a mix of static analysis and run-time execution.

Second, the observability stack for agent derived code is a bit hard to pin down but it can roughly fit three buckets:

Agent ‘orchestration’ - state/memory store
Telemetry
Anything that helps you assess what actually happens in the code

The diagram below represents a high level view of a coding agent structure - or at least the main idea. In a coding agent, an orchestrator manages the task, asks the LLM what to do next, invokes tools, runs code in an isolated environment, formats and checks the results before returning outputs. High level it uses something that can be described as a plan - act - verify loop, with complexity increasing depending on the agent.

Translated into our three buckets, we have the below:

Light green = state / memory store & separately artifact store
Light blue = OpenTelemetry
Pink = QA & test/ grading record

The first bucket - light green on the diagram - helps provide the agent context. That context is essential for spotting potential issues, because it shows the shape of the run and what was intended, e.g. why was a patch made, did the agent originally intend to modify one file before branching into a different fix, etc. This bucket provides what the system believed it was doing and the end artifact store: the end outputs produced by an end-to-end run.

The second bucket, the light blue one, is the runtime execution capture via OpenTelemetry. This layer captures traces, metrics, and logs, which in a coding-agent system can include model and tool-call spans, subprocess execution, HTTP and database activity, timings, statuses, exit codes, service-to-service requests, and logs and metrics surrounding the run.

Runtime telemetry provides evidence that does not depend on whether the agent was honest, accurate, or even aware of what happened. The process either ran or it did not; the HTTP request either happened or it did not. OpenTelemetry shows what the platform observed rather than what the agent claimed. It can answer questions such as whether the model call happened, whether the patch step executed, whether the script ran, if/where latency occurred, and which retry loop consumed most of the time.

The third bucket - the pink one - looks in more detail at what happens with the code that was produced by the agent in this run. It can look at code logic, unit tests, static analysis and capturing vulnerabilities. And with the Etiq framework it can have in depth observability on the executed code beyond OpenTelemetry. Let’s say this is an agent that creates workflows based on various data feeds. At some point it calls an LLM, but prior to this call, it does 10 steps that are just about data processing, once the LLM returns an answer this gets joined up with another data source and the pipeline keeps going. The green bucket would provide us with the agent’s intention in writing this code and hopefully a coherent plan, the blue telemetry bucket would capture the API calls to the LLM and to get the initial data and would associate the full code with them. But regarding the 10 interim steps there is no way to log them in an observability framework outside instructing the agent itself to capture the artifacts and associate them with the appropriate function. Semantic search does not have a direct link to the produced interim artifacts. And this is where a framework like Etiq comes in - that is able to log granular steps of interim artifact/functions pairs and lineage.

In the case of a very simple example code generation agent with the following structure:

The orchestration would capture details on each of the agent’s nodes, below just for example purposes:

The OpenTelemetry logging would capture information as per below:

And Etiq would log the detail of what actually ran during the code execution for the given run:

The information produced via the Etiq framework serves a few different purposes:

It captures interim artifacts/function pairs thus allowing verification, test harnesses and checks on them - this enables the kind of granular testing data and AI pipelines need
It optimizes debugging as it can point exactly to the function that is producing the wrong interim step
It provides a level of audibility that open telemetry and agent orchestration or end artifact capture cannot do as it traces the lineage of data through the pipeline

Fundamentally it is great that we are able to to observe what the system is trying to do and what it stores at the end as code or output artifacts, it is equally important that we can capture the API calls and tool calls to the data sources, LLMs, the sandbox in which the code runs, etc. But there is currently a gap when it comes to observing the executed code the system produces. And the solution to this gap is an observability framework beyond what we currently have in the space, namely a framework that can trace the interim artifacts produced by the code and their producer functions and map their relationships, so they can be tested, debugged and audited.

DEV Community

3 Levels of Observability for Coding Agents

Top comments (0)