<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Raluca Crisan</title>
    <description>The latest articles on DEV Community by Raluca Crisan (@rraluca07).</description>
    <link>https://dev.to/rraluca07</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3923190%2Ff6ac5cbb-6592-4cc2-9548-02456fd8970c.png</url>
      <title>DEV Community: Raluca Crisan</title>
      <link>https://dev.to/rraluca07</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rraluca07"/>
    <language>en</language>
    <item>
      <title>3 Levels of Observability for Coding Agents</title>
      <dc:creator>Raluca Crisan</dc:creator>
      <pubDate>Sun, 10 May 2026 11:45:43 +0000</pubDate>
      <link>https://dev.to/rraluca07/3-levels-of-observability-for-coding-agents-1ce3</link>
      <guid>https://dev.to/rraluca07/3-levels-of-observability-for-coding-agents-1ce3</guid>
      <description>&lt;p&gt;In this blogpost I am exploring how a framework that translates code into a graph fits within the observability stack. Intuitively, something that helps decompose a pipeline code into its corresponding elements should help - it should help coding agent with testing  &amp;amp; verification, optimized debugging, auditability. It should help support a host of coding agent architectures, especially with longer-term horizons. But this blogpost is trying to explore less of what it can do, and more of where this framework fits in.&lt;br&gt;
First, as a quick reminder: the framework I’m exploring (etiq) maps your code and traces artifacts and their lineage deterministically and without manual instrumentation (a bit like extreme auto-logging) and it works for data and AI pipelines. It does so on a mix of static analysis and run-time execution.&lt;br&gt;
Second, the observability stack for agent derived code is a bit hard to pin down but it can roughly fit three buckets: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent ‘orchestration’ - state/memory store &lt;/li&gt;
&lt;li&gt;Telemetry &lt;/li&gt;
&lt;li&gt;Anything that helps you assess what actually happens in the code &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The diagram below represents a high level view of a coding agent structure - or at least the main idea. In a coding agent, an orchestrator manages the task, asks the LLM what to do next, invokes tools, runs code in an isolated environment, formats and checks the results before returning outputs. High level it uses something that can be described as a plan - act - verify loop, with complexity increasing depending on the agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5zm9zkaqyl61awyoi3r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5zm9zkaqyl61awyoi3r.png" alt=" " width="512" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Translated into our three buckets, we have the below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvss3cnrqlbvmcdpdsupi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvss3cnrqlbvmcdpdsupi.png" alt=" " width="512" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Light green = state / memory store &amp;amp; separately artifact store&lt;br&gt;
Light blue = OpenTelemetry&lt;br&gt;
Pink  = QA &amp;amp; test/ grading record&lt;/p&gt;

&lt;p&gt;The first bucket - light green on the diagram - helps provide the agent context. That context is essential for spotting potential issues, because it shows the shape of the run and what was intended, e.g. why was a patch made, did the agent originally intend to modify one file before branching into a different fix, etc. This bucket provides what the system believed it was doing and the end artifact store: the end outputs produced by an end-to-end run.&lt;/p&gt;

&lt;p&gt;The second bucket, the light blue one, is the runtime execution capture via OpenTelemetry. This layer captures traces, metrics, and logs, which in a coding-agent system can include model and tool-call spans, subprocess execution, HTTP and database activity, timings, statuses, exit codes, service-to-service requests, and logs and metrics surrounding the run.&lt;br&gt;
Runtime telemetry provides evidence that does not depend on whether the agent was honest, accurate, or even aware of what happened. The process either ran or it did not; the HTTP request either happened or it did not. OpenTelemetry shows what the platform observed rather than what the agent claimed. It can answer questions such as whether the model call happened, whether the patch step executed, whether the script ran, if/where latency occurred, and which retry loop consumed most of the time. &lt;/p&gt;

&lt;p&gt;The third bucket - the pink one - looks in more detail at what happens with the code that was produced by the agent in this run. It can look at code logic, unit tests, static analysis and capturing vulnerabilities. And with the Etiq framework it can have in depth observability on the executed code beyond OpenTelemetry. Let’s say this is an agent that creates workflows based on various data feeds. At some point it calls an LLM, but prior to this call, it does 10 steps that are just about data processing, once the LLM returns an answer this gets joined up with another data source and the pipeline keeps going. The green bucket would provide us with the agent’s intention in writing this code and hopefully a coherent plan, the blue telemetry bucket would capture the API calls to the LLM and to get the initial data and would associate the full code with them. But regarding the 10 interim steps there is no way to log them in an observability framework outside instructing the agent itself to capture the artifacts and associate them with the appropriate function. Semantic search does not have a direct link to the produced interim artifacts. And this is where a framework like Etiq comes in - that is able to log granular steps of interim artifact/functions pairs and lineage. &lt;/p&gt;

&lt;p&gt;In the case of a very simple example code generation agent with the following structure: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvulo6t9mdrc3eye0d4x9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvulo6t9mdrc3eye0d4x9.png" alt=" " width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The orchestration would capture details on each of the agent’s nodes, below just for example purposes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69jhk0scqnav8bkjng4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69jhk0scqnav8bkjng4j.png" alt=" " width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The OpenTelemetry logging would capture information as per below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjnsarolgr1kh1adb8il6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjnsarolgr1kh1adb8il6.png" alt=" " width="800" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And Etiq would log the detail of what actually ran during the code execution for the given run:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzl0qzesyuwm71pyv73ld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzl0qzesyuwm71pyv73ld.png" alt=" " width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The information produced via the Etiq framework serves a few different purposes: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It captures interim artifacts/function pairs thus allowing verification, test harnesses and checks on them - this enables the kind of granular testing data and AI pipelines need&lt;/li&gt;
&lt;li&gt;It optimizes debugging as it can point exactly to the function that is producing the wrong interim step&lt;/li&gt;
&lt;li&gt;It provides a level of audibility that open telemetry and agent orchestration or end artifact capture cannot do as it traces the lineage of data through the pipeline &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fundamentally it is great that we are able to to observe what the system is trying to do and what it stores at the end as code or output artifacts, it is equally important that we can capture the API calls and tool calls to the data sources, LLMs, the sandbox in which the code runs, etc. But there is currently a gap when it comes to observing the executed code the system produces. And the solution to this gap is an observability framework beyond what we currently have in the space, namely a framework that can trace the interim artifacts produced by the code and their producer functions and map their relationships, so they can be tested, debugged and audited. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>agents</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>The observability gap for data science and analytics agents</title>
      <dc:creator>Raluca Crisan</dc:creator>
      <pubDate>Sun, 10 May 2026 11:05:31 +0000</pubDate>
      <link>https://dev.to/rraluca07/the-observability-gap-for-data-science-and-analytics-agents-3cnd</link>
      <guid>https://dev.to/rraluca07/the-observability-gap-for-data-science-and-analytics-agents-3cnd</guid>
      <description>&lt;p&gt;Databricks and similar enterprise data platforms have spent a great deal of effort and time to full-proof their product suite with relevant observability and tracing. Not surprisingly this is needed as part of enterprise support especially in regulated sectors. But for the specific case of sophisticated data science and analytics agents there is a gap in the observability suite not just for Databricks but across all big and small analytics and data science agent providers.&lt;br&gt;
In the case of Databricks, even with notebooks as a primary user interface, given the offerings across data lineage, data management and MLflow, the level of control and tracing is no doubt high. However both large vendors like Databricks and Snowflake and smaller analytics and data science agents suppliers share an observability gap. The gap is inherent to coding agent architectures and does not apply equally to all agents. A text-to-SQL assistant can be wrong in an ‘obvious’ way: the result makes no sense. A multi-step python or spark pipeline produced by an agent is different. Even when made by a human, it’s hard to unpick pipeline logic given endless combinations of joins, data issues, data characteristics. This problem doesn’t go away when an agent is involved. E.g. Genie can plan a solution,run code, use cell outputs to improve results, and fix errors automatically. The question is what beyond the initial reasoning and the final artifact can be inspected in this instance and what can be reliably/not-probabilistically logged. &lt;br&gt;
To achieve their objectives, these more sophisticated data science and analytics agents need to create relatively complex multi-step pipelines. Past the initial data retrieval and the final storage step, the pipelines themselves are just arbitrary code. Observability for this type of scripts when they are man-made span a whole area of companies in the MLOps space including Databricks’ own Mlflow. But it is unclear what observability is out there when this code is produced by agents - short of asking the agent itself to instrument the code (probabilistically), thus somewhat defeating the purpose of observability in the first place. &lt;/p&gt;

&lt;p&gt;Now that we’ve narrowed the gap in observability from the bigger data platform context to a specific area: the ‘executed pipeline code’ element part of these more sophisticated analytics and data science agents workflow, my first question was to see if Mlflow or a different ‘off-the-shelf’ tool in the ecosystem can fill this gap directly. For why OpenTelemetry is not enough here please see the previous blogpost.&lt;/p&gt;

&lt;p&gt;Unsurprisingly, Mlflow is heading in the direction of more granular instrumentation with the least amount of effort - on anyone’s part, human or agent. For classic ML, a single mlflow.autolog() call can automatically capture params, metrics, models, datasets, and artifacts around supported training APIs, while for GenAI and agent workflows, one-line tracing primitives like @mlflow.trace, mlflow.trace(...), and mlflow.start_span() add function- and block-level visibility, including parent-child relationships, inputs, outputs, exceptions, and execution time. &lt;/p&gt;

&lt;p&gt;My initial experiments with trying to instrument agent-created code with Mlflow deterministically  have allowed me to track the models as experiments which was a good step in the right direction 👍, but of course I cannot track data transformations - with Mlflow or with anything else that I’m familiar with. &lt;br&gt;
Trying to track with autolog was the better option for me - rather than the tracing function, because I’m not really tracking the agent, I’m trying to track what’s happening in the code produced by the agent when it runs. Below some example basic tracking:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5x6sk8daei5jef0jrng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5x6sk8daei5jef0jrng.png" alt=" " width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The gap is of course tracking what actually happens inside the pipeline outside the model itself, all the data operations for which no observability is present. While the code is of course the best evidence in other use cases, for pipeline types structures where the outcomes are heavily influenced by the particulars of the data, the code is not enough - observability on code and runtime execution both is needed and for these data science and analytics agents, the code they produce (outside the model itself) is currently a black box - an example table of interim artifacts below which at the moment tooling like Mlflow does not capture for agent written code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69bbs07fybmr2c7vco5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69bbs07fybmr2c7vco5q.png" alt=" " width="512" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this space we were brainwashed to believe that observability matters at all cost; however I feel for this instance given the perception of coding agents in the market, an argument might have to be made for why it really matters. &lt;br&gt;
First, it’s about auditability. Truly not everyone cares about this and not everyone should. But in regulated sectors like finance or healthcare this matters. For model validation in e.g. finance, the type of data lineage documentation required involves more than what gets stored in Unity catalogue, Delta lakes or Mlflow model tracking - all useful components. This type of use case needs to reflect the transformations that happen in the code itself once executed and teams currently do this manually. At the moment, the use of semiautonomous coding agents for these use cases is minimal but this is not where the enterprise stack is going.&lt;/p&gt;

&lt;p&gt;Second, observability for these more sophisticated agents moves into other related risks, such as reproducibility, error propagation across longer pipelines, and general control issues for agent generated code. &lt;br&gt;
Without observability, it is harder to track ‘semantic mistakes’ the agent might make, such as not using the correct metric definition, or applying the analysis or model to the wrong population. A bad transformation early in the pipeline affects everything downstream. I’m not sure what exactly is the level of observability needed to help us mitigate the potential issues, but without any we certainly would struggle. &lt;br&gt;
Reproducibility is another area that does require some level of observability: if transformation execution is not observable, the final notebook may not be a faithful record of the run that produced the result. Similarly, we would struggle to compare agent runs over time (or rather without observability we would struggle more).&lt;br&gt;
The key argument for in-depth-observability on agent generated code is enterprise level control especially for regulated sectors. Usage of these sophisticated data science and analytics agents in regulated sectors might be small to begin with relative to the size of the overall data platform offering. However as Databricks and large enterprise data platforms are feeling the pressure from coding agents and foundational models, there just aren’t that many avenues left to go into. If Databricks’ long-term position is around providing the governed system in which semiautonomous enterprise agents can actually run, then any observability gap will prove problematic. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>machinelearning</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
