Turning Observability into a Tunable Search Space

#ai #agents #automl #observability

In the Mlops world, people have long used DAGs/graphs or at least the consensus has been that best practice was to use them. With AI and agents, the types of graphs used for orchestration or instrumentation are more varied, but they share the same core idea: capturing and tracking the artifacts produced by a pipeline, along with their parent and child relationships. The reason for this is intuitive: a common pattern across data, ML, and agent pipelines is a sequence of steps that can be represented, stored, and discovered through a graph structure. Tracking, observing, and optimizing this sequence broadly supports monitoring, reproducibility, orchestration, backfills, retraining, and related workflows.
Etiq is a tool that creates a lineage and captures artifacts (a bit like a DAG or similar graph) on executed code. The question I’m addressing is what impact would such a tool have on different types of coding agents and coding agents architectures given how it seems tailor made for these types of agents.

Experiment Description

This first quick project looks at trying to incorporate AutoML type tuning into a coding agent whose main task is to solve various data science related challenges. Everything but the Etiq tool is more or less vibe-coded and the agent repo is not designed to be used but just to illustrate a point and for some quick hypothesis testing.

The main flow of the agent is:

Gets given a data science type task
A codegenerator codes and runs a script to answer the task till if finds a configuration that runs
Etiq tracks the artifacts and lineage (graph) of this baseline script, including the intermediate data objects, the model object, and the flow between them
A small, controlled search space from the script’s configuration is derived and passed to SMAC (a known AutoML tuner) for optimization.
Each SMAC trial reruns the same script with a different configuration, captures the resulting metric, and stores the full attempt record

The current system turns the DAG into tunable space via a number of rules: only include executed nodes, block certain node classes, inspect only literal call arguments when deciding the type of tuning, drop low-impact knobs, and optionally add safe remove-node controls. But really this is a rather arbitrary part of this process.

Also, in this implementation, the lineage edges are not used to create dependency constraints between knobs. The current implementation is therefore better described as executed-node-to-source-control tuning than true DAG-topology optimization.

Metrics and benchmark

The hardest part here was finding what to compare against. The real question is does this set-up help with something - performance, time to best result or cost in terms of LLM API calls? But what is a fair comparison point?

The starting point is always an executable script by an LLM tasked to address the given data science problem, but after this initial step, initially 4 different approaches were explored to see if we can isolate the impact of using an Etiq/DAG. The tasks themselves were adapted from MLE-Bench - only 5 of them and for structured data only. Well performing solutions to the tasks are short scripts (of no more than a few hundred lines of codes each).

While it quickly became apparent that this type of comparison above is fraught, some lessons have been learned (by me).

Because the No-DAG + SMAC also needed a tunable space, a kind of ad-hoc space was procured through some AST parsing + rules combo. The implementation and the idea was quite half baked and although it ran on a few of the tasks, it was problematic. What SMAC truly optimizes in this instance is the model. When the whole-pipeline DAG pretense was dropped from the approach, and the SMAC only optimized the model, it all made a lot more sense. In both cases the DAG+SMAC approach outperformed the No-DAG + SMAC one, in the second instance because it optimized for the data prep as well as the model (and, as we all know, data matters!). The difference was not too large which is a trend and also made sense on the small tasks/pipelines the comparison was ran on.

The harder but more interesting lesson (which made me think a bit more about the logic behind what I’m trying to do) was that the free LLM search usually outperforms everything else (or one cannot tell the difference). Again here when the LLM-search was constrained by some ad-hoc made up search space using parsing and arbitrary rules (to make the comparison seem more ‘fair’), the no-DAG LLM-search also failed or underperformed slightly the DAG version. But when the search was completely free, the LLM only (no DAG) did outperform.

Why having a DAG can help, and in which instances

When thinking about it a bit harder, this finding kind of made sense.
There are a few potential benefits of the DAG + LLM search approach vs. free LLM search approach:

Lower cost via localization (in theory DAG acts as a king of context compression, and I would emphasize ‘in theory’ here)
Better search/higher overall performance

Cost is a trickier story, but generally the second benefit only really show up for context windows that are large enough. For smaller pipelines/scripts, like the ones produced to answer the benchmark used here, it really doesn’t matter, if anything it makes things worse. At the very beginning, when the pipeline is still small, the model benefits from seeing the whole design end to end. A DAG can potentially start to become useful once the pipeline has stabilized into recognizable stages, is large enough and/or most changes become local. At that point, the DAG could help because it externalizes structure that the no-DAG agent would otherwise have to rediscover from code again and again. Additionally, if the pipelines/codebase truly is too large for one context/attention window, then the DAG is an appropriate search optimization approach.

Before concluding, it is worth making a quick detour to see if this DAG-based idea appears in other ‘nearby’ areas.

First, semantic search seems to me the closest comparison to a DAG-based approach because both aim to avoid resending the full script on every iteration. However, one localizes context by similarity, while a DAG localizes context by explicit dependency structure. E.g. a DAG can show which pipeline stage feeds another, what is upstream or downstream, and which artifacts connect different components. In data-science and ML-type pipelines, the important code may matter because of execution order, dataflow, or artifact dependencies, not because it looks textually similar to the request.

Second, looking at it from the coding agent angle, most coding agents do not natively extract a pipeline DAG and use it to guide local rewrites. Aider is the closest mainstream example, but it uses a repository graph rather than a true pipeline or dataflow DAG. And tools like Cline, Roo Code, and Sourcegraph Cody mainly rely on semantic search, AST/file analysis, repository maps.
DAG-like approaches may appear more often in context engines and MCP tools than in mainstream coding agents, but they are primarily based on static analysis, not runtime observation. Static-analysis tools usually parse files into AST-like structures and combine them into a repository-level index or graph, which can indeed be very useful for general coding. But for data-science pipelines runtime DAGs are often more relevant. because failures and performance issues depend on the specific data and configuration used.
I believe the reason we don’t really see these DAGs in practice are two-fold. One, they are extremely hard to produce reliably and then integrate in a useful manner, and two, and more importantly, their benefits only show up on either large codebases, where the main approach is so far semantic based, or specific types of long-horizon agents, which don’t really show up so often in practice. In the next blogpost, I will try to explore how setting up these observability DAGs as part of a long horizon architecture itself improves the performance (or doesn’t).

DEV Community

Turning Observability into a Tunable Search Space

Experiment Description

Metrics and benchmark

Why having a DAG can help, and in which instances

Top comments (0)