Databricks and similar enterprise data platforms have spent a great deal of effort and time to full-proof their product suite with relevant observability and tracing. Not surprisingly this is needed as part of enterprise support especially in regulated sectors. But for the specific case of sophisticated data science and analytics agents there is a gap in the observability suite not just for Databricks but across all big and small analytics and data science agent providers.
In the case of Databricks, even with notebooks as a primary user interface, given the offerings across data lineage, data management and MLflow, the level of control and tracing is no doubt high. However both large vendors like Databricks and Snowflake and smaller analytics and data science agents suppliers share an observability gap. The gap is inherent to coding agent architectures and does not apply equally to all agents. A text-to-SQL assistant can be wrong in an ‘obvious’ way: the result makes no sense. A multi-step python or spark pipeline produced by an agent is different. Even when made by a human, it’s hard to unpick pipeline logic given endless combinations of joins, data issues, data characteristics. This problem doesn’t go away when an agent is involved. E.g. Genie can plan a solution,run code, use cell outputs to improve results, and fix errors automatically. The question is what beyond the initial reasoning and the final artifact can be inspected in this instance and what can be reliably/not-probabilistically logged.
To achieve their objectives, these more sophisticated data science and analytics agents need to create relatively complex multi-step pipelines. Past the initial data retrieval and the final storage step, the pipelines themselves are just arbitrary code. Observability for this type of scripts when they are man-made span a whole area of companies in the MLOps space including Databricks’ own Mlflow. But it is unclear what observability is out there when this code is produced by agents - short of asking the agent itself to instrument the code (probabilistically), thus somewhat defeating the purpose of observability in the first place.
Now that we’ve narrowed the gap in observability from the bigger data platform context to a specific area: the ‘executed pipeline code’ element part of these more sophisticated analytics and data science agents workflow, my first question was to see if Mlflow or a different ‘off-the-shelf’ tool in the ecosystem can fill this gap directly. For why OpenTelemetry is not enough here please see the previous blogpost.
Unsurprisingly, Mlflow is heading in the direction of more granular instrumentation with the least amount of effort - on anyone’s part, human or agent. For classic ML, a single mlflow.autolog() call can automatically capture params, metrics, models, datasets, and artifacts around supported training APIs, while for GenAI and agent workflows, one-line tracing primitives like @mlflow.trace, mlflow.trace(...), and mlflow.start_span() add function- and block-level visibility, including parent-child relationships, inputs, outputs, exceptions, and execution time.
My initial experiments with trying to instrument agent-created code with Mlflow deterministically have allowed me to track the models as experiments which was a good step in the right direction 👍, but of course I cannot track data transformations - with Mlflow or with anything else that I’m familiar with.
Trying to track with autolog was the better option for me - rather than the tracing function, because I’m not really tracking the agent, I’m trying to track what’s happening in the code produced by the agent when it runs. Below some example basic tracking:
The gap is of course tracking what actually happens inside the pipeline outside the model itself, all the data operations for which no observability is present. While the code is of course the best evidence in other use cases, for pipeline types structures where the outcomes are heavily influenced by the particulars of the data, the code is not enough - observability on code and runtime execution both is needed and for these data science and analytics agents, the code they produce (outside the model itself) is currently a black box - an example table of interim artifacts below which at the moment tooling like Mlflow does not capture for agent written code.
In this space we were brainwashed to believe that observability matters at all cost; however I feel for this instance given the perception of coding agents in the market, an argument might have to be made for why it really matters.
First, it’s about auditability. Truly not everyone cares about this and not everyone should. But in regulated sectors like finance or healthcare this matters. For model validation in e.g. finance, the type of data lineage documentation required involves more than what gets stored in Unity catalogue, Delta lakes or Mlflow model tracking - all useful components. This type of use case needs to reflect the transformations that happen in the code itself once executed and teams currently do this manually. At the moment, the use of semiautonomous coding agents for these use cases is minimal but this is not where the enterprise stack is going.
Second, observability for these more sophisticated agents moves into other related risks, such as reproducibility, error propagation across longer pipelines, and general control issues for agent generated code.
Without observability, it is harder to track ‘semantic mistakes’ the agent might make, such as not using the correct metric definition, or applying the analysis or model to the wrong population. A bad transformation early in the pipeline affects everything downstream. I’m not sure what exactly is the level of observability needed to help us mitigate the potential issues, but without any we certainly would struggle.
Reproducibility is another area that does require some level of observability: if transformation execution is not observable, the final notebook may not be a faithful record of the run that produced the result. Similarly, we would struggle to compare agent runs over time (or rather without observability we would struggle more).
The key argument for in-depth-observability on agent generated code is enterprise level control especially for regulated sectors. Usage of these sophisticated data science and analytics agents in regulated sectors might be small to begin with relative to the size of the overall data platform offering. However as Databricks and large enterprise data platforms are feeling the pressure from coding agents and foundational models, there just aren’t that many avenues left to go into. If Databricks’ long-term position is around providing the governed system in which semiautonomous enterprise agents can actually run, then any observability gap will prove problematic.


Top comments (0)