Blaine Elliott

Posted on Jun 22 • Originally published at blog.anomalyarmor.ai

How AI-Native Data Observability Changes Incident Response

#dataobservability #ai

AI-native data observability changes incident response by replacing dashboard navigation with conversational queries grounded in your warehouse metadata. Instead of clicking through a lineage graph to find which dashboards depend on a broken table, you ask the assistant, get an answer with citations to the actual metadata, and act on it. The change is not the alert itself; alerts have always fired. The change is the minutes between the alert firing and the engineer knowing exactly what to do.

I build AnomalyArmor, which is one of the tools that works this way. The argument below is about the workflow shape, not the product specifically. The same workflow exists with any tool that exposes its metadata to an AI assistant through an MCP server or similar surface. The reason it matters is mean time to resolution, not vendor differentiation.

What a traditional incident response looks like

Take a real failure mode: a Fivetran sync to raw.orders runs late, the downstream fct_orders model builds against stale data, and the executive dashboard shows yesterday's revenue as today's. A freshness monitor fires at 9:04am.

Here is the workflow most teams run today.

9:04 — Alert fires in Slack. "Freshness check on raw.orders: last update 14 hours ago, expected < 6."
9:06 — Engineer opens the observability tool dashboard. Searches for raw.orders. Opens the table view.
9:08 — Clicks into lineage. Visual graph shows 12 downstream models and 4 dashboards. Engineer reads each one to figure out which matter.
9:12 — Switches to Fivetran. Confirms the sync is in a retrying state. Reads logs.
9:15 — Switches to dbt Cloud. Checks whether the model run that consumed stale data already completed. It did.
9:18 — Switches back to observability tool. Cross-references which dashboards consume fct_orders directly versus through a downstream mart.
9:22 — Pings #data-platform. "Hey, exec dashboard is on stale data, Fivetran is recovering, ETA?"
9:30 — Posts in #exec-team. "Heads up, dashboard is stale until ~10am, here is what is affected."

Twenty-six minutes from alert to communication. Most of that time is context-switching between tools to assemble a picture the metadata already contains. None of that time is spent fixing the actual problem.

What an AI-native incident response looks like

Same incident. Same alert. Different workflow because the assistant has access to the same metadata the dashboard does, plus the ability to reason across it.

9:04 — Alert fires in Slack. Same alert.
9:05 — Engineer opens Claude Code / Cursor / their assistant. Types: "raw.orders is 14h stale, which dashboards consume it and what is the blast radius."
9:05 — Assistant responds. "raw.orders feeds fct_orders (built 6:00 today on stale data). fct_orders feeds 3 dashboards: exec_revenue (refreshed 7:15), sales_pipeline (refreshed 7:30), finance_daily (refreshed 7:45). All three are showing yesterday's data. No other downstream consumers." Each claim cites the metadata source.
9:06 — Engineer asks one more. "What is the Fivetran sync status and last successful run."
9:06 — Assistant responds. "Fivetran sync raw_orders is in 'recovering' state since 8:51, retry 2 of 5, last successful sync at 19:04 yesterday. Expected next success window 9:30 to 10:00 based on retry cadence."
9:07 — Engineer posts in #exec-team. "Heads up: exec_revenue, sales_pipeline, and finance_daily are stale until ~10am. Fivetran recovering, no action needed on your side."

Three minutes from alert to communication. The dashboards were not opened. The lineage graph was not clicked through. The cross-tool stitching happened in one query because the assistant has access to the metadata that exists in the warehouse, the observability tool, and the data sync platform.

The fix did not get faster. Fivetran still needs to recover at its own pace. What got faster is everything that is not the fix: classification, impact analysis, stakeholder communication, decision making.

Why the workflow shape changes the math, not just the tool

The reason this matters is not that AI assistants are clever. It is that the bottleneck in most incident response is not "I do not know what broke" or "I do not know how to fix it." It is "I do not yet know what depends on the thing that broke." That is a metadata-traversal problem, and metadata traversal is what AI assistants do well when you give them the right context.

Phase	Traditional time	AI-native time	Why the difference
Detect	0 to 5 min	0 to 5 min	Same alert
Classify (is this real?)	2 to 5 min	30 sec	Assistant reads alert payload + recent state
Impact analysis (what depends?)	5 to 15 min	30 sec	Assistant traverses lineage in one query
Cross-tool context (sync, dbt, BI)	5 to 10 min	30 sec	Assistant queries all three sources
Stakeholder communication	2 to 5 min	1 min	Engineer has the answer, just writes the message
Total mean time to communication	15 to 35 min	2 to 8 min	Metadata traversal happens in parallel, not serially

The "AI" is not doing the engineer's job. It is removing the dashboard-clicking from the engineer's job so that the engineer can spend their attention on the parts of the response that actually require judgment: do we roll back, do we wait, do we communicate broadly or narrowly, do we change the SLA going forward.

The technical shape that makes this work

The capability that enables the AI-native workflow is the MCP server (or equivalent agent-callable interface) that exposes the observability tool's metadata to the assistant. Without it, the assistant has training-data-level knowledge of "what dbt is" but no access to your specific schemas, freshness state, lineage graph, or recent alerts. With it, the assistant can query the same metadata the dashboard renders, and reason across it in natural language.

A working MCP surface for data observability needs to expose, at minimum:

Asset inventory. Tables, schemas, dependencies, owners.
Current state. Freshness, last successful sync, recent schema changes, active alerts.
Lineage. Which downstream models, dashboards, and consumers depend on a given table or column.
Historical context. What changed in the last hour, day, week. What alerts fired and how they resolved.
Cross-tool metadata where it exists. dbt run history, Fivetran/Airbyte sync state, BI tool consumption.

Each of these is a discrete tool call the assistant can compose into an answer. The engineer asks one natural-language question; the assistant runs three or four metadata queries and synthesizes a response. The work that used to be "open four tabs, read each one, reconcile in your head" becomes one round trip.

What AI-native observability does not do

Be honest about the limits, because the failure mode of AI-native workflows is overclaiming.

It does not replace the human decision. The assistant tells you that exec_revenue and finance_daily are stale. You still decide whether to publish a stale dashboard, roll back to a snapshot, or let it ride to the next refresh. That decision involves business context the assistant does not have and should not be asked to have.

It does not catch what the underlying detection misses. If your freshness monitors are misconfigured or your schema checks are turned off, the AI-native workflow surfaces the absence of signal cleanly, but it does not invent signal. The underlying monitors still have to be good.

It does not work without metadata access. The whole capability hinges on the assistant being able to query the tool's metadata. A tool that does not expose an MCP server or equivalent does not participate in this workflow at all, regardless of how good its dashboard is.

It does not eliminate dashboards. The AI-native flow handles the "during an incident" path well. Long-term trend analysis, cross-team review, exec-level reporting, and visual lineage exploration still benefit from a UI. Both surfaces matter; they cover different work.

Where this changes hiring and team shape

A secondary effect worth naming. When incident response moves from "navigate four tools" to "ask one question," the implicit seniority floor for handling an incident drops. A new data engineer in week two cannot effectively triangulate a dashboard breakage across Snowflake, dbt Cloud, Fivetran, and the observability tool. They can ask the assistant a clear question and read a sourced answer.

The result is not that you need fewer senior engineers. It is that incident response stops monopolizing senior attention for routine cases, which frees senior engineers to spend time on the work where their judgment actually matters: schema design, SLA negotiation, postmortems that change the system, not just patch it.

How to evaluate whether a tool genuinely supports this

Many tools market "AI features" that do not change the incident-response workflow at all. Three questions separate the substantive integrations from the marketing layer.

Does the tool expose an MCP server or equivalent agent-callable interface? If the AI is only available inside the vendor's own dashboard chat box, the workflow has not changed; you have moved the dashboard click into a chat input. The point is to use the assistant you already work in, not to add another UI.
Are the answers grounded in your metadata, with citations to the source? "The lineage graph says X" is grounded. "Based on common patterns, X is likely" is not. Hallucinated answers in incident response are worse than no answers.
Can the assistant query historical state, not just current? "What changed in the last 24 hours" is the question incident response runs on. A tool that only exposes current snapshots forces you back to the dashboard for history, which puts you back in the traditional workflow.

Tools that answer yes to all three change the workflow. Tools that answer yes to one or two are partially there. Tools that answer no to all three have a feature called "AI" that does not affect how you handle an incident.

A short experiment to run before committing

You do not have to switch tools to test whether this workflow would help your team. Run a one-week experiment.

For five business days, log every incident response. Time-stamp: alert fired, engineer started, impact identified, stakeholders notified, root cause known, fixed.
After each incident, write down which steps were "looking at metadata to figure out the shape of the problem" versus "actually doing the fix."
Total the "metadata-traversal" time across the week. That is the time an AI-native workflow targets.

Most teams who run this exercise find that 50 to 70 percent of their incident-response time is metadata traversal. If your number is in that range, the workflow change is worth evaluating concretely. If your number is below 25 percent (either because you have very few incidents or because your team has built strong tooling already), the gain is smaller and the priority lower.

For what AI-native observability looks like end to end, see a data observability tool that works from inside your AI assistant. For why this matters relative to dbt-only workflows, see how to catch silent dbt test failures.

AI-native incident response FAQ

Is this the same as ChatGPT for data engineering?

No. A general assistant has no access to your warehouse, your lineage, your alert history, or your dbt project. It can talk about data engineering in the abstract. AI-native observability gives the assistant grounded access to your specific metadata so the answers are about your warehouse, not generic patterns.

What is an MCP server in this context?

MCP (Model Context Protocol) is a standard for exposing tools and data to AI assistants. An observability tool with an MCP server exposes its metadata (tables, lineage, freshness, alerts) as callable tools the assistant can use. The assistant decides which to call based on the question, runs them, and synthesizes the response. The engineer never has to know which tool was called.

Does this work with Claude, Cursor, ChatGPT, or my own agent?

Any assistant or agent that speaks MCP can use any MCP-exposing tool. Claude Code, Cursor, and the OpenAI Agents SDK are the most common surfaces right now. The point of using a standard is that the observability tool does not have to integrate with each assistant separately.

Does this change the alert itself?

No. Alerts still fire from the same underlying monitors (freshness, schema, volume, distribution, custom rules). What changes is what happens between the alert firing and the human deciding what to do. That window is where the workflow shape matters.

What if my team already has good runbooks?

Runbooks are static; metadata is dynamic. A runbook tells you the kind of thing to check when freshness fails. AI-native observability runs the checks against the actual current state and tells you what is true right now. The two work together: the runbook is the strategy, the AI-native flow is the execution.

How is this different from the chat box already in my observability tool?

A vendor-specific chat box keeps you inside the vendor's UI, which means you have to context-switch into that UI during an incident. An MCP server lets the assistant you already use (the same one you write SQL with, edit code in, and run terminal commands from) talk to the observability tool. The difference is whether you change tools to ask the question.

Is the assistant going to hallucinate when I am trying to resolve an incident?

Hallucination risk drops sharply when the assistant has grounded access to real data and is asked to cite sources for its claims. The remaining risk is real but manageable: treat the assistant's answer as a faster version of what a junior engineer would tell you, and verify the same way you would verify a junior engineer (does the cited source actually say what they said it says).

Does this only work in well-instrumented warehouses?

It works best in well-instrumented warehouses, the same way dashboards work best when the underlying metrics exist. If your tool has no lineage, the assistant cannot traverse lineage. The capability of AI-native observability is bounded by the metadata that exists under it.

The bottom line

The promise of AI-native observability is not smarter alerts; it is shorter time from "something broke" to "we know what to do." The change happens in the metadata-traversal window, which is where most incident-response time actually goes. The shape of the workflow shifts from "open four tabs" to "ask one question," and the engineer's attention moves from clicking through tools to making the decisions only a human can make.

This is not the only thing AnomalyArmor does, but it is the thing that has changed how I think about what an observability tool is for. AnomalyArmor is in private beta. If you want to see what the assistant-side workflow looks like on your own warehouse, reach out and we will get you access.

DEV Community