Angela Zhao

Posted on Mar 10 • Originally published at tacnode.io

The Modern Data Stack Has a Coherence Problem

#dataengineering #ai #architecture #database

The modern data stack is an engineering achievement. Teams can ingest petabytes from dozens of sources, transform them with dbt, warehouse them in Snowflake or BigQuery, and serve them through semantic layers to dashboards that refresh in seconds. The tooling has never been better.

And yet, the decisions that matter most — the ones made by automated systems in the moment a customer buys, a transaction clears, an agent acts — keep getting them wrong. Not because the data is missing. Not because the pipeline is slow. Because the stack was never designed for coherence.

What Coherence Means — and Why It's Different from Freshness

When engineers talk about data quality, they usually mean freshness: how recently was this data updated? That's a real problem, and the modern stack has made genuine progress on it. Streaming pipelines, near-real-time warehouses, and aggressive materialization schedules have compressed lag from hours to minutes, or even seconds.

But freshness is a per-table property. Coherence is a cross-system property.

A decision is coherent when every piece of context it consumes reflects the same version of reality at the same moment in time. A fraud check is coherent when the velocity counter, the account balance, the session signal, and the device fingerprint all describe the same instant — not a patchwork of states from four different systems, each updated at different times, each read at a different point in the query.

Most modern data stacks are very good at keeping individual tables fresh. They are almost universally bad at ensuring coherence across tables and systems when a decision has to read all of them simultaneously.

The Stack Was Built for Analysis, Not for Decisions

The modern data stack's architecture reflects its original purpose: analytical reporting. You have a source of truth (the warehouse), a transformation layer (dbt or similar), and a consumption layer (BI tools, notebooks, dashboards). The flow is append-only, batch-friendly, and eventually consistent.

That model works fine when a human is the decision-maker. A dashboard showing last night's revenue is coherent enough for a Monday morning review. Stale by a few seconds? Nobody cares.

The problem is that this architecture has been retrofitted — often ad hoc — into serving automated decisions that need correct context now. Product teams building recommendation engines, risk teams building fraud models, and AI teams building autonomous agents all end up pulling from the same warehouse or the same derived tables that were designed for dashboards. The tooling was built for analysis. The decisions need something different.

Three Ways the Modern Stack Loses Coherence

1. Preparation Delay

Derived state — aggregates, velocity counters, feature vectors, materialized views — is computed from raw events. That computation takes time. Even if your raw event pipeline is near-real-time, your derived state lags behind it by whatever your transformation cycle costs: minutes in a well-tuned dbt flow, tens of minutes in a typical warehouse setup.

During that window, the raw facts and the derived state describe different realities. An AI agent reading a user's "current session context" from a feature store may be reading a summary computed from events that stopped at T-5 minutes. The events since then exist somewhere in the pipeline, but they haven't made it into the context the agent reads.

This is not a pipeline speed problem. It's a structural gap between event arrival and context availability — and it persists even in sophisticated real-time stacks because preparation itself takes time.

2. Cross-System Retrieval Inconsistency

Modern automated decisions rarely pull all their context from a single system. A real-time fraud check might read:

Account balance from a transactional database
Velocity counters from a Redis cache
Device reputation from a third-party service
Session signals from a streaming platform
Feature vectors from a feature store

Each of these systems has its own consistency model, its own replication lag, and its own definition of "current." There's no transaction boundary spanning all five reads. The fraud engine assembles a composite context from five different points in time and treats it as if it represents a single, coherent moment.

Under normal load, the differences are small enough to ignore. Under high concurrency — exactly the conditions when fraud and limit breaches actually occur — the gaps widen. Events that changed account state 800ms ago may not have propagated to the cache yet. Two concurrent transactions may both read a pre-update balance.

The modern data stack has no native mechanism to provide a consistent snapshot across heterogeneous systems. Each tool guarantees consistency within itself. Cross-system coherence is left to the application.

3. Snapshot Incoherence Under Concurrency

The third failure mode is the subtlest. Even in a single system, reads and writes interleave under concurrent load. A velocity counter incremented by transaction A may not be visible to the read performed by transaction B if B reads before A's write commits. Depending on isolation level, B may see a partially updated state — or a state that will be rolled back.

In analytical workloads, this is tolerable. Slightly stale aggregates don't change business outcomes. In automated decision workloads, particularly in financial services, a velocity counter that misses the last N concurrent increments is the mechanism by which fraud rings exploit systems. The counter says "3 transactions in the last minute." The reality is 12.

The modern data stack's read-optimized, eventually consistent architecture — designed for analytical correctness — provides insufficient isolation for decision-time correctness under real concurrency.

Why Teams Don't See This as a Stack Problem

Here's the frustrating part: teams usually diagnose coherence failures as model problems, feature problems, or freshness problems — not as architectural problems.

When a fraud model approves a transaction it should have blocked, the first instinct is to retrain the model, adjust the threshold, or improve the features. When an AI agent acts on wrong context, the first instinct is to improve the prompt, add memory, or switch models.

The interventions that should work — more data, better models, lower latency — don't fix coherence failures, because the problem isn't in any single layer. It's in the gap between layers: the seam where independently-consistent systems have to be read together and their outputs treated as a unified picture of reality.

Coherence failures are invisible in the tooling. Your data observability platform will show green. Your feature store's freshness metrics will look fine. Your latency dashboards will show sub-100ms reads. Everything looks healthy because every individual component is healthy. The incoherence only exists at the moment a decision assembles context from all of them simultaneously.

What Would a Coherence-Aware Stack Look Like?

A stack designed for decision coherence has three properties that the modern analytical stack lacks.

Single snapshot semantics across systems. A decision should be able to read all of its required context — transactional state, derived aggregates, streaming signals, vector representations — as of the same logical point in time. This is different from reading each system at "the latest." It means the stack maintains a consistent snapshot that spans systems, so a decision sees a coherent view of reality rather than a patchwork of independently-current values.

Incremental materialization with bounded lag. Derived state — aggregates, features, rollups — should be maintained incrementally as events arrive, not recomputed on a batch schedule. The goal is not zero-lag (which is impossible for non-trivial transformations) but bounded lag: a guarantee that the context available at decision time is at most N milliseconds behind raw event arrival, where N is small enough to be within the validity window of the decision.

Concurrent write isolation that doesn't sacrifice read performance. Under high concurrency, reads and writes must be isolated such that a decision sees either a fully committed write or no write — not a partial state. This is a standard database guarantee that most analytical systems relax for throughput. A decision-coherent stack restores it for the specific reads that feed automated decisions.

These properties are not exotic. They exist in database systems, though usually only within a single system boundary. The architectural challenge — and the reason the modern data stack hasn't solved this — is providing them across the heterogeneous sources that real automated decisions consume.

The Coherence Problem Is Getting Harder

Three trends are making this worse.

AI agents read more context, from more systems, under tighter time constraints. A traditional fraud model might read five features from one system. A multi-agent orchestration system might read dozens of signals from a dozen systems, synthesize them, and act — all within a second. Each additional source multiplies the opportunity for incoherence.

Automated decisions are taking on higher-stakes actions. AI agents are increasingly being given the ability to take real-world actions: approving transactions, extending credit, executing trades, modifying customer state. The cost of acting on incoherent context is no longer a misfired recommendation — it's a financial loss, a compliance violation, or a cascading error that's hard to reverse.

Concurrency is increasing. As more decisions are automated and as systems scale, the window during which concurrent state changes can cause coherence failures grows. Fraud rings exploit exactly this: high-concurrency bursts designed to exploit the gap between when state changes and when derived context reflects it.

The modern data stack was not designed for this world. It was designed for a world where decisions are made by humans, who can tolerate staleness, who can recognize and correct inconsistencies, and who operate at a cadence that makes analytical eventual consistency acceptable.

Coherence Is Not a Feature. It's Infrastructure.

The instinct, when faced with a coherence problem, is to solve it in the application: add more aggressive cache invalidation, tighten replication lag, build a custom state synchronization layer. Teams do this, and it works — until it doesn't, which is usually at the worst possible moment, under the highest possible load.

Coherence cannot be reliably provided by application logic on top of an architecture that was never designed to support it. It requires infrastructure that was designed for it from the start: a system that maintains a consistent, multi-modal view of state across sources, keeps derived context within bounded lag of events, and guarantees snapshot isolation for the reads that feed automated decisions.

This is what the modern data stack is missing. Not more speed. Not more data. Not better models. Coherence — the guarantee that when a decision is made, the context it reads describes the same world at the same moment in time.

Until the stack provides that guarantee, automated systems will keep making decisions on a world that doesn't quite exist anymore.

Originally published at tacnode.io

DEV Community