Apache SeaTunnel

Posted on May 28

The Next Decade of Data Engineering: From Modern Data Stack to Data Engineering Harness

#data #dataengineering #dataengineeringharness #bigdata

Over the past decade, the core evolution of data engineering has been the deconstruction and reconstruction of traditional data warehouse architectures through the Modern Data Stack.

We separated data ingestion from databases, forming the Data Ingestion layer, using tools like FiveTran, Airbyte, and Apache SeaTunnel to solve ELT / CDC / Reverse ETL problems;
We separated compute from storage, forming cloud data warehouse and lakehouse systems such as Snowflake, Databricks, Iceberg, and Hive;
We separated orchestration from scripts, leading to orchestration systems like Apache Airflow and Apache DolphinScheduler;
SQL development, data modeling, lineage, data quality, BI, and AI analytics were further split into independent tools.

This architecture was undoubtedly progress. It moved data engineering away from the primitive era of “a bunch of scripts + Crontab” toward cloud-native infrastructure, elastic computing, engineering governance, and open ecosystems.

The greatest contribution of the Modern Data Stack was “decoupling,” and its biggest side effect was also “decoupling.”
Tools became more powerful, but data engineers were forced to switch between more systems than ever before: datasources in one place, synchronization configs in another, DAGs somewhere else, logs elsewhere, SQL stored in Git, and Snowflake / Iceberg / cloud warehouse execution results living in yet another environment.

As a result, many data engineers spend less time on data modeling, business understanding, metric definitions, architecture design, and cost optimization — and far more time configuring datasources, setting field mappings, dragging DAG nodes, modifying SQL, checking logs, and rerunning tasks. This is the hidden pain created by the Modern Data Stack: data engineers became trapped inside tools.

The emergence of engineering-focused AI systems like Codex and Claude Code is now changing the entire software engineering workflow. But how can data engineers truly achieve Vibe Coding? That is exactly the direction I’ve been exploring, and the core topic of this article.

I believe future data engineering will no longer revolve around “humans operating tools.” Instead, it will evolve into: Codex + Data Engineering Skills & Harness + Data Engineering SaaS + Cloud Data Warehouse Infrastructure.

In the past, the Modern Data Stack assumed that humans were the operational center: humans understood tools, clicked interfaces, connected workflows, and handled context switching. But in the AI and Agentic development era, data engineering should no longer mean “humans operating a pile of tools.” Instead, humans define objectives, Codex/Claude Code decompose and implement solutions automatically, the Data Engineering Skill & Harness layer provides engineering boundaries and translates them into cloud SaaS systems, Snowflake / Iceberg / cloud warehouses provide scalable compute, orchestration and synchronization engines ensure runtime stability, and humans become responsible for reviewing, governing, and making final decisions.

Once Codex and Claude Code deeply participate in data engineering, perhaps data engineers can finally be freed from the “Dirty Work” created by the Modern Data Stack, allowing data engineering to return from “tool operation” back to “engineering creation.”

I believe this organizational transformation is inevitable in the AI and Agent era.

1. The Problem with the Modern Data Stack: The Issue Is Not Weak Tools — It’s That Humans Spend Too Much Time Managing Complexity

Today’s data platforms are already extremely capable. Datasource management, batch synchronization, real-time CDC, SQL development, workflow orchestration, runtime logs, alerting, auditing, and lineage analysis are all widely available. But the more features platforms add, the more complex they become. Menus multiply, configurations grow deeper, and processes become longer.

Data engineers are no longer mastering tools — they are adapting themselves to tools. The once-popular Modern Data Stack essentially forced engineers to learn endless tools under the glamorous label of “Data Stack,” while in reality engineers became slaves to tools. Engineers should control tools, not endlessly relearn fragmented ecosystems.

Even a seemingly simple MySQL-to-Snowflake synchronization task may involve source schemas, target database/schema/warehouse/role settings, field type conversion, synchronization strategies, workflow dependencies, failure logs, downstream SQL, and reporting definitions. Even with the best visual tools, it still requires multiple drag-and-drop operations and configuration steps.

The real burden is not that any single technical challenge is difficult. The real burden is excessive context switching. Datasources live in one system, task configurations in another, scheduling elsewhere, logs elsewhere, SQL in Git or local files, and Snowflake execution results in cloud environments.

In the past, there was no better way, so humans had to do everything manually.

But once engineering AI systems like Codex and Claude Code emerged, many decisions became processable by large language models. Tiny repetitive actions became decomposable, callable, executable, and feedback-driven automatically. That made the emergence of the Data Engineering Harness possible.

A Data Engineering Harness is not simply another data platform. It is a data engineering capability framework designed specifically for AI systems and engineering agents.

It encapsulates datasource management, synchronization, CDC, SQL development, orchestration, log diagnostics, permission auditing, observability, cost governance, and human takeover mechanisms into engineering capabilities that Codex/Claude Code can invoke, humans can review, and enterprises can govern.

In other words, the Harness is not solving the question: “Can AI write SQL?” It is solving questions like:

After AI writes SQL, can it run safely?
After AI creates tasks, can they be audited and tracked?
After AI invokes Snowflake, can permissions and costs be controlled?
After AI generates workflows, can humans understand, confirm, and take over?

Therefore, the value of a Data Engineering Harness is not replacing data engineers, nor simply replacing data platforms. It upgrades data engineering from “humans manually operating tools” into “humans define goals, Codex executes tasks, platforms provide boundaries, and enterprises accumulate engineering know-how.”

2. Why Not Let Codex Directly Write Scripts?

Many people ask: if Codex can write SQL, Python, and invoke command lines, why do we still need a Data Engineering Harness? Why not simply let it connect directly to MySQL and Snowflake and generate scripts automatically?

This may work in personal experiments, but it fails in enterprise data engineering.

Enterprise data engineering is not simply “making a script run.” Production-grade systems require manageability, auditability, operations, collaboration, and governance. At minimum, enterprises must answer questions such as:

How do we restrict Codex/Claude Code behavior across development and production environments to avoid catastrophic actions?
How can runtime failures be interpreted and corrected automatically by AI?
How can other people, agents, or tools understand the generated engineering workflows?
Can failed tasks recover automatically through retries, checkpoint resume, or reruns?
Will table modifications affect downstream systems?
Can DAG dependencies be visualized?
Can synchronization, ETL, and Data Mapping processes be visually represented?
Who audits incidents when problems occur?

If AI generates temporary scripts every time, we simply replace “humans writing scripts” with “AI generating scripts.” Short-term productivity improves, but long-term technical debt explodes: inconsistent styles, unclear permissions, nonstandard logs, uncontrolled failures, and untraceable operations.

Eventually, data engineering falls back into the “Shell + Crontab era.”

That is why the future of enterprise AI data engineering is not about letting Codex run freely. It is about giving Codex clear engineering boundaries.

That is the true meaning of the Data Engineering Harness, and also the reason I designed the WhaleStudio Harness Suite. Harness does not restrict Codex or Claude Code — it makes them observable, manageable, and production-ready.

3. Data Engineering Harness Design Philosophy

Future Data Engineering Harness systems will no longer be traditional human-centered development platforms. They will become Harness & Skill suites designed specifically for Codex, Claude Code, and Agentic development.

Take WhaleStudio Harness Suite as an example. Previously:

Apache DolphinScheduler solved orchestration problems;
Apache SeaTunnel solved multi-datasource synchronization and CDC problems;
WhaleStudio integrated these capabilities into an all-in-one enterprise platform.

But in the era of large models and Codex/Claude Code, providing GUI interfaces for humans is no longer sufficient.

Future systems must simultaneously allow humans to review and take over, while enabling Codex/Claude Code to invoke, debug, and receive feedback through CLI interfaces and engineering contexts.

This means WhaleStudio must reorganize the core capabilities of DolphinScheduler and SeaTunnel — including orchestration, synchronization, CDC, SQL tasks, runtime execution, diagnostics, auditing, and observability — into an engineering capability layer that agents can invoke and debug, engineers can rapidly review, and enterprises can govern.

This is not about adding an “AI button” or chatbot onto old platforms. It is about redesigning software interaction models around agents as primary users.

From underlying engines to development feedback systems, every layer must become understandable, callable, observable, and controllable by AI systems.

Future data engineering platforms will not simply be feature collections. They will become containers for enterprise data engineering know-how.

Scheduling strategies, synchronization experience, SQL migration expertise, Snowflake/cloud warehouse cost optimization strategies, release workflows, and exception handling rules should all become part of Harness Memory and Skills. Codex/Claude Code should invoke not raw APIs, but proven enterprise engineering capabilities.

4. UI Will Not Disappear — It Will Become an Observability & Fine-Tuning Interface

Some people believe AI will make enterprise software UI irrelevant.

I disagree.

UI will not disappear, but its role will change. Previously, UI was the operational entry point: humans created datasources, configured tasks, dragged DAGs, scheduled workflows, and inspected logs.

In the future, many actions will be completed by Codex/Claude Code. But humans must still clearly understand what the agent created, which datasources were used, which Snowflake schemas were modified, which SQL changed, whether DAG dependencies are valid, why tasks failed, whether downstream systems are impacted, and whether human takeover is needed. Teams also need collaboration.

Nobody wants to read another person’s AI prompt history just to understand an engineering workflow. This creates demand for Observability + Fine-Tuning Interfaces.

Future UI systems will no longer focus on step-by-step manual operations. Instead, they will help humans review, fine-tune, and build trust in AI-generated engineering workflows.

UI should visualize execution plans, SQL diffs, DAG dependencies, runtime states, failure logs, and cost risks.

In short:

CLI is for Codex execution.
GUI is for human review.
Harness connects both worlds.

The best future UI may not even be static pages. It may dynamically generate review interfaces around specific engineering actions: SQL migration diffs, synchronization confirmation, DAG risk analysis, cost estimation, and deployment approvals.

UI becomes the trust layer between humans and AI-generated engineering systems.

5. Future Data Engineers: From Tool Operators to Engineering Commanders

Data engineers will not disappear. But they will diverge into two categories.

One group will remain tool operators: configuring platforms, editing SQL, checking logs, and manually dragging DAGs. These skills still matter, but they will increasingly be automated by agents.

The other group will move upward: understanding business goals, designing data models, governing cloud warehouse costs, understanding orchestration/synchronization/CDC relationships, and encoding team experience into Harness systems.

Future elite data engineers may not be the people who know the most tools. They will be the people who best organize engineering capabilities.

They will know:

what can be automated;
what requires human confirmation;
what should become Harness rules;
and what should remain human judgment.

In the past, data engineers revolved around tools. In the future, tools, Codex/Claude Code, and cloud capabilities will revolve around engineering goals.

Conclusion: The Future of Data Engineering Is Not Humanless — Humans Finally Move to a Higher Level

In the future, engineers who only know how to manually operate Modern Data Stack tools may become obsolete, just like developers who only know how to manually write Java code.

But engineers who understand business, data engineering, cloud warehouses, AI workflows, and Harness systems will become increasingly valuable.

And this is not some distant vision.

In one of my experimental demos, I already completed an entire MySQL-to-Snowflake ETL pipeline with automated SQL orchestration creation in just 10 minutes using Codex and WhaleStudio Harness.

Through CLI-based capabilities, the system automatically identified datasources, created synchronization tasks, generated visual DAGs, executed workflows, inspected logs, converted SQL into Snowflake-compatible pipelines, debugged runtime failures, and corrected issues automatically.

Through this demo, you can experience how future data engineers may work.

The next decade of data engineering will not be about adding more tools. It will be about AI deeply integrating into tools, understanding goals, respecting boundaries, and operating under human review. And that is what Data Engineering Harness truly means.

Top comments (1)

Harjot Singh • Jun 1

totally agree with your point about how the modern data stack has transformed data engineering. it's impressive to see how much we've evolved from those days of scripts and crontab. if you're looking for a quick way to deploy an app, check out Moonshift. you can get a full next.js + postgres + auth build up in about 7 minutes, and you own the code. let me know if you want to give it a free try.