Linghua Jin

Posted on Dec 2, 2025

Why AI ETL Needs Different Primitives: Lessons from Building CocoIndex in Rust 🦀

#rust #ai #etl #datascience

Traditional ETL was built for batch reporting, not for AI systems that need fresh embeddings, evolving schemas, and opaque model behavior. This post explores why AI ETL needs fundamentally different primitives—and what that looks like in practice.

The Problem: Legacy ETL Doesn't Scale for AI

Most legacy ETL assumes:

Stable schemas and daily/hourly batches
SQL-only transformations
Single target systems
"Best effort" execution

In AI workloads, the reality is completely different:

Stale embeddings = hallucinations. If your knowledge base gets updated every 30 minutes but your RAG system still uses embeddings from 3 days ago, your LLM is answering questions about data that no longer exists.

Schemas evolve constantly. Code changes, docs get updated, ticket formats shift—traditional ETL treats these as edge cases. For AI, they're the norm.

Transformations aren't just SQL. You're calling embedding APIs, chunking documents, running custom Python, routing to multiple vector DBs, knowledge graphs, and relational stores simultaneously.

API calls are expensive. If you re-embed your entire corpus every batch run, you're not just wasting compute—you're hemorrhaging money and hitting rate limits.

If you treat AI data pipelines as "just another batch job," you end up with either:

Over-recomputation: Wasting 10x the compute by rebuilding what didn't change
Index drift: Running stale data and watching your AI performance degrade silently

That's the gap CocoIndex set out to solve.

Primitive 1: Dataflow Instead of Mutable Tables

CocoIndex adopts a dataflow programming model. Instead of imperative "insert/update/delete" commands, you define a DAG of transformations where each node is a pure function.

Raw Input → Parse → Chunk → Embed → Normalize → [Vector DB + Postgres + Graph]

Why this matters:

Declarative: Change a formula once, and it propagates everywhere
Safe to cache: Pure functions mean the engine knows exactly when results are reusable
Compositional: Complex AI pipelines are just nested dataflow graphs
Spreadsheet-intuitive: Every field is defined by a formula (like Excel), making it easy to reason about

Learn more about CocoIndex's dataflow architecture.

Primitive 2: Incremental Processing as First-Class

Rust services can't afford to re-process entire corpora. CocoIndex tracks changes at two levels:

Source-level: Content hashes and fingerprints detect when a file/row actually changed. If the fingerprint is identical, skip it entirely—no reprocessing, no API calls.

Flow-level: When transformation logic changes (new embedding model, better chunking), the engine computes which parts of the graph are affected and reprocesses only those nodes.

Result: Near real-time indexes without the cost of full rebuilds.

Primitive 3: Durable Execution for Unreliable APIs

AI ETL calls flaky APIs, hits rate limits, and deals with credential expiry. The execution engine itself must be durable:

Row-level retry semantics: Failed rows are captured and retried in subsequent runs
Version-aware commits: Incremental updates are applied in consistent order across targets
Stable error handling: Transient failures don't produce inconsistent data across stores

This shifts reliability from "bash script + prayer" into a system where failures, retries, and progress tracking are built-in concerns.

Read more about durable execution in CocoIndex.

Primitive 4: Lineage and Observability

When your RAG system returns a bad answer, you need to know:

Which documents produced that chunk?
Which embedding model was used?
Was the chunking strategy correct?
Which version of the source data was indexed?

CocoIndex bakes this in end-to-end:

End-to-end lineage: Trace a bad search result back to source records and transformation versions
Before/after visibility: CocoInsight exposes data at each pipeline step (no custom logging needed)
Spreadsheet UI: Visual inspection of flows and transformations

Primitive 5: Multi-Target, AI-Native Connectivity

AI stacks don't write to one system. A single pipeline needs to:

Store embeddings in Qdrant or LanceDB
Persist metadata in Postgres or Snowflake
Emit knowledge graphs
Sync to feature stores

CocoIndex treats these as first-class plug-and-play targets, not special cases. One logical flow fans out to all of them, staying in sync automatically.

Why Rust?

Building this in Rust wasn't cosmetic—it enables these primitives at scale:

Predictable performance: No garbage collection pauses when processing massive datasets. Incremental processing and change detection run in tight memory-efficient code.

Safe concurrency: Tracking multiple concurrent flows and partitions is inherently error-prone; Rust's ownership model prevents data races in the execution core.

Interop: Rust compiles to static binaries and integrates with Python, TypeScript, and other ecosystems, so the core runs natively while the API stays accessible.

Check out the CocoIndex architecture deep dive to see how Rust enables these capabilities.

What This Means for AI Teams

The lesson from building CocoIndex: AI ETL requires fundamentally different primitives than BI ETL.

If you're currently:

Rebuilding your entire embedding index every day
Debugging pipeline failures by grep-ing through logs
Manually syncing data across Postgres, Qdrant, and your feature store
Hoping your RAG system's knowledge base is reasonably fresh

...you're operating at 10% efficiency and full technical debt.

The emerging shape of AI ETL:

✅ Continuous, fingerprint-aware incrementality (not nightly batches)
✅ Declarative, observable dataflows with end-to-end lineage
✅ Multi-target sync (vector DBs, graphs, OLTP stores as peers, not afterthoughts)
✅ Durable execution for unreliable APIs
✅ Native tooling for debugging and observability

CocoIndex is one concrete instantiation of these ideas. But the underlying principle is universal: AI systems demand ETL primitives that treat change, uncertainty, and heterogeneity as first-class concerns.

What's your biggest pain point in AI data pipelines? Are you dealing with stale embeddings, schema drift, or just constant re-engineering of ETL jobs? Drop a comment—I'd love to hear how you're solving this.

For more on building scalable AI data infrastructure, explore CocoIndex and check out our technical blog and documentation.

DEV Community