Linghua Jin for CocoIndex

Posted on May 13, 2025 • Edited on May 31, 2025

🎉 CocoIndex hits 1,000 stars – powering real-time data for AI

#programming #showdev #opensource #datascience

We’ve been building CocoIndex, an ultra performant real-time data transformation framework for AI, and the response from the developer and data engineering community has been incredible. This week, CocoIndex officially surpassed 1,000 stars on GitHub!

🎉 A huge thank you to everyone who has starred, forked, contributed, or shared the project — your support is helping shape the future of open-source AI data infrastructure. ❤️

CocoIndex is a high-performance, real-time data transformation framework designed for modern AI workflows. Built on a blazing-fast Rust engine, CocoIndex simplifies preparing fresh, continuously updated data for AI applications — including embedding generation, knowledge graph construction, and other complex data transformations. It enables developers to build real-time data pipelines that go far beyond the limitations of traditional SQL-based systems.

The philosophy is to have the framework handle the source updates, and having developers only worry about defining a series of data transformation, inspired by spreadsheet.

Here's a bit history about our open source journey:

Data Flow Programming

Unlike conventional workflow orchestration frameworks—where data is typically treated as an opaque payload—CocoIndex elevates data and data transformations to first-class citizens within its architecture. Built around the Dataflow programming model, CocoIndex enforces a design where each transformation is a pure, stateless function that produces a new output field exclusively from its input fields, without relying on hidden states or mutating existing data.

This approach guarantees deterministic and reproducible transformations that can be fully observed and audited. Every intermediate state of the data, both before and after each transformation, is completely visible and traceable, enabling automatic data lineage tracking. This lineage provides deep insights into how data flows through the pipeline, making it easier to debug, troubleshoot, and optimize real-time data pipelines.

By embracing this model, CocoIndex empowers developers to build transparent, incremental, and highly performant AI data pipelines where data freshness and accuracy are maintained effortlessly. This visibility is critical for AI applications that depend on reliable and up-to-date information, such as embedding generation, knowledge graph construction, and complex data enrichment workflows.

For example this is a kind of a data flow: Parse files -> Data Mapping -> Data Extraction -> Knowledge Graph

Particularly, users don't define data operations like creation, update, or deletion. Rather, they define something like -
for a set of source data, this is the transformation or formula. The framework takes care of the data operations such as when to create, update, or delete.

With operations like

# ingest
data['content'] = flow_builder.add_source(...) 

# transform
data['out'] = data['content'] 
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

Data Freshness

As a cutting-edge data framework, CocoIndex significantly enhances data freshness by leveraging incremental processing—a core feature that ensures only changed data is processed and updated, enabling faster, more efficient real-time data pipelines.

The framework takes care of:

Change data capture
Figuring out what exactly needs to be updated, and only updating that without having to recompute everything throughout.

This makes it fast to reflect any source updates to the target store. If you have concerns with surfacing stale data to AI agents and are spending lots of effort working on infrastructure pieces to optimize latency, the framework actually handles it for you.

Built-in modules and custom modules

At CocoIndex, we recognize that data preparation is highly use-case specific, and there is no one-size-fits-all solution for every AI workflow. With rapid advancements in AI models and data technologies, selecting the best tools involves balancing multiple factors such as data quality, performance, and cost efficiency.

The data infrastructure ecosystem offers a wide variety of specialized tools — including parsers, embedding generation models, vector databases, graph databases, and more — each optimized for different goals and scenarios.

Instead of building everything from scratch, CocoIndex adopts a modular, composition-based approach. We provide native plugin support that enables seamless integration with your preferred data tools and services. By standardizing plugin interfaces, CocoIndex allows developers to easily plug in, swap, or upgrade components like building blocks, just like LEGO. This flexibility empowers you to tailor your AI data pipelines for maximum efficiency and adaptability.

And we can focus on what we're best at - building the best incremental pipeline and compute engine in the AI era.

Tooling

In addition to standard statistics and reports with CocoIndex, we are building a product called CocoInsight. It has zero pipeline data retention and connects to your on-premise CocoIndex server for pipeline insights. This makes data directly visible and easy to develop ETL pipelines.

Especially, as we mentioned earlier, CocoIndex is heavily influenced by spreadsheets. This is directly reflected in CocoInsight, as you can see all the data in the spreadsheet and how it looks after each transformation.

This is an example of mapping:

In addition to the data insights for data transformation, we also aim to provide tooling to make it directly visible to understand the LLM transformations, for example:

Understand and trouble shooting the chunking, if there's any, and debug why a chunk is showing from search results

Understand the relation extraction for Graphs

and more

Data should be transparent, especially for ETL frameworks.

Support us

We are constantly improving, and more features and examples are coming soon. If you love this article, please drop us a star ⭐ at GitHub repo to help us grow. You can also find us on Discord, I try to be there 24/7 😊 Looking forward to discussing CocoIndex or any topics in Data/AI infra with you!

DEV Community