amir

Posted on May 19

From DeepSeek to Quack: When the Dream of Distributed DuckDB Started to Feel Real

#duckdb #dataengineering #ai #database

At the beginning of 2025, DeepSeek changed the conversation around AI infrastructure.

Most people focused on the model quality, the training cost, and the geopolitical story around a new Chinese AI lab suddenly competing with the biggest names in the industry. That part was interesting, of course. But as an engineer, the part that caught my attention was not only the model.

It was the data pipeline behind it.

DeepSeek released Smallpond, a lightweight data processing framework built on DuckDB and 3FS. The idea was surprisingly simple: instead of building everything around a traditional big-data engine like Spark, run many independent DuckDB-based processing jobs close to the data, partition the workload carefully, and let each local engine do what it does best.

That sounds almost too simple.

But that is exactly why it is interesting.

The uncomfortable question: do we always need Spark?

As a senior engineer, I have worked on systems where Spark was the default answer before the problem was even fully understood.

Need to process files? Use Spark.

Need to aggregate logs? Use Spark.

Need to transform Parquet? Use Spark.

Need to join medium-sized datasets? Still Spark.

Spark is powerful, and I am not arguing against it. But in many real projects, the operational cost becomes the hidden tax: cluster configuration, memory tuning, shuffle behavior, executor sizing, dependency packaging, job retries, monitoring, and the constant pain of debugging a distributed job that fails somewhere in the middle of a long DAG.

DuckDB sits on the opposite side of that spectrum.

It is embedded. It runs inside your process. It reads Parquet beautifully. It speaks SQL. It is fast for analytical workloads. And most importantly, it makes local data processing feel boring again.

That boring part is a compliment.

When I first started using DuckDB seriously, it replaced a lot of small Python scripts in my workflow. Instead of loading CSV or Parquet files into Pandas, fighting memory limits, and then exporting results again, I could write SQL directly over files:

SELECT customer_id, sum(amount) AS total_spent
FROM read_parquet('orders/*.parquet')
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 20;

For one-off analysis, this is already great. But Smallpond suggested something bigger: what if DuckDB is not just a local helper tool, but the execution unit of a distributed data system?

Smallpond's lesson: distribute the plan, not the database

Smallpond is interesting because it does not try to turn DuckDB itself into a distributed database. Instead, it treats DuckDB as a fast local execution engine.

The pattern looks like this:

Split the dataset into partitions.
Send partitions to different workers.
Let each worker run DuckDB locally.
Write intermediate results back to shared storage.
Merge or repartition when needed.

That is not a new idea in distributed systems, but using DuckDB as the local analytical core makes it feel lightweight.

In the Smallpond repository, the basic example is simple: read Parquet, repartition by a key, run SQL, and write Parquet output again. The README also mentions a GraySort benchmark where Smallpond processed more than 100 TiB of data on a cluster. That is a strong reminder that not every scalable system needs to look like the traditional Hadoop/Spark stack.

The deeper lesson for me is this:

Sometimes the best distributed architecture is not one giant distributed database. Sometimes it is thousands of small, predictable local engines coordinated well.

That idea maps nicely to modern AI pipelines.

Training a model is not only about GPUs. Before the GPU sees anything, there is a long chain of data cleaning, deduplication, filtering, tokenization, feature extraction, metadata joins, quality checks, and batch generation. A lot of that work is analytical. A lot of it is file-based. And a lot of it can be pushed close to storage.

DuckDB is very good at that style of work.

Where DuckDB used to hurt

DuckDB's strength has always been its simplicity: it is an in-process analytical database.

But the same simplicity creates a limitation.

If you have one Python process, one CLI session, or one application working with a DuckDB database file, everything feels clean. But once multiple processes want to write to the same database file, you quickly run into locking and concurrency constraints.

That is not a bug. It is part of the design.

DuckDB was originally optimized for embedded OLAP workloads, not for being a shared multi-client server like PostgreSQL.

In my own projects, I usually solved this by avoiding shared writes completely:

write partitioned Parquet instead of writing into one shared database file
let each worker produce immutable output
use object storage as the coordination layer
run a final compaction or merge step later
keep DuckDB as the query engine, not the source of truth

This works well, but it has trade-offs. You start building your own small coordination layer. You need naming conventions, idempotent writes, retry logic, cleanup jobs, and sometimes a metadata database just to track what happened.

That is why Quack is so interesting.

Quack: DuckDB starts speaking over the network

In 2026, Hannes Mühleisen introduced Quack, a remote protocol that turns DuckDB into a client-server database.

The idea is elegant: both the client and the server are DuckDB instances, but they communicate through the quack: protocol. The server owns the data. The client sends queries. The heavy work happens near the data, and the result comes back to the client.

A simplified example looks like this:

INSTALL quack FROM core_nightly;
LOAD quack;

CREATE SECRET (
    TYPE quack,
    TOKEN 'super_secret'
);

ATTACH 'quack:bigserver:9494' AS remote;

SELECT customer_id, sum(amount) AS total_amount
FROM remote.transactions
GROUP BY customer_id
ORDER BY total_amount DESC
LIMIT 10;

This is not just a nicer syntax. It changes the deployment model.

Before Quack, DuckDB was mostly local-first. With Quack, DuckDB can become remote-first when needed.

That means:

the data can stay on a powerful server
laptop clients can query without downloading huge datasets
multiple clients can connect to the same DuckDB server
DuckDB can be used in more traditional application architectures
DuckDB-Wasm and browser-based analytical tools become more interesting

The official DuckDB documentation describes Quack as an RPC protocol for DuckDB and mentions use cases like concurrent read-write access, moving computation closer to data, and querying powerful servers from local clients.

For me, the key phrase is: compute near data.

That is one of the most important ideas in data engineering.

Moving 500 GB to a laptop is a bad plan. Sending a SQL query to the machine that already has the data is a better plan.

A prototype I would actually build

The first thing I wanted to try with this architecture was not a huge AI training pipeline. It was something more realistic: a lightweight analytics service for event data.

Imagine this setup:

application events are written as Parquet files to object storage
a small ingestion service batches new events
DuckDB reads and validates those files locally
embeddings are generated for selected text fields
analytical metadata is stored in DuckDB or DuckLake
vector search is handled by a dedicated vector database
Quack exposes the central DuckDB instance to internal tools

This kind of architecture is attractive because each tool does one job well.

DuckDB is great for analytical SQL.

Object storage is great for cheap durable files.

A vector database is great for similarity search.

Quack becomes the bridge that lets multiple clients query the analytical layer without copying everything locally.

Where vector databases fit into this story

A vector database stores embeddings instead of just rows and columns.

An embedding is a numerical representation of text, image, audio, code, or another object. For example, a support ticket like:

“The payment failed after I changed my billing address.”

can be converted into a vector such as:

[0.012, -0.441, 0.087, ...]

The numbers themselves are not meaningful to humans, but their position in vector space captures semantic meaning. Similar texts produce vectors that are close to each other.

That enables queries like:

Find tickets semantically similar to this new complaint.

Traditional SQL is not designed for that kind of similarity search. SQL is excellent when you know the exact fields and predicates:

WHERE status = 'failed'
AND country = 'AM'

Vector search is different:

Find documents close to this embedding.

This is why systems like Qdrant, Milvus, Weaviate, Pinecone, pgvector, and others became popular. They use indexes such as HNSW or IVF to make nearest-neighbor search fast.

But here is the important part: vector search alone is rarely enough.

In production, you usually need hybrid retrieval:

vector similarity for semantic meaning
SQL filters for structured constraints
full-text search for exact keywords
metadata joins for permissions, customers, time ranges, or product categories
analytical queries to evaluate quality and drift

That is where DuckDB becomes valuable again.

I do not want my vector database to become my entire analytics platform. I want it to retrieve candidates. Then I want SQL to inspect, filter, aggregate, evaluate, and debug the system.

For example:

SELECT source, count(*) AS total, avg(score) AS avg_score
FROM retrieval_logs
WHERE created_at >= now() - INTERVAL '7 days'
GROUP BY source
ORDER BY avg_score DESC;

This kind of query belongs naturally in DuckDB or a lakehouse layer, not inside the vector database.

Why Quack makes this architecture cleaner

Without Quack, I would normally run DuckDB locally inside each service and write files back to object storage. That is still a good pattern. But it makes interactive querying harder.

With Quack, I can imagine a cleaner workflow:

ETL workers process raw data locally with DuckDB.
Processed files are written to object storage.
A central DuckDB/DuckLake server exposes curated tables.
Internal tools connect through Quack.
BI dashboards query the same analytical layer.
Vector search services write retrieval logs back into the lake.
Engineers debug everything with SQL.

This is not a replacement for every data warehouse. It is not a replacement for Kafka, Spark, PostgreSQL, or a vector database.

But it is a powerful middle layer.

It fits the space where many teams actually live: too much data for one Pandas script, but not enough operational complexity to justify a full big-data platform.

The part I like most as an engineer

What I like about this direction is that it respects mechanical sympathy.

DuckDB is fast because it understands analytical execution: vectorized processing, columnar storage, efficient scans, smart Parquet reads, and local execution without network overhead.

Smallpond says: keep that local execution model, but run it many times in parallel.

Quack says: keep DuckDB's engine, but allow it to communicate when a shared server model is useful.

That is a healthy evolution.

It does not throw away the original design. It extends it.

As someone who has spent too much time debugging over-engineered pipelines, I appreciate systems that scale by composition instead of magic.

What I would be careful about

I would still be cautious before using Quack as the core of a production system today.

The official page describes it as a beta release. That matters.

Before depending on it, I would test:

write concurrency under real workload
authentication and network exposure
backup and restore strategy
failure behavior during long-running writes
compatibility with existing DuckDB extensions
observability and query logging
behavior behind load balancers or proxies
performance with many small writes versus large batches

I would also avoid pretending DuckDB suddenly became PostgreSQL.

DuckDB is still fundamentally an analytical engine. Even if Quack makes multi-client access possible, I would not immediately use it for high-volume OLTP workloads like payments, orders, or user sessions.

For those, PostgreSQL is still the boring and correct answer.

But for analytical workloads, internal dashboards, data pipelines, AI preprocessing, evaluation datasets, batch transformations, and lakehouse-style metadata, Quack opens a very interesting door.

Final thought

The story from DeepSeek's Smallpond to DuckDB's Quack is not just about one tool becoming distributed.

It is about a shift in how we think about data systems.

For years, the default answer to scale was often: use a bigger distributed framework.

Now we are seeing another pattern:

keep compute simple
keep data in open formats
run fast local engines near the data
coordinate through lightweight protocols
use specialized systems where they actually make sense

That is why this space is exciting.

DuckDB made local analytics feel simple.

Smallpond showed that many local DuckDB jobs can become a serious distributed processing pattern.

Quack now makes DuckDB instances talk to each other.

And when you combine that with object storage, DuckLake, Parquet, and vector databases, you get a very pragmatic architecture for modern AI and data engineering.

Not because it is trendy.

Because it removes unnecessary complexity.

DEV Community