XIANGWEIXIAO

Posted on Apr 16

DuckDB in the Wild: What 6 Minutes of Benchmarking Across 4 Machines Taught Me About Real-World Performance

#dataengineering #sql #database #performance

Character Design

I care more about "can we ship this?" than "is this theoretically optimal?"

When I pick data tools, I usually ask three questions:

Will it run fast enough on the hardware we actually have?
How much does object storage overhead really cost us?
Can I estimate these things without building a full test lab?

With that mindset, I ran the same DuckDB workload across four different machines, comparing local disk against S3-compatible object storage. The goal was not to crown a winner. It was to calibrate my intuition for what "fast enough" looks like in practice.

Why I ran these tests

I had two recurring questions while working with DuckDB:

Same SQL, different machine — why does it feel so much slower on some boxes? Is it CPU, disk, or something else entirely?
Local folder vs object storage — where does the extra time actually go?

So I ran the same job on a MacBook, a Ubuntu server, a Windows workstation, and a small cloud instance. I kept the query identical, the data identical, and the DuckDB version identical. Everything else — CPU, memory, filesystem, network — I let vary.

This is not science. It is reconnaissance.

What I held constant (and what I did not)

This is not a lab benchmark. It is an engineering gut-check: same query, different machines, different storage. The numbers are messy, the comparisons are unfair, and yet they are surprisingly useful.

Same across all runs:

DuckDB version: v1.5.2 (Variegata)
Data scale:
- Full dataset: ~64.5M rows, ~13.7 GB source, ~3.5 GB output
- Sample dataset: ~18M rows, ~4.0 GB source, ~1.0 GB output
Job semantics: identical ELT pattern (read → transform → write)

Storage setups:

Storage	Deployment	What this actually tests
Local disk	Native filesystem	Baseline: no network, no protocol overhead
MinIO	Single-node, same machine as DuckDB	S3 protocol overhead without network latency
Alibaba Cloud OSS	Cloud VM via internal network	Cloud CPU + memory + network combined

Important caveat: I did not test "MinIO on a different machine." From experience, cross-machine object storage introduces bandwidth and latency costs that would push the overhead well beyond what I measured here.

Full dataset results: four machines, two storage types

How to read the numbers:

Wall time: total elapsed time from start to finish
Rough read MB/s: (source GB × 1024) ÷ wall time

This is not raw disk throughput. It includes Parquet parsing, SQL execution, and write operations. Think of it as "end-to-end productivity" rather than hardware specs.

Environment	Data source	Wall time	Rows processed	Rough MB/s	Source size	Output size
macOS	Local disk	43s	64.5M	~326	13.7 GB	3.5 GB
Ubuntu	Local disk	58s	64.5M	~242	13.7 GB	3.5 GB
Ubuntu	MinIO (same machine)	62s	64.5M	~226	13.7 GB	3.5 GB
Windows	Local disk	331s	64.5M	~42	13.7 GB	3.5 GB

How I interpret these

macOS vs Ubuntu (both local disk)

The 15-second gap (43s vs 58s) likely comes from CPU microarchitecture and memory bandwidth. The Ubuntu box runs an older Xeon; the MacBook has a newer M-series chip with better vectorization. Same disk type, different compute muscle.

Ubuntu local vs MinIO (same machine)

58s → 62s is about a 7% overhead. That is the cost of S3 protocol parsing, HTTP client work, and MinIO's request handling — without any network latency. If MinIO were on a separate machine, expect significantly more.

Windows local disk (331s)

This number is an outlier, and I treat it as such. Anti-malware scanning, background services, power management policies, and NTFS characteristics can all drag down I/O on Windows. I would not generalize from this single data point without testing in a clean environment first.

Sample dataset results: cloud + OSS vs MacBook

This comparison is intentionally unfair. I am putting it here anyway because unfair comparisons happen in the real world all the time.

Cloud VM specs:

Attribute	Value
vCPU	2
Threads	4 (2 cores × hyperthreading)
Memory	~7.3 GB
Disk	Cloud SSD (ESSD class)
Network	Internal VPC to OSS

Comparison baseline:

MacBook — Apple M5, 10 cores, 32 GB RAM, local NVMe (APFS)

Environment	Data source	Wall time	Rows processed	Rough MB/s	Source size	Output size
Debian (2 vCPU cloud)	OSS (internal)	61s	18.0M	~67	4.0 GB	1.0 GB
macOS	Local disk	23s	18.0M	~178	4.0 GB	1.0 GB

How I interpret this

The cloud instance is ~2.7× slower — but that is not a story about "cloud is slow." It is a story about 2 vCPU + 7 GB RAM + network versus 10 cores + 32 GB RAM + local NVMe.

I did not profile deeply enough to separate CPU time from I/O wait from network latency. So I cannot tell you which factor matters most. What I can tell you: if someone hands you a 2-core cloud VM and asks for a performance estimate, 2–3× slower than a modern laptop is a reasonable working assumption.

Hardware inventory

For those who want to reproduce or sanity-check:

Machine	OS	Kernel	CPU	Cores / Threads	RAM	Primary storage
MacBook	macOS 26.4	—	Apple M5	10 / 10	32 GB	Apple NVMe SSD (APFS)
Workstation	Windows 10 Pro	—	Intel i7-10750H	6 / 12	~64 GB	SK Hynix SSD (NTFS)
Server	Ubuntu 24.04.3	6.17.0-19	Xeon E5-2698 v3	16 / 32	~188 GB	Samsung NVMe (ext4)
Cloud VM	Debian 11	5.10.0-26	Xeon Platinum (virtual)	2 / 4	~7.3 GB	Cloud SSD (ESSD)

Minimal reproduction code

Here are the exact patterns I used, in case you want to run your own gut-check.

From MinIO (S3-compatible, local):

INSTALL httpfs;
LOAD httpfs;

CREATE OR REPLACE SECRET (
    TYPE s3,
    KEY_ID 'your_access_key',
    SECRET 'your_secret_key',
    REGION 'us-east-1',
    ENDPOINT '127.0.0.1:9000',
    URL_STYLE 'path'
);

COPY (
    SELECT *
    FROM read_csv(
        's3://bucket/raw/**/*.csv',
        ignore_errors = true,
        filename = true,
        union_by_name = true,
        sample_size = 200
    )
) TO 's3://bucket/output/result.parquet' (FORMAT PARQUET);

From local filesystem:

COPY (
    SELECT *
    FROM read_csv_auto(
        '/path/to/data/**/*.csv',
        ignore_errors = true,
        filename = true,
        union_by_name = true,
        sample_size = 200
    )
) TO '/path/to/output/result.parquet' (FORMAT PARQUET);

From Alibaba Cloud OSS:

INSTALL httpfs;
LOAD httpfs;

CREATE OR REPLACE SECRET (
    TYPE s3,
    KEY_ID 'your_oss_key',
    SECRET 'your_oss_secret',
    REGION 'cn-hangzhou',
    ENDPOINT 'oss-cn-hangzhou.aliyuncs.com'
);

COPY (
    SELECT *
    FROM read_csv(
        's3://bucket/raw/**/*.csv',
        ignore_errors = true,
        filename = true,
        union_by_name = true,
        sample_size = 200
    )
) TO 's3://bucket/output/result.parquet' (FORMAT PARQUET);

What I actually learned

Three things stick with me:

Same query, same data, 8× wall-time spread — from 43 seconds to 331 seconds. Hardware and environment matter more than I sometimes remember.
Local MinIO adds ~7% overhead when co-located. That is low enough that protocol compatibility is worth it for my use cases. Cross-machine MinIO would be a different story.
Unfair comparisons are still useful — as long as you label them honestly. A 2-core cloud VM being 2–3× slower than a modern laptop is not surprising, but having a concrete number helps with capacity conversations.

What this means for picking your setup

I would not choose hardware based on these numbers alone. But I would use them to:

Set expectations: If someone proposes a 2-core cloud instance for heavy batch work, 60+ seconds for 4 GB of data is a realistic starting assumption.
Justify local object storage: A same-machine MinIO instance is cheap enough to run and compatible enough to integrate that the 7% tax is usually acceptable.
Flag outliers early: That 331-second Windows run tells me "something else is going on here" — whether I investigate further depends on whether that machine actually matters for production.

DEV Community