S3 as a disk for agents: geesefs vs s3fs, measured

#s3 #fuse #agents #infra

Most infra today is built for a human to click through. That's ending. The thing reading your docs, hitting your API, and filling your disk next year is an agent, not a person. And an agent doesn't have a laptop. It has a Linux box and a network.

So when you design for agents, two assumptions hold. They run on Linux. And the only storage layer that's actually everywhere — every cloud, every region, every cheap S3-compatible box in a closet — is S3. Not a managed filesystem, not a vendor volume. S3 won by being the lowest common denominator that the machine can reach from anywhere.

Which leads to the question we kept circling. What if S3 isn't an SDK call, but a disk? Mounted into the VM, inside our VPC, so the agent writes to /mnt/data and the bytes land in object storage with no code that knows S3 exists.

That matters because of how agents actually write code. They call open(), pandas.read_csv("/mnt/..."), they shell out to rsync and git. They do not reach for boto3 unless you force them. Mount the bucket and every POSIX tool speaks S3 for free. That's the point.

Two projects do this seriously over FUSE: s3fs-fuse and geesefs. We put both on the same VM, same bucket, same workload, and measured.

The setup

A 4-vCPU / 4 GB Linux VM, Ubuntu 24.04. Tigris as the S3 backend. Latest of each engine: geesefs v0.43.7, s3fs v1.97. apt still ships s3fs 1.93, so we built 1.97 from source. Same 128 MB sequential file, same 500 × 8 KiB small files, five runs, cold and warm. No on-disk cache on either side, so a remount is a true cold start.

The mount commands — the part everyone gets wrong first:

# s3fs
s3fs my-bucket /mnt/s -o url=https://t3.storage.dev \
  -o use_path_request_style -o passwd_file=/etc/passwd-s3fs

# geesefs  (--list-type 2 is mandatory on non-Yandex S3; forget it and listing breaks)
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... \
  geesefs --endpoint https://t3.storage.dev --list-type 2 my-bucket /mnt/g

Mount a sub-prefix with s3fs and it 404s unless that prefix already exists as a directory marker — you need -o compat_dir. geesefs mounts a prefix natively. First gotcha, before a single byte moves.

The numbers

Sequential throughput is a tie. ~45–50 MB/s on write and on cold read, both engines. That number is the wire to S3, not the filesystem. If your benchmark is dd if=/dev/zero of=/mnt/big, you'll conclude the two are identical, and you'll be wrong about everything that matters.

Everything an agent actually does is not close:

workload	s3fs	geesefs	gap
small-file write	2.6–7 files/s	563 files/s	80–217×
small-file warm read	5–15 files/s	1188 files/s	78–233×
sequential warm read	61 MB/s	883 MB/s	14×
mount	336 ms	139 ms	2×

The ranges are real: we ran two s3fs versions (1.93 from apt, 1.97 from source) and the gap held in both directions. Two mechanisms explain almost all of it.

Why

Small-file writes. s3fs does one synchronous PUT per file and waits for it. geesefs pipelines uploads in the background and returns. 500 tiny files is 500 serial round trips for s3fs; geesefs overlaps them. That's the 217×.

Directory listing. geesefs fills file size and mtime straight from the LIST response it already made, so ls -la stays cheap and consistent — single-digit milliseconds warm. s3fs is the variable one: depending on version and config it either reads attributes from the listing or does a getattr per file, and when it does the latter, an ls -la over a few hundred files becomes a serial HEAD storm. We watched the same listing swing from sub-second on one s3fs build to ~11 seconds on another. Either way this is the operation that bites agents, because agents stat everything: os.listdir + os.stat, git status, rsync's whole model. A names-only ls is cheap. The moment a tool wants metadata, s3fs is the one that might fall off a cliff — and you won't see it in a dd test.

The warm reads — 14× sequential, 233× small — are the kernel page cache. geesefs lets Linux cache file pages; s3fs disables that by default and re-fetches from S3 on every read. Be precise about what this is: it is not "geesefs uses less RAM." Its cached data lives in the page cache, which doesn't even show up in the process's RSS. geesefs spends memory to make re-reads free. s3fs spends almost none and pays the network every time. For an agent that reads the same dataset in a loop, free re-reads is the whole game.

The trade-off

geesefs is not a free win, and pretending otherwise is how you get paged at 3 AM.

Its cold directory listing is noisier — we saw a 0.3–5.7 s spread on the same ls. It buffers writes aggressively, so you sync -f before you tear the mount down or you lose the tail — bit us once, now it's reflex. It does not do concurrent multi-writer to one file — that's last-writer-wins with stale-read risk through each box's cache, so keep it to one writer or disjoint prefixes. And the async deletes that make it fast mean a rm -rf lingers in the bucket for a while after the call returns.

s3fs earns its place when you want a single synchronous writer with no surprises, or you're streaming a few large objects and never re-reading them. Boring, predictable, slow on metadata. Sometimes boring is the requirement.

The takeaway

If the consumer is an agent on Linux — many small files, repeated reads, metadata-heavy tools — mount with geesefs and give it page-cache headroom. The 80–230× isn't a marketing number; it's synchronous-per-file versus pipelined, and re-fetch versus page cache — measured across two s3fs versions. If you only ever stream big blobs once and want the simplest possible failure mode, s3fs is fine.

S3 is the storage layer that's actually everywhere, and the machine consuming it doesn't care about your SDK. So give it a disk. Just pick the FUSE layer that matches how the thing on top does I/O — and measure the workload it'll really run, not dd.

DEV Community

S3 as a disk for agents: geesefs vs s3fs, measured

The setup

The numbers

Why

The trade-off

The takeaway

Top comments (0)