Courtney Robinson

Posted on Nov 19

How We Built an AI‑Native Object Store (Tensor Streaming, Erasure Coding, QUIC, Rust)

#ai #rust #mlops #s3

Over the past year my team and I have been building an AI product that needed to serve large LLM model files reliably, quickly, and privately.

We assumed the existing tooling would “just work”:

Git LFS
Hugging Face repos
S3 / MinIO
generic object stores

But once we started working with multi‑GB safetensors, gguf, ONNX, and datasets, everything broke in predictable and painful ways.

This post explains the technical journey that led us to build Anvil — an open‑source, S3‑compatible, AI‑native object store built in Rust — and how we designed it around:

Tensor‑level streaming
Model‑aware indexing
QUIC transport
Erasure‑coded distributed storage
Simple Docker deployment
Multi‑region clustering
gRPC APIs + S3 compatibility

And why we decided to open source the entire project (Apache‑2.0).

The Pain That Set This All In Motion

Git LFS

Failed repeatedly at multi‑GB model files. Corruption, slow diffs, weird retry loops.

Hugging Face

Amazing for public hosting — but for private/internal models:

rate limits
slow downloads
no control over the infra
not ideal for production workloads.

S3 / MinIO

Rock‑solid for normal object storage, but:

treats model files as “dumb blobs”
no safetensor/gguf indexing
no tensor‑level streaming
full downloads required before inferencing
expensive when replication is used for durability

Our own app’s needs

We have users on:

machines with 4–8GB VRAM
laptops needing local/offline inference
mobile‑adjacent devices
distributed clusters needing fast warm starts

We could not afford 5–15GB full model downloads for every startup.

We needed inference to start instantly.

That’s when we realized:

Object stores were never built for AI workloads.

We needed something model‑aware.

Enter Anvil — What We Ended Up Building

GitHub Repo: https://github.com/worka-ai/anvil

Docs: https://worka.ai/docs/anvil/getting-started

Landing: https://worka.ai/anvil

Release: https://github.com/worka-ai/anvil/releases/latest

Anvil started as an internal hack.

It’s now a complete, distributed object store built for ML systems.

At a high level, Anvil is:

fully S3-compatible
fully gRPC-native
simple (Docker-first) to run
built in Rust
open-source (Apache‑2.0)
model-aware (safetensors, gguf, onnx)
supports tensor-streaming for partial inference loads
supports erasure coding (Ceph-style)
clusterable (libp2p gossip + QUIC)
multi-region with isolated metadata

Let’s dive into the internals.

Model‑Aware Indexing (safetensors / gguf / onnx)

This is one of the core innovations.

When a model file is uploaded, Anvil automatically indexes:

tensor names
byte offsets
dtypes
shapes
file segments
metadata

This allows the client to do:

from anvilml import Model

m = Model("s3://models/llama3.safetensors")
q_proj = m.get_tensor("layers.12.attn.q_proj.weight")

No full download.

No giant memory spike.

Just one tensor.

Why this matters

It enables:

partial inference on underpowered devices
instant warm starts
cold start reduction by ~12×
efficient multi‑variant fine‑tune workflows

Tensor‑Level Streaming Over QUIC

Instead of downloading the entire model file:

Use the tensor index
Open a QUIC stream
Fetch only the byte ranges needed
Feed directly into the ML framework

This results in:

🟢 Cold Start

37.1s → 2.9s on a real 3B model.

🟢 Data transferred

6.3GB → 128MB

🟢 CPU and memory way lower

QUIC gives us:

multiplexing
congestion control
lower latency
fewer TLS overheads than HTTP/2

And QUIC is increasingly the default for high-performance ML workloads.

Erasure Coding for AI‑Sized Objects

Traditional replication is expensive:

100GB model
3× replication
→ 300GB storage required

Erasure coding (like Ceph) gives:

100GB
+ parity shards
→ ~150GB for the same durability

Anvil uses Reed‑Solomon encoding:

configurable shard counts
rebuilt on the fly
stored across the cluster automatically

This is a life‑saver for multi‑GB models and datasets.

Multi‑Region Clustering (Gossip + Postgres)

We adopted a split‑metadata pattern:

Global Postgres

tenant metadata
bucket metadata
auth
region definitions

Regional Postgres (one per region)

object metadata
tensor index
block maps
journalling

Node Discovery via libp2p

Nodes gossip:

liveness
region membership
shard ownership
cluster size
bootstrap points

Zero configuration cluster growth:

anvil --bootstrap /dns/anvil1/tcp/7443

Code: Upload + Stream a Tensor

Upload a model file

aws --endpoint-url http://localhost:9000 s3 cp llama3.safetensors s3://models/

Stream a tensor

from anvilml import Model

m = Model("s3://models/llama3.safetensors")
w = m.get_tensor("layers.8.attn.q_proj.weight")

print(w.shape)

Deploy locally

docker compose up -d

Built for Local + Hybrid

We wanted something that:

runs offline
runs on laptops
runs on home labs
runs across small teams
runs in production clusters
doesn’t require k8s or cloud lock‑in

So Anvil is:

single binary
Docker-first
multi-region optional
no external services besides Postgres

Why Open Source?

Because object storage is infrastructure.

People need to trust it.

Teams need to inspect and extend it.

Researchers need to experiment with it.

ML engineers need to run it offline.

We’re releasing Anvil under Apache‑2.0 with:

full source
production-ready release
detailed docs
Python SDK
S3 API
examples and tutorials

If you want to run models locally, self-host private AI workloads, or build infra around LLMs — we hope Anvil is useful.

DEV Community