DEV Community

Cover image for How We Built an AI‑Native Object Store (Tensor Streaming, Erasure Coding, QUIC, Rust)
Courtney Robinson
Courtney Robinson

Posted on

How We Built an AI‑Native Object Store (Tensor Streaming, Erasure Coding, QUIC, Rust)

Over the past year my team and I have been building an AI product that needed to serve large LLM model files reliably, quickly, and privately.

We assumed the existing tooling would “just work”:

  • Git LFS
  • Hugging Face repos
  • S3 / MinIO
  • generic object stores

But once we started working with multi‑GB safetensors, gguf, ONNX, and datasets, everything broke in predictable and painful ways.

This post explains the technical journey that led us to build Anvil — an open‑source, S3‑compatible, AI‑native object store built in Rust — and how we designed it around:

  • Tensor‑level streaming
  • Model‑aware indexing
  • QUIC transport
  • Erasure‑coded distributed storage
  • Simple Docker deployment
  • Multi‑region clustering
  • gRPC APIs + S3 compatibility

And why we decided to open source the entire project (Apache‑2.0).


The Pain That Set This All In Motion

Git LFS

Failed repeatedly at multi‑GB model files. Corruption, slow diffs, weird retry loops.

Hugging Face

Amazing for public hosting — but for private/internal models:

  • rate limits
  • slow downloads
  • no control over the infra
  • not ideal for production workloads.

S3 / MinIO

Rock‑solid for normal object storage, but:

  • treats model files as “dumb blobs”
  • no safetensor/gguf indexing
  • no tensor‑level streaming
  • full downloads required before inferencing
  • expensive when replication is used for durability

Our own app’s needs

We have users on:

  • machines with 4–8GB VRAM
  • laptops needing local/offline inference
  • mobile‑adjacent devices
  • distributed clusters needing fast warm starts

We could not afford 5–15GB full model downloads for every startup.

We needed inference to start instantly.

That’s when we realized:

Object stores were never built for AI workloads.

We needed something model‑aware.


Enter Anvil — What We Ended Up Building

GitHub Repo: https://github.com/worka-ai/anvil

Docs: https://worka.ai/docs/anvil/getting-started

Landing: https://worka.ai/anvil

Release: https://github.com/worka-ai/anvil/releases/latest

Anvil started as an internal hack.

It’s now a complete, distributed object store built for ML systems.

At a high level, Anvil is:

  • fully S3-compatible
  • fully gRPC-native
  • simple (Docker-first) to run
  • built in Rust
  • open-source (Apache‑2.0)
  • model-aware (safetensors, gguf, onnx)
  • supports tensor-streaming for partial inference loads
  • supports erasure coding (Ceph-style)
  • clusterable (libp2p gossip + QUIC)
  • multi-region with isolated metadata

Let’s dive into the internals.


Model‑Aware Indexing (safetensors / gguf / onnx)

This is one of the core innovations.

When a model file is uploaded, Anvil automatically indexes:

  • tensor names
  • byte offsets
  • dtypes
  • shapes
  • file segments
  • metadata

This allows the client to do:

from anvilml import Model

m = Model("s3://models/llama3.safetensors")
q_proj = m.get_tensor("layers.12.attn.q_proj.weight")
Enter fullscreen mode Exit fullscreen mode

No full download.

No giant memory spike.

Just one tensor.

Why this matters

It enables:

  • partial inference on underpowered devices
  • instant warm starts
  • cold start reduction by ~12×
  • efficient multi‑variant fine‑tune workflows

Tensor‑Level Streaming Over QUIC

Instead of downloading the entire model file:

  • Use the tensor index
  • Open a QUIC stream
  • Fetch only the byte ranges needed
  • Feed directly into the ML framework

This results in:

🟢 Cold Start

37.1s → 2.9s on a real 3B model.

🟢 Data transferred

6.3GB → 128MB

🟢 CPU and memory way lower

QUIC gives us:

  • multiplexing
  • congestion control
  • lower latency
  • fewer TLS overheads than HTTP/2

And QUIC is increasingly the default for high-performance ML workloads.


Erasure Coding for AI‑Sized Objects

Traditional replication is expensive:

  • 100GB model
  • 3× replication
  • → 300GB storage required

Erasure coding (like Ceph) gives:

  • 100GB
  • + parity shards
  • ~150GB for the same durability

Anvil uses Reed‑Solomon encoding:

  • configurable shard counts
  • rebuilt on the fly
  • stored across the cluster automatically

This is a life‑saver for multi‑GB models and datasets.


Multi‑Region Clustering (Gossip + Postgres)

We adopted a split‑metadata pattern:

Global Postgres

  • tenant metadata
  • bucket metadata
  • auth
  • region definitions

Regional Postgres (one per region)

  • object metadata
  • tensor index
  • block maps
  • journalling

Node Discovery via libp2p

Nodes gossip:

  • liveness
  • region membership
  • shard ownership
  • cluster size
  • bootstrap points

Zero configuration cluster growth:

anvil --bootstrap /dns/anvil1/tcp/7443
Enter fullscreen mode Exit fullscreen mode

Code: Upload + Stream a Tensor

Upload a model file

aws --endpoint-url http://localhost:9000 s3 cp llama3.safetensors s3://models/
Enter fullscreen mode Exit fullscreen mode

Stream a tensor

from anvilml import Model

m = Model("s3://models/llama3.safetensors")
w = m.get_tensor("layers.8.attn.q_proj.weight")

print(w.shape)
Enter fullscreen mode Exit fullscreen mode

Deploy locally

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Built for Local + Hybrid

We wanted something that:

  • runs offline
  • runs on laptops
  • runs on home labs
  • runs across small teams
  • runs in production clusters
  • doesn’t require k8s or cloud lock‑in

So Anvil is:

  • single binary
  • Docker-first
  • multi-region optional
  • no external services besides Postgres

Why Open Source?

Because object storage is infrastructure.

People need to trust it.

Teams need to inspect and extend it.

Researchers need to experiment with it.

ML engineers need to run it offline.

We’re releasing Anvil under Apache‑2.0 with:

  • full source
  • production-ready release
  • detailed docs
  • Python SDK
  • S3 API
  • examples and tutorials

If you want to run models locally, self-host private AI workloads, or build infra around LLMs — we hope Anvil is useful.


Links

GitHub

https://github.com/worka-ai/anvil

Docs

https://worka.ai/docs/anvil/getting-started

Landing

https://worka.ai/anvil

Release

https://github.com/worka-ai/anvil/releases/latest


If you have thoughts, critiques, architectural ideas, or want to break Anvil — we’d genuinely love feedback.

This is just the beginning.

Top comments (0)