Over the past year my team and I have been building an AI product that needed to serve large LLM model files reliably, quickly, and privately.
We assumed the existing tooling would “just work”:
- Git LFS
- Hugging Face repos
- S3 / MinIO
- generic object stores
But once we started working with multi‑GB safetensors, gguf, ONNX, and datasets, everything broke in predictable and painful ways.
This post explains the technical journey that led us to build Anvil — an open‑source, S3‑compatible, AI‑native object store built in Rust — and how we designed it around:
- Tensor‑level streaming
- Model‑aware indexing
- QUIC transport
- Erasure‑coded distributed storage
- Simple Docker deployment
- Multi‑region clustering
- gRPC APIs + S3 compatibility
And why we decided to open source the entire project (Apache‑2.0).
The Pain That Set This All In Motion
Git LFS
Failed repeatedly at multi‑GB model files. Corruption, slow diffs, weird retry loops.
Hugging Face
Amazing for public hosting — but for private/internal models:
- rate limits
- slow downloads
- no control over the infra
- not ideal for production workloads.
S3 / MinIO
Rock‑solid for normal object storage, but:
- treats model files as “dumb blobs”
- no safetensor/gguf indexing
- no tensor‑level streaming
- full downloads required before inferencing
- expensive when replication is used for durability
Our own app’s needs
We have users on:
- machines with 4–8GB VRAM
- laptops needing local/offline inference
- mobile‑adjacent devices
- distributed clusters needing fast warm starts
We could not afford 5–15GB full model downloads for every startup.
We needed inference to start instantly.
That’s when we realized:
Object stores were never built for AI workloads.
We needed something model‑aware.
Enter Anvil — What We Ended Up Building
GitHub Repo: https://github.com/worka-ai/anvil
Docs: https://worka.ai/docs/anvil/getting-started
Landing: https://worka.ai/anvil
Release: https://github.com/worka-ai/anvil/releases/latest
Anvil started as an internal hack.
It’s now a complete, distributed object store built for ML systems.
At a high level, Anvil is:
- fully S3-compatible
- fully gRPC-native
- simple (Docker-first) to run
- built in Rust
- open-source (Apache‑2.0)
- model-aware (safetensors, gguf, onnx)
- supports tensor-streaming for partial inference loads
- supports erasure coding (Ceph-style)
- clusterable (libp2p gossip + QUIC)
- multi-region with isolated metadata
Let’s dive into the internals.
Model‑Aware Indexing (safetensors / gguf / onnx)
This is one of the core innovations.
When a model file is uploaded, Anvil automatically indexes:
- tensor names
- byte offsets
- dtypes
- shapes
- file segments
- metadata
This allows the client to do:
from anvilml import Model
m = Model("s3://models/llama3.safetensors")
q_proj = m.get_tensor("layers.12.attn.q_proj.weight")
No full download.
No giant memory spike.
Just one tensor.
Why this matters
It enables:
- partial inference on underpowered devices
- instant warm starts
- cold start reduction by ~12×
- efficient multi‑variant fine‑tune workflows
Tensor‑Level Streaming Over QUIC
Instead of downloading the entire model file:
- Use the tensor index
- Open a QUIC stream
- Fetch only the byte ranges needed
- Feed directly into the ML framework
This results in:
🟢 Cold Start
37.1s → 2.9s on a real 3B model.
🟢 Data transferred
6.3GB → 128MB
🟢 CPU and memory way lower
QUIC gives us:
- multiplexing
- congestion control
- lower latency
- fewer TLS overheads than HTTP/2
And QUIC is increasingly the default for high-performance ML workloads.
Erasure Coding for AI‑Sized Objects
Traditional replication is expensive:
- 100GB model
- 3× replication
- → 300GB storage required
Erasure coding (like Ceph) gives:
- 100GB
- + parity shards
- → ~150GB for the same durability
Anvil uses Reed‑Solomon encoding:
- configurable shard counts
- rebuilt on the fly
- stored across the cluster automatically
This is a life‑saver for multi‑GB models and datasets.
Multi‑Region Clustering (Gossip + Postgres)
We adopted a split‑metadata pattern:
Global Postgres
- tenant metadata
- bucket metadata
- auth
- region definitions
Regional Postgres (one per region)
- object metadata
- tensor index
- block maps
- journalling
Node Discovery via libp2p
Nodes gossip:
- liveness
- region membership
- shard ownership
- cluster size
- bootstrap points
Zero configuration cluster growth:
anvil --bootstrap /dns/anvil1/tcp/7443
Code: Upload + Stream a Tensor
Upload a model file
aws --endpoint-url http://localhost:9000 s3 cp llama3.safetensors s3://models/
Stream a tensor
from anvilml import Model
m = Model("s3://models/llama3.safetensors")
w = m.get_tensor("layers.8.attn.q_proj.weight")
print(w.shape)
Deploy locally
docker compose up -d
Built for Local + Hybrid
We wanted something that:
- runs offline
- runs on laptops
- runs on home labs
- runs across small teams
- runs in production clusters
- doesn’t require k8s or cloud lock‑in
So Anvil is:
- single binary
- Docker-first
- multi-region optional
- no external services besides Postgres
Why Open Source?
Because object storage is infrastructure.
People need to trust it.
Teams need to inspect and extend it.
Researchers need to experiment with it.
ML engineers need to run it offline.
We’re releasing Anvil under Apache‑2.0 with:
- full source
- production-ready release
- detailed docs
- Python SDK
- S3 API
- examples and tutorials
If you want to run models locally, self-host private AI workloads, or build infra around LLMs — we hope Anvil is useful.
Links
GitHub
https://github.com/worka-ai/anvil
Docs
https://worka.ai/docs/anvil/getting-started
Landing
Release
https://github.com/worka-ai/anvil/releases/latest
If you have thoughts, critiques, architectural ideas, or want to break Anvil — we’d genuinely love feedback.
This is just the beginning.
Top comments (0)