I Killed a 773 MB Model Download at 60%. It Recovered in 44 Seconds.

#ai #opensource #llm #programming

The discussion around local AI is a hardware discussion: which model fits on which device, at what speed, at what quantization. Framed that way, the field reads as a benchmark race against the cloud, and the cloud usually wins. The more consequential development sits one layer lower, in how models reach devices and how devices reach each other's models, and it is easy to miss because no benchmark measures it. A download killed at 488 of 773 megabytes measures it precisely.

This week I tested that layer directly. I installed a newly released local AI SDK on the cheapest server I operate, a virtual machine with four CPU cores, 8 GB of RAM and no GPU, and paid attention not to the token rate but to the network architecture underneath.

A model that arrives like a torrent

Peer-to-peer local AI means two things move between devices without a cloud endpoint: the model itself, synchronized block by block from whichever peers seed it, and the conversation with a model, when one device sends its inference calls over an encrypted peer connection to a device that runs the model locally.

On my test box, that looked unremarkable in the best sense: three commands, a 773 MB model fetched from the peer-to-peer registry, a correct completion streamed on CPU only.

mkdir qvac-test && cd qvac-test && npm init -y && npm pkg set type=module
npm install @qvac/sdk
node quickstart.js   # loads Llama 3.2 1B, streams a completion

The transport shows its nature when things go wrong. I wiped the cache, started the download again, killed the process at 488 of 773 megabytes, and reran the command. The partial blocks on disk were reused, and the rerun completed, inference included, in 44 seconds. For comparison, resumable model downloads have been an open request in the transformers.js project since March 2025, because over plain HTTP, partial caching is a hard problem. Over a block-synchronized transport, that resume is the default behavior, not a feature.

Inference as a shared resource

The SDK in my test, QVAC by Tether, treats that transport as the foundation for something larger. Its architecture documents describe delegated inference over the Holepunch stack, and the mechanics matter: the model does not move, and neither does the inference. A device that holds a model announces it on a peer-discovery topic; another device connects, and its inference calls proxy through an encrypted peer-to-peer stream while the holding device executes locally and streams results back. There is no server tier in this design. Holding and borrowing are roles per model, not device classes: every peer runs the same stack and is addressed by a public key, and one machine can serve a model while borrowing another, the way a file-sharing peer seeds one file and fetches the next. Access is gated by a firewall of allowed public keys, and blind relay nodes route the traffic across NATs. What travels between the devices is the conversation, an agent on one machine talking to a model on another. The project frames the ambition as building systems "like BitTorrent, IPFS, and blockchain networks, but for AI."

A scope note: I verified the model-delivery layer firsthand; the delegation above it is the documented design on the same stack that moved my 773 MB model. Taken as designed, a fleet of edge devices, sensor boxes, point-of-sale terminals, machines on a factory floor, could share whichever peer currently holds a capable model, without any of them holding an API key or reaching a cloud endpoint. Model registry, transport and delegation are all peer-to-peer; no central server sits in the path to be metered, throttled or switched off.

For regulated environments, the same property reads differently but lands in the same place: the data path is inspectable end to end, and nothing in it terminates at a third party.

The layer worth evaluating

The plumbing around all this is unusually complete for a young SDK. Session state persists to disk, output can be constrained with a JSON schema or a grammar enforced in the sampler, and streaming transcription ships with voice activity detection. The install is heavy at 3.2 GB of node_modules, a known issue the project tracks, and the team is still hardening GPU edge cases. None of that changes the architecture underneath.

A 44-second recovery from a killed download told me more about this stack than any tokens-per-second table could have. Local AI on a single device is the settled part; benchmarks measure it because it is measurable. The part that decides whether fleets of edge agents become practical is the network layer between the devices, and that layer can be tested in an afternoon: interrupt the model transfer, watch what resumes, and read what the architecture does when no cloud is in the path.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks; if this one was useful, the self-hosted LLM break-even calculator is the companion download.

DEV Community

I Killed a 773 MB Model Download at 60%. It Recovered in 44 Seconds.

A model that arrives like a torrent

Inference as a shared resource

The layer worth evaluating

Top comments (0)