DEV Community

David Aronchick
David Aronchick

Posted on • Originally published at distributedthoughts.org

WebAssembly 3.0 and the Infrastructure We Actually Need

"We've finished the model... when do we get to the value?"

Here's what I'm watching happen in real time: DevOps teams are spending $$$$$$ annually on cloud egress fees just to move ML models between environments, and flow data to centralized hosting.

Platform engineers are debugging why a 1.1GB model requires a 7GB container. SREs are explaining why edge inference needs a 30-second cold start.

Everyone knows something's wrong. But we keep reaching for the same solution: bigger containers, faster networks, more centralized infrastructure.

WebAssembly 3.0, released last week, MAY have another way. I hope!
The Container Tax We Normalized
Docker revolutionized deployment by making environments portable. From nightmare- runbooks to docker run changed how we ship software. But watch what happened to model deployment:
A transformer model that's 1.11GB becomes 7.05GB once you add Python, PyTorch, and CUDA libraries.The actual inference code? Maybe 2MB. The rest is environmental overhead we're shipping because we can't separate the what from the where.
This compounds at scale. Every device not located in your datacenter in Northern Virginia gets gigabytes of infrastructure. Every browser download includes an entire runtime. Every cloud function pays startup penalties for containers that spend 99% of their time idle.

We solved "works on my machine" by making the machine part of what we ship. For ML deployment—particularly the edge-native inference we actually need—this is backwards.
What Changed in WASM 3.0
Four technical advances matter for how we deploy models:
64-bit memory: The 4GB limit is gone. Web environments support 16GB; off-web is essentially unlimited. You can load a 12GB language model directly, no pagination hacks. This unblocks the entire class of problems where models were too large for edge deployment.Garbage collection: Java, Kotlin, Scala, OCaml, Dart, Scheme—all now compile to WASM. Previously, you needed Rust/C++ or JavaScript bridges. ML pipelines aren't monoglots. Python for training, Go for serving, Rust for performance paths—this is how production ML actually works. WASM GC removes the "rewrite everything" tax.Multiple memories: Separate address spaces for model weights versus runtime data. When you load untrusted model code—and increasingly, you will—it literally cannot access host memory. Security isolation becomes structural.Native exception handling: Models fail. Usually at 3 AM. WASM now has proper error boundaries that don't escape to host languages. This matters when your inference pipeline spans edge devices, fog nodes, and cloud.The Topology Problem Docker Can't Solve
ML inference needs to run in five fundamentally different environments:
Browsers - Client-side inference, zero server cost*Edge devices* - 512MB total storage, millisecond budgets*Serverless functions* - Cold starts under 100msGPUs for training and heavy inferenceCPUs for lightweight inference at scale
Docker handles #4. Barely handles #3. Can't touch #1, #2, or most practical implementations of #5. Not because it is not a great tool! But because it was not built to.

WASM runs everywhere Docker runs, plus everywhere Docker can't. This isn't theoretical; it's shipping today.
What This Looks Like in Production
TinyGo compiles Go to WASM. A simple service drops from 1.1MB to 377KB with optimization. Debug information alone accounts for two-thirds of binary size—strip it and 93KB becomes 30KB. Cold starts move from seconds to sub-millisecond.

Extism builds universal plugins on WASM. Write inference code in Rust, call it from Python, run it anywhere. This is the plugin architecture distributed ML actually needs: models as portable, sandboxed plugins rather than services wrapped in infrastructure.

Yzma proves WASM runs on actual resource-constrained hardware. Not "edge" as in "a smaller data center." Edge as in microcontrollers with meaningful memory limits doing real-time sensor processing.
The Missing Layer for Distributed Inference
We can train models across many machines; Kubernetes and GPUs solved that. We can store models in registries and blob store. We still can't deploy models efficiently across the topology we actually need.

For edge deployment at scale, the standard of "throwing everything in a container" is unworkable. For browser-based inference, it's impossible. For compute-over-data architectures where you move inference to where data lives, it's economically absurd.

WASM 3.0 has a real opportunity to provide what we were missing:
Portable across radically different environments*Secure* through sandboxing by design*Fast* with near-native performance*Small* measured in kilobytes, not gigabytes*Multi-language* without framework lock-inWhat This Enables
Docker for environments, WASM for computation. This distinction matters:

Browser inference without 100MB downloads: Load a 400KB WASM module instead of shipping an entire ML framework. Process user data client-side, never hitting servers. Privacy by architecture, not policy.

Edge devices running real inference: A 6-watt chip running Llama2-7B beats a cloud round-trip for latency, cost, and privacy. WASM makes this practical to deploy and update.

Model updates in seconds, not minutes: Push a 500KB WASM module instead of rebuilding and shipping 4GB container images. Distributed inference infrastructure becomes as agile as the models themselves.

The infrastructure we need isn't always what we expect. Sometimes it's the 16-year-old technology that finally grew the features to solve problems we created by centralizing everything.

Further Technical Deep Dives:
WebAssembly 3.0 SpecificationTinyGo Optimization GuideExtism Universal Plugin SystemWASM Implementation Status
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. [I'd love to hear your thoughts**](https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org)!**


Originally published at Distributed Thoughts.

Top comments (0)