Matias Affolter

Posted on Jun 4

🛡️ How I Built a NSFWJS Alternative That's 2.5 Lighter (~1 MB Gzipped) and Runs in ~65 ms (WASM)

#ai #webdev #javascript #opensource

A field report on shipping a binary SFW/NSFW image classifier for pixagram that runs **entirely in the browser, tuned for the one thing every off-the-shelf model quietly gets wrong: pixel art.

🎯 The itch

I run a Web3 platform for pixel art. People upload sprites, mint them, share them. And the moment you let strangers upload images, you inherit a very old problem: some of those images shouldn't be shown to everyone.

The obvious move is NSFWJS. It's excellent, it's battle-tested, and it runs in the browser. I tried it. Then I looked at what it was doing on my content and realized I was using a sledgehammer to crack a very specific nut.

"I didn't need a model that knows a thousand things about a photograph. I needed one that knows exactly two things about a sprite."

So I built my own. This is the story of how — the good decisions, the one that almost shipped a broken model, and the numbers I ended up with. I've written it so a junior dev can follow the whole thing and a senior one still finds the war stories useful. Where I use jargon, there's a 🔰 In plain terms box right after.

🤔 Why not just use NSFWJS?

NSFWJS is built around a MobileNetV2 backbone, trained on a large corpus of photographic and drawn web imagery, and it returns five categories (Drawing, Hentai, Neutral, Porn, Sexy). That's a fantastic general tool. But three things didn't fit me:

It's photographic. My content is pixel art — hard edges, tiny palettes, no anti-aliasing. A model that learned from photos has never really seen a 160×160 sprite.
It's bigger than my whole problem. Five classes and a ~3.5M-parameter backbone is a multi-megabyte download for a yes/no question.
I only ever ask one question. "Is this safe to show or not?" That's a single threshold, not a five-way softmax I have to interpret.

I wanted something I could embed in my bundle, that ran on-device (no upload, no server round-trip, no privacy headache), and that was trained to understand the medium I actually serve.

📋 What I actually needed

Before writing a line of code, I wrote the spec. It fit on a sticky note:

✅ Binary. sfw vs nsfw, one probability, one threshold.
✅ Tiny. Small enough to base64-embed in an npm package — ideally ~1 MB.
✅ Fast. Real-time-ish on a mid laptop without a GPU. Target: well under 100 ms.
✅ On-device. Runs in a Web Worker via WebAssembly. Nothing leaves the browser.
✅ Pixel-art-aware. The preprocessing has to respect the pixel grid, not smear it.
✅ Self-contained. npm install, import, call classify(). No model-hosting step for the consumer.

Everything below is downstream of that list.

🧠 Choosing a brain that fits in a tweet

I picked MobileNetV4-conv-small-050 from timm. It's ~0.96M parameters — roughly 2.5× lighter than the MobileNetV2 that NSFWJS leans on — and it's one of the strongest architectures per parameter you can pull off the shelf with ImageNet weights.

Then I made it cheaper still: I trained and ran it at 160×160 instead of the usual 224. For a convolutional net, compute scales with the pixel area, so dropping from 224 to 160 cuts the work to roughly half — landing the model around ~65 MFLOPs per image. For a binary decision on a sprite, that resolution is plenty.

🔰 In plain terms: the "backbone" is the pretrained image-understanding part of the model. Starting from one that already learned general vision on millions of photos (ImageNet) means I only have to teach it the last mile — "of the things you already see, which are NSFW?" — instead of teaching it to see from scratch.

🔧 The pipeline (the boring part that matters)

The whole thing is four steps, and each one hands its output to the next:

Train in PyTorch/timm — fine-tune the backbone on a two-folder dataset (nsfw/, sfw/).
Export to ONNX — a portable model format the browser can run.
Quantize — shrink the weights from 32-bit floats to 8-bit integers.
Embed — base64 the quantized model straight into the JS bundle, so there's no separate fetch.

At runtime it's onnxruntime-web, which can run the same model on WebGPU when it's available and fall back to WebAssembly (WASM) on the CPU otherwise.

🔰 In plain terms: WASM lets the browser run compiled code at near-native speed. It's why a neural network can run client-side without melting the tab.

The thing I want to stress for anyone building one of these: the four steps are one chain, and the weakest link decides your accuracy. I learned that the hard way twice — once on quantization, once on preprocessing.

🐇 The quantization rabbit hole (where I almost outsmarted myself)

My first instinct was to go aggressive: if 8-bit is good, 4-bit (Q4) must be better, right? Half the size, the blogs promise 2× speedups, everyone's quantizing LLMs to 4-bit. I almost spent a weekend on it.

Then I actually checked what 4-bit means in ONNX Runtime, and on my own model. Two facts stopped me cold:

ONNX Runtime's 4-bit path is weight-only and only touches MatMul operations (it rewrites them into MatMulNBits). It's a tool built for transformers.
My model, when I counted its operations, is 46 convolutions, one tiny classifier head, and zero MatMuls. A MobileNet is almost pure convolution.

So I ran the 4-bit quantizer on it anyway, just to be sure. It dutifully skipped every single node and handed me back a file that was 100% of the original size. Nothing happened, because there was nothing it knew how to touch.

"Halving a file that already fits in cache doesn't make it faster — it just makes it smaller. Those are not the same win. The 4-bit magic is real; it's just real for a different kind of model than mine."

The memory-bandwidth argument behind 4-bit assumes a huge model whose weights can't fit in fast memory. Mine is ~1 MB — it lives in CPU cache comfortably. So I dropped the fantasy and used uint8, which is exactly what ONNX Runtime recommends for the WASM/CPU path.

🔰 In plain terms: quantization stores each weight in fewer bits (32-bit float → 8-bit int) to shrink and speed up the model. Less precision, smaller file. The trick is doing it without changing the model's answers — which is exactly where I got burned later.

🎨 Pixel art breaks all the rules

Here's the thing nobody warns you about when your domain is pixel art: how you resize the image matters more than almost anything else.

Every model needs a fixed input size, so every image gets resized. For photos, the default — bilinear interpolation, which smoothly blends pixels — is fine. For pixel art, it's destructive: it takes crisp, intentional blocks and blurs them into mush. The thing that makes a sprite a sprite is the first thing bilinear throws away.

The right filter for upscaling pixel art is nearest-neighbor ("pixelated"): it keeps every block sharp.

But here's the trap, and it's the one that bites everyone: the resize at training time and the resize at serving time have to be identical. If you train on crisp nearest-neighbor images and then serve on blurred bilinear ones, the model is seeing a different distribution than it learned. You didn't just lose a little accuracy — you're running a different model than the one you trained.

"A model is only as honest as its preprocessing. Train on crisp pixels, serve on blurred ones, and you've quietly shipped a model you never actually tested."

So I made the resize filter a single source of truth: you choose it once at training time (--interp nearest), it gets stamped into the model checkpoint, the exporter reads it back and bakes it into the preprocessing config, and the browser reads that and resizes the same way. One decision, threaded through the entire chain, impossible to desync.

(One sharp edge worth flagging: in the browser, the cleaner-looking createImageBitmap API can do the resize, but its quality setting is browser-dependent — Firefox has historically ignored the "pixelated" hint. For guaranteed nearest-neighbor everywhere, an old-fashioned <canvas> with image smoothing turned off is the reliable tool.)

⚡ Making it fast (and invisible) in the browser

A classifier that freezes the UI is a classifier nobody ships. So the runtime does a few things to stay out of the way:

🧵 Web Worker by default. Inference runs on a background thread, with an automatic fall back to the main thread where workers aren't available. Same API either way.
📦 Batching. Multiple classify() calls within a few milliseconds get coalesced into a single inference, so the fixed per-call overhead is paid once for the whole batch.
🎒 The model travels with the code. It's base64-embedded in the bundle — one fewer network request, and it works offline.
🖥️ GPU when present, CPU when not. The runtime tries WebGPU and gracefully falls back to WASM, so a uint8 model still reaches the GPU on machines that have one.

The result of all this — small backbone, 160px, uint8, off-thread — is an inference that lands around ~65 ms on a CPU. Fast enough to check an image the moment it's dropped in.

🐛 The bug that almost broke me: "it never says NSFW"

This is the part I almost left out, because it's embarrassing. But it's also the most useful thing in this whole post, so here it is.

Everything was wired up. I deployed it. And it never flagged anything as NSFW. Not once. Clearly-NSFW pixel art sailed right through as "safe."

My first assumption was the worst-case one: the model never learned, or my training data (photographic) didn't transfer to pixel art at all. A wasted model. But before re-training anything, I wrote a tiny sanity-check script — the single most valuable hour I spent on the project. All it does is run the full-precision model and the quantized model on the same image, with the exact preprocessing, in plain Python. No browser, no canvas, no worker. Just: where, precisely, does the answer go wrong?

I ran it on one NSFW sprite. Here's what it told me:

nsfw_pixelart.png
  FP32 : P(nsfw)=0.973   ← the real model is CONFIDENTLY correct
  UINT8: P(nsfw)=0.106   ← the SHIPPED model is confidently WRONG

There it was. The full-precision model was right — 97% sure it was NSFW. The quantized model I'd actually deployed was 89% sure it was safe. Quantization had flipped the answer.

"FP32 said 0.97. The version I shipped said 0.11. The model wasn't wrong — the model I deployed was a different model than the one I trained."

The cause is a classic, and it's specific to this family of networks. MobileNets are built from depthwise convolutions, where each channel has its own little filter and the weight magnitudes vary wildly from channel to channel. I'd quantized with a single scale for each whole weight tensor (per-tensor), which crushed the small-magnitude channels to nothing. The fix is per-channel quantization — give every channel its own scale — which is a single flag (--wasm-quant u8s8, per-channel int8 weights).

🔰 In plain terms: imagine compressing a choir by setting one volume knob for everyone. The loud singers are fine; the quiet ones vanish. Per-channel quantization gives each singer their own knob. For MobileNets, that's the difference between a working model and a broken one.

The deeper lesson wasn't even about quantization. It was that the model you train and the model you deploy are not automatically the same model — export and quantization are transformations that can silently change behavior. A 30-line script that compares them at the boundary is worth more than any amount of staring at training curves.

🏆 The result

Where it all landed:

	NSFWJS (typical)	This
Backbone	MobileNetV2 (~3.5M params)	MobileNetV4-conv-small-050 (~0.96M)
Output	5 categories	binary `sfw` / `nsfw`
Domain	photographic / drawn	pixel art
Size	several MB	~1 MB gzipped
Inference	—	~65 ms on CPU (WASM)
Runs	in browser	in browser, in a worker, offline

Roughly 2.5× lighter, single-purpose, on-device, and — critically — trained and served with preprocessing that respects the medium. It ships as a self-contained npm package: npm install, import { classify }, done.

💡 What I'd tell my past self

The reusable lessons, stripped of my specific stack:

🎯 Match the model to the question. A binary problem doesn't need a five-class model; a sprite doesn't need a photo-scale backbone. Smaller-but-fitted beats bigger-but-generic.
🧮 "Smaller file" and "faster" are different wins. 4-bit shrinks giant models that are memory-bound. A 1-MB CNN is compute-bound and already cache-resident — different problem, different tool.
🔗 Preprocessing is part of the model. Train and serve through the identical pipeline, or you're testing one thing and shipping another. Make it a single source of truth.
🧪 Verify at every boundary. Export and quantization can flip your answers. A tiny full-precision-vs-quantized diff script will save you a day of guessing.
🐛 When it "never fires," suspect the transform, not the model. My model was perfect. The 8-bit copy of it wasn't.

🚀 Try it

This powers content moderation on my pixel-art platform, and it's open. If you're building anything that takes user images in the browser — especially in a niche domain where the big general models don't quite fit — I'd genuinely encourage you to consider training a small, fitted model instead of reaching for the default. It's less work than it sounds, and the result is something you actually understand top to bottom.

Build the thing that knows exactly what it needs to know. 🎨🛡️

Happy shipping.

DEV Community