Yuka Kust

Posted on May 5

I shipped a free AI-art site with a flawed LoRA and ran a 75-image ablation to prove it

#ai #python #opensource #machinelearning

TL;DR. I built pinock.io — an endless feed of AI-generated animals in 1960s Soviet matchbox poster style. Free, no signup, no watermark. Under the hood: FLUX.2-klein + a custom LoRA + a two-pass "sandwich" pipeline. I posted it on r/StableDiffusion, got a long technical critique with three specific complaints, and ran a 75-image ablation (5 pipeline variants × 5 categories × 3 seeds) to verify. The critic was right — and the ablation surfaced one finding I did not expect: my LoRA literally renders Cyrillic gibberish into the output at the "textbook-correct" inference settings. This is a postmortem.

What pinock.io does

Open the site → see a feed of AI-generated animals in vintage Soviet/Eastern-European matchbox label illustration style. New image every 30 seconds. ~6,700 images so far. You can like, download, share, search ("cat", "owl"), or queue your own one-word prompt. No accounts, no watermarks, no paywalls.

Stack (deliberately tiny so one person can maintain it):

Frontend: vanilla JS, Caddy, static
Backend: FastAPI + SQLite (WAL mode) on a cheap Ubuntu box
FLUX worker: one RTX 3090 on vast.ai (~$0.20/hr), tunneled in via SSH
Caption worker: Qwen2.5-VL-7B INT4 on a secondary box
Real-ESRGAN x2 for upscaling Hall-of-Fame images
Stripe for paid edit-tokens (Gemini 3.1 Flash Image)

Cost per generated image: ~$0.01.

The "two-pass sandwich" — and why it's a hack

Each generation runs two passes:

prompt = "cat"
   │
   ├─ Pass 1: FLUX.2-klein + matchbox LoRA (rank=32, alpha=64, scale=2.0)
   │             text2image, 28 steps
   │             → output_b1 (stylized but with broken anatomy)
   │
   └─ Pass 2: FLUX.2-klein, no LoRA
                 img2img from output_b1, strength=0.9, 28 steps
                 → output_b (final)

Why? I trained the LoRA on ~300 matchbox samples. At lora_scale=1.0 the style was barely visible. At lora_scale=2.0 the style appeared but anatomy broke (extra limbs, fused heads). I patched it: pass-2 takes the broken pass-1 as init and at strength=0.9 essentially redraws the image from scratch, leaving only a low-frequency "style fingerprint." It works empirically.

It also sounds like a trick.

The Reddit critique that made me sit down

Posted on r/StableDiffusion. Got a long, technically-precise comment from u/DelinquentTuna. Three points:

lora_scale=2.0 over-cooks the LoRA, and you then nuke it with strength=0.9 in pass-2 — you're discarding ~90% of the LoRA's output.
FLUX.2-klein has native edit/style-transfer features. I (the critic) ran your images through it on a 4080 16GB and got 4× larger output (1024×1024) in 9 seconds with more cohesive style. Use the edit feature, not your handrolled i2i.
~300 examples is too few for matchbox aesthetic (halftone, limited palette, lithographic textures). You need 5× the dataset and proper captions.

All three were technically correct. I sat down to ablate.

The ablation — 5 variants × 5 animals × 3 seeds = 75 images

Tested on the prod rig (RTX 3090 + FLUX.2-klein + matchbox LoRA, same stack as production). Two tmux scripts, ~30 minutes total, results gridded with PIL.

Code	Description	Params
A	Pure FLUX, no LoRA, bare prompt	baseline
B	LoRA t2i pass-1 snapshot (raw LoRA before "sandwich" pass-2 nukes it)	lora_scale=2.0, prompt="cat"
C	Current production sandwich	lora=2.0, pass2_strength=0.9
D	Single-pass with style prompt (critic's suggestion #1)	lora=1.0, prompt="cat, matchbox poster style, 1960s Soviet, woodcut, halftone, limited red-black palette"
E	Edit-style: pure FLUX → img2img with style prompt (critic's suggestion #2)	init=A, lora=1.0, strength=0.5

Categories: cat, fox, owl, lion, wolf. Seeds: 42, 1337, 80085 (chosen before runs; three repeats to catch seed-dependence).

Findings, in order of how much they hurt

Variant B — LoRA at scale=2.0, bare prompt (snapshot)

Total collapse. On every seed, all 5 categories look almost identical — colored texture noise:

seed=42: red-orange wavy stripes
seed=1337: green "forest noise"
seed=80085: gold smear

No anatomy. The LoRA at scale=2.0 does not generate animals. It generates poster-texture, because I overcooked the inference weight. Which is exactly why I invented the sandwich — I was watching this catastrophe and trying to hide it behind pass-2.

The critic saw it instantly. I did not.

Variant D — single-pass with style prompt at scale=1.0 (suggestion #1)

A different kind of catastrophe. On seed=42, several output images contain literal Cyrillic gibberish text: "СТАДИНАМ" or similar, baked into the image. On seed=1337, all 5 categories collapse into nearly-identical "red silhouette on dark" compositions. On seed=80085, again all 5 collapse to "red silhouette on white."

What happened: the training set (~300 examples) included Soviet posters with Cyrillic text and red dominant backgrounds. At lora_scale=1.0 plus a long, "correct" style-prompt, the LoRA starts recalling whole posters from training rather than transferring style. Textbook training-set leakage.

This is the most interesting observation in the series. The critic's advice — "use scale=1.0 with a proper style-prompt" — is theoretically right, but on this LoRA it just exposes how badly it's overfit to specific training examples.

Variant E — edit-style refinement (suggestion #2)

Style barely visible. At strength=0.5 + lora=1.0 the LoRA can't punch through the FLUX prior. Output looks like A with a faint illustrative tint. Not matchbox.

To get the style to come through I'd need strength≥0.7 — which lands us back in i2i sandwich territory, where the same Cyrillic / collapse will reappear via img2img.

Variant C — current sandwich

Works adequately. Recognizable animals with visible matchbox aesthetic: woodcut linework, halftone backgrounds, limited palette, sometimes Morris-style floral patterns. Stable across all 3 seeds.

Mechanism: pass-2 at strength=0.9 takes the broken pass-1 (B), adds 90% noise, redraws. From pass-1 only a low-frequency signal survives — overall composition and color profile. That injects style without leaving room for anatomy to break.

The headline conclusion

The current sandwich (C) wins this matchup — but it's a patch on top of a poorly-trained LoRA, not the right architecture.

All three "alternative" approaches (B raw, D single-pass-styled, E edit-style) revealed the same underlying problem: the LoRA at scale=1.0 tries to reproduce training set examples wholesale instead of transferring style. The sandwich works precisely because pass-2 at strength=0.9 burns that memorized content down to a low-frequency residual.

So:

Critic's suggestion #1 (single-pass + scale=1.0 + style-prompt) is theoretically right but on this LoRA produces worse results than the sandwich, because it triggers leakage.
Critic's suggestion #2 (edit features) doesn't bite at moderate strength and reverts to leakage at high strength.
Critic's suggestion #3 (5× the dataset, cleaner captions) is the only real fix. And it's exactly what I didn't do.

What's next

Rebuild the dataset to 1500+ images. No Cyrillic at all (or behind a separate "soviet-text" token if it ever has to come back). Hard filters: halftone present, limited palette (≤5 colors), flat geometry. Captions via Qwen2.5-VL using a template like matchbox poster of a {category}, {dominant colors}, {composition}, woodcut linework.
Retrain on rank 32 + attention+MLP modules, not attention-only. The current LoRA only touches attention blocks, which is too narrow for compositional features (woodcut, halftone). MLP gives more "room" for style.
After v2 — re-run the same ablation. If single-pass at scale=1.0 + style-prompt produces clean recognizable animals on v2, the sandwich gets deleted. Generation time drops from ~30s to ~10-15s. I can crank resolution from 512 to 1024 (the 3090 has the headroom). The VAE round-trip between passes (currently saving pass-1 to JPEG and reading back) goes away too.

Side findings worth a paragraph each

FastAPI + SQLite + cursor pagination in search. The search endpoint originally hard-capped output at 60 results — 581 cats in the database, but the frontend only ever saw 60. Added ?cursor=<id> (filter id < cursor, ORDER BY id DESC), and disabled auto-generation on paginated requests so the queue isn't flooded by pagination.

Auto-prompt variety. For automated generation (when the queue is empty), I added three pools — adjectives (proud, fierce, sleepy…), actions (running, perched, watching…), scenes (in winter forest, at sunset…) — with a 55/20/15/10 distribution: 55% bare category name, 20% adj+animal, 15% animal+action, 10% animal+scene. Before this, all "cat" auto-generations looked the same.

Real cost. vast.ai 3090 ~$0.20/hr → ~$5/day → at ~1500 images/day = $0.003/image GPU cost. Plus backend/storage ~$2/day. Total <$0.01 per image at current scale.

What I take from this

"Empirically works" is not the same as "optimal." I picked the sandwich by trial and error and stopped questioning it. I never asked "why did I have to crank scale to 2.0 in the first place?" The Reddit critic asked.
Ablation should be day-one. 5 variants × 3 seeds = 15 minutes on a borrowed GPU. I would not have shipped the sandwich as "the solution" if I'd done this.
External criticism is the cheapest source of truth. A month ago I would have second-guessed posting. One Reddit post and one long comment from a stranger who ran his own parallel work on a 4080 changed the entire architecture plan.
Training-set leakage is not theoretical. In my case it manifested as literal Cyrillic letters in the output. If I'd only ever inspected the sandwich result (where the leakage is hidden), I would never have seen it.

Links

pinock.io — https://pinock.io
LoRA on HuggingFace — yukakst/pinock-matchbox-flux2-klein
HuggingFace Space (live demo) — yukakst/pinock-matchbox-demo
LoRA on Civitai — civitai.com/models/2598394
Original Russian writeup on Habr (with full Cyrillic example screenshots) — habr.com/ru/articles/1031338/
Reddit thread with the original critique — r/StableDiffusion

If you train v2 LoRAs on small datasets and have advice on how to avoid the training-set-leakage trap I fell into, I'm all ears in comments. Especially curious whether anyone has seen text-leakage manifest this literally before.

DEV Community