DEV Community: Yuka Kust

We trained a personal voice DoRA on Qwen3-8B for $1.50 — beat stock model 100% in blind A/B

Yuka Kust — Mon, 25 May 2026 11:59:45 +0000

TL;DR. Trained a DoRA adapter on Qwen3-8B using 6128 personal Telegram messages. Cost: $1.50 on a single Vast.ai RTX 3090. In blind head-to-head A/B, the DoRA-tuned model beat stock Qwen3-8B 100% of the time. Zero catastrophic forgetting on 50 general-knowledge tasks. One prompt where the model actually beat the real human at sounding like themselves.

Full long-form write-up lives on the canonical URL: aiconic.company/en/journal/dora-personal-voice. This post is the dev.to-flavored version with the practical bits.

What we did

Took one person's Telegram export (DataExport JSON, 1047 personal chats), wrote a custom pairs extractor (other_person_message, author_reply), capped 12 pairs per chat so a few active chats don't dominate, deduplicated. Final dataset: 6128 train + 322 valid pairs.

Trained a DoRA adapter on top of Qwen/Qwen3-8B. DoRA (Weight-Decomposed Low-Rank Adaptation, Liu et al. 2024) decomposes pretrained weights into magnitude and direction, then applies LoRA-style updates only to the direction component while learning magnitude as a separate trainable vector. In practice it matches full fine-tuning more closely than LoRA at the same rank.

The training config

from peft import LoraConfig
from transformers import TrainingArguments

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_dora=True,           # the only line that turns LoRA into DoRA
    task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_steps=50,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch = 16
    max_seq_length=1024,
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
)

Trainable params: ~30M / 8B = 0.4%. Adapter file on disk: 63 MB. Total wall time: 3.5h on a single Vast.ai RTX 3090 spot (~$0.30/h, ~$1.50 total).

Critical detail: apply loss only on the author's assistant tokens, not on the prompt. Without this mask the model spends half its capacity learning what other people say to you, which dilutes voice signal noticeably. Non-optional for personal voice work.

The evaluation (blind 3-way A/B)

Loss numbers are useless for personal voice. The relevant question is does a human who knows you think it sounds like you. So:

30 hold-out prompts — real recent messages from real people, where we knew what the author actually replied. Held out of train.
Three responses per prompt: stock Qwen3-8B reply, DoRA reply, real human reply.
Randomized A/B/C labels per prompt. secret.json mapped labels back to sources, kept blind from rater.
HTML rating UI asking "which one sounds most like you?"
Catastrophic forgetting check: separate 50-task suite (capitals, math, code, translations).

Results

Comparison	Result
DoRA vs stock (head-to-head)	DoRA 100%
Full 3-way (real / DoRA / stock)	Real 71% / DoRA 29% / Stock 0%
One specific prompt (p07)	DoRA beat the real human
Catastrophic forgetting	0 pp (49/50 = 49/50)

The p07 case is the one that gets me. Author looked at her own real reply, looked at DoRA, picked DoRA over herself. Her comment: "Honestly the DoRA one sounds more like a representative thing I'd say than what I actually wrote that day."

Reading it as: DoRA samples from a smoothed manifold of typical replies and can produce a closer-to-mean instance than the human did on a specific Tuesday afternoon.

What broke (so you don't waste an evening)

1. `enable_thinking=False` is mandatory

Qwen3 is a reasoning model by default — emits <think>...</think> traces before its final answer. Chat training data has none. During inference, base prior pulls toward reasoning prefixes while DoRA shifts toward chat style, output ends up as Frankenstein reasoning + short colloquial reply.

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=False,   # MANDATORY for chat-style adapters
    return_tensors="pt",
)

If you're training a chat-style adapter on Qwen3, set this in your training data tokenization too — aligns training prefix with inference prefix and probably helps eval loss further.

2. transformers version dance

Qwen3 lands in 4.51. 4.55+ wants torch ≥2.5. Working pin for Vast 3090 image: transformers==4.53.0. Boring but cost two hours.

3. Cerebras can't load adapters

Cerebras hosted inference (where we run prod) does not support runtime LoRA/DoRA loading. So this adapter is a research artifact for us, not a prod swap. For prod personalization either self-host on vLLM (~$300/mo single 3090 24/7) or stay on hosted backbone + system prompt + RAG. We ship the latter today; the DoRA convinces us self-hosted is worth building once user demand justifies.

Reproducibility

Adapter on HuggingFace: aiconiccompany/yuka-dora-v1 (gated CC BY-NC 4.0 because training data is one person's private chats).

Hardware to reproduce on your own messages:

Single RTX 3090 (24 GB VRAM) — about $0.30/h on Vast.ai
3.5 hours of GPU time
Your own Telegram export (Settings → Advanced → Export Telegram data → JSON)
~6000 message pairs for solid voice capture, 1000 minimum

Total cost on your own messaging history: $1–$3.

Why it matters

The thesis we keep restating: the right granularity of personalization is the individual, not the segment. Companies have been trying personalized AI by clustering users into 50 personas and routing to slightly-tuned base models. That's segment-level. The destination is one small adapter per user, trained on their own continuous data stream, owned by the user.

yuka-dora-v1 is the first concrete piece of evidence we have that the unit economics work: $1.50 of GPU time turns a frontier model into your specific voice with no measurable capability loss. Multiply by users-who-would-pay for personalized AI and the cost structure starts looking very different from "rent OpenAI by the token."

Full write-up

The long version with the full code, the loss curve, the complete p07 sample, the v2 backlog, and the bigger personal-AI thesis lives on the canonical:

→ aiconic.company/en/journal/dora-personal-voice

If you want a custom DoRA trained for your product (voice-of-the-brand, customer-support style, founder-voice): hi@aiconic.company.

Otherwise — train one for yourself. The README is there. The GPU is cheap. The result is worth it.

Aiconic is a research-grade AI engineering shop. Three engineers, AI tooling. Custom adapters, personal AI engines, production ML systems. aiconic.company

I shipped a free AI-art site with a flawed LoRA and ran a 75-image ablation to prove it

Yuka Kust — Tue, 05 May 2026 21:15:35 +0000

TL;DR. I built pinock.io — an endless feed of AI-generated animals in 1960s Soviet matchbox poster style. Free, no signup, no watermark. Under the hood: FLUX.2-klein + a custom LoRA + a two-pass "sandwich" pipeline. I posted it on r/StableDiffusion, got a long technical critique with three specific complaints, and ran a 75-image ablation (5 pipeline variants × 5 categories × 3 seeds) to verify. The critic was right — and the ablation surfaced one finding I did not expect: my LoRA literally renders Cyrillic gibberish into the output at the "textbook-correct" inference settings. This is a postmortem.

What pinock.io does

Open the site → see a feed of AI-generated animals in vintage Soviet/Eastern-European matchbox label illustration style. New image every 30 seconds. ~6,700 images so far. You can like, download, share, search ("cat", "owl"), or queue your own one-word prompt. No accounts, no watermarks, no paywalls.

Stack (deliberately tiny so one person can maintain it):

Frontend: vanilla JS, Caddy, static
Backend: FastAPI + SQLite (WAL mode) on a cheap Ubuntu box
FLUX worker: one RTX 3090 on vast.ai (~$0.20/hr), tunneled in via SSH
Caption worker: Qwen2.5-VL-7B INT4 on a secondary box
Real-ESRGAN x2 for upscaling Hall-of-Fame images
Stripe for paid edit-tokens (Gemini 3.1 Flash Image)

Cost per generated image: ~$0.01.

The "two-pass sandwich" — and why it's a hack

Each generation runs two passes:

prompt = "cat"
   │
   ├─ Pass 1: FLUX.2-klein + matchbox LoRA (rank=32, alpha=64, scale=2.0)
   │             text2image, 28 steps
   │             → output_b1 (stylized but with broken anatomy)
   │
   └─ Pass 2: FLUX.2-klein, no LoRA
                 img2img from output_b1, strength=0.9, 28 steps
                 → output_b (final)

Why? I trained the LoRA on ~300 matchbox samples. At lora_scale=1.0 the style was barely visible. At lora_scale=2.0 the style appeared but anatomy broke (extra limbs, fused heads). I patched it: pass-2 takes the broken pass-1 as init and at strength=0.9 essentially redraws the image from scratch, leaving only a low-frequency "style fingerprint." It works empirically.

It also sounds like a trick.

The Reddit critique that made me sit down

Posted on r/StableDiffusion. Got a long, technically-precise comment from u/DelinquentTuna. Three points:

lora_scale=2.0 over-cooks the LoRA, and you then nuke it with strength=0.9 in pass-2 — you're discarding ~90% of the LoRA's output.
FLUX.2-klein has native edit/style-transfer features. I (the critic) ran your images through it on a 4080 16GB and got 4× larger output (1024×1024) in 9 seconds with more cohesive style. Use the edit feature, not your handrolled i2i.
~300 examples is too few for matchbox aesthetic (halftone, limited palette, lithographic textures). You need 5× the dataset and proper captions.

All three were technically correct. I sat down to ablate.

The ablation — 5 variants × 5 animals × 3 seeds = 75 images

Tested on the prod rig (RTX 3090 + FLUX.2-klein + matchbox LoRA, same stack as production). Two tmux scripts, ~30 minutes total, results gridded with PIL.

Code	Description	Params
A	Pure FLUX, no LoRA, bare prompt	baseline
B	LoRA t2i pass-1 snapshot (raw LoRA before "sandwich" pass-2 nukes it)	lora_scale=2.0, prompt="cat"
C	Current production sandwich	lora=2.0, pass2_strength=0.9
D	Single-pass with style prompt (critic's suggestion #1)	lora=1.0, prompt="cat, matchbox poster style, 1960s Soviet, woodcut, halftone, limited red-black palette"
E	Edit-style: pure FLUX → img2img with style prompt (critic's suggestion #2)	init=A, lora=1.0, strength=0.5

Categories: cat, fox, owl, lion, wolf. Seeds: 42, 1337, 80085 (chosen before runs; three repeats to catch seed-dependence).

Findings, in order of how much they hurt

Variant B — LoRA at scale=2.0, bare prompt (snapshot)

Total collapse. On every seed, all 5 categories look almost identical — colored texture noise:

seed=42: red-orange wavy stripes
seed=1337: green "forest noise"
seed=80085: gold smear

No anatomy. The LoRA at scale=2.0 does not generate animals. It generates poster-texture, because I overcooked the inference weight. Which is exactly why I invented the sandwich — I was watching this catastrophe and trying to hide it behind pass-2.

The critic saw it instantly. I did not.

Variant D — single-pass with style prompt at scale=1.0 (suggestion #1)

A different kind of catastrophe. On seed=42, several output images contain literal Cyrillic gibberish text: "СТАДИНАМ" or similar, baked into the image. On seed=1337, all 5 categories collapse into nearly-identical "red silhouette on dark" compositions. On seed=80085, again all 5 collapse to "red silhouette on white."

What happened: the training set (~300 examples) included Soviet posters with Cyrillic text and red dominant backgrounds. At lora_scale=1.0 plus a long, "correct" style-prompt, the LoRA starts recalling whole posters from training rather than transferring style. Textbook training-set leakage.

This is the most interesting observation in the series. The critic's advice — "use scale=1.0 with a proper style-prompt" — is theoretically right, but on this LoRA it just exposes how badly it's overfit to specific training examples.

Variant E — edit-style refinement (suggestion #2)

Style barely visible. At strength=0.5 + lora=1.0 the LoRA can't punch through the FLUX prior. Output looks like A with a faint illustrative tint. Not matchbox.

To get the style to come through I'd need strength≥0.7 — which lands us back in i2i sandwich territory, where the same Cyrillic / collapse will reappear via img2img.

Variant C — current sandwich

Works adequately. Recognizable animals with visible matchbox aesthetic: woodcut linework, halftone backgrounds, limited palette, sometimes Morris-style floral patterns. Stable across all 3 seeds.

Mechanism: pass-2 at strength=0.9 takes the broken pass-1 (B), adds 90% noise, redraws. From pass-1 only a low-frequency signal survives — overall composition and color profile. That injects style without leaving room for anatomy to break.

The headline conclusion

The current sandwich (C) wins this matchup — but it's a patch on top of a poorly-trained LoRA, not the right architecture.

All three "alternative" approaches (B raw, D single-pass-styled, E edit-style) revealed the same underlying problem: the LoRA at scale=1.0 tries to reproduce training set examples wholesale instead of transferring style. The sandwich works precisely because pass-2 at strength=0.9 burns that memorized content down to a low-frequency residual.

So:

Critic's suggestion #1 (single-pass + scale=1.0 + style-prompt) is theoretically right but on this LoRA produces worse results than the sandwich, because it triggers leakage.
Critic's suggestion #2 (edit features) doesn't bite at moderate strength and reverts to leakage at high strength.
Critic's suggestion #3 (5× the dataset, cleaner captions) is the only real fix. And it's exactly what I didn't do.

What's next

Rebuild the dataset to 1500+ images. No Cyrillic at all (or behind a separate "soviet-text" token if it ever has to come back). Hard filters: halftone present, limited palette (≤5 colors), flat geometry. Captions via Qwen2.5-VL using a template like matchbox poster of a {category}, {dominant colors}, {composition}, woodcut linework.
Retrain on rank 32 + attention+MLP modules, not attention-only. The current LoRA only touches attention blocks, which is too narrow for compositional features (woodcut, halftone). MLP gives more "room" for style.
After v2 — re-run the same ablation. If single-pass at scale=1.0 + style-prompt produces clean recognizable animals on v2, the sandwich gets deleted. Generation time drops from ~30s to ~10-15s. I can crank resolution from 512 to 1024 (the 3090 has the headroom). The VAE round-trip between passes (currently saving pass-1 to JPEG and reading back) goes away too.

Side findings worth a paragraph each

FastAPI + SQLite + cursor pagination in search. The search endpoint originally hard-capped output at 60 results — 581 cats in the database, but the frontend only ever saw 60. Added ?cursor=<id> (filter id < cursor, ORDER BY id DESC), and disabled auto-generation on paginated requests so the queue isn't flooded by pagination.

Auto-prompt variety. For automated generation (when the queue is empty), I added three pools — adjectives (proud, fierce, sleepy…), actions (running, perched, watching…), scenes (in winter forest, at sunset…) — with a 55/20/15/10 distribution: 55% bare category name, 20% adj+animal, 15% animal+action, 10% animal+scene. Before this, all "cat" auto-generations looked the same.

Real cost. vast.ai 3090 ~$0.20/hr → ~$5/day → at ~1500 images/day = $0.003/image GPU cost. Plus backend/storage ~$2/day. Total <$0.01 per image at current scale.

What I take from this

"Empirically works" is not the same as "optimal." I picked the sandwich by trial and error and stopped questioning it. I never asked "why did I have to crank scale to 2.0 in the first place?" The Reddit critic asked.
Ablation should be day-one. 5 variants × 3 seeds = 15 minutes on a borrowed GPU. I would not have shipped the sandwich as "the solution" if I'd done this.
External criticism is the cheapest source of truth. A month ago I would have second-guessed posting. One Reddit post and one long comment from a stranger who ran his own parallel work on a 4080 changed the entire architecture plan.
Training-set leakage is not theoretical. In my case it manifested as literal Cyrillic letters in the output. If I'd only ever inspected the sandwich result (where the leakage is hidden), I would never have seen it.

Links

pinock.io — https://pinock.io
LoRA on HuggingFace — yukakst/pinock-matchbox-flux2-klein
HuggingFace Space (live demo) — yukakst/pinock-matchbox-demo
LoRA on Civitai — civitai.com/models/2598394
Original Russian writeup on Habr (with full Cyrillic example screenshots) — habr.com/ru/articles/1031338/
Reddit thread with the original critique — r/StableDiffusion

If you train v2 LoRAs on small datasets and have advice on how to avoid the training-set-leakage trap I fell into, I'm all ears in comments. Especially curious whether anyone has seen text-leakage manifest this literally before.