PEPPERCORN

Posted on May 11 • Edited on May 27

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

#localllm #ai #dgxspark #lora

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

Intro

Day 5!

Today was originally going to be "have AI analyze a year of my Amazon order history," but downloading the Amazon purchase history just wouldn't work no matter what I tried. So that was a bust.

Pivoted.

On Day 2, I trained an AI to memorize my cat from 22 photos (Day 2 article). That thing is called a "LoRA."

What's a LoRA? = A small add-on that teaches an AI to recognize a specific subject. Pair photos with a trigger word like ohwx cat, train, and then writing ohwx cat in any prompt makes the AI draw my cat.

On Day 4, I had AI sort through 25,000 photos on my iPhone (Day 4 article). It found 999 photos it identified as cats.

Today's experiment: Will using those 999 photos make my cat-LoRA stronger?

A simple expectation, really. 22 photos → 999 photos is 45x more data. Surely the LoRA gets stronger, right?

TL;DR

Spoiler-free version:

Training with 999 photos made things worse, not better
After removing "other people's cats" from the dataset (down to 213 photos), I got LoRA quality matching my original 22-photo version
22 photos and 213 photos produced basically the same quality

I came in thinking "more photos = stronger LoRA." Turns out that's not really how it works, and today I learned why.

What I actually did

Trained on 999 photos → got worse (v2)

Same base model and trigger word (ohwx cat) as Day 2. Just bumped the photo count from 22 to 999. Kohya_ss training, 14 minutes. Calling this v2.

Generated test images and…

Photorealistic scene (left: no LoRA, center: v1=22 photos, right: v2=999 photos). v2 looks barely different from no-LoRA. 45x more data, but the cat identity is gone.

Creative prompts were worse:

Prompt: "ohwx cat as a cute chef." v2 produced a human woman as the chef, with the cat reduced to a tiny illustration on her apron.

Prompt: "ohwx cat as an astronaut." v2 produced a tabby (orange-striped) cat — the fur color is straight up wrong. My cat is black and white.

→ More data made the LoRA broadly worse, across both photorealistic and creative prompts.

Cause: "other cats" had snuck into the dataset

Once I thought about it, it was obvious.

Day 4's classifier labels images as "contains a cat or not" — it does NOT verify "is this MY cat." So the 999-photo "cat" folder included:

My cat
Friends' and family's cats
Stray cats from around town
Cats at pet stores

All mixed together. When I trained with the label ohwx cat = my cat, the model basically learned ohwx cat ≈ generic cat-shape.

Pulled out just my cat → 213 photos (v3)

To curate, I borrowed another AI — CLIP.

What's CLIP? = An OpenAI image-understanding model. Show it two images and it returns a similarity score.

I used the 22 confirmed-my-cat photos from Day 2 as a reference set, then asked CLIP to score how similar each of the 999 candidates was. Sorted by score, threw the thumbnails into a single HTML page, and went through visually — checking "this one's a different cat", "this has a person in it", and so on, marking exclusions as I went.

Final cut: 213 photos, all confirmed to be my cat. Re-trained → v3.

Result:

v3 is as sharp as v1. Tuxedo pattern, white chest, the works.

Creative prompts came back too:

The human chef from v2 is gone, replaced by my cat. The astronaut and forest cat similarly snapped back (more comparisons in the collapsible section below).

→ Cleaning the data was enough to fix everything.

Bonus: also tried natural-language captions (v4)

One more thing I wanted to test.

v1 (Day 2) and v3 (today) differ in their captions — the text labels paired with each training photo:

v1: hand-written natural sentences (ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles...)
v3: just the trigger word (ohwx cat) repeated for every image

What's a caption? = A short English text describing what's in each photo, paired with that photo during training.

Would adding richer captions on top of clean data push v3 further? Hand-writing 213 captions wasn't realistic, so I had another AI (Qwen2-VL) auto-generate them. Calling this v4.

Result: v4 looked basically identical to v3. Small differences here and there but nothing substantial.

→ Caption granularity barely matters once the data is clean.

The actual question: does more data make a stronger LoRA?

Now for the real comparison. v1 (22 photos) vs v4 (213 photos):

	Photos	Data purity	Captions
v1	22	My cat only	Hand-written natural language
v4	213 (10x!)	My cat only	VLM natural language (same style)

The only meaningful difference is photo count.

Five-way comparison:

Left to right: no LoRA, v1 (22), v2 (999, contaminated), v3 (213, trigger-only), v4 (213, natural captions).

v1 and v4 are essentially the same quality. To my eye, v1 has a slightly more painterly feel on the chef prompt, but otherwise — same.

Same pattern across all the other prompts:

→ 10x more photos. No visible improvement. This was today's main finding.

After the fact, I looked it up. Turns out this is common knowledge.

I found "more photos doesn't help" interesting enough to look up afterward, and:

Character LoRAs are typically trained on 25–40 images, with 40–80 as a soft cap
"Over 30 images shows diminishing returns; dataset quality matters more than dataset size"
"15–20 well-curated images beat 50 mediocre ones"
Too many images can actually overfit and degrade the result
DreamBooth (a closely related technique) was designed around 3–5 images

→ It's established consensus in the field: photo count saturates fast, and dataset purity is the real lever.

Day 2's 22 photos? Turns out that was already a healthy amount.

What I learned today

Quality > Quantity, apparently

22 photos (v1) ≈ 213 photos (v4): photo count doesn't push quality much
999 photos (v2): contamination made things worse
213 photos (v3): cleaning brought everything back

"More photos = better LoRA" runs out of road fast. What actually moves the needle is the right photos, not more photos.

A working playbook (so far)

From today's experiments:

Source photos that match the goal (photos of MY cat, not "any cat")
Aim for 20–30 photos — past that, diminishing returns
Captions help, but don't sweat the wording — auto-generated is fine
If you must use a big dataset, curate aggressively first — contamination is brutal

💡 Tip: when you want to use a big dataset anyway

If you're starting from a large unfiltered pile and want to keep it that way, pre-curation is essential. The approach that worked today:

Pick a small "ground truth" set (~20 confirmed examples)
Use CLIP image similarity to score the big pile against the ground truth
Browse thumbnails sorted by score, eyeball-exclude the misses
Train on what's left

Details in the collapsible section below.

Technical details (the AI explains)

The implementation details, walked through by Claude.

:::details 1. More v2 failure examples

Skipped from the main body for length, but worth seeing:

Prompt: "ohwx cat in a magical forest." v2 produced a black-bear-style illustration — the cat identity is completely gone.

The one photorealistic-ish prompt where v2 sort-of held it together.

:::

:::details 2. Data prep and CLIP similarity ranking

Day 4's _review/cat/ had 1,009 symlinks (503 HEIC, 505 JPG, 1 other). Resized to short-side 512px:

python3 shared/utils/resize-shortside.py \
  --src private-data/iphone-photos-classified/_review/cat \
  --dst private-data/cat-lora-v2/images-512 \
  --short-side 512

1,009 → 999 after collisions (9 stem collisions where IMG_XXXX.HEIC and IMG_XXXX.JPG produced the same .jpg name) and 1 resize failure.

CLIP similarity scoring with openai/clip-vit-base-patch32:

from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to("cuda")
ref_feats = embed(model, processor, ref_paths)     # 22 refs
cand_feats = embed(model, processor, cand_paths)   # 999 candidates
sim = cand_feats @ ref_feats.T                     # (999, 22)
score = sim.mean(dim=1)                            # (999,)

Score distribution:

Score band	Contents
≥ 0.85	Almost all solo shots of my cat
0.76 – 0.85	Mostly my cat, with occasional other-cat or human contamination
< 0.76	Mostly other cats or photos with people

Cut at 0.76 and reviewed everything above visually. 312 manual exclusions later: 213 photos.

:::

:::details 3. Browser-based curation UI

A single HTML page laying out all 999 thumbnails in score order, served via python3 -m http.server. Each thumbnail has a checkbox:

<div class="cell" data-name="IMG_2906.jpg">
  <img src="thumbs-256/IMG_2906.jpg">
  <div class="meta">#1 0.871</div>
  <input type="checkbox" onchange="toggleExclude(this)">
</div>
<script>
function exportExcluded(){
  const names = [...document.querySelectorAll('.cell.excluded')]
    .map(c => c.dataset.name);
  download('excluded.txt', names.join('\n'));
}
</script>

Click "Export excluded list" to download excluded.txt, then use that to filter the training dir.

:::

:::details 4. Training configs (Kohya_ss / TOML)

The training config is identical across v1/v2/v3/v4 — only the dataset and output name change:

output_name = "ohwx_cat_v3"   # or v4
max_train_epochs = 2
network_dim = 32
network_alpha = 16
unet_lr = 1e-4
text_encoder_lr = 5e-5

Step count is also matched:

	Math	Steps
v1	22 × 10 × 10 ÷ 2	1,100
v2	999 × 1 × 2 ÷ 2	999
v3 / v4	213 × 5 × 2 ÷ 2	1,065

All within ~1,000 steps, so the only variables in play are photo count and caption granularity.

cd ~/Kohya_ss && source venv/bin/activate
accelerate launch --num_cpu_threads_per_process 8 train_network.py \
  --config_file configs/train_v3.toml \
  --dataset_config configs/dataset_v3.toml

DGX Spark, 1.4 it/s, ~14 minutes per training run.

:::

:::details 5. Qwen2-VL caption auto-generation

Reusing Day 4's Qwen2-VL 7B Instruct setup. The prompt:

Describe what is happening in this cat photo using short comma-separated
phrases. Cover: (1) the cat's pose or action, (2) the view angle,
(3) the setting and notable background details. Keep it under 25 words.
Do NOT describe the cat's appearance (color, breed, fur, markings) — focus
only on the scene. Output the description directly without any preamble.
Example: walking on a metal kitchen counter, side profile, indoor kitchen
with spice bottles and shelves in the background

The "do not describe the cat's appearance" line is intentional: identity is supposed to come from the trigger word ohwx cat, so captions should only describe context.

desc = vlm_caption(model, processor, img)
caption = f"ohwx cat, {desc}"
txt_path.write_text(caption + "\n", encoding="utf-8")

213 captions in 6 minutes. Sample output:

ohwx cat, sitting, side view, indoor setting, wooden floor,
folding chair, curtain, air conditioner

Stylistically very close to Day 2's hand-written captions.

:::

:::details 6. Version summary

	v1 (Day 2)	v2	v3	v4
Photos	22	999	213	213
Cat content	My cat only	My cat + many others	My cat only	My cat only
Captions	Hand-written natural	`ohwx cat` only	`ohwx cat` only	VLM natural
Total steps	1,100	999	1,065	1,065
Training time	13m 3s	14m 0s	14m 0s	14m 0s

What each pair isolates:

v2 vs v3 → effect of data purity (same captions, only purity differs)
v3 vs v4 → effect of caption granularity (same data, only captions differ)
v1 vs v4 → effect of photo count (clean data, natural captions, only count differs)

:::

:::details 7. References on LoRA training dataset size

The "diminishing returns past ~30 photos" claim has multiple sources:

20–30 photos saturates; dataset quality > dataset size (Civitai: Large Dataset LoRA Tips)
15–20 well-curated images beat 50 mediocre ones (same)
Over-training and "overcooked" LoRAs from too much data (Hugging Face Blog: After 500+ LoRAs)
DreamBooth (the original subject-finetuning technique) was designed around 3–5 images (DreamBooth project page)

DEV Community

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

Intro

TL;DR

What I actually did

Trained on 999 photos → got worse (v2)

Cause: "other cats" had snuck into the dataset

Pulled out just my cat → 213 photos (v3)

Bonus: also tried natural-language captions (v4)

The actual question: does more data make a stronger LoRA?

After the fact, I looked it up. Turns out this is common knowledge.

What I learned today

Quality > Quantity, apparently

A working playbook (so far)

💡 Tip: when you want to use a big dataset anyway

Technical details (the AI explains)

Tomorrow's preview: Day 6

100ExperimentsWithDGX #LocalLLM #LoRA #StableDiffusion

Top comments (0)