DEV Community

Cover image for [Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical
PEPPERCORN
PEPPERCORN

Posted on

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

Intro

Day 5!

Today was originally going to be "have AI analyze a year of my Amazon order history," but downloading the Amazon purchase history just wouldn't work no matter what I tried. So that was a bust.

Pivoted.

On Day 2, I trained an AI to memorize my cat from 22 photos (Day 2 article). That thing is called a "LoRA."

What's a LoRA? = A small add-on that teaches an AI to recognize a specific subject. Pair photos with a trigger word like ohwx cat, train, and then writing ohwx cat in any prompt makes the AI draw my cat.

On Day 4, I had AI sort through 25,000 photos on my iPhone (Day 4 article). It found 999 photos it identified as cats.

Today's experiment: Will using those 999 photos make my cat-LoRA stronger?

A simple expectation, really. 22 photos → 999 photos is 45x more data. Surely the LoRA gets stronger, right?


TL;DR

Spoiler-free version:

  • Training with 999 photos made things worse, not better
  • After removing "other people's cats" from the dataset (down to 213 photos), I got LoRA quality matching my original 22-photo version
  • 22 photos and 213 photos produced basically the same quality

I came in thinking "more photos = stronger LoRA." Turns out that's not really how it works, and today I learned why.


What I actually did

Trained on 999 photos → got worse (v2)

Same base model and trigger word (ohwx cat) as Day 2. Just bumped the photo count from 22 to 999. Kohya_ss training, 14 minutes. Calling this v2.

Generated test images and…

No LoRA / v1 / v2 comparison

Photorealistic scene (left: no LoRA, center: v1=22 photos, right: v2=999 photos). v2 looks barely different from no-LoRA. 45x more data, but the cat identity is gone.

Creative prompts were worse:

Chef v1 vs v2

Prompt: "ohwx cat as a cute chef." v2 produced a human woman as the chef, with the cat reduced to a tiny illustration on her apron.

Astronaut v1 vs v2

Prompt: "ohwx cat as an astronaut." v2 produced a tabby (orange-striped) cat — the fur color is straight up wrong. My cat is black and white.

More data made the LoRA broadly worse, across both photorealistic and creative prompts.

Cause: "other cats" had snuck into the dataset

Once I thought about it, it was obvious.

Day 4's classifier labels images as "contains a cat or not" — it does NOT verify "is this MY cat." So the 999-photo "cat" folder included:

  • My cat
  • Friends' and family's cats
  • Stray cats from around town
  • Cats at pet stores

All mixed together. When I trained with the label ohwx cat = my cat, the model basically learned ohwx cat ≈ generic cat-shape.

Pulled out just my cat → 213 photos (v3)

To curate, I borrowed another AI — CLIP.

What's CLIP? = An OpenAI image-understanding model. Show it two images and it returns a similarity score.

I used the 22 confirmed-my-cat photos from Day 2 as a reference set, then asked CLIP to score how similar each of the 999 candidates was. Sorted by score, threw the thumbnails into a single HTML page, and went through visually — checking "this one's a different cat", "this has a person in it", and so on, marking exclusions as I went.

Final cut: 213 photos, all confirmed to be my cat. Re-trained → v3.

Result:

No LoRA / v1 / v2 / v3

v3 is as sharp as v1. Tuxedo pattern, white chest, the works.

Creative prompts came back too:

Chef v1 / v2 / v3

The human chef from v2 is gone, replaced by my cat. The astronaut and forest cat similarly snapped back (more comparisons in the collapsible section below).

Cleaning the data was enough to fix everything.

Bonus: also tried natural-language captions (v4)

One more thing I wanted to test.

v1 (Day 2) and v3 (today) differ in their captions — the text labels paired with each training photo:

  • v1: hand-written natural sentences (ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles...)
  • v3: just the trigger word (ohwx cat) repeated for every image

What's a caption? = A short English text describing what's in each photo, paired with that photo during training.

Would adding richer captions on top of clean data push v3 further? Hand-writing 213 captions wasn't realistic, so I had another AI (Qwen2-VL) auto-generate them. Calling this v4.

Result: v4 looked basically identical to v3. Small differences here and there but nothing substantial.

Caption granularity barely matters once the data is clean.


The actual question: does more data make a stronger LoRA?

Now for the real comparison. v1 (22 photos) vs v4 (213 photos):

Photos Data purity Captions
v1 22 My cat only Hand-written natural language
v4 213 (10x!) My cat only VLM natural language (same style)

The only meaningful difference is photo count.

Five-way comparison:

No LoRA / v1 / v2 / v3 / v4

Left to right: no LoRA, v1 (22), v2 (999, contaminated), v3 (213, trigger-only), v4 (213, natural captions).

v1 and v4 are essentially the same quality. To my eye, v1 has a slightly more painterly feel on the chef prompt, but otherwise — same.

Same pattern across all the other prompts:

Chef v1 / v2 / v3 / v4

10x more photos. No visible improvement. This was today's main finding.


After the fact, I looked it up. Turns out this is common knowledge.

I found "more photos doesn't help" interesting enough to look up afterward, and:

  • Character LoRAs are typically trained on 25–40 images, with 40–80 as a soft cap
  • "Over 30 images shows diminishing returns; dataset quality matters more than dataset size"
  • "15–20 well-curated images beat 50 mediocre ones"
  • Too many images can actually overfit and degrade the result
  • DreamBooth (a closely related technique) was designed around 3–5 images

→ It's established consensus in the field: photo count saturates fast, and dataset purity is the real lever.

Day 2's 22 photos? Turns out that was already a healthy amount.


What I learned today

Quality > Quantity, apparently

  • 22 photos (v1) ≈ 213 photos (v4): photo count doesn't push quality much
  • 999 photos (v2): contamination made things worse
  • 213 photos (v3): cleaning brought everything back

"More photos = better LoRA" runs out of road fast. What actually moves the needle is the right photos, not more photos.

A working playbook (so far)

From today's experiments:

  1. Source photos that match the goal (photos of MY cat, not "any cat")
  2. Aim for 20–30 photos — past that, diminishing returns
  3. Captions help, but don't sweat the wording — auto-generated is fine
  4. If you must use a big dataset, curate aggressively first — contamination is brutal

💡 Tip: when you want to use a big dataset anyway

If you're starting from a large unfiltered pile and want to keep it that way, pre-curation is essential. The approach that worked today:

  • Pick a small "ground truth" set (~20 confirmed examples)
  • Use CLIP image similarity to score the big pile against the ground truth
  • Browse thumbnails sorted by score, eyeball-exclude the misses
  • Train on what's left

Details in the collapsible section below.


Technical details (the AI explains)

The implementation details, walked through by Claude.

:::details 1. More v2 failure examples

Skipped from the main body for length, but worth seeing:

Fantasy forest v1 vs v2

Prompt: "ohwx cat in a magical forest." v2 produced a black-bear-style illustration — the cat identity is completely gone.

Balcony v1 vs v2

The one photorealistic-ish prompt where v2 sort-of held it together.

:::

:::details 2. Data prep and CLIP similarity ranking

Day 4's _review/cat/ had 1,009 symlinks (503 HEIC, 505 JPG, 1 other). Resized to short-side 512px:

python3 shared/utils/resize-shortside.py \
  --src private-data/iphone-photos-classified/_review/cat \
  --dst private-data/cat-lora-v2/images-512 \
  --short-side 512
Enter fullscreen mode Exit fullscreen mode

1,009 → 999 after collisions (9 stem collisions where IMG_XXXX.HEIC and IMG_XXXX.JPG produced the same .jpg name) and 1 resize failure.

CLIP similarity scoring with openai/clip-vit-base-patch32:

from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to("cuda")
ref_feats = embed(model, processor, ref_paths)     # 22 refs
cand_feats = embed(model, processor, cand_paths)   # 999 candidates
sim = cand_feats @ ref_feats.T                     # (999, 22)
score = sim.mean(dim=1)                            # (999,)
Enter fullscreen mode Exit fullscreen mode

Score distribution:

Score band Contents
≥ 0.85 Almost all solo shots of my cat
0.76 – 0.85 Mostly my cat, with occasional other-cat or human contamination
< 0.76 Mostly other cats or photos with people

Cut at 0.76 and reviewed everything above visually. 312 manual exclusions later: 213 photos.

:::

:::details 3. Browser-based curation UI

A single HTML page laying out all 999 thumbnails in score order, served via python3 -m http.server. Each thumbnail has a checkbox:

<div class="cell" data-name="IMG_2906.jpg">
  <img src="thumbs-256/IMG_2906.jpg">
  <div class="meta">#1 0.871</div>
  <input type="checkbox" onchange="toggleExclude(this)">
</div>
<script>
function exportExcluded(){
  const names = [...document.querySelectorAll('.cell.excluded')]
    .map(c => c.dataset.name);
  download('excluded.txt', names.join('\n'));
}
</script>
Enter fullscreen mode Exit fullscreen mode

Click "Export excluded list" to download excluded.txt, then use that to filter the training dir.

:::

:::details 4. Training configs (Kohya_ss / TOML)

The training config is identical across v1/v2/v3/v4 — only the dataset and output name change:

output_name = "ohwx_cat_v3"   # or v4
max_train_epochs = 2
network_dim = 32
network_alpha = 16
unet_lr = 1e-4
text_encoder_lr = 5e-5
Enter fullscreen mode Exit fullscreen mode

Step count is also matched:

Math Steps
v1 22 × 10 × 10 ÷ 2 1,100
v2 999 × 1 × 2 ÷ 2 999
v3 / v4 213 × 5 × 2 ÷ 2 1,065

All within ~1,000 steps, so the only variables in play are photo count and caption granularity.

cd ~/Kohya_ss && source venv/bin/activate
accelerate launch --num_cpu_threads_per_process 8 train_network.py \
  --config_file configs/train_v3.toml \
  --dataset_config configs/dataset_v3.toml
Enter fullscreen mode Exit fullscreen mode

DGX Spark, 1.4 it/s, ~14 minutes per training run.

:::

:::details 5. Qwen2-VL caption auto-generation

Reusing Day 4's Qwen2-VL 7B Instruct setup. The prompt:

Describe what is happening in this cat photo using short comma-separated
phrases. Cover: (1) the cat's pose or action, (2) the view angle,
(3) the setting and notable background details. Keep it under 25 words.
Do NOT describe the cat's appearance (color, breed, fur, markings) — focus
only on the scene. Output the description directly without any preamble.
Example: walking on a metal kitchen counter, side profile, indoor kitchen
with spice bottles and shelves in the background
Enter fullscreen mode Exit fullscreen mode

The "do not describe the cat's appearance" line is intentional: identity is supposed to come from the trigger word ohwx cat, so captions should only describe context.

desc = vlm_caption(model, processor, img)
caption = f"ohwx cat, {desc}"
txt_path.write_text(caption + "\n", encoding="utf-8")
Enter fullscreen mode Exit fullscreen mode

213 captions in 6 minutes. Sample output:

ohwx cat, sitting, side view, indoor setting, wooden floor,
folding chair, curtain, air conditioner
Enter fullscreen mode Exit fullscreen mode

Stylistically very close to Day 2's hand-written captions.

:::

:::details 6. Version summary

v1 (Day 2) v2 v3 v4
Photos 22 999 213 213
Cat content My cat only My cat + many others My cat only My cat only
Captions Hand-written natural ohwx cat only ohwx cat only VLM natural
Total steps 1,100 999 1,065 1,065
Training time 13m 3s 14m 0s 14m 0s 14m 0s

What each pair isolates:

  • v2 vs v3 → effect of data purity (same captions, only purity differs)
  • v3 vs v4 → effect of caption granularity (same data, only captions differ)
  • v1 vs v4 → effect of photo count (clean data, natural captions, only count differs)

:::

:::details 7. References on LoRA training dataset size

The "diminishing returns past ~30 photos" claim has multiple sources:

:::


Tomorrow's preview: Day 6

Day 6: still undecided. Decision tomorrow morning.


100ExperimentsWithDGX #LocalLLM #LoRA #StableDiffusion

Top comments (0)