DEV Community

Cover image for [Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene
PEPPERCORN
PEPPERCORN

Posted on

[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene

[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene

So, yesterday I generated "some cat"

Day 1 ended with "I made my DGX draw a cat" — but the cat that came out was just "a cat from somewhere". Today, the goal is to teach the AI about my actual cat (who's currently being looked after at my parents' place back in Japan).

This is what people call LoRA training.

LoRA: A technique that teaches an AI model "specific features" using a small set of images, without touching the base model itself. Apparently. The output is a small "diff" file (tens of MB).

This is experiment #2.


The training data

Source material: 22 photos of my cat.

Training photo collage

I picked a mix of angles — front-facing, full body, sleepy poses, varying lighting — to give the AI a fair shot at recognizing the cat's defining features (tuxedo black-and-white pattern, white socks, the black smudge on the nose).


Training pipeline

1. Pre-processing

iPhone HEIC files don't work directly with most AI tools, so first conversion to JPG. 10 of the 22 were HEIC.

Then resize to 512px on the short side for training. This is where I tripped over a sneaky bug — details in the collapsible section below.

2. Captions

Every image gets a text description like "ohwx cat, sitting on a wooden floor, indoor, soft lighting". The four-letter ohwx is a meaningless token that becomes the trigger word for "my specific cat" after training.

Drafting 22 captions by hand would be tedious — but Claude can read images directly, so it drafted them while I just reviewed. The accuracy was uncanny. For example:

Cat on a kitchen counter

ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles and shelves in the background

Mid-yawn cat

ohwx cat, in a loaf pose on a gray carpet, mouth open showing teeth, mid-yawn, indoor with shelves and warm lights in the background

Cat by a window

ohwx cat, sitting on a wooden floor by a balcony window, viewed from behind, sharp sunlight casting long shadows, indoor

SUGOI.

3. Kohya_ss training

Kohya_ss is the de-facto LoRA training tool. Set up a TOML config, run one command:

$ accelerate launch train_network.py \
    --config_file configs/train.toml \
    --dataset_config configs/dataset.toml
Enter fullscreen mode Exit fullscreen mode

Training logs scroll by, and the loss value gradually drops. Lower loss = the model is learning, apparently.

4. Done

1100 steps in 13 minutes 3 seconds on the DGX Spark.


Result 1: just typing "ohwx cat" gives me my cat

The first thing I tried was a "without LoRA vs with LoRA" comparison. Same prompt — "ohwx cat as a chef in a kitchen, ..." — first without the LoRA, then with it:

Without (left) vs With (right) LoRA

Left: no LoRA. Right: with LoRA.

Without LoRA, ohwx is gibberish to the model, so it's ignored and only "a chef in a kitchen" carries weight. Result: a human chef. A nice woman cooking in a pink kitchen.

With LoRA, ohwx becomes a real token that points at my cat. Same prompt, but now my cat is the chef.

This was the moment that hit.


Result 2: novel scene reproduction

The training set has no photo of the cat sitting on a wooden floor in this exact composition. So I tried it:

My cat sitting on a wooden floor

White socks: present. Nose smudge: present.


My cat, in places she's never been

ohwx cat in various scenes.

Sunny balcony

Cat on a sunny balcony

Cozy.

Chef (reprise)

Cat as a chef

The chef hat fits suspiciously well. Cooking ability unverified.

Autumn forest

Cat in an autumn forest

A painterly take.

Astronaut

Cat as an astronaut

A doppelgänger via the helmet glass — but sci-fi all the same.


Today's takeaway

"Build your own AI from your own data" turned out to be way more accessible than I'd assumed.


Tech details (Claude explains)

The technical bits, written up by my AI pair.

  1. HEIC → JPG conversion and the EXIF orientation trap

Reading iPhone HEIC files in Python is straightforward with pillow-heif. JPG conversion is a few lines:

from PIL import Image, ImageOps
from pillow_heif import register_heif_opener
register_heif_opener()

with Image.open("IMG_1234.HEIC") as img:
    oriented = ImageOps.exif_transpose(img)  # ← critical line
    rgb = oriented.convert("RGB")
    rgb.save("IMG_1234.jpg", quality=95)
Enter fullscreen mode Exit fullscreen mode

What I tripped on

My first version skipped ImageOps.exif_transpose(). Result: 8 of 22 photos came out rotated 90° in the resized output.

iPhones save portrait shots with the actual pixels stored landscape-ways, plus an EXIF Orientation tag saying "rotate 90° on display". Pillow's default Image.open() ignores that tag — you have to call exif_transpose() explicitly.

Caught it before training started. If I hadn't, the LoRA would have learned "sideways cat" and generation would be weird.

  1. Kohya_ss setup on ARM64 (DGX Spark)

There are two repos commonly referred to as "Kohya_ss":

  • bmaltais/kohya_ss — GUI wrapper, xformers dependency (clashes with ARM64)
  • kohya-ss/sd-scripts — the actual training engine, CLI/TOML driven

DGX Spark is ARM64, so I went with the latter:

git clone --depth 1 https://github.com/kohya-ss/sd-scripts.git ~/Kohya_ss
cd ~/Kohya_ss
python3 -m venv venv && source venv/bin/activate
pip install --upgrade pip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

DGX Spark uses CUDA 12.8 + ARM64 (sbsa), so the PyTorch cu128 channel works directly. Surprisingly painless.

Training config (TOML)

# train.toml (excerpt)
pretrained_model_name_or_path = ".../Realistic_Vision_V6.0_NV_B1.safetensors"
vae = ".../vae-ft-mse-840000-ema-pruned.safetensors"

network_module = "networks.lora"
network_dim = 32
network_alpha = 16

optimizer_type = "AdamW8bit"
unet_lr = 1e-4
text_encoder_lr = 5e-5
lr_scheduler = "cosine_with_restarts"

max_train_epochs = 10
save_every_n_epochs = 2

mixed_precision = "bf16"
sdpa = true
cache_latents = true
Enter fullscreen mode Exit fullscreen mode
# dataset.toml
[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[[datasets]]
resolution = 512
batch_size = 2
enable_bucket = true

  [[datasets.subsets]]
  image_dir = "/path/to/cat-photos-512"
  num_repeats = 10
Enter fullscreen mode Exit fullscreen mode

22 photos × 10 repeats × 10 epochs ÷ batch 2 = 1100 steps. 13 minutes.

Base model: Realistic Vision V6.0 B1 noVAE (a photo-realistic SD 1.5 derivative). External VAE: sd-vae-ft-mse-original. The combination is good at fur detail.

  1. Hitting the ComfyUI HTTP API for batch generation

Clicking through the GUI for one image at a time gets old fast. ComfyUI exposes an HTTP API that's easy to drive from Python — urllib.request from the standard library is enough (no extra deps).

import json, urllib.request, time

COMFY_URL = "http://127.0.0.1:8188"

def queue_prompt(workflow):
    payload = json.dumps({"prompt": workflow}).encode()
    req = urllib.request.Request(
        f"{COMFY_URL}/prompt",
        data=payload,
        headers={"Content-Type": "application/json"},
    )
    return json.loads(urllib.request.urlopen(req).read())["prompt_id"]

def wait_for_history(prompt_id, timeout=180):
    start = time.time()
    while time.time() - start < timeout:
        with urllib.request.urlopen(f"{COMFY_URL}/history/{prompt_id}") as resp:
            data = json.loads(resp.read())
            if prompt_id in data:
                return data[prompt_id]
        time.sleep(0.5)
Enter fullscreen mode Exit fullscreen mode

The workflow is ComfyUI's API format (a dict of node IDs with their connections). To use a LoRA, insert a LoraLoader node between the checkpoint loader and KSampler.

DGX Spark generates one 512×768 image in about 3 seconds. With seed/strength/prompt parametrized in a script, all 12 grid images came out in under a minute.


Tomorrow: Day 3

Day 3 plan: have a local AI analyze my credit card history.

The kind of data I'd rather not send to a cloud AI, but absolutely want to understand. Quintessential local-AI territory.


100ExperimentsWithDGX #LocalLLM

Top comments (0)