[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene
So, yesterday I generated "some cat"
Day 1 ended with "I made my DGX draw a cat" — but the cat that came out was just "a cat from somewhere". Today, the goal is to teach the AI about my actual cat (who's currently being looked after at my parents' place back in Japan).
This is what people call LoRA training.
LoRA: A technique that teaches an AI model "specific features" using a small set of images, without touching the base model itself. Apparently. The output is a small "diff" file (tens of MB).
This is experiment #2.
The training data
Source material: 22 photos of my cat.
I picked a mix of angles — front-facing, full body, sleepy poses, varying lighting — to give the AI a fair shot at recognizing the cat's defining features (tuxedo black-and-white pattern, white socks, the black smudge on the nose).
Training pipeline
1. Pre-processing
iPhone HEIC files don't work directly with most AI tools, so first conversion to JPG. 10 of the 22 were HEIC.
Then resize to 512px on the short side for training. This is where I tripped over a sneaky bug — details in the collapsible section below.
2. Captions
Every image gets a text description like "ohwx cat, sitting on a wooden floor, indoor, soft lighting". The four-letter ohwx is a meaningless token that becomes the trigger word for "my specific cat" after training.
Drafting 22 captions by hand would be tedious — but Claude can read images directly, so it drafted them while I just reviewed. The accuracy was uncanny. For example:
ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles and shelves in the background
ohwx cat, in a loaf pose on a gray carpet, mouth open showing teeth, mid-yawn, indoor with shelves and warm lights in the background
ohwx cat, sitting on a wooden floor by a balcony window, viewed from behind, sharp sunlight casting long shadows, indoor
SUGOI.
3. Kohya_ss training
Kohya_ss is the de-facto LoRA training tool. Set up a TOML config, run one command:
$ accelerate launch train_network.py \
--config_file configs/train.toml \
--dataset_config configs/dataset.toml
Training logs scroll by, and the loss value gradually drops. Lower loss = the model is learning, apparently.
4. Done
1100 steps in 13 minutes 3 seconds on the DGX Spark.
Result 1: just typing "ohwx cat" gives me my cat
The first thing I tried was a "without LoRA vs with LoRA" comparison. Same prompt — "ohwx cat as a chef in a kitchen, ..." — first without the LoRA, then with it:
Left: no LoRA. Right: with LoRA.
Without LoRA, ohwx is gibberish to the model, so it's ignored and only "a chef in a kitchen" carries weight. Result: a human chef. A nice woman cooking in a pink kitchen.
With LoRA, ohwx becomes a real token that points at my cat. Same prompt, but now my cat is the chef.
This was the moment that hit.
Result 2: novel scene reproduction
The training set has no photo of the cat sitting on a wooden floor in this exact composition. So I tried it:
White socks: present. Nose smudge: present.
My cat, in places she's never been
ohwx cat in various scenes.
Sunny balcony
Cozy.
Chef (reprise)
The chef hat fits suspiciously well. Cooking ability unverified.
Autumn forest
A painterly take.
Astronaut
A doppelgänger via the helmet glass — but sci-fi all the same.
Today's takeaway
"Build your own AI from your own data" turned out to be way more accessible than I'd assumed.
Tech details (Claude explains)
The technical bits, written up by my AI pair.
- HEIC → JPG conversion and the EXIF orientation trap
Reading iPhone HEIC files in Python is straightforward with pillow-heif. JPG conversion is a few lines:
from PIL import Image, ImageOps
from pillow_heif import register_heif_opener
register_heif_opener()
with Image.open("IMG_1234.HEIC") as img:
oriented = ImageOps.exif_transpose(img) # ← critical line
rgb = oriented.convert("RGB")
rgb.save("IMG_1234.jpg", quality=95)
What I tripped on
My first version skipped ImageOps.exif_transpose(). Result: 8 of 22 photos came out rotated 90° in the resized output.
iPhones save portrait shots with the actual pixels stored landscape-ways, plus an EXIF Orientation tag saying "rotate 90° on display". Pillow's default Image.open() ignores that tag — you have to call exif_transpose() explicitly.
Caught it before training started. If I hadn't, the LoRA would have learned "sideways cat" and generation would be weird.
- Kohya_ss setup on ARM64 (DGX Spark)
There are two repos commonly referred to as "Kohya_ss":
-
bmaltais/kohya_ss— GUI wrapper, xformers dependency (clashes with ARM64) -
kohya-ss/sd-scripts— the actual training engine, CLI/TOML driven
DGX Spark is ARM64, so I went with the latter:
git clone --depth 1 https://github.com/kohya-ss/sd-scripts.git ~/Kohya_ss
cd ~/Kohya_ss
python3 -m venv venv && source venv/bin/activate
pip install --upgrade pip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
DGX Spark uses CUDA 12.8 + ARM64 (sbsa), so the PyTorch cu128 channel works directly. Surprisingly painless.
Training config (TOML)
# train.toml (excerpt)
pretrained_model_name_or_path = ".../Realistic_Vision_V6.0_NV_B1.safetensors"
vae = ".../vae-ft-mse-840000-ema-pruned.safetensors"
network_module = "networks.lora"
network_dim = 32
network_alpha = 16
optimizer_type = "AdamW8bit"
unet_lr = 1e-4
text_encoder_lr = 5e-5
lr_scheduler = "cosine_with_restarts"
max_train_epochs = 10
save_every_n_epochs = 2
mixed_precision = "bf16"
sdpa = true
cache_latents = true
# dataset.toml
[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1
[[datasets]]
resolution = 512
batch_size = 2
enable_bucket = true
[[datasets.subsets]]
image_dir = "/path/to/cat-photos-512"
num_repeats = 10
22 photos × 10 repeats × 10 epochs ÷ batch 2 = 1100 steps. 13 minutes.
Base model: Realistic Vision V6.0 B1 noVAE (a photo-realistic SD 1.5 derivative). External VAE: sd-vae-ft-mse-original. The combination is good at fur detail.
- Hitting the ComfyUI HTTP API for batch generation
Clicking through the GUI for one image at a time gets old fast. ComfyUI exposes an HTTP API that's easy to drive from Python — urllib.request from the standard library is enough (no extra deps).
import json, urllib.request, time
COMFY_URL = "http://127.0.0.1:8188"
def queue_prompt(workflow):
payload = json.dumps({"prompt": workflow}).encode()
req = urllib.request.Request(
f"{COMFY_URL}/prompt",
data=payload,
headers={"Content-Type": "application/json"},
)
return json.loads(urllib.request.urlopen(req).read())["prompt_id"]
def wait_for_history(prompt_id, timeout=180):
start = time.time()
while time.time() - start < timeout:
with urllib.request.urlopen(f"{COMFY_URL}/history/{prompt_id}") as resp:
data = json.loads(resp.read())
if prompt_id in data:
return data[prompt_id]
time.sleep(0.5)
The workflow is ComfyUI's API format (a dict of node IDs with their connections). To use a LoRA, insert a LoraLoader node between the checkpoint loader and KSampler.
DGX Spark generates one 512×768 image in about 3 seconds. With seed/strength/prompt parametrized in a script, all 12 grid images came out in under a minute.
Tomorrow: Day 3
Day 3 plan: have a local AI analyze my credit card history.
The kind of data I'd rather not send to a cloud AI, but absolutely want to understand. Quintessential local-AI territory.










Top comments (0)