sm1ck

Posted on Apr 22 • Edited on Apr 25 • Originally published at honeychat.bot

Character consistency in AI image generation — where prompts break down and LoRA helps

#ai #llm #python #programming

📦 Training template: github.com/sm1ck/honeychat/tree/main/tutorial/03-lora — a generic Kohya SDXL config with <tune> placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.

Here's a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They're not wrong to feel that way. Character identity is part of the product.

This post is about why that happens, why the obvious fixes (seed-pinning, more prompt detail, reference images) often don't fully solve it, and what class of solution works better.

TL;DR

Identical seed + identical prompt + different batch size = different face. Seeds only help within the same sampler run.
Prompt detail plateaus fast. Past a certain tag count, the model interpolates anyway.
Reference image (IP-Adapter) works but can bleed stylistic features — outfit, lighting, background — into generations where you only wanted identity.
Custom LoRA per character makes identity much more stable by encoding it at the weights level instead of relying only on prompt text.

Train your own character LoRA — the short walkthrough

LoRA training is GPU-heavy and doesn't belong in a docker-compose, so the tutorial folder at tutorial/03-lora ships the config template and recipe. You bring the GPU.

1. Get a GPU

24 GB VRAM (RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don't own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.

2. Install Kohya_ss

git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
cd ~/kohya_ss && ./setup.sh

3. Grab the template

git clone https://github.com/sm1ck/honeychat
cp -r honeychat/tutorial/03-lora ./my-character-lora
cd my-character-lora

4. Prepare your dataset

Drop 15–30 varied images of your subject into dataset/train/5_character/ (the 5_ is the repeat count). For each image, create a same-named .txt caption describing the scene — not the character. See dataset/README.md for the full curation checklist.

5. Fill the <tune> slots in kohya-config.toml

Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each <tune> with a real value. The safety check in train.sh will refuse to run if any placeholder remains.

6. Train

export KOHYA_DIR=~/kohya_ss
bash train.sh

The checkpoint lands at ./output/<your-character>.safetensors. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.

Why "same prompt, same face" doesn't hold

Users naturally assume this works:

"anime girl, long silver hair, green eyes, Arknights operator outfit"
+ seed=12345
→ Anna, always. Or so it seems.

Not reliably. Three reasons.

Batch size changes the output. In most Stable Diffusion runs, batch_size=1 and batch_size=4 with the same seed produce different images for position 0. The RNG state depends on batch dimension.

Provider-side sampler drift. If you're calling a managed API (fal.ai, Replicate, Together), provider-side changes — model updates, sampler tweaks, default parameter shifts — can produce visually different outputs across weeks. Your "locked" character can drift.

Prompt detail saturates. At some point, adding more tags ("sharp nose, high cheekbones, narrow eyes, specific mole position") stops helping much. The model has a rough template and interpolates within it.

The in-between fix that doesn't quite work: IP-Adapter

IP-Adapter lets you pass a reference image alongside the prompt. The model bakes the reference's features into the cross-attention. For product photography, excellent.

For character identity, it has a practical drawback: IP-Adapter can carry stylistic baggage. A reference photo with specific lighting, pose, outfit, and background can bleed those into the generated image. You can turn the weight down, but then identity may weaken; turn it up, and the reference can dominate.

IP-Adapter is a good fit when the reference is what you want preserved (e.g., rendering a shop item on a character — next post in the series). It's usually a poor fit when what you want preserved is only the face.

The solution: custom LoRA per character

A LoRA (Low-Rank Adaptation) is a small set of additional weights layered on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights themselves, not into the prompt.

Inference pipeline:

workflow = [
    "Checkpoint",           # base SDXL model
    f"LoRA: {char.lora}",   # the character's custom LoRA
    "FreeU",                # quality touch-up
    "KSampler",             # actual diffusion
]

Now Anna is much more likely to stay Anna across pose, outfit, and lighting changes. The face is represented in the weights, not only in the words.

Training a character LoRA (public-friendly template)

The conceptual shape of the training job using the publicly available Kohya_ss SDXL trainer:

# Kohya_ss SDXL LoRA training config — generic template
# Replace every <tune> value based on your dataset and base model.
# See Kohya docs for the full parameter reference.

[model_arguments]
pretrained_model_name_or_path = "<path/to/sdxl-base-or-finetune.safetensors>"

[dataset_arguments]
train_data_dir = "./dataset/train"
resolution     = "1024,1024"
caption_extension = ".txt"

[training_arguments]
output_dir      = "./output"
output_name     = "<your_character_v1>"
save_model_as   = "safetensors"

# Training steps and batch — VRAM-bound. Tune for your hardware.
learning_rate    = "<tune>"
max_train_steps  = "<tune>"
train_batch_size = "<tune>"

[network_arguments]
network_module = "networks.lora"
network_dim    = "<tune>"
network_alpha  = "<tune>"

→ full template on GitHub

The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal "best" setting.

What to actually optimize for:

Dataset quality over dataset size. 20 clean, varied, captioned images beat 100 messy ones.
Varied pose and lighting, constant face. Same angle 30 times teaches "this angle," not "this character."
Clean captions — describe the scene, not the character. "Woman standing in a garden" is better than "Anna standing in a garden" because you want the model to learn the face from context, not from the token.
Dedicated rank for face detail. Lower ranks underfit the identity; higher ranks overfit and kill flexibility.

Marginal cost: usually manageable

If you're running inference on a rented or owned GPU, training one character LoRA is a one-time cost usually measured in minutes to hours of GPU time, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training compute.

This is why a LoRA-per-character pipeline can be viable for products with many characters: once the pipeline exists, adding a new character is mostly a dataset and QA exercise, not a research project.

Production concerns

LoRA hot-swapping. Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.

Dataset hygiene. LoRAs memorize whatever's in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.

Storage at scale. LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.

Face ≠ body. A LoRA trained on face crops will not lock body proportions. Include full-body shots in the dataset if you need full-body consistency.

What I'd change if starting over

Ship the LoRA pipeline from day 1, even for three characters. Inconsistent visuals in the free tier can hurt activation before users ever see the stronger parts of the product.
Curate datasets manually, don't scrape. Five iterations of a hand-picked set of 20 images beat a scraped 200.
Store base-model version with each LoRA. When you update the base, you need to know which LoRAs need retraining.
Version LoRAs (v1, v2) and keep old versions live. If v2 ships with a regression, roll back per-character without reverting a whole release.

Where this lives

HoneyChat uses custom LoRA per character for visual identity in image and video generation. Public architecture: github.com/sm1ck/honeychat.

Previous: LLM routing per tier via OpenRouter.
Next: IP-Adapter Plus for a product catalog — how to put arbitrary shop items on a character while keeping the character's face locked.

References

If you've trained character LoRAs in production and have opinions on rank selection or caption strategy, I'd love to hear them in the comments. There's very little public writing on this outside the anime generation community.

Top comments (1)

Li DevTools • Jun 11

Great breakdown of why seed-pinning and prompt detail aren't enough. The LoRA approach makes total sense for production apps where you need a single character locked in across sessions.

For multi-panel storytelling (manga/webtoon), I've found the challenge is slightly different — you need 10+ characters staying consistent across dozens of panels, and training a LoRA per character gets expensive fast. I've been using pixiaoli.cn which handles consistency at the platform level without needing individual LoRA training. It's designed specifically for manga creators who need character identity across panel sequences.

The dataset quality point is gold though — "varied pose and lighting, constant face" is exactly the right framing. Your Kohya template looks solid for anyone who needs that granular control.