Stephen Batifol

Posted on Jun 11 • Originally published at x.com

Training a LoRA on FLUX.2 [klein] with Hermes Agent

#ai #hermes #finetune

If you've trained a LoRA before, for FLUX or other models, you know that creating a dataset is the annoying part, you need to: find images, check their licenses, write captions and then train your LoRA. A lot of people have been talking about Hermes Agent lately, so I figured I'd see if I could automate most of the work to train a LoRA for FLUX.2 [klein].

For training the LoRA, I'll use ai-toolkit but you can use other LoRA Training framework.

The LoRA: medieval marginalia

Usually you create a LoRA when the base model is weak at. I asked Hermes to search for different ideas that could work and where: the base FLUX model is weak at, can be trained on, and visually distinct enough to see at a glance. It came back with a shortlist: risograph misregistration, Soviet technical schematics, ukiyo-e weather diagrams, brutalist concrete textures, medieval marginalia.

I haven't seen many medieval oriented LoRAs so I figured I could try to make it work with FLUX and went with medieval marginalia. If you don't know what medieval marginalia is: it's the weird creatures drawn in the margins of old manuscripts, in particular 13th to 15th century.

Example images from the training set.

The base models can do a generic "medieval illustration" but not to the extent that we have in our training data; it's missing the parchment tone, weird proportions, page artifacts, the small figures, the decorative ground lines, making it a good candidate for a LoRA.

Same prompt, base vs LoRA:

Creating a Dataset

The images come from Wikimedia Commons. It has good metadata, licensing information and plenty of material. It first pulled 199 images with their license and resolution.

We then filtered out the unusable ones, removed the huge full-res scans and that left us with 104 usable images.

Hermes ran a check on each one:

is this actually public domain or CC0/CC-BY?
is the image high enough resolution?
is it actually visual marginalia, or just a manuscript page with text?
is this a duplicate crop?
is the image too noisy to train on?
can the source be traced?

Here are some that got cut:

illuminated initial: a decorated letter, not a margin creature. Great image, wrong thing. A style LoRA trained on these would learn "ornate capital", not "drollery".
Coptic manuscript: marginal illustration, but a completely different tradition. Wrong script, wrong palette, wrong look. It would just add noise.
near-duplicate: a cropped copy of an image already in the keep set. Duplicates overweight one composition and waste a dataset slot.
signature marginalia: technically "marginalia", but it's an early-modern handwritten signature with a flourish. No creature, and the search term matched on the wrong sense of the word.
labor scene: a real Luttrell Psalter folio, fine quality, but it's a genre/labor scene with weak drollery signal. Close, but not the thing I'm teaching.

Picking the final images

For a style LoRA, 20 to 40 is usually the sweet spot. At this point we still had >100 images left so we need to filter some out.

Normally you'd open the folder, go through all the images by hand and filter them down. It's not a hard task but it's not a fun one either, so I figured I would let Hermes do it. It decided to write a whole script to score them for quality: resolution, how on-topic the Wikimedia title and category were, the state of the scan.

These are the questions that decide a style dataset:

do I have enough isolated creatures?
do I have enough full manuscript context?
is there too much text?
are the colors varied enough?
is the dataset all rabbits and snails, or does it cover more shapes?

Hermes sent the images to a VLM to pick the best one that would teach FLUX the style LoRA we are looking for.

The final 30 was a deliberate mix:

isolated drolleries on plain backgrounds
partial and full folio crops with page layout and text
hybrid creatures and grotesques for figural variety

Captioning

For a style LoRA, captions should describe the subject in plain structural terms and say nothing about the style. The model learns the style from the images. If you put it in the caption too, you split the signal.

Bad caption:

MRGN_DRLR. A medieval illuminated manuscript drawing in ink on vellum of a rabbit jousting a snail.

Better caption:

MRGN_DRLR. Bas-de-page of a rabbit in chainmail jousting a giant snail with a lance. The rabbit charges from the left on a small horse, the snail rears its body up on the right with antennae extended. Green ground line with stylized plants below.

The second caption never says "medieval," "manuscript," "ink," "vellum," or "illuminated." Those are exactly the things I want the model to absorb from the images themselves. The trigger word MRGN_DRLR carries the style; the rest of the caption just describes what's there.

For 30 images I could have just written the captions by hand but where is the fun in that? Hermes built a captioning pipeline instead, which I can point at the next dataset.

The ai-toolkit config

Trainer: ostris/ai-toolkit. Model: black-forest-labs/FLUX.2-klein-base-4B. Train on the base, not the distilled checkpoint.

The config is based on the BFL Klein training example, retargeted for this dataset:

model:
  name_or_path: black-forest-labs/FLUX.2-klein-base-4B
  arch: flux2_klein_4b
  quantize: true

network:
  type: lora
  linear: 128
  linear_alpha: 64
  conv: 64
  conv_alpha: 32

datasets:
  - folder_path: /workspace/datasets/marginalia
    caption_ext: txt
    caption_dropout_rate: 0.05
    resolution: [512, 768, 1024]

train:
  steps: 2000
  lr: 0.0001
  optimizer: adamw8bit
  timestep_type: shift
  content_or_style: balanced

save:
  save_every: 250
  max_step_saves_to_keep: 8

Training & Picking the right checkpoint

RunPod has an official AI toolkit template, so I figured I could let Hermes run the training there directly on an RTX 4090.

Picking the right checkpoint

The loss curve is of course interesting to monitor but not only, for example in this case, the best checkpoint was actually at step 1000, which is halfway through what we defined.

Here's the progression I saw across checkpoints, same prompts and seeds:

step 0: base Klein, too clean and generic
step 500: parchment tone and line work start to appear
step 1000: best balance of style, subject, and composition
step 1500: text artifacts and muddy colors start creeping in
step 2000: usable, but worse than step 1000

When training a LoRA, it's important to look at the samples and not only the loss to make sure we aren't overfitting.

At step 1000 the figure is clean. By 1500 fake script crowds the page, and by 2000 the palette turns to mud, all while the loss number kept dropping.

What Hermes actually did

The training run itself was 37 minutes, and I did almost none of it. Hermes found the images on Wikimedia and checked the licenses, downloaded and renamed them, scored the pile down and built the contact sheet, ran the vision pass, captioned the set and scrubbed the style words the VLM kept sneaking in, wrote the ai-toolkit config and the RunPod instructions, and put together the checkpoint grids I used to pick the final one.

The day of scraping, license-checking and file-wrangling in front of a run like this is usually what kills the idea. It's dull, and something more interesting always wins.

If you want to train your own

A few things are worth knowing before you start.

Train against the base model, not the distilled one, and you don't need a huge dataset. 20 to 40 images is the sweet spot, since piling on more tends to wash the style out rather than strengthen it.

Captions are important for a LoRA. Describe what's in the image and say nothing about the style, so the model has to learn the look from the pixels instead of from a word, and give it a trigger that isn't a real English word so it doesn't collide with everything the model already knows.

When you train, save often and trust your eyes over the loss curve. Mine kept improving long after the images had started to fall apart, and the checkpoint I ended up shipping was halfway through the run. Generate the same prompts at the same seed across every checkpoint so you're comparing the same thing each time.

The last part matters most to me: if an agent can handle the sourcing, the captioning and the packaging, hand it over. That's what I wanted to test here, and it took care of most of it.

Try it

The LoRA and its training data are both on the Hugging Face. The marginalia LoRA loads straight onto FLUX.2 [klein] with diffusers (trigger word MRGN_DRLR.), and the training dataset is published too, so you can retrain it yourself or build something on top of the same public-domain images.

DEV Community