<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stephen Batifol</title>
    <description>The latest articles on DEV Community by Stephen Batifol (@stephen_btl).</description>
    <link>https://dev.to/stephen_btl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3979267%2F1c8da7a1-1d34-401a-9e9e-bb2c8ac2b57b.png</url>
      <title>DEV Community: Stephen Batifol</title>
      <link>https://dev.to/stephen_btl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/stephen_btl"/>
    <language>en</language>
    <item>
      <title>Training a LoRA on FLUX.2 [klein] with Hermes Agent</title>
      <dc:creator>Stephen Batifol</dc:creator>
      <pubDate>Thu, 11 Jun 2026 10:19:16 +0000</pubDate>
      <link>https://dev.to/stephen_btl/training-a-lora-on-flux2-klein-with-hermes-agent-2k05</link>
      <guid>https://dev.to/stephen_btl/training-a-lora-on-flux2-klein-with-hermes-agent-2k05</guid>
      <description>&lt;p&gt;If you've trained a LoRA before, for FLUX or other models, you know that creating a dataset is the annoying part, you need to: find images, check their licenses, write captions and then train your LoRA. A lot of people have been talking about Hermes Agent lately, so I figured I'd see if I could automate most of the work to train a LoRA for FLUX.2 [klein].&lt;/p&gt;

&lt;p&gt;For training the LoRA, I'll use &lt;a href="https://github.com/ostris/ai-toolkit" rel="noopener noreferrer"&gt;ai-toolkit&lt;/a&gt; but you can use other LoRA Training framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LoRA: medieval marginalia
&lt;/h2&gt;

&lt;p&gt;Usually you create a LoRA when the base model is weak at. I asked Hermes to search for different ideas that could work and where: the base FLUX model is weak at, can be trained on, and visually distinct enough to see at a glance. It came back with a shortlist: risograph misregistration, Soviet technical schematics, ukiyo-e weather diagrams, brutalist concrete textures, medieval marginalia.&lt;/p&gt;

&lt;p&gt;I haven't seen many medieval oriented LoRAs so I figured I could try to make it work with FLUX and went with medieval marginalia. If you don't know what medieval marginalia is: it's the weird creatures drawn in the margins of old manuscripts, in particular 13th to 15th century.&lt;/p&gt;

&lt;p&gt;Example images from the training set.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs685no0rj7t5s59qmwtm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs685no0rj7t5s59qmwtm.jpg" alt="Twelve source images from the medieval marginalia dataset: manuscript folios, rabbit jousts, hybrid creatures, and isolated drolleries." width="800" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The base models can do a generic "medieval illustration" but not to the extent that we have in our training data; it's missing the parchment tone, weird proportions, page artifacts, the small figures, the decorative ground lines, making it a good candidate for a LoRA.&lt;/p&gt;

&lt;p&gt;Same prompt, base vs LoRA:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fciuf65cr66h2b4ggvz3l.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fciuf65cr66h2b4ggvz3l.jpg" alt="Side-by-side comparison of FLUX.2 klein base output and the trained marginalia LoRA output." width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Creating a Dataset
&lt;/h2&gt;

&lt;p&gt;The images come from Wikimedia Commons. It has good metadata, licensing information and plenty of material. It first pulled 199 images with their license and resolution.&lt;/p&gt;

&lt;p&gt;We then filtered out the unusable ones, removed the huge full-res scans and that left us with 104 usable images.&lt;/p&gt;

&lt;p&gt;Hermes ran a check on each one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;is this actually public domain or CC0/CC-BY?&lt;/li&gt;
&lt;li&gt;is the image high enough resolution?&lt;/li&gt;
&lt;li&gt;is it actually visual marginalia, or just a manuscript page with text?&lt;/li&gt;
&lt;li&gt;is this a duplicate crop?&lt;/li&gt;
&lt;li&gt;is the image too noisy to train on?&lt;/li&gt;
&lt;li&gt;can the source be traced?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are some that got cut:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faly37grazt6t2wg81lq5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faly37grazt6t2wg81lq5.jpg" alt="Five rejected candidates" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;illuminated initial&lt;/strong&gt;: a decorated letter, not a margin creature. Great image, wrong thing. A style LoRA trained on these would learn "ornate capital", not "drollery".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coptic manuscript&lt;/strong&gt;: marginal illustration, but a completely different tradition. Wrong script, wrong palette, wrong look. It would just add noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;near-duplicate&lt;/strong&gt;: a cropped copy of an image already in the keep set. Duplicates overweight one composition and waste a dataset slot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;signature marginalia&lt;/strong&gt;: technically "marginalia", but it's an early-modern handwritten signature with a flourish. No creature, and the search term matched on the wrong sense of the word.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;labor scene&lt;/strong&gt;: a real Luttrell Psalter folio, fine quality, but it's a genre/labor scene with weak drollery signal. Close, but not the thing I'm teaching.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Picking the final images
&lt;/h2&gt;

&lt;p&gt;For a style LoRA, 20 to 40 is usually the sweet spot. At this point we still had &amp;gt;100 images left so we need to filter some out.&lt;/p&gt;

&lt;p&gt;Normally you'd open the folder, go through all the images by hand and filter them down. It's not a hard task but it's not a fun one either, so I figured I would let Hermes do it. It decided to write a whole script to score them for quality: resolution, how on-topic the Wikimedia title and category were, the state of the scan.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte4f8lbkqrl7l3lyrpid.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte4f8lbkqrl7l3lyrpid.jpg" alt="Contact sheet of 50 top-scoredcandidates" width="800" height="1760"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are the questions that decide a style dataset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;do I have enough isolated creatures?&lt;/li&gt;
&lt;li&gt;do I have enough full manuscript context?&lt;/li&gt;
&lt;li&gt;is there too much text?&lt;/li&gt;
&lt;li&gt;are the colors varied enough?&lt;/li&gt;
&lt;li&gt;is the dataset all rabbits and snails, or does it cover more shapes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hermes sent the images to a VLM to pick the best one that would teach FLUX the style LoRA we are looking for.&lt;/p&gt;

&lt;p&gt;The final 30 was a deliberate mix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;isolated drolleries on plain backgrounds&lt;/li&gt;
&lt;li&gt;partial and full folio crops with page layout and text&lt;/li&gt;
&lt;li&gt;hybrid creatures and grotesques for figural variety&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Captioning
&lt;/h2&gt;

&lt;p&gt;For a style LoRA, captions should describe the subject in plain structural terms and say nothing about the style. The model learns the style from the images. If you put it in the caption too, you split the signal.&lt;/p&gt;

&lt;p&gt;Bad caption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MRGN_DRLR. A medieval illuminated manuscript drawing in ink on vellum of a rabbit jousting a snail.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Better caption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MRGN_DRLR. Bas-de-page of a rabbit in chainmail jousting a giant snail with a lance. The rabbit charges from the left on a small horse, the snail rears its body up on the right with antennae extended. Green ground line with stylized plants below.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second caption never says "medieval," "manuscript," "ink," "vellum," or "illuminated." Those are exactly the things I want the model to absorb from the images themselves. The trigger word &lt;code&gt;MRGN_DRLR&lt;/code&gt; carries the style; the rest of the caption just describes what's there.&lt;/p&gt;

&lt;p&gt;For 30 images I could have just written the captions by hand but where is the fun in that? Hermes built a captioning pipeline instead, which I can point at the next dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ai-toolkit config
&lt;/h2&gt;

&lt;p&gt;Trainer: &lt;a href="https://github.com/ostris/ai-toolkit" rel="noopener noreferrer"&gt;ostris/ai-toolkit&lt;/a&gt;. Model: &lt;code&gt;black-forest-labs/FLUX.2-klein-base-4B&lt;/code&gt;. Train on the base, not the distilled checkpoint.&lt;/p&gt;

&lt;p&gt;The config is based on the &lt;a href="https://docs.bfl.ai/flux_2/flux2_klein_training_example" rel="noopener noreferrer"&gt;BFL Klein training example&lt;/a&gt;, retargeted for this dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name_or_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;black-forest-labs/FLUX.2-klein-base-4B&lt;/span&gt;
  &lt;span class="na"&gt;arch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;flux2_klein_4b&lt;/span&gt;
  &lt;span class="na"&gt;quantize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;lora&lt;/span&gt;
  &lt;span class="na"&gt;linear&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;128&lt;/span&gt;
  &lt;span class="na"&gt;linear_alpha&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
  &lt;span class="na"&gt;conv&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;
  &lt;span class="na"&gt;conv_alpha&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;32&lt;/span&gt;

&lt;span class="na"&gt;datasets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;folder_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/workspace/datasets/marginalia&lt;/span&gt;
    &lt;span class="na"&gt;caption_ext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;txt&lt;/span&gt;
    &lt;span class="na"&gt;caption_dropout_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;
    &lt;span class="na"&gt;resolution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;512&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;768&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;1024&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;train&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
  &lt;span class="na"&gt;lr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.0001&lt;/span&gt;
  &lt;span class="na"&gt;optimizer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;adamw8bit&lt;/span&gt;
  &lt;span class="na"&gt;timestep_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shift&lt;/span&gt;
  &lt;span class="na"&gt;content_or_style&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;balanced&lt;/span&gt;

&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;save_every&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;250&lt;/span&gt;
  &lt;span class="na"&gt;max_step_saves_to_keep&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Training &amp;amp; Picking the right checkpoint
&lt;/h2&gt;

&lt;p&gt;RunPod has an official AI toolkit template, so I figured I could let Hermes run the training there directly on an RTX 4090.&lt;/p&gt;

&lt;h3&gt;
  
  
  Picking the right checkpoint
&lt;/h3&gt;

&lt;p&gt;The loss curve is of course interesting to monitor but not only, for example in this case, the best checkpoint was actually at step 1000, which is halfway through what we defined.&lt;/p&gt;

&lt;p&gt;Here's the progression I saw across checkpoints, same prompts and seeds:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy8qqtjyaqqyxclff3au.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy8qqtjyaqqyxclff3au.jpg" alt="Checkpoint progression during training" width="800" height="103"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;step 0:    base Klein, too clean and generic&lt;/li&gt;
&lt;li&gt;step 500:  parchment tone and line work start to appear&lt;/li&gt;
&lt;li&gt;step 1000: best balance of style, subject, and composition&lt;/li&gt;
&lt;li&gt;step 1500: text artifacts and muddy colors start creeping in&lt;/li&gt;
&lt;li&gt;step 2000: usable, but worse than step 1000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When training a LoRA, it's important to look at the samples and not only the loss to make sure we aren't overfitting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3b86n19t0cect2hd3t63.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3b86n19t0cect2hd3t63.jpg" alt="Three-frame overfit demo showing the same prompt at steps 1000, 1500, and 2000" width="799" height="308"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At step 1000 the figure is clean. By 1500 fake script crowds the page, and by 2000 the palette turns to mud, all while the loss number kept dropping.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Hermes actually did
&lt;/h2&gt;

&lt;p&gt;The training run itself was 37 minutes, and I did almost none of it. Hermes found the images on Wikimedia and checked the licenses, downloaded and renamed them, scored the pile down and built the contact sheet, ran the vision pass, captioned the set and scrubbed the style words the VLM kept sneaking in, wrote the ai-toolkit config and the RunPod instructions, and put together the checkpoint grids I used to pick the final one.&lt;/p&gt;

&lt;p&gt;The day of scraping, license-checking and file-wrangling in front of a run like this is usually what kills the idea. It's dull, and something more interesting always wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you want to train your own
&lt;/h2&gt;

&lt;p&gt;A few things are worth knowing before you start.&lt;/p&gt;

&lt;p&gt;Train against the base model, not the distilled one, and you don't need a huge dataset. 20 to 40 images is the sweet spot, since piling on more tends to wash the style out rather than strengthen it.&lt;/p&gt;

&lt;p&gt;Captions are important for a LoRA. Describe what's in the image and say nothing about the style, so the model has to learn the look from the pixels instead of from a word, and give it a trigger that isn't a real English word so it doesn't collide with everything the model already knows.&lt;/p&gt;

&lt;p&gt;When you train, save often and trust your eyes over the loss curve. Mine kept improving long after the images had started to fall apart, and the checkpoint I ended up shipping was halfway through the run. Generate the same prompts at the same seed across every checkpoint so you're comparing the same thing each time.&lt;/p&gt;

&lt;p&gt;The last part matters most to me: if an agent can handle the sourcing, the captioning and the packaging, hand it over. That's what I wanted to test here, and it took care of most of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The LoRA and its training data are both on the Hugging Face. The &lt;a href="https://huggingface.co/stephenbtl/marginalia-drolleries-klein-4b-lora" rel="noopener noreferrer"&gt;marginalia LoRA&lt;/a&gt; loads straight onto FLUX.2 [klein] with diffusers (trigger word &lt;code&gt;MRGN_DRLR.&lt;/code&gt;), and the &lt;a href="https://huggingface.co/datasets/stephenbtl/marginalia-drolleries-dataset" rel="noopener noreferrer"&gt;training dataset&lt;/a&gt; is published too, so you can retrain it yourself or build something on top of the same public-domain images.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>hermes</category>
      <category>finetune</category>
    </item>
  </channel>
</rss>
