Intro
Day 12!
This time I tried to take my own hand-drawn animation (a short video) and build a
line-art LoRA that learns its art style and characters.
The plan was a little lazy, honestly. The usual way to train this is to prepare
character stills one by one, by hand. But I thought:
"I already have the video — why not just rip frames out of it and collect the
training material the easy way?"
Short version: the lines got clean, but the one thing that mattered — actually
reproducing my characters — failed completely. And the reason it failed is what I
actually took away from today.
What I used: my home AI machine (DGX Spark) + a training tool (Kohya) + my own
hand-drawn animation (two characters).
Note: everything shown here is LoRA-generated line art only. I'm not showing the
source video itself or where it's published.
Result first: failure on the left, and the right is also a failure
Left is the first attempt. A second body grows upside-down out of the top of the head.
Right is after I tracked down the causes and rebuilt the data — the lines came out
clean. But as you'll see, it's also a failure: it looks nothing like my original
characters — it's a totally different person.
How can "the breakage got fixed" still be a failure? Let me walk through it.
What I did: rip frames from the video and train (v1)
Simple steps:
- Extract still frames from my hand-drawn animation video
- Roughly select ~300 of them as training material
- Train a LoRA on them
The training itself took 17 minutes on the DGX. Lightning fast.
"Oh, this is easy," I thought — for about five minutes.
Then I generated, and it was a mess
Asking the finished LoRA for "single character" and "two-person scenes" gave me this:
| Symptom | How it broke |
|---|---|
| Fused heads | A second body sprouts from the head / multiple faces merge into one |
| Backgrounds won't go away | Even asking for "white background," blue/pink backgrounds show up |
| Thick, muddy lines | No clean line art, everything is heavy and blurry |
| Ghost text | Meaningless characters (leftover captions) get baked into the image |
It could tell the characters apart (A and B were recognized as different people).
But the looks were wrecked. Lined up, the actual outputs were quite the horror show:

▲ When two characters show up, you can't tell what's what anymore

▲ Leftover captions bake in as "ghost text" all over the frame

▲ Train on high-motion frames and everything just melts
Why did it break? (the real point)
The culprit was using video frames as the source itself.
A video, if you think about it, is footage with many things happening at once. Rip a
frame out of it and you don't just learn the character's shape — you learn all the
surrounding noise too.
| Symptom | Cause |
|---|---|
| Fused heads | Video has lots of frames where two people move in the same shot. The model learns an instant where bodies overlap as "one single body" |
| Backgrounds stick | Tons of background-laden frames get in. It learns "character = with this colored background" as a set, and you can't override it later |
| Thick lines | Mid-motion frames are blurred; that blur bakes in as a "thick-line style" |
| Ghost text | Caption text sitting on the frames sneaks into the material and gets learned |
In one sentence: video is noise-laden material — motion, overlap, backgrounds, text
all baked in — and it's a poor way to cleanly extract just a character's shape.

▲ Even asking for "white background," the training background color (pink) won't peel off — and the character multiplies for good measure
Fixing it (v2)
Now that I knew the causes, I rebuilt the data side and trained again.
- Auto-remove frames with caption text
- Drop "no character" frames — pure backgrounds, transition frames (~300 → 141 frames)
- Split into three groups — "A only," "B only," "the two together" — to stop the characters from bleeding into each other
- Switch the base model to an anime one (good at line art) and tune the settings
The result:
| Aspect | v1 (first) | v2 (rebuilt) |
|---|---|---|
| Fused heads | ✗ frequent | ✓ gone |
| Thick/muddy lines | ✗ | ✓ thin and clean |
| Line-art look | △ | ✓✓ clearly line art |
| Backgrounds | ✗ | △ white now, but color bleeds onto clothes |
| Ghost text | ✗ | △ far less, a little remains |
| Resemblance to my characters (the whole point) | ✗ | ✗ a different person — sometimes missing an arm |
Look at just the top of that table and you think "oh, it's fixed!" I did too, for a second.
The noise problems (fused heads, thick lines, background color) really were fixed.
But look closer and there's no trace of my original characters. The lines are clean,
but what comes out is "some vaguely anime-style stranger." The worst ones are even missing
an arm.

▲ The lines are clean, sure — but it looks nothing like my original character. A stranger.

▲ Stable as a drawing. Still zero resemblance (and color still bleeds onto the clothes)
The real wall I never got past
Even with the noise gone, the one thing I actually wanted — reproducing my own
characters — was completely out of reach. All I got was a clean-looking stranger.
And since even a single character came out this much of a different person, "two of them
together, in a scene with a relationship" was even more hopeless. Ask for the two of them
and only one shows up, or it falls apart.
I chased down why two-person was especially bad, too:
- There were only 9 real frames in the entire set where the two were naturally side by side
- I tried to get more by re-checking another 478 frames, but every "two-person" hit was a false positive (the detector reacting to on-screen text or body fragments)
- → In other words, you cannot grow "two-person scenes" out of video material
If the composition you want (the two of them cleanly together) doesn't happen to exist in
the video, you can't extract it after the fact. Obvious in hindsight — but it really sank
in once I'd hit the wall.
Today's takeaway
Ripping frames from a video teaches the model "vaguely anime-ish" at best.
It couldn't reproduce my characters even for a single figure (let alone two / a
relationship). In the end, you have to hand-draw the stills you want it to learn.
This isn't a sour-grapes conclusion — it's the answer after exhausting every way to grow
the data. I took the long way around trying to be lazy, but because of it I now
understand, first-hand, why hand-drawn stills are necessary.
Someday I'll act on this conclusion and prepare the compositions by hand — but that's
a project for another day.
The details
Training settings (v1 → v2)
| Item | v1 | v2 |
|---|---|---|
| Base model | SD1.5 (plain) | anime-style (good at line art) |
| clip_skip | 1 | 2 |
| Data | ~300 lumped together | split into 3 groups (68 / 64 / 9) |
| Epochs | 2 | 3 |
| LoRA dim / alpha | 32 / 16 | 32 / 16 (kept) |
| Training time (DGX) | ~17 min | ~14 min |
In v2 I pinned a required character-name tag to the front of each group's captions
(keep_tokens) to suppress the characters bleeding into each other.
Why "two-person scenes" couldn't be grown
I re-tagged another 478 video frames looking for "two people in frame." Co-occurrence
flagged 25, but on full-resolution inspection almost all of them held only one person —
the tagger was misfiring on on-screen text labels and body fragments. The real "two
together" frames were the 9 I'd hand-picked at the start, basically the whole supply.
What's still left (the homework)
- Color bleeding onto clothes (likely from color tags in some groups)
- Leftover ghost text (a little text-like noise remains)
- And the big one: reproducing the characters at all. Neither single figures nor pairs actually look like "my" characters → hand-draw the compositions I need
Next up
Next time I'm switching things up with a completely different experiment 🎬


Top comments (0)