DEV Community

Cover image for [Day 12] I tried to build a line-art LoRA from video frames, and the characters' heads fused together
PEPPERCORN
PEPPERCORN

Posted on

[Day 12] I tried to build a line-art LoRA from video frames, and the characters' heads fused together

Intro

Day 12!

This time I tried to take my own hand-drawn animation (a short video) and build a
line-art LoRA that learns its art style and characters.

The plan was a little lazy, honestly. The usual way to train this is to prepare
character stills one by one, by hand. But I thought:
"I already have the video — why not just rip frames out of it and collect the
training material the easy way?"

Short version: the lines got clean, but the one thing that mattered — actually
reproducing my characters — failed completely.
And the reason it failed is what I
actually took away from today.

What I used: my home AI machine (DGX Spark) + a training tool (Kohya) + my own
hand-drawn animation (two characters).
Note: everything shown here is LoRA-generated line art only. I'm not showing the
source video itself or where it's published.


Result first: failure on the left, and the right is also a failure

Left: v1 trained on video frames (heads fuse). Right: v2 after rebuilding the dataset (clean lines, but a different person)

Left is the first attempt. A second body grows upside-down out of the top of the head.
Right is after I tracked down the causes and rebuilt the data — the lines came out
clean
. But as you'll see, it's also a failure: it looks nothing like my original
characters
— it's a totally different person.

How can "the breakage got fixed" still be a failure? Let me walk through it.


What I did: rip frames from the video and train (v1)

Simple steps:

  1. Extract still frames from my hand-drawn animation video
  2. Roughly select ~300 of them as training material
  3. Train a LoRA on them

The training itself took 17 minutes on the DGX. Lightning fast.
"Oh, this is easy," I thought — for about five minutes.


Then I generated, and it was a mess

Asking the finished LoRA for "single character" and "two-person scenes" gave me this:

Symptom How it broke
Fused heads A second body sprouts from the head / multiple faces merge into one
Backgrounds won't go away Even asking for "white background," blue/pink backgrounds show up
Thick, muddy lines No clean line art, everything is heavy and blurry
Ghost text Meaningless characters (leftover captions) get baked into the image

It could tell the characters apart (A and B were recognized as different people).
But the looks were wrecked. Lined up, the actual outputs were quite the horror show:

Line art where bodies tangle together, you can't tell whose limbs are whose
▲ When two characters show up, you can't tell what's what anymore

A dinner-table scene where multiple faces fuse and multiply
▲ Faces fuse and multiply

Line art covered in meaningless floating text
▲ Leftover captions bake in as "ghost text" all over the frame

A figure screaming as its body dissolves
▲ Train on high-motion frames and everything just melts


Why did it break? (the real point)

The culprit was using video frames as the source itself.

A video, if you think about it, is footage with many things happening at once. Rip a
frame out of it and you don't just learn the character's shape — you learn all the
surrounding noise too.

Symptom Cause
Fused heads Video has lots of frames where two people move in the same shot. The model learns an instant where bodies overlap as "one single body"
Backgrounds stick Tons of background-laden frames get in. It learns "character = with this colored background" as a set, and you can't override it later
Thick lines Mid-motion frames are blurred; that blur bakes in as a "thick-line style"
Ghost text Caption text sitting on the frames sneaks into the material and gets learned

In one sentence: video is noise-laden material — motion, overlap, backgrounds, text
all baked in
— and it's a poor way to cleanly extract just a character's shape.

Line art with a pink background that won't go away, character multiplied
▲ Even asking for "white background," the training background color (pink) won't peel off — and the character multiplies for good measure


Fixing it (v2)

Now that I knew the causes, I rebuilt the data side and trained again.

  1. Auto-remove frames with caption text
  2. Drop "no character" frames — pure backgrounds, transition frames (~300 → 141 frames)
  3. Split into three groups — "A only," "B only," "the two together" — to stop the characters from bleeding into each other
  4. Switch the base model to an anime one (good at line art) and tune the settings

The result:

Aspect v1 (first) v2 (rebuilt)
Fused heads ✗ frequent ✓ gone
Thick/muddy lines ✓ thin and clean
Line-art look ✓✓ clearly line art
Backgrounds △ white now, but color bleeds onto clothes
Ghost text △ far less, a little remains
Resemblance to my characters (the whole point) ✗ a different person — sometimes missing an arm

Look at just the top of that table and you think "oh, it's fixed!" I did too, for a second.
The noise problems (fused heads, thick lines, background color) really were fixed.

But look closer and there's no trace of my original characters. The lines are clean,
but what comes out is "some vaguely anime-style stranger." The worst ones are even missing
an arm.

v2 line art of a single character cooking
▲ The lines are clean, sure — but it looks nothing like my original character. A stranger.

v2 line art of a single character standing
▲ Stable as a drawing. Still zero resemblance (and color still bleeds onto the clothes)


The real wall I never got past

Even with the noise gone, the one thing I actually wanted — reproducing my own
characters — was completely out of reach.
All I got was a clean-looking stranger.

And since even a single character came out this much of a different person, "two of them
together, in a scene with a relationship" was even more hopeless.
Ask for the two of them
and only one shows up, or it falls apart.

I chased down why two-person was especially bad, too:

  • There were only 9 real frames in the entire set where the two were naturally side by side
  • I tried to get more by re-checking another 478 frames, but every "two-person" hit was a false positive (the detector reacting to on-screen text or body fragments)
  • → In other words, you cannot grow "two-person scenes" out of video material

If the composition you want (the two of them cleanly together) doesn't happen to exist in
the video, you can't extract it after the fact. Obvious in hindsight — but it really sank
in once I'd hit the wall.


Today's takeaway

Ripping frames from a video teaches the model "vaguely anime-ish" at best.
It couldn't reproduce my characters even for a single figure (let alone two / a
relationship). In the end, you have to hand-draw the stills you want it to learn.

This isn't a sour-grapes conclusion — it's the answer after exhausting every way to grow
the data. I took the long way around trying to be lazy, but because of it I now
understand, first-hand, why hand-drawn stills are necessary.

Someday I'll act on this conclusion and prepare the compositions by hand — but that's
a project for another day.


The details

Training settings (v1 → v2)

Item v1 v2
Base model SD1.5 (plain) anime-style (good at line art)
clip_skip 1 2
Data ~300 lumped together split into 3 groups (68 / 64 / 9)
Epochs 2 3
LoRA dim / alpha 32 / 16 32 / 16 (kept)
Training time (DGX) ~17 min ~14 min

In v2 I pinned a required character-name tag to the front of each group's captions
(keep_tokens) to suppress the characters bleeding into each other.

Why "two-person scenes" couldn't be grown

I re-tagged another 478 video frames looking for "two people in frame." Co-occurrence
flagged 25, but on full-resolution inspection almost all of them held only one person —
the tagger was misfiring on on-screen text labels and body fragments. The real "two
together" frames were the 9 I'd hand-picked at the start, basically the whole supply.

What's still left (the homework)

  • Color bleeding onto clothes (likely from color tags in some groups)
  • Leftover ghost text (a little text-like noise remains)
  • And the big one: reproducing the characters at all. Neither single figures nor pairs actually look like "my" characters → hand-draw the compositions I need

Next up

Next time I'm switching things up with a completely different experiment 🎬

100ExperimentsWithDGX #LocalLLM

Top comments (0)