PEPPERCORN

Posted on Jun 23

[Day 12] I tried to build a line-art LoRA from video frames, and the characters' heads fused together

#localllm #ai #dgxspark #stablediffusion

Intro

Day 12!

This time I tried to take my own hand-drawn animation (a short video) and build a
line-art LoRA that learns its art style and characters.

The plan was a little lazy, honestly. The usual way to train this is to prepare
character stills one by one, by hand. But I thought:
"I already have the video — why not just rip frames out of it and collect the
training material the easy way?"

Short version: the lines got clean, but the one thing that mattered — actually
reproducing my characters — failed completely. And the reason it failed is what I
actually took away from today.

What I used: my home AI machine (DGX Spark) + a training tool (Kohya) + my own
hand-drawn animation (two characters).
Note: everything shown here is LoRA-generated line art only. I'm not showing the
source video itself or where it's published.

Result first: failure on the left, and the right is also a failure

Left is the first attempt. A second body grows upside-down out of the top of the head.
Right is after I tracked down the causes and rebuilt the data — the lines came out
clean. But as you'll see, it's also a failure: it looks nothing like my original
characters — it's a totally different person.

How can "the breakage got fixed" still be a failure? Let me walk through it.

What I did: rip frames from the video and train (v1)

Simple steps:

Extract still frames from my hand-drawn animation video
Roughly select ~300 of them as training material
Train a LoRA on them

The training itself took 17 minutes on the DGX. Lightning fast.
"Oh, this is easy," I thought — for about five minutes.

Then I generated, and it was a mess

Asking the finished LoRA for "single character" and "two-person scenes" gave me this:

Symptom	How it broke
Fused heads	A second body sprouts from the head / multiple faces merge into one
Backgrounds won't go away	Even asking for "white background," blue/pink backgrounds show up
Thick, muddy lines	No clean line art, everything is heavy and blurry
Ghost text	Meaningless characters (leftover captions) get baked into the image

It could tell the characters apart (A and B were recognized as different people).
But the looks were wrecked. Lined up, the actual outputs were quite the horror show:

▲ When two characters show up, you can't tell what's what anymore

▲ Faces fuse and multiply

▲ Leftover captions bake in as "ghost text" all over the frame

▲ Train on high-motion frames and everything just melts

Why did it break? (the real point)

The culprit was using video frames as the source itself.

A video, if you think about it, is footage with many things happening at once. Rip a
frame out of it and you don't just learn the character's shape — you learn all the
surrounding noise too.

Symptom	Cause
Fused heads	Video has lots of frames where two people move in the same shot. The model learns an instant where bodies overlap as "one single body"
Backgrounds stick	Tons of background-laden frames get in. It learns "character = with this colored background" as a set, and you can't override it later
Thick lines	Mid-motion frames are blurred; that blur bakes in as a "thick-line style"
Ghost text	Caption text sitting on the frames sneaks into the material and gets learned

In one sentence: video is noise-laden material — motion, overlap, backgrounds, text
all baked in — and it's a poor way to cleanly extract just a character's shape.

▲ Even asking for "white background," the training background color (pink) won't peel off — and the character multiplies for good measure

Fixing it (v2)

Now that I knew the causes, I rebuilt the data side and trained again.

Auto-remove frames with caption text
Drop "no character" frames — pure backgrounds, transition frames (~300 → 141 frames)
Split into three groups — "A only," "B only," "the two together" — to stop the characters from bleeding into each other
Switch the base model to an anime one (good at line art) and tune the settings

The result:

Aspect	v1 (first)	v2 (rebuilt)
Fused heads	✗ frequent	✓ gone
Thick/muddy lines	✗	✓ thin and clean
Line-art look	△	✓✓ clearly line art
Backgrounds	✗	△ white now, but color bleeds onto clothes
Ghost text	✗	△ far less, a little remains
Resemblance to my characters (the whole point)	✗	✗ a different person — sometimes missing an arm

Look at just the top of that table and you think "oh, it's fixed!" I did too, for a second.
The noise problems (fused heads, thick lines, background color) really were fixed.

But look closer and there's no trace of my original characters. The lines are clean,
but what comes out is "some vaguely anime-style stranger." The worst ones are even missing
an arm.

▲ The lines are clean, sure — but it looks nothing like my original character. A stranger.

▲ Stable as a drawing. Still zero resemblance (and color still bleeds onto the clothes)

The real wall I never got past

Even with the noise gone, the one thing I actually wanted — reproducing my own
characters — was completely out of reach. All I got was a clean-looking stranger.

And since even a single character came out this much of a different person, "two of them
together, in a scene with a relationship" was even more hopeless. Ask for the two of them
and only one shows up, or it falls apart.

I chased down why two-person was especially bad, too:

There were only 9 real frames in the entire set where the two were naturally side by side
I tried to get more by re-checking another 478 frames, but every "two-person" hit was a false positive (the detector reacting to on-screen text or body fragments)
→ In other words, you cannot grow "two-person scenes" out of video material

If the composition you want (the two of them cleanly together) doesn't happen to exist in
the video, you can't extract it after the fact. Obvious in hindsight — but it really sank
in once I'd hit the wall.

Today's takeaway

Ripping frames from a video teaches the model "vaguely anime-ish" at best.
It couldn't reproduce my characters even for a single figure (let alone two / a
relationship). In the end, you have to hand-draw the stills you want it to learn.

This isn't a sour-grapes conclusion — it's the answer after exhausting every way to grow
the data. I took the long way around trying to be lazy, but because of it I now
understand, first-hand, why hand-drawn stills are necessary.

Someday I'll act on this conclusion and prepare the compositions by hand — but that's
a project for another day.

The details

Training settings (v1 → v2)

Item	v1	v2
Base model	SD1.5 (plain)	anime-style (good at line art)
clip_skip	1	2
Data	~300 lumped together	split into 3 groups (68 / 64 / 9)
Epochs	2	3
LoRA dim / alpha	32 / 16	32 / 16 (kept)
Training time (DGX)	~17 min	~14 min

In v2 I pinned a required character-name tag to the front of each group's captions
(keep_tokens) to suppress the characters bleeding into each other.

Why "two-person scenes" couldn't be grown

I re-tagged another 478 video frames looking for "two people in frame." Co-occurrence
flagged 25, but on full-resolution inspection almost all of them held only one person —
the tagger was misfiring on on-screen text labels and body fragments. The real "two
together" frames were the 9 I'd hand-picked at the start, basically the whole supply.

What's still left (the homework)

Color bleeding onto clothes (likely from color tags in some groups)
Leftover ghost text (a little text-like noise remains)
And the big one: reproducing the characters at all. Neither single figures nor pairs actually look like "my" characters → hand-draw the compositions I need

Next up

Next time I'm switching things up with a completely different experiment 🎬

100ExperimentsWithDGX #LocalLLM

DEV Community