PEPPERCORN

Posted on Jul 3

[Day 13] I got a cat to "talk." The biggest wall: the AI couldn't recognize the cat's face

#localllm #ai #dgxspark #machinelearning

Intro

Day 13!

Today's experiment: take a single cat image, lay human facial motion on top of it, and make a "talking cat." It's the usual "talking avatar" idea, except I use a cat instead of my own face. The tool is LivePortrait (still image + a "driving video" of motion → it transfers the video's expressions onto the still).

The result: a cat that properly talks. The hard part wasn't the animation — it was the step before it, getting the AI to recognize the cat's face. Here's where it snagged, and how I got past it.

What I used: DGX Spark (my home AI machine) / LivePortrait / one cat image (AI-generated) / a driving video (ships with LivePortrait).

Result first: a talking cat

The mouth and eye movement from a human talking-video landed on the cat's face. But there were a few snags on the way here.

Snag #1: it won't recognize a "face" at all

To make a cat talk, there's a first step: find where the face is in the image (the position of eyes, nose, mouth) — the face detector.

The tool I reached for first didn't have a single one installed, so it stopped with an error every time. I added detectors and tried again, but the answer didn't change:

Detector I added	Commercial use	Recognized the cat's face?
MediaPipe	OK	❌ no
InsightFace	Not allowed (non-commercial)	❌ no

None could recognize the cat's face. They're not broken — they're all built to find human faces, so a cat's face doesn't register as a "face."

Snag #2: why it couldn't recognize the face

The "animal mode" of the tool I started with only swaps the motion part for an animal version — the crucial "find the animal's face" detector was never bundled in. All that's left is the human one.

That was why work had stalled here last time, too. Not disk space, not the GPU — just a tool that couldn't look for an animal's face.

The fix: use the original tool, and skip the build

The upstream (original) LivePortrait does ship an animal-specific face detector, called XPose. So I set it up in a separate folder and used that.

The catch: XPose normally needs you to compile (build) a part on your own machine, and on this new-generation machine there was no guarantee the build would go through. So, reading the code, I found a slower but no-build spare part tucked inside. I rewrote three files to route to it, dodging compilation entirely. For a short clip, the slowness doesn't matter.

The exact files and edits are in "The details" at the bottom.

It worked — but at first it just looked like the cat stuck its tongue out

The detector ran, the cat's face was recognized, and a video generated. But the first result looked like the cat just gave a little tongue-out blep. The expression transfer was clearly working — so why?

The cause was the driving video (= the input you feed the AI). My first one was a short "just open the mouth" sample — and if the reference only opens its mouth, the cat only opens its mouth. Swapping in a video of someone actually talking gave me a cat whose eyes and mouth both moved.

The quality of the driving video pretty much decides the result.

Today's takeaway

The hard part isn't the motion engine, it's recognizing the animal's face.
A tool's "animal support" can be a label with the actual part (the detector) missing. When it won't run, tracking down which part is missing is the fast path.
If you hit a "must compile" part on a too-new machine, look first for a no-build fallback route.
The quality of the result is mostly decided by the quality of the input (the driving motion).

A note on licensing

The detectors that could recognize the cat's face (XPose / InsightFace) are both non-commercial licenses. So I avoid commercial use of the footage itself, and this article keeps the focus on the method and the gotchas.
The commercially-OK detector (MediaPipe) couldn't recognize the cat this time.

The details

What was missing, and how it was solved

The "animal mode" of the ComfyUI node I used first only swaps in the animal motion model; the animal face detector (XPose) is not bundled. Human detectors (InsightFace / MediaPipe / FaceAlignment) can't detect a cat's face, so it stops at No face detected.
The fix: set up upstream KwaiVGI/LivePortrait in a separate folder, fetch the official weight set (including xpose.pth), and use inference_animals.py.
The InsightFace and landmark models I already had could be reused.

The no-compile patch for XPose

XPose is built to compile its own CUDA custom op called MultiScaleDeformableAttention. On the newest GPU/CUDA generation there's no guarantee that build succeeds, so I routed it to the bundled pure-PyTorch fallback instead.

Three files edited (under XPose's ops/):

functions/ms_deform_attn_func.py: wrap the compiled-version import in try/except, set the flag to False on failure.
modules/ms_deform_attn.py: when that flag is False, branch forward through the pure-PyTorch ms_deform_attn_core_pytorch.
(if needed) add weights_only=False to the torch.load in animal_landmark_runner.py.

Now animal detection runs with no compilation at all. It's slower, but for a short clip it's fine (one generation finished in ~8 seconds).

Environment gotchas

Each time I swapped detectors, the base library (numpy) version see-sawed (mediapipe wants numpy<2, insightface wants 2.x). The existing core (cv2/torch) survived, but the clean approach is to keep upstream LivePortrait in its own isolated environment.
When stopping the server, killing by process name took out my own command too. Stopping by port number was safe.
Disk and GPU had plenty of headroom the whole time — not once was the snag a resource shortage.

Next up

Next time I'm switching things up again with a different kind of experiment 🎬

100ExperimentsWithDGX #LocalLLM

DEV Community