Israel Martins

Posted on Jul 1

Building Anime Lip Sync in ComfyUI: A Detection-Guided Diffusion Pipeline

#ai #deeplearning #machinelearning #showdev

Lip syncing anime footage is deceptively difficult. A realistic face already gives a model many visual cues, but anime faces often reduce the mouth to a few lines, flatten facial depth, and change proportions dramatically between styles. A generic lip-sync system may produce unstable shapes, paste skin around the lips, mistake an eyelash for a closed mouth, or place a valid mouth at the wrong location when the character moves.

The Anime Lip Sync custom node for ComfyUI approaches the problem as a controlled, multi-stage pipeline. Instead of asking one model to solve audio alignment, face editing, mouth generation, segmentation, tracking, and compositing at once, it gives each problem to a specialized component.

The result is a workflow that combines phoneme analysis, YOLO detection, diffusion, ControlNet guidance, BEN2 background removal, anatomical validation, temporal caching, and final video composition.

The Core Idea

The node receives an audio track, a batch of video frames, and the models needed for diffusion. It then produces three mouth states:

neutral_closed
half_open
fully_open

These are intentionally broad categories. Anime animation often communicates speech with a small vocabulary of mouth shapes, so generating a stable set of reusable states can look more coherent than trying to synthesize a unique mouth for every phoneme.

The pipeline is organized into four phases:

Detect and prepare the face.
Generate the required mouth states with diffusion.
Isolate and track the generated mouth.
Composite the mouth and build the final video.

Phase 0: Preparing a Clean Face

The first phase uses a custom YOLO model to identify facial regions. The current anatomical validator expects the following classes:

Model Link: https://huggingface.co/israelmarmar/anime_face_mouth/tree/main

0 = face
1 = mouth
2 = nose
3 = chin

The face bounding box defines the working crop. The original mouth is detected, masked, and inpainted when mouth removal is enabled. This produces a clean base frame that can receive a newly generated mouth later.

For scenes with multiple characters, the node can optionally use Florence-2 phrase grounding. A text query such as the girl with blue hair identifies the target character before YOLO searches for the face. This prevents the pipeline from switching to another character simply because that face has a higher detection score.

All intermediate assets are written through a disk-backed store. Original frames, mouth masks, face masks, bounding boxes, generated faces, mouth crops, and output frames remain separated. This structure limits VRAM pressure and makes the pipeline easier to inspect.

Phase 1: Audio-Driven Diffusion

Audio is analyzed with Allosaurus. Detected sounds are mapped to the three mouth categories, including a neutral closed state for silence. The node then groups visually similar face crops so it does not need to run diffusion independently for every video frame.

Face similarity is based on facial structure and scribble-like contours rather than color alone. HED guidance preserves important anime line work, head pose, and silhouette while allowing diffusion to change the mouth. The generated face is conditioned through a Qwen Image DiffSynth ControlNet workflow and sampled with Z-Image Turbo.

Each mouth state can use its own:

prompt or external conditioning
LoRA
step count
CFG value
denoise strength
ControlNet strength
sampler and scheduler

This matters because a closed mouth and a fully open mouth are different generation problems. A closed mouth needs a crisp, restrained line, while an open mouth needs enough denoising freedom to create a visible interior.

Do Not Cache a Bad Generation

Diffusion is stochastic. Even with a strong prompt and LoRA, a requested closed mouth may appear slightly open, or no usable mouth may be generated at all. Caching that result would spread the failure across many frames.

The node therefore validates every generated face before saving it to the history cache. The YOLO model searches for a mouth, but confidence alone is not enough. Anime eyelashes and eye contours can look like closed mouths to an object detector.

The validator uses the detected nose and chin to build an anatomical mouth region. A mouth candidate must be:

inside the detected face
below the nose
above the chin
horizontally aligned with the nose and chin

If the highest-confidence candidate lies near the eyes, it is rejected. The node performs a second YOLO pass with a lower confidence threshold and a larger inference resolution to look for another candidate inside the valid region.

If no anatomically valid mouth is found, the KSampler runs again with a new seed. The mouth_regen_attempts option controls how many additional samples may be generated.

Validating a Truly Closed Mouth

A closed-mouth detection needs stricter rules. It should be wide enough to be visible, horizontally shaped, and thin enough to resemble a lip contour rather than a filled cavity.

For neutral_closed, the validator checks:

relative width
relative height
width-to-height aspect ratio
detected area
internal fill
foreground thickness
the presence of a solid inner core

The final checks analyze the pixels inside the detected mouth bounding box. A thin lip line has low internal fill and little or no eroded core. A slightly open mouth usually contains a thicker dark or reddish region. When that filled region exceeds the allowed thresholds, the candidate is rejected and diffusion samples again.

After a closed mouth passes validation, a local contrast enhancement makes the lip contour easier to read. CLAHE and mild sharpening are applied only around the accepted anatomical mouth region, preserving the rest of the face.

Phase 2: Extracting the Mouth with BEN2

Once a generated face passes validation, the mouth must be isolated without carrying a rectangle of differently colored skin into the final frame.

The node calls the same easy imageRemBg custom node used by ComfyUI-Easy-Use:

rem_mode = BEN2
add_background = none
refine_foreground = false

The returned RGBA image is used directly. This is important because multiplying the BEN2 alpha by additional heuristic masks can make the mouth unnecessarily transparent. BEN2 remains responsible for separating the foreground mouth from surrounding skin, while fallback masking is used only when the custom node is unavailable.

The final mouth detector also uses the nose-and-chin anatomical selector. This prevents Phase 1 from validating the correct mouth only for Phase 2 to crop an eyelash with a higher confidence score.

Tracking Mouth Position Relative to the Face

Absolute coordinates are not enough for video. If a character moves to the right, a cached mouth should move with the face rather than remain at the old screen position.

The mouth cache stores normalized coordinates relative to the face bounding box:

rx, ry = relative mouth position
rw, rh = relative mouth size

When a visually similar face appears at a different position or scale, the node projects those normalized values into the new face bounding box. This allows a generated mouth state to be reused while still following head movement.

Cache epochs are updated when facial appearance or motion changes enough to make previous mouth data unsafe. The pipeline can also process mouth extraction and composition concurrently with diffusion to reduce total runtime.

Phase 3: Compositing and Video Output

The generated RGBA mouth is resized to its tracked destination and alpha-composited over the inpainted frame. Weak residual alpha values are removed to reduce halos and skin contamination. The output frame sequence is then encoded with FFmpeg and combined with the original audio.

The node returns:

the final video path
VHS-compatible video information
an optional persistent debug directory

Debugging the Pipeline

Enabling save_debug_folder copies the temporary working directory into the ComfyUI output folder before cleanup. This provides access to:

original and inpainted frames
face and mouth masks
face bounding boxes
generated diffusion faces
generated mouth history
BEN2 RGBA mouth caches
final composed frames
validation metadata

History JSON files include whether the generated mouth passed validation and a message containing confidence, anatomy, dimensions, and contour metrics. Mouth cache metadata also records the alpha source, making it possible to confirm that easy imageRemBg with BEN2 was used.

Practical Tuning

The most useful controls are:

mouth_conf for detector sensitivity
verify_generated_mouth to enable generation validation
mouth_regen_attempts for diffusion retries
sim_threshold for face grouping
motion_variance_factor for cache invalidation
per-state denoise and ControlNet strength
per-state mouth padding and brightness
compose_feather_px for final blending

Low mouth confidence may recover subtle closed lines but can increase false positives. Anatomical validation helps make lower thresholds practical. Higher regeneration counts improve the chance of receiving a valid mouth at the cost of inference time.

Why This Architecture Works for Anime

The pipeline does not assume that one neural network will always make the correct decision. It combines generative and discriminative models in a feedback loop:

audio chooses the requested mouth state
diffusion proposes a visual result
YOLO checks whether the result contains a plausible mouth
nose and chin detections verify its location
pixel analysis verifies that a closed mouth is actually closed
BEN2 extracts the accepted mouth
face-relative tracking places it consistently over time

That layered approach is especially useful for stylized animation, where a small line can represent a mouth, an eyelash can look like the same line, and a few pixels can determine whether an expression reads correctly.

The custom node still needs some improvements, such as refining the shape of the mouths and stabilizing their position to prevent flickering. Any suggestions for improvement are welcome.

Repository: https://github.com/israelmarmar/anime-lipsync