Lip syncing anime footage is deceptively difficult. A realistic face already gives a model many visual cues, but anime faces often reduce the mouth to a few lines, flatten facial depth, and change proportions dramatically between styles. A generic lip-sync system may produce unstable shapes, paste skin around the lips, mistake an eyelash for a closed mouth, or place a valid mouth at the wrong location when the character moves.
The Anime Lip Sync custom node for ComfyUI approaches the problem as a controlled, multi-stage pipeline. Instead of asking one model to solve audio alignment, face editing, mouth generation, segmentation, tracking, and compositing at once, it gives each problem to a specialized component.
The result is a workflow that combines phoneme analysis, YOLO detection, diffusion, ControlNet guidance, BEN2 background removal, anatomical validation, temporal caching, and final video composition.
The Core Idea
The node receives an audio track, a batch of video frames, and the models needed for diffusion. It then produces three mouth states:
neutral_closedhalf_openfully_open
These are intentionally broad categories. Anime animation often communicates speech with a small vocabulary of mouth shapes, so generating a stable set of reusable states can look more coherent than trying to synthesize a unique mouth for every phoneme.
The pipeline is organized into four phases:
- Detect and prepare the face.
- Generate the required mouth states with diffusion.
- Isolate and track the generated mouth.
- Composite the mouth and build the final video.
Phase 0: Preparing a Clean Face
The first phase uses a custom YOLO model to identify facial regions. The current anatomical validator expects the following classes:
Model Link: https://huggingface.co/israelmarmar/anime_face_mouth/tree/main
0 = face
1 = mouth
2 = nose
3 = chin
The face bounding box defines the working crop. The original mouth is detected, masked, and inpainted when mouth removal is enabled. This produces a clean base frame that can receive a newly generated mouth later.
For scenes with multiple characters, the node can optionally use Florence-2 phrase grounding. A text query such as the girl with blue hair identifies the target character before YOLO searches for the face. This prevents the pipeline from switching to another character simply because that face has a higher detection score.
All intermediate assets are written through a disk-backed store. Original frames, mouth masks, face masks, bounding boxes, generated faces, mouth crops, and output frames remain separated. This structure limits VRAM pressure and makes the pipeline easier to inspect.
Phase 1: Audio-Driven Diffusion
Audio is analyzed with Allosaurus. Detected sounds are mapped to the three mouth categories, including a neutral closed state for silence. The node then groups visually similar face crops so it does not need to run diffusion independently for every video frame.
Face similarity is based on facial structure and scribble-like contours rather than color alone. HED guidance preserves important anime line work, head pose, and silhouette while allowing diffusion to change the mouth. The generated face is conditioned through a Qwen Image DiffSynth ControlNet workflow and sampled with Z-Image Turbo.
Each mouth state can use its own:
- prompt or external conditioning
- LoRA
- step count
- CFG value
- denoise strength
- ControlNet strength
- sampler and scheduler
This matters because a closed mouth and a fully open mouth are different generation problems. A closed mouth needs a crisp, restrained line, while an open mouth needs enough denoising freedom to create a visible interior.
Do Not Cache a Bad Generation
Diffusion is stochastic. Even with a strong prompt and LoRA, a requested closed mouth may appear slightly open, or no usable mouth may be generated at all. Caching that result would spread the failure across many frames.
The node therefore validates every generated face before saving it to the history cache. The YOLO model searches for a mouth, but confidence alone is not enough. Anime eyelashes and eye contours can look like closed mouths to an object detector.
The validator uses the detected nose and chin to build an anatomical mouth region. A mouth candidate must be:
- inside the detected face
- below the nose
- above the chin
- horizontally aligned with the nose and chin
If the highest-confidence candidate lies near the eyes, it is rejected. The node performs a second YOLO pass with a lower confidence threshold and a larger inference resolution to look for another candidate inside the valid region.
If no anatomically valid mouth is found, the KSampler runs again with a new seed. The mouth_regen_attempts option controls how many additional samples may be generated.
Validating a Truly Closed Mouth
A closed-mouth detection needs stricter rules. It should be wide enough to be visible, horizontally shaped, and thin enough to resemble a lip contour rather than a filled cavity.
For neutral_closed, the validator checks:
- relative width
- relative height
- width-to-height aspect ratio
- detected area
- internal fill
- foreground thickness
- the presence of a solid inner core
The final checks analyze the pixels inside the detected mouth bounding box. A thin lip line has low internal fill and little or no eroded core. A slightly open mouth usually contains a thicker dark or reddish region. When that filled region exceeds the allowed thresholds, the candidate is rejected and diffusion samples again.
After a closed mouth passes validation, a local contrast enhancement makes the lip contour easier to read. CLAHE and mild sharpening are applied only around the accepted anatomical mouth region, preserving the rest of the face.
Phase 2: Extracting the Mouth with BEN2
Once a generated face passes validation, the mouth must be isolated without carrying a rectangle of differently colored skin into the final frame.
The node calls the same easy imageRemBg custom node used by ComfyUI-Easy-Use:
rem_mode = BEN2
add_background = none
refine_foreground = false
The returned RGBA image is used directly. This is important because multiplying the BEN2 alpha by additional heuristic masks can make the mouth unnecessarily transparent. BEN2 remains responsible for separating the foreground mouth from surrounding skin, while fallback masking is used only when the custom node is unavailable.
The final mouth detector also uses the nose-and-chin anatomical selector. This prevents Phase 1 from validating the correct mouth only for Phase 2 to crop an eyelash with a higher confidence score.
Tracking Mouth Position Relative to the Face
Absolute coordinates are not enough for video. If a character moves to the right, a cached mouth should move with the face rather than remain at the old screen position.
The mouth cache stores normalized coordinates relative to the face bounding box:
rx, ry = relative mouth position
rw, rh = relative mouth size
When a visually similar face appears at a different position or scale, the node projects those normalized values into the new face bounding box. This allows a generated mouth state to be reused while still following head movement.
Cache epochs are updated when facial appearance or motion changes enough to make previous mouth data unsafe. The pipeline can also process mouth extraction and composition concurrently with diffusion to reduce total runtime.
Phase 3: Compositing and Video Output
The generated RGBA mouth is resized to its tracked destination and alpha-composited over the inpainted frame. Weak residual alpha values are removed to reduce halos and skin contamination. The output frame sequence is then encoded with FFmpeg and combined with the original audio.
The node returns:
- the final video path
- VHS-compatible video information
- an optional persistent debug directory
Debugging the Pipeline
Enabling save_debug_folder copies the temporary working directory into the ComfyUI output folder before cleanup. This provides access to:
- original and inpainted frames
- face and mouth masks
- face bounding boxes
- generated diffusion faces
- generated mouth history
- BEN2 RGBA mouth caches
- final composed frames
- validation metadata
History JSON files include whether the generated mouth passed validation and a message containing confidence, anatomy, dimensions, and contour metrics. Mouth cache metadata also records the alpha source, making it possible to confirm that easy imageRemBg with BEN2 was used.
Practical Tuning
The most useful controls are:
-
mouth_conffor detector sensitivity -
verify_generated_mouthto enable generation validation -
mouth_regen_attemptsfor diffusion retries -
sim_thresholdfor face grouping -
motion_variance_factorfor cache invalidation - per-state denoise and ControlNet strength
- per-state mouth padding and brightness
-
compose_feather_pxfor final blending
Low mouth confidence may recover subtle closed lines but can increase false positives. Anatomical validation helps make lower thresholds practical. Higher regeneration counts improve the chance of receiving a valid mouth at the cost of inference time.
Why This Architecture Works for Anime
The pipeline does not assume that one neural network will always make the correct decision. It combines generative and discriminative models in a feedback loop:
- audio chooses the requested mouth state
- diffusion proposes a visual result
- YOLO checks whether the result contains a plausible mouth
- nose and chin detections verify its location
- pixel analysis verifies that a closed mouth is actually closed
- BEN2 extracts the accepted mouth
- face-relative tracking places it consistently over time
That layered approach is especially useful for stylized animation, where a small line can represent a mouth, an eyelash can look like the same line, and a few pixels can determine whether an expression reads correctly.
The custom node still needs some improvements, such as refining the shape of the mouths and stabilizing their position to prevent flickering. Any suggestions for improvement are welcome.
Repository: https://github.com/israelmarmar/anime-lipsync
Top comments (0)