Wlad Radchenko

Posted on Jun 27

Stopping the flicker when you restyle a video frame by frame

#python #computervision #deeplearning #machinelearning

Run a diffusion restyle on every frame of a clip, one frame at a time, and the still images look great. Then you play them back and the whole thing boils. Textures crawl, colors pulse, a brick wall shifts its grout lines every frame. The model did nothing wrong on any single frame. It just made a slightly different choice each time, and at 24 frames a second your eye reads those differences as flicker.

This is a walkthrough of the code that kills that flicker. The trick is to stop restyling every frame. Stylize a few frames, then carry the style to the rest by following the motion. The interesting part is the bookkeeping and the blending that make the carry invisible, so I will spend most of the post there.

Each frame is a separate sample from the model, so each frame lands in a slightly different place. Played in sequence, those small differences become flicker. Photo: Unsplash.

Why frame-by-frame flickers

A diffusion model starts from noise and walks toward an image. Two frames of a video that look almost identical to you are still two different starting points and two different walks. The model has no memory of what it drew last frame, so it picks a fresh interpretation of "oil painting" or "anime" each time. On a still you never notice. In motion you see the model changing its mind 24 times a second.

You can lower the denoise strength so the model stays close to the input, but then you barely restyle anything. You can feed the previous frame back in, which helps a little and drifts a lot. The cleaner answer is structural: restyle only a sparse set of frames, and fill the gaps by warping a real stylized frame into place. A warped frame cannot disagree with itself between neighbors, because it is the same pixels pushed along the motion. This is the idea behind Rerender A Video (Yang et al., SIGGRAPH Asia 2023) and behind EbSynth (Jamriška et al., ACM ToG 2019), and it is what the code below implements.

Step 1. Pick the keyframes

Keyframes come from scene detection plus a fixed interval. Inside the scene detector:

if interval:
    scene_frames.extend(range(start_frame, end_frame, interval))
    scene_frames.append(end_frame - 1)

The default interval is 10. So inside each detected scene you take every tenth frame, plus the last frame of the scene, as keyframes. Those are the only frames the diffusion model ever touches. Everything between two keyframes is going to be synthesized by warping, not by the model. Pick the interval too large and motion outruns the warp; pick it too small and you pay for diffusion you did not need. Ten is a reasonable middle for most footage.

Step 2. The bookkeeping that nobody talks about

Once you commit to "stylize keyframes, propagate the rest," you inherit a filing problem. For every gap between keyframe i and keyframe i+1 you need the right input frames, the right output paths, the right optical-flow files, and the right guide images, in both the forward and backward direction. Get one index off and a frame lands in the wrong folder. VideoSequence in video_sequence.py is the class that does this filing.

It is constructed with the list of keyframes (here called frame_files_with_interval) and it makes one output subdirectory per keyframe:

self.__frame_files_with_interval = [f for f in frame_files_with_interval if ".png" in f]
self.__n_seq = len(self.__frame_files_with_interval)
# ...
out_subdir = self.__get_out_subdir(frame_file)   # out_<keyframe-name>

The core method is get_input_sequence. Given a gap index i, it returns the list of input frame paths in that gap:

def get_input_sequence(self, i, is_forward=True):
    if i + 1 > len(self.__frame_files_with_interval) - 1:
        # last gap: run from the final keyframe to the true last frame of the video
        last_input_frame = self.__input_frames[-1]
        last_interval_frame = self.__frame_files_with_interval[i]
        if last_input_frame == last_interval_frame:
            return None
        else:
            beg_id = int(last_interval_frame.split(".")[0])
            end_id = int(last_input_frame.split(".")[0])
    else:
        beg_id = self.get_sequence_beg_id(i)
        end_id = self.get_sequence_beg_id(i + 1)
    if is_forward:
        id_list = list(range(beg_id, end_id))
    else:
        id_list = list(range(end_id, beg_id, -1))
    return [os.path.join(self.__input_dir, self.__input_format % id)
            for id in id_list if self.__input_format % id in self.__input_frames]

Two things to notice. First, beg_id and end_id come straight from the keyframe filenames, which are named by frame number (%04d.jpg). The filename is the index. That is why the whole class can do its math on int(name.split(".")[0]) instead of carrying a separate table. Second, the is_forward flag reverses the range. The pipeline propagates style from the left keyframe rightward, and from the right keyframe leftward, then meets in the middle. The backward pass needs the same frames in reverse, and this one flag gives both.

The same shape repeats for every artifact the propagation needs:

get_output_sequence builds the destination paths inside the keyframe's out_ folder.
get_flow_sequence builds flow_f_%04d.npy for forward and flow_b_%04d.npy for backward.
get_edge_sequence, get_temporal_sequence, get_pos_sequence build the per-frame guide paths in the keyframe's tmp folder.

One detail worth flagging: the flow lists are one element shorter than the frame lists. There are N frames but only N-1 motions between them:

if is_forward:
    id_list = list(range(beg_id, end_id - 1))      # forward flows: N-1
else:
    id_list = list(range(end_id, beg_id + 1, -1))  # backward flows: N-1

If you ever zip flows against frames and get an off-by-one, this is where it comes from. The class is built so the flow list and the warp loop line up.

The last gap is the awkward one. The final keyframe is rarely the literal last frame of the video, so the code special-cases it: when i+1 runs past the keyframe list, it uses self.__input_frames[-1], the true last frame, as the end of the gap. Without that branch the tail of every clip would go unstyled.

Step 3. The guides that steer the propagation

Propagation here is done EbSynth-style: you give EbSynth a source stylized image and a set of guide channels, and it synthesizes the target frame so it matches the style of the source while respecting the guides. The guides live in guide.py. Each one answers a different question for the synthesizer.

The positional guide answers "where did this pixel come from?" It starts from a synthetic image where each pixel encodes its own coordinate as color, then warps that image along the optical flow, frame after frame:

@staticmethod
def __generate_first_img(H, W):
    Hs = np.linspace(0, 1, H)
    Ws = np.linspace(0, 1, W)
    i, j = np.meshgrid(Hs, Ws, indexing='ij')
    r = (i * 255).astype(np.uint8)   # row -> red
    g = (j * 255).astype(np.uint8)   # col -> green
    b = np.zeros(r.shape)
    return np.stack((b, g, r), 2)

Red is the row, green is the column. After you warp this map by the flow, a pixel's color tells you which original pixel ended up there. That is a dense, smooth correspondence field, which is exactly what a synthesizer wants so it does not invent new texture in moving regions.

The temporal guide answers "what did the previous stylized frame look like, moved to here?" It takes the previous stylized frame and warps it forward by the flow:

def get_cmd(self, i, weight) -> str:
    if i == 0:
        warped_img = self.stylized_imgs[0]
    else:
        prev_img = cv2.imread(self.stylized_imgs[i - 1])
        warped_img = self.flow_calc.warp(prev_img, self.flows[i - 1], 'nearest').astype(np.uint8)
        warped_img = cv2.inpaint(warped_img, self.masks[i - 1], 30, cv2.INPAINT_TELEA)
        cv2.imwrite(self.imgs[i], warped_img)
    return super().get_cmd(i, weight)

This is the anti-flicker guide. It pushes the synthesizer to make frame i look like frame i-1 carried along the motion, so the style stays put on a surface as it moves instead of re-rolling every frame.

Both warps leave holes. Where motion uncovers a region the camera could not see last frame, the warp has no data, and the optical-flow mask marks those pixels. The fix is the same in both guides:

cur_img = cv2.inpaint(cur_img, mask, 30, cv2.INPAINT_TELEA)

cv2.INPAINT_TELEA fills the disoccluded holes from their surroundings so the guide has no black gaps. A radius of 30 pixels is generous, which suits the smooth guide maps; you do not need sharp inpainting here, just plausible filler.

The other two guides are simpler. The edge guide runs a Laplacian-style filter so the synthesizer keeps structure aligned to the input:

filter = np.array([[0, -1, 0], [-1, 4, -1], [0, -1, 0]])
res = cv2.filter2D(img, -1, filter)

And the color guide is just the raw frames, so the synthesizer has the original colors to refer to. Each guide carries a -weight, so you can dial how strongly the synthesizer listens to motion versus structure versus color.

Step 4. Where forward and backward meet, blend the colors

Now you have two stylized versions of every in-between frame: one propagated forward from the left keyframe, one propagated backward from the right keyframe. They agree on geometry but rarely on color, because each picked up the tint of a different keyframe along the way. Stack them naively and you get a visible color step. histogram_blend.py fixes the color before the seam gets stitched.

It works in Lab color space, which separates lightness from color so you can match tone without muddying brightness:

a = cv2.cvtColor(a, cv2.COLOR_BGR2Lab)
b = cv2.cvtColor(b, cv2.COLOR_BGR2Lab)
# normalize each to a common mean/std
t_mean_val = 0.5 * 256
t_std_val = (1 / 36) * 256
a = histogram_transform(a, a_mean, a_std, t_mean, t_std)
b = histogram_transform(b, b_mean, b_std, t_mean, t_std)
# average them, then re-key to the reference frame's statistics
ab = (a * weight1 + b * weight2 - t_mean_val) / 0.5 + t_mean_val
ab = histogram_transform(ab, ab_mean, ab_std, min_error_mean, min_error_std)

The shape is: push both images to the same neutral mean and standard deviation, average them, then push the average to the statistics of min_error, the frame the pipeline trusts most for this position. The two odd constants are a target mean of 0.5 * 256 (mid-gray) and a target std of (1/36) * 256. They are just a stable common ground to average in; the final re-keying is what makes the result match a real frame rather than a washed-out midpoint. This is per-channel mean/std transfer, the same idea as classic Reinhard color transfer, done twice.

Step 5. Hide the seam with Poisson fusion

Color matching makes the two halves agree on average. It does not hide the actual boundary where forward meets backward. For that the pipeline pastes in the gradient domain, the same Poisson Image Editing idea from Pérez, Gangnet and Blake (SIGGRAPH 2003) that photo tools use to drop an object into a new background without a halo.

The principle: do not copy pixels, copy differences between pixels. Build a target gradient field by taking gradients from image 1 outside the mask and image 2 inside it, then solve for the image whose gradients match that field. Seams disappear because you never enforce an absolute pixel value at the boundary, only the slope across it.

def poisson_fusion(blendI, I1, I2, mask, grad_weight=[2.5, 0.5, 0.5]):
    Iab = cv2.cvtColor(blendI, cv2.COLOR_BGR2LAB).astype(float)
    Ia  = cv2.cvtColor(I1, cv2.COLOR_BGR2LAB).astype(float)
    Ib  = cv2.cvtColor(I2, cv2.COLOR_BGR2LAB).astype(float)
    m = (mask > 0).astype(float)[:, :, np.newaxis]

    # gradient from I1 outside the mask, from I2 inside it
    gx[:-1] = (Ia[:-1] - Ia[1:]) * (1 - m[:-1]) + (Ib[:-1] - Ib[1:]) * m[:-1]
    gy[:, :-1] = (Ia[:, :-1] - Ia[:, 1:]) * (1 - m[:, :-1]) + (Ib[:, :-1] - Ib[:, 1:]) * m[:, :-1]

Then for each channel it solves a least-squares system Ax = b where A stacks the gradient operators and an identity term, and b stacks the target gradients and the original intensities:

A = As[i]
b = np.vstack([im_dx * weight, im_dy * weight, im])
out = scipy.sparse.linalg.lsqr(A, b)

Two things make this practical. First, grad_weight=[2.5, 0.5, 0.5] weights the L (lightness) channel five times harder than the two color channels. Lightness carries the structure your eye locks onto, so the solver is told to preserve lightness gradients tightly and let color relax. Second, the big sparse matrix A depends only on image size and weights, not on pixel values, so it is built once and cached:

crt_states = (h, w, grad_weight)
if As is None or crt_states != prev_states:
    As = construct_A(*crt_states)
    prev_states = crt_states

Building A walks every pixel to wire up the gradient operators, which is slow. For a video you run poisson_fusion on hundreds of frames at the same resolution, so caching it across calls turns a per-frame cost into a one-time cost. That global cache is the difference between a restyle that finishes and one you abandon.

About the author. I'm Wlad Radchenko, a software engineer. The code in this article comes from Wunjo Make (open source) is local software for video makers, and Wunjo Design as offline PWA for designers. Get in touch to find more GitHub and LinkedIn

Gotchas you will actually hit

The interval is the dial that matters most. With fast motion or a moving camera, optical flow gets unreliable and a wide interval lets the warp smear. Drop the interval so keyframes are closer together and the propagation has less work to do per gap.

The flow list is N-1, not N. If you write your own warp loop against get_flow_sequence, remember there is one fewer flow than frame, and the temporal guide already accounts for it by special-casing i == 0.

Inpaint radius is forgiving here. The TELEA radius of 30 looks large, but it fills guide maps, not final pixels, so a soft fill is fine. The real frame quality comes from the synthesizer, the histogram match, and the Poisson solve downstream.

Watch the last gap. The final keyframe is almost never the last frame of the clip. The i + 1 > len(...) branch in every VideoSequence method exists to run that tail against the real last frame. If you reimplement the bookkeeping and skip it, your output will be a few frames short and the cut will be obvious.

Wrap-up

Per-frame restyle flickers because the model re-decides the look on every frame. The fix is to decide rarely and propagate. Stylize sparse keyframes, warp them across the gaps with EbSynth-style positional and temporal guides, match colors in Lab with a double histogram transfer, and stitch the forward and backward halves with a gradient-domain Poisson solve that weights lightness heavily and caches its matrix. None of the temporal-coherence work is diffusion. It is flow, inpainting, and two classic blends, sequenced carefully.

The code is in visual_generation/restyle/blender/ in the Wunjo Make repo: video_sequence.py for the bookkeeping, guide.py for the guides, histogram_blend.py and poisson_fusion.py for the blends. If your own restyle boils, start by cutting the per-frame diffusion down to keyframes.

References

Yang, Zhou, Liu, Loy. "Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation." SIGGRAPH Asia 2023. arXiv:2306.07954
Jamriška et al. "Stylizing Video by Example." (EbSynth.) ACM ToG 2019.
Xu, Zhang, Cai, Rezatofighi, Tao. "GMFlow: Learning Optical Flow via Global Matching." CVPR 2022. arXiv:2111.13680
Zhang, Rao, Agrawala. "Adding Conditional Control to Text-to-Image Diffusion Models." (ControlNet.) ICCV 2023. arXiv:2302.05543
Pérez, Gangnet, Blake. "Poisson Image Editing." ACM SIGGRAPH 2003.

DEV Community