Latency still blocks diffusion models from being practical in interactive editing tools. HiLo‑Token proves that input‑adaptive high‑low frequency token compression can halve that latency while keeping generation quality unchanged, delivering up to a 3.13× speedup on typical edits [1].
In current pipelines the Diffusion Transformer (DiT) consumes the bulk of compute, accounting for about 73 % of total model latency even after being distilled from 50 timesteps down to 8 timesteps. Earlier optimizations therefore focused on timestep reduction, leaving the token‑wise cost untouched.
The authors allocate full‑resolution tokens to a dilated edit mask, prune low‑frequency tokens outside it, and supplement the remainder with a 16× downsampled representation, achieving up to 3.13× DiT speedup on small‑mask edits. Across mask sizes the average token ratios drop to 6.38 % (small), 15.92 % (medium) and 35.36 % (large), corresponding to DiT speedups of 3.13×, 2.59×, and 1.67× respectively [1].
The reported gains diminish as the mask grows, falling to 1.67× speedup for large masks where 35 % of tokens are retained, exposing a trade‑off between compression ratio and runtime. Moreover, the evaluation is limited to mask‑guided editing; how the same token budget strategy fares on unrestricted image‑to‑image or text‑to‑image diffusion remains an open question.
Deployments that power Photoshop’s Remove feature can cut their Amazon AWS p5.48xlarge node count by a third thanks to the reduced token budget. Practitioners should therefore treat adaptive token allocation as a default optimisation layer when building diffusion‑based vision editors for constrained hardware.
Top comments (0)