DEV Community: Kumar K Jha

A 1.125x FLUX speedup was real. The harder question is where diffusion gives back time.

Kumar K Jha — Sun, 19 Jul 2026 20:08:58 +0000

In the previous post, I benchmarked training-free block-residual caching for 4-bit FLUX.1-dev on an Apple M5 Max.

The short version: caching helped, but it did not give me the clean win I wanted.

The longer version was more interesting. A fixed-interval cache produced real speedups, then failed a same-seed PSNR gate so badly that I almost wrote the wrong conclusion. The images were not necessarily bad. They had drifted from the uncached denoising trajectory. That sent me down the rabbit hole of metric floors, windowed SSIM, LPIPS, CLIP scoring, prompt stratification, and interleaved timing.

That work is here:

github.com/kkjcodes/m5-flux-block-cache-benchmark

This post is the handoff from that experiment. Not a victory lap. Not a promise that I have the next speedup solved. More like: here is where block caching stopped being the most interesting question.

Where the cache experiment landed

The best policy I found was a warmup split:

joint blocks 12-18  -> start_step=8
single blocks 28-37 -> start_step=6
interval            -> 2
value mode          -> residual

In the code, start_step is the warmup boundary. That boundary step is still a refresh point; reuse is allowed after it on interval misses.

That policy passed an interleaved timing gate at 1.125× and scored:

Metric	Result
Median PSNR	29.06
Median windowed SSIM	0.94575
Median LPIPS Alex	0.05519
Median CLIP delta	0.00181

That is a real result. It is also a bounded result.

The cache is not pixel-identical. It still fails a strict same-seed trajectory gate. And the improvement is content-dependent: portraits, product renders, and typography are close to the uncached references, while dense interiors remain the hard case.

Here is the prompt-stratified picture:

Prompt class	PSNR	Windowed SSIM	LPIPS Alex
dense_interior	22.62	0.83994	0.12326
portrait	29.90	0.95402	0.04808
product_render	29.10	0.96124	0.03791
typography	31.80	0.94660	0.05070

That table is the actual frontier. Not "cache works." Not "cache fails." More like: warmup-delayed residual reuse is near-fidelity on some content classes, but dense interiors expose the weakness.

Here is one of the easier cases. Same prompt, same seed, uncached reference on the left and warmup-split cached variant on the right:

Uncached reference	Warmup-split cached variant

And here is the kind of case that kept the result honest. Dense interiors were still hard, even after the residual gate. The two cached versions are near-indistinguishable from each other, which is the point: the gate fired, but it did not change this failure mode enough.

Uncached reference	Warmup split	Residual-gated m2

The adaptive gate I thought would help

The obvious next move was residual-volatility-gated reuse.

The idea was simple enough: keep the warmup split schedule, but when a block's residual changes too much relative to its recent history, veto reuse and recompute the block.

I pre-registered the target in plain English: lift dense-interior PSNR from 22.62 toward roughly 29, without dropping below the split policy's timing class.

That did not happen.

On the matched interleaved timing comparison, the split re-measured at 1.121× in the same run as the gated policy. The gated threshold measured 1.093×. So the gate gave back speed.

Quality moved only a little:

Policy	Dense PSNR	Dense LPIPS
split warmup	22.62	0.1233
gated m2	22.82	0.1146
gated m3	22.78	0.1166

That is not close. It needed roughly 6 dB of dense-interior improvement. It found about 0.2 dB.

Then I checked whether the gate was even firing. It was. But it did not fire more on dense interiors than on easier prompt classes. Product renders and portraits often fired at similar or higher rates.

So the conclusion is narrow but useful:

This residual-volatility signal, at these thresholds and block spans, did not target the dense-interior failure mode.

That is not the same as "adaptive reuse can never work." It is just enough evidence to stop spending time on this particular gate.

What this changed about the next question

At the start, I thought the path was probably:

find cacheable blocks -> tune schedule -> add adaptive gate -> get speed

After the experiments, that feels too small.

A veto layer can only block reuse decisions that a fixed schedule already admitted. If the failure is created by the schedule itself, then a gate on top can only claw back part of the damage. That may be exactly what happened here.

Dense interiors may not need "reuse unless volatility spikes." They may need a different schedule, a different approximation surface, or a different way of spending compute across the denoising trajectory.

Which leads to the question I actually want to ask next:

Can we make diffusion faster without treating transformer blocks as the only reusable unit?

I do not have the answer yet. I am not claiming the next project will beat the warmup split. I am saying the cache benchmark changed what I think is worth investigating.

Why block caching may be the wrong level

Block-residual caching is attractive because it is mechanical. You can hook transformer blocks, store deltas, and measure wall time. It does not require training. It does not require changing the model. It gives you a clean benchmark surface.

But diffusion is not just a stack of interchangeable block calls. The denoising trajectory has phases. Early steps decide composition. Later steps refine detail. Different content classes stress different parts of the trajectory. A fixed block schedule ignores most of that structure.

The warmup result made that obvious. Skipping the early steps mattered more than almost anything else.

The residual-gate result made the next problem obvious. A block-level volatility score did not know enough about the failure mode to protect dense interiors.

So the next acceleration surface I want to study is narrower than "try more caching tricks": trajectory-aware compute scheduling.

The question is whether the denoising process can decide, step by step, when a full expensive transformer evaluation is worth paying for. Early steps carry composition. Later steps refine detail. Dense interiors may be sensitive in places where portraits and product renders are not. A fixed block-reuse schedule flattens all of that structure into a yes/no decision per block.

That feels like the wrong abstraction. The next question is whether cheap signals from the latent trajectory can guide the amount of compute spent across denoising phases without moving the image onto a different composition path.

That is a direction, not a commitment. The last project taught me to be careful with promises before the metric harness exists.

What I am taking forward

The useful artifact from this work is not just the 1.125× speedup. It is the measurement discipline.

For the next round, I want the same rules from day one:

interleaved timing for candidate-vs-baseline comparisons;
prompt-stratified quality summaries;
cross-seed floors for reference metrics;
perceptual metrics, not only PSNR;
visual samples in the report;
explicit distinction between trajectory preservation and image quality;
negative results written down instead of quietly discarded.

That last one matters. The residual gate was a miss, but it was not wasted. It told me that a simple per-block residual-volatility signal is probably not the lever. It also told me that dense interiors are the content class to keep in the loop for every future speed claim.

A speedup that only works on simple prompts is a demo. A speedup that survives dense interiors is a result.

The current state

The block-residual caching milestone is closed.

What I am comfortable saying publicly:

Fixed-interval block-residual caching exposes real acceleration headroom on this setup.
Same-seed PSNR is a trajectory metric, not an image-quality verdict.
Warmup-delayed reuse is the best tested cache policy so far.
The warmup split reaches a timing-pass near-fidelity point at 1.125×.
Dense interiors remain the hard case.
Residual-volatility gating did not solve that hard case.

What I am not claiming:

that the cache is generally quality-preserving;
that 1.125× transfers to other hardware, quantization settings, or model variants;
that adaptive reuse is dead;
that I already know the next acceleration method.

The next thing is a question:

If block caching gives a bounded win, can trajectory-aware scheduling give diffusion a better way to spend compute?

That is where I want to go next.

The benchmark repo is here: github.com/kkjcodes/m5-flux-block-cache-benchmark

The README now includes the curated reports, visual samples, and the residual-gated negative result. If you are doing similar work, my strongest recommendation is still the boring one: measure your floor, interleave your timing, and do not let a single prompt write your headline.

I made FLUX 1.16 faster on an M5 Max. Then I found out my quality gate was measuring the wrong thing.

Kumar K Jha — Thu, 16 Jul 2026 13:13:42 +0000

I spent a few weeks benchmarking training-free block-residual caching for FLUX.1-dev on an Apple M5 Max. The timing result was real: a stable ~1.16× speedup, tight variance, reproducible across prompts.

Then the quality sweep came back and every policy failed my gate. Median PSNR of 14.27 dB against a gate of 30 dB. Decisive failure.

I almost published that as "block-residual caching doesn't preserve quality on Apple Silicon." That would have been wrong — not because the numbers were wrong, but because PSNR wasn't measuring what I thought it was measuring. This post is about the speedup, the trap, and the one cheap calibration that caught it.

Everything here is reproducible — harness, raw per-pair data, and the two image pairs the argument turns on:
github.com/kkjcodes/m5-flux-block-cache-benchmark

The setup

FLUX.1-dev has 19 joint transformer blocks (image and text streams attending together) and 38 single blocks (a fused stream). Phase attribution on my machine said:

DiT forward: 95.2% of uncached wall time
Single-stream blocks: 63.8%
Joint blocks: 31.2%

So essentially all the time is in the transformer, and the single blocks are where the money is. That's the headroom.

The idea behind block-residual caching is simple. A transformer block computes out = f(in). Instead of caching out (which is wrong the moment the input changes), you cache the residual out - in, then on a reuse step you apply the stale residual to the fresh input:

def _store(self, kind, value, **kwargs):
    if self.value_mode == "raw":
        return value
    if kind == "joint":
        # joint blocks return (encoder_hidden_states, hidden_states)
        return value[0] - kwargs["encoder_hidden_states"], value[1] - kwargs["hidden_states"]
    return value - kwargs["hidden_states"]

def _restore(self, kind, cached, **kwargs):
    if self.value_mode == "raw":
        return cached
    if kind == "joint":
        return kwargs["encoder_hidden_states"] + cached[0], kwargs["hidden_states"] + cached[1]
    return kwargs["hidden_states"] + cached

Reuse is on a fixed interval — every other step:

def should_reuse(self, step_index: int, timestep: float) -> bool:
    return step_index > 0 and step_index % self.interval != 0

Config throughout: 4-bit quantized FLUX.1-dev, 28 steps, linear scheduler, 1024×1024, Apple M5 Max. The 4-bit part matters — quantization shifts the compute/memory balance that determines how much caching can win, so none of these numbers should be assumed to transfer to a full-precision build.

The timing result

Twelve measured runs per policy, one warmup, acceptance gated on delta > 3 × sqrt(baseline_stdev² + candidate_stdev²):

Config	Blocks	Clean median	Stdev	Speedup
baseline_uncached	none	119.33s	2.89s	1.000×
joint_12_18_i2_residual	joint:12-18	134.82s	18.15s	0.885× ❌
joint_10_18_i2_residual	joint:10-18	98.82s	2.86s	1.208×
joint_8_18_i2_residual	joint:8-18	100.50s	2.60s	1.187×
single_28_37_i2_residual	single:28-37	101.03s	2.20s	1.181×
joint_12_18_single_28_37	joint:12-18,single:28-37	91.87s	0.99s	1.299×

Real speedup, cleanly separated from noise. But that table has a problem I only caught later, so hold onto it.

The quality sweep

4 prompts × 8 seeds × 3 surviving policies = 96 comparisons. Each variant is compared against an uncached reference generated with the same seed — same latents, same prompt, same everything but the cache. Gate: median PSNR >= 30 dB, median global-SSIM >= 0.98.

Policy	Median PSNR	Median Global-SSIM	Gate
joint_10_18_i2_residual	16.75	0.857	fail
joint_12_18_single_28_37	14.27	0.693	fail
single_28_37_i2_residual	14.23	0.727	fail

Total wipeout. 0 of 96 pairs cleared 30 dB. Zero cleared 25.

So: caching breaks the images, right?

The trap

I pulled the single worst pair in the sweep — 11.34 dB, a typography prompt — expecting mush.

It was a clean, well-composed "OPEN LATE" bookstore sign. Sharp letterforms, correct prompt adherence, nice rain-slicked reflections. It was a good image. It just wasn't the same image as the reference.

That reframes everything. PSNR against a same-seed reference doesn't measure quality. It measures trajectory divergence — how far the denoising path drifted from where it would have gone. A cached run that lands on a different-but-equally-good image scores catastrophically, and a genuinely degraded run scores catastrophically, and PSNR cannot tell you which one you're looking at.

Worse, the gate itself was unreachable. The single best pair in my whole sweep was 22.25 dB. I looked at it: same face, same pose, same lighting, same composition — differing in fingernail detail and pot texture. Visually near-identical output, and it still misses a 30 dB gate by 8 dB. A gate that rejects near-perfect output isn't a fidelity gate, it's a pixel-identity gate.

The calibration that caught it

Here's the cheap trick, and it's the most portable thing in this post.

If you don't know what your metric's numbers mean, measure the floor: score two images that are both good but share no trajectory at all. For me that's two reference images, same prompt, different seeds — no cache anywhere near them.

# both images uncached, both good, different seeds
for a, b in itertools.combinations(seeds, 2):
    psnr, global_ssim = compare(refs[a], refs[b])

Result over 112 unrelated pairs: 9.73 dB, global-SSIM 0.23.

Now the whole scale snaps into focus:

 9.73 dB  two unrelated good images  <- the floor
14–17 dB  my cached variants
22.25 dB  visually near-identical (verified by eye)
30.00 dB  my gate                    <- nothing reaches this, ever

Same story in global-SSIM units: floor 0.23, my variants 0.69–0.86, gate 0.98.

My variants at 14–17 dB sit meaningfully above "unrelated," so the cache does retain real trajectory structure. But the worst cases at 11.3 dB are creeping toward "might as well have used a different seed." That's a genuinely interesting, defensible finding — and I could only state it because I had the floor. Without it, "14.27 dB" is a number with no semantics.

If you're gating generative output on a reference metric, go measure your floor first. It takes ten minutes and it tells you whether your gate has any discriminating power at all. Mine didn't.

Two more things I got wrong

My SSIM wasn't SSIM. I'd written a global SSIM over the whole flattened image. Real SSIM (Wang et al., which is where the 0.98 convention comes from) is an 11×11 windowed mean computed per channel. My numbers were inflated by ~0.15 across the board — 0.857 where skimage says 0.698. It didn't flip any verdict, but I'd have published a column labeled "SSIM" that wasn't. If you hand-roll a metric, diff it against the reference implementation before you put it in a table.

My headline speedup didn't replicate. That 1.299× came from one prompt ("A simple red cube on a white table") at one seed. The quality sweep incidentally timed 32 prompt/seed cells per policy, so I ran a paired comparison:

Policy	Paired median (n=32)	IQR	Headline (n=1 prompt)
joint_10_18	1.085×	1.067–1.113	1.208×
single_28_37	1.095×	1.075–1.110	1.181×
joint_12_18_single_28_37	1.159×	1.141–1.185	1.299×

Ordering preserved, variance tight, and speedup is essentially prompt-independent (1.084–1.108 across four very different prompts) — so "stable" is actually better supported than the single-prompt run showed. But every magnitude lands ~10 points lower. The n=32 sample is the weaker timing protocol (no warmup, N=1 per cell, fixed ordering), so it's not a refutation — but it's my own data disagreeing with my own headline, and the honest move is to quote the range and name the protocol.

The big caveat: I never tested a warmup window

Look at that reuse schedule again:

step 0: compute → step 1: REUSE → step 2: compute → step 3: REUSE ...
14 of 28 steps reused. First reuse at step index 1.

Caching kicks in at the second denoising step — while composition is still being decided. And composition divergence is exactly what my PSNR was detecting.

Every established training-free cache method (DeepCache, TeaCache, FBCache) skips an early warmup window for precisely this reason. My policy dataclass has no start-step field, so I couldn't express "skip the first N steps" even if I'd wanted to. I tested the configuration most likely to diverge and then generalized from it.

That's the next experiment, and I'd bet it's where the actual frontier is.

What I'm actually claiming

Training-free block-residual caching exposes real acceleration headroom for 4-bit FLUX.1-dev on Apple M5 Max: broad single-stream and joint+single coverage produce stable, prompt-independent speedups in the 1.09×–1.30× range depending on policy and timing protocol.

But fixed-interval residual reuse — enabled from the first denoising step, with no warmup exclusion, at the block spans tested — does not preserve same-seed reference trajectories, and divergence grows with coverage.

Image quality remains unmeasured. I have no perceptual metric and no prompt-alignment metric in this run. The images I inspected by hand looked good. I'm not claiming they are, because I didn't measure it.

That's a timing and trajectory-divergence characterization. It is not a quality result, and I'm not going to dress it up as one.

Next up: LPIPS, real windowed SSIM, and CLIP prompt-alignment over the 32 references and 96 variants already sitting on disk — each with its own cross-seed floor, because now I know better than to report a number without one. Then a warmup-window knob.

The code and the data

Everything is on GitHub: github.com/kkjcodes/m5-flux-block-cache-benchmark

The repo ships the harness (a custom mflux denoising loop with block-level hooks, so the timing has no hidden synchronization in it), plus the audited evidence bundle — all 96 per-pair rows in quality_results.jsonl, the timing sweep JSON, the environment record, and the two reference/variant pairs this post argues from. The 153MB of remaining images stayed out, but the two that carry the argument are there, so you can look at the 11.34 dB "failure" and the 22.25 dB "near-identical" pair and judge my read for yourself.

If you disagree with my interpretation, the raw numbers are right there to disagree with.

If you're doing similar work on Apple Silicon: measure your floor, diff your hand-rolled metrics against reference implementations, and never let a headline number rest on one prompt. All three of my mistakes were free to catch and would have been expensive to publish.

What it actually costs to generate one AI cartoon video, line by line

Kumar K Jha — Fri, 26 Jun 2026 17:13:57 +0000

I sat down to look at my fal.ai billing dashboard last week for the first time in a few months. Not because something was wrong — just because I wanted to write a "here's what this costs" post and I'd been winging the numbers for a while.

A couple of hours later I had a spreadsheet, three tabs of receipts, and a slightly different view of my own pipeline than I started with. So I'm writing it up.

This is about the cost structure of a small AI-video product I built solo. The point isn't to argue it's a great cost structure — it isn't, particularly — but to break down where the money actually goes. If you're building one of these and trying to figure out where to optimize, the numbers might be useful. And if you're not, the surprise at the end is at least kind of fun.

What I built, briefly and honestly

A user uploads a photo. The pipeline glues together a few third-party models — image generator, image-to-video, TTS, a few LLM calls — and out the other end pops a personalized animated MP4 starring that person (and optionally up to four people in the same video).

It's atveanimation.com if you want to poke at it. Free tier is real. Free tier is also why this cost post exists — running a free tier means knowing where the money goes.

I did not invent any of the techniques here. Character consistency, keyframe anchoring, LoRA conditioning — those are all standard patterns at this point. Frontier models (Kling 3.0 Motion Control, Seedance 2.0, Hedra, Wan 2.7 multi-ref) do the consistency part natively and arguably better. What I'm posting is just the unit economics of a particular stack: WAN + Flux Kontext + flux-lora + Kokoro.

How I measured this

Prices come from fal.ai, Replicate, and Anthropic posted API rates, cross-referenced with averaged usage across the last ~100 generations on my account. For models priced by output (Kontext Pro) or character count (Kokoro), I used the per-call mean rather than the marginal-token rate.

A couple of definitions worth being precise about, because I tripped over them in an earlier draft:

Generated seconds: what you actually pay for. WAN i2v emits ~6-second clips at my frame settings, so 4 scenes = 24 generated seconds.
Finished seconds: what the user watches. My pipeline trims each scene to audio_length + 0.5s after merging, so a 6-second clip with a 3-second voice line becomes a 3.5-second finished scene. A 4-scene video usually finishes around 16 seconds.

Both numbers are real. The per-second figure depend and I'll keep them straight throughout.

Amortization: I'm dividing per-character setup costs across 5 videos per character, which is what the repeat-use pattern looks like in my own data. At one video per character the per-second cost roughly doubles. I'll show the range.

What this excludes: failed-and-rerolled generations (handled separately below), Azure blob storage and egress (rounding error), Container Apps baseline (~$30–50/month, fixed, amortized across all traffic), developer time. What it includes: the vision call that auto-picks each character's voice, the Claude Haiku call that writes the scene brief.

The teardown

Per scene

Component	Model	Per call
Keyframe (solo character)	fal-ai/flux-lora	$0.04
Keyframe (multi-character)	fal-ai/flux-pro/kont
Keyframe (anchor scene)	FLUX Kontext Pro	$0.04
Animated clip (100 frames, ~6s)	fal-ai/wan-i2v
Voice line	Kokoro TTS	$0.005
Prompt sanitization	Claude Haiku	$0.0008

Per multi-character scene: about $0.555. Four scenes: $2.22.

Worth noting: fal's posted price for wan-i2v at 720r clip, not per second. The 1.25× multiplier kicks in for clips over 81 frames. My pipeline requests 16 fps) per scene to give the audio room, which lands me in the multiplier band at $0.50 per clip. Dropping to 80 frames would save $0.10 per clip and 5 seconds is plenty for most voice lines — that's a real optimization I should run, and I'll get to it below.

WAN dominates the per-scene line. No surprise — I expected the video model to be the expensive part. What I didn't expect was where
the rest of the bill came from.

Per character (one-time, amortized)

Component	Model	Cost
Visual description	Claude Sonnet (vision)	$0.
Style transfer (4 cartoon options)	FLUX Kontext Pro × 4	$0.16
Training augmentations (35 images)	FLUX Kontext
LoRA fine-tune (1500 steps)	fal-ai/flux-lora-fast-training	$0.40
Total per character		~$1.975

Here's the surprise that made me actually write this post: the augmentation step costs 3.5× more than the LoRA training itself.

If you've never built one of these, you might assume LoRA training is the expensive part — it's the line item with "training" in the name. It's not. The expensive line item is generating the 35 cartoon variations you need to feed the training, because a LoRA fine-tuned on a single source photo overfits horribly and the resulting character looks generic and same-y across scenes.

So you need pose variation. Expression variation. Lighting variation. Each one costs ~$0.04 to generate via Kontext Pro. Stack 35 of them and you've spent $1.40 before you've trained a single weight.

I generate 20 variations from the cartoon style image (poses, expressions) plus 15 variations from the original selfie (anchored on the real face, to counterbalance the cartoon-side darkening of skin tone that I observed when I had a more skewed mix). That 20+15 split is what makes the LoRA actually produce a recognizable person.

It's the hidden cost nobody flags when they talk about "LoRA fine-tuning is cheap now."

Reconciliation

For a typical 4-scene, 2-character video, amortized over 5 videos per character:

Per-character setup (amortized): 2 × $1.975 / 5 = $0.79
Per-scene (4 × $0.555): = $2.22
Per-project brief: < $0.01
────────────────────────────────
Total per video ≈ $3.02
÷ 24 generated seconds (4 × 6.25s clips) ≈ $0.126 / sec
÷ 16 finished seconds (after audio-aware trim) ≈ $0.189 / sec

So: about $0.13 per generated second (what fal/Replicate/Anthropic invoice me for), or $0.19 per finished second (what a
viewer actually experiences). Both are real; the finished-second number is the more honest headline because it's what the user gets.

Share of the bill

This is the part I think is actually useful:

Component	$ / video	Share
WAN i2v (4 clips)	$2.00	66%
Augmentation (amortized, 2 characters)	$0.56	19%
LoRA training (amortized)	$0.16	5%
Multi-Kontext keyframes (4 scenes)	$0.20	7%
Style transfer + vision describe (amortized)	$0.07	2%
Kokoro voice lines (4 scenes)	$0.02	<1%
Everything else (LLM, moderation)	<$0.01	<1%

The video model is the biggest line. It's not the only line, and it doesn't dominate the way I expected. Augmentation alone is almost a fifth of the bill.

Amortization range

Videos / character	Per generated sec (24s)	Per finished sec (16s)
1 (single video, new character)	~$0.26	~$0.39
3	~$0.15	~$0.22
5 (my measured average)	~$0.13	~$0.19
10	~$0.11	~$0.16
20	~$0.10	~$0.15

The generated-second column is what your accountant cares about (it matches the invoice). The finished-second column is what the user experiences. They diverge because each WAN clip generates ~6 seconds but the concat step trims to audio-length + 0.5s — most voice lines come back at 3-4 seconds, so a lot of generated frames get cut.

Shape worth noticing: going from 5 to 10 videos only saves $0.02–0.03/sec. Going from 1 to 2 saves about $0.08/sec. The biggest unit-economic win is getting a user to make their second video on an existing character, not their tenth. Most of the product features I've been building (preset scenes, group videos, "make a sequel") are essentially shaped by that math.

Effective vs. sticker cost

In my logs, scene image generation fails on the first try roughly 1 in 8–10 attempts. Two main causes: transient 5xx from fal, and WAN's content filter rejecting a scene description with action-y or fight-y language even after my Claude Haiku rewriter swaps the trigger words. With the rewriter, the practical reroll rate is about 10–12%.

So effective cost is ~1.10–1.12× sticker. Not nothing, but not the 1.4× you'd get with a stricter video model. If I were on Sora 2 Pro or early Veo this multiplier would be much bigger.

How this stacks up against just the video model

For raw per-second video-model pricing as of mid-20

Model	Posted price
Seedance 1.5 Pro	~$0.025/sec
Kling 3.0	~$0.029/sec
Runway Gen-4 Turbo	~$0.05/sec
Sora 2 base	~$0.10/sec
Veo 3.1 Fast	~$0.10–$0.15/sec
Veo 3.1 Standard	~$0.40/sec
Sora 2 Pro	~$0.30–$0.50/sec
WAN 2.1 i2v (720p, ≤81 frames)	$0.40 / clip
WAN 2.1 i2v (720p, 82–100 frames)	$0.50 / clip

A note on WAN pricing: fal bills it per clip, not per second. At 720p the base is $0.40 per clip, but clips over 81 frames incur a 1.25× multiplier — $0.50 per clip. My pipeline requests 100 frames per scene (~6.25 seconds raw) so I'm in the multiplier band. A clip generates 5-6 seconds of footage but I pay the same regardless of how short the finished cut is. Roundup posts that quote $0.04–$0.08/sec for WAN are usually referring to the 480p variant ($0.20/clip, halving the price) or dividing $0.40/clip by the maximum frame count rather than what the model actually emits at default settings.

So in raw video-model terms my pipeline is mid-range — about Sora 2 base, cheaper than Veo Standard, more expensive than Kling. The all-in cost works out to roughly 1.5–2× the video model alone. That overhead is structural to a multi-model personalization stack: keyframe conditioning, per-character training, voice, vision, brief.

Migrating to Seedance would save ~$0.04/sec on the video line and leave the other overhead untouched. The optimization question isn't "which cheaper video model?" — it's "how much can I cut from the wrapper around it?"

What I'd actually optimize

Honest, in rough order of impact:

Cut augmentation calls. Biggest non-video line item and the most room. Replacing the 35-image Kontext-Pro augmentation set with a Flux LoRA training pass directly on the source + selected style image would save ~$1.40 per character. The trade-off is real (less expression range in the LoRA) — that's the next A/B I want to run.
Drop WAN num_frames from 100 to 80. Per fal's posted pricing, 720p clips over 81 frames pay a 1.25× multiplier — so I'm paying $0.50 when I could be paying $0.40. The audio-aware trim downstream means my finished scenes rarely exceed 5 seconds anyway. Net savings: $0.40 per video, ~$0.025/finished second. This is the easiest unit-economic win in the whole stack and I have no excuse for not having shipped it.
Raise videos-per-character. Every additional video on an existing character drops per-second cost by ~$0.02/sec. Product features that bring users back to existing characters have a direct unit-economic lever. Cheaper than optimizing models.
Don't touch TTS. Kokoro is $0.005/scene. Anything cheaper would be rounding error and Kokoro sounds better than the alternatives at this price.

What I would not spend time on: switching the video model. Cheaper options exist but the headroom isn't in the model — it's in the
conditioning and training around it.

Try it if you want

If you want to see what $0.19/finished second looks like as an actual video — and figure out whether my math is right — the product is at atveanimation.com. Upload a photo, pick a style, hit generate. Free tier gives you 10 scenes a day, which is enough for two short videos.

If you find a way to crash the augmentation step or rack up a $20 bill on a single account, please tell me. I'd genuinely like to know.

Closing thought

I'm not going to pretend this teardown is novel — anyone with a billing dashboard and a calculator can produce one. But I hadn't seen one written publicly for a WAN + Kontext + flux-lora + Kokoro stack, and the augmentation surprise (3.5× the cost of the LoRA training it feeds) was non-obvious enough to me, after a year of building this, that it probably warranted writing down.

If you spot something off in the numbers, the comments are open. I'll fix the post rather than defend it.

Posted from my own desk on a Friday afternoon. Numbers reconcile to the nearest cent; if they don't reconcile to yours I'd love to know why.