Kumar K Jha

Posted on Jun 26

What it actually costs to generate one AI cartoon video, line by line

#ai #machinelearning #sideprojects

I sat down to look at my fal.ai billing dashboard last week for the first time in a few months. Not because something was wrong — just because I wanted to write a "here's what this costs" post and I'd been winging the numbers for a while.

A couple of hours later I had a spreadsheet, three tabs of receipts, and a slightly different view of my own pipeline than I started with. So I'm writing it up.

This is about the cost structure of a small AI-video product I built solo. The point isn't to argue it's a great cost structure — it isn't, particularly — but to break down where the money actually goes. If you're building one of these and trying to figure out where to optimize, the numbers might be useful. And if you're not, the surprise at the end is at least kind of fun.

What I built, briefly and honestly

A user uploads a photo. The pipeline glues together a few third-party models — image generator, image-to-video, TTS, a few LLM calls — and out the other end pops a personalized animated MP4 starring that person (and optionally up to four people in the same video).

It's atveanimation.com if you want to poke at it. Free tier is real. Free tier is also why this cost post exists — running a free tier means knowing where the money goes.

I did not invent any of the techniques here. Character consistency, keyframe anchoring, LoRA conditioning — those are all standard patterns at this point. Frontier models (Kling 3.0 Motion Control, Seedance 2.0, Hedra, Wan 2.7 multi-ref) do the consistency part natively and arguably better. What I'm posting is just the unit economics of a particular stack: WAN + Flux Kontext + flux-lora + Kokoro.

How I measured this

Prices come from fal.ai, Replicate, and Anthropic posted API rates, cross-referenced with averaged usage across the last ~100 generations on my account. For models priced by output (Kontext Pro) or character count (Kokoro), I used the per-call mean rather than the marginal-token rate.

A couple of definitions worth being precise about, because I tripped over them in an earlier draft:

Generated seconds: what you actually pay for. WAN i2v emits ~6-second clips at my frame settings, so 4 scenes = 24 generated seconds.
Finished seconds: what the user watches. My pipeline trims each scene to audio_length + 0.5s after merging, so a 6-second clip with a 3-second voice line becomes a 3.5-second finished scene. A 4-scene video usually finishes around 16 seconds.

Both numbers are real. The per-second figure depend and I'll keep them straight throughout.

Amortization: I'm dividing per-character setup costs across 5 videos per character, which is what the repeat-use pattern looks like in my own data. At one video per character the per-second cost roughly doubles. I'll show the range.

What this excludes: failed-and-rerolled generations (handled separately below), Azure blob storage and egress (rounding error), Container Apps baseline (~$30–50/month, fixed, amortized across all traffic), developer time. What it includes: the vision call that auto-picks each character's voice, the Claude Haiku call that writes the scene brief.

The teardown

Per scene

Component	Model	Per call
Keyframe (solo character)	fal-ai/flux-lora	$0.04
Keyframe (multi-character)	fal-ai/flux-pro/kont
Keyframe (anchor scene)	FLUX Kontext Pro	$0.04
Animated clip (100 frames, ~6s)	fal-ai/wan-i2v
Voice line	Kokoro TTS	$0.005
Prompt sanitization	Claude Haiku	$0.0008

Per multi-character scene: about $0.555. Four scenes: $2.22.

Worth noting: fal's posted price for wan-i2v at 720r clip, not per second. The 1.25× multiplier kicks in for clips over 81 frames. My pipeline requests 16 fps) per scene to give the audio room, which lands me in the multiplier band at $0.50 per clip. Dropping to 80 frames would save $0.10 per clip and 5 seconds is plenty for most voice lines — that's a real optimization I should run, and I'll get to it below.

WAN dominates the per-scene line. No surprise — I expected the video model to be the expensive part. What I didn't expect was where
the rest of the bill came from.

Per character (one-time, amortized)

Component	Model	Cost
Visual description	Claude Sonnet (vision)	$0.
Style transfer (4 cartoon options)	FLUX Kontext Pro × 4	$0.16
Training augmentations (35 images)	FLUX Kontext
LoRA fine-tune (1500 steps)	fal-ai/flux-lora-fast-training	$0.40
Total per character		~$1.975

Here's the surprise that made me actually write this post: the augmentation step costs 3.5× more than the LoRA training itself.

If you've never built one of these, you might assume LoRA training is the expensive part — it's the line item with "training" in the name. It's not. The expensive line item is generating the 35 cartoon variations you need to feed the training, because a LoRA fine-tuned on a single source photo overfits horribly and the resulting character looks generic and same-y across scenes.

So you need pose variation. Expression variation. Lighting variation. Each one costs ~$0.04 to generate via Kontext Pro. Stack 35 of them and you've spent $1.40 before you've trained a single weight.

I generate 20 variations from the cartoon style image (poses, expressions) plus 15 variations from the original selfie (anchored on the real face, to counterbalance the cartoon-side darkening of skin tone that I observed when I had a more skewed mix). That 20+15 split is what makes the LoRA actually produce a recognizable person.

It's the hidden cost nobody flags when they talk about "LoRA fine-tuning is cheap now."

Reconciliation

For a typical 4-scene, 2-character video, amortized over 5 videos per character:

Per-character setup (amortized): 2 × $1.975 / 5 = $0.79
Per-scene (4 × $0.555): = $2.22
Per-project brief: < $0.01
────────────────────────────────
Total per video ≈ $3.02
÷ 24 generated seconds (4 × 6.25s clips) ≈ $0.126 / sec
÷ 16 finished seconds (after audio-aware trim) ≈ $0.189 / sec

So: about $0.13 per generated second (what fal/Replicate/Anthropic invoice me for), or $0.19 per finished second (what a
viewer actually experiences). Both are real; the finished-second number is the more honest headline because it's what the user gets.

Share of the bill

This is the part I think is actually useful:

Component	$ / video	Share
WAN i2v (4 clips)	$2.00	66%
Augmentation (amortized, 2 characters)	$0.56	19%
LoRA training (amortized)	$0.16	5%
Multi-Kontext keyframes (4 scenes)	$0.20	7%
Style transfer + vision describe (amortized)	$0.07	2%
Kokoro voice lines (4 scenes)	$0.02	<1%
Everything else (LLM, moderation)	<$0.01	<1%

The video model is the biggest line. It's not the only line, and it doesn't dominate the way I expected. Augmentation alone is almost a fifth of the bill.

Amortization range

Videos / character	Per generated sec (24s)	Per finished sec (16s)
1 (single video, new character)	~$0.26	~$0.39
3	~$0.15	~$0.22
5 (my measured average)	~$0.13	~$0.19
10	~$0.11	~$0.16
20	~$0.10	~$0.15

The generated-second column is what your accountant cares about (it matches the invoice). The finished-second column is what the user experiences. They diverge because each WAN clip generates ~6 seconds but the concat step trims to audio-length + 0.5s — most voice lines come back at 3-4 seconds, so a lot of generated frames get cut.

Shape worth noticing: going from 5 to 10 videos only saves $0.02–0.03/sec. Going from 1 to 2 saves about $0.08/sec. The biggest unit-economic win is getting a user to make their second video on an existing character, not their tenth. Most of the product features I've been building (preset scenes, group videos, "make a sequel") are essentially shaped by that math.

Effective vs. sticker cost

In my logs, scene image generation fails on the first try roughly 1 in 8–10 attempts. Two main causes: transient 5xx from fal, and WAN's content filter rejecting a scene description with action-y or fight-y language even after my Claude Haiku rewriter swaps the trigger words. With the rewriter, the practical reroll rate is about 10–12%.

So effective cost is ~1.10–1.12× sticker. Not nothing, but not the 1.4× you'd get with a stricter video model. If I were on Sora 2 Pro or early Veo this multiplier would be much bigger.

How this stacks up against just the video model

For raw per-second video-model pricing as of mid-20

Model	Posted price
Seedance 1.5 Pro	~$0.025/sec
Kling 3.0	~$0.029/sec
Runway Gen-4 Turbo	~$0.05/sec
Sora 2 base	~$0.10/sec
Veo 3.1 Fast	~$0.10–$0.15/sec
Veo 3.1 Standard	~$0.40/sec
Sora 2 Pro	~$0.30–$0.50/sec
WAN 2.1 i2v (720p, ≤81 frames)	$0.40 / clip
WAN 2.1 i2v (720p, 82–100 frames)	$0.50 / clip

A note on WAN pricing: fal bills it per clip, not per second. At 720p the base is $0.40 per clip, but clips over 81 frames incur a 1.25× multiplier — $0.50 per clip. My pipeline requests 100 frames per scene (~6.25 seconds raw) so I'm in the multiplier band. A clip generates 5-6 seconds of footage but I pay the same regardless of how short the finished cut is. Roundup posts that quote $0.04–$0.08/sec for WAN are usually referring to the 480p variant ($0.20/clip, halving the price) or dividing $0.40/clip by the maximum frame count rather than what the model actually emits at default settings.

So in raw video-model terms my pipeline is mid-range — about Sora 2 base, cheaper than Veo Standard, more expensive than Kling. The all-in cost works out to roughly 1.5–2× the video model alone. That overhead is structural to a multi-model personalization stack: keyframe conditioning, per-character training, voice, vision, brief.

Migrating to Seedance would save ~$0.04/sec on the video line and leave the other overhead untouched. The optimization question isn't "which cheaper video model?" — it's "how much can I cut from the wrapper around it?"

What I'd actually optimize

Honest, in rough order of impact:

Cut augmentation calls. Biggest non-video line item and the most room. Replacing the 35-image Kontext-Pro augmentation set with a Flux LoRA training pass directly on the source + selected style image would save ~$1.40 per character. The trade-off is real (less expression range in the LoRA) — that's the next A/B I want to run.
Drop WAN num_frames from 100 to 80. Per fal's posted pricing, 720p clips over 81 frames pay a 1.25× multiplier — so I'm paying $0.50 when I could be paying $0.40. The audio-aware trim downstream means my finished scenes rarely exceed 5 seconds anyway. Net savings: $0.40 per video, ~$0.025/finished second. This is the easiest unit-economic win in the whole stack and I have no excuse for not having shipped it.
Raise videos-per-character. Every additional video on an existing character drops per-second cost by ~$0.02/sec. Product features that bring users back to existing characters have a direct unit-economic lever. Cheaper than optimizing models.
Don't touch TTS. Kokoro is $0.005/scene. Anything cheaper would be rounding error and Kokoro sounds better than the alternatives at this price.

What I would not spend time on: switching the video model. Cheaper options exist but the headroom isn't in the model — it's in the
conditioning and training around it.

Try it if you want

If you want to see what $0.19/finished second looks like as an actual video — and figure out whether my math is right — the product is at atveanimation.com. Upload a photo, pick a style, hit generate. Free tier gives you 10 scenes a day, which is enough for two short videos.

If you find a way to crash the augmentation step or rack up a $20 bill on a single account, please tell me. I'd genuinely like to know.

Closing thought

I'm not going to pretend this teardown is novel — anyone with a billing dashboard and a calculator can produce one. But I hadn't seen one written publicly for a WAN + Kontext + flux-lora + Kokoro stack, and the augmentation surprise (3.5× the cost of the LoRA training it feeds) was non-obvious enough to me, after a year of building this, that it probably warranted writing down.

If you spot something off in the numbers, the comments are open. I'll fix the post rather than defend it.

Posted from my own desk on a Friday afternoon. Numbers reconcile to the nearest cent; if they don't reconcile to yours I'd love to know why.

Top comments (4)

Alex Shev • Jun 27

Cost breakdowns are useful because they expose the hidden retry tax. The sticker price per generation is only one layer. The real production cost includes bad takes, continuity fixes, prompt iteration, upscaling, editing, and the human time spent deciding what is good enough. That is where budgets drift.

Kumar K Jha • Jun 27

This is exactly the layer the post doesn't get into, and you're right that it's where budgets actually drift. The $0.19/sec is the bill from fal/replicate. It's not what it cost me to produce a video I was actually happy with.

A few categories I tried to design around, with mixed success:

1.** Scene-level preview before stitching. **Each scene renders its first frame before the WAN video call, so you can reject early at $0.05 instead of paying the full $0.55 to find out the character drifted. Cuts the retry tax meaningfully but doesn't eliminate it.

**Anchor scenes for continuity. **Scene 0 becomes the reference image for every subsequent scene's Kontext call, which kills the "why does her hair color change in scene 3" reroll category. Took me about 4 iterations to land on this. Before it existed I was burning 2-3 retries per multi-scene video on continuity alone.
The category I have no good answer for is script iteration. People rewrite the brief 3-4 times before they like it, and each rewrite regenerates everything downstream. That's pure retry tax and I haven't found a clean way around it short of making script edits cheap to preview without re-running the full pipeline. Open to ideas if anyone's solved this.

The 10-12% reroll number in the post is model-level retries (bad WAN clip, Kokoro timeout, fal queue hiccup). Human-decision rerolls are a separate number I don't have clean data on yet. Curious what that multiplier looks like on other people's stacks - is the real bill 2x the sticker, 3x, more?

Alex Shev • Jun 28

That early reject layer is the real cost control. The billed model seconds are easy to count, but the expensive part is letting a bad scene travel too far down the pipeline before anyone notices. Agent workflows need the same checkpoint design: cheap validation before the heavy call, not heroic cleanup after the expensive step.

Kumar K Jha • Jun 28

Yes, and the failure mode I keep hitting is that the cheap validator turns out to lie. First-frame preview catches character drift at $0.05, but it doesn't catch the motion artifacts that only show up in the full WAN clip. So the layer reduces retry tax in one dimension and is blind in another.

The version that works cleanly is when the validator is a genuinely different signal from the heavy call, not a cheaper approximation of it. Brief preview catches script issues for free. Style preview ($0.16) catches style mismatch before you commit to LoRA training ($0.40). "Cheap proxy of the expensive thing" lies. "Different signal entirely" doesn't.

The unsolved version is video QA. No cheap automated "is this clip good" signal exists. Curious if you've found a checkpoint pattern that's actually cheap AND catches the real failure modes, not just the trivial ones.