DEV Community

Cover image for Two architectures for "script to video", and why the credit meter follows from the design
AI Alleyway
AI Alleyway

Posted on

Two architectures for "script to video", and why the credit meter follows from the design

Two AI video tools take the same input — a short script — and hand back the same shape of output: a captioned vertical clip with a voiceover. Feed the same brief to both and one produces a clip for the equivalent of about 2 credits; the other, on its premium setting, burns about 40 credits for a single 30-second clip on a 75-credit monthly plan.

A 20× spread on identical-looking output is the kind of thing that looks like arbitrary pricing until you look at the pipeline. It isn't arbitrary. The credit meter is a projection of the architecture, and once you see the two designs, the prices — and the free-tier policies, and where each tool spends its quality budget — all fall out of the diagram.

I tested both (Fliki and InVideo) hands-on for a comparison, so to be clear about scope up front: I can't see either company's source. What follows is the architecture the observable behavior implies — the models each one exposes and the credit costs I actually watched tick down. Treat it as a systems read, not an internal spec.

Pipeline A: assemble a voice, backfill the picture

Fliki's design is a text-to-speech assembly pipeline. Trace a script through it and the stages look roughly like this:

  1. Parse the script and segment it into scenes.
  2. Synthesize narration from a large TTS voice library — 2,000+ voices across 80+ languages.
  3. Time captions to the audio.
  4. Backfill each scene's background from stock or an AI-generated still.
  5. Mux audio + captions + background into a clip.

The thing to notice is where the cost and the differentiation live. Every stage except one is cheap and mostly deterministic — segmentation, caption timing, and asset lookup are database-and-glue work. The one stage that's both expensive and the actual product is TTS synthesis. That's why Fliki pours its quality budget into the voice library and lets the visuals stay basic: the visuals are a lookup, the voice is the inference.

TTS is also, in compute terms, cheap inference relative to what's coming in pipeline B. A few seconds of neural speech is orders of magnitude less GPU time than a few seconds of generated video. So the marginal cost of one Fliki clip is close to flat, and low. Two consequences fall straight out:

  • The free tier can actually export a finished (watermarked, 720p) video. A free export costs Fliki almost nothing, so it can afford to give one away as a real test drive.
  • The paid floor is low — about $8/month for the entry plan — because the pipeline it's amortizing is cheap.

Pipeline B: generate the footage, then narrate

InVideo's design is a generative-model orchestration layer. The stages:

  1. Take a one-line prompt and expand it into a storyboard/plan (an LLM step).
  2. For each scene, call a text-to-video model — it reaches Veo 3.1, Sora 2, Kling, and Seedance, 200+ models in all — to generate original footage.
  3. Generate a voiceover.
  4. Assemble.

Here the expensive stage isn't the voice — it's step 2, and it's expensive by a different order of magnitude. Generating a few seconds of novel video from a frontier diffusion/video model is among the most compute-heavy inferences you can buy right now. That single stage dominates the cost function so completely that everything else rounds to zero.

Now the 20× makes sense. When InVideo pulls a stock clip, step 2 degrades to a lookup and the clip costs ~2 credits — same class of operation as Fliki's backfill. When it generates a premium Veo/Sora clip, you're paying for GPU-seconds of a frontier model, and that's the ~40-credit clip. Same tool, same UI, two completely different cost regimes depending on whether step 2 retrieves or generates.

And the same two consequences invert:

  • The free tier cannot export a usable video, because a single free generation is real, non-trivial GPU cost — you can't give that away the way you give away a TTS clip.
  • The paid floor is higher (about $20/month entry) and the meter is a monthly pool of ~75 credits that one ambitious clip can gut, rather than the slow-draining yearly pool an assembly tool can offer.

The meter is a shadow of the pipeline

Put the two side by side and the pricing stops looking like a marketing decision and starts looking like an accounting identity:

Observable Fliki (assembly) InVideo (generation)
Dominant-cost stage TTS synthesis per-scene video generation
Marginal cost per clip low, ~flat low for stock, high for generative
Credit cost, one clip ~2-equivalent ~2 stock / ~40 generative
Free tier exports? yes (cheap pipeline) no (a free gen is real GPU cost)
Meter shape slow yearly pool monthly pool one clip can drain
Entry price ~$8/mo ~$20/mo
Where quality is spent the voice the footage

None of the right-hand column is a pricing "choice" in isolation. It's what the generative pipeline costs to operate, expressed as credits. The left column is the same story for a cheap pipeline.

This generalizes past these two tools, and it's the actually-useful part if you build or buy in this space: find the dominant-cost stage of a pipeline and you've predicted its pricing model, its free-tier policy, and where it spends quality. A tool whose expensive stage is retrieval will have a generous free tier and a flat, low meter. A tool whose expensive stage is frontier-model inference will gate the free tier and meter aggressively, because it has to — the unit economics don't allow anything else. When a pricing page confuses you, reverse the arrow: ask what the tool must be spending compute on, and the meter usually explains itself.

For the buyer, the practical read is the same one the architecture predicts: if the voice carries your video, the assembly pipeline is the efficient match and you'll pay a low, predictable meter. If the footage is the product, you're buying GPU-time-as-credits and your job is to ration the generative stage — treat the 40-credit clip as a deliberate spend, not a default.

Both, incidentally, export clean 1080p on their paid plans, which is the tell that the resolution was never the differentiator — what fills the frame is. One holds a generic AI still; the other holds generated footage. That's the whole 20×.

I scored the two 4.3 and 4.2 respectively when I tested them — close, because the decision was never a scoreboard. It is one question: does the voice or the picture carry your video?

But the pricing itself you can now read straight off the diagram: cheap pipeline, cheap meter, free export; expensive pipeline, expensive meter, no free export. The credit gap was the architecture talking the whole time.

Top comments (0)