A few months ago, I was working on a pet project.
YumCut is an end-to-end service for creating short vertical videos: from writing the text and generating images to editing and adding subtitles.
A critical problem showed up quickly: cost. One minute of video required about twenty generated images, or $0.8/min. Besides the visuals, you also need to generate audio - $0.2/min - plus minor additional costs for editing and subtitle generation.
I started looking for a way out. This article is about the unconventional techniques that helped reduce the cost by several times, and about an open-source solution that makes it possible to generate images eight times cheaper than commercial APIs. Full code and instructions are available on GitHub.
First approach: multiple scenes in one frame
The logical solution seemed obvious: generate several images in a single request by placing scenes next to each other. In theory, this should reduce costs proportionally to the number of images.
The first attempt was to put all eight scenes into the prompt at once. The result was disastrous: the model simply mixed all elements into one blurry composition, unusable for video editing.
By reducing the number of scenes to two per image, I got an acceptable result. That already cut the cost in half, but it was still far from the target.
Key insight: borders must be literally visible
It turned out that AI models struggle to determine logical boundaries between separate areas. The solution was simple: use colored zones (red and blue).
Instead of an abstract description, I started sending a PNG template with clear borders and a matching instruction: "The first idea is in the red area, the second idea is in the blue area. Fill each area completely."
This technique did not require additional costs, but it dramatically improved the quality of scene separation. The model now understood the structure and rarely mixed elements.
However, this was only a partial solution. I needed a drastic reduction. I had to cut the price by another order of magnitude.
Second approach: migrating to open-source alternatives
The idea was to find open-source/open-weight image-generation models, run them in the cloud, and reduce costs that way.
First, I had to identify which models were freely available. There are many: Qwen-Image, FLUX, HunyuanImage, Stable Diffusion, and others. For my use case, I had one extra requirement - the ability to reuse characters across many generated images. That is why I chose Qwen-Image-Edit.
I audited the market of commercial generators:
Major APIs (OpenAI/Google Gemini/Stability AI): similar prices or higher
Alibaba cloud services: about $0.04 per image - roughly the same
Self-hosted options like RunPod: you need a large number of images per run to reach meaningful savings
The picture was disappointing - even the creators of Qwen-Image, Alibaba, were offering the model at inflated prices. But then I found runware.ai and together.ai, where generating images with Qwen-Image-Edit and Qwen-Image was almost eight times cheaper than Nano Banana - ~$0.005 vs. $0.04.
Third approach: improving image detail
As it turned out, with low image-generation prices the model started producing more uniform images - all scenes looked too similar to each other.
Here is an example of generated images with different prompts about a happy cat on the beach:
Even though the images look good, there is a missing layer for improving the story prompts. The obvious solution is to add a layer between the story prompt and the image generator. But it has to be an LLM that does not increase the cost significantly.
To test many LLMs, I used openrouter.ai. Once you write the wrapper code, you can switch to any available model. After testing a dozen models, I settled on openai/gpt-oss-120b with low reasoning effort. The improved image-generation prompt costs about ~$0.0003, and the images above turn into results like these.
The images became more diverse even with nearly the same prompt. GPT OSS improved the description for Qwen-Image, and you now have a lever that lets you control the style and mood of the images.
Results: numbers that speak for themselves
By combining the two approaches, the price drops by another factor of two, but sometimes you will see artifacts in the images because the original image is being split.
This price was good enough for me, so I stopped at this result.










Top comments (0)