DEV Community

Cover image for How to vibe-code an AI SaaS, fix Google Cloud bugs, and hit an explosive $0 MRR.
Aleksei Kim
Aleksei Kim

Posted on

How to vibe-code an AI SaaS, fix Google Cloud bugs, and hit an explosive $0 MRR.

I am a Vue/Nuxt frontend dev. On the side, while keeping my day job, I built
Reelsub, an automatic karaoke caption tool for short vertical videos (Reels,
Shorts, TikTok). It lives at reelsub.app. First I built it for myself, then I
put it out in public. Solo. No team, no investor, no designer. I vibe-coded
about 99 percent of it and ran the infrastructure on Google Cloud.

Why a captions tool

I post short videos on TikTok, Instagram and YouTube, so I edit them all the
time. Captions hold watch time: most people scroll the feed on mute, and
on-screen text is what catches them. Existing editors drove me crazy. Pay for
this, pay for that, and if you do not pay you get a 720p export with a watermark
across half the face. I wanted something simple: upload a clip, get accurate
captions, download it with no watermark, free at least a few times, with every
spoken word highlighted exactly on beat. That karaoke effect was the whole point.

The first version lived under my bed

The prototype ran locally. I did the speech recognition on my own RTX 4070 with
Whisper, on my home machine. It worked for me, I even built a small UI so I did
not have to run commands by hand. Then it hit me: I am a developer, I can ship
this. So the home hack became a service you open in a browser.

Everything runs on Google Cloud, in one region (West Europe). That is on
purpose. Speech recognition and video rendering sit in the same region so data
does not travel across the planet between them. One region means less latency
and lower inter-service traffic cost.

actual costs of infrastructure to date

Rendering runs on Cloud Run. It scales up under load and shuts down when there
is nothing to process, so I do not pay for idle hardware. For a solo project
with no revenue that is gold: the infra scales itself and the bill only grows
when someone actually renders.

In production the transcription is Google Speech-to-Text, the chirp_2 model. It
returns per-word timings, and without those you cannot sync captions. That is
the key to the karaoke effect, accurate to the millisecond.

Google gave 300 dollars in credit you do not pay back. I have burned less than
100 so far, so I still have runway. The plan is simple: reach the first
customers before the credit runs out. Even if it does, the most expensive piece
is a small VDS, everything else lives in the cloud and costs pennies.

The painful part: burning captions with ffmpeg

This almost killed the project. A cloud GPU is too expensive for a tool with no
customers. My VDS CPU cannot handle many users at once. I even considered
rendering in the browser with WebGL, but that is endless debugging across
devices and it just dies on old Android. I landed on Cloud Run: rendering moved
to the cloud, scales on its own, and my VDS is free again. It is still the
slowest step, around 1.2 seconds per second of video, slower than real time, and
I live with that for now.

The real engineering: collapsed timings

This is where vibe-coding ended and engineering began. chirp_2 sometimes
collapses timings and returns a batch of words on a single timestamp. On screen
the captions either flash all at once or freeze. For a captions tool that is
death. Feeding the file to an LLM did not help, the models did not actually know
what was right. So I wrote a simple fix myself: when timings collapse, I take the
two nearest correct timestamps and spread the words evenly between them. On real
speech you cannot see the seam.

Takeaway: vibe-coding speeds up maybe 80 percent of the work. The other 20
percent is the core of the product, you build it by hand, and that is what
matters. Real production experience is what saves you there.

Honest traction

UI of the main page

Five users. Friends and family. People do not show up even for free. I posted in
chats and on Threads, silence so far. Building the product turned out to be
easier than getting people to try it. That is part of the story, not something
to hide.

Why free and no card
I hate how big companies do trials: one nanosecond of access, or a card upfront.
I want to give people a real free trial, no watermark and no card. Value first,
money talk later.

Reelsub does automatic captions for Reels, Shorts and TikTok, no watermark, free
to start: reelsub.app. Open it, poke around, break it. Feedback means a lot.

Top comments (0)