<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aleksei Kim</title>
    <description>The latest articles on DEV Community by Aleksei Kim (@alyosha_kim).</description>
    <link>https://dev.to/alyosha_kim</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3977286%2F4f4c9872-d254-4430-a693-5d83eea10015.png</url>
      <title>DEV Community: Aleksei Kim</title>
      <link>https://dev.to/alyosha_kim</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alyosha_kim"/>
    <language>en</language>
    <item>
      <title>How to vibe-code an AI SaaS, fix Google Cloud bugs, and hit an explosive $0 MRR.</title>
      <dc:creator>Aleksei Kim</dc:creator>
      <pubDate>Wed, 10 Jun 2026 12:00:26 +0000</pubDate>
      <link>https://dev.to/alyosha_kim/how-to-vibe-code-an-ai-saas-fix-google-cloud-bugs-and-hit-an-explosive-0-mrr-1p85</link>
      <guid>https://dev.to/alyosha_kim/how-to-vibe-code-an-ai-saas-fix-google-cloud-bugs-and-hit-an-explosive-0-mrr-1p85</guid>
      <description>&lt;p&gt;I am a Vue/Nuxt frontend dev. On the side, while keeping my day job, I built&lt;br&gt;
Reelsub, an automatic karaoke caption tool for short vertical videos (Reels,&lt;br&gt;
Shorts, TikTok). It lives at reelsub.app. First I built it for myself, then I&lt;br&gt;
put it out in public. Solo. No team, no investor, no designer. I vibe-coded&lt;br&gt;
about 99 percent of it and ran the infrastructure on Google Cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why a captions tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I post short videos on TikTok, Instagram and YouTube, so I edit them all the&lt;br&gt;
time. Captions hold watch time: most people scroll the feed on mute, and&lt;br&gt;
on-screen text is what catches them. Existing editors drove me crazy. Pay for&lt;br&gt;
this, pay for that, and if you do not pay you get a 720p export with a watermark&lt;br&gt;
across half the face. I wanted something simple: upload a clip, get accurate&lt;br&gt;
captions, download it with no watermark, free at least a few times, with every&lt;br&gt;
spoken word highlighted exactly on beat. That karaoke effect was the whole point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The first version lived under my bed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The prototype ran locally. I did the speech recognition on my own RTX 4070 with&lt;br&gt;
Whisper, on my home machine. It worked for me, I even built a small UI so I did&lt;br&gt;
not have to run commands by hand. Then it hit me: I am a developer, I can ship&lt;br&gt;
this. So the home hack became a service you open in a browser.&lt;/p&gt;

&lt;p&gt;Everything runs on Google Cloud, in one region (West Europe). That is on&lt;br&gt;
purpose. Speech recognition and video rendering sit in the same region so data&lt;br&gt;
does not travel across the planet between them. One region means less latency&lt;br&gt;
and lower inter-service traffic cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F884h61ms3afhhp943ry3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F884h61ms3afhhp943ry3.png" alt="actual costs of infrastructure to date " width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rendering runs on Cloud Run. It scales up under load and shuts down when there&lt;br&gt;
is nothing to process, so I do not pay for idle hardware. For a solo project&lt;br&gt;
with no revenue that is gold: the infra scales itself and the bill only grows&lt;br&gt;
when someone actually renders.&lt;/p&gt;

&lt;p&gt;In production the transcription is Google Speech-to-Text, the chirp_2 model. It&lt;br&gt;
returns per-word timings, and without those you cannot sync captions. That is&lt;br&gt;
the key to the karaoke effect, accurate to the millisecond.&lt;/p&gt;

&lt;p&gt;Google gave 300 dollars in credit you do not pay back. I have burned less than&lt;br&gt;
100 so far, so I still have runway. The plan is simple: reach the first&lt;br&gt;
customers before the credit runs out. Even if it does, the most expensive piece&lt;br&gt;
is a small VDS, everything else lives in the cloud and costs pennies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The painful part: burning captions with ffmpeg&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This almost killed the project. A cloud GPU is too expensive for a tool with no&lt;br&gt;
customers. My VDS CPU cannot handle many users at once. I even considered&lt;br&gt;
rendering in the browser with WebGL, but that is endless debugging across&lt;br&gt;
devices and it just dies on old Android. I landed on Cloud Run: rendering moved&lt;br&gt;
to the cloud, scales on its own, and my VDS is free again. It is still the&lt;br&gt;
slowest step, around 1.2 seconds per second of video, slower than real time, and&lt;br&gt;
I live with that for now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real engineering: collapsed timings&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where vibe-coding ended and engineering began. chirp_2 sometimes&lt;br&gt;
collapses timings and returns a batch of words on a single timestamp. On screen&lt;br&gt;
the captions either flash all at once or freeze. For a captions tool that is&lt;br&gt;
death. Feeding the file to an LLM did not help, the models did not actually know&lt;br&gt;
what was right. So I wrote a simple fix myself: when timings collapse, I take the&lt;br&gt;
two nearest correct timestamps and spread the words evenly between them. On real&lt;br&gt;
speech you cannot see the seam.&lt;/p&gt;

&lt;p&gt;Takeaway: vibe-coding speeds up maybe 80 percent of the work. The other 20&lt;br&gt;
percent is the core of the product, you build it by hand, and that is what&lt;br&gt;
matters. Real production experience is what saves you there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest traction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8j82c399hb42ut0bgmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8j82c399hb42ut0bgmy.png" alt="UI of the main page" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five users. Friends and family. People do not show up even for free. I posted in&lt;br&gt;
chats and on Threads, silence so far. Building the product turned out to be&lt;br&gt;
easier than getting people to try it. That is part of the story, not something&lt;br&gt;
to hide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why free and no card&lt;/strong&gt;&lt;br&gt;
I hate how big companies do trials: one nanosecond of access, or a card upfront.&lt;br&gt;
I want to give people a real free trial, no watermark and no card. Value first,&lt;br&gt;
money talk later.&lt;/p&gt;

&lt;p&gt;Reelsub does automatic captions for Reels, Shorts and TikTok, no watermark, free&lt;br&gt;
to start: reelsub.app. Open it, poke around, break it. Feedback means a lot.&lt;/p&gt;

</description>
      <category>buildinpublic</category>
      <category>indiehackers</category>
      <category>ai</category>
      <category>sideprojects</category>
    </item>
  </channel>
</rss>
