<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohammed Fashan</title>
    <description>The latest articles on DEV Community by Mohammed Fashan (@mohammed_fashan_152d2c6a7).</description>
    <link>https://dev.to/mohammed_fashan_152d2c6a7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3882977%2Fa0e02073-638f-461b-8dbf-a18532993968.jpg</url>
      <title>DEV Community: Mohammed Fashan</title>
      <link>https://dev.to/mohammed_fashan_152d2c6a7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mohammed_fashan_152d2c6a7"/>
    <language>en</language>
    <item>
      <title>Building a local audio &amp; video transcription API with FastAPI and faster-whisper</title>
      <dc:creator>Mohammed Fashan</dc:creator>
      <pubDate>Thu, 16 Apr 2026 17:47:52 +0000</pubDate>
      <link>https://dev.to/mohammed_fashan_152d2c6a7/building-a-local-audio-video-transcription-api-with-fastapi-and-faster-whisper-47f8</link>
      <guid>https://dev.to/mohammed_fashan_152d2c6a7/building-a-local-audio-video-transcription-api-with-fastapi-and-faster-whisper-47f8</guid>
      <description>&lt;p&gt;I wanted a way to transcribe audio and video files without sending anything to the cloud. No OpenAI API key, no monthly bill, no data leaving my machine. So I built player2text — a local transcription API powered by faster-whisper.&lt;/p&gt;

&lt;p&gt;Here's what it does and how I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Most transcription tools either cost money per minute, require an API key, or both. For personal projects, meeting recordings, or anything sensitive, that's not ideal. Whisper runs locally and it's surprisingly good — the challenge is just wrapping it in something usable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; — clean async API, great auto-generated docs at /docs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;faster-whisper&lt;/strong&gt; — 4x faster than original Whisper, ~50% less RAM, same accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ffmpeg&lt;/strong&gt; — handles all the audio/video heavy lifting&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The auto-compression trick
&lt;/h2&gt;

&lt;p&gt;The best part of this project is the pre-processing step. Before transcription, every file gets passed through ffmpeg:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ffmpeg &lt;span class="nt"&gt;-i&lt;/span&gt; input.mp4 &lt;span class="nt"&gt;-vn&lt;/span&gt; &lt;span class="nt"&gt;-acodec&lt;/span&gt; pcm_s16le &lt;span class="nt"&gt;-ar&lt;/span&gt; 16000 &lt;span class="nt"&gt;-ac&lt;/span&gt; 1 output.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This strips the video stream, downsamples to 16kHz mono (Whisper's native rate), and converts to raw PCM. A 300MB video becomes about 5MB. Transcription is dramatically faster as a result.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;One endpoint does everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/transcribe
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Send a file (and optionally a language code), get back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Full transcript here..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;462.74&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"segments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"start"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;9.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hello and welcome..."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Language is auto-detected if you don't specify it — Whisper supports 99 languages out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/fashan7/audio-to-text
&lt;span class="nb"&gt;cd &lt;/span&gt;audio-to-text
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip setuptools wheel
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python run.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;a href="http://localhost:8000/docs" rel="noopener noreferrer"&gt;http://localhost:8000/docs&lt;/a&gt; and test it right in the browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;React frontend (Lovable) for a proper UI with drag-and-drop upload&lt;/li&gt;
&lt;li&gt;Progress streaming for long files&lt;/li&gt;
&lt;li&gt;Deployment guide for Railway/Render&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full code is on GitHub: &lt;a href="https://github.com/fashan7/audio-to-text" rel="noopener noreferrer"&gt;https://github.com/fashan7/audio-to-text&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Would love feedback — especially if you've dealt with long audio files on CPU and have ideas for speeding things up further!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
