<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Felipe Gazolla</title>
    <description>The latest articles on DEV Community by Felipe Gazolla (@felipegazolla).</description>
    <link>https://dev.to/felipegazolla</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4011542%2F9f7fdee5-dd7b-471a-bb99-e0988110383e.png</url>
      <title>DEV Community: Felipe Gazolla</title>
      <link>https://dev.to/felipegazolla</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/felipegazolla"/>
    <language>en</language>
    <item>
      <title>Transcrevendo áudio e gerando capítulos com IA (Whisper + GPT) sem estourar o custo</title>
      <dc:creator>Felipe Gazolla</dc:creator>
      <pubDate>Thu, 02 Jul 2026 02:10:56 +0000</pubDate>
      <link>https://dev.to/felipegazolla/transcrevendo-audio-e-gerando-capitulos-com-ia-whisper-gpt-sem-estourar-o-custo-3lk6</link>
      <guid>https://dev.to/felipegazolla/transcrevendo-audio-e-gerando-capitulos-com-ia-whisper-gpt-sem-estourar-o-custo-3lk6</guid>
      <description>&lt;p&gt;Semana dessas caí num problema que parecia trivial e não era: pegar um áudio longo, tipo um episódio de podcast, e gerar os capítulos sozinho — aqueles "00:00 Intro", "04:12 fulano fala sobre X". Com IA, e sem tomar um susto na fatura da OpenAI no fim do mês.&lt;/p&gt;

&lt;p&gt;Trabalho com full stack e esse tipo de automação com IA acabou virando parte da rotina. Como quebrei a cabeça um tempo até achar um caminho que funciona e sai barato, achei que valia escrever.&lt;/p&gt;

&lt;h2&gt;
  
  
  O que precisa acontecer
&lt;/h2&gt;

&lt;p&gt;Basicamente três etapas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcrever o áudio, mas com os tempos de cada trecho&lt;/li&gt;
&lt;li&gt;transformar essa transcrição em capítulos que façam sentido&lt;/li&gt;
&lt;li&gt;e fazer isso sem gastar uma fortuna, porque áudio longo vira um caminhão de token&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Transcrição com o Whisper
&lt;/h2&gt;

&lt;p&gt;A parte que muita gente erra é essa: dá pra pedir a transcrição já com os timestamps por segmento, usando &lt;code&gt;verbose_json&lt;/code&gt;. Assim o Whisper te entrega onde cada fala começa e termina, e você não precisa pedir pro GPT "chutar" os minutos depois (spoiler: ele chuta mal).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node:fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createReadStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;episodio.mp3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;whisper-1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;verbose_json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;timestamp_granularities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;segment&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// transcription.segments -&amp;gt; [{ start, end, text }, ...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Gerando os capítulos com o GPT
&lt;/h2&gt;

&lt;p&gt;Aqui tem uma tentação que sai caro: mandar a transcrição inteira, palavra por palavra, pro modelo. Não faça isso. O que eu faço é condensar antes: pra cada segmento, mando só o timestamp e o comecinho do texto. É o suficiente pro modelo entender onde o assunto muda, e corta um monte de token.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resumoSegmentos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;segments&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`[&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;formatTime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;] &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json_object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Gere de 5 a 8 capítulos a partir da transcrição com tempos.
Responda em JSON: { "capitulos": [{ "inicio": "mm:ss", "titulo": "..." }] }.

Transcrição:
&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resumoSegmentos&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;capitulos&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Duas coisas que me pouparam dor de cabeça: usar &lt;code&gt;response_format: { type: "json_object" }&lt;/code&gt;, que garante um JSON válido de volta (nada de ficar limpando texto solto com regex), e deixar o Whisper cuidar dos tempos enquanto o GPT cuida só dos títulos. Cada um no que faz bem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Onde o custo realmente cai
&lt;/h2&gt;

&lt;p&gt;Quando fui olhar a conta, o que pesava não era o "modelo caro", era eu mandando contexto demais. O que mudou o jogo:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Escolher o modelo pela tarefa. Titular capítulo não precisa do modelo topo de linha; um "mini" resolve por uma fração do preço.&lt;/li&gt;
&lt;li&gt;Resumir os segmentos antes de enviar, como mostrei ali em cima.&lt;/li&gt;
&lt;li&gt;Cachear a transcrição. O áudio não muda depois de transcrito, então guardo o resultado. Se eu quiser regerar os capítulos, roda só a parte barata.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Resumindo
&lt;/h2&gt;

&lt;p&gt;No fim, IA "cara" quase sempre é IA mal orquestrada. Boa parte do custo tá em quanto contexto você joga pro modelo, não só em qual modelo você usa. E combinar ferramentas pelo forte de cada uma (Whisper pro tempo, GPT pra linguagem) deixa o resultado bem mais previsível do que tentar fazer um só modelo dar conta de tudo.&lt;/p&gt;

&lt;p&gt;Se quiser dar uma olhada em outras coisas que já construí — tem umas demos de sistemas pra abrir e testar — meu portfólio tá aqui: &lt;a href="https://felipegazolla.dev" rel="noopener noreferrer"&gt;https://felipegazolla.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>node</category>
      <category>typescript</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
