<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: pekiskol</title>
    <description>The latest articles on DEV Community by pekiskol (@pekisko).</description>
    <link>https://dev.to/pekisko</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3899236%2F00787efc-3dc8-4c56-a0fd-b48c741ae795.png</url>
      <title>DEV Community: pekiskol</title>
      <link>https://dev.to/pekisko</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pekisko"/>
    <language>en</language>
    <item>
      <title>Fine-tuning Chatterbox on a Low-Resource Language: 7 Things That Mattered</title>
      <dc:creator>pekiskol</dc:creator>
      <pubDate>Sun, 26 Apr 2026 18:52:11 +0000</pubDate>
      <link>https://dev.to/pekisko/fine-tuning-chatterbox-on-a-low-resource-language-7-things-that-mattered-13e1</link>
      <guid>https://dev.to/pekisko/fine-tuning-chatterbox-on-a-low-resource-language-7-things-that-mattered-13e1</guid>
      <description>&lt;h1&gt;
  
  
  Fine-tuning Chatterbox on a Low-Resource Language: 7 Things That Mattered
&lt;/h1&gt;

&lt;p&gt;Resemble AI's &lt;a href="https://huggingface.co/ResembleAI/chatterbox" rel="noopener noreferrer"&gt;Chatterbox Multilingual TTS&lt;/a&gt; is one of the few SOTA open-source TTS models with a real MIT license — code &lt;em&gt;and&lt;/em&gt; weights — so it's a natural starting point if you want to build a commercially usable text-to-speech for a language that the official model doesn't ship.&lt;/p&gt;

&lt;p&gt;I fine-tuned it on Slovak. The 23 supported languages in the official multilingual checkpoint don't include it, and there was nothing comparable in the open-source ecosystem — every other halfway decent multilingual TTS I could find (XTTS-v2, F5-TTS, Fish Speech) ships under non-commercial licenses. So I trained my own and &lt;a href="https://huggingface.co/pekiskol/chatterbox-tts-slovak" rel="noopener noreferrer"&gt;published the weights&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The TTS sits at the end of a larger pipeline I'm building — an end-to-end Slovak video-dubbing tool (Whisper for transcription, a fine-tuned Gemma 3 for translation, MuseTalk for lip-sync) — and Slovak TTS was the missing piece. That's the reason "good enough on a single sentence" wasn't good enough; I needed audio that holds up across hours of generated speech.&lt;/p&gt;

&lt;p&gt;The fine-tuning itself isn't the hard part. The hard part is the dozen tiny things that turn a "weights file that loads" into "weights file that produces audio you'd actually ship." Here are the seven that mattered most for me.&lt;/p&gt;

&lt;p&gt;This post assumes you already know what fine-tuning a TTS is and have a base Chatterbox setup running. It's the practical-tuning notes I wish someone had written before I started.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The base model and your fine-tune disagree on vocab size
&lt;/h2&gt;

&lt;p&gt;Chatterbox uses a sentencepiece text tokenizer, and depending on training data your fine-tune may end up with a different vocab size than the base multilingual checkpoint. If you naively &lt;code&gt;load_state_dict(strict=True)&lt;/code&gt; the T3 weights into the base model, you get a shape mismatch.&lt;/p&gt;

&lt;p&gt;The fix is to pad or trim the affected matrices (&lt;code&gt;text_emb.weight&lt;/code&gt; and &lt;code&gt;text_head.weight&lt;/code&gt;) before loading:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_safetensors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_finetune_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;target_vocab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;src_vocab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_emb.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;src_vocab&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;target_vocab&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_emb.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_emb.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="n"&gt;target_vocab&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_head.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_head.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="n"&gt;target_vocab&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;src_vocab&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;target_vocab&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_vocab&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;src_vocab&lt;/span&gt;
    &lt;span class="n"&gt;emb_pad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_emb.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pad&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;head_pad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_head.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pad&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_emb.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_emb.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;emb_pad&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_head.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_head.weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;head_pad&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_state_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Padding with the row mean instead of zeros gives you a sensible "neutral" embedding for tokens the fine-tune never saw — better than zeros, which produce noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The reference audio matters more than you think
&lt;/h2&gt;

&lt;p&gt;Chatterbox is a zero-shot voice-cloning model. You give it a few seconds of someone speaking, and the output mimics that voice. Most articles stop there. They don't tell you that &lt;strong&gt;whatever noise is in your reference clip will be baked into every generation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I was using a 5.7-second Common Voice Slovak clip as the reference. The model output was almost perfect — but had a faint low-frequency hum throughout. Same hum that was in the reference, which I hadn't noticed until I heard it stretched across thirty seconds of generated audio.&lt;/p&gt;

&lt;p&gt;The fix is to clean the reference &lt;em&gt;before&lt;/em&gt; you pass it to the model. Here's the ffmpeg chain I ended up with (based on what's in my production pipeline):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ffmpeg &lt;span class="nt"&gt;-i&lt;/span&gt; reference.wav &lt;span class="nt"&gt;-af&lt;/span&gt; &lt;span class="s2"&gt;"
  highpass=f=70,
  afftdn=nr=12:nt=w:om=o,
  lowpass=f=11000,
  equalizer=f=6800:t=q:w=1.2:g=-1.5,
  silenceremove=start_periods=1:start_silence=0.04:start_threshold=-50dB,
  areverse,
  silenceremove=start_periods=1:start_silence=0.06:start_threshold=-46dB,
  areverse
"&lt;/span&gt; reference_clean.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each stage does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;highpass=70&lt;/code&gt; removes mains hum and low-frequency rumble&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;afftdn&lt;/code&gt; is FFT-based broadband denoising (12 dB reduction is gentle — pushing it higher starts to make speech metallic)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lowpass=11000&lt;/code&gt; cuts hiss above 11 kHz, which Chatterbox doesn't reproduce anyway&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;equalizer&lt;/code&gt; notch around 6.8 kHz tames sibilance&lt;/li&gt;
&lt;li&gt;The pair of &lt;code&gt;silenceremove&lt;/code&gt; + &lt;code&gt;areverse&lt;/code&gt; blocks trims silence from both ends without complicated edge-case handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because cloning is essentially a "voice colour transfer" — anything in the reference &lt;em&gt;is&lt;/em&gt; part of the cloned voice. Garbage in, garbage out.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Generation parameters: defaults vs. tuned
&lt;/h2&gt;

&lt;p&gt;The default &lt;code&gt;model.generate()&lt;/code&gt; call works, but you can do better. After regression-testing on Slovak segments, I landed on these:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Param&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Tuned for stability&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exaggeration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cfg_weight&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;temperature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;Lower → more stable, less variable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;top_p&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;Cuts the long tail of low-prob samples&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;repetition_penalty&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;Prevents the model getting stuck on syllables&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There's a real &lt;strong&gt;trade-off&lt;/strong&gt; here: tuned parameters produce more consistent, less-likely-to-fail output, but they also flatten the prosody. The voice sounds slightly more monotone. For a production pipeline doing thousands of segments per day where any failure is worse than a slightly less expressive read, tuned wins. For a demo on a model card where you want one perfect take, default temperature with a deterministic seed and a retry loop wins.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;RETRY_SEEDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;min_dur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;25.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;RETRY_SEEDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;wav&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;wav&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sr&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_dur&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The retry loop catches the cases where the model produces a too-short output (it sometimes EOSes early on hard inputs). If the first seed works, you stop; if not, try another. Five seeds is plenty in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The first word can be garbage. Add a warmup prefix.
&lt;/h2&gt;

&lt;p&gt;Generating "Slovenský jazyk je úradný..." would sometimes produce "Zvolenský jazyk..." — the model's first token after the reference would morph. Sometimes the entire first word would be quiet noise.&lt;/p&gt;

&lt;p&gt;This is a "warmup" artifact: the model has to transition from the reference voice's prosody to its own generation, and that first ~0.3 s is where it can wobble. Hard words at position zero hit hardest.&lt;/p&gt;

&lt;p&gt;Two fixes, both work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reword.&lt;/strong&gt; If "Slovenský" trips the model, start with "Slovenčina" instead. Easy, free, doesn't always work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a warmup prefix.&lt;/strong&gt; Put a short, easy phrase at the start that the model can land on cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vitajte v ukážke.
Slovenčina je úradný jazyk Slovenskej republiky.
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the time the model reaches "Slovenčina," it has stabilised. The prefix becomes part of the generation but is short and natural, and you can always trim it in post if you don't want it.&lt;/p&gt;

&lt;p&gt;This isn't unique to Chatterbox — most autoregressive TTS models have warmup behaviour at the very start. The fix transfers.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The model can't read out individual letter names
&lt;/h2&gt;

&lt;p&gt;I tried generating "Má bohatú gramatiku, sedem pádov a špecifické hlásky ako ô, ľ alebo ŕ." — a sentence that names individual Slovak letters. The model produced nonsense for the letter-name part.&lt;/p&gt;

&lt;p&gt;This is a known limitation of TTS models trained on running speech: they rarely see "the letter ô" as a phrase, so they don't know it should be pronounced as a name rather than as the sound itself. The same is true of acronyms ("NDA" gets read as "enda") and units ("20 %" sometimes becomes "two-hundred percent" because of how digit-percent pairs were tokenised in training data).&lt;/p&gt;

&lt;p&gt;The fix is text normalisation &lt;em&gt;before&lt;/em&gt; TTS. In my pipeline I have a small Slovak-specific preprocessor that rewrites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;20 %&lt;/code&gt; → &lt;code&gt;dvadsať percent&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Y100&lt;/code&gt; → &lt;code&gt;ypsilon sto&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NDA&lt;/code&gt; → &lt;code&gt;eN-Dý-Á&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Letter names → spelled-out phonetic forms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lives as a preprocessing layer, not part of the model. It's easier to fix text once than to retrain the model to handle every edge case.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. &lt;code&gt;max_new_tokens&lt;/code&gt; matters for long generations
&lt;/h2&gt;

&lt;p&gt;Chatterbox auto-scales the generation budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_max_toks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a 350-character text that gives 2800 tokens, which sounds like plenty (at ~25 Hz token rate that's over 100 seconds of audio). But the model can EOS early on a particular word — even with a budget far above what the text needs. I had a 30-second narrative whose final word ("Amerike") got cut short despite a 4096-token budget.&lt;/p&gt;

&lt;p&gt;When you have a long input, &lt;strong&gt;set &lt;code&gt;max_new_tokens=4096&lt;/code&gt; explicitly&lt;/strong&gt; and check for premature EOS in postprocessing. If your output ends mid-word, treat it as a generation failure and retry with a different seed (see #3).&lt;/p&gt;

&lt;p&gt;For very long inputs, a more reliable strategy is to chunk into sentences, generate each separately, and concatenate with a short crossfade. Chatterbox doesn't have first-class chunking yet, so you do this at the application layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Use &lt;code&gt;prepare_conditionals&lt;/code&gt; separately from &lt;code&gt;generate&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The Chatterbox API supports passing &lt;code&gt;audio_prompt_path=&lt;/code&gt; directly to &lt;code&gt;generate()&lt;/code&gt;, but in a production loop where you're generating many segments with the same voice, it's faster to call &lt;code&gt;prepare_conditionals&lt;/code&gt; once and reuse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare_conditionals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Now generate many times without re-loading reference each time
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;wav&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conditioning extracts speaker-identity features from the reference. Doing it once amortises the cost across the loop. For one-off demos it doesn't matter; for batch jobs it can shave noticeable time.&lt;/p&gt;

&lt;p&gt;A related gotcha: if you switch reference voices mid-loop, remember to re-run &lt;code&gt;prepare_conditionals&lt;/code&gt; (or null &lt;code&gt;model.conds&lt;/code&gt;) before the next generation, or you'll keep cloning the previous voice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final recipe
&lt;/h2&gt;

&lt;p&gt;Putting it together — the inference snippet I'd hand to someone starting fresh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torchaudio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chatterbox.mtl_tts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatterboxMultilingualTTS&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;safetensors.torch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_file&lt;/span&gt;

&lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Clean the reference
&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ffmpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-i&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-af&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;highpass=f=70,afftdn=nr=12,lowpass=f=11000,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equalizer=f=6800:t=q:w=1.2:g=-1.5,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silenceremove=start_periods=1:start_silence=0.04:start_threshold=-50dB,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;areverse,silenceremove=start_periods=1:start_silence=0.06:start_threshold=-46dB,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;areverse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference_clean.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Load base + patch in fine-tune
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatterboxMultilingualTTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_finetune_t3.safetensors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# vocab resize block from #1 here
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_state_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Prepare reference once
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare_conditionals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference_clean.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Generate with retry
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Vitajte v ukážke. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;your_actual_text&lt;/span&gt;  &lt;span class="c1"&gt;# warmup prefix from #4
&lt;/span&gt;&lt;span class="n"&gt;RETRY_SEEDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;min_dur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;25.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;wav&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;RETRY_SEEDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inference_mode&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sr&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_dur&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;wav&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;

&lt;span class="n"&gt;torchaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wav&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is roughly what's running in my pipeline today.&lt;/p&gt;




&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;A few things I tried that turned out to be dead ends, in case you're tempted by them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Aggressive denoising&lt;/strong&gt; of the reference (FFT noise reduction at -18 dB or higher). Removes hum reliably but starts producing a metallic, "phasey" cloned voice. -12 dB is the sweet spot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;temperature=0.4&lt;/code&gt; or below&lt;/strong&gt;. Flattens prosody to the point of sounding like a robocall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping the warmup prefix and trimming the first 0.3 s in post&lt;/strong&gt;. Works &lt;em&gt;most&lt;/em&gt; of the time, but occasionally cuts the start of an actually-clean first word. The prefix is more reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trying to fix the EOS-early problem with &lt;code&gt;repetition_penalty=1.5&lt;/code&gt; or higher&lt;/strong&gt;. Did not help; the model's stop decision is upstream of repetition logic.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;If you're fine-tuning Chatterbox on a language it doesn't ship, the biggest things you can do &lt;em&gt;outside&lt;/em&gt; the training loop are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Clean your reference audio&lt;/li&gt;
&lt;li&gt;Tune your generation params (or accept the defaults' trade-offs)&lt;/li&gt;
&lt;li&gt;Normalise your text &lt;em&gt;before&lt;/em&gt; it reaches the model&lt;/li&gt;
&lt;li&gt;Use a warmup prefix for the first word&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;max_new_tokens&lt;/code&gt; explicitly and have a retry path&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Slovak fine-tune lives at &lt;a href="https://huggingface.co/pekiskol/chatterbox-tts-slovak" rel="noopener noreferrer"&gt;huggingface.co/pekiskol/chatterbox-tts-slovak&lt;/a&gt; under MIT — drop in the inference snippet above and it should work out of the box. Feedback (or different language fine-tunes that hit the same gotchas) welcome on the model's HF discussions tab.&lt;/p&gt;

&lt;p&gt;If you've discovered other Chatterbox tuning tricks I missed, leave a comment — I'd like to extend this list.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The code in this post is also in the &lt;a href="https://huggingface.co/pekiskol/chatterbox-tts-slovak/tree/main" rel="noopener noreferrer"&gt;release scripts&lt;/a&gt; on the model page.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>tts</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
