<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Turbo Electric</title>
    <description>The latest articles on DEV Community by Turbo Electric (@turbo_electric_1c09f3bec0).</description>
    <link>https://dev.to/turbo_electric_1c09f3bec0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3921443%2F1518e6f9-5d82-4e68-9fd3-7ac8f010cdd3.png</url>
      <title>DEV Community: Turbo Electric</title>
      <link>https://dev.to/turbo_electric_1c09f3bec0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/turbo_electric_1c09f3bec0"/>
    <language>en</language>
    <item>
      <title>How human feedback actually steers TTS fine-tuning</title>
      <dc:creator>Turbo Electric</dc:creator>
      <pubDate>Sat, 09 May 2026 09:21:15 +0000</pubDate>
      <link>https://dev.to/turbo_electric_1c09f3bec0/how-human-feedback-actually-steers-tts-fine-tuning-1g4</link>
      <guid>https://dev.to/turbo_electric_1c09f3bec0/how-human-feedback-actually-steers-tts-fine-tuning-1g4</guid>
      <description>&lt;h1&gt;
  
  
  How human feedback actually steers TTS fine-tuning
&lt;/h1&gt;

&lt;p&gt;Notes on the iteration loop we ran while fine-tuning F5-TTS and StyleTTS2 on&lt;br&gt;
a small Northern English corpus. The headline finding is that the listening&lt;br&gt;
test isn't optional polish at the end — it's the &lt;strong&gt;only&lt;/strong&gt; measurement that&lt;br&gt;
catches the failure modes that matter, and each round of listening produces&lt;br&gt;
specific phonetic observations that map to specific engineering decisions.&lt;/p&gt;

&lt;p&gt;This is a write-up of the methodology, with the concrete examples that&lt;br&gt;
forced each decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The loop
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        ┌────────────────────────┐
        │  render passage        │
        │  (baseline + ft)       │
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         a feature is "right" if a native
        │  human listens against │         speaker recognises it. Record both
        │  marker list (BATH,    │  ◀───── what's working AND what's broken;
        │  FOOT-STRUT, …)        │         both are signal.
        └──────────┬─────────────┘
                   ▼
        ┌────────────────────────┐         translate audible features
        │  diagnose: why is the  │         to training-side cause:
        │  output the way it is? │  ◀───── · missing accent → under-trained
        └──────────┬─────────────┘         · right accent + glitches → over-trained
                   ▼                       · wrong accent → data or LR direction
        ┌────────────────────────┐         specific knobs:
        │  pick next training    │         · lr ↑/↓ (drift per step)
        │  move                  │  ◀───── · epochs ±N (cumulative drift)
        └──────────┬─────────────┘         · earlier ckpt (rewind)
                   ▼                       · data filter (cleaner signal)
              [iterate]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The verdict-to-action mapping
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Listening verdict&lt;/th&gt;
&lt;th&gt;What it implies physically&lt;/th&gt;
&lt;th&gt;Engineering response&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"No discernible difference from baseline"&lt;/td&gt;
&lt;td&gt;Cumulative weight drift Σ lr·grad too small. Either lr too low, scheduler decayed it to ~0, or epochs too few.&lt;/td&gt;
&lt;td&gt;Increase lr or remove decay; add epochs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Accent is right but specific words mangled / dropped / truncated"&lt;/td&gt;
&lt;td&gt;Late-epoch overfitting on training-corpus pace or timing. Crossed from "learning the distribution" to "memorising peculiarities of small corpus".&lt;/td&gt;
&lt;td&gt;Step back: pick an earlier checkpoint, or continue at lower lr.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Accent is wrong direction (e.g. American instead of Northern)"&lt;/td&gt;
&lt;td&gt;Training data misattributed, or model pulled toward different distribution than expected.&lt;/td&gt;
&lt;td&gt;Audit data: manifest pointing at right speakers? Diarisation clean? Speaker IDs correct?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Specific phonetic feature still missing (e.g. monophthongisation absent on 'sunshine')"&lt;/td&gt;
&lt;td&gt;That pattern needs more training-distribution exposure. Some accent features are easier than others.&lt;/td&gt;
&lt;td&gt;Train more, keeping lr constant. Don't increase lr to chase one feature — risk catastrophic forgetting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Feature drifted past the target (e.g. 'down' → 'doon')"&lt;/td&gt;
&lt;td&gt;Over-fit on the broader cluster of related accents. Model has slid past the target sub-region.&lt;/td&gt;
&lt;td&gt;Step back to earlier checkpoint OR pick checkpoint &lt;em&gt;before&lt;/em&gt; the drift.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These categories aren't theoretical. We hit each of them in real training&lt;br&gt;
runs. Examples:&lt;/p&gt;

&lt;h3&gt;
  
  
  "No discernible difference" → &lt;strong&gt;LR scheduler decayed to zero&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Run 1 of F5-TTS used the trainer's default schedule: linear warmup to peak&lt;br&gt;
1e-5, then linear decay across the entire run to ~0. After 5 epochs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean loss per epoch: 0.629, 0.677, 0.648, 0.642, 0.670 — &lt;strong&gt;flat&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Listening: indistinguishable from baseline&lt;/li&gt;
&lt;li&gt;Numerical: waveform correlation with baseline = 0.017 (essentially uncorrelated audio, as expected for diffusion sampling) — looked like the model was &lt;em&gt;doing something&lt;/em&gt;, but the perceptual output disagreed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Diagnosis: the schedule shape was wrong. Step-by-step LR values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;step 1: lr = 1e-7 (warmup)&lt;/li&gt;
&lt;li&gt;step 100: lr = 1e-5 (peak, decay starts)&lt;/li&gt;
&lt;li&gt;step 1000: ≈ 5.5e-6&lt;/li&gt;
&lt;li&gt;step 2225: lr = 1e-13 — effectively zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total weight drift is bounded by Σ lr·grad. With LR linearly decaying to ~0&lt;br&gt;
over 2225 steps, late-epoch gradients are multiplied by near-zero values.&lt;br&gt;
&lt;strong&gt;Most of run 1's "5 epochs" was a no-op.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Run 2 fix: 5× higher peak LR (5e-5), constant after warmup, no decay. 10&lt;br&gt;
epochs. Result: per-epoch mean loss decreased (0.701 → 0.683 → 0.661 →&lt;br&gt;
0.646 across the first 4), and listening verdict was &lt;em&gt;audibly Northern&lt;/em&gt; —&lt;br&gt;
"London" rendered as "Lundun" (FOOT-STRUT vowel collapse, a textbook&lt;br&gt;
Northern marker).&lt;/p&gt;

&lt;h3&gt;
  
  
  "Specific feature missing" → &lt;strong&gt;Train more, same LR&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After 4 epochs of run 2, FOOT-STRUT had emerged ("Lundun") but&lt;br&gt;
monophthongisation hadn't ("sunshine" still diphthongised standard). Some&lt;br&gt;
phonetic patterns are easier to acquire than others — single-vowel&lt;br&gt;
substitutions vs global diphthong→monophthong shifts.&lt;/p&gt;

&lt;p&gt;Continuing 6 more epochs at the same constant 5e-5: monophthongisation&lt;br&gt;
strengthened ("laughing" landed correctly), but truncation appeared&lt;br&gt;
("sunshine" → "sunshinn", dropped function words like "her").&lt;/p&gt;

&lt;h3&gt;
  
  
  "Accent right but words mangled" → &lt;strong&gt;Pick earlier checkpoint&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Run 2 epoch 10 had the strongest accent but the most word-truncation.&lt;br&gt;
Rendering epochs 6, 7, 8, 9 with the same input passage and listening&lt;br&gt;
through revealed epoch 9 as the sweet spot — accent committed, mostly&lt;br&gt;
without truncation. Final shipping checkpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Drifted past the target" → &lt;strong&gt;StyleTTS2's late-epoch failure mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;StyleTTS2 epoch 5 introduced "down" → "doon" (Geordie/Scots realisation).&lt;br&gt;
That's &lt;em&gt;more Northern&lt;/em&gt; than the Bolton/Lancashire target. The model had&lt;br&gt;
slid past the target sub-region of accent space and was now drifting toward&lt;br&gt;
broader Scots/North-East phonetics. Stopped training; epoch 4 became the&lt;br&gt;
shipping checkpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why loss alone can't replace listening
&lt;/h2&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Loss flatness is ambiguous.&lt;/strong&gt; A flat loss curve could mean "converged"&lt;br&gt;
or "not learning at all." Run 1's flat 0.65 was the latter; only listening&lt;br&gt;
("indistinguishable from baseline") disambiguated and pointed at the LR&lt;br&gt;
scheduler. No purely numerical metric on training loss could distinguish&lt;br&gt;
those two cases without an evaluation set.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Some failures look like wins on the loss curve.&lt;/strong&gt; Late-epoch&lt;br&gt;
overfitting drops training loss while degrading output. &lt;em&gt;Lower&lt;/em&gt; loss +&lt;br&gt;
&lt;em&gt;worse&lt;/em&gt; output. Only listening catches it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The thing being optimised isn't what you actually want.&lt;/strong&gt; Flow-matching&lt;br&gt;
loss measures velocity-field reconstruction quality on the training&lt;br&gt;
distribution. It doesn't directly measure "is this output Northern&lt;br&gt;
English-sounding to a native speaker." The model can get better at&lt;br&gt;
fitting Sara's training mels while producing audio that sounds different&lt;br&gt;
from any actual Sara recording.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why every training run produces multiple per-epoch checkpoints and&lt;br&gt;
we render the same passage through several of them. The cost (~30s per&lt;br&gt;
render × 5–6 epochs = ~3 min) buys you a perceptual gradient across training&lt;br&gt;
time that no scalar loss provides.&lt;/p&gt;

&lt;h2&gt;
  
  
  The phonetic-marker passage as deliberate probe
&lt;/h2&gt;

&lt;p&gt;The test passage is loaded with English-accent markers so a single rendering&lt;br&gt;
surfaces multiple aspects of the model's state. Our standard probe:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"It was a bright morning when the path through the grass led down to the&lt;br&gt;
running water. She ran her hand along the back of the chair before&lt;br&gt;
sitting down. The young children were laughing in the sunshine, dancing&lt;br&gt;
in patterns through the warm afternoon. After tea, the family walked up&lt;br&gt;
the hill to look at the view. One of them said, with a small smile: I&lt;br&gt;
cannot believe how lovely it is, our little corner of the world."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What we're probing&lt;/th&gt;
&lt;th&gt;Words that probe it&lt;/th&gt;
&lt;th&gt;"Wrong" sounds like&lt;/th&gt;
&lt;th&gt;"Northern" sounds like&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BATH vowel&lt;/td&gt;
&lt;td&gt;path, grass, laughing, dancing, after, cannot&lt;/td&gt;
&lt;td&gt;/pɑːθ/ (RP "parth")&lt;/td&gt;
&lt;td&gt;/pæθ/ (rhymes with "trap")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FOOT-STRUT&lt;/td&gt;
&lt;td&gt;running, sunshine, hand, hill, up, lovely, our&lt;/td&gt;
&lt;td&gt;distinct "put"/"putt"&lt;/td&gt;
&lt;td&gt;collapsed: both /ʊ/, so "London" → "Lundun"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diphthong→monophthong&lt;/td&gt;
&lt;td&gt;sunshine (→sunshaan), morning, smile&lt;/td&gt;
&lt;td&gt;standard /aɪ/, /eɪ/&lt;/td&gt;
&lt;td&gt;flat, longer single vowel /aː/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;happY-tense&lt;/td&gt;
&lt;td&gt;lovely, family, every, country&lt;/td&gt;
&lt;td&gt;tense /iː/&lt;/td&gt;
&lt;td&gt;laxer, more /ɪ/-like&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;R-intrusion / linking&lt;/td&gt;
&lt;td&gt;chair before, our little&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;often realised in connected Northern speech&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If only some markers come through, that tells us which &lt;em&gt;kinds&lt;/em&gt; of changes&lt;br&gt;
the model is finding easier vs harder to learn. In run 2 epoch 4 the&lt;br&gt;
FOOT-STRUT shift had emerged ("Lundun") but monophthongisation had not&lt;br&gt;
("sunshine" still diphthongised). That gap motivated continuing training&lt;br&gt;
rather than declaring done — specific phonetic gaps mapping to specific&lt;br&gt;
training decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical recommendations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Save a checkpoint per epoch.&lt;/strong&gt; They're cheap to disk and you'll want&lt;br&gt;
the perceptual gradient across training time. Late-epoch isn't always&lt;br&gt;
best.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Curate one phonetic-marker passage&lt;/strong&gt; that targets the dialect features&lt;br&gt;
you care about. Reuse the same passage every render so you build a&lt;br&gt;
listening-memory of the model's progression.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Render with the same reference clip every time.&lt;/strong&gt; The only variable&lt;br&gt;
should be the model weights. If you change the reference clip you're&lt;br&gt;
asking two different questions at once.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native-speaker listeners are the most reliable test instrument.&lt;/strong&gt;&lt;br&gt;
Their judgement catches features that numerical metrics miss — and&lt;br&gt;
importantly, also catches &lt;em&gt;over-fitting&lt;/em&gt; failures that look fine&lt;br&gt;
numerically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Both wins and bugs are signal.&lt;/strong&gt; Don't just record what's working;&lt;br&gt;
record what's broken. The combination of "what improved" and "what got&lt;br&gt;
worse" defines the engineering response (continue / step back / change&lt;br&gt;
data).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run more checkpoints than you think you need.&lt;/strong&gt; A/B-ing 6 different&lt;br&gt;
epochs of the same run takes 3 minutes of compute. The information&lt;br&gt;
gain — perceptual gradient over training time — is worth far more than&lt;br&gt;
that.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Provenance
&lt;/h2&gt;

&lt;p&gt;Worked example from a small TTS fine-tuning project: ~3 hours of single-speaker&lt;br&gt;
British (Bolton-area) audio + WhisperCPP for transcripts → fine-tuned F5-TTS&lt;br&gt;
and StyleTTS2 producing recognisably Northern-English output. Both&lt;br&gt;
architectures hit different late-epoch failure modes that only the listening&lt;br&gt;
loop caught. The companion piece F5 vs StyleTTS2 architecture trade-off&lt;br&gt;
documents what those failure modes implied about the architectures themselves.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally posted at &lt;a href="https://netlinux-ai.github.io/2026/05/09/tts-listening-loop/" rel="noopener noreferrer"&gt;netlinux-ai.github.io/2026/05/09/tts-listening-loop/&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>tts</category>
      <category>ai</category>
    </item>
    <item>
      <title>Running modern Python TTS toolchains on non-AVX2 CPUs</title>
      <dc:creator>Turbo Electric</dc:creator>
      <pubDate>Sat, 09 May 2026 09:21:12 +0000</pubDate>
      <link>https://dev.to/turbo_electric_1c09f3bec0/running-modern-python-tts-toolchains-on-non-avx2-cpus-1ejb</link>
      <guid>https://dev.to/turbo_electric_1c09f3bec0/running-modern-python-tts-toolchains-on-non-avx2-cpus-1ejb</guid>
      <description>&lt;h1&gt;
  
  
  Running modern Python TTS toolchains on non-AVX2 CPUs
&lt;/h1&gt;

&lt;p&gt;Notes from getting &lt;strong&gt;F5-TTS, StyleTTS2, kokoro/Misaki, and whisper.cpp&lt;/strong&gt; to work&lt;br&gt;
on an AMD Phenom II X6 1090T (2010 K10/Family-10h architecture).&lt;/p&gt;

&lt;p&gt;The CPU has SSE/SSE2/SSE3/SSE4a, plus CX16/POPCNT/LAHF — but &lt;strong&gt;no SSE4.1, no&lt;br&gt;
SSE4.2, no AVX, no AVX2, no FMA, no F16C&lt;/strong&gt;. That puts it below the modern&lt;br&gt;
&lt;strong&gt;x86-64-v2&lt;/strong&gt; baseline. A growing share of binary Python wheels in the AI&lt;br&gt;
ecosystem assume v2 or v3, so they SIGILL or SIGFPE at import. This is a&lt;br&gt;
ground-truth list of what we hit and what worked.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick triage
&lt;/h2&gt;

&lt;p&gt;If your CPU is below x86-64-v2 (in particular, missing &lt;strong&gt;SSE4.1&lt;/strong&gt;), expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pyarrow&lt;/code&gt; static-init &lt;code&gt;pinsrq&lt;/code&gt; SIGILL on import&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;numpy 2.x&lt;/code&gt; wheel SIGILL on import (numpy 1.26.4 still has a fallback path)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;torch 2.10+&lt;/code&gt; wheel SIGFPE in &lt;code&gt;torch._dynamo&lt;/code&gt; on import&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt; modern wheels SIGILL on tokenisation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;monotonic_align&lt;/code&gt; and other Cython extensions: build-from-source SIGILL&lt;/li&gt;
&lt;li&gt;DataLoader subprocess workers SIGFPE re-importing torch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your CPU is x86-64-v2 (Nehalem ~2008 or newer Intel; Bulldozer ~2011 or&lt;br&gt;
newer AMD) but missing AVX/AVX2, you'll still hit some of these but fewer.&lt;/p&gt;
&lt;h2&gt;
  
  
  Working pin-set
&lt;/h2&gt;

&lt;p&gt;These are versions empirically verified to import and run on this CPU:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;package&lt;/th&gt;
&lt;th&gt;version&lt;/th&gt;
&lt;th&gt;why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;numpy&lt;/td&gt;
&lt;td&gt;1.26.4&lt;/td&gt;
&lt;td&gt;last with a non-AVX2 fallback path; from-source builds OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;torch&lt;/td&gt;
&lt;td&gt;2.7.0&lt;/td&gt;
&lt;td&gt;last with a usable &lt;code&gt;_dynamo&lt;/code&gt; init that doesn't SIGFPE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;torchaudio&lt;/td&gt;
&lt;td&gt;2.7.0&lt;/td&gt;
&lt;td&gt;last with the soundfile backend (2.10+ requires torchcodec)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;transformers&lt;/td&gt;
&lt;td&gt;4.57.3&lt;/td&gt;
&lt;td&gt;5.x triggers &lt;code&gt;torch._dynamo&lt;/code&gt; import-time via &lt;code&gt;torch.compiler.disable&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;numba / scipy / librosa&lt;/td&gt;
&lt;td&gt;latest binary wheels&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pyarrow / pandas / datasets / torchcodec&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;uninstalled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;wheels assume SSE4.1+; not actually needed for inference&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a fresh install, layer the pins after the project install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--prefer-binary&lt;/span&gt; &amp;lt;project&amp;gt;           &lt;span class="c"&gt;# whatever you actually want&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--prefer-binary&lt;/span&gt; &lt;span class="nt"&gt;--force-reinstall&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"torch==2.7.0"&lt;/span&gt; &lt;span class="s2"&gt;"torchaudio==2.7.0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"transformers==4.57.3"&lt;/span&gt; &lt;span class="s2"&gt;"numpy&amp;lt;2"&lt;/span&gt;
pip uninstall &lt;span class="nt"&gt;-y&lt;/span&gt; datasets pyarrow pyarrow-hotfix pandas torchcodec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Patches required
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Patch 1: &lt;code&gt;torch._dynamo&lt;/code&gt; SIGFPE on int division by zero
&lt;/h3&gt;

&lt;p&gt;Even after pinning to torch 2.7.0, the very first dynamo init still SIGFPEs&lt;br&gt;
on this CPU. Cause: &lt;code&gt;torch._dynamo.variables.torch_function.populate_builtin_to_tensor_fn_map()&lt;/code&gt;&lt;br&gt;
probes Python operators on dummy tensors, including &lt;code&gt;tensor // 0&lt;/code&gt; (integer&lt;br&gt;
floor-divide by zero). Newer Intel CPUs trap this into a Python&lt;br&gt;
&lt;code&gt;ZeroDivisionError&lt;/code&gt; via signal handler. AMD Phenom II just SIGFPEs.&lt;/p&gt;

&lt;p&gt;The function's output isn't actually needed for inference. Stub it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;F&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import torch._dynamo.variables.torch_function as m; print(m.__file__)"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nv"&gt;$F&lt;/span&gt; &lt;span class="nv"&gt;$F&lt;/span&gt;.orig
&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"0,/    global BUILTIN_TO_TENSOR_FN_MAP/s//    return  # patched: SIGFPE on Phenom II&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;    global BUILTIN_TO_TENSOR_FN_MAP/"&lt;/span&gt; &lt;span class="nv"&gt;$F&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is non-invasive — only affects code that uses &lt;code&gt;torch.compile()&lt;/code&gt; /&lt;br&gt;
dynamo paths, which most fine-tuning trainers don't.&lt;/p&gt;
&lt;h3&gt;
  
  
  Patch 2: GPU-only mel-spectrogram computation
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;torch.matmul&lt;/code&gt; on CPU SIGFPEs on this CPU. Anything that calls torchaudio's&lt;br&gt;
&lt;code&gt;MelSpectrogram&lt;/code&gt; on CPU dies. For training pipelines that compute mels&lt;br&gt;
in the data loader, this is fatal.&lt;/p&gt;

&lt;p&gt;Two ways to fix:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a)&lt;/strong&gt; Move the mel module to GPU (cheap audio→mel transfer per sample):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;to_mel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MelSpectrogram&lt;/span&gt;&lt;span class="p"&gt;(...).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wave&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;wave&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_numpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wave&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;to_mel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wave&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# back to CPU for DataLoader collator
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;b)&lt;/strong&gt; Pre-compute all mels once on GPU, save to disk, load at training time&lt;br&gt;
(&lt;a href="https://gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f" rel="noopener noreferrer"&gt;example script&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;(b) is faster overall — no per-sample audio→GPU transfer, just &lt;code&gt;torch.load&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Patch 3: &lt;code&gt;num_workers=0&lt;/code&gt; everywhere
&lt;/h3&gt;

&lt;p&gt;DataLoader spawns subprocess workers that re-import torch and re-run&lt;br&gt;
&lt;code&gt;_dynamo&lt;/code&gt; init. Even with patch 1, the patched source isn't always picked up&lt;br&gt;
in subprocess. Set &lt;code&gt;num_workers=0&lt;/code&gt; to keep all loading in the main process.&lt;/p&gt;
&lt;h3&gt;
  
  
  Patch 4: &lt;code&gt;weights_only=False&lt;/code&gt; for older checkpoint formats
&lt;/h3&gt;

&lt;p&gt;PyTorch 2.6+ flipped the default. If you load checkpoints saved before 2.6&lt;br&gt;
that contain pickled Python objects, you need &lt;code&gt;torch.load(path, weights_only=False)&lt;/code&gt;.&lt;br&gt;
Affected: many published TTS pretrained models (StyleTTS2's ASR/JDC/PLBERT&lt;br&gt;
modules, F5-TTS in some cases).&lt;/p&gt;
&lt;h3&gt;
  
  
  Patch 5: Stub &lt;code&gt;datasets&lt;/code&gt; for transformers' lazy loader
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;transformers.utils.import_utils._is_package_available("datasets")&lt;/code&gt; calls&lt;br&gt;
&lt;code&gt;importlib.util.find_spec("datasets")&lt;/code&gt;, which raises &lt;code&gt;ValueError&lt;/code&gt; if&lt;br&gt;
&lt;code&gt;__spec__&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt;. If you provide a stub &lt;code&gt;datasets&lt;/code&gt; module via&lt;br&gt;
&lt;code&gt;sys.modules&lt;/code&gt; (to avoid pulling pyarrow), it must have a real ModuleSpec:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;importlib.machinery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="n"&gt;_stub&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datasets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__spec__&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;importlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;machinery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datasets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dataset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
&lt;span class="n"&gt;_stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_from_disk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;modules&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datasets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_stub&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Patch 6: &lt;code&gt;--no-build-isolation&lt;/code&gt; for Cython extensions
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;monotonic_align&lt;/code&gt; (used by StyleTTS2) and similar packages build with their&lt;br&gt;
own ephemeral build-env via pip's build isolation. That ephemeral env&lt;br&gt;
re-installs &lt;code&gt;numpy&lt;/code&gt; and &lt;code&gt;cython&lt;/code&gt; and may pull AVX2 wheels. Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-build-isolation&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; &amp;lt;package&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This forces the build to use your already-installed (pinned) numpy+cython.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-project status
&lt;/h2&gt;

&lt;h3&gt;
  
  
  F5-TTS
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Inference and training both work after patches 1–5.&lt;/li&gt;
&lt;li&gt;See companion gist for a minimal trainer that bypasses &lt;code&gt;datasets&lt;/code&gt;/&lt;code&gt;accelerate&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Issue filed: SWivid/F5-TTS#1292 (EMA-only checkpoint structure).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  StyleTTS2
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Inference and fine-tune both work after patches 1, 2, 3, 4, 6.&lt;/li&gt;
&lt;li&gt;PRs filed: yl4579/StyleTTS2#361 (weights_only=False), #362 (drop pandas).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  kokoro
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Inference works (via the &lt;code&gt;kokoro-onnx&lt;/code&gt; ONNX runtime path; PyTorch path
blocked by upstream dep pinning, not CPU).&lt;/li&gt;
&lt;li&gt;Issue filed: hexgrad/kokoro#321 (broken &lt;code&gt;misaki&amp;gt;=0.7.16&lt;/code&gt; PyPI pin).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  whisper.cpp
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Works out of the box. Pure C++, no Python wheels involved. CUDA inference
on the GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What does &lt;em&gt;not&lt;/em&gt; work
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pyarrow&lt;/code&gt; source build: succeeds eventually but the resulting library
still uses SSE4.1 in places (Apache Arrow's CMake &lt;code&gt;ARROW_SIMD_LEVEL=NONE&lt;/code&gt;
doesn't cover everything). Not worth the multi-hour build.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;numpy 2.x&lt;/code&gt;: even from-source build emits AVX-needing code via OpenBLAS
bundled wheels. Stick with 1.26.4.&lt;/li&gt;
&lt;li&gt;Anything using &lt;code&gt;bitsandbytes&lt;/code&gt; int8/int4 quantisation: those kernels
hard-require AVX2.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Worth trying if you have AVX (no AVX2)
&lt;/h2&gt;

&lt;p&gt;A 2011-era Sandy Bridge or later Intel CPU has AVX but no AVX2. Most of the&lt;br&gt;
patches above still apply, but you may not need patch 1 (dynamo SIGFPE),&lt;br&gt;
and pyarrow/datasets/pandas may install (just not the AVX2-specific code&lt;br&gt;
paths). Try without the uninstalls first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;If you want to do TTS fine-tuning on hardware below x86-64-v2:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Do inference work on the GPU. Keep CPU-side code to file I/O and JSON.&lt;/li&gt;
&lt;li&gt;Pin numpy 1.26 + torch 2.7 + transformers 4.57.&lt;/li&gt;
&lt;li&gt;Stub or uninstall &lt;code&gt;datasets&lt;/code&gt;/&lt;code&gt;pyarrow&lt;/code&gt;/&lt;code&gt;pandas&lt;/code&gt;/&lt;code&gt;torchcodec&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Patch &lt;code&gt;torch._dynamo&lt;/code&gt; once per torch install.&lt;/li&gt;
&lt;li&gt;Pre-compute mel-spectrograms offline.&lt;/li&gt;
&lt;li&gt;Train at &lt;code&gt;num_workers=0&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rig produces useful output. It's not a fast-iteration machine — every&lt;br&gt;
upstream upgrade re-breaks something — but for fine-tuning (which doesn't&lt;br&gt;
need a fast-iteration machine) it's economical: an RTX 3060 12 GB on a&lt;br&gt;
2010-era CPU running real-world TTS workloads.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally posted at &lt;a href="https://netlinux-ai.github.io/2026/05/09/non-avx2-cpu-tts-compat/" rel="noopener noreferrer"&gt;netlinux-ai.github.io/2026/05/09/non-avx2-cpu-tts-compat/&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>tts</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
