DEV Community

Turbo Electric
Turbo Electric

Posted on • Originally published at netlinux-ai.github.io

Running modern Python TTS toolchains on non-AVX2 CPUs

Running modern Python TTS toolchains on non-AVX2 CPUs

Notes from getting F5-TTS, StyleTTS2, kokoro/Misaki, and whisper.cpp to work
on an AMD Phenom II X6 1090T (2010 K10/Family-10h architecture).

The CPU has SSE/SSE2/SSE3/SSE4a, plus CX16/POPCNT/LAHF — but no SSE4.1, no
SSE4.2, no AVX, no AVX2, no FMA, no F16C
. That puts it below the modern
x86-64-v2 baseline. A growing share of binary Python wheels in the AI
ecosystem assume v2 or v3, so they SIGILL or SIGFPE at import. This is a
ground-truth list of what we hit and what worked.

Quick triage

If your CPU is below x86-64-v2 (in particular, missing SSE4.1), expect:

  • pyarrow static-init pinsrq SIGILL on import
  • numpy 2.x wheel SIGILL on import (numpy 1.26.4 still has a fallback path)
  • torch 2.10+ wheel SIGFPE in torch._dynamo on import
  • pandas modern wheels SIGILL on tokenisation
  • monotonic_align and other Cython extensions: build-from-source SIGILL
  • DataLoader subprocess workers SIGFPE re-importing torch

If your CPU is x86-64-v2 (Nehalem ~2008 or newer Intel; Bulldozer ~2011 or
newer AMD) but missing AVX/AVX2, you'll still hit some of these but fewer.

Working pin-set

These are versions empirically verified to import and run on this CPU:

package version why
numpy 1.26.4 last with a non-AVX2 fallback path; from-source builds OK
torch 2.7.0 last with a usable _dynamo init that doesn't SIGFPE
torchaudio 2.7.0 last with the soundfile backend (2.10+ requires torchcodec)
transformers 4.57.3 5.x triggers torch._dynamo import-time via torch.compiler.disable
numba / scipy / librosa latest binary wheels OK
pyarrow / pandas / datasets / torchcodec uninstalled wheels assume SSE4.1+; not actually needed for inference

For a fresh install, layer the pins after the project install:

pip install --prefer-binary <project>           # whatever you actually want
pip install --prefer-binary --force-reinstall --no-deps \
    "torch==2.7.0" "torchaudio==2.7.0" \
    "transformers==4.57.3" "numpy<2"
pip uninstall -y datasets pyarrow pyarrow-hotfix pandas torchcodec
Enter fullscreen mode Exit fullscreen mode

Patches required

Patch 1: torch._dynamo SIGFPE on int division by zero

Even after pinning to torch 2.7.0, the very first dynamo init still SIGFPEs
on this CPU. Cause: torch._dynamo.variables.torch_function.populate_builtin_to_tensor_fn_map()
probes Python operators on dummy tensors, including tensor // 0 (integer
floor-divide by zero). Newer Intel CPUs trap this into a Python
ZeroDivisionError via signal handler. AMD Phenom II just SIGFPEs.

The function's output isn't actually needed for inference. Stub it:

F=$(python -c "import torch._dynamo.variables.torch_function as m; print(m.__file__)")
cp $F $F.orig
sed -i "0,/    global BUILTIN_TO_TENSOR_FN_MAP/s//    return  # patched: SIGFPE on Phenom II\n    global BUILTIN_TO_TENSOR_FN_MAP/" $F
Enter fullscreen mode Exit fullscreen mode

This is non-invasive — only affects code that uses torch.compile() /
dynamo paths, which most fine-tuning trainers don't.

Patch 2: GPU-only mel-spectrogram computation

torch.matmul on CPU SIGFPEs on this CPU. Anything that calls torchaudio's
MelSpectrogram on CPU dies. For training pipelines that compute mels
in the data loader, this is fatal.

Two ways to fix:

a) Move the mel module to GPU (cheap audio→mel transfer per sample):

to_mel = torchaudio.transforms.MelSpectrogram(...).to("cuda")
def preprocess(wave):
    wave = torch.from_numpy(wave).to("cuda")
    mel = to_mel(wave)
    return mel.cpu()  # back to CPU for DataLoader collator
Enter fullscreen mode Exit fullscreen mode

b) Pre-compute all mels once on GPU, save to disk, load at training time
(example script).

(b) is faster overall — no per-sample audio→GPU transfer, just torch.load.

Patch 3: num_workers=0 everywhere

DataLoader spawns subprocess workers that re-import torch and re-run
_dynamo init. Even with patch 1, the patched source isn't always picked up
in subprocess. Set num_workers=0 to keep all loading in the main process.

Patch 4: weights_only=False for older checkpoint formats

PyTorch 2.6+ flipped the default. If you load checkpoints saved before 2.6
that contain pickled Python objects, you need torch.load(path, weights_only=False).
Affected: many published TTS pretrained models (StyleTTS2's ASR/JDC/PLBERT
modules, F5-TTS in some cases).

Patch 5: Stub datasets for transformers' lazy loader

transformers.utils.import_utils._is_package_available("datasets") calls
importlib.util.find_spec("datasets"), which raises ValueError if
__spec__ is None. If you provide a stub datasets module via
sys.modules (to avoid pulling pyarrow), it must have a real ModuleSpec:

import importlib.machinery, types, sys
_stub = types.ModuleType("datasets")
_stub.__spec__ = importlib.machinery.ModuleSpec("datasets", loader=None)
_stub.Dataset = type("Dataset", (), {})
_stub.load_from_disk = lambda *a, **kw: None
sys.modules["datasets"] = _stub
Enter fullscreen mode Exit fullscreen mode

Patch 6: --no-build-isolation for Cython extensions

monotonic_align (used by StyleTTS2) and similar packages build with their
own ephemeral build-env via pip's build isolation. That ephemeral env
re-installs numpy and cython and may pull AVX2 wheels. Use:

pip install --no-build-isolation --no-deps <package>
Enter fullscreen mode Exit fullscreen mode

This forces the build to use your already-installed (pinned) numpy+cython.

Per-project status

F5-TTS

  • Inference and training both work after patches 1–5.
  • See companion gist for a minimal trainer that bypasses datasets/accelerate.
  • Issue filed: SWivid/F5-TTS#1292 (EMA-only checkpoint structure).

StyleTTS2

  • Inference and fine-tune both work after patches 1, 2, 3, 4, 6.
  • PRs filed: yl4579/StyleTTS2#361 (weights_only=False), #362 (drop pandas).

kokoro

  • Inference works (via the kokoro-onnx ONNX runtime path; PyTorch path blocked by upstream dep pinning, not CPU).
  • Issue filed: hexgrad/kokoro#321 (broken misaki>=0.7.16 PyPI pin).

whisper.cpp

  • Works out of the box. Pure C++, no Python wheels involved. CUDA inference on the GPU.

What does not work

  • pyarrow source build: succeeds eventually but the resulting library still uses SSE4.1 in places (Apache Arrow's CMake ARROW_SIMD_LEVEL=NONE doesn't cover everything). Not worth the multi-hour build.
  • numpy 2.x: even from-source build emits AVX-needing code via OpenBLAS bundled wheels. Stick with 1.26.4.
  • Anything using bitsandbytes int8/int4 quantisation: those kernels hard-require AVX2.

Worth trying if you have AVX (no AVX2)

A 2011-era Sandy Bridge or later Intel CPU has AVX but no AVX2. Most of the
patches above still apply, but you may not need patch 1 (dynamo SIGFPE),
and pyarrow/datasets/pandas may install (just not the AVX2-specific code
paths). Try without the uninstalls first.

Summary

If you want to do TTS fine-tuning on hardware below x86-64-v2:

  1. Do inference work on the GPU. Keep CPU-side code to file I/O and JSON.
  2. Pin numpy 1.26 + torch 2.7 + transformers 4.57.
  3. Stub or uninstall datasets/pyarrow/pandas/torchcodec.
  4. Patch torch._dynamo once per torch install.
  5. Pre-compute mel-spectrograms offline.
  6. Train at num_workers=0.

The rig produces useful output. It's not a fast-iteration machine — every
upstream upgrade re-breaks something — but for fine-tuning (which doesn't
need a fast-iteration machine) it's economical: an RTX 3060 12 GB on a
2010-era CPU running real-world TTS workloads.


Originally posted at netlinux-ai.github.io/2026/05/09/non-avx2-cpu-tts-compat/.

Top comments (0)