Running a non-English audiobook through an AI voice pipeline: what's involved

#ai #audio #tts #i18n

Most TTS-based audiobook pipelines are built around English. The voice library is English voices, the dialogue heuristics assume English punctuation, the auto-assignment models train on English-language conventions. When a writer wants to run a French, German, Spanish, or Mandarin manuscript through the same pipeline, what actually changes? Some pieces port over cleanly. Others don't. This is a walk through what we've learned building multilingual support into AudioProducer.ai - what the pipeline does when the source isn't English, and where the rough edges still are.

Voice selection across languages

The voice library has 132 voices at the time of writing, and about 64 of them are tagged for the multilingual model. "Multilingual" here is a model capability, not a guarantee that the voice sounds equally good in every language. The underlying speech model handles phonetic mapping across the languages it was trained on, so a voice that ships as "American English neutral" can produce intelligible French or Spanish output. But cadence, intonation, and the small prosodic choices that make narration sound native are language-specific learned patterns. Some voices carry their non-English performance further than others.

For a writer starting a non-English project, the practical advice is to evaluate a few of the multilingual-tagged voices on a paragraph of the target language before committing to a narrator. Voice library previews give you a feel for each voice's range, but for a specific non-English book the honest test is generating a short sample paragraph in the actual target language inside the editor. That's the cheapest way to know whether the voice carries the language well enough for your purpose.

Per-character voice routing when the prose has multiple speakers

Auto-Assign Characters tags every line in a chapter by speaker. Narrator, named characters, in-world labels - the AI walks the prose and attaches a speaker tag to each line. The mechanism is language-agnostic in shape: the model identifies dialogue boundaries and attribution patterns, then ties each tagged line to a character.

In practice, non-English prose introduces two adjustments.

First, dialogue punctuation conventions vary by language. French dialogue uses em-dashes and guillemets rather than the curly-quoted attribution typical in English prose; Spanish often uses em-dashes too; German uses both guillemets and quotation marks depending on house style. Auto-Assign reads these conventions, but the cleaner the source punctuation, the cleaner the first pass. Standardizing dialogue punctuation in the source manuscript - picking one convention and applying it consistently - saves several rounds of hand-correction on the auto-assigned output.

Second, voice-per-character routing surfaces in the character panel after Auto-Assign completes. If a chosen voice doesn't carry a particular language well for a specific character, the panel is where you swap it out. Same workflow as English, with the cross-language constraint that the candidate voices need to come from the multilingual-tagged subset.

Sound design across languages

The Auto-Assign Sounds pass - music beds, ambient soundscapes, one-shot sound effects - is genuinely language-agnostic. Sound effects don't know what language the chapter was written in. A thunderclap is a thunderclap; rain over a city is rain over a city. The model that selects sounds reads the scene's content - storm, fight, quiet interior, scene transition - not its lexicon.

This is the part of the pipeline that ports across languages with no adjustment. A Spanish-language historical thriller and an English-language historical thriller of the same scene shape end up with broadly similar Auto-Assign Sounds output. Music selection rules (genre, energy, mood) operate at the same layer. Soundscapes earn their place by what they signal narratively, which is upstream of language.

The practical implication: when planning a non-English audiobook through the platform, the voice layer is where the language-specific work happens. The sound design layer is the same pipeline you'd run on an English book.

UI language vs. content language

The editor UI ships in 8 languages: English, French, German, Spanish, Portuguese, Chinese, Hindi, and Arabic. The UI language is independent of the content language. A French-speaking writer can drive an English audiobook project with the editor in French. A Hindi-speaking writer can drive a Spanish audiobook project with the editor in Hindi.

The two layers are decoupled by design. UI locale is picked from the Accept-Language header and an optional locale cookie at SSR time, with translation strings living in their own per-locale files. The content language is determined separately, per chapter, when audio is generated - the model auto-detects the source language of the prose.

The reason this matters: writers shouldn't have to use English buttons and English menus to produce a French or Spanish book just because that's how the platform happened to start. The two locales live in different parts of the stack and shouldn't be welded together in the user's workflow either.

What's still hard

The honest part. Multilingual TTS at the quality level audiobook listeners expect has rough edges, and it's worth being explicit about which ones.

Accent within a language is still hard. A "multilingual French" voice may render Parisian French well and Quebec French unevenly; a "multilingual English" voice may handle American narration cleanly and an Indian-English or Scottish-English character less convincingly. Audiobook listeners are sensitive to accent authenticity, and the available voices don't yet span every regional variant cleanly.

Code-switching in dialogue is also rough. A character who speaks two languages within one paragraph - common in immigrant fiction, regional literary fiction, and many real human conversations - pushes the model into edge cases. Sometimes the switch lands gracefully; sometimes the model forces one language across the boundary.

Idiomatic prosody is the third rough edge. Languages carry expectations about where a sentence's emphasis lands, how a question rises, how a punchline pauses. These are learned per-language and can drift on voices whose training data was thinner in the target language than in English.

What this means operationally: if you're producing in a language you're a native or near-native speaker of, you'll catch what's off and route around it. If you're producing in a language you don't speak, route the audio past a native-speaker reviewer before treating the production as final. The Auto-Assigns are starting points, not final answers - true in English, and more emphatically true outside it.

Wrapping up

Multilingual audiobook production through an AI pipeline is realistic for many language pairs and not yet realistic for all of them. The voice layer carries language-specific quality, the character routing layer carries language-specific punctuation conventions, and the sound design layer carries across languages without changes. Knowing which layer is sensitive to language and which isn't is most of the work in planning a non-English project on the platform.

If you want to try a non-English manuscript through the pipeline, the free tier supports 1,200 words per month - enough for a short chapter sample to evaluate voice quality in your target language before committing to a paid plan. The voice library at audioproducer.ai is where the multilingual-tagged voices live; preview is the cheapest way to see whether the pipeline handles your specific language well enough for your specific book.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.