Tarun yadav

Posted on Apr 20 • Originally published at murmurtts.com

How to Create an AI Audiobook (Full Workflow)

#audiobook #tutorial #acx #audible

From manuscript to finished audiobook: a complete guide to producing audiobooks with AI text-to-speech, including ACX/Audible specs.

The Traditional Audiobook Problem

Producing an audiobook the traditional way is a significant investment. A 10-hour audiobook (roughly 80,000 words) requires 40+ hours of studio time for recording and editing. Professional narrators charge $200 to $400 per finished hour, putting the total cost between $2,000 and $4,000 for a typical novel. Add studio rental, mastering, and quality control, and you can easily reach $5,000+.

For indie authors who have already spent months writing their book, that cost is often prohibitive. Many books never become audiobooks simply because the economics do not work for titles that might sell a few hundred copies.

AI text-to-speech changes the math. With Murmur, you can generate a full audiobook on your Mac for the cost of the software ($49). The process takes hours instead of weeks, and you can iterate, re-generate, and update chapters at will.

Step 1: Prepare Your Manuscript

Good audiobook production starts with good source text. Your manuscript needs specific preparation before TTS generation:

Remove all formatting. Bold, italic, headers, and footnotes should be stripped. The TTS engine reads plain text.
Split into chapter files. One text file per chapter keeps your project organized and makes regeneration easier.
Handle dialogue tags explicitly. Instead of relying on quotation marks alone, ensure "said" tags are present so the voice engine can adjust delivery.
Spell out numbers, abbreviations, and special characters. "$2.5M" should become "two and a half million dollars." "Dr." should become "Doctor."

Step 2: Choose Your Narrator Voice

For an audiobook, voice consistency is everything. Listeners will spend 6 to 12 hours with this voice. It needs to be pleasant, clear, and sustainable across long passages. In Murmur, audition voices with a full page of text, not just a sentence. A voice that sounds great for 10 seconds might become fatiguing after 10 minutes.

For non-fiction, choose a voice with authority and clarity. Kokoro voices excel here because they maintain consistent pacing and tone. For fiction, you want more expressiveness. Fish Audio S2 Pro produces the most natural prosody, handling dialogue and description shifts with genuine nuance. Chatterbox adds emotional range that works well for dramatic fiction.

Voice Cloning Option

If you want the audiobook in your own voice (author-narrated books sell well), Murmur's voice cloning feature lets you create a voice profile from a 10-second recording. The clone captures your pitch, timbre, and speaking rhythm. It will not be an exact replica, but it will be recognizably you. This gives you the personal touch of self-narration without the 40+ hours of recording.

Step 3: Generate Chapter by Chapter

Work through your book one chapter at a time. For each chapter: paste the text, verify the voice and speed settings match your previous chapters, generate, and listen to the first minute. If it sounds right, export and move to the next chapter. If a section sounds off (mispronunciation, odd pacing), edit that portion of the script and regenerate just that chapter.

For a 10-hour audiobook, expect the generation process to take a few hours on Apple Silicon hardware. Kokoro generates fastest (30 to 45 seconds per 1,500 words). Fish Audio S2 Pro takes longer (2 to 3 minutes per 1,500 words) but produces more polished output. Choose based on your quality requirements.

Step 4: Review and Polish

Listen to each chapter fully. Mark timestamps where you hear issues: mispronunciations, awkward pauses, unnatural emphasis. For most chapters, the output will be clean. For problem sections, adjust the script text (rephrase a sentence, add a comma for a pause, spell out a tricky word) and regenerate that chapter.

After all chapters pass review, normalize the volume across all files. Audio editing tools like Audacity (free) can batch-process volume normalization. This ensures chapter 1 is not noticeably louder or quieter than chapter 20.

Step 5: Meet ACX/Audible Technical Specs

If you plan to distribute through ACX (Amazon's audiobook platform for Audible), your files must meet specific technical requirements:

Format: MP3 at 192kbps CBR (constant bit rate) or WAV at 44.1kHz.
Each chapter must be a separate file, named sequentially.
Peak volume: must not exceed -3dB.
RMS (average volume): between -23dB and -18dB.
Noise floor: below -60dB.
Each file must have 0.5 to 1 second of room tone (silence) at the beginning and 1 to 5 seconds at the end.
Opening credits file: must include the book title, author name, and narrator credit.
Closing credits file: must include an end-of-book announcement.

Murmur exports clean WAV files that meet the sample rate and format requirements. Volume normalization and room tone can be added in Audacity using its built-in tools. The ACX Check plugin for Audacity can verify all technical specs before submission.

Timeline Comparison

Traditional audiobook recording: 40+ hours for a 10-hour book. AI generation with Murmur: a few hours of generation plus review time.

The timeline difference is dramatic. A traditional 10-hour audiobook requires 40+ hours of recording, 20+ hours of editing, and weeks of back-and-forth with narrators and engineers. With AI generation, you can go from manuscript to finished audiobook in a weekend. The actual generation takes a few hours. The bulk of your time goes to review and quality checking, which is the part that should take time.

Frequently Asked Questions

Does ACX accept AI-narrated audiobooks?

ACX updated its policies in 2024 to require disclosure of AI-generated narration. AI-narrated books are accepted but must be labeled accordingly. The technical quality standards (audio specs, consistency, clarity) still apply regardless of how the narration was produced.

Is AI narration good enough for fiction?

For single-narrator fiction (most novels), modern TTS handles the job well. Models like Fish Audio S2 Pro and Chatterbox deliver natural prosody and emotional range. Multi-character fiction with distinct character voices is more challenging. You can use different voices for dialogue sections, but this requires more editing work.

How do I handle character dialogue in fiction?

The simplest approach is single-narrator style, where one voice reads everything including dialogue. This is how most human-narrated audiobooks work. For distinct character voices, you can generate dialogue sections with different Murmur voices and splice them together in an audio editor. This takes more effort but creates a more immersive experience.

What about non-fiction vs fiction quality?

AI narration works exceptionally well for non-fiction: memoirs, self-help, business books, guides. The consistent, clear delivery suits informational content. Fiction requires more emotional range, which models like Chatterbox and Fish Audio provide, but the quality ceiling is slightly lower than for non-fiction.

Can I sell the audiobook commercially?

Yes. Murmur's license covers commercial use of generated audio. You retain full rights to your audiobook. Distribute on ACX/Audible, sell directly, or use any other distribution channel.

Try Murmur - $49 one-time. No subscriptions, no cloud, no per-character fees.

Originally published at murmurtts.com

DEV Community