AudioProducer.ai

Posted on May 18

Auto-Assign Sounds: how AudioProducer.ai turns chapter text into music beds, ambience, and SFX

#ai #audio #tts #writing

If you read our Auto-Assign pipeline post from last week, you already know the shape: chapter text in, two AI passes (Characters, then Sounds), tweak in the editor, click Generate. That walkthrough was deliberately wide. This one is narrow.

This post zooms in on the second pass: Auto-Assign Sounds. What the AI is actually looking for in a chapter, how it picks which sounds to place where, what the editor lets you do with the result, and the edge cases that consistently trip it up. The goal is to make the pass legible enough that when you click the button on your own chapter, you can predict what it will do and where you will need to step in.

Why a separate Sounds pass

In our pipeline, Characters and Sounds are two distinct AI passes, not one. They could in principle be folded together. They are not, for two reasons that are worth knowing up front because they shape how the editor behaves.

First, the failure modes are different. Character attribution is mostly a parsing problem (who said this line?). Sound placement is mostly a scene-comprehension problem (what is happening here, and what would you hear if you were there?). Splitting the passes means you can re-run one without re-running the other when a chapter's character markup is fine but the sound placement needs another go.

Second, the user-facing controls are different. After Characters runs you typically tune voice assignments. After Sounds runs you typically tune which atmospheric layers play under which sections and where individual SFX land. Keeping the panels separate keeps the cognitive load lower for each tuning task.

What the Sounds pass detects

Input: a chapter, with characters already assigned. Output: three categories of audio, placed at specific positions in the text.

Music beds. Long-form atmospheric tracks that play under stretches of text. These are mood pieces, not tied to a single line. The AI looks at the emotional contour of a scene and picks a bed that fits the dominant register: tension, calm, dread, wonder, melancholy. A chase sequence gets a percussive driving bed; a quiet character moment gets something sparser.

Ambient soundscapes. Environmental layers tied to the place a scene happens in, not the mood of it. Wind under an outdoor scene. Distant traffic under a city scene. A crackling fire under a hearth scene. Surf under a beach scene. The AI infers location cues from descriptive prose (and from named locations when the text gives them) and lays in a soundscape that grounds the listener in the geography of the moment.

One-shot sound effects. Discrete events tied to a single moment in the text. A door slamming. Stones launching from a sling. A bottle shattering. Thunder cracking on a beat. These are the ones that show up as inline chips in the editor view with their duration in parentheses, sitting on the exact line they fire on. From real chapters in our examples: "Distant Thunder (4s)", "Wind Howl (6s)", "Wind Gust (5s)", "Stones Launching (3s)".

The three categories layer. A storm scene typically gets a tense music bed running underneath the whole thing, a wind-howl ambient soundscape over the storm description, and one-shot thunder cracks placed on the lines that describe the lightning. The editor shows all three layers stacked at the moments they overlap.

How the editor surfaces the result

After the Sounds pass completes, the chapter view shows you what the AI placed and where. Three surfaces matter:

Inline SFX chips on the text. One-shot sounds appear as small chips inside the body of the chapter, on the line they fire on, labeled with the sound name and duration. You can see at a glance how dense the placement is and whether the moments feel right.
The Sounds panel. Lists every music bed and ambient soundscape the AI placed, along with which range of text they cover. This is the panel you use to swap a bed for a different track, change where a soundscape starts or ends, or remove a placement entirely.
The library browser. Same place you would use to add a sound by hand. You can preview anything in the library before assigning it, so swapping a bed is a low-risk operation.

Two behaviors that come up often:

Dragging chips. SFX chips can be moved to a different line in the text if the AI placed one a beat early or late.
Removing placements. Anything the AI put down can be deleted with no penalty. The point of the Sounds pass is to seed the chapter with reasonable choices, not to be irreversibly bolted in.

Whatever you change in the panel takes effect on the next Generate Audio click. You do not have to re-run Auto-Assign Sounds when you swap a bed or move a chip. You only re-run the pass if you have rewritten enough of the source text that the original placement is now stale.

A concrete example

To make this less abstract, here is roughly what happens on a few paragraphs of a typical fantasy chapter.

The chapter opens with two characters approaching a cave during a storm. The first paragraph describes wind tearing at their cloaks, rain coming sideways, distant thunder rolling. The second paragraph has them ducking into the cave mouth. The third paragraph has one character striking flint to start a fire.

After the Sounds pass:

A tense, low-register music bed lays in under the first two paragraphs. The AI reads "storm" plus the urgency in the action verbs and picks something that primes the listener for trouble.
A wind-howl ambient soundscape sits over the outdoor portion and tapers as the characters enter the cave.
One-shot SFX chips appear: a distant-thunder chip on the line about thunder rolling, a flint-strike chip on the line about striking sparks, a fire-crackle ambient layer that starts after the fire catches.
The music bed shifts down (or out, depending on the AI's read) once the characters are safely inside, leaving the fire-crackle ambient and quiet dialogue.

That is the AI doing the placement work for you. Whether each piece is right is a different question, which is why the editor is built around fast review rather than starting from scratch.

Edge cases and where the AI misfires

The Sounds pass is good and uses the same underlying scene comprehension that powers the Characters pass. It also misfires in predictable ways.

Internal scenes get treated as external. When a paragraph reads as a character internal monologue with vivid imagery (a memory of a battlefield, a dream of an ocean), the AI sometimes places real-world ambience as if the events were happening live. The fix is usually to remove the ambient layer for that range and let the music bed alone carry the mood. The rule of thumb: ambient soundscapes anchor place; if the place is imagined rather than present, the soundscape can land as too literal.

Mood mismatch on tonally ambiguous scenes. A funeral that opens with a joke. A reconciliation that ends with a fight. Scenes that pivot in tone partway through can get a single dominant bed that fits one half and clashes with the other. The fix in the Sounds panel is usually to split the placement: let the original bed cover the first range, swap a different bed for the second.

SFX over-eagerness on dialogue-heavy chapters. When a chapter is mostly two characters talking in a static setting, the AI sometimes places sparse SFX on every motion verb (a cup setting down, a chair scraping) that adds up to noise rather than texture. The fix is to delete the chips that do not need to be there. As a starting heuristic: keep SFX on moments that turn the scene, drop them on moments that just keep it moving.

Repetition across long stretches. On long chapters, the AI can re-use the same ambient track across multiple scenes that happen in different places. The fix is to swap one of them; the variety reads as more produced. The Sounds panel makes this a one-click operation.

Genre register drift. Cozy mysteries and grimdark fantasies do not want the same musical palette. The AI gets the gross-level register right most of the time, but on chapter one of a new project, give the bed selections a closer look. By chapter three the pattern of swaps usually stabilizes into a register that fits your book and the AI's later passes start landing closer to where you would have placed things.

What composes well, what does not

Two practical patterns we have seen across user chapters:

Auto-Assign Sounds composes well for action, exterior scenes, and high-mood passages. The denser the sensory description in the source text, the more cues the AI has to anchor placements to, and the more the result feels intentional. Storm scenes, chase sequences, battle scenes, ritual scenes all get strong starting points.

Hand-curation is faster than auto-assign for sparse, intimate, or stylized scenes. A two-character conversation in a quiet room with no environmental cues does not give the AI much to work with. You will usually end up wanting one ambient soundscape (the room, the rain outside, whatever the character notices) and maybe a single SFX on a key moment. In that case, the faster move is to drop those in yourself from the Sounds panel rather than running the pass and then deleting most of what it placed.

For most book-length manuscripts, the answer is to run Auto-Assign Sounds on every chapter and then keep the hand-curation muscle warm for the chapters where the AI is fighting the material rather than helping it.

Your own audio in the same library

One detail that matters for projects with a specific sonic identity: you are not limited to the built-in library. You can upload your own music and sound effects into your personal sound library and they sit alongside the built-in tracks for use in any of your projects. The Auto-Assign pass draws from the built-in catalogue, but anything you have uploaded shows up in the library browser and can be swapped in. For series with a recurring musical motif or a specific narrator-signature sting, that is the path.

Standard caveat: only upload audio you are authorized to use.

What this pass does not do

To keep the picture honest:

It does not write music or generate SFX from scratch. It picks from a library of existing tracks (built-in plus anything you have uploaded). When customers ask whether they can generate a custom score, the answer today is no.
It does not mix levels for you across tracks. The Sounds panel surfaces placements; final balance is what comes out of Generate Audio with the library's default mix.
It does not handle publishing or distribution. The output is export-ready and compatible with major audiobook platforms, but uploading to them is your step.

Try it

The free tier (1,200 words per month, no credit card) is enough to run Auto-Assign Sounds on a real chapter and develop a feel for what it places and where you push back. Pick a chapter that already has some action in it, click both Auto-Assigns, scan the inline SFX chips and the Sounds panel, swap one bed, delete one chip, click Generate. Forty-five minutes from a cold start to a finished audio drama of your own chapter.

You can do that at audioproducer.ai. If you have already read the Auto-Assign pipeline post, this is the natural next click; if not, that one sets up the broader picture this post drills into.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.

DEV Community