ElevenLabs Studio Workflow: 4 Patterns for 12-Minute Solo Episodes

#ai #productivity #claudecode #automation

Scripted-narration solo pod: 12 min finished in 38 min using ElevenLabs v3, costs 0.34 EUR per episode
Tutorial with code intercuts uses voice cloning + SSML pauses, 51 min total, 0.52 EUR
Multi-voice debate format runs 4 voices through one v3 Studio project, 1h 12 min, 0.71 EUR
B-roll narration for short-form pulls 6 clips from the long script in 14 min, 0.18 EUR per pack

I record almost zero of my own voice now. The four patterns below cover every piece of audio I ship for RAXXO Studios, from a 12-minute solo episode to a 30-second TikTok cut. Each one runs inside ElevenLabs Studio with a single project per episode. I track time-to-publish and cost per episode for each pattern because both numbers used to scare me off podcasting entirely.

Pattern 1: Scripted-Narration Solo Pod

This is the workhorse. One narrator, one script, one finished 12-minute MP3. I use it for the weekly RAXXO Lab audio companion that mirrors a blog article.

The flow is dead simple. I paste the script into v3 Studio, pick a single voice (a cloned one I made from 40 minutes of my own recordings), and let the engine render. The script is the blog post with three edits: contractions added, em dashes replaced with commas, and any inline code stripped out so the narrator does not try to read curly braces.

What v3 Studio does well here is paragraph-level emotion tagging. I drop a [thoughtful] or [laughing slightly] tag at the top of paragraphs that need a tone shift. Two or three tags per 12-minute script is the right amount. More than that and the voice starts to sound theatrical.

Time-to-publish from final script: 38 minutes. Breakdown: 6 minutes to paste and tag, 9 minutes to render in 1500-character chunks, 14 minutes to scrub and regenerate three lines that came out wrong, 9 minutes to export, normalize loudness in Auphonic, and upload to the host.

Cost per episode: 0.34 EUR. A 12-minute script is roughly 1800 spoken words, which is 9500 characters. At the Creator plan rate of about 18 EUR for 500,000 characters, that lands at 0.34 EUR of consumed budget. Two regenerations push it to 0.36 EUR. For comparison, the same episode recorded the old way (record, edit, master) used to eat 2.5 hours of my time and roughly 70 EUR of opportunity cost.

The killer feature is consistency. Episode 12 sounds exactly like episode 1. The cloned voice does not have a cold, did not stay up late, and never has a Berlin tram going past the window.

Pattern 2: Narrated Tutorial With Code Intercuts

Tutorials need to switch register. The narration explains the why in a warm tone, the code reading needs to be slower and more precise, and the wrap-up returns to the warm tone. v3 Studio handles this with a single project plus SSML pauses.

The setup: I segment the script into three voice blocks per section. Block A is the explanation, block B is the code walkthrough, block C is the takeaway. I drop a 500-millisecond SSML pause before each code block (``) and slow the code reading by 8% using the speed slider on that block only. The voice stays the same, only the pacing changes.

For code with symbols I write the symbols phonetically inside the script. const fn = (x) => x * 2 becomes const f n equals open-paren x close-paren arrow x times two. Ugly but it reads correctly the first time. The first 6 tutorials I tried without this hack had at least 2 regenerations per code block. Now I get most of them in one render.

Time-to-publish: 51 minutes for a 12-minute tutorial episode. The extra 13 minutes over Pattern 1 is the phonetic code rewrite and the per-block speed adjustments.

Cost per episode: 0.52 EUR. Tutorials run longer in characters (more verbose explanations), so the same 12-minute target spoken minutes hits roughly 14,500 characters. The slower code blocks also push character billing slightly because the engine pads silence.

Two things I learned the hard way. First, never paste real terminal output into the script. The narrator will try to read it. Replace it with a one-sentence summary. Second, the [whisper] tag does not survive code blocks. Use a separate voice block with manual volume reduction instead. That cost me 9 EUR of regenerations across three tutorials before I figured it out.

If you want the full story on getting voice consistency right, the deeper dive lives in the ElevenLabs vs Other AI Voice Tools comparison.

Pattern 3: Multi-Voice Debate Format

This is the most fun pattern and the only one that draws real listener feedback. Four characters debate a take, like "should solo studios run Postgres or SQLite," and each one has a distinct voice, accent, and pace. The whole 12 minutes runs inside one v3 Studio project, no audio editor needed.

Voice picks: I use one cloned voice for the host, plus three library voices chosen for clear separation. A clipped British male for the skeptic, a slow Midwest American female for the practitioner, and a fast New York male for the contrarian. The library has hundreds of options; I pick the four that sound least like each other.

The script is structured as a screenplay. Each line starts with [host], [skeptic], [practitioner], or [contrarian]. v3 Studio's project view lets me assign a voice to each tag globally, so I do it once at the top of the project. Reassigning later is one click.

The trick that makes this not sound like a synthesized panel: I write in actual interruptions. Lines that get cut off mid-thought with a hyphen at the end. Lines that respond to the previous speaker by name. Beats of [laughs] and [skeptical] sprinkled in where a human would react. These tiny touches are what make the format feel produced, not algorithmic.

Time-to-publish: 1 hour 12 minutes for a 12-minute episode. Most of the extra time is writing the script. The rendering itself is no slower than Pattern 1.

Cost per episode: 0.71 EUR. Four voices, more emotion tags, occasional [laughs] calls that count as characters. Higher than Pattern 1 but still under one euro per finished episode.

For the bigger picture on where AI voice is going and the ethics around it, AI Voice Generation: Tools, Ethics, and Practical Use Cases is the primer I send people first.

Pattern 4: B-Roll Narration for Short-Form Social Cuts

The fourth pattern is the multiplier. Every long episode I produce gets sliced into six short-form clips for Reels, Shorts, and TikTok. Instead of re-recording the narration for each cut, I extract six 25-to-40-second segments straight from the long-form script and re-render them with v3 Studio at a slightly punchier delivery.

The reason I re-render rather than slice the long-form audio: short-form needs different pacing. The long episode breathes. A Reel cannot afford a 600ms pause. So I take the same words, drop the SSML pauses, bump the speed by 5%, and add a [confident] tag at the start. Same voice, tighter delivery.

I batch all six into one v3 Studio project. Each segment is its own block so I can re-render any single one without touching the others. Total render time is under 3 minutes for six clips.

For B-roll footage I pair the audio with motion plates generated from Magnific and a few stock clips I keep in a per-episode folder. The visuals get queued separately, but the narration is locked in by the time I open the editor.

Time-to-publish for a 6-clip pack: 14 minutes. Breakdown: 4 minutes to slice the script, 3 minutes to render, 4 minutes to export and rename, 3 minutes to drop into the social scheduler.

Cost per pack: 0.18 EUR. Each clip averages 700 characters, so six clips is about 4200 characters. Roughly 0.15 EUR plus a regeneration or two.

Distribution is handled by Buffer. One scheduler holds the long-form post on Spotify, the YouTube clip on the studio channel, and the six short-form pieces across Instagram, TikTok, and YouTube Shorts. The whole pack goes from voice render to scheduled in under 30 minutes.

Bottom Line

Four patterns, one tool, every audio piece I ship for RAXXO Studios. Total cost for a long episode plus its six short-form children: about 0.52 EUR for the workhorse case, 0.89 EUR for the debate format. Total time-to-publish for the full package, long plus shorts: 52 to 86 minutes depending on the pattern.

What used to be a podcast production stack with a microphone, a DAW, a noise reducer, a leveler, and three hours of attention is now a paste-tag-render loop inside a browser. The voice sounds like me because the cloned voice IS me, just one that does not get tired. The math works at zero scale (one episode a week) and at twenty scale (a launch week).

If you want the studio storefront where all the audio companions live alongside the products they pair with, that runs on Shopify. The hub for everything in the AI Voice/Video cluster is here.