Jakub

Posted on Jun 29 • Originally published at magicalsong.com

Magical Song by Inithouse: the text-to-song pipeline that turns a short story into studio-quality music

#ai #audio #machinelearning #webdev

At Inithouse we build a portfolio of products in parallel. One of them, Magical Song, takes a short personal story (a birthday memory, a wedding toast, a bedtime tale) and produces a finished song with real vocals and professional production. The whole thing runs in under five minutes.

This post breaks down how the pipeline actually works: four stages, where latency hides, and what still breaks regularly.

The pipeline: story in, song out

The generation pipeline has four sequential stages. Each one feeds into the next, and each one can fail independently.

User story (free text, 20-300 words)
  → Stage 1: Story analysis + lyric generation
  → Stage 2: Melody and arrangement
  → Stage 3: Vocal synthesis
  → Stage 4: Mix + master + delivery

Stage 1: story to lyrics

The user writes a short story. Could be three sentences about their dog. Could be a paragraph about meeting their partner at a gas station in 2014. The input is messy, personal, and unpredictable.

We parse the story for emotional arc, key images, and the names/details that make it personal. Then we generate lyrics that preserve those specifics. "Rocky" stays Rocky, not "my beloved pet." The gas station stays a gas station, not "the place where fate intervened."

This is where most competing tools lose the plot. They strip out the weird, concrete details and replace them with generic sentiment. We do the opposite: the weird details ARE the song.

Typical latency for stage 1: 3-8 seconds depending on story length. The variance comes from retry logic when the first lyric draft scores below our internal coherence threshold.

Stage 2: melody and arrangement

Lyrics go in. We match genre, tempo, and key to the emotional profile from stage 1. A nostalgic birthday story gets a different treatment than a rowdy bachelor party toast.

The user picks a genre preference upfront (pop, rock, acoustic, R&B, country, a few others), which constrains the arrangement. But the actual melodic line is generated to fit the lyric stress patterns. English lyrics with lots of monosyllables get different melodic contours than, say, Spanish lyrics with natural four-syllable words.

We spent weeks on the stress-matching problem alone. Early versions would put emphasis on the wrong syllable constantly: "re-MEM-ber" sung as "RE-mem-ber." Small thing, but it made every song sound off.

Latency here: 5-12 seconds. Genre selection is fast; melodic fitting is the bottleneck.

Stage 3: vocal synthesis

This is the expensive stage, both in compute and in quality risk. We use AI vocal models that produce singing (not speech-to-song conversion, actual singing synthesis). The output sounds like a real vocalist recorded in a treated room.

What still breaks here:

Breath placement. Vocal models sometimes forget to breathe. A 16-bar phrase with zero breath sounds robotic no matter how good the timbre is. We inject breath markers between phrases, but edge cases (very fast lyrics, unusual phrase lengths) still slip through.
Vibrato on held notes. Too much vibrato and it sounds like a parody. Too little and it sounds flat. We cap vibrato depth and rate, but the sweet spot varies by genre. Country vocals need more than pop vocals, for instance.
Consonant clarity. Plosives (p, b, t, d) at phrase starts sometimes get swallowed. We boost transients in post-processing, but it is an ongoing tuning problem.

Latency: 15-40 seconds. This is the longest stage by far. Vocal synthesis is GPU-bound and we batch requests, so queue depth matters.

Stage 4: mix, master, delivery

The raw vocal track plus the arrangement go through automated mixing: EQ, compression, reverb, stereo imaging. Then a loudness-normalized master pass targeting -14 LUFS (streaming standard).

The output is a downloadable MP3 (and WAV for paid users). We also generate a shareable player page with the lyrics displayed, synced roughly to playback.

Latency: 4-6 seconds. This stage is fast because it is deterministic signal processing, no model inference.

Total wall-clock time

Adding it up: 27-66 seconds for the full pipeline, with a median around 40 seconds. We tell users "a few minutes" because we would rather under-promise. Network overhead, queue wait, and the occasional retry push real-world delivery to 1-3 minutes for most requests.

What still breaks

Three recurring failure modes:

Genre mismatch. User writes a somber memorial story and picks "party pop" as genre. The system generates something tonally confused. We added a soft warning ("your story seems reflective; acoustic or ballad might fit better") but we do not override the user's choice.

Names in lyrics. Unusual names sometimes get mispronounced by the vocal model. "Květa" or "Xiaoling" are harder than "Sarah." We maintain a pronunciation override table for common edge cases but it is never complete enough.

Very short stories. "Happy birthday Mom" gives us almost nothing to work with. The lyrics end up generic because the input was generic. We prompt users for more detail, but many skip past it. This is a UX problem more than a technical one.

Stack notes

The web frontend is a React SPA built with Lovable. Backend orchestration runs on Supabase Edge Functions. The AI inference stages call external model APIs (we do not host our own GPU fleet). Storage and delivery through Supabase Storage + CDN.

The whole thing costs us roughly $0.30-0.50 per song in API calls and compute. We charge $1.90 per song, one-time, no subscription.

Try it

You can generate a song at magicalsong.com. Paste a story, pick a genre, wait a minute. The free preview lets you hear a clip; full songs are $1.90.

If you are building something similar (text-to-audio, generative media pipelines), happy to compare notes in the comments.

DEV Community