Lee Stuart

Posted on Jun 10

14 Observations From Rebuilding My Ad Workflow Around AI Tools (The Unglamorous Version)

A friend who does similar work — content production, ad creative, the whole messy stack — sent me a voice note last week. Three minutes long. The gist: "I don't understand why you're still doing voiceovers manually. You're spending four hours on something that takes four minutes now."

She wasn't wrong. She was annoying about it, but she wasn't wrong.

So I spent the better part of two weeks pulling AI Ad Builder tools and AI Text to Speech engines into my existing workflow. Not replacing it. Pulling them in, which is a different and much more friction-filled process than the demos make it look.

Here's what I actually noticed. No rankings. No scores. Just observations, in the order I wrote them down.

1. The workflow friction is front-loaded, not ongoing.

The first three days were genuinely unpleasant. Figuring out where the AI output slots into my existing pipeline — what it replaces, what it feeds into, what it breaks — took longer than I expected. After that, it got easier. The pain is real but it's not permanent. Worth knowing before you quit on day two.

2. AI Text to Speech has a "valley of almost-good" problem.

The voices aren't bad. That's the issue. They're good enough that you keep thinking one more tweak will get them to broadcast quality. It won't. There's a ceiling, and it's lower than the demos suggest. For internal content or low-stakes social ads, fine. For anything where voice is part of the brand identity, you'll feel the gap.

3. Dynamic caption generation is the sleeper feature nobody talks about enough.

I expected to care most about the visual generation. I ended up caring most about dynamic subtitle generation. Accurate, auto-timed captions that adapt to pacing changes — this alone saved me somewhere between 45 minutes and two hours per project depending on length. It's not exciting to write about. It's extremely exciting to experience at 6pm when you're tired.

4. The AI Ad Builder treats "brand voice" as a style parameter. It isn't.

Every tool I tested had some version of a brand voice input. Tone selectors, adjective fields, example copy uploads. The output was consistently adjacent to brand voice, not inside it. Brand voice isn't a style setting. It's accumulated context — years of decisions about what a company doesn't say as much as what it does. AI doesn't have that context. You have to hold it yourself.

5. TTS pacing is harder to control than pitch or tone.

Most AI Text to Speech interfaces give you decent control over speed and emotion. What they don't give you is good control over micro-pacing — the half-beat pause before a key word, the slight acceleration through a list, the breath before a CTA. These are the things that make a voiceover feel like a person made it. SSML tags help but they're fiddly and the results are inconsistent across engines.

6. Generating caption styles that match video tone is still a manual job.

The tool auto-generates the text. It does not auto-generate the right visual treatment for that text. Font weight, color contrast, animation timing, whether the captions sit inside or outside the safe zone — all of that still requires human decisions. The generation handles the what. The how it looks is on you.

7. The best use case I found wasn't the one I went in looking for.

I went in trying to speed up final production. The actual win was in early-stage iteration. Using the AI Ad Builder to rough out six or seven structural variants of an ad before I committed to any direction — that changed how I present options to clients. I show up with more, I commit to less, and the conversations are better. Didn't expect that.

8. Voice cloning is available. I didn't use it. I'm not sure I should.

Several platforms I tested offered some version of voice cloning or "custom voice" training. The output quality was impressive. I still haven't used it for a client project. Partly because the consent and disclosure questions aren't resolved in my head yet. Partly because I don't want to be the person who has to explain it when someone asks. Maybe that's overcautious. I'm leaving it here anyway.

9. The caption sync breaks on music-heavy segments.

When background audio gets loud or rhythmically complex, auto-caption timing drifts. Not catastrophically, but enough to require manual correction. This is a known limitation and I'm noting it because three tools I tested didn't mention it anywhere in their documentation. You find out when you're reviewing a finished export at 7pm.

10. AI-generated voiceovers change client expectations in ways you don't anticipate.

I delivered a rough cut with an AI voiceover as a placeholder. Client came back asking if we could "just keep this one." The AI voice was cheaper and faster than booking a real VO artist. I said yes. I've thought about that decision more than I expected to. Not sure what the right answer is, but the question is real and it's coming for everyone doing this kind of work.

11. Nextify.ai handles the TTS-to-caption pipeline more smoothly than most.

I tested a handful of platforms end-to-end. Most required me to export audio, run it through a separate transcription tool, then reimport. One platform — Nextify.ai — kept the TTS output and caption generation in the same environment. Less context-switching. Fewer export/import errors. Small thing. Meaningful at volume.

12. The "natural pause" problem in TTS is real and underreported.

AI voices don't breathe. They don't hesitate. They don't do the small human things that signal to a listener that a thought is complete before the next one starts. You can fake some of this with punctuation and SSML. You can't fully replicate it. For short-form ads this is manageable. For anything over 60 seconds, it accumulates into something that feels slightly off in a way listeners can't name but definitely feel.

13. The workflow I ended up with looks nothing like the one I planned.

I went in with a neat diagram of where AI would slot in. The actual workflow that emerged was messier, more iterative, and honestly more interesting. AI draft → human edit → AI refinement → human final check. Not a replacement loop. A conversation loop. That framing helped me stop resenting the tool for not doing everything and start using it for what it actually does well.

14. My friend was right, but only about the easy part.

She was right that I was wasting time on things AI can handle. She was right that the transition was worth doing. What she didn't mention — because it's not the kind of thing that fits in a voice note — is that the interesting work starts after you've automated the easy stuff. The judgment calls, the brand instincts, the knowing-when-something's-wrong-but-not-why. That part didn't get faster. It got more visible.

The tools handle the volume. You handle the meaning. That's not a consolation prize — it's actually the job.

Posted from my desk at 7:15pm. The monitor is still the brightest thing in the room.

DEV Community