DEV Community

Abhinav Goyal
Abhinav Goyal

Posted on

The reason your live AI demo spins has nothing to do with your model

There's a specific kind of fear before a live demo.

Not general anxiety. The 30-seconds-before-you-hit-run kind. Where
you're suddenly aware of every API call, every network hop, every
dependency you didn't stress-test. You smile. You keep talking.
Somewhere in the background, something is computing.

And you're just hoping it finishes before the silence gets awkward.

I've been in that position more times than I want to admit. I work
at a global health non-profit. My audience is usually program managers,
M&E specialists, researchers. People who are genuinely skeptical about
whether AI belongs in their workflows. People who need to see it work.

One spinner and you've set back AI adoption in that room by six months.

So I started asking a question I should have asked much earlier:

What actually has to happen live — and what am I running live out of habit?

That question rewired everything.


The split

Most live AI demos fail for the same reason. Not bad models. Not flaky
APIs. The architecture itself — generation and performance running in
the same process, at the same time, in front of people.

Audio synthesis. Script generation. API calls. All happening live,
while a room waits.

The fix isn't faster tools. It's removing the dependency entirely.

Split it into two phases:

Phase 1 — before anyone arrives: narration scripts, audio synthesis,
slide timing, the entire orchestration layer. All of it. Pre-computed,
cached, done.

Phase 2 — during the talk: only the things that genuinely have to
happen live. Real workflows. Real data. Real outputs.

By the time I walk into the room, the orchestration overhead is gone.
What's left is only the actual work — and it runs with nothing in its way.


Phase 1 — Before the room fills (15–20 min)

python -m core.pre_generate
Enter fullscreen mode Exit fullscreen mode

Per slide, in order:

  1. slide_reader.py pulls text and structure from the PPTX
  2. script_agent.py sends it to Claude → 3–4 sentence narration back
  3. voice_engine.py synthesises audio via Edge TTS (free, no API key)
  4. ffprobe measures the actual duration of each generated file
  5. Everything lands in cache/manifest.json

That last step matters more than it sounds. First version used
word-count estimates for timing. It was wrong constantly — natural
pauses, variation in delivery pace. Switching to ffprobe measuring
actual media duration fixed it permanently. Now timing is a property
of the content, not a guess about it.

The pipeline is resumable. Re-run it and completed slides skip.


Phase 2 — During the talk

python -m core.orchestrator
Enter fullscreen mode Exit fullscreen mode
manifest → play audio (pygame)
         → wait actual duration
         → PyAutoGUI advances slide
         → fire n8n webhook (if demo slide)
         → next
Enter fullscreen mode Exit fullscreen mode

No API calls. No generation. Reads the manifest, plays files, fires
webhooks at the right moments.

One thing that took longer than expected: PyAutoGUI focus management.
Controlling another application's window from a separate Python process
sounds trivial. Window focus, keypress timing, different PowerPoint
versions — all of it needed explicit settle delays and focus checking.
Budget an afternoon for this.

pynput handles keyboard controls globally without stealing PowerPoint
focus. SPACE pauses, D skips a demo countdown, Q quits.


Phase 3 — Q&A

Final slide, mic opens automatically.

SpeechRecognition → Google STT
→ Claude (full conversation history maintained across turns)
→ Edge TTS → plays answer aloud → loops
Enter fullscreen mode Exit fullscreen mode

The decision to maintain the full messages array across Q&A turns
was the smallest change with the biggest effect. Follow-up questions
get answered in full session context. Claude remembers what it said
two questions ago. That made the Q&A feel like a completely different
product from the single-turn version.


The three live workflows

These are the part that actually runs live. They fire via webhook at
configured slide numbers. All three ship as importable JSON — connect
credentials and they run immediately.

Email Pipeline

Webhook (triggered by orchestrator)
→ Gmail: fetch latest email
→ Claude: classify intent + draft reply
→ Escalation Router
→ Gmail Draft + CC logic + Sheets log
Enter fullscreen mode Exit fullscreen mode

This is the slide where the phone went face down.

n8n fired. Gmail opened on the secondary screen. Claude read an
incoming email, classified the intent, drafted a reply, routed it
for escalation. 4 seconds. Nobody touched a keyboard.

Meeting Pipeline

Webhook (paste any transcript)
→ Claude: action items, decisions, risks, follow-up email
→ Pull attendees from Sheets
→ Gmail + Slack + Sheets log
Enter fullscreen mode Exit fullscreen mode

I pasted a transcript from a meeting that happened that morning. By the time I finished talking over the slide — action items extracted, follow-ups written to 6 attendees, Slack notified, row in Sheets.

Someone in the back: "wait, that just actually happened?"

Yeah.

Evidence Intelligence Engine

Webhook (research question)
→ Claude: decompose into sub-queries
→ Perplexity Web Search  ─┐
→ Perplexity Academic    ─┘ parallel
→ Claude: is this evidence strong enough?
→ [yes] → Google Doc brief
→ [no]  → rewrite queries → retry (max 2 rounds)
→ Slack + Sheets log
Enter fullscreen mode Exit fullscreen mode

The quality gate is what gets the most questions. Claude evaluates
whether it actually has enough evidence to synthesise before writing
the brief. If not, it rewrites the queries and runs again. Caps at
two retries then forces synthesis anyway.

Live. In the room. Under 90 seconds.

The question afterward wasn't "how did you build this?"

It was: "can I have this?"

That's the only metric that matters.


One thing I'd change immediately

Built with ElevenLabs initially. Swapped to Microsoft Edge TTS when
I realised quality was close enough for narration and the cost
difference was literally $0. If you're building voice into an AI
pipeline for a resource-constrained organisation, start there.
Upgrade only if you actually hit a ceiling.


Structure

ai-presentation-orchestrator/
├── core/
│   ├── orchestrator.py      main runtime controller
│   ├── pre_generate.py      pre-generation pipeline
│   ├── regenerate.py        selective slide regeneration
│   ├── diagnose.py          pre-flight checks
│   └── logger.py
│
├── agents/
│   ├── script_agent.py
│   ├── slide_controller.py
│   └── slide_reader.py
│
├── integrations/
│   ├── voice_engine.py      Edge TTS + pygame
│   ├── heygen_engine.py     avatar video (optional)
│   ├── n8n_trigger.py
│   └── slack_notifier.py
│
├── n8n/                     3 importable workflow JSONs
├── docs/
│   ├── RUNBOOK.md           day-of checklist
│   └── architecture.svg
└── cache/                   auto-created, gitignored
Enter fullscreen mode Exit fullscreen mode

Run it

git clone https://github.com/TrippyEngineer/ai-presentation-orchestrator
cd ai-presentation-orchestrator
pip install -r requirements.txt

cp env.example .env
# ANTHROPIC_API_KEY is the only required key

python -m core.diagnose      # before anything else
python -m core.pre_generate  # 15–20 min before the talk
python -m core.orchestrator  # when the room fills up
Enter fullscreen mode Exit fullscreen mode

ffmpeg separately (not pip):


GitHub logo TrippyEngineer / ai-presentation-orchestrator

Fully automated AI presentation system — pre-generates narration, fires live n8n demo workflows, and answers audience questions via voice Q&A. Zero manual input during the talk.

AI Presentation Orchestrator

Separate the compute from the performance.
Pre-generate everything. Cache it. Walk in and press Enter.

Python 3.10+ License: MIT


What This Project Demonstrates

  • Two-phase system design — generation and runtime are fully decoupled. All LLM calls, TTS synthesis, and video rendering happen before the presentation. The runtime reads a manifest and executes deterministically.
  • Cache-first pipeline — every output is stored with a manifest. Partial runs resume from where they stopped. Nothing is regenerated unless explicitly forced.
  • Multi-modal orchestration — synchronises audio playback, slide advancement, and live webhook triggers from a single timing source (actual media duration).
  • Modular integration layer — TTS, avatar video, n8n, Slack, and Google APIs are all isolated modules. Swapping any one of them does not affect the core pipeline.
  • Live voice Q&A — final slide activates a mic listener. Audience questions are captured via speech recognition, answered by Claude in persona, and spoken aloud via TTS. Conversation…

Six talks since I rebuilt around this principle. Zero spinners.

The audience doesn't know there's a manifest. They don't know about
the cache. They see real workflows execute in 4 seconds. They hear
spoken answers before they've finished processing that it worked.

Nothing hesitates — because by the time the lights go down, the only
thing left to compute is the actual work.


Abhinav Goyal — building AI infrastructure for global health programs,
where reliability isn't optional. Drop questions below, especially if you've hit the live demo
reliability problem from a different angle.

Top comments (0)