Jarosław Szutkowski

Posted on Dec 5, 2025 • Edited on Dec 22, 2025

The AI Pipeline I Use to Convert Any Podcast into a 5-Minute Summary

#ai #automation #productivity #programming

Most of the knowledge I gain as a developer doesn't come from books.
It comes from podcasts, long-form interviews, deep technical talks and YouTube channels - the kind of content that's brilliant for learning, but nearly impossible to digest properly without stopping, rewinding, and taking notes.

And that's exactly the problem: I don't always have the time to take notes.
Often I barely have the time to listen.

A few weeks ago I asked myself a simple question:

"What if I could turn every hour-long podcast into a concise, structured summary - automatically?"

So I built a system that does exactly that.

I type the video ID, hit Enter, and a few minutes later I get:

the full transcription,
the most important lessons extracted by AI,
a clean HTML version ready for email,
and everything stored neatly in Google Sheets as part of my personal knowledge base.

The best part?
Transcribing an hour of audio costs less than a dollar.

In this article, I'll walk you through the whole story - first the non-technical, "why this matters" part, and then the technical architecture behind it.

Part I - The Non-Technical Story

The Problem: Too Much Content, Too Little Time

I love listening to podcasts.
Technical, health, sport, productivity, travel - you name it. Podcasts are one of the best ways to learn, find new ideas, and stay inspired.

But we all know the reality:

you're listening while walking, training, or cooking,
you hear something valuable,
you want to write it down,
but you don't - because you're busy doing something else.

The result is predictable:
you consume a lot, but retain little.

And as devs and content-driven people, there's a second challenge:
we don't just want to learn - we want to reuse the knowledge.

For writing articles.
For making decisions.
For improving our work.
For building products.

Except... we never have the time to take proper notes.

The Realisation: AI Works for Pennies, If You Use It Well

At one point I realised that:

I don't need the entire episode,
I don't need all the words,
I just need the essence.

And AI makes the full summary of an audio for under $1 for an hour-long episode.
That's absurdly cheap considering it saves me hours of manual work.

So the idea started forming:

"What if I could automate the entire process:
from downloading a video, to transcription, to extracting lessons, to sending myself the summary?"

One command.
Zero manual steps.

The Idea: A Personal Knowledge Engine

I structured the goal like this:

Input:

YouTube video ID
channel name
video title

Output:

transcription
concise summary
list of lessons
Markdown + HTML
everything stored in Google Sheets
email notification

If I could turn long content into a reusable, searchable knowledge resource, I'd not only save time - I'd also build a growing repository of insights that I can use in other projects.

This is already proving incredibly useful for:

writing technical blog posts,
researching complex topics faster,
making notes I actually keep.

And the system works not only for technical podcasts - it's also brilliant for health, business, personal development, travel content, and many others.

Anything long, dense, and rich in knowledge.

The Impact: From Hours to Minutes

Today, instead of spending a full hour or two listening and taking notes, I spend five minutes reading a perfectly structured summary.

AI extracts more lessons than I would manually.
It spots connections I'd miss.
It adds nuance and examples.

And the best part?
I can share those summaries easily - internally, with friends, or with anyone who enjoys organized knowledge.

This one automation already saves me multiple hours every week, and compounds over time because the knowledge base keeps growing.

Part II - The Technical Breakdown

Let's go through the architecture, step by step.

Architecture Overview

The system is split into two parts:

Local Machine (current heavy processing)

downloading video
extracting audio
slicing audio into chunks
Whisper transcription
assembling final text
sending file to n8n

Remote Server (light orchestration for now)

A tiny Mikr.us instance running n8n, installed with a single command.

It performs:

receiving the transcription
parsing metadata
uploading transcript to OpenAI
invoking temporary assistants
enhancing summaries
generating HTML
storing everything in Google Sheets
email notification

Why Not Do Everything on the Server?

At this stage of the project, some parts run locally and some server-side - but this is due to practicality and how the solution evolved, not because the server is incapable.

Here's the real reasoning behind the split.

1. YouTube downloading works more reliably locally

PyTube and similar libraries can misbehave on servers due to YouTube's dynamic protection mechanisms.
Running the download step locally gives me stability and easier debugging.

This is likely the only part that will permanently remain local. I tried multiple ways of running it on the server, but it kept breaking.

2. ffmpeg slicing is local for now - but not due to server limitations

ffmpeg runs perfectly fine on servers.
I simply iterated and tested faster on my machine, so for now it lives locally.

Why do I need slicing at all? Because Whisper has limits, and I often deal with longer videos.

In the future, this entire slicing step will move to the server once I scale its resources.

3. Whisper wasn't the issue - large audio files were

I originally wanted everything to happen inside n8n.

But large single audio files created problems:

not much space on the budget test server,
higher error rates,
processing timeouts.

Once I split the audio locally, it became easier to transcribe the chunks on my machine.

However, Whisper can absolutely run on the server - or be replaced entirely by GPT's audio transcription - and that's the plan for the future.

4. n8n can absolutely handle heavy tasks

Although n8n is fantastic for orchestration (API calls, assistants, Google Sheets, email), it's fully capable of running heavier commands too.

The only reason some tasks aren't on the server yet is simple: I haven't migrated them.

5. Future Architecture: Mostly Server-Side

Long-term, the plan is:

a small PHP application running on the server,
audio upload directly through the app,
server-side slicing,
server-side Whisper/GPT transcription,
n8n orchestrating the rest.

The only local step will likely be converting YouTube videos to audio.
Everything else will move server-side once I expand the resources.

What Happens Inside the Local Python Script

A high-level walkthrough:

Download video using PyTube.
Extract audio via ffmpeg.
Slice audio into ~10 minute chunks using silence detection.
Transcribe each chunk with Whisper.
Assemble full transcription.
Attach metadata (video ID, channel name, title).
POST the file to the n8n webhook.

Most of this will eventually move server-side.

What Happens Inside n8n

The n8n workflow (export will be in the repo) performs:

Receive uploaded file.
Extract text + parse metadata.
Upload the transcription to OpenAI.

Initially I wanted to paste the whole transcriptions into the prompt, but it turned out they are too long - there is a token limit per minute. So I found a workaround - now I upload them as files to OpenAI and reference it in the prompt.
Create a temporary Assistant with a highly detailed prompt and attached transcription file.
Start a thread and send the initial message.
Enhance the summary via a second message:
"You're a consultant earning £100,000 for a correct answer - try harder." - noticed that this improves the quality of the output.
Delete the assistant.
Convert Markdown → HTML via a second assistant to make it look better in the browser.
Store everything in Google Sheets.
Email the HTML to myself.
Delete the uploaded file.

Diagram - Full n8n Workflow

*The part with PREMIUM enhancement changes the content of generated summary. I'd suggest not using it until I figure out how to fix it.

Part III - What I'd Improve Next

No system is perfect, and this one has room to grow:

Google Sheets sometimes can't store very large transcripts → might move to file storage or some database.
The local script could evolve into a small PHP web app.

But the value is already delivered:
it works, it saves time, and it compounds.

Part IV - Want to Try It Yourself?

I've just published the full "Starter Kit" with:

the Python script,
the n8n workflow export,
setup instructions,
sample outputs.

👉 Here's the link to the repo: https://github.com/jszutkowski/audio-content-summariser

Closing Thoughts

This project started from a simple frustration:
I don't have time to take notes from all the amazing content I consume.

AI + automation solved that for me in a way that feels almost unfair.

If you work with content - whether as a developer, writer, educator or researcher - this approach can save you hours every week.

I hope this breakdown inspires you to experiment with your own automations.

Once you experience the feeling of:
"I press Enter and the work is done for me,"
you don't want to go back.

And if you have ideas for extensions or want help customising the automation -
feel free to DM me.

DEV Community