This blog post is a translation of the original post, with help from Gemini and some human reviewers. Pictures come from the original blog post and are left in French.
Artificial intelligence is everywhere. Its use is increasing and is transforming the way we approach our work. It is proving to be a valuable assistant, helping people accomplish their daily activities more quickly. Particularly in speech-to-text, it provides quick transcriptions of audio recordings, although human intervention remains vital to ensure that the generated content is accurate and on topic.
In this article, we will tell you about our experience integrating AI into the production of our podcast, Zenikast, and more specifically, using Google Gemini to generate transcripts of episodes.
🎙️ Zenikast, a new adventure
When we decided to launch Zenikast, Zenika's new podcast, we pondered how to make our episodes accessible.
Unfortunately, accessibility, especially concerning podcasts, is not often a priority. At Zenika, it's a topic we consider important. We are dedicated to bringing accessibility into our services, audits, training, and even our side activities like the podcast!
On podcast platforms like Apple Podcasts or YouTube, automatic transcriptions are generated, often containing errors and various hallucinations. We know that AI-based tools improve every day, but we were determined to provide near-perfect transcripts. To make possible, we opted for a simple and effective solution: a Google Doc. We have an internal tool that allows us to create meaningful links to web pages or documents, so we don’t have to share the obscure raw Google Doc link. This means we can quickly and easily have a link like https://links.zenika.com/link/zenikast/episode-1.
Many thanks to Emmanuelle Aboaf for taking the time to answer our questions and for her advice 🙏.
🧪 Our first tests: Whisper Transcribe
For Season 1, the podcast was managed by several people with technical backgrounds. We experimented with various tools, including Node.js and Java libraries, as well as directly installable software. However, Zenikast is intended to be available to everyone at Zenika, whether they have a technical background or not. Therefore, the tool and method for creating our transcrips needed to be accessible to all Zenika employees.
We chose Whisper Transcribe and were able to test our first transcriptions using the free version. A big thank you to everyone who worked on this fantastic project 💪.
A significant feature in Whisper is the diarization, or the automatic detection of different speakers in the podcast. This feature is incredibly useful for our episodes, which can have up to six different voices.
Specific words, abbreviations, or technical terms can be added to Whisper's "known" word list.
Despite its quality, each transcript still required several hours of proofreading to optimize the text’s readability. Additionally, Whisper Transcribe is a paid service for more intensive use.
💡 Towards generative AI with Vertex AI
At Devoxx France in April, a conversation with Valentin Deleplace convinced us to continue our experiments with Gemini and Vertex AI Studio. It's possible to import an audio file (.mp3 or .wav) and ask Gemini to provide the episode's transcript, all without writing a single line of code.
The result was quite satisfactory and piqued our curiosity to try it out during our next podcast episode, which was recorded at Devoxx (see the episode's video).
🧪 Feedback: testing Gemini 2.5 Pro
We went straight to the latest model. Available in preview via Vertex AI, Gemini 2.5 Pro can process multimodal prompts: text, image, audio, and video. Unlike previous generations, it integrates much more advanced reasoning capabilities.
Getting started is quick: by importing an mp3 or wav file, we interacted with the model directly in Vertex AI Studio, and in a few seconds Gemini generated a solid text that fit the audio. Whereas Whisper Transcribe needed painstaking proofreading, Gemini’s output was a significant improvement. More importantly, it is also capable of diarization.
🪜 Step 1: Transcription with voice detection
We imported a wav file into Vertex AI Studio and gave it this simple instruction (this is a translation of our initial prompt in French).
Can you generate a transcript of this audio file for me?
Context: for Zenika’s podcast, Zenikast, we are recording an episode at Devoxx France 2025. The first person speaking is Jean-Philippe and the second person is Benjamin.
Can you provide a transcript in text format?
In a few seconds, Gemini produced a structured transcript, partitioned by speaker ("Speaker 1", "Speaker 2", etc). The raw result was really good. No hallucinations, very few mistakes, and above all, a good understanding of the transitions between speakers. This trial convinced us of the potential. But we wanted to go further: to improve the readability of the text, to make it more pleasant to read, or even publishable as is.
🪜 Step 2: Automatic cleaning of verbal tics
During a second prompt, we asked for exactly the same transcript but with the removal of verbal tics such as “uh” and “so”. Gemini preserved the overall gist of the discussion, and made it much more fluid to read.
This is the prompt:
Can you generate a transcript of this audio file for me?
Context: for the Zenika’s (Zenikast) podcast, Zenikast, we are recording an episode at Devoxx France 2025. The first person speaking is Jean-Philippe and the second person is Benjamin..
Can you provide the transcript in text format? Can you also clean up the text by removing verbal tics like "uh", word repetitions, etc., to make it readable while remaining as faithful as possible to the original text?
🪜 Step 3: Two complementary versions
This dual output now allows us to have one version that remains faithful to the original audio and another that is more readable.
🧭 Grounding: verified answers from reliable sources
In the previous screenshots, you'll notice that we activated an option called "Grounding" on the right panel.
The idea behind this is to connect Gemini to verifiable data sources, such as Google Search, Google Maps, or even your own data. This drastically reduces the risk of hallucinations.
For example, if you ask Gemini to summarize a recent news article, generate up-to-date content, or even answer a very precise question on a constantly evolving topic, grounding with Google Search allows the output to be based on recent and relevant web content. Even better, citations and confidence scores can be added to the output which enhances auditability and transparency.
It's possible to connect Gemini to a RAG Engine, Vertex AI Search, or Elasticsearch.
🚀 What's next, based on AI agents?
Using artificial intelligence, whether it’s Whisper Transcribe or more recently Gemini, has saved us several hours on transcribing our episodes. During our initial tests, we spent 3 to 4 hours copyediting an episode. For now, even if this process isn’t as smooth and as automated as we’d like, we're still only spending 30 minutes.
Given the announcements made by Google in April and May 2025, improvements will quickly be available and could further accelerate the process. This will allow any collaborator at Zenika to create transcripts in just a few clicks (via the AgentSpace platform, the Agent2Agent (A2A) protocol, and the Agent Development Kit, ADK). Stay tuned for more about this in a future blog post.
👉 Post wrote with Benjamin Bourgeois
🙏 Thanks to William for your review
Top comments (0)