As many of you know, I have been hosting a podcast for the last three and a half years. It is one of the most exciting experiences of my life. I have produced approximately 200 hours of audio content in all these years.
The audio content has a drawback; if you want to search for something that was said, you often have to listen to hours and hours of old episodes without reaching the searched point.
How to solve this problem?
Transcript
The first step is to transcribe the episodes. Since the beginning, I have implemented a simple pipeline to transcribe each episode.
It turned out to be a total failure! The transcriptions were based on AWS Transcribe, and the service couldn't transcribe Italian audio correctly, perhaps due to the presence of technical English words or the remarkable Sardinian accent.
The outcome was terrible, it was impossible to read and understand the meaning, and they were not usable for my primary purpose. Aside from that, it also had a cost, each transcription was around 1 euro, and despite the low price, it was an absolute waste of money.
After one year, I stopped using the lambdas and decided not to transcribe the episodes anymore.
Also, the wise man retraces his steps, and who am I to not review my decisions?
Reviewing the decision is an outstanding practice, and doing it when the context changes could help us to stay tuned with the world around us and catch opportunities.
Since Open Ai started releasing its products, our industry was swept into a whirlwind of astonishment; chat GPT, copilot, and DALL•E were perceived as masterpieces, but another service caught my attention.
Whisper, as its name suggests, arrived without making much noise. While all the attention was on ChatGPT, Whisper was exceptionally interesting. Its quality, compared to other transcription services, is remarkable. It performs excellently in Italian, accurately recognising English words and technical jargon. I have never seen such precision before. Moreover, there is another non-trivial aspect—it is open source and released under the MIT license!
After conducting a test, I quickly embraced Whisper to transcribe the episodes. At first, I was tempted to set up a machine on AWS to run the entire process in the cloud. However, Whisper requires a massive amount of resources and time. Ultimately, I chose to run it on my gaming machine, which has been dormant for a while. My desktop computer, equipped with a GTX 3060, was the perfect candidate to put it to the test, especially after I stopped playing video games.
Whisper offers different pre-trained models, ranging from small to large. The largest one, which provides the best quality but is also the slowest, can utilise the mathematical capabilities of the GPU.
Since I am not a Python developer and PyTorch is unfamiliar to me, starting from scratch to implement the transcription script was nearly impossible.
Thankfully, a simple Docker image came to my rescue. This container simplifies all the steps and provides a REST API directly.
https://github.com/ahmetoner/whisper-asr-webservice
Now, it is enough to navigate to the web port exposed by the container to reach the swagger ui. From the Swagger UI, select the file, select the language (in my case, Italian), and wait around 30 minutes. Each episode is one and a half hours long, so Whisper needs quite a bit of time for transcribing. In the end, we will receive a well-structured JSON file containing the transcriptions with time references.
Cool, but now it's time to play with the code!
Make it searchable
With Whisper, I have completed the first half of the problem. Now it's time to discuss how to implement the search functionality.
Whisper can export in multiple formats, including TXT, VTT, SRT, TSV, and JSON.
In my case, I will be using the JSON data format. This format contains both the raw translated text and the time-coded text. The raw translated text is displayed on the episode page and is crucial for SEO purposes. The second part of the implementation revolves around the search functionality, which is one of the main pillars of this project.
The search process is fairly straightforward. There will be an input box where users can enter the words they wish to search for.
After submitting the search query, a list of audio samples that match the searched terms will be displayed.
The episode will be played by clicking on the "play" button next to each text slice, starting from when the word is pronounced.
The Gitbar website has no backend; it is entirely static and built using Astro, an excellent framework.
https://github.com/brainrepo/gitbar-2023
How to manage the search feature? Should I install Elasticsearch? How much will it cost? Or should I consider using Algolia?
These questions arose as I started implementing the feature.
From the beginning, using Elasticsearch was excluded as an option. Managing an Elasticsearch instance is not trivial, as it requires a server or computational capabilities. Similarly, using Algolia incurs additional costs, and since we rely on donations to support Gitbar's expenses, we need to minimise the expenditures.
Therefore, I needed to find an alternative solution.
I have been following Michele's project, Orama Search, since its inception, and I believe he, Paolo, Angela, and their "crime partners" are doing incredible work with it.
If JavaScript has democratised software development, I would say that Orama Search (also known as Lyra for nostalgic folks like me) has done the same for the search experience.
Initially, JavaScript may seem limiting, but thanks to it, we can run Orama Search everywhere, from the client to the server. It's truly amazing!
Another appealing aspect of Orama is its immutable nature, which makes it the perfect fit for my use case.
Since the Gitbar website is statically generated, it is not an issue for me to build the search index during the page generation process and share it as a simple JSON file.
To accomplish that, I created an Astro plugin inspired by the official one.
Now, let's dive into the details of what I have implemented.
Creating an Astro Plugin for Orama Search
Orama Search provides built-in support for Astro. It takes the generated files of the website and creates an index or database from the content within the HTML pages. However, my use case had specific requirements that differed from the common ones.
You can refer to the Astro plugin documentation for Orama for more information.
To meet my specific needs, I had to index a particular data structure that included the following fields:
-
text
: The transcribed fragment, usually consisting of 10-15 words. -
title
: The episode title. -
from
: The timestamp indicating when the words are pronounced. -
episodePath
: The path of the episode page.
Given this requirement, I had to create a plugin from scratch to support it.
Astro provides a plugin API that allows us to extend its capabilities. It's important to note that the plugin API is relatively low-level. While it grants access to many internal details, it also requires caution when making changes to avoid unintended consequences.
Now, let's go through the steps involved in creating the plugin.
Initial Setup:
To start, create a new folder in the root directory of your project called /plugins
. This folder will hold all the plugins for this project. Each Astro plugin is a JavaScript (or TypeScript) file that exports a single function.
export default () => ({
name: "GITBAR-ASTRO-SEARCH",
hooks: {
"astro:server:start": async () => {
await initDB("dev");
},
"astro:build:done": async () => {
await initDB("prod");
},
},
});
Astro's core functionalities can be extended using hooks, which allow us to run custom JavaScript code at specific moments in Astro's lifecycle. In this case, we want to extend two pivotal moments: the server starting phase and the build done phase, when the website is fully built.
- astro:server:start: Since Gitbar is a static website and will be served by Netlify servers, we don't require a Node server to run it. However, during the development environment, we want the plugin to build the Orama database for us to use in the development process.
- astro:build:done: We use this hook to build the production database. When we release the website, along with the static pages, we also release a JSON file that contains a serialized **Orama database.
Data Preparation and Ingestion
To prepare the data for seeding the Orama database, I followed a multi-step process. Here's a breakdown of the steps you took:
Fetch the episodes from the podcast feed using the @podverse/podcast-feed-parser
library. This allowed me to retrieve the necessary episode data.
const { episodes } = await getPodcastFeed(podcastFeed);
Iterate over the list of episodes and check if there is a corresponding translation file in the transcriptions
folder. You searched for a file with the episode number as the filename.
const { episodes } = await getPodcastFeed(podcastURL);
const episodeSegments = episodes.map(async (episode) => {
const episodeNumber = extractEpisodeNumber(episode.title);
try {
const json = await import(`../transcriptions/${episodeNumber}.json`);
const segments = json.segments.map((s) => ({
title: episode.title,
path: getSlug(episode),
from: s.start,
text: s.text,
mp3url: episode.enclosure.url,
}));
return segments;
} catch (e) {
console.log(`Transcription ${episodeNumber} not found`);
}
return [];
});
const results = await Promise.all(episodeSegments);
const segmentsToInsert = results.flat();
After obtaining the segments for each episode, I flattened them into a single array that contains all the segments of all the episodes.
After that, I created the Orama database by calling the create
function providing the desired schema.
const db = await create({
schema: {
title: "string",
text: "string",
from: "number",
path: "string",
mp3url: "string",
},
});
To efficiently insert the segments into the Orama database, I divided them into chunks of 500 units and used the insertMultiple
function for batch insertion.
await insertMultiple(db, episodes, 500);
Finally, to complete the plugin, I serialised the database to a JSON file. This allowed me to share the database as a simple JSON file.
With these steps, I am able to prepare and ingest the necessary data into the Orama search index, using the appropriate schema and chunking techniques to optimise performance.
Search component
Now is the time to come out from the shadows and create the components for our search form. Since Astro supports React components, let's write our search component.
const Search = () => {
return ...
}
I will skip some parts to keep the focus on the interesting ones.
The first thing I want to do is fetch the Orama database that we built earlier from the network. I want to do this during the mounting phase of the component.
I'll use a useEffect
hook where:
- I initialise the Orama instance with the same schema used before.
- I load the database file and track the loading state to disable the search UI during the loading process.
- I load the fetched data into the Orama instance.
- I update the DB state to make the Orama instance available to the component.
import { search, create, load } from "@orama/orama";
...
const Search = () => {
// State that holds the database
const [DB, setDB] = useState(null);
// State that holds the loading status
const [isLoading, setIsLoading] = useState(true);
useEffect(() => {
const getData = async () => {
const _db = await create({
schema: {
title: "string",
text: "string",
from: "number",
path: "string",
mp3url: "string",
},
});
setIsLoading(true);
const resp = await fetch(`/in_episode.db.json`);
setIsLoading(false);
const data = await resp.json();
load(_db, data);
setDB(_db);
};
getData();
}, []);
return ...
}
Lastly, we need to create our search function, which we'll call when there is a change in the search field input.
To avoid creating a new find
function with every rendering, we'll use the useCallback
hook to cache it and update it when DB
or setResults
changes.
The rest of the function calls Orama's search function with the search term and runs the search on all properties, retrieving the first 100 results.
const find = useCallback(
async (term: string) => {
const res = (
await search(DB, {
term,
properties: "*",
limit: 100,
})
)?.hits?.map((e) => e.document);
setResults(res);
},
[DB, setResults]
);
Now all that's left is to attach this function to an input field change event, and we're done!
<input
className="bg-transparent w-full text-white p-6 text-2xl outline-yellow-300 outline-3 placeholder:text-yellow-300"
placeholder="Search..."
onChange={(e) => console.log(find(e.target.value))}
/>
The audio player features are beyond the scope of this article. Let me know if you would like me to write an article on that.
I intentionally left out some other implementation details. If you want a running code example, take a look at https://github.com/brainrepo/gitbar-2023/blob/main/src/plugins/lyra_in_episode.ts for the database creation and https://github.com/brainrepo/gitbar-2023/blob/main/src/components/searchAdvanced.tsx for the frontend JSX code.
The limits
Currently the search feature is in production on this url, but if you check the network inspector you will see that the Orama database size is more the 40MB, note that not all the episode are indexed, the transcription process takes lots of time and I transcribed and indexed from the episonde n*130 to the number 156.
The size is not negligible, and I expect that the size of the database can reach 200MB pretty soon.
Botstrapping a nodejs server for running it seems the most reasonable way to solve the problem, but I don’t want to do it than I need a plan B.
The B Plan
My Plan B has two levels:
- The first solution is very straightforward. The JavaScript ecosystem offers some fantastic libraries that can assist with GZIP compression. It appears that the Netlify servers don't handle compression, so I can leverage libraries such as
pako
,tiny-inflate
,uzip.js
, orfflate
to achieve this goal. I conducted some tests, and the compressed database size was reduced to just 10% of the original size. Implementing this solution requires only a few lines of code, specifically less than 4. By incorporating this solution, I can easily handle up to 300 episodes with sustainable download times. Considering that I have recorded around 160 episodes in the past three and a half years, I can sleep soundly because I have ample time ahead. - Whenever I encounter an upper limit, I usually engage in a mental exercise to find a workaround. What if I group the episodes into chunks of 10 or 20 elements and create an Orama database for each group? Furthermore, what if I begin my search with the most recent episodes and, once I have fetched and searched within the first group, I proceed to the older databases until the result limit is reached? This approach would compel the results to be sorted by date (which could even be considered a feature), and the sequential style of this search feature prevents the need to download all the databases in advance.
There is always a C Plan
Ok, these two B Plans have stimulated my rambling. What if I incorporate the Chat GPT API to provide answers in natural language, using Orama for the time-stamped results? That way, I can have a contextual conversation with the generated excerpt from Chat GPT and rely on the accurate source of information and timestamp reference from Orama search results.
Alright, alright, it's time to come back to reality now. @micheleriva, I'm blaming you for this mind-bending journey 😂!
Before wrapping up this article, I want to extend my heartfelt appreciation to Michele, Paolo, Angela, and everyone involved in the hard work on Orama. Hey folks, keep up the great work—I'm a big fan!
Top comments (6)
Excellent article, Mauro! 😃
Definitely a solution I will look up to.
If the memory serves me well is you who suggested me to create it
Always here for supporting you, Mauro!
Ah, I wish 😁
Only the bit about Orama, maybe? But you probably already had the idea, at least since the last Codemotion in Milan since Michele Riva was there too!
Wasn't you who asked me for creating the transcriptions after my panel at Codemotion?
Oh, that! Yes, that was me indeed! 😃