This article is part of a tutorial series on txtai, an AI-powered semantic search platform.
This article covers the transcription of audio files to text using models provided by Hugging Face.
txtai and all dependencies. Since this article is using optional pipelines, we need to install the pipeline extras package.
pip install txtai[pipeline] # Get test data wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz tar -xvzf tests.tar.gz
The Transcription instance is the main entrypoint for transcribing audio to text. The pipeline abstracts transcribing audio into a one line call!
The pipeline executes logic to read audio files into memory, run the data through a machine learning model and output the results to text.
from txtai.pipeline import Transcription # Create transcription model transcribe = Transcription("facebook/wav2vec2-large-960h")
The example below shows how to transcribe a list of audio files to text. Let's transcribe audio to text and look at each result.
from IPython.display import Audio, display files = ["Beijing_mobilises.wav", "Canadas_last_fully.wav", "Maine_man_wins_1_mil.wav", "Make_huge_profits.wav", "The_National_Park.wav", "US_tops_5_million.wav"] files = ["txtai/%s" % x for x in files] for x, text in enumerate(transcribe(files)): display(Audio(files[x])) print(text) print()
Baging mobilizes invasion craft along coast as tiwan tensions escalates Canada's last fully intact ice shelf has suddenly collapsed forming a manhatten sized ice berg Main man wins from lottery ticket Make huge profits without working make up to one hundred thousand dollars a day National park service warns against sacrificing slower friends in a bare attack U s virus cases top a million
Overall the results are solid. Each result sounds phonetically like the audio. There is an open task with the Hugging Face models to use a language model to decode the model outputs and further improve result accuracy.
Keep an eye out for those updated models!