Amazon Transcribe provides automatic speech recognition (ASR) with support for speaker diarizationโthe process of labeling individual speakers in audio recordings.
๐ ๏ธ Prerequisites
- โ AWS Account
- โ AWS CLI or SDK installed and configured
- โ An S3 bucket to store audio files
- โ
Audio file in supported format (e.g.,
.wav
,.mp3
,.flac
)
๐ค Step 1: Upload Audio to Amazon S3
aws s3 cp your_audio_file.wav s3://your-bucket-name/
๐ง Step 2: Start Transcription Job with Speaker Diarization Enabled
aws transcribe start-transcription-job \
--transcription-job-name "diarization-job-001" \
--language-code "en-US" \
--media MediaFileUri=s3://your-bucket-name/your_audio_file.wav \
--output-bucket-name your-output-bucket \
--settings ShowSpeakerLabels=true,MaxSpeakerLabels=5
๐
ShowSpeakerLabels=true
enables speaker diarization
๐MaxSpeakerLabels=5
sets an upper limit on the number of speakers
โณ Step 3: Check Transcription Job Status
aws transcribe get-transcription-job \
--transcription-job-name "diarization-job-001"
Once the job status becomes COMPLETED
, the transcription JSON is available in your S3 output bucket.
๐ Step 4: View Diarized Transcription Output
Sample excerpt from the output JSON:
{
"results": {
"speaker_labels": {
"segments": [
{
"speaker_label": "spk_0",
"start_time": "0.0",
"end_time": "2.5"
}
]
},
"items": [
{
"start_time": "0.0",
"end_time": "0.7",
"alternatives": [
{
"confidence": "1.0",
"content": "Hello"
}
],
"type": "pronunciation",
"speaker_label": "spk_0"
}
]
}
}
๐ Optional: Python Script to Start Job
import boto3
transcribe = boto3.client('transcribe')
transcribe.start_transcription_job(
TranscriptionJobName='diarization-job-001',
LanguageCode='en-US',
Media={'MediaFileUri': 's3://your-bucket-name/your_audio_file.wav'},
OutputBucketName='your-output-bucket',
Settings={
'ShowSpeakerLabels': True,
'MaxSpeakerLabels': 5
}
)
๐ Optional: Convert Output to Readable Text
Example post-processed output:
Speaker 1: Hello, how are you?
Speaker 2: I'm doing well, thanks. And you?
Speaker 1: I'm great!
You can write a script to process the JSON and reformat it into readable dialogue using speaker labels and timestamps.
๐งฉ Notes
- Speaker Diarization is only supported in batch mode, not real-time.
- The accuracy depends on the quality of the audio and clarity of speaker voices.
- Diarization is supported for select languages (e.g., English).
๐ Resources
๐ค Bonus: Create a Transcriber Agent using LangChain and AWS
You can automate the transcription and diarization process using a LangChain agent!
๐งฉ Requirements
langchain
boto3
-
openai
(for natural language post-processing or QA)
๐ฆ Install Dependencies
pip install langchain boto3 openai
๐ค Sample LangChain Agent Setup
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
import boto3
# Tool to trigger transcription job
def start_transcription_job(file_uri):
transcribe = boto3.client('transcribe')
response = transcribe.start_transcription_job(
TranscriptionJobName="LangChainDiarizationJob",
LanguageCode="en-US",
Media={'MediaFileUri': file_uri},
OutputBucketName='your-output-bucket',
Settings={
'ShowSpeakerLabels': True,
'MaxSpeakerLabels': 5
}
)
return "Started transcription job: LangChainDiarizationJob"
# Register tool with LangChain
tools = [
Tool(
name="AWSTranscribeDiarizer",
func=start_transcription_job,
description="Start a diarization transcription job using AWS Transcribe given an S3 audio URL"
)
]
# Initialize agent with OpenAI and tools
llm = OpenAI(temperature=0)
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)
# Run agent with a prompt
agent.run("Transcribe the file at s3://your-bucket-name/your_audio.wav with speaker labels")
๐ง What This Agent Does
- Accepts a prompt to trigger AWS Transcribe
- Starts diarization on a given audio URL
- Can be extended to fetch and format output, or even generate summaries!
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.