DEV Community

Cover image for Who Said What? Build a Smart Transcriber Agent with AWS & LangChain
Chandrani Mukherjee
Chandrani Mukherjee

Posted on

Who Said What? Build a Smart Transcriber Agent with AWS & LangChain

Amazon Transcribe provides automatic speech recognition (ASR) with support for speaker diarizationโ€”the process of labeling individual speakers in audio recordings.


๐Ÿ› ๏ธ Prerequisites

  • โœ… AWS Account
  • โœ… AWS CLI or SDK installed and configured
  • โœ… An S3 bucket to store audio files
  • โœ… Audio file in supported format (e.g., .wav, .mp3, .flac)

๐Ÿ“ค Step 1: Upload Audio to Amazon S3

aws s3 cp your_audio_file.wav s3://your-bucket-name/
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  Step 2: Start Transcription Job with Speaker Diarization Enabled

aws transcribe start-transcription-job \
  --transcription-job-name "diarization-job-001" \
  --language-code "en-US" \
  --media MediaFileUri=s3://your-bucket-name/your_audio_file.wav \
  --output-bucket-name your-output-bucket \
  --settings ShowSpeakerLabels=true,MaxSpeakerLabels=5
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“Œ ShowSpeakerLabels=true enables speaker diarization

๐Ÿ“Œ MaxSpeakerLabels=5 sets an upper limit on the number of speakers


โณ Step 3: Check Transcription Job Status

aws transcribe get-transcription-job \
  --transcription-job-name "diarization-job-001"
Enter fullscreen mode Exit fullscreen mode

Once the job status becomes COMPLETED, the transcription JSON is available in your S3 output bucket.


๐Ÿ“„ Step 4: View Diarized Transcription Output

Sample excerpt from the output JSON:

{
  "results": {
    "speaker_labels": {
      "segments": [
        {
          "speaker_label": "spk_0",
          "start_time": "0.0",
          "end_time": "2.5"
        }
      ]
    },
    "items": [
      {
        "start_time": "0.0",
        "end_time": "0.7",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Hello"
          }
        ],
        "type": "pronunciation",
        "speaker_label": "spk_0"
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ Optional: Python Script to Start Job

import boto3

transcribe = boto3.client('transcribe')

transcribe.start_transcription_job(
    TranscriptionJobName='diarization-job-001',
    LanguageCode='en-US',
    Media={'MediaFileUri': 's3://your-bucket-name/your_audio_file.wav'},
    OutputBucketName='your-output-bucket',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 5
    }
)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“ Optional: Convert Output to Readable Text

Example post-processed output:

Speaker 1: Hello, how are you?
Speaker 2: I'm doing well, thanks. And you?
Speaker 1: I'm great!
Enter fullscreen mode Exit fullscreen mode

You can write a script to process the JSON and reformat it into readable dialogue using speaker labels and timestamps.


๐Ÿงฉ Notes

  • Speaker Diarization is only supported in batch mode, not real-time.
  • The accuracy depends on the quality of the audio and clarity of speaker voices.
  • Diarization is supported for select languages (e.g., English).

๐Ÿ“š Resources



๐Ÿค– Bonus: Create a Transcriber Agent using LangChain and AWS

You can automate the transcription and diarization process using a LangChain agent!

๐Ÿงฉ Requirements

  • langchain
  • boto3
  • openai (for natural language post-processing or QA)

๐Ÿ“ฆ Install Dependencies

pip install langchain boto3 openai
Enter fullscreen mode Exit fullscreen mode

๐Ÿค– Sample LangChain Agent Setup

from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
import boto3

# Tool to trigger transcription job
def start_transcription_job(file_uri):
    transcribe = boto3.client('transcribe')
    response = transcribe.start_transcription_job(
        TranscriptionJobName="LangChainDiarizationJob",
        LanguageCode="en-US",
        Media={'MediaFileUri': file_uri},
        OutputBucketName='your-output-bucket',
        Settings={
            'ShowSpeakerLabels': True,
            'MaxSpeakerLabels': 5
        }
    )
    return "Started transcription job: LangChainDiarizationJob"

# Register tool with LangChain
tools = [
    Tool(
        name="AWSTranscribeDiarizer",
        func=start_transcription_job,
        description="Start a diarization transcription job using AWS Transcribe given an S3 audio URL"
    )
]

# Initialize agent with OpenAI and tools
llm = OpenAI(temperature=0)
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

# Run agent with a prompt
agent.run("Transcribe the file at s3://your-bucket-name/your_audio.wav with speaker labels")
Enter fullscreen mode Exit fullscreen mode

๐Ÿง  What This Agent Does

  • Accepts a prompt to trigger AWS Transcribe
  • Starts diarization on a given audio URL
  • Can be extended to fetch and format output, or even generate summaries!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.