DEV Community

Cover image for Serverless Speech-to-Text with AssemblyAI
Majdi Dhissi
Majdi Dhissi

Posted on

Serverless Speech-to-Text with AssemblyAI

This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.

What I Built

This project is a serverless solution for performing speech-to-text using AWS Lambda, SQS, and the AssemblyAI API, integrated with a front-end Blazor Web project. When a user uploads an audio file through the web application, the file is stored in an S3 bucket, which triggers a Lambda function to process the file. The speech-to-text conversion is handled by the AssemblyAI API, and the results are communicated back to the front-end via SQS and a background polling service.

The solution demonstrates a scalable and efficient way to leverage cloud technologies for advanced AI-powered transcription.

Demo

Front-End

DynamoDB Table
DynamoDB

Lambda logs
Lambda logs

The target S3 bucket used to store transcript data in Json
Target S3

Custom Json schema
Custom Json schema

Journey

Incorporating AssemblyAI’s Universal-2 Model

The Universal-2 model provided by AssemblyAI played a central role in this project. Its robust transcription capabilities ensured that audio files were processed with high accuracy.

Architecture

Cloud Architecture

The solution was designed as follows:

  1. Blazor Web Front-End:
    The web project serves as the user interface for uploading audio files. Users can drag and drop files or use a browse button to select files. Once uploaded, the file is sent to an S3 bucket, with status updates and results displayed dynamically.

  2. AWS Lambda Function:
    Built using .NET 8, the Lambda function is triggered by an S3 event whenever a new file is uploaded. This function downloads the file, processes it using the AssemblyAI API, and sends a success message containing transcription results to an SQS queue.

  3. SQS Integration:
    SQS acts as a communication bridge, decoupling the Lambda function from the Blazor application. This ensures that the system remains robust and scalable, handling spikes in audio processing without impacting the UI.

  4. Blazor Background Service:
    A background polling service in the Blazor project checks the SQS queue for new messages. When results are fetched, they are displayed in the web application in real time.

Technical Highlights

Background Service Overview

The Blazor front-end includes a background service that interacts with AWS SQS to fetch and process transcription results. This service ensures that messages from the SQS queue are retrieved and used to update the UI dynamically.

Here is a summary of its key components:

1. SqsService Class

The SqsService class encapsulates logic for communicating with AWS SQS.

Core Functionality:
  • Uses the AWS SDK for .NET to fetch messages from the SQS queue with long polling to reduce unnecessary API calls.
  • After processing a message, it deletes the message from the queue to ensure it is not reprocessed.

2. SqsBackgroundService Class

The SqsBackgroundService is a hosted service that continuously polls the SQS queue for messages.

Core Functionality:
  • Calls FetchMessageAsync from SqsService to retrieve messages.
  • Upon receiving a message, invokes a delegate (Func) that triggers a refresh in the Blazor UI to display transcription results.

Blazor Front-End Overview

The Blazor front-end component serves as the user interface for uploading files, displaying data from DynamoDB, and reflecting updates in real time via integration with the background SQS service.

Here is a breakdown of the main functionalities:

1. File Upload Feature

  • The component is used to select a file for upload. Once a file is selected, details such as the file name and extension are displayed to the user.
  • Files are uploaded to an S3 bucket using the TransferUtility from the AWS SDK for .NET.
  • A unique file key is generated using Guid to ensure no name conflicts.
  • The UI displays a modal during the upload process, indicating that the operation is in progress.
  • After the upload is complete, the SuccessMessage is updated to notify the user of the outcome.

2. DynamoDB Integration

  • The ListDynamoDBItems method retrieves all items from a DynamoDB table, which loads data into the DynamoDBItems list.
  • The table is displayed on the page, showing each item's Id, Transcribed text, and Timestamp.
  • A refresh button allows the table to be updated dynamically.
  • Text fields with more than 2,000 characters are truncated for readability.

3. SQS Background Service Integration

  • The component starts the SqsBackgroundService during initialization, allowing real-time updates when new transcription results are available in SQS.
  • When the service receives a message, it triggers the RefreshDynamoDbTable method, which reloads the data and refreshes the UI.
  • The service runs in the background and is gracefully stopped when the component is disposed.

AWS Lambda Function Overview

The Lambda function integrates S3, DynamoDB, and AssemblyAI to handle audio transcription, storage, and processing. Here is a breakdown of its functionality:

1. S3 Event Trigger

The function is triggered by S3 events, such as an object being created in a bucket.

  • Metadata is retrieved for validation purposes.
  • A pre-signed URL is generated to provide external access to the file for the AssemblyAI API.
  • The transcription file is then processed using the AssemblyAI API using an HTTP client

2. AssemblyAI Integration

The function initializes an AssemblyAIClient using an API key from environment variables.

  • The StabilityAIProcessor handles file transcription via the pre-signed S3 URL.
  • The transcription result includes the text and metadata, which are logged and processed further.
  • Logging:
  • Transcription text is unescaped and logged for debugging or auditing.

3. DynamoDB Integration

Each transcription result is converted into a DynamoDB document and stored in the AssemblyAI table.
The table stores:

  • Id: The unique transcription ID.
  • Text: The transcribed text.
  • Timestamp: The upload time in UTC.

Any issues during the database operation are logged and re-thrown.

4. Enhanced S3 Functionality

  • The transcription results are uploaded to a designated bucket (e.g., assemblyai-challenge-transcripts) with a .json extension.
  • The uploaded transcription files use a consistent naming convention: -transcription.json.
  • A short-lived pre-signed URL (120 seconds) is generated for secure external access to the uploaded files. Since the S3 bucket is not public, the URL for the uploaded file is pre-signed and available for the AssemblyAI to download for further processing within a short time frame

In addition to the transcribed text, we also retrieve several data items such as:

  • The Confidence Score
  • Total number of words
  • Audio duration
  • Number of speakers
  • List of highlights
  • Sentiment Analysis (Negative, Neutral, Positive) for each sentence
  • Detected language
  • Number of chapters Custom Json schema

Deployment

The solution is deployed using the following:

Terraform IaC

It enables building the necessary infrastructure to host the solution including:

Source and Target S3 buckets

Serve as secure storage locations where data is ingested and processed. The source bucket contains raw data, while the target bucket stores the transformed or processed output.

SQS Queue

A highly reliable and scalable messaging service that decouples components by queuing messages between producers and consumers, ensuring asynchronous communication.

DynamoDB Table

A fully managed NoSQL database optimized for high availability and low-latency access to application data, structured by a key-value or document data model.

IAM Policies and Roles

Define and enforce granular access permissions, ensuring resources and services are accessed securely. IAM roles enable temporary, controlled access to AWS services by trusted entities.

Lambda Function

A serverless compute service that executes custom logic in response to events or triggers, such as processing SQS messages or transforming data from the source bucket.

ECR repository for the respective docker images

A secure, scalable container registry to store and manage Docker images required for deployment, facilitating seamless integration with ECS and other AWS services.

ECS and Task Definitions that will run the images

ECS orchestrates containerized applications, while task definitions specify the configuration for running containers, including image, CPU, memory, and networking requirements.

ALB for exposing the front-end running on ECS Fargate

An Application Load Balancer distributes incoming HTTP/HTTPS traffic across front-end tasks running on ECS Fargate, ensuring high availability, scalability, and secure access.

This submission was crafted for the challenge, and the full source code is available at GitHub. Try it out, and feel free to share feedback!

Conclusion

AssemblyAI's advanced features make it a powerful tool for creating sophisticated speech-to-text solutions. With capabilities such as sentiment analysis, speaker identification, language detection, and detailed transcription metadata, it goes beyond basic transcription to deliver valuable insights from audio content. These features enable developers to build robust, intelligent, and scalable applications tailored to diverse use cases.

This article focused on showcasing the overall solution's architecture and integration while omitting detailed infrastructure or coding aspects, as those are outside its scope, which is to demonstrate a serverless solution for speech to text using AssemblyAI API

Feel free to explore the provided repositories and try out the solution. Feedback is always welcome!

If you find this article insightful, please

  • Share it on your feed or social media
  • Follow me to receive updates
  • Keep in touch LinkedIn

Top comments (0)