Sophisticated Speech-to-Text Submission Template, The AssemblyAI challenge.

#devchallenge #assemblyaichallenge #ai #api

This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.

What I Built

A Speech-to-Text Transcription Web Application using Flask for the backend and AssemblyAI's API for real-time audio transcription. The frontend, built with HTML, CSS, and jQuery, offers an interactive interface for users to control the transcription process and view transcribed text in real-time.

Demo

Here is the link to my app

Journey

Key Features

Real-Time Transcription:

Utilizes AssemblyAI's real-time API to process live audio input from the user's microphone and convert it to text.
Supports both partial and final transcripts.

Web Interface:

Clean and intuitive design with buttons to start and stop transcription.
Displays the transcribed text dynamically in a formatted
```
 block.
```

Flask Backend:

Handles routes for starting (/start), stopping (/stop), and retrieving the transcript (/transcript).
Runs transcription in a separate thread to ensure non-blocking operations.

Polling Mechanism:

Implements a JavaScript-based polling system using jQuery to fetch the latest transcribed text every second.

Customizable Word Boost:

Boosts recognition accuracy for specific words like "AWS," "Azure," and "Google Cloud."

Responsive Design:

Ensures usability across devices with a centralized, easy-to-use layout.

Technology Stack

Backend:

Python (Flask): Manages the web server and API interactions.
AssemblyAI API: Handles speech-to-text transcription.

import assemblyai as aai
from flask import Flask, render_template, jsonify
import os
from dotenv import load_dotenv
import threading

app = Flask(__name__)
load_dotenv()

aai.settings.api_key = os.getenv('API_KEY')

transcriber = None
transcribed_text = ""

def on_open():
    print("Transcription started!")

def on_data(transcript: aai.RealtimeTranscript):
    global transcribed_text
    if not transcript.text:
        return

    if isinstance(transcript, aai.RealtimeFinalTranscript):
        transcribed_text += transcript.text + "\n"
        print("Transcribed:", transcript.text)  # Verify text here
    else:
        print("Received partial:", transcript.text)


def on_error(error):
    print("Error:", error)

def on_close():
    print("Transcription stopped!")

def start_transcription():
    global transcriber
    microphone_stream = aai.extras.MicrophoneStream(sample_rate=16_000)
    transcriber = aai.RealtimeTranscriber(
        encoding=aai.AudioEncoding.pcm_mulaw,
        sample_rate=16_000,
        word_boost=["aws", "azure", "google cloud"],
        end_utterance_silence_threshold=500,
        on_open=on_open,
        on_data=on_data,
        on_error=on_error,
        on_close=on_close,
    )

    for audio_data in microphone_stream:
        if transcriber is not None:
            transcriber.stream(audio_data)
        else:
            break

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/start')
def start():
    global transcribed_text
    transcribed_text = ""  # Clear previous transcript
    threading.Thread(target=start_transcription).start()
    return jsonify({"message": "Transcription started!"})


@app.route('/stop')
def stop():
    global transcriber
    if transcriber is not None:
        transcriber.close()
        transcriber = None
        print("Transcriber closed")
    return jsonify({"message": "Transcription stopped!"})

@app.route('/transcript')
def transcript():
    global transcribed_text
    return jsonify({"transcript": transcribed_text})


if __name__ == "__main__":
    app.run(debug=True)

Frontend:

HTML & CSS: Provides structure and styling for the user interface.
jQuery: Handles AJAX requests for starting, stopping, and polling the transcription.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Speech to Text App</title>
    <script src="https://code.jquery.com/jquery-3.5.1.min.js"></script>
    <style>
        body {
            margin: 0;
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh; /* Full viewport height */
            font-family: Arial, sans-serif;
            background-color: #f4f4f4; /* Light background for better readability */
        }

        #container {
            text-align: center;
            background: #ffffff;
            padding: 20px;
            box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
            border-radius: 8px;
        }

        button {
            margin: 10px;
            padding: 10px 20px;
            font-size: 16px;
            border: none;
            border-radius: 5px;
            background-color: #007bff;
            color: white;
            cursor: pointer;
        }

        button:hover {
            background-color: #0056b3;
        }

        pre {
            padding: 10px;
            background-color: #e9ecef;
            border-radius: 5px;
            overflow: auto;
        }
    </style>
</head>
<body>
    <div id="container">
        <h1>Speech-to-Text Transcription</h1>
        <button id="start">Start Transcription</button>
        <button id="stop">Stop Transcription</button>
        <h2>Transcribed Text:</h2>
        <pre id="transcript"></pre>
    </div>

    <script>
        $(document).ready(function() {
            let pollInterval; // Variable to hold the interval ID

            // Start transcription
            $('#start').click(function() {
                $.get('/start', function(data) {
                    console.log(data.message);

                    // Start polling for transcripts if not already polling
                    if (!pollInterval) {
                        pollInterval = setInterval(function() {
                            $.ajax({
                                type: 'GET',
                                url: '/transcript',
                                dataType: 'json',
                                success: function(data) {
                                    console.log(data);
                                    if (data && data.transcript) {
                                        $('#transcript').text(data.transcript);
                                    } else {
                                        $('#transcript').text('No transcription available yet.');
                                    }
                                },
                                error: function(err) {
                                    console.error('Error fetching transcript:', err);
                                }
                            });
                        }, 1000);
                    }
                });
            });

            // Stop transcription
            $('#stop').click(function() {
                $.get('/stop', function(data) {
                    console.log(data.message);

                    // Stop polling for transcripts
                    if (pollInterval) {
                        clearInterval(pollInterval);
                        pollInterval = null; // Reset the interval variable
                    }
                });
            });
        });
    </script>

</body>
</html>

Audio Input:

AssemblyAI's MicrophoneStream: Streams audio data for real-time processing.

I utilized additional prompts to enhance the project. I employed the #FlaskWebFramework for rendering templates and returning JSON responses, and I used the #dotenv library to load environment variables from the env file. On the frontend, I implemented CSS for styling the user interface.

Lastly, I want to thank my team, @devnenyasha, and @lindiwe09, for their UI idea. If not for them my UI would have been a mess.