This is a submission for the AssemblyAI Challenge : Sophisticated Speech-to-Text.
What I Built
I built Speech-to-Note, an innovative web application that combines speech recognition and musical note detection. The application allows users to record audio (either speech or singing) and processes it in two ways:
- Converts spoken words into text using AssemblyAI's Speech-to-Text API
- Analyzes the audio to detect musical notes, including their pitch, octave, and duration
The application features a modern, responsive UI built with React and TailwindCSS, and a robust backend powered by FastAPI. It's particularly useful for musicians, music teachers, and anyone interested in analyzing the musical properties of their voice or instruments.
Demo
Link to site https://speech.vicentereyes.org/
GitHub:
Speech to Musical Notes Converter
This application converts spoken words into musical notes using FastAPI, React, and AssemblyAI.
Prerequisites
- Python 3.8+
- Node.js and npm
- AssemblyAI API key
Setup
-
Clone the repository
-
Set up the backend:
# Install Python dependencies pip install -r requirements.txt # Set up your AssemblyAI API key in .env file # Replace 'your_api_key_here' with your actual API key
-
Set up the frontend:
cd frontend npm install
Running the Application
-
Start the backend server:
uvicorn main:app --reload
-
Start the frontend development server:
cd frontend npm run dev
-
Open your browser and navigate to the URL shown in the frontend terminal output (usually http://localhost:5173)
Usage
- Click the "Start Recording" button to begin recording audio
- Speak into your microphone
- Click "Stop Recording" when finished
- Click "Process Audio" to send the recording to the server
- The transcribed text will appear below
Features
- Audio recording using the Web Audio API
- Real-time…
Landing Page
Audio Processing
Result
Journey
AssemblyAI's Universal-2 Speech-to-Text Model was integrated into the application through their Python SDK. The implementation can be found in the upload_audio endpoint of our FastAPI backend:
- When a user records audio, it's sent to our backend as a WAV file
- The audio file is processed in parallel:
- Sent to AssemblyAI's API for transcription
- Analyzed locally using librosa for musical note detection
- The transcribed text and detected musical notes are returned to the frontend
The AssemblyAI integration was straightforward thanks to their well-documented SDK:
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_file_path)
transcribed_text = transcript.text
What makes this implementation sophisticated is the dual-processing approach:
- Using AssemblyAI's advanced speech recognition for accurate text transcription
- Complementing it with custom pitch detection algorithms to extract musical information
- Providing a synchronized playback experience where users can hear the detected notes while seeing the transcribed text
This creates a unique tool that bridges the gap between spoken word and musical notation, making it valuable for various musical applications, from education to composition.
The application qualifies for additional prompts as it implements:
- Real-time audio processing
- Custom pitch detection algorithms
- Interactive audio playback
- Modern, responsive UI with TailwindCSS
- Full-stack implementation with React and FastAPI
The project demonstrates how AssemblyAI's technology can be combined with custom audio processing to create innovative applications that go beyond simple speech-to-text conversion.
Top comments (2)
This was a very very cool project.
Thanks, Jess!
Some comments may only be visible to logged-in visitors. Sign in to view all comments.