VaishakhVipin

Posted on Jul 28

Whispers - A Real-Time Voice Journaling Agent Built with AssemblyAI

#devchallenge #assemblyaichallenge #ai #api

AssemblyAI Voice Agents Challenge: Real-Time

This is a submission for the AssemblyAI Voice Agents Challenge

Whispers - A Real-Time Voice Journaling Agent

What I Built

Whispers is a voice-first journaling application powered by AssemblyAI's universal-streaming API. It enables users to speak their thoughts in real-time, intelligently formatting their words into reflective, readable journal entries. The app serves as a personal wellness companion—part therapist, part mirror, part coach—helping users capture their daily reflections through natural speech.

This project falls under the Real-Time Performance category, demonstrating advanced real-time audio processing with sub-300ms latency for live transcription display. The application showcases how AssemblyAI's universal-streaming technology can create seamless, responsive voice experiences that feel natural and immediate.

Demo

🎥 Video Demo: https://drive.google.com/file/d/1RHyqpW434EeTGdP6xMRYbZCfifNatZd7/view?usp=sharing

GitHub Repository

The complete source code is available at: (https://github.com/VaishakhVipin/whispers-final)

Key files demonstrating AssemblyAI integration:

backend/services/assembly.py - Python WebSocket streaming implementation
frontend/src/components/NotionLikeEditor.tsx - Frontend WebSocket integration
backend/routes/stream.py - Backend API endpoints for voice processing
frontend/src/lib/api.ts - Frontend API integration

Technical Implementation & AssemblyAI Integration

AssemblyAI's universal-streaming WebSocket API is the core of Whispers' real-time voice processing capabilities. The implementation streams microphone audio and receives live, formatted transcripts with exceptional accuracy and minimal latency.

Key AssemblyAI Features Implemented:

Real-time WebSocket Connection: Direct streaming to AssemblyAI's v3 streaming endpoint with formatted finals
Live Transcription: Continuous audio processing with immediate text output and partial transcript display
Auto-formatting: Clean, punctuated transcripts with proper sentence boundaries using formatted_finals=true
Streaming State Management: Robust connection handling with proper cleanup and error recovery
Duplicate Detection: Intelligent handling to prevent transcription artifacts and repeated content
Paragraph Logic: Smart paragraph spacing based on content analysis and sentence boundaries

Code Snippet - Python WebSocket Implementation:

async def stream_to_assemblyai(audio_generator):
    """
    Streams PCM audio chunks to AssemblyAI Universal-Streaming API and yields transcript text results.
    :param audio_generator: async generator yielding raw PCM audio bytes
    :yield: transcript text (str)
    """
    token = get_assemblyai_token_universal_streaming()
    ws_url = ASSEMBLYAI_WS_BASE + token

    async with websockets.connect(ws_url) as ws:
        async def send_audio():
            async for chunk in audio_generator:
                await ws.send(chunk)
            await ws.send(json.dumps({"terminate_session": True}))

        async def receive_transcripts():
            async for msg in ws:
                data = json.loads(msg)
                if data.get("message_type") == "FinalTranscript":
                    yield data.get("text", "")

        send_task = asyncio.create_task(send_audio())
        async for transcript in receive_transcripts():
            yield transcript
        await send_task

Frontend JavaScript Integration:

// Connect to AssemblyAI WebSocket
const ws = new WebSocket(`wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&formatted_finals=true&token=${token}`);

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === "Turn") {
    const transcript = data.transcript || "";
    const turnIsFormatted = data.turn_is_formatted || false;

    if (turnIsFormatted && transcript.trim()) {
      // Final, formatted version - add to main transcription
      console.log("📝 Clean transcription:", transcript);

      // Check for duplicates and add with proper paragraph spacing
      const shouldStartNewParagraph = shouldStartNewParagraphLogic(transcript, transcriptionText);
      const separator = shouldStartNewParagraph ? "\n\n" : " ";

      setTranscriptionText(prev => {
        const trimmedTranscript = transcript.trim();
        const trimmedPrev = prev.trim();

        // Robust duplicate detection
        if (trimmedTranscript && 
            !trimmedPrev.endsWith(trimmedTranscript) && 
            !trimmedPrev.includes(trimmedTranscript + " " + trimmedTranscript)) {
          return prev + (prev && !prev.endsWith('\n\n') ? separator : "") + transcript;
        }
        return prev;
      });
    } else if (!turnIsFormatted && transcript.trim()) {
      // Partial version - show in real-time stream
      setCurrentStreamText(transcript);
    }
  }
};

UX Design & Features

Voice-First Interface:

Minimalist journaling canvas with vintage paper aesthetic
Pulsing recording indicator for live microphone status
Real-time word count and session duration tracking
Intelligent duplicate detection to prevent transcription artifacts

Smart Journaling Features:

Daily Reflection Prompts: Curated prompts that refresh daily at 12 AM GMT
Tone Rewriting: AI-powered text transformation (optimistic, technical, formal, etc.)
Session Management: Edit sessions created on the same day, read-only after that
Content Analysis: Automatic title generation, summaries, and key theme extraction
Search & Discovery: Full-text search across all journal entries

Technical Architecture:

Frontend: React + TypeScript + Vite + Tailwind CSS + Shadcn/ui
Backend: FastAPI + Python for API endpoints and AI processing
Database: Supabase for user authentication and session storage
Search: Algolia for fast, semantic search across journal entries
AI Processing: Google Gemini for content summarization and tone rewriting

Key Technical Achievements

Real-Time Performance:

Sub-200ms latency for live transcription display
Seamless WebSocket connection management
Efficient audio processing with proper resource cleanup
Responsive UI updates synchronized with audio state

Domain Expertise:

Specialized journaling workflow optimized for voice input
Intelligent content organization with automatic categorization
User behavior analysis with session statistics and trends
Privacy-focused design with user data isolation

Robust Error Handling:

Graceful microphone permission management
Connection recovery mechanisms
Comprehensive logging for debugging
Fallback modes for degraded performance

Key Takeaways

AssemblyAI's Real-time Capabilities: The universal-streaming API provides exceptional low-latency transcription with remarkable accuracy, making voice journaling feel natural and responsive.
WebSocket Management is Critical: Proper cleanup of WebSocket connections and audio resources is essential, especially when users navigate between pages or close the application.
Voice Journaling Requires Context: Beyond simple text capture, voice journaling benefits from emotional context, prompting, and intelligent content organization.
Immutable Journals Encourage Honesty: Locking journal entries after creation (read-only after the same day) encourages more authentic, unfiltered self-reflection.
Real-time UX Demands Attention: Users expect immediate feedback when speaking, requiring careful attention to UI state management and audio-visual synchronization.

What's Next

Immediate Roadmap:

Deploy live version with enhanced security and RLS re-enabled
Implement user streak tracking and habit formation features
Add sentiment analysis for emotional trend tracking
Create memory timelines and reflection insights

Future Enhancements:

Voice emotion detection for mood tracking
Collaborative journaling features
Integration with wellness apps and calendars
Advanced AI coaching and reflection prompts

Technical Stack

Frontend:

Typescript (React)
Vite for fast development and building
Tailwind CSS for styling
Shadcn/ui for component library
React Router for navigation

Backend:

FastAPI for RESTful API endpoints
Python for server-side processing
Supabase for authentication and database
Algolia for search indexing

Voice & AI:

AssemblyAI Universal Streaming for real-time transcription
Google Gemini for content analysis and rewriting
WebSocket for real-time communication

Deployment:

Vercel for frontend hosting
Vercel Functions for backend API
Environment-based security configuration

Final Note

Whispers is built for people who think best out loud. It transforms the traditional journaling experience into a dynamic conversation with yourself—live, raw, and authentically yours. By leveraging AssemblyAI's cutting-edge voice technology, Whispers makes capturing daily reflections as natural as having a conversation, while providing the structure and insights that make journaling truly meaningful.

The project demonstrates how real-time voice technology can enhance personal wellness applications, creating a more intuitive and engaging way for users to document their thoughts, emotions, and personal growth journey.

DEV Community