DEV Community

Cover image for From MVP to Production: Scaling a Speech AI Service
alfchee
alfchee

Posted on

From MVP to Production: Scaling a Speech AI Service

Hey fellow builders! 🚀 After 7 articles diving into the technical details of my real-time speech service, I want to take a step back and share the bigger picture: how we scaled from a shaky MVP to a production-ready system handling 150+ concurrent users. This is the story that ties everything together.

The MVP Mindset (And Its Pitfalls)

Our first version worked—like most MVPs do. We had Flask handling WebSockets, Riva doing transcription, and basic TTS playback. Users could connect, speak, and get responses.

But "working" and "production-ready" are completely different categories:

# MVP code that "worked"
@app.websocket("/transcribe")
async def transcribe(websocket):
    await websocket.accept()
    while True:
        data = await websocket.receive_text()
        result = riva.transcribe(data)  # Fire and forget
        await websocket.send_text(result)
Enter fullscreen mode Exit fullscreen mode

This code has:

  • No error handling
  • No reconnection logic
  • No resource cleanup
  • No monitoring

It worked in demos. It would fail spectacularly in production.

The Migration to FastAPI (Article #1 Recap)

The first major shift was moving to FastAPI. But here's what nobody tells you: the code migration was the easy part.

The hard part was rethinking everything:

# Production-ready approach
@app.websocket("/transcribe/{session_id}")
async def transcribe(websocket: WebSocket, session_id: str):
    await websocket.accept()

    try:
        session = await create_session(session_id)
        async for message in websocket.iter_text():
            result = await process_audio(message, session)
            await websocket.send_text(result)
    except WebSocketDisconnect:
        logger.info(f"Session {session_id} disconnected")
    finally:
        await cleanup_session(session_id)  # Critical!
Enter fullscreen mode Exit fullscreen mode

Lesson: Async isn't just syntax—it's a different mental model for resource management.

Adding Resilience Layer by Layer (Article #7)

Error handling evolved in phases:

Phase 1: Basic try-catch

try:
    result = riva.transcribe(audio)
except Exception as e:
    logger.error(f"Error: {e}")
Enter fullscreen mode Exit fullscreen mode

Phase 2: Categorized errors

try:
    result = riva.transcribe(audio)
except RivaConnectionError as e:
    emit_reconnection_event(session_id)
except AudioFormatError as e:
    send_error_message(websocket, "Invalid audio format")
except RateLimitError:
    queue_request_for_retry(session_id)
Enter fullscreen mode Exit fullscreen mode

Phase 3: Graceful degradation

async def transcribe_with_fallback(audio):
    try:
        return await riva.transcribe(audio)
    except RivaConnectionError:
        logger.warning("Riva unavailable, using fallback T2")
        return await fallback_service.transcribe(audio)
Enter fullscreen mode Exit fullscreen mode

Lesson: Error handling isn't a layer you add—it's a journey that evolves with your system.

The Logging Evolution (Article #14)

In the MVP, we had print("here") scattered everywhere. Production required structure:

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

# Now we can correlate logs across services
logger = structlog.get_logger()
logger.info(
    "transcription_complete",
    session_id=session_id,
    latency_ms=latency,
    audio_chunks=chunk_count,
)
Enter fullscreen mode Exit fullscreen mode

Lesson: Debugging production issues without structured logging is like trying to find a needle in a haystack—blindfolded.

Docker: From "It Works On My Machine" to Production (Article #9)

Our Dockerfile went through iterations:

# v1 - The MVP version
FROM python:3.9
COPY . /app
RUN pip install -r requirements.txt
CMD ["python", "app.py"]

# v2 - Production version
FROM nvidia/cuda:11.8-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3.9 python3-pip
COPY requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy only needed files
COPY ./app ./app

# Run as non-root user
RUN useradd -m appuser
USER appuser

CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0"]
Enter fullscreen mode Exit fullscreen mode

Lesson: Docker isn't just packaging—it's defining your production environment.

Audio Quality: The Unexpected Challenge (Article #4)

We thought audio streaming was "solved." Then users started complaining about gaps and choppiness.

The solution involved:

  1. Buffering with sequence numbers
  2. Fade-out/fade-in for seamless transitions
  3. Adaptive buffering based on network conditions
class AudioBuffer:
    def __init__(self):
        self.chunks = {}
        self.sequence = 0

    def add_chunk(self, audio_data: bytes, seq: int):
        self.chunks[seq] = audio_data
        self.sequence = max(self.sequence, seq)

    def get_ordered_audio(self) -> bytes:
        ordered = b"".join(
            self.chunks[i] 
            for i in sorted(self.chunks.keys())
        )
        return ordered
Enter fullscreen mode Exit fullscreen mode

Lesson: The "last 10%" of quality takes 90% of the effort.

What I Would Do Different

If I could restart this journey:

  1. Start with FastAPI - Don't accumulate technical debt with Flask
  2. Add structured logging from day 1 - You will need it
  3. Design for failure - Every component will fail; plan for it
  4. Monitor everything - If you can't measure it, you can't improve it
  5. Document decisions - Future you will thank present you

The Takeaway

Building an MVP is fun. Scaling it to production is where the real engineering happens. The 7 articles in this series covered the technical pieces, but the overarching lesson is:

Production systems aren't built—they're evolved.

Each challenge we faced—async programming, error handling, logging, GPU integration, audio quality, Docker—added another layer of resilience. That's what makes a service "production-ready."

Thanks for following this journey! What's your scaling story?

Top comments (0)