alfchee

Posted on Apr 3

From MVP to Production: Scaling a Speech AI Service

#sre #fastapi #nvidia #docker

Hey fellow builders! 🚀 After 7 articles diving into the technical details of my real-time speech service, I want to take a step back and share the bigger picture: how we scaled from a shaky MVP to a production-ready system handling 150+ concurrent users. This is the story that ties everything together.

The MVP Mindset (And Its Pitfalls)

Our first version worked—like most MVPs do. We had Flask handling WebSockets, Riva doing transcription, and basic TTS playback. Users could connect, speak, and get responses.

But "working" and "production-ready" are completely different categories:

# MVP code that "worked"
@app.websocket("/transcribe")
async def transcribe(websocket):
    await websocket.accept()
    while True:
        data = await websocket.receive_text()
        result = riva.transcribe(data)  # Fire and forget
        await websocket.send_text(result)

This code has:

No error handling
No reconnection logic
No resource cleanup
No monitoring

It worked in demos. It would fail spectacularly in production.

The Migration to FastAPI (Article #1 Recap)

The first major shift was moving to FastAPI. But here's what nobody tells you: the code migration was the easy part.

The hard part was rethinking everything:

# Production-ready approach
@app.websocket("/transcribe/{session_id}")
async def transcribe(websocket: WebSocket, session_id: str):
    await websocket.accept()

    try:
        session = await create_session(session_id)
        async for message in websocket.iter_text():
            result = await process_audio(message, session)
            await websocket.send_text(result)
    except WebSocketDisconnect:
        logger.info(f"Session {session_id} disconnected")
    finally:
        await cleanup_session(session_id)  # Critical!

Lesson: Async isn't just syntax—it's a different mental model for resource management.

Adding Resilience Layer by Layer (Article #7)

Error handling evolved in phases:

Phase 1: Basic try-catch

try:
    result = riva.transcribe(audio)
except Exception as e:
    logger.error(f"Error: {e}")

Phase 2: Categorized errors

try:
    result = riva.transcribe(audio)
except RivaConnectionError as e:
    emit_reconnection_event(session_id)
except AudioFormatError as e:
    send_error_message(websocket, "Invalid audio format")
except RateLimitError:
    queue_request_for_retry(session_id)

Phase 3: Graceful degradation

async def transcribe_with_fallback(audio):
    try:
        return await riva.transcribe(audio)
    except RivaConnectionError:
        logger.warning("Riva unavailable, using fallback T2")
        return await fallback_service.transcribe(audio)

Lesson: Error handling isn't a layer you add—it's a journey that evolves with your system.

The Logging Evolution (Article #14)

In the MVP, we had print("here") scattered everywhere. Production required structure:

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

# Now we can correlate logs across services
logger = structlog.get_logger()
logger.info(
    "transcription_complete",
    session_id=session_id,
    latency_ms=latency,
    audio_chunks=chunk_count,
)

Lesson: Debugging production issues without structured logging is like trying to find a needle in a haystack—blindfolded.

Docker: From "It Works On My Machine" to Production (Article #9)

Our Dockerfile went through iterations:

# v1 - The MVP version
FROM python:3.9
COPY . /app
RUN pip install -r requirements.txt
CMD ["python", "app.py"]

# v2 - Production version
FROM nvidia/cuda:11.8-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3.9 python3-pip
COPY requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy only needed files
COPY ./app ./app

# Run as non-root user
RUN useradd -m appuser
USER appuser

CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0"]

Lesson: Docker isn't just packaging—it's defining your production environment.

Audio Quality: The Unexpected Challenge (Article #4)

We thought audio streaming was "solved." Then users started complaining about gaps and choppiness.

The solution involved:

Buffering with sequence numbers
Fade-out/fade-in for seamless transitions
Adaptive buffering based on network conditions

class AudioBuffer:
    def __init__(self):
        self.chunks = {}
        self.sequence = 0

    def add_chunk(self, audio_data: bytes, seq: int):
        self.chunks[seq] = audio_data
        self.sequence = max(self.sequence, seq)

    def get_ordered_audio(self) -> bytes:
        ordered = b"".join(
            self.chunks[i] 
            for i in sorted(self.chunks.keys())
        )
        return ordered

Lesson: The "last 10%" of quality takes 90% of the effort.

What I Would Do Different

If I could restart this journey:

Start with FastAPI - Don't accumulate technical debt with Flask
Add structured logging from day 1 - You will need it
Design for failure - Every component will fail; plan for it
Monitor everything - If you can't measure it, you can't improve it
Document decisions - Future you will thank present you

The Takeaway

Building an MVP is fun. Scaling it to production is where the real engineering happens. The 7 articles in this series covered the technical pieces, but the overarching lesson is:

Production systems aren't built—they're evolved.

Each challenge we faced—async programming, error handling, logging, GPU integration, audio quality, Docker—added another layer of resilience. That's what makes a service "production-ready."

Thanks for following this journey! What's your scaling story?

DEV Community