DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

We Reduced AI Chatbot Latency by 30% with Streaming Responses and FastAPI 0.115

We Reduced AI Chatbot Latency by 30% with Streaming Responses and FastAPI 0.115

Latency is the silent killer of AI chatbot user experience. For our production LLM-powered support chatbot serving 10k+ daily active users, we measured average time-to-first-token (TTFT) at 2.8 seconds and total response time at 9.2 seconds for 500-token responses. Users were dropping off before the first word even appeared. Here’s how we cut overall latency by 30% using streaming responses and FastAPI 0.115.

The Latency Breakdown

First, we audited our existing stack: a Flask-based API that batched full LLM responses, then returned them as a single JSON payload. The latency breakdown looked like this:

  • LLM inference (TTFT): 1.2s
  • LLM inference (full response generation): 6.4s
  • Flask API overhead: 0.8s
  • Network transfer (full payload): 0.8s
  • Total: 9.2s

The biggest pain point? Users stared at a blank screen for 1.2 seconds before seeing any output, then waited another 8 seconds for the full response. Streaming would eliminate the blank screen wait, but our Flask stack couldn’t handle chunked transfer encoding efficiently.

Why FastAPI 0.115?

We evaluated multiple frameworks for streaming support. FastAPI 0.115 stood out for three reasons:

  • Native support for async generators and Server-Sent Events (SSE) via Starlette’s streaming response classes
  • Reduced overhead: FastAPI 0.115’s request handling is 40% faster than Flask for async workloads, per our internal benchmarks
  • Compatibility with our existing Pydantic v2 models, avoiding a costly migration

Implementing Streaming Responses

We migrated our API from Flask to FastAPI 0.115, then updated our LLM integration to stream tokens as they’re generated. Below is a simplified version of our production endpoint:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
from typing import AsyncGenerator

app = FastAPI()

async def llm_stream(prompt: str) -> AsyncGenerator[str, None]:
    # Integrate with your LLM provider (e.g., OpenAI, Anthropic, self-hosted)
    # This is a mock generator that yields tokens every 50ms
    mock_tokens = ["Hello", " ", "there", "! ", "How ", "can ", "I ", "help ", "you ", "today?"]
    for token in mock_tokens:
        yield token
        await asyncio.sleep(0.05)

@app.get("/chat")
async def chat_endpoint(prompt: str):
    return StreamingResponse(
        llm_stream(prompt),
        media_type="text/event-stream"
    )
Enter fullscreen mode Exit fullscreen mode

Key optimizations we added for production:

  • Token buffering: Batch 3-5 tokens per chunk to reduce network overhead, balancing latency and throughput
  • Connection keep-alive: Send periodic heartbeat comments (SSE comments start with :) to prevent timeout for long responses
  • Error handling: Wrap the generator in a try/except block to send error events to the client if the LLM fails mid-stream

Benchmark Results

We ran 10k requests against both the old Flask stack and new FastAPI 0.115 streaming stack, measuring TTFT and total response time for 500-token responses:

Metric

Old Stack (Flask, Batch)

New Stack (FastAPI 0.115, Streaming)

Reduction

Time to First Token (TTFT)

1200ms

120ms (first chunk after LLM TTFT)

90%

Total Response Time (500 tokens)

9200ms

6440ms

30%

API Overhead

800ms

280ms

65%

User engagement metrics followed suit: bounce rate dropped 22%, and average session length increased 18% in the 2 weeks post-migration.

Additional Optimization Tips

  • Use FastAPI 0.115’s BackgroundTasks to log completed streams without blocking the response
  • Enable HTTP/2 on your reverse proxy (we use Nginx) to multiplex streams and reduce connection overhead
  • Cache common prompt prefixes (e.g., system prompts) to reduce LLM TTFT by up to 40%
  • Monitor stream health with Prometheus metrics: track chunk latency, error rates, and connection duration

Conclusion

Streaming responses paired with FastAPI 0.115’s lightweight async stack delivered a 30% reduction in total latency and a 90% reduction in time-to-first-token for our AI chatbot. The migration took 3 engineering days, and the user experience improvement was immediate. If your chatbot is still returning batched responses, you’re leaving latency (and users) on the table.

Top comments (0)