The Problem: LLM Latency is Killing Your User Experience
Picture this: Your user clicks "Generate Report" in your AI-powered app. They wait. And wait. And wait some more. 10 seconds later (which feels like an eternity in user time), a complete response finally appears. By then, they've probably already started questioning whether your app is broken.
This is the reality of working with Large Language Models (LLMs). Whether you're using OpenAI's GPT, Google's Gemini, or any other LLM API, response times typically range from 5-15 seconds for complex queries. In today's instant-gratification world, that's simply too slow.
The Solution: Streaming Creates the Illusion of Speed
Here's where streaming comes to the rescue. Instead of waiting for the complete response, streaming allows you to show partial results as they're generated. This creates a powerful psychological effect - users see immediate progress, making the wait feel much shorter.
Think about how ChatGPT works. It doesn't wait to generate the entire response before showing it to you. Instead, it streams the response token by token, creating that satisfying typewriter effect that keeps you engaged.
Basic Text Streaming Example
Here's how you can implement basic text streaming with Python using the new Google Gen AI SDK:
# Example with Google Gen AI SDK (Recommended)
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
response = client.models.generate_content_stream(
model="gemini-2.0-flash",
contents="Write a story about AI"
)
for chunk in response:
if chunk.text:
# Update UI with each chunk
updateUI(chunk.text)
Alternatively, you can use the OpenAI-compatible endpoint in Python:
# Example with Gemini via OpenAI-compatible endpoint
from openai import OpenAI
client = OpenAI(
api_key="YOUR_GEMINI_API_KEY",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
response = client.chat.completions.create(
model="gemini-1.5-flash",
messages=[{"role": "user", "content": "Write a story about AI"}],
stream=True
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
# Update UI with each chunk
updateUI(content)
This works beautifully for plain text responses. But what happens when you need structured data?
The JSON Streaming Challenge
Modern applications often require structured responses from LLMs. You might ask for:
- A list of recommendations in JSON format
- User profiles with specific fields
- Complex data structures for dashboards
Here's where things get tricky. When you stream JSON, you get something like this:
[{"name": "Jo
Try running JSON.parse()
on that, and you'll get an error. The JSON is malformed because it's incomplete. You could wait for the complete response, but then you lose all the benefits of streaming.
Enter JSON Streaming: The Best of Both Worlds
This is where specialized libraries like http-streaming-request
come in handy. They solve the JSON streaming problem by providing well-formed JSON objects even when the underlying stream is incomplete.
Installation
npm install http-streaming-request
# or
yarn add http-streaming-request
For Python backends, you'll also need the appropriate Gemini SDK:
# New Google Gen AI SDK - recommended for Gemini 2.0
pip install google-genai
# Or legacy Google Generative AI SDK
pip install google-generativeai
# For OpenAI-compatible endpoints
pip install openai
Basic JSON Streaming Example
import { makeStreamingJsonRequest } from "http-streaming-request";
const stream = makeStreamingJsonRequest({
url: "/api/generate-users",
method: "POST",
payload: { count: 10 }
});
for await (const data of stream) {
// Even if the API only returns [{"name": "Jo
// this will give you [{ "name": "Jo" }] - valid JSON!
console.log(data);
updateUserList(data);
}
Real-World React Example
Here's how you might implement this in a React application:
import React, { useState } from 'react';
import { makeStreamingJsonRequest } from "http-streaming-request";
const UserGenerator = () => {
const [users, setUsers] = useState([]);
const [isLoading, setIsLoading] = useState(false);
const generateUsers = async () => {
setIsLoading(true);
setUsers([]);
try {
for await (const usersSoFar of makeStreamingJsonRequest({
url: "/api/generate-users",
method: "POST",
payload: {
count: 20,
prompt: "Generate diverse user profiles with names, ages, and locations"
}
})) {
setUsers(usersSoFar);
}
} catch (error) {
console.error('Streaming error:', error);
} finally {
setIsLoading(false);
}
};
return (
<div>
<button onClick={generateUsers} disabled={isLoading}>
{isLoading ? 'Generating...' : 'Generate Users'}
</button>
<div className="user-grid">
{users.map((user, index) => (
<div key={index} className="user-card">
<h3>{user.name}</h3>
<p>Age: {user.age}</p>
<p>Location: {user.city}, {user.country}</p>
</div>
))}
</div>
</div>
);
};
Using React Hooks for Cleaner Code
The library also provides a React hook for even cleaner implementation:
import { useJsonStreaming } from "http-streaming-request";
const UserGenerator = () => {
const { data: users, run } = useJsonStreaming({
url: "/api/generate-users",
method: "POST",
});
const handleGenerate = () => {
run({
payload: {
count: 20,
prompt: "Generate diverse user profiles"
}
});
};
return (
<div>
<button onClick={handleGenerate}>Generate Users</button>
{users && users.map((user, index) => (
<div key={index} className="user-card">
<h3>{user.name}</h3>
<p>Age: {user.age}</p>
<p>Location: {user.city}, {user.country}</p>
</div>
))}
</div>
);
};
Advanced: Handling Malformed JSON
Sometimes LLMs generate slightly malformed JSON that can't be parsed. The library allows you to provide a repair function:
const stream = makeStreamingJsonRequest({
url: "/api/generate-data",
method: "POST",
jsonRepairFunction: (data) => {
// Fix common LLM JSON mistakes
return data
.replace(/([{,]\s*)(\w+):/g, '$1"$2":') // Add quotes to keys
.replace(/,\s*}/g, '}') // Remove trailing commas
.replace(/,\s*]/g, ']'); // Remove trailing commas in arrays
}
});
Backend Implementation Examples
Python with New Google Gen AI SDK
Google has released a new unified SDK for Gemini 2.0 that provides a cleaner interface. Here's how to use it for streaming:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from google import genai
from typing import Dict, Any
app = FastAPI()
# Configure the new Gen AI client
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
@app.post("/api/generate-users-v2")
async def generate_users_v2(request: Dict[str, Any]):
count = request.get("count", 10)
prompt = request.get("prompt", "Generate user profiles")
# Create the prompt for structured JSON output
full_prompt = f"""
{prompt}. Return exactly {count} users as a JSON array with fields: name, age, city, country.
Example format:
[
{{"name": "John Doe", "age": 30, "city": "New York", "country": "USA"}},
{{"name": "Jane Smith", "age": 25, "city": "London", "country": "UK"}}
]
Return only valid JSON, no additional text.
"""
def generate():
try:
# Use the new streaming method
response = client.models.generate_content_stream(
model="gemini-2.0-flash",
contents=full_prompt
)
for chunk in response:
if chunk.text:
yield chunk.text
except Exception as e:
yield f"Error: {str(e)}"
return StreamingResponse(
generate(),
media_type="application/json",
headers={"Cache-Control": "no-cache"}
)
# Simple example for testing
@app.get("/api/test-stream")
async def test_stream():
def generate():
try:
response = client.models.generate_content_stream(
model="gemini-2.0-flash",
contents="Write a story about a magic backpack."
)
for chunk in response:
if chunk.text:
yield chunk.text
yield "\n" + "_" * 80 + "\n" # Separator for clarity
except Exception as e:
yield f"Error: {str(e)}"
return StreamingResponse(
generate(),
media_type="text/plain",
headers={"Cache-Control": "no-cache"}
)
Python with Flask and Structured Output
For more control over JSON structure, you can use the new Gen AI SDK with Flask:
from flask import Flask, request
from google import genai
from google.genai import types
import json
from fastapi.responses import StreamingResponse
app = Flask(__name__)
# Configure the new Gen AI client
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
@app.route('/api/generate-structured', methods=['POST'])
def generate_structured():
data = request.json
count = data.get('count', 10)
prompt = data.get('prompt', 'Generate user profiles')
# Create structured prompt for JSON output
full_prompt = f"""
{prompt}. Generate exactly {count} diverse user profiles.
Return a JSON array where each user has these fields:
- name (string): Full name
- age (integer): Age between 18-80
- city (string): City name
- country (string): Country name
- occupation (string): Job title
Example format:
[
{{"name": "Alice Johnson", "age": 28, "city": "Seattle", "country": "USA", "occupation": "Software Engineer"}},
{{"name": "Carlos Silva", "age": 35, "city": "São Paulo", "country": "Brazil", "occupation": "Marketing Manager"}}
]
Return only valid JSON, no additional text.
"""
def generate():
try:
# Use the new streaming method
response = client.models.generate_content_stream(
model="gemini-2.0-flash",
contents=full_prompt
)
for chunk in response:
if chunk.text:
yield chunk.text
except Exception as e:
yield json.dumps({"error": str(e)})
return StreamingResponse(
generate(),
mimetype='application/json',
headers={'Cache-Control': 'no-cache'}
)
# Alternative approach with generation config for more control
@app.route('/api/generate-with-config', methods=['POST'])
def generate_with_config():
data = request.json
count = data.get('count', 10)
def generate():
try:
# You can also pass generation configuration
response = client.models.generate_content_stream(
model="gemini-2.0-flash",
contents=f"Generate {count} diverse user profiles as a JSON array with name, age, city, country, and occupation fields.",
config=types.GenerateContentConfig(
temperature=0.7,
max_output_tokens=2048,
response_mime_type="application/json"
)
)
for chunk in response:
if chunk.text:
yield chunk.text
except Exception as e:
yield json.dumps({"error": str(e)})
return StreamingResponse(
generate(),
mimetype='application/json',
headers={'Cache-Control': 'no-cache'}
)
if __name__ == '__main__':
app.run(debug=True)
Node.js with Gemini Streaming
For comparison, here's the Node.js equivalent using Gemini:
// Express.js with Gemini API
import express from 'express';
import { GoogleGenerativeAI } from '@google/generative-ai';
const app = express();
const genAI = new GoogleGenerativeAI('YOUR_GEMINI_API_KEY');
app.use(express.json());
app.post('/api/generate-users', async (req, res) => {
res.setHeader('Content-Type', 'application/json');
res.setHeader('Transfer-Encoding', 'chunked');
res.setHeader('Cache-Control', 'no-cache');
const { count, prompt } = req.body;
try {
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });
const fullPrompt = `${prompt}. Return exactly ${count} users as a JSON array with fields: name, age, city, country.`;
const result = await model.generateContentStream(fullPrompt);
for await (const chunk of result.stream) {
const chunkText = chunk.text();
if (chunkText) {
res.write(chunkText);
}
}
res.end();
} catch (error) {
res.write(JSON.stringify({ error: error.message }));
res.end();
}
});
// Using OpenAI-compatible Gemini endpoint
app.post('/api/generate-users-openai', async (req, res) => {
const { OpenAI } = await import('openai');
const openai = new OpenAI({
apiKey: 'YOUR_GEMINI_API_KEY',
baseURL: 'https://generativelanguage.googleapis.com/v1beta/openai/'
});
res.setHeader('Content-Type', 'application/json');
res.setHeader('Transfer-Encoding', 'chunked');
const { count, prompt } = req.body;
try {
const completion = await openai.chat.completions.create({
model: 'gemini-1.5-flash',
messages: [{
role: 'user',
content: `${prompt}. Return exactly ${count} users as a JSON array.`
}],
stream: true,
});
for await (const chunk of completion) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
res.write(content);
}
}
res.end();
} catch (error) {
res.write(JSON.stringify({ error: error.message }));
res.end();
}
});
app.listen(3000, () => {
console.log('Server running on port 3000');
});
Performance Benefits
The impact of streaming on user experience is significant:
- Perceived Performance: Users see results immediately, reducing perceived wait time by up to 70%
- Engagement: Users are more likely to wait when they see progress
- Scalability: Your server can handle more concurrent requests since connections don't stay open as long
- Error Recovery: You can detect and handle errors earlier in the process
Best Practices
- Always show loading states: Even with streaming, let users know something is happening
- Handle errors gracefully: Network issues can interrupt streams
- Implement retry logic: Streams can fail, so have a fallback
- Use TypeScript: Define your expected JSON structure for better development experience
- Test with slow connections: Ensure your streaming works well on slower networks
- Choose the right model: Gemini 1.5 Flash is optimized for speed, while Pro offers better quality
- Use structured output: When possible, use schema-based generation for more reliable JSON
-
Consider the new SDK: For new projects, consider using the experimental
google-genai
SDK for Gemini 2.0 - it provides a cleaner, more unified interface - Monitor token usage: Streaming doesn't reduce token consumption, so implement proper usage tracking
- Implement proper error boundaries: Streaming failures should gracefully fall back to non-streaming responses
Conclusion
Streaming transforms the LLM user experience from frustrating waits to engaging real-time interactions. While streaming plain text is straightforward, structured JSON responses require specialized handling. Libraries like http-streaming-request
make it possible to stream JSON responses while maintaining data integrity.
Whether you choose Python with FastAPI, Node.js with Express, or any other stack, the principles remain the same. And with APIs like Gemini offering both native streaming and OpenAI-compatible endpoints, you have flexibility in implementation.
The next time you're building an AI-powered feature, remember: your users don't want to wait for perfection—they want to see progress. Give them streaming, and watch your user engagement soar.
Ready to implement streaming in your next project? Start with the examples above and adapt them to your specific use case. Your users will thank you for the improved experience!
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.