AbHiNaV_PrAkAsH_AP

Posted on Jun 15

From Waiting to Streaming: How to Handle LLM Responses Like a Pro (Especially with JSON)

The Problem: LLM Latency is Killing Your User Experience

Picture this: Your user clicks "Generate Report" in your AI-powered app. They wait. And wait. And wait some more. 10 seconds later (which feels like an eternity in user time), a complete response finally appears. By then, they've probably already started questioning whether your app is broken.

This is the reality of working with Large Language Models (LLMs). Whether you're using OpenAI's GPT, Google's Gemini, or any other LLM API, response times typically range from 5-15 seconds for complex queries. In today's instant-gratification world, that's simply too slow.

The Solution: Streaming Creates the Illusion of Speed

Here's where streaming comes to the rescue. Instead of waiting for the complete response, streaming allows you to show partial results as they're generated. This creates a powerful psychological effect - users see immediate progress, making the wait feel much shorter.

Think about how ChatGPT works. It doesn't wait to generate the entire response before showing it to you. Instead, it streams the response token by token, creating that satisfying typewriter effect that keeps you engaged.

Basic Text Streaming Example

Here's how you can implement basic text streaming with Python using the new Google Gen AI SDK:

# Example with Google Gen AI SDK (Recommended)
from google import genai

client = genai.Client(api_key="YOUR_GEMINI_API_KEY")

response = client.models.generate_content_stream(
    model="gemini-2.0-flash",
    contents="Write a story about AI"
)

for chunk in response:
    if chunk.text:
        # Update UI with each chunk
        updateUI(chunk.text)

Alternatively, you can use the OpenAI-compatible endpoint in Python:

# Example with Gemini via OpenAI-compatible endpoint
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GEMINI_API_KEY",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = client.chat.completions.create(
    model="gemini-1.5-flash",
    messages=[{"role": "user", "content": "Write a story about AI"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        # Update UI with each chunk
        updateUI(content)

This works beautifully for plain text responses. But what happens when you need structured data?

The JSON Streaming Challenge

Modern applications often require structured responses from LLMs. You might ask for:

A list of recommendations in JSON format
User profiles with specific fields
Complex data structures for dashboards

Here's where things get tricky. When you stream JSON, you get something like this:

[{"name": "Jo

Try running JSON.parse() on that, and you'll get an error. The JSON is malformed because it's incomplete. You could wait for the complete response, but then you lose all the benefits of streaming.

Enter JSON Streaming: The Best of Both Worlds

This is where specialized libraries like http-streaming-request come in handy. They solve the JSON streaming problem by providing well-formed JSON objects even when the underlying stream is incomplete.

Installation

npm install http-streaming-request
# or
yarn add http-streaming-request

For Python backends, you'll also need the appropriate Gemini SDK:

# New Google Gen AI SDK - recommended for Gemini 2.0
pip install google-genai

# Or legacy Google Generative AI SDK
pip install google-generativeai

# For OpenAI-compatible endpoints
pip install openai

Basic JSON Streaming Example

import { makeStreamingJsonRequest } from "http-streaming-request";

const stream = makeStreamingJsonRequest({
  url: "/api/generate-users",
  method: "POST",
  payload: { count: 10 }
});

for await (const data of stream) {
  // Even if the API only returns [{"name": "Jo
  // this will give you [{ "name": "Jo" }] - valid JSON!
  console.log(data);
  updateUserList(data);
}

Real-World React Example

Here's how you might implement this in a React application:

import React, { useState } from 'react';
import { makeStreamingJsonRequest } from "http-streaming-request";

const UserGenerator = () => {
  const [users, setUsers] = useState([]);
  const [isLoading, setIsLoading] = useState(false);

  const generateUsers = async () => {
    setIsLoading(true);
    setUsers([]);

    try {
      for await (const usersSoFar of makeStreamingJsonRequest({
        url: "/api/generate-users",
        method: "POST",
        payload: { 
          count: 20,
          prompt: "Generate diverse user profiles with names, ages, and locations"
        }
      })) {
        setUsers(usersSoFar);
      }
    } catch (error) {
      console.error('Streaming error:', error);
    } finally {
      setIsLoading(false);
    }
  };

  return (
    <div>
      <button onClick={generateUsers} disabled={isLoading}>
        {isLoading ? 'Generating...' : 'Generate Users'}
      </button>

      <div className="user-grid">
        {users.map((user, index) => (
          <div key={index} className="user-card">
            <h3>{user.name}</h3>
            <p>Age: {user.age}</p>
            <p>Location: {user.city}, {user.country}</p>
          </div>
        ))}
      </div>
    </div>
  );
};

Using React Hooks for Cleaner Code

The library also provides a React hook for even cleaner implementation:

import { useJsonStreaming } from "http-streaming-request";

const UserGenerator = () => {
  const { data: users, run } = useJsonStreaming({
    url: "/api/generate-users",
    method: "POST",
  });

  const handleGenerate = () => {
    run({ 
      payload: { 
        count: 20,
        prompt: "Generate diverse user profiles"
      }
    });
  };

  return (
    <div>
      <button onClick={handleGenerate}>Generate Users</button>

      {users && users.map((user, index) => (
        <div key={index} className="user-card">
          <h3>{user.name}</h3>
          <p>Age: {user.age}</p>
          <p>Location: {user.city}, {user.country}</p>
        </div>
      ))}
    </div>
  );
};

Advanced: Handling Malformed JSON

Sometimes LLMs generate slightly malformed JSON that can't be parsed. The library allows you to provide a repair function:

const stream = makeStreamingJsonRequest({
  url: "/api/generate-data",
  method: "POST",
  jsonRepairFunction: (data) => {
    // Fix common LLM JSON mistakes
    return data
      .replace(/([{,]\s*)(\w+):/g, '$1"$2":') // Add quotes to keys
      .replace(/,\s*}/g, '}') // Remove trailing commas
      .replace(/,\s*]/g, ']'); // Remove trailing commas in arrays
  }
});

Backend Implementation Examples

Python with New Google Gen AI SDK

Google has released a new unified SDK for Gemini 2.0 that provides a cleaner interface. Here's how to use it for streaming:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from google import genai
from typing import Dict, Any

app = FastAPI()

# Configure the new Gen AI client
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")

@app.post("/api/generate-users-v2")
async def generate_users_v2(request: Dict[str, Any]):
    count = request.get("count", 10)
    prompt = request.get("prompt", "Generate user profiles")

    # Create the prompt for structured JSON output
    full_prompt = f"""
    {prompt}. Return exactly {count} users as a JSON array with fields: name, age, city, country.

    Example format:
    [
      {{"name": "John Doe", "age": 30, "city": "New York", "country": "USA"}},
      {{"name": "Jane Smith", "age": 25, "city": "London", "country": "UK"}}
    ]

    Return only valid JSON, no additional text.
    """

    def generate():
        try:
            # Use the new streaming method
            response = client.models.generate_content_stream(
                model="gemini-2.0-flash",
                contents=full_prompt
            )

            for chunk in response:
                if chunk.text:
                    yield chunk.text

        except Exception as e:
            yield f"Error: {str(e)}"

    return StreamingResponse(
        generate(), 
        media_type="application/json",
        headers={"Cache-Control": "no-cache"}
    )

# Simple example for testing
@app.get("/api/test-stream")
async def test_stream():
    def generate():
        try:
            response = client.models.generate_content_stream(
                model="gemini-2.0-flash", 
                contents="Write a story about a magic backpack."
            )
            for chunk in response:
                if chunk.text:
                    yield chunk.text
                    yield "\n" + "_" * 80 + "\n"  # Separator for clarity

        except Exception as e:
            yield f"Error: {str(e)}"

    return StreamingResponse(
        generate(), 
        media_type="text/plain",
        headers={"Cache-Control": "no-cache"}
    )

Python with Flask and Structured Output

For more control over JSON structure, you can use the new Gen AI SDK with Flask:

from flask import Flask, request
from google import genai
from google.genai import types
import json
from fastapi.responses import StreamingResponse

app = Flask(__name__)

# Configure the new Gen AI client
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")

@app.route('/api/generate-structured', methods=['POST'])
def generate_structured():
    data = request.json
    count = data.get('count', 10)
    prompt = data.get('prompt', 'Generate user profiles')

    # Create structured prompt for JSON output
    full_prompt = f"""
    {prompt}. Generate exactly {count} diverse user profiles.

    Return a JSON array where each user has these fields:
    - name (string): Full name
    - age (integer): Age between 18-80
    - city (string): City name
    - country (string): Country name
    - occupation (string): Job title

    Example format:
    [
      {{"name": "Alice Johnson", "age": 28, "city": "Seattle", "country": "USA", "occupation": "Software Engineer"}},
      {{"name": "Carlos Silva", "age": 35, "city": "São Paulo", "country": "Brazil", "occupation": "Marketing Manager"}}
    ]

    Return only valid JSON, no additional text.
    """

    def generate():
        try:
            # Use the new streaming method
            response = client.models.generate_content_stream(
                model="gemini-2.0-flash",
                contents=full_prompt
            )

            for chunk in response:
                if chunk.text:
                    yield chunk.text

        except Exception as e:
            yield json.dumps({"error": str(e)})

    return StreamingResponse(
        generate(),
        mimetype='application/json',
        headers={'Cache-Control': 'no-cache'}
    )

# Alternative approach with generation config for more control
@app.route('/api/generate-with-config', methods=['POST'])
def generate_with_config():
    data = request.json
    count = data.get('count', 10)

    def generate():
        try:
            # You can also pass generation configuration
            response = client.models.generate_content_stream(
                model="gemini-2.0-flash",
                contents=f"Generate {count} diverse user profiles as a JSON array with name, age, city, country, and occupation fields.",
                config=types.GenerateContentConfig(
                    temperature=0.7,
                    max_output_tokens=2048,
                    response_mime_type="application/json"
                )
            )

            for chunk in response:
                if chunk.text:
                    yield chunk.text

        except Exception as e:
            yield json.dumps({"error": str(e)})

    return StreamingResponse(
        generate(),
        mimetype='application/json',
        headers={'Cache-Control': 'no-cache'}
    )

if __name__ == '__main__':
    app.run(debug=True)

Node.js with Gemini Streaming

For comparison, here's the Node.js equivalent using Gemini:

// Express.js with Gemini API
import express from 'express';
import { GoogleGenerativeAI } from '@google/generative-ai';

const app = express();
const genAI = new GoogleGenerativeAI('YOUR_GEMINI_API_KEY');

app.use(express.json());

app.post('/api/generate-users', async (req, res) => {
  res.setHeader('Content-Type', 'application/json');
  res.setHeader('Transfer-Encoding', 'chunked');
  res.setHeader('Cache-Control', 'no-cache');

  const { count, prompt } = req.body;

  try {
    const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });

    const fullPrompt = `${prompt}. Return exactly ${count} users as a JSON array with fields: name, age, city, country.`;

    const result = await model.generateContentStream(fullPrompt);

    for await (const chunk of result.stream) {
      const chunkText = chunk.text();
      if (chunkText) {
        res.write(chunkText);
      }
    }

    res.end();
  } catch (error) {
    res.write(JSON.stringify({ error: error.message }));
    res.end();
  }
});

// Using OpenAI-compatible Gemini endpoint
app.post('/api/generate-users-openai', async (req, res) => {
  const { OpenAI } = await import('openai');

  const openai = new OpenAI({
    apiKey: 'YOUR_GEMINI_API_KEY',
    baseURL: 'https://generativelanguage.googleapis.com/v1beta/openai/'
  });

  res.setHeader('Content-Type', 'application/json');
  res.setHeader('Transfer-Encoding', 'chunked');

  const { count, prompt } = req.body;

  try {
    const completion = await openai.chat.completions.create({
      model: 'gemini-1.5-flash',
      messages: [{
        role: 'user',
        content: `${prompt}. Return exactly ${count} users as a JSON array.`
      }],
      stream: true,
    });

    for await (const chunk of completion) {
      const content = chunk.choices[0]?.delta?.content || '';
      if (content) {
        res.write(content);
      }
    }

    res.end();
  } catch (error) {
    res.write(JSON.stringify({ error: error.message }));
    res.end();
  }
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Performance Benefits

The impact of streaming on user experience is significant:

Perceived Performance: Users see results immediately, reducing perceived wait time by up to 70%
Engagement: Users are more likely to wait when they see progress
Scalability: Your server can handle more concurrent requests since connections don't stay open as long
Error Recovery: You can detect and handle errors earlier in the process

Best Practices

Always show loading states: Even with streaming, let users know something is happening
Handle errors gracefully: Network issues can interrupt streams
Implement retry logic: Streams can fail, so have a fallback
Use TypeScript: Define your expected JSON structure for better development experience
Test with slow connections: Ensure your streaming works well on slower networks
Choose the right model: Gemini 1.5 Flash is optimized for speed, while Pro offers better quality
Use structured output: When possible, use schema-based generation for more reliable JSON
Consider the new SDK: For new projects, consider using the experimental google-genai SDK for Gemini 2.0 - it provides a cleaner, more unified interface
Monitor token usage: Streaming doesn't reduce token consumption, so implement proper usage tracking
Implement proper error boundaries: Streaming failures should gracefully fall back to non-streaming responses

Conclusion

Streaming transforms the LLM user experience from frustrating waits to engaging real-time interactions. While streaming plain text is straightforward, structured JSON responses require specialized handling. Libraries like http-streaming-request make it possible to stream JSON responses while maintaining data integrity.

Whether you choose Python with FastAPI, Node.js with Express, or any other stack, the principles remain the same. And with APIs like Gemini offering both native streaming and OpenAI-compatible endpoints, you have flexibility in implementation.

The next time you're building an AI-powered feature, remember: your users don't want to wait for perfection—they want to see progress. Give them streaming, and watch your user engagement soar.

Ready to implement streaming in your next project? Start with the examples above and adapt them to your specific use case. Your users will thank you for the improved experience!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.