DEV Community

Cover image for How Can We Stream AI Chat Messages Like ChatGPT?
Malloc72P
Malloc72P

Posted on • Originally published at blog.malloc72p.com

How Can We Stream AI Chat Messages Like ChatGPT?

Note: This post is a translated version of an article originally published on my personal blog. You can read the original Korean post here.

How Can We Stream AI Chat Messages Like ChatGPT?

Can We Do It With Traditional HTTP?

When using services like ChatGPT, Claude, or Gemini, you'll notice that the AI's response is printed out on the screen bit by bit. How exactly is this implemented?

Fundamentally, web services operate on the HTTP protocol. This protocol works as unidirectional communication: the client sends a request to the server, and the server sends back a single response.

However, what we want is for the server to send AI message tokens down to the client as soon as they are ready. The traditional request-response pair we are familiar with isn't quite cut out for this.

Can We Use WebSockets?

As an alternative, we could use WebSockets, but this isn't a great approach either. Here's why:

  • Streaming messages doesn't actually require a bidirectional channel. We only need the server to be able to respond in multiple chunks after the client makes a single request.
  • Since it's not HTTP, we can't leverage existing HTTP features out of the box, such as authentication (cookies, tokens), CORS, caching, and logging.
  • It makes scaling difficult.

Load balancer configuration is the biggest headache. In a typical web service, a load balancer distributes traffic across multiple servers. But if you use WebSockets, a specific server must maintain a persistent connection with the client. Therefore, you have to keep the connection alive using Sticky Sessions, which makes horizontal scaling difficult because traffic won't be distributed evenly.

The Answer is Server-Sent Events (SSE)

Actually, this problem can be solved with the HTTP protocol. A typical HTTP response sends the entire payload at once, like this:

HTTP/1.1 200 OK
Content-Length: 42
...
Enter fullscreen mode Exit fullscreen mode

Because there is a Content-Length, the browser closes the connection once it receives that amount of data.

On the other hand, an SSE response looks like this:

HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: text/event-stream

data: chunk 1

data: chunk 2

...

data: chunk N
Enter fullscreen mode Exit fullscreen mode

Notice that there is no Content-Length, and Transfer-Encoding is set to chunked. In this case, the browser assumes the response is not yet finished, keeps the connection open, and processes the data whenever it arrives.

Just by doing this, we can maintain the standard HTTP protocol while allowing the server to send a response in multiple chunks for a single request. This is how we can stream AI messages.

Simpler than you thought, right?

Well, seeing is believing, so let's implement it ourselves.

Implementing Message Streaming with SSE

  • In this tutorial, we will create a Next.js app and implement a Route Handler that responds with messages using SSE. First, let's setup the Next.js app.

Next.js App Setup

# Create a next app using boilerplate
pnpm create next-app@16.2.1 sse-exam --yes

# Navigate to the generated project
cd sse-exam

# Run the dev server
pnpm dev
Enter fullscreen mode Exit fullscreen mode
  • We're all set. Now let's create the Route Handler.

Creating the Route Handler 1: Preparation

  • Create the app/api/route.ts file and paste the following code:
// Dummy tokens for testing
const tokens = [
  'Never gonna give you up ',
  'Never gonna let you down ',
  'Never gonna run around and desert you ',
  'Never gonna make you cry ',
  'Never gonna say goodbye ',
  'Never gonna tell a lie and hurt you ',
];

/**
 * A function to simulate an external API call by delaying execution.
 * @param ms Delay time in milliseconds
 */
async function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve();
    }, ms);
  });
}
Enter fullscreen mode Exit fullscreen mode
  • tokens is a list of dummy tokens we'll use for testing. We need some data for the server to respond with!
  • sleep is used to simulate the latency of calling an external API. In reality, calling the GPT or Anthropic API introduces latency, so we use this to mimic that behavior.

Creating the Route Handler 2: GET Request Handler

Now let's implement the handler for the GET request. Add the following function to the app/api/route.ts file:

export async function GET() {
  // Create a ReadableStream for SSE
  const stream = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder();

      // Encode tokens and enqueue them into the controller
      for (const token of tokens) {
        // Format the data according to the SSE standard
        const encodedToken = encoder.encode(`data: ${JSON.stringify({ text: token })}\n\n`);

        await sleep(100);
        controller.enqueue(encodedToken);
      }

      controller.enqueue(encoder.encode('data: [DONE]\n\n'));
      controller.close();
    },
  });

  // Create and return a Response object containing the Stream
  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
    },
  });
}
Enter fullscreen mode Exit fullscreen mode

It might look complicated, but it's actually quite simple. You just create a ReadableStream object, put it inside a Response object, and return it.

Let's dig a little deeper.

Deep Dive into ReadableStream

  • ReadableStream is an object that reads and transmits data in chunks, rather than sending the entire payload at once.
  • You can provide an asynchronous start function as an argument to the constructor, where you can write the logic for reading and sending data in chunks.
  • To read and send a chunk, simply call enqueue on the controller parameter of the start callback.
  • Once all transmissions are complete, you can call close to end the stream.
  • A simplified version looks like this:
const stream = new ReadableStream({
  start(controller) {
    // Called once when the stream is created
    controller.enqueue('first chunk'); // Push data
    controller.enqueue('second chunk');
    controller.close(); // Close stream
  },
});
Enter fullscreen mode Exit fullscreen mode

Deep Dive into Response

// Create and return a Response object containing the Stream
return new Response(stream, {
  headers: {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
  },
});
Enter fullscreen mode Exit fullscreen mode
  • When creating the response object, we define a few attributes in the headers.
  • As mentioned earlier, we set Content-Type to text/event-stream.
  • Also, since we passed a ReadableStream as the body of the Response, the transfer-encoding is automatically set to chunked.
  • The client calling this API will read these attributes from the HTTP response headers and realize, "Ah, the server is sending the message in chunks. I should treat the result as a ReadableStream and read it accordingly."
  • Cache-Control isn't strictly required, but it's added defensively to prevent unexpected caching in production environments.

Midpoint Summary

That was a long explanation, so let's summarize before moving on.

First, the advantages and characteristics of SSE:

  • To stream fragmented AI messages down to the client, you should use the SSE approach.
  • Because SSE operates over the standard HTTP protocol, you can use existing authentication methods (cookies, tokens) as is.
  • Unlike WebSockets, it remains unidirectional, making horizontal scaling much easier.

How to implement SSE on the server:

  • Create a ReadableStream object and place it in the body of a Response object to send it back.
  • Set the Content-Type header to text/event-stream. By placing a stream object in the body, the transfer-encoding automatically becomes chunked.

Client-Side Implementation

Now let's see how the client calls the API that sends messages via SSE.

First, open app/page.tsx and modify it as follows:

'use client';

import { useState } from 'react';

export default function Home() {
  const [message, setMessage] = useState<string>('');

  const onClick = async () => {
    // Write the API call logic here!
  };

  return (
    <div className="w-160 h-175 bg-gray-50 border border-gray-300 shadow-md rounded-md mx-auto mt-10 overflow-hidden">
      <div className="flex justify-between items-center bg-gray-100 p-5 border-b border-b-gray-300">
        <h1 className="font-bold text-xl">SSE Exam</h1>
        <button
          className="px-4 py-2 bg-blue-700 rounded-md text-white font-bold hover:brightness-105 cursor-pointer"
          onClick={onClick}
        >
          Request AI
        </button>
      </div>

      <div className="p-5 overflow-auto">
        {message.length === 0 && <p className="text-gray-500 text-2xl">no data!</p>}
        {message.length > 0 && message}
      </div>
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode
  • This is a simple UI that displays the message sent from the server via SSE when the Request AI button is clicked.
  • Now let's implement onClick. Write it like this:
const onClick = async () => {
  const response = await fetch('/api', { method: 'GET' });
  const reader = response.body?.getReader();

  if (!reader) {
    throw new Error('Invalid response. Failed to get reader.');
  }

  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();

    if (done) break;

    const lines = decoder.decode(value).split('\n\n');

    for (const line of lines) {
      if (!line.startsWith('data: ') || line === 'data: [DONE]') continue;

      const { text } = JSON.parse(line.slice(6));
      setMessage((prev) => prev + text);
    }
  }
};
Enter fullscreen mode Exit fullscreen mode

Since the server sent a ReadableStream in the body, the client must also read it as a ReadableStream.

You can easily extract the stream reader object by calling response.body.getReader().

We run a loop reading messages until there's no more data left (done).
At this point, you might worry that entering a while(true) loop will block the UI thread, but it's perfectly fine.
If you look closely, there is an await on reader.read(). Because this method is asynchronous, the event loop will process other tasks until the server sends a chunk.
Therefore, this code will not block the UI.

Checking Network Requests in DevTools

Now, when you press the Request AI button, you'll see the message being printed out piece by piece on the screen.
To see how the transmission actually happens, let's open the Network tab in DevTools.

First, here are the response message headers:

Header Value
cache-control no-cache
content-type text/event-stream
transfer-encoding chunked
  • As configured earlier, you can see that the content type and transfer encoding are set correctly.

Next, here is the list of messages actually received in the EventStream tab:

Type Data Time
message {"text":"Never gonna give you up "} 11:04:21.967
message {"text":"Never gonna let you down "} 11:04:22.179
message {"text":"Never gonna run around and desert you "} 11:04:22.280
message {"text":"Never gonna make you cry "} 11:04:22.380
message {"text":"Never gonna say goodbye "} 11:04:22.482
message {"text":"Never gonna tell a lie and hurt you "} 11:04:22.583

The EventStream tab is specific to SSE transmissions, showing exactly which messages were sent in chronological order.
It's truly fascinating that you can respond in chunks as soon as the data is ready using simple HTTP configurations and ReadableStream, without any complex WebSocket setup.

Honestly, before researching how this feature was implemented, I naturally assumed it used WebSockets. Finding out that it can be done over HTTP was a pleasant surprise.
Even outside of AI chatbots, HTTP chunked transfer—which underpins SSE—can be used when transmitting large files bit by bit. I read that combining this with the MediaSource API allows browsers to receive and play video or audio chunks seamlessly.

Conclusion

In this post, we explored how to leverage HTTP specifications to stream AI messages to the client, much like ChatGPT or Claude.
Before looking into it, I thought it would be difficult, assuming WebSockets were mandatory. However, I was surprised by how easy and intuitive it turned out to be.
I definitely want to use this approach when building my next AI-powered project.

Check out the complete source code for this tutorial in the sse-exam repository.

Thanks for reading this post!

Top comments (0)