Note: This post is a translated version of an article originally published on my personal blog. You can read the original Korean post here.
How Can We Stream AI Chat Messages Like ChatGPT?
Can We Do It With Traditional HTTP?
When using services like ChatGPT, Claude, or Gemini, you'll notice that the AI's response is printed out on the screen bit by bit. How exactly is this implemented?
Fundamentally, web services operate on the HTTP protocol. This protocol works as unidirectional communication: the client sends a request to the server, and the server sends back a single response.
However, what we want is for the server to send AI message tokens down to the client as soon as they are ready. The traditional request-response pair we are familiar with isn't quite cut out for this.
Can We Use WebSockets?
As an alternative, we could use WebSockets, but this isn't a great approach either. Here's why:
- Streaming messages doesn't actually require a bidirectional channel. We only need the server to be able to respond in multiple chunks after the client makes a single request.
- Since it's not HTTP, we can't leverage existing HTTP features out of the box, such as authentication (cookies, tokens), CORS, caching, and logging.
- It makes scaling difficult.
Load balancer configuration is the biggest headache. In a typical web service, a load balancer distributes traffic across multiple servers. But if you use WebSockets, a specific server must maintain a persistent connection with the client. Therefore, you have to keep the connection alive using Sticky Sessions, which makes horizontal scaling difficult because traffic won't be distributed evenly.
The Answer is Server-Sent Events (SSE)
Actually, this problem can be solved with the HTTP protocol. A typical HTTP response sends the entire payload at once, like this:
HTTP/1.1 200 OK
Content-Length: 42
...
Because there is a Content-Length, the browser closes the connection once it receives that amount of data.
On the other hand, an SSE response looks like this:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: text/event-stream
data: chunk 1
data: chunk 2
...
data: chunk N
Notice that there is no Content-Length, and Transfer-Encoding is set to chunked. In this case, the browser assumes the response is not yet finished, keeps the connection open, and processes the data whenever it arrives.
Just by doing this, we can maintain the standard HTTP protocol while allowing the server to send a response in multiple chunks for a single request. This is how we can stream AI messages.
Simpler than you thought, right?
Well, seeing is believing, so let's implement it ourselves.
Implementing Message Streaming with SSE
- In this tutorial, we will create a Next.js app and implement a Route Handler that responds with messages using SSE. First, let's setup the Next.js app.
Next.js App Setup
# Create a next app using boilerplate
pnpm create next-app@16.2.1 sse-exam --yes
# Navigate to the generated project
cd sse-exam
# Run the dev server
pnpm dev
- We're all set. Now let's create the Route Handler.
Creating the Route Handler 1: Preparation
- Create the
app/api/route.tsfile and paste the following code:
// Dummy tokens for testing
const tokens = [
'Never gonna give you up ',
'Never gonna let you down ',
'Never gonna run around and desert you ',
'Never gonna make you cry ',
'Never gonna say goodbye ',
'Never gonna tell a lie and hurt you ',
];
/**
* A function to simulate an external API call by delaying execution.
* @param ms Delay time in milliseconds
*/
async function sleep(ms: number): Promise<void> {
return new Promise((resolve) => {
setTimeout(() => {
resolve();
}, ms);
});
}
-
tokensis a list of dummy tokens we'll use for testing. We need some data for the server to respond with! -
sleepis used to simulate the latency of calling an external API. In reality, calling the GPT or Anthropic API introduces latency, so we use this to mimic that behavior.
Creating the Route Handler 2: GET Request Handler
Now let's implement the handler for the GET request. Add the following function to the app/api/route.ts file:
export async function GET() {
// Create a ReadableStream for SSE
const stream = new ReadableStream({
async start(controller) {
const encoder = new TextEncoder();
// Encode tokens and enqueue them into the controller
for (const token of tokens) {
// Format the data according to the SSE standard
const encodedToken = encoder.encode(`data: ${JSON.stringify({ text: token })}\n\n`);
await sleep(100);
controller.enqueue(encodedToken);
}
controller.enqueue(encoder.encode('data: [DONE]\n\n'));
controller.close();
},
});
// Create and return a Response object containing the Stream
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
},
});
}
It might look complicated, but it's actually quite simple. You just create a ReadableStream object, put it inside a Response object, and return it.
Let's dig a little deeper.
Deep Dive into ReadableStream
-
ReadableStreamis an object that reads and transmits data in chunks, rather than sending the entire payload at once. - You can provide an asynchronous
startfunction as an argument to the constructor, where you can write the logic for reading and sending data in chunks. - To read and send a chunk, simply call
enqueueon thecontrollerparameter of thestartcallback. - Once all transmissions are complete, you can call
closeto end the stream. - A simplified version looks like this:
const stream = new ReadableStream({
start(controller) {
// Called once when the stream is created
controller.enqueue('first chunk'); // Push data
controller.enqueue('second chunk');
controller.close(); // Close stream
},
});
Deep Dive into Response
// Create and return a Response object containing the Stream
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
},
});
- When creating the response object, we define a few attributes in the headers.
- As mentioned earlier, we set
Content-Typetotext/event-stream. - Also, since we passed a
ReadableStreamas the body of theResponse, thetransfer-encodingis automatically set tochunked. - The client calling this API will read these attributes from the HTTP response headers and realize, "Ah, the server is sending the message in chunks. I should treat the result as a
ReadableStreamand read it accordingly." -
Cache-Controlisn't strictly required, but it's added defensively to prevent unexpected caching in production environments.
Midpoint Summary
That was a long explanation, so let's summarize before moving on.
First, the advantages and characteristics of SSE:
- To stream fragmented AI messages down to the client, you should use the SSE approach.
- Because SSE operates over the standard HTTP protocol, you can use existing authentication methods (cookies, tokens) as is.
- Unlike WebSockets, it remains unidirectional, making horizontal scaling much easier.
How to implement SSE on the server:
- Create a
ReadableStreamobject and place it in the body of aResponseobject to send it back. - Set the
Content-Typeheader totext/event-stream. By placing a stream object in the body, thetransfer-encodingautomatically becomeschunked.
Client-Side Implementation
Now let's see how the client calls the API that sends messages via SSE.
First, open app/page.tsx and modify it as follows:
'use client';
import { useState } from 'react';
export default function Home() {
const [message, setMessage] = useState<string>('');
const onClick = async () => {
// Write the API call logic here!
};
return (
<div className="w-160 h-175 bg-gray-50 border border-gray-300 shadow-md rounded-md mx-auto mt-10 overflow-hidden">
<div className="flex justify-between items-center bg-gray-100 p-5 border-b border-b-gray-300">
<h1 className="font-bold text-xl">SSE Exam</h1>
<button
className="px-4 py-2 bg-blue-700 rounded-md text-white font-bold hover:brightness-105 cursor-pointer"
onClick={onClick}
>
Request AI
</button>
</div>
<div className="p-5 overflow-auto">
{message.length === 0 && <p className="text-gray-500 text-2xl">no data!</p>}
{message.length > 0 && message}
</div>
</div>
);
}
- This is a simple UI that displays the message sent from the server via SSE when the
Request AIbutton is clicked. - Now let's implement
onClick. Write it like this:
const onClick = async () => {
const response = await fetch('/api', { method: 'GET' });
const reader = response.body?.getReader();
if (!reader) {
throw new Error('Invalid response. Failed to get reader.');
}
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder.decode(value).split('\n\n');
for (const line of lines) {
if (!line.startsWith('data: ') || line === 'data: [DONE]') continue;
const { text } = JSON.parse(line.slice(6));
setMessage((prev) => prev + text);
}
}
};
Since the server sent a ReadableStream in the body, the client must also read it as a ReadableStream.
You can easily extract the stream reader object by calling response.body.getReader().
We run a loop reading messages until there's no more data left (done).
At this point, you might worry that entering a while(true) loop will block the UI thread, but it's perfectly fine.
If you look closely, there is an await on reader.read(). Because this method is asynchronous, the event loop will process other tasks until the server sends a chunk.
Therefore, this code will not block the UI.
Checking Network Requests in DevTools
Now, when you press the Request AI button, you'll see the message being printed out piece by piece on the screen.
To see how the transmission actually happens, let's open the Network tab in DevTools.
First, here are the response message headers:
| Header | Value |
|---|---|
| cache-control | no-cache |
| content-type | text/event-stream |
| transfer-encoding | chunked |
- As configured earlier, you can see that the content type and transfer encoding are set correctly.
Next, here is the list of messages actually received in the EventStream tab:
| Type | Data | Time |
|---|---|---|
| message | {"text":"Never gonna give you up "} |
11:04:21.967 |
| message | {"text":"Never gonna let you down "} |
11:04:22.179 |
| message | {"text":"Never gonna run around and desert you "} |
11:04:22.280 |
| message | {"text":"Never gonna make you cry "} |
11:04:22.380 |
| message | {"text":"Never gonna say goodbye "} |
11:04:22.482 |
| message | {"text":"Never gonna tell a lie and hurt you "} |
11:04:22.583 |
The EventStream tab is specific to SSE transmissions, showing exactly which messages were sent in chronological order.
It's truly fascinating that you can respond in chunks as soon as the data is ready using simple HTTP configurations and ReadableStream, without any complex WebSocket setup.
Honestly, before researching how this feature was implemented, I naturally assumed it used WebSockets. Finding out that it can be done over HTTP was a pleasant surprise.
Even outside of AI chatbots, HTTP chunked transfer—which underpins SSE—can be used when transmitting large files bit by bit. I read that combining this with the MediaSource API allows browsers to receive and play video or audio chunks seamlessly.
Conclusion
In this post, we explored how to leverage HTTP specifications to stream AI messages to the client, much like ChatGPT or Claude.
Before looking into it, I thought it would be difficult, assuming WebSockets were mandatory. However, I was surprised by how easy and intuitive it turned out to be.
I definitely want to use this approach when building my next AI-powered project.
Check out the complete source code for this tutorial in the sse-exam repository.
Thanks for reading this post!
Top comments (0)