Introduction
In the previous post, we built a simple pizza chatbot using the OpenAI SDK that could handle pizza orders. We discussed several features necessary for a consistent chatbot experience, such as streaming LLM responses, parsing markdown content generated by the LLM into HTML, and exploring alternative LLMs and SDKs for potentially more cost-effective options.
In this second part on building a pizza chatbot with Node.js, we'll implement these features. If you haven’t read the first part yet, I recommend doing so before diving into this post.
Streaming OpenAI Responses in an Express API
To enable streaming responses in our Express API, we need to pass the stream: true parameter to the OpenAI SDK. Instead of waiting for the LLM to generate the entire response, this approach streams the data as it arrives, enhancing the chat experience—especially for longer responses. This is the same technique used in ChatGPT-like interfaces to provide a smoother and more interactive user experience.
app.post("/api/chat-streaming", async (req, res) => {
const conversations = req.body.conversations;
try {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [...context, ...conversations],
stream: true,
});
const stream = response.toReadableStream();
const reader = stream.getReader();
const decoder = new TextDecoder();
let assistantMessage = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split("\n");
for (const line of lines) {
if (line.trim() === "") continue;
const json = JSON.parse(line);
const content = json.choices[0]?.delta?.content || "";
assistantMessage += content;
res.write(content);
}
}
return res.end();
} catch (error) {
console.error("Error during streaming:", error);
return res.status(500).json({ error: "An error occurred during streaming" });
}
});
The streaming part involves getting a continuous flow of data from OpenAI's API and sending it to the client in real-time. Here’s how it works:
-
response.toReadableStream(): The
response
object from the OpenAI API call has a method calledtoReadableStream()
. This converts the response into a stream that can be read piece by piece instead of all at once. Think of a stream like a faucet where water (data) flows gradually, instead of filling a whole bucket and then using it. You get the data as soon as it arrives. -
const reader = stream.getReader(): The stream is read using a
reader
. ThegetReader()
method gives us a way to read the stream's data chunk by chunk. -
const decoder = new TextDecoder(): The
TextDecoder
is used to convert the chunks of data (which are in a raw format) into readable text. -
Reading the Stream in a Loop: The
while (true)
loop continuously reads from the stream until there is no more data (done becomes true).await reader.read()
reads the next chunk of data from the stream. - done: A flag that tells if the stream has finished (true when there’s no more data).
- value: The actual chunk of data.
-
Processing the Data: Each chunk of data is decoded into text using
decoder.decode(value)
. The decoded text (which might be several lines) is split into individual lines usingchunk.split("\n")
. The loop goes through each line. If a line is empty (just a newline character), it skips it. -
Sending Data to the Client: For each non-empty line, the line is parsed as JSON
(JSON.parse(line))
, and the relevant content (the AI's message) is extracted.res.write(content)
sends this content to the client as soon as it’s available. This means the client starts receiving parts of the AI's response immediately, without waiting for the whole response to be ready. -
Ending the Response: Once the stream has no more data (done becomes true), the loop exits, and
res.end()
is called to signal that the response is complete.
Receiving Streams on the Frontend
Now that we have the backend code set up for streaming, let’s look at how to handle this data on the frontend. Replace the script code from the previous tutorial with the following:
<script>
document.addEventListener("DOMContentLoaded", () => {
const conversations = [];
const initialMessage =
"Hi there! 😊 Ready to place your pizza order 🍕 or have any questions? Let me know!";
addMessageToChat("Assistant", initialMessage, "assistant");
document
.getElementById("chat-form")
.addEventListener("submit", async (event) => {
event.preventDefault();
const userMessage = document.getElementById("message-input").value;
if (!userMessage) return;
addMessageToChat("User", userMessage, "user");
conversations.push({ role: "user", content: userMessage });
document.getElementById("message-input").value = "";
const response = await fetch("/api/chat-streaming", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ conversations }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let assistantMessage = "";
// Create the assistant message element once
const assistantMessageElement = addMessageToChat(
"Assistant",
"",
"assistant"
);
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
assistantMessage += chunk;
// Update the existing message element with the latest content
updateMessageInChat(assistantMessageElement, assistantMessage);
}
conversations.push({
role: "assistant",
content: assistantMessage,
});
});
function addMessageToChat(sender, message, className) {
const chat = document.getElementById("chat-area");
const messageElement = document.createElement("div");
messageElement.className = `chat-message ${className}`;
messageElement.innerHTML = `<strong>${sender}:</strong> <span class="message-content">${message}</span>`;
chat.appendChild(messageElement);
chat.scrollTop = chat.scrollHeight;
return messageElement.querySelector(".message-content"); // Return the span where the message content is displayed
}
function updateMessageInChat(element, message) {
element.innerHTML = message;
const chat = document.getElementById("chat-area");
chat.scrollTop = chat.scrollHeight; // Keep the chat scrolled to the bottom
}
});
</script>
Let's break down the streaming part of our frontend code, focusing on how messages from the backend are received and displayed in real-time.
-
Sending the User Message and Making the POST Request: When the user submits a message in the chat, it’s added to the
conversations
array and displayed in the chat window. A POST request is then sent to the/api/chat-streaming
endpoint with theconversations
array in the request body. -
Receiving the Stream: Once the POST request is sent, the frontend starts receiving a streamed response from the server. The
response.body.getReader()
method is used to create a reader that reads the incoming stream of data chunk by chunk. -
Decoding the Stream: A
TextDecoder
is used to convert each chunk of raw data into a readable text string. -
Creating the Assistant Message Element: When the streaming starts, an empty message element is created for the assistant using the
addMessageToChat
function. This element is added to the chat area, but it starts with no content because the full response hasn’t been received yet. TheaddMessageToChat
function returns the specific span element where the assistant's message content will be displayed. -
Updating the Assistant Message in Real-Time: As chunks of the assistant's message are received from the stream, the message content is gradually built up in the
assistantMessage
variable. After each chunk is received, theupdateMessageInChat
function is called to update the content of the already-created message element in the chat. This means that the assistant’s message appears to be typing out in real-time, giving the user immediate feedback as the AI generates its response. -
Finalizing the Message: Once the stream is fully read (indicated by
done
beingtrue
), the full message from the assistant is finalized and added to the conversations array.
Parsing Markdown to HTML on the Frontend
One issue we’ve encountered is that while the LLM generates great markdown text, we're unable to display it properly on the frontend. To resolve this, we'll use the marked library to parse markdown into HTML. First, include it in your HTML header:
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
Next, update your updateMessage
function to integrate the marked library. This will convert the streamed content from markdown to HTML in real time:
function updateMessageInChat(element, message) {
// Convert markdown to HTML using marked.js
const htmlContent = marked.parse(message);
element.innerHTML = htmlContent; // Update the message content with HTML
const chat = document.getElementById("chat-area");
chat.scrollTop = chat.scrollHeight; // Keep the chat scrolled to the bottom
}
Now, run the code and send this message: show me the menu please
. Watch how the LLM streams the response in real time and how we beautifully render the markdown.
Using LLaMA from Together AI
In one of our previous posts, we highlighted the importance of choosing the right LLM for the right job, as it directly impacts both pricing and performance. For instance, there's no need to use the latest and most expensive LLM for simple tasks like generating emails, etc.
In this section, we'll switch from OpenAI's GPT-4o Mini to Meta's LLaMA 3 8B from Together AI. Here’s a comparison of their pricing:
- GPT-4o Mini: $0.15 per million input tokens, $0.60 per million output tokens
- Meta's LLaMA 3 8B: $0.055 per million input tokens, $0.055 per million output tokens
Together AI offers a more affordable option, and it’s compatible with the OpenAI SDK for streaming content. Here’s how to make the switch:
- Get your API keys from Together AI
- Update your API configuration:
const openai = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: "https://api.together.xyz/v1",
});
- Update the model in your request:
const response = await openai.chat.completions.create({
model: "meta-llama/Llama-3-8b-chat-hf",
messages: [...context, ...conversations],
stream: true,
});
That’s it! You’ve successfully changed the API key, updated the URL, and switched the model. Test your app to see the new setup in action.
Conclusion
In this post, we built upon our previous work by implementing streaming for our chatbot, adding markdown parsing for enhanced content display, and exploring how to switch LLMs to optimize for cost and performance. In the next post, we'll dive into function calling with the OpenAI SDK and explore how to implement it effectively. Until then, keep coding!
Top comments (0)