DEV Community

Gokhan KOC
Gokhan KOC

Posted on

Chasing 240 FPS in LLM Chat UIs

I've been using LLM chat apps for a while now; ChatGPT, Claude, Copilot, Cline, RooCode, Cursor (you name it) and I've built a few myself.

One thing I've noticed: a lot of them start to lag after a while.

You know the feeling. The conversation gets long, and suddenly the cursor stutters, scrolling gets choppy, typing feels delayed. Sometimes the whole tab just crashes.

It makes the app feel... cheap.

Are there good examples out there? Sure, some feel smooth at first. But eventually, they all start to struggle. To be fair, things are getting better across the board.

So I started wondering: forget the bad ones—can we even hit 60 FPS consistently? And what about 240 FPS for us frame rate nerds?

TL;DR

I built a benchmark suite to test various optimizations for streaming LLM responses in a React UI. For those in a hurry, here are my findings:

1) Build a proper state first, then optimize the rendering later. Ideally, do it without React. You can use Zustand (or whatever library that helps you to build a state and adapt to React later) to build the whole state outside React.

2) Do not try to optimize React re-renders or hooks, focus on windowing.
Wins on React internals are minimal (memoization, useTransition, useDeferredValue etc) compared to windowing.

3) Focus on CRP (Critical Rendering Path) optimizations using CSS properties like content-visibility: auto and contain: content.

If you are going to use animations: use will-change property.

4) If you are designing the Chat Bot yourself, consider not responding with markdown at all. It is expensive to parse and render.

And if you must, you can use segments in your LLM response to separate markdown parts from plain text parts. Render plain text parts as raw text and only parse markdown parts.

5) If network speed is not an issue, consider slowing down the stream itself. A small delay between words (5-10ms) can help maintain higher FPS.

How LLM Streaming Works

Before diving into optimizations, let's understand what we're dealing with. LLMs don't return a complete response in one shot—they stream it in chunks using Server-Sent Events (SSE).

Don't ask me why, ChatGPT started it.

Let's take a look at the most basic LLM Stream response handling with fetch:

async function streamLLMResponse(url) {
  const response = await fetch(url);
  const reader = response.body.getReader();
  let done = false;

  while (!done) {
    const { value, done: streamDone } = await reader.read();
    if (streamDone) {
      done = true;
      break;
    }
    const chunk = new TextDecoder().decode(value);
    // Process the chunk (can be one or multiple words)
    console.log(chunk);
  }
}
Enter fullscreen mode Exit fullscreen mode

How to handle the chunk in React and why it lags?

In a React application, you might handle each incoming chunk by updating the component's state. The most straightforward way is to append each chunk to a state variable:

const [response, setResponse] = useState("");

async function streamLLMResponse(url) {
  const response = await fetch(url);
  const reader = response.body.getReader();
  let done = false;

  while (!done) {
    const { value, done: streamDone } = await reader.read();
    if (streamDone) {
      done = true;
      break;
    }
    const chunk = new TextDecoder().decode(value);
    setResponse((prev) => prev + chunk); // Update state with each chunk
  }
}
Enter fullscreen mode Exit fullscreen mode

Lets take a look at why this approach can lead to lag and stuttering:
(Raw text with React Set State approach):

Result: As state (basically chat history) grows, FPS drops significantly.

Min FPS: 0
Achieved in: 10 seconds

Optimizations

I have built a benchmark suite to test various optimizations for streaming LLM responses.

It includes:

  • A minimal Node + TypeScript server that streams words with configurable delay. (basically simulating an LLM SSR stream)
  • A React + Vite + TypeScript frontend that implements various optimizations for handling and rendering the streamed response.

Going forward, I will describe each optimization I have tested and their impact on performance. Performance is only measured in FPS (Frames Per Second) during streaming. RAM usage varies between optimizations.

RAF Batching

The first optimization is to batch incoming chunks using requestAnimationFrame (RAF). Instead of updating the state with each chunk, we buffer the chunks and update the state once per animation frame.

It is not that complicated, you keep a buffer string in a Ref and on each chunk, you append to that buffer. Then you schedule a RAF callback to flush the buffer to a React State.

const bufferRef = useRef("");

async function streamLLMResponse(url) {
  const response = await fetch(url);
  const reader = response.body.getReader();

  while (true) {
    const { value, done } = await reader.read();
    if (done) break;

    bufferRef.current += new TextDecoder().decode(value); // Collect Collect!

    requestAnimationFrame(() => {
      if (bufferRef.current) {
        setResponse(prev => prev + bufferRef.current); // Set the state and flush!
        bufferRef.current = "";
      }
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Lets see how this performs:

Min FPS: 15
Achieved in: 90 seconds

A bit better. Still far from great. Lets move on.

React 18 startTransition

This is a tough one. React 18 introduced startTransition function to mark updates (synchronous) as non-urgent. This allows React to prioritize more urgent updates (like user inputs) over less urgent ones (like streaming text).

Basically, if you are constantly updating your state, UI gets blocked.
This function (startTransition)tells React "Hey, this update is not urgent, you can wait and do it later".

A useful example scenario:

import { startTransition } from 'react';

function handleInputChange(e) {
  const value = e.target.value;

  // Urgent update, user input should never be blocked.
  setInputValue(value);

  // Non-urgent update, we can show filtered list later
  startTransition(() => {
    setFilteredList(filterList(value));
  });
}
Enter fullscreen mode Exit fullscreen mode

Buuut, there is a catch. This function works with time slicing. If your updates are too frequent, React may not get enough idle time to process them. What I mean is:

Without startTransition (blocking):
|--chunk1--|--chunk2--|--chunk3--|--chunk4--|--chunk5--|
[█████████████████████████████████████████████████████]  ← Main thread blocked
                                                          User input? Wait.

With startTransition (non-blocking):
|--chunk1--|--chunk2--|--chunk3--|--chunk4--|--chunk5--|
[███]  [███]  [███]  [███]  [███]  [███]  [███]  [███]  ← Work split into slices
     ↑      ↑      ↑      ↑      ↑      ↑      ↑
   Yield  Yield  Yield  Yield  Yield  Yield  Yield     ← React yields to browser

But if chunks arrive faster than React can yield:
|chunk1|chunk2|chunk3|chunk4|chunk5|chunk6|chunk7|...   ← Too fast!
[█████████████████████████████████████████████████████]  ← No time to yield
                                                          Same as blocking!
Enter fullscreen mode Exit fullscreen mode

React's time slicing works by breaking work into small chunks (~5ms) and yielding back to the browser between them. But if new updates arrive faster than React can process + yield, the queue grows infinitely and you get the same blocking behavior.

Basically, you can type in the input box without blocking (yay!), but:

  • If your network is fast; you will not see new text until the stream ends. Because the updates keep piling up and React cancels previous low-priority updates to focus on the latest ones.

Bottom line: startTransition helps with occasional heavy updates, not continuous high-frequency streams. For streaming, combine it with RAF batching to reduce update frequency first.

Some interesting reads on this topic:
https://dev.to/mohamad_msalme_38f2508ea2/time-slicing-in-react-how-your-ui-stays-butter-smooth-the-frame-budget-secret-59lf

Lets see it in action:
(Raw Text with RAF + startTransition):

Min FPS: 20
Achieved in: 90 seconds

Another disappointment. Lets move on.

MillionJS

I stumbled upon MillionJS while researching high-performance React rendering. I was going to use it for improving Ag-Grid performance initially, but then I thought; why not try it here?

I won't go into details about MillionJS here, but in short, it compiles React components into highly optimized vanilla JS code that manipulates the DOM directly, bypassing React's reconciliation process.

And it did not help at all. Benchmarks were identical.

CSS Optimizations

Finally, we come to CSS optimizations. These are not React-specific, but they can have impact on rendering performance.

1) content-visibility: auto - This CSS property allows the browser to skip rendering elements that are off-screen.

2) contain: content - This property tells the browser that the element's layout and paint are independent of the rest of the page. This can help reduce layout thrashing.

Lets see how these optimizations perform:

(Raw Text with RAF + CSS Optimizations):

Min FPS: 20
Achieved in: 90 seconds

Again...nothing significant.

Delaying the stream

One last test: what if we slow down the stream itself? Instead of sending words as fast as possible, we introduce a small delay between words.

This can be achieved via a simple RxJs on UI side or a handmade function:

// ui
function delay(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}
async function streamLLMResponse(url) {
  const response = await fetch(url);
  const reader = response.body.getReader();

  while (true) {
    const { value, done } = await reader.read();
    if (done) break;

    const chunk = new TextDecoder().decode(value);
    setResponse((prev) => prev + chunk); // Update state with each chunk

    await delay(5); // Introduce a 5ms delay between chunks
  }
}
Enter fullscreen mode Exit fullscreen mode

Lets see how this performs:
(Raw Text with 5ms delay between words):

Min FPS: 40 (keeps dropping linearly)
Achieved in: 180 seconds

It looks like an improvement at first, but FPS keeps dropping linearly as the response grows. So not a real solution.

Windowing (Virtualization)

The final and most effective optimization is windowing (or virtualization). This technique involves rendering only the visible portion of the content, rather than the entire response.

Probably you were expecting this one.

I have tried a few Virtualization libraries, but @tanstack/react-virtual has provided the most smooth UX (scrolling behavior etc).

Lets go:
(Raw Text + Virtualization + No Other Optimizations):

Min FPS: 240 (stable)
Achieved in: 300 seconds

Okay, we got there. 240 FPS is achievable with proper windowing.

Now lets push it further:
(Raw Text + Virtualization + Text Animations + No Other Optimizations):

Min FPS: 230 (stable)
Achieved in: 300 seconds

Looks much smoother with a few fps sacrifice.

Now also with Markdown (segmented rendering; only markdown parts are parsed, this is the most realistic scenario):

(Raw Text + Virtualization + Text Animations + Markdown + No Other Optimizations):


FPS: 180 - 200 (stablish)

Now lets add optimizations like RAF batching and CSS and realistic network delay:

(Raw Text + Virtualization + Text Animations + Markdown + RAF + CSS + Realistic Network Delay 5ms):

FPS: 200 - 235 (stable)

And with lightweight markdown parsing (no syntax highlighting):

FPS is 240 stable.

Conclusion

After testing various optimizations for streaming LLM responses in a React UI, the key takeaway is that windowing (virtualization) is the most effective/useful technique to achieve high FPS. It has some trade-offs (complexity, memory usage and find functionality; ctrl+f) but the performance gains are significant.

It was definitely a fun experiment and I hope these findings help you build smoother apps!

Repository

You can find the complete benchmark suite and code examples in this GitHub repository: how-to-handle-llm-responses

Top comments (1)

Collapse
 
abewheeler profile image
Abe Wheeler

FASTER