DEV Community

Programming Central
Programming Central

Posted on • Originally published at programmingcentral.hashnode.dev

Stop the Spinners: How to Make AI Streams Feel Instant with Skeleton Loaders & Suspense

We’ve all been there. You ask an AI a question, hit send, and the interface freezes. Two seconds pass. Three. You start wondering if it crashed. Then, suddenly, a wall of text dumps onto the screen, causing the UI to jump and shift. It feels clunky, unresponsive, and frankly, broken.

In the world of generative AI, where Large Language Models (LLMs) inherently take seconds to generate responses, "speed" isn't just about raw latency—it's about perceived performance. If the user feels like the app is thinking, the magic is lost.

This guide explores the architecture of the "waiting game." We'll dive into the mechanics of React Suspense and Skeleton Loaders, specifically tailored for the unique challenges of AI streaming. By the end, you'll know exactly how to implement a UI that feels alive the moment the user clicks "Send."

The Illusion of Speed: Perceived Performance vs. Absolute Latency

When building a standard web app (like a CRUD dashboard), the server usually fetches data and sends a complete response. The user waits, but they wait for a finished product.

Generative AI changes this. It’s not an assembly line; it’s a chef cooking a complex dish one spoonful at a time. If the waiter waits for the whole pot to finish before bringing the first spoon, the customer starves. If the waiter brings the spoon but the plate keeps shifting around on the table, the soup spills.

This is the Uncanny Valley of Waiting. The goal is to decouple the user's perception from the backend processing time. We achieve this by shifting from a "loading state" mindset to a progressive rendering mindset.

The Restaurant Analogy

To visualize this, let's compare traditional web apps to AI streaming apps:

  • Traditional Apps (The Assembly Line): The kitchen cooks the entire dish. The waiter holds the plate until it's perfect. The user waits for the whole thing.
  • AI Streaming (The Open Kitchen): The chef ladles soup into the bowl as it cooks. The waiter runs the bowl to the table immediately. The user starts eating (reading) while the rest is still cooking.

However, there is still a delay between the order and the first spoonful (the time to generate the first token). If the waiter stands at the kitchen door staring at the chef, the customer sees nothing happening.

Suspense and Skeletons are the Bread Basket. The waiter brings a placeholder immediately upon receiving the order. It signals: "We got it. We are working. Here is the shape of what you're getting."

How React Suspense Handles AI Latency

In React, Suspense allows components to "wait" for something before rendering. In the context of AI, we use it to manage the initial latency gap—the time from the user clicking "Send" to the arrival of the first token.

The Critical Distinction

In traditional Suspense, you might wait for a database query to finish. In AI streaming, we do not want to wait for the entire stream. We only want to show the fallback (the skeleton) for the initial connection. Once the first token arrives, we want to seamlessly transition to the streaming content.

The Flow:

  1. 0ms (Click): User sends a message. The UI is frozen.
  2. Suspense Activation: React shows the <Skeleton /> immediately. The layout is reserved.
  3. 1500ms (First Token): The server connects to the LLM and pushes the first chunk.
  4. Suspense Resolution: React detects data. It swaps the Skeleton for the streaming component.
  5. 1500ms - 5000ms (Streaming): Tokens arrive. The text fills in the reserved space.

Designing Effective Skeleton Loaders

A skeleton loader is not just a generic spinner. It is a structural placeholder. For generative UI, the skeleton must mimic the shape of the expected response.

Why does this matter?

  1. Layout Stability (CLS): It reserves the exact space the content will occupy. This prevents Cumulative Layout Shift, a major Core Web Vital metric.
  2. Cognitive Expectation: It tells the user what is coming. A code-shaped skeleton implies code. A chat-bubble skeleton implies a conversation.

Visualizing the Flow

Imagine a timeline. A blocking UI waits for the full response, resulting in a long, empty gap. A non-blocking UI with Suspense and Skeletons fills that gap immediately, creating a continuous experience.

  • Blocking UI: [Click] -> [Waiting...] -> [Content Dumps]
  • Non-Blocking UI: [Click] -> [Skeleton Appears] -> [Content Streams In]

Code Example: Streaming a User Report with streamUI

Let's look at a practical implementation using the Vercel AI SDK and Next.js App Router. We will build a "User Report" generator that streams a structured React component.

We will use streamUI (which streams React components, not just text) and wrap the client-side consumption in a Suspense boundary.

1. The Server Action

This runs on the server. It connects to OpenAI, defines a tool (Zod schema), and streams back a React component.

// app/actions/generateReport.ts
'use server';

import { streamUI } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

export async function generateReportStream(userId: string) {
  const result = await streamUI({
    model: openai('gpt-4o-mini'),
    prompt: `Generate a status report for user ${userId}. Include a summary and recommendation.`,

    // The AI can invoke this tool to render a specific UI card
    tools: {
      report_card: {
        description: 'A card displaying a user status report.',
        parameters: z.object({
          username: z.string(),
          status: z.enum(['active', 'inactive', 'pending']),
          summary: z.string(),
          recommendation: z.string(),
        }),
        generate: async function* ({ username, status, summary, recommendation }) {
          // Yield a loading state if generation takes time
          yield <div className="animate-pulse">Assembling report...</div>;

          // Return the final structured component
          return (
            <div className="border rounded-lg p-4 shadow-sm bg-white">
              <h3 className="font-bold text-lg">Report for {username}</h3>
              <div className="mt-2 space-y-2">
                <p><span className="font-semibold">Status:</span> {status}</p>
                <p className="text-gray-700">{summary}</p>
                <div className="mt-3 p-2 bg-blue-50 border-l-4 border-blue-500">
                  <p className="font-semibold text-blue-700">Recommendation</p>
                  <p className="text-blue-600">{recommendation}</p>
                </div>
              </div>
            </div>
          );
        },
      },
    },
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

2. The Client Component & The Skeleton

This component triggers the action. Crucially, it uses a Skeleton to handle the initial latency before the stream starts.

// app/components/ReportGenerator.tsx
'use client';

import { useState, useTransition } from 'react';
import { generateReportStream } from '@/app/actions/generateReport';
import { readStreamableValue } from 'ai';
import { Suspense } from 'react';

export default function ReportGenerator() {
  const [isPending, startTransition] = useTransition();
  const [uiState, setUiState] = useState<React.ReactNode | null>(null);

  const handleClick = () => {
    startTransition(async () => {
      const result = await generateReportStream('user_12345');

      // Read the stream token-by-token
      for await (const delta of readStreamableValue(result.value)) {
        if (delta) {
          setUiState(delta);
        }
      }
    });
  };

  return (
    <div className="max-w-md mx-auto p-6 space-y-6">
      <button
        onClick={handleClick}
        disabled={isPending}
        className="w-full py-2 px-4 bg-blue-600 text-white rounded hover:bg-blue-700 disabled:opacity-50"
      >
        {isPending ? 'Generating...' : 'Generate User Report'}
      </button>

      {/* 
        SUSPENSE BOUNDARY:
        1. While the stream is initializing (0ms - 1500ms), <ReportSkeleton /> is shown.
        2. Once the first token arrives, the children (uiState) render.
      */}
      <Suspense fallback={<ReportSkeleton />}>
        <div className="min-h-[200px]">
          {isPending ? uiState : null}
        </div>
      </Suspense>
    </div>
  );
}

// The Skeleton mimics the structure of the final Report Card
function ReportSkeleton() {
  return (
    <div className="border rounded-lg p-4 shadow-sm bg-white animate-pulse">
      <div className="h-6 bg-gray-200 rounded w-1/2 mb-4"></div>
      <div className="space-y-2">
        <div className="h-4 bg-gray-200 rounded w-1/3"></div>
        <div className="h-4 bg-gray-200 rounded w-full"></div>
        <div className="h-4 bg-gray-200 rounded w-2/3"></div>
        <div className="mt-3 p-2 bg-gray-100 border-l-4 border-gray-300">
          <div className="h-3 bg-gray-200 rounded w-1/3 mb-1"></div>
          <div className="h-3 bg-gray-200 rounded w-3/4"></div>
        </div>
      </div>
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

Line-by-Line Breakdown

  1. useTransition: This React hook allows us to mark the state update as "non-blocking." This keeps the UI responsive even while the server action is processing.
  2. readStreamableValue: This utility from the Vercel AI SDK allows the client to consume the stream asynchronously.
  3. Suspense: This is the wrapper. It intercepts the rendering of its children. If the children aren't ready (because the promise hasn't resolved or the stream hasn't started), it renders the fallback.
  4. ReportSkeleton: Notice the animate-pulse class. It creates a shimmer effect. The div heights and widths are calculated to match the final output. This prevents the layout from "jumping" when the real text arrives.

The Underlying Mechanism: Server-Sent Events (SSE)

How does this actually work under the hood?

  1. The Request: The client calls the Server Action.
  2. The Bridge: The server connects to the LLM (OpenAI). It creates a ReadableStream (a Web Standard API).
  3. The Push: As the LLM generates tokens, the server writes them into the stream immediately. It uses SSE (Server-Sent Events) format to push these chunks over the HTTP connection.
  4. The Client: The browser reads this stream. The Vercel SDK parses the chunks.
  5. The Update: The client updates its state (setUiState) with the new data, triggering a React re-render.

Because the stream is pushed, the user sees content as soon as the server receives it. The Skeleton ensures that the time before that first push is visually handled.

Summary: Architecting for "Perceived Instant"

High absolute latency is a constraint of LLMs, but it doesn't have to ruin the user experience. By combining Suspense (for initial latency), Skeleton Loaders (for layout stability), and Streaming (for progressive rendering), we can build AI applications that feel responsive and polished.

Key Takeaways:

  • Don't block: Never wait for the full response.
  • Reserve Space: Use skeletons that mimic the final UI shape to prevent layout shifts.
  • Bridge the Gap: Use Suspense to handle the "cold start" time before the first token arrives.

When implemented correctly, your users won't just see an AI generating content—they'll see an app that is thinking with them, in real-time.

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the book The Modern Stack. Building Generative UI with Next.js, Vercel AI SDK, and React Server Components Amazon Link of the AI with JavaScript & TypeScript Series.
The ebook is also on Leanpub.com with many other ebooks: https://leanpub.com/u/edgarmilvus.

Top comments (0)