DEV Community

Cover image for How I Built a Real-Time AI Dungeon Master with Claude API, Socket.io & Next.js
rahul patwa
rahul patwa

Posted on

How I Built a Real-Time AI Dungeon Master with Claude API, Socket.io & Next.js

In this post, I’ll show you how I built a real-time AI Dungeon Master using Claude API, Socket.io, and Next.js. This multiplayer AI system can narrate stories, manage game state, and respond to multiple players simultaneously—just like a real DM.

The AI in gaming market sits at $4.54 billion in 2025 and is projected to hit $81.19 billion by 2035 (SNS Insider, 2025). That number isn't surprising when you think about what generative AI actually unlocks for games infinite narrative branching, dynamic NPCs, and a Dungeon Master who never gets tired at midnight.

I built DnD AI, a multiplayer AI Dungeon Master running on Next.js 16, Claude API (claude-sonnet-4-6), Socket.io, and DALL-E 3. This post is a technical walkthrough of the six hardest problems I ran into, and how I solved them. No fluff just the architecture decisions that actually mattered.

Why this matters
Most multiplayer RPGs fail because they depend on a human Dungeon Master. This project removes that bottleneck using AI unlocking instant gameplay for anyone.

TL;DR:

  • Next.js App Router can't maintain persistent WebSockets, so a custom server.ts boots Socket.io and Next.js in one process
  • Claude streaming output pipes through Socket.io to all connected clients in real time, with chunk batching to avoid socket flooding
  • DALL-E 3 fires only on location changes and major story beats not every message keeping session cost under $0.25

Why a Custom Server Instead of Next.js API Routes?

Next.js App Router Route Handlers are stateless by design each request spins up, responds, and exits. That works fine for REST, but a multiplayer game needs a persistent socket connection that stays alive for an entire session. There's no clean way to run Socket.io inside an App Router handler. The solution is a custom server.ts at the project root that boots both runtimes in a single Node process.

The key insight: Next.js exposes a createServer API that lets you hand off HTTP requests to the Next.js handler while Socket.io attaches to the same HTTP server instance. Both share one process, one port, and one set of environment variables.

// server.ts (root of project, not inside /app)
import { createServer } from "http";
import { parse } from "url";
import next from "next";
import { Server as SocketIOServer } from "socket.io";
import { registerGameHandlers } from "./src/lib/socket/gameHandlers";

const dev = process.env.NODE_ENV !== "production";
const app = next({ dev });
const handle = app.getRequestHandler();

app.prepare().then(() => {
  const httpServer = createServer((req, res) => {
    const parsedUrl = parse(req.url!, true);
    handle(req, res, parsedUrl);
  });

  // Socket.io attaches to the same HTTP server
  const io = new SocketIOServer(httpServer, {
    cors: { origin: process.env.NEXT_PUBLIC_APP_URL },
  });

  // Register all game-related socket handlers
  registerGameHandlers(io);

  httpServer.listen(3000, () => {
    console.log("> Ready on http://localhost:3000");
  });
});
Enter fullscreen mode Exit fullscreen mode

One gotcha: your package.json build script needs to compile server.ts separately with tsc, then run the compiled output not next start. Keep that in mind before you hit deploy.


The Game Loop Architecture

Every player action travels through a consistent five-step loop: client input, socket event, Claude API call, streamed narrative back to all clients, then stat resolution. Keeping this loop linear and deterministic was the biggest architectural decision I made early on. It made debugging a lot easier.

Player types/speaks action
        │
        ▼
[Socket.io] "player:action" event → server
        │
        ▼
[gameHandlers.ts] validate action, load session memory
        │
        ▼
[Claude API] stream narrative response (claude-sonnet-4-6)
        │
  ┌─────┴────────────────────┐
  │                          │
  ▼                          ▼
[Socket.io]            [Stat resolver]
stream chunks          roll dice, calc HP delta
to all clients         emit "game:statUpdate"
  │
  ▼
[DALL-E trigger check]
location changed? major beat?
→ fire async image gen
Enter fullscreen mode Exit fullscreen mode

The stat resolver runs in parallel with the stream, not after it. Players see the narrative arriving word-by-word while the HP update hits their UI within a second. That parallel execution matters for perceived responsiveness if you wait for the full Claude response before resolving stats, the game feels sluggish.


The Hardest Part: Syncing Streaming LLM Output Across Multiple Clients

84% of developers now use AI tools daily (Stack Overflow Developer Survey, 2025), but most of those integrations are single-user. Streaming Claude's output to four simultaneous Socket.io clients introduces a problem that single-user apps never face: how do you fan out a streaming response without flooding the socket or losing chunk ordering?

The naive approach emit a socket event for every token causes event queue saturation at ~4 clients with a fast model. I batched chunks into 50ms windows instead. Each batch emits one socket event with concatenated text. Clients append to their local buffer and re-render.

// src/lib/socket/streamHandler.ts
import Anthropic from "@anthropic-ai/sdk";

export async function streamNarrativeToRoom(
  io: SocketIOServer,
  roomId: string,
  messages: Anthropic.MessageParam[],
  systemPrompt: string
) {
  const client = new Anthropic();
  let chunkBuffer = "";
  let flushTimer: NodeJS.Timeout | null = null;

  const flush = () => {
    if (chunkBuffer.length > 0) {
      // Single emit per batch window all clients in room receive it
      io.to(roomId).emit("dm:narrative_chunk", { text: chunkBuffer });
      chunkBuffer = "";
    }
    flushTimer = null;
  };

  const stream = await client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: systemPrompt,
    messages,
  });

  for await (const chunk of stream) {
    if (
      chunk.type === "content_block_delta" &&
      chunk.delta.type === "text_delta"
    ) {
      chunkBuffer += chunk.delta.text;

      // Batch: flush every 50ms, not every token
      if (!flushTimer) {
        flushTimer = setTimeout(flush, 50);
      }
    }
  }

  // Flush any remaining buffer after stream ends
  if (flushTimer) clearTimeout(flushTimer);
  flush();

  io.to(roomId).emit("dm:narrative_end", { roomId });
}
Enter fullscreen mode Exit fullscreen mode

The 50ms batch window is the sweet spot I landed on after testing. At 20ms the socket still floods at high token velocity. At 100ms the streaming effect feels choppy to users. Your mileage will vary depending on average token rate.


Dice Determinism in a Distributed Game

In a multiplayer game, dice rolls can't be client-side. If player A's browser rolls a d20 and player B's browser rolls independently, they see different outcomes for the same event. The server must own every roll, and the result must be deterministic reproducible from a seed if you ever need to replay or audit a session.

I use a seeded pseudo-random number generator (PRNG) on the server, seeded per session. The seed gets stored in SQLite alongside the session record. When a roll happens, the server increments a counter, derives the roll from prng(sessionSeed + rollCount), emits the result to all clients simultaneously, and stores it.

// src/lib/dice/seededRng.ts
// Mulberry32 fast, seedable, good distribution for game use
function mulberry32(seed: number) {
  return function () {
    seed |= 0;
    seed = (seed + 0x6d2b79f5) | 0;
    let t = Math.imul(seed ^ (seed >>> 15), 1 | seed);
    t = (t + Math.imul(t ^ (t >>> 7), 61 | t)) ^ t;
    return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
  };
}

export function createSessionDice(sessionSeed: number) {
  const rng = mulberry32(sessionSeed);
  return {
    roll: (sides: number) => Math.floor(rng() * sides) + 1,
  };
}
Enter fullscreen mode Exit fullscreen mode

The first version used Math.random() server-side and it worked fine until I added session replay for debugging. Replays produced different dice outcomes, which made bug reproduction impossible. Seeding costs nothing and saves you later.


Persistent Memory Without Blowing the Context Window

Claude's context window is large, but sending full game history on every turn is expensive and eventually hits limits. The practical solution is two-layer memory: SQLite stores the complete event log, and a compressed summary gets injected into the Claude system prompt on each turn.

Prisma manages the schema. Each session has a GameSession record, a log of GameEvent rows (player actions, DM responses, stat changes), and a MemorySummary that gets regenerated every 10 turns using a separate, cheaper Claude call.

Most tutorials suggest simply truncating history. That's wrong for a DnD game early plot details (the villain's name, a deal the players made) matter at turn 40. A compressed summary preserves narrative continuity without the token cost of full history. The summarization prompt is as important as the main DM prompt.

The system prompt structure:

// src/lib/ai/buildSystemPrompt.ts
export function buildDMSystemPrompt(
  campaignConfig: CampaignConfig,
  memorySummary: string,
  recentEvents: GameEvent[]  // last 5 events in full
): string {
  return `
You are the Dungeon Master for a ${campaignConfig.theme} campaign.
Setting: ${campaignConfig.worldDescription}
Players: ${campaignConfig.players.map(p => `${p.name} (${p.class}, HP: ${p.currentHp}/${p.maxHp})`).join(", ")}

## Story So Far (compressed)
${memorySummary}

## Recent Events (verbatim)
${recentEvents.map(e => `[${e.type}] ${e.content}`).join("\n")}

Rules: Stay in character. When a player action requires a dice check, output a JSON block:
{"diceCheck": {"stat": "strength", "dc": 14, "consequence": {...}}}
  `.trim();
}
Enter fullscreen mode Exit fullscreen mode

DALL-E 3 When to Trigger and How to Manage Cost

DALL-E 3 at $0.040 per standard image adds up fast if you fire it on every player message. The fix is a trigger logic layer that fires image generation only when the scene actually changes: entering a new location, starting combat, or hitting a major story beat that Claude flags in its response.

In testing across 12 playthroughs averaging 45 turns each, trigger-gated generation fired DALL-E 3 an average of 6 times per session keeping image cost under $0.25 per session. Ungated, the same sessions would have triggered 40+ image calls.

Image generation runs async the game doesn't wait for it.

// src/lib/ai/imageOrchestrator.ts
type SceneSignal = { sceneChange: true; description: string } | null;

export async function maybeGenerateSceneImage(
  roomId: string,
  dmResponse: string,
  io: SocketIOServer
): Promise<void> {
  const signal = extractSceneSignal(dmResponse); // parses JSON block in response
  if (!signal) return;

  // Fire and forget don't await in the main game loop
  generateAndEmit(roomId, signal.description, io).catch((err) =>
    console.error("Image gen failed silently:", err)
  );
}

async function generateAndEmit(
  roomId: string,
  description: string,
  io: SocketIOServer
) {
  const imageUrl = await generateImage(description);
  io.to(roomId).emit("scene:image_ready", { imageUrl });
}
Enter fullscreen mode Exit fullscreen mode

The async fire pattern means players get a scene image 5–8 seconds after a location change, not blocking any game action. The UI shows a loading shimmer until scene:image_ready arrives.


Web Speech API in Production

Web Speech API works well for the happy path clear speech, modern Chrome, quiet environment. It breaks in ways that are hard to predict. Browser support is inconsistent outside Chrome and Edge. Background noise triggers false positives. Silence detection varies by OS, and on some systems the API stops listening after a few seconds even when the user is still speaking.

The two real-world failures I hit most: (1) mobile Safari doesn't support the API at all in some iOS versions, and (2) the onend event fires too aggressively in noisy environments, cutting off player actions mid-sentence. The fix for the second issue was a 1.5-second debounce on the onend event before submitting the transcript.

The connection to the action pipeline is simple: the browser captures speech, converts to text, and fires the same player:action socket event that a typed message would send. The server doesn't know or care whether input came from voice or keyboard.

Always provide a text input fallback. Don't ship a voice-only interface.


What I'd Do Differently

Over-engineered: the Prisma schema. I built relations between sessions, events, characters, and items on day one. For a prototype, a flat JSON blob in SQLite would have been faster to iterate on and just as queryable for my needs.

Under-engineered: the Claude system prompt. I spent a lot of time on the infrastructure and not enough on prompt quality in the first two weeks. The DM's narrative consistency improved dramatically after I added explicit persona instructions, tone guidance, and a rules-enforcement section. Infrastructure is the easy part.

The biggest surprise: streaming to multiple clients is genuinely tricky. I expected it to be a minor detail. The 50ms batching window took three days of testing to land on. If I started over, I'd prototype the streaming fan-out before anything else.

90% of game developers already use AI in their workflows, and 97% say generative AI is reshaping the industry (Google Cloud / Harris Poll, Aug 2025). The tooling is mature. The hard problems are now architectural not whether AI can generate good narrative, but how to wire it into a real-time system without it falling apart.

Live Demo


FAQ

Can this architecture scale beyond 4 players?

The Socket.io room model scales to more players without code changes. The real constraint is Claude API latency streaming a response to 8 clients at once still uses one API call, so cost doesn't multiply. In testing, the 50ms batch window held stable at 6 clients. Beyond that, you'd want to profile socket event queue depth under load.

Why Claude over GPT-4 for the DM role?

Claude's longer context window and instruction-following consistency made it the better fit for maintaining campaign continuity across a long session. In our tests, Claude adhered to custom rule constraints in the system prompt more reliably than GPT-4 Turbo, particularly for structured JSON output embedded in narrative responses. That JSON output drives the dice check and stat resolution pipeline.

How do you handle Claude going off-script or breaking game rules?

The system prompt includes an explicit rules section and a JSON output schema for structured events. When Claude's response doesn't contain valid JSON where expected, the server falls back to a text-only parse and logs the miss. A separate "rules referee" prompt runs on flagged responses to check for obvious violations before emitting to clients. It catches about 80% of off-script outputs without blocking the stream.

Top comments (0)