kubo shiori

Posted on Feb 17

Beyond RAG: Building a Logic-Grounded AI Host with a Dual-Layer Architecture

#ai #architecture #gamedev #showdev

More Than Just a Chatbot: How I Engineered an "Immersive" AI Host for Lateral Thinking Puzzles

When we talk about AI apps today, we often get stuck in the binary of RAG (Retrieval-Augmented Generation) or Prompt Engineering. But when building TurtleNoir (my multiplayer lateral thinking puzzle game), I realized that real-time inference alone wasn't enough.

Here is my journey of building an AI Host that actually has a "soul."

What are Lateral Thinking Puzzles?

For those unfamiliar, the game (often called "Turtle Soup" in Asia) works like this: The players are presented with a strange scenario (the "Surface"), and the truth is the "Soup." It requires at least two people: a Host who knows the truth, and Players who ask questions.

Example: A captain drinks a bowl of turtle soup, cries, and then commits suicide. Why?

Players ask Yes/No questions to deduce the backstory. The Host must answer: Yes / No / Irrelevant.

From Burning $5 in 30 Minutes to Sustainable Costs

Initially, I built this just to play with my girlfriend. I used the free Gemini API, which worked fine for two people. But after a TikTok influencer featured the site, hundreds of users flooded in.

The free tier quickly became insufficient. I switched to an API key pool, but the rate limits were still too tight. To maintain service, I moved to OpenRouter. Result: costs reached $5 in less than 30 minutes.

For an indie developer with a limited budget, this was clearly unsustainable.

I had to optimize. After experimenting with Gemini Thinking Mode (too slow) and Gemini 2.0 Flash, I finally settled on DeepSeek V3.2.

The game changer was Context Caching. A puzzle’s rules and backstory can consume thousands of tokens. Transmitting this context for every single question is a waste of money and latency. With Context Caching, I pay the "memory cost" once, and subsequent turns reuse that cache.

Result: Costs dropped by over 90%. The platform was finally sustainable.

The Dilemma: From "Glitchy AI" to "Omniscient Host"

The charm of this game lies in Information Asymmetry. The Host must be omniscient (know the truth) but tight-lipped (only answer Yes/No).

At first, I lazily stuffed the story into a prompt:

"You are the Host. Here is the story... Here is the truth... Answer the player."

This approach had clear limitations:

With Chain-of-Thought: The Host was too slow, and token costs doubled.
Without Chain-of-Thought: The AI struggled with "narrative tricks." Mystery stories often rely on subtle wording. The AI couldn't build a rigorous logical defense in 500ms.

I realized: Asking one AI to understand a complex puzzle scenario, maintain logical consistency, and roleplay simultaneously is asking too much.

We needed to decouple "Thinking" from "Speaking."

Why RAG Failed in Practice

Before landing on my final architecture, I tried RAG (Vector Search). My idea was a "Smart Cache":

Store every Player Question + AI Answer in a vector DB.
If a new question is semantically similar to a past one, serve the cached answer.

This failed completely.

Why? Because RAG relies on Semantic Similarity, not Logical Precision.
In a lateral thinking game, nuances matter:

Player A: "Did he kill himself?"
Player B: "Was he killed?"

To a vector DB, these sentences can have very similar embeddings (both are about death and "kill"). If the AI previously answered "Yes" to "Did he kill himself?" and RAG serves that cached "Yes" to "Was he killed?" because of high similarity, the game logic collapses.

Lesson: You cannot rely on probability for logic puzzles. You need structured reasoning.

The Solution: A Dual-Layer AI Architecture

To fix this, I engineered a two-step system:

Layer 1: The Architect (Offline Deep Thinking)

This is the heavy lifter. When a new puzzle is added to the database, a high-compute model (The Architect) kicks in. It doesn't talk to players. Instead, it spends 30+ seconds deconstructing the story to generate a Logic Profile.

It analyzes:

What is the core trick?
What is the physical evidence?
What are the causal relationships?

It outputs a comprehensive JSON file containing the game's "metadata."

Layer 2: The Host (Real-Time Response)

When players start a game, they interact with the Host.

The Host doesn't need to deduce the truth from scratch. It simply reads the Logic Profile created by the Architect. Since the heavy logical lifting is "pre-rendered," the Host can run on a faster, cheaper model, focusing solely on natural language interaction and roleplay.

How Pre-Caching Unlocked Advanced Features

This architecture didn't just save money; it enabled features that were previously impossible:

1. Seance Mode and "Epistemic Blind Spots"
In TurtleNoir, players can summon characters or objects in the story and talk to them.

The tricky part is perspective control. For example: does a dead character know exactly how they died? If they were attacked from behind, they should not know who did it.

The Architect pre-defines these "Epistemic Blind Spots" for each role inside the Logic Profile. The Host then follows that profile during roleplay, so each role only reveals what they could know from their own point of view.

This creates a Rashomon-style narrative: fragmented perspectives instead of omniscient answers.

2. Evidence Collection: Organized Randomness
Players can scratch off "Evidence Bags." These aren't random hallucinations. The Architect pre-generates:

1 Core Evidence (Points to the key truth)
3 Circumstantial Clues (Assist reasoning)
2 Red Herrings (Designed to mislead)

Real-time AI struggles to invent good Red Herrings on the fly. Offline generation ensures quality design.

Immersion Beyond Text

As an indie dev, I also polished the frontend to match the backend logic:

The Detective Workbench: A "Case File" tab collects unlocked clues, and players can tag evidence as "Critical," "Doubtful," or "Excluded."
Dynamic Visuals: Using the character descriptions from the Architect, I use image generation models to create consistent character portraits without spoiling the gore or the twist.

Conclusion: Tech Serves the Experience

Building TurtleNoir taught me that AI-native apps risk being simple "Chatbot Wrappers" if they lack architectural depth.

By separating the Architect (Offline Reasoning) from the Host (Online Response) and using the Logic Profile as the bridge, we mimic the workflow of a human game designer. This allows the AI to handle a game that relies heavily on strict logic and information gaps.

Technology is just the tool. The goal is that late-night moment when you and your friends stay focused on the screen, piecing together a key character's testimony and feeling fully immersed in the story.

One More Thing:
The app is a PWA (Progressive Web App)—no download needed, perfect for mobile.

👉 Play the English Demo: turtlenoir.com
(Or check out the Chinese version: haiguitang.net)

I’d love to hear how other indie devs are structuring their AI apps in the comments!

DEV Community