DEV Community

rishabh pahwa
rishabh pahwa

Posted on

Why Your LLM Bot Forgets Everything

Your decade-old "stateless microservice" mantra is failing your LLM-powered applications. Treating every LLM request as an independent, isolated transaction ignores the fundamental need for persistent, evolving context, leading to astronomically high costs and a broken user experience.

Why Your LLM Bot Forgets Everything

Imagine you're building a customer support chatbot. A user asks: "My order #7890 is stuck, can you help?" Your API Gateway routes this to a stateless llm-processor microservice. This service pulls the order details from a database, adds them to the prompt, sends it to GPT-4, and returns a polite "I'm looking into order #7890."

The user then asks: "What's the estimated delivery date?"
If your architecture is purely stateless, that second request hits a new llm-processor instance, completely unaware of the previous interaction. It has no idea what "the estimated delivery date" refers to. It will likely respond with a generic "Please specify which order you're referring to," or worse, hallucinate.

This isn't just annoying; it's slow, expensive, and wastes user patience. Every single turn of the conversation means:

  1. Re-fetching context: The system has to re-query databases for order #7890 details.
  2. Re-prompting: The LLM receives a prompt that likely needs to re-introduce previous context, consuming more tokens and increasing latency and cost.
  3. No conversational memory: The user experience is disjointed and frustrating. Your bot acts like it has severe amnesia. This drives user churn faster than any bug.

The Dedicated State Service: Your LLM's Memory Bank

A new generation of LLM architectures moves away from purely stateless services for core interaction flows. Instead, they introduce a dedicated State Service. This isn't just a database; it's an intelligent orchestrator of user-specific context, session history, and often, retrieved external information.

The core idea is to establish a persistent session context for each user interaction. When a user sends a query, the LLM Orchestrator service first retrieves relevant context from the State Service before composing the final prompt. After the LLM responds, the orchestrator updates the State Service with the latest turn, optionally summarizing or pruning older history.

Here's how it generally flows:

USER
  |
  V
[API Gateway]
  |
  V
[LLM Orchestrator] --- (User ID) ---> [State Service]
  |                                     ^      |
  | (Get Context)                       |      | (Store/Update Context)
  +-------------------------------------+      |
  |                                            |
  V (Context + Current Prompt)                 V (Session History, RAG Data, Preferences)
[LLM Provider] (e.g., OpenAI, Anthropic, OSS LLM)
  |
  V (LLM Response)
[LLM Orchestrator]
  |
  V (User Response)
USER
Enter fullscreen mode Exit fullscreen mode

The State Service stores:

  • Conversation History: The raw turns of the conversation, potentially summarized.
  • User Preferences/Profile: Specific settings, roles, or persona details.
  • Retrieval Augmented Generation (RAG) Data: Documents, database records, or search results retrieved for the current session.
  • Intermediate Results: Partially completed tasks, user intentions.

By doing this, the LLM Orchestrator can construct a lean, targeted prompt for the LLM, reducing token counts by 50-80% on subsequent turns compared to rebuilding context from scratch. This directly translates to lower API costs and faster response times.

How Companies Handle Stateful LLM Interactions at Scale

Consider a platform like Intercom's Fin AI Bot or Zendesk's AI Agent Assist. These systems can't afford to rebuild context for every user interaction across millions of conversations. They leverage sophisticated state management.

When a user initiates a chat, a unique session_id is established. This session_id becomes the key for retrieving and storing conversational state in a dedicated, low-latency data store. They might use:

  • Redis Enterprise for in-memory caching of active session data, providing sub-millisecond latency for context retrieval.
  • Amazon DynamoDB or Cassandra for more durable, sharded storage of full conversation histories, with an eviction policy for very old, inactive sessions.
  • Custom data structures within the State Service that intelligently summarize older conversation turns using an LLM itself (e.g., "Summarize the conversation so far for the LLM") to keep the active prompt window small and token-efficient.

They don't just dump raw text. They might store structured JSON objects representing key-value pairs of extracted entities (e.g., {"order_id": "7890", "issue": "delivery_delay"}) alongside the conversation history. This allows the orchestrator to quickly inject relevant, structured data into the prompt without re-parsing lengthy texts. This approach reduces the effective context window size passed to the LLM, directly saving compute and API costs, while maintaining a coherent conversation.

What Most People Get Wrong

  1. Treating the State Service as just a Cache: This isn't temporary, easily discardable data. It's critical, active conversational context. A simple LRU cache is insufficient because it doesn't account for persistence, intelligent summarization, or the active lifecycle of a conversation. State needs to be durable enough to survive orchestrator restarts and potentially consistent for multi-turn operations.
  2. Storing Too Much, Unstructured State: Engineers often just dump the entire raw conversation history into the state store. This quickly bloats the context window, leading to higher token costs and slower inference times. The State Service needs logic for:
    • Summarization: Periodically summarizing older parts of the conversation.
    • Pruning: Removing irrelevant or outdated information.
    • Structured Entity Extraction: Converting free-form text into key-value pairs (e.g., extracting order IDs, dates, user names) to provide concise, direct context.
  3. Lack of Distributed Coordination: In a scaled-out system, multiple LLM Orchestrator instances might try to read or update the same user's session state concurrently. Without proper distributed locks or optimistic concurrency controls, you can end up with race conditions, inconsistent state, or lost updates, making your bot "forget" recent turns.

Interview Angle

When designing LLM-powered systems, interviewers will challenge your understanding of state management beyond simple caching.

"How would you handle state for a million concurrent users in a personalized LLM assistant?"
A strong answer goes beyond "use Redis." You'd discuss sharding the state service by user_id or session_id to distribute load and improve retrieval latency. Mention replication for high availability and durability. Crucially, talk about intelligent state management: implementing a policy for summarization and eviction (e.g., active sessions in-memory, older sessions in a persistent store like DynamoDB, with an LLM-powered summarizer pruning the context window dynamically). You'd discuss how to identify "inactive" sessions to move them to cheaper storage or expire them.

"What are the trade-offs of storing full conversation history versus summarized history?"
Full History: Pros – complete context, no loss of nuance. Cons – high token cost, increased latency, storage bloat, hits LLM context window limits quickly. Good for debugging or very short, critical interactions.
Summarized History: Pros – significantly reduced token cost, faster inference, fits within smaller context windows. Cons – potential loss of nuance/detail, summarization itself consumes LLM tokens/compute, risk of "hallucinated summaries" if not carefully engineered. Good for long-running conversations where fine-grained detail isn't critical for every turn. The trade-off is often between token efficiency/latency and conversational coherence/accuracy.

"How does Retrieval Augmented Generation (RAG) fit into this state management?"
RAG isn't just a one-off query. The results of RAG (e.g., retrieved documents, database query outputs) become part of the session state. If a user asks about "order status" and your RAG system pulls order #7890's details, those details should be stored in the State Service. This ensures subsequent turns referencing "the order" can access those previously retrieved facts without hitting the RAG system again, further reducing latency and redundant work.

Designing LLM applications successfully requires a fundamental shift from purely stateless paradigms to intelligent, distributed state management. Master this, and you'll build robust, cost-effective, and genuinely helpful AI experiences.


Want to level up your system design skills for LLM-powered applications? Book a 1:1 session with me on Topmate to dive deeper into these architectures and prepare for your next interview.


Want to Go Deeper?

I do 1:1 sessions on system design, backend architecture, and interview prep.
If you're preparing for a Staff/Senior role or cracking FAANG rounds — book a session here.

Top comments (0)