Akhilesh

Posted on May 28

99. Build a Chatbot With Memory

#ai #productivity #beginners #python

You ask a chatbot: "What's the capital of France?"

It says: "Paris."

You ask: "What's the population there?"

It says: "Where?"

That's a stateless chatbot. Every message is treated as a completely new conversation. It has no idea what "there" refers to. It has no memory.

Real conversation doesn't work like this. Context carries forward. References accumulate. The chatbot needs to know what came before.

This post builds a chatbot with memory. One that knows what you said two messages ago, what topic you're discussing, and what decisions were made earlier.

What You'll Learn Here

Why LLMs are stateless and how to fake memory
The conversation history pattern: how it actually works
Context window limits and why they matter
Sliding window memory: keep the last N messages
Summary memory: compress old conversations
Entity memory: remember specific facts about the user
Building a full multi-turn chatbot with LangChain
Persisting memory across sessions

Why LLMs Are Stateless

Every time you call an LLM API, it starts fresh. It has zero memory of previous calls. The only context it has is what you put in the current prompt.

The trick that makes chatbots work: you include the entire conversation history in every prompt.

Turn 1:
  USER: What's the capital of France?
  → Send to LLM: "User: What's the capital of France?"
  → LLM replies: "Paris"

Turn 2:
  USER: What's the population there?
  → Send to LLM:
      "User: What's the capital of France?
       Assistant: Paris.
       User: What's the population there?"
  → LLM sees full context, knows "there" = Paris

Turn 3:
  → Send EVERYTHING from turns 1, 2, and now 3

Every message appends to a growing list. That list goes into every subsequent prompt. The LLM can refer back to it because it's in the current context.

Simple. But it has a hard limit: the context window.

The Context Window Problem

Every LLM has a maximum number of tokens it can process at once. GPT-3.5-turbo: 16k tokens. GPT-4: 128k tokens. LLaMA-7B: 4k tokens.

A long conversation fills up that window. When the conversation exceeds the limit, you can't just include everything. You need a strategy.

# Estimate token count (rough: 1 token ≈ 4 characters for English)
def estimate_tokens(text: str) -> int:
    return len(text) // 4

def estimate_conversation_tokens(messages: list) -> int:
    total = 0
    for msg in messages:
        total += estimate_tokens(msg['content'])
        total += 4   # overhead per message (role, formatting)
    return total

# Show how fast a conversation fills up
messages = []
example_turns = [
    ("user", "Tell me about machine learning."),
    ("assistant", "Machine learning is a field of artificial intelligence that enables computers to learn from data without being explicitly programmed. It includes supervised learning, where models are trained on labeled examples, unsupervised learning, where patterns are found without labels, and reinforcement learning, where agents learn through trial and error."),
    ("user", "What about deep learning specifically?"),
    ("assistant", "Deep learning is a subset of machine learning that uses neural networks with many layers. These networks learn hierarchical representations of data, making them especially powerful for images, audio, and text. The transformer architecture, introduced in 2017, has become the foundation for most modern deep learning systems."),
    ("user", "Can you give me examples of real applications?"),
    ("assistant", "Sure! Real applications include image classification in medical diagnosis, natural language processing for translation and chatbots, recommendation systems on Netflix and Spotify, fraud detection in banking, and autonomous driving. Deep learning powers most of these through pattern recognition at scale."),
]

print(f"{'Turn':<6} {'New tokens':<14} {'Total tokens':<14} {'% of 4k limit'}")
print("-" * 50)
for role, content in example_turns:
    messages.append({'role': role, 'content': content})
    total = estimate_conversation_tokens(messages)
    new   = estimate_tokens(content)
    print(f"{len(messages):<6} {new:<14} {total:<14} {total/4000:.1%}")

Output:

Turn   New tokens     Total tokens   % of 4k limit
--------------------------------------------------
1      12             16             0.4%
2      73             93             2.3%
3      13             110            2.8%
4      65             179            4.5%
5      15             198            5.0%
6      72             274            6.9%

A long conversation about a complex topic can easily hit 2000-3000 tokens. Add RAG context and system prompts, and you're at the limit fast.

Strategy 1: Sliding Window Memory

Keep only the last N messages. Simple and effective.

from collections import deque
from typing import List, Optional

class SlidingWindowChatbot:
    def __init__(self, model_pipeline, window_size: int = 10,
                 system_prompt: str = "You are a helpful assistant."):
        self.model         = model_pipeline
        self.window_size   = window_size  # max messages to keep
        self.system_prompt = system_prompt
        self.history       = deque(maxlen=window_size)

    def chat(self, user_message: str) -> str:
        # Add user message to history
        self.history.append({'role': 'user', 'content': user_message})

        # Build the prompt with history
        messages = [
            {'role': 'system', 'content': self.system_prompt}
        ] + list(self.history)

        # Call the model (using a simple text format for demo)
        prompt = self._format_prompt(messages)
        response = self.model(prompt)

        # Add assistant response to history
        self.history.append({'role': 'assistant', 'content': response})

        return response

    def _format_prompt(self, messages: List[dict]) -> str:
        formatted = ""
        for msg in messages:
            if msg['role'] == 'system':
                formatted += f"System: {msg['content']}\n\n"
            elif msg['role'] == 'user':
                formatted += f"Human: {msg['content']}\n"
            else:
                formatted += f"Assistant: {msg['content']}\n"
        formatted += "Assistant:"
        return formatted

    def get_history(self) -> list:
        return list(self.history)

    def clear(self):
        self.history.clear()
        print("Conversation history cleared.")

# Simulate a conversation (using a mock model for demo)
def mock_model(prompt: str) -> str:
    # In production: replace with real LLM call
    if "capital of france" in prompt.lower():
        return "The capital of France is Paris."
    elif "population" in prompt.lower() and "paris" in prompt.lower():
        return "Paris has a population of approximately 2.1 million in the city proper, and about 12 million in the greater metropolitan area."
    elif "famous landmark" in prompt.lower():
        return "Paris is famous for the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe."
    elif "eiffel tower" in prompt.lower():
        return "The Eiffel Tower was built between 1887 and 1889, designed by engineer Gustave Eiffel. It stands 330 meters tall."
    else:
        return "I understand. Could you tell me more?"

bot = SlidingWindowChatbot(mock_model, window_size=6)

# Simulate multi-turn conversation
turns = [
    "What's the capital of France?",
    "What's the population there?",
    "What are some famous landmarks in that city?",
    "Tell me more about the Eiffel Tower.",
    "When was it built?",
]

for user_input in turns:
    print(f"\nUser: {user_input}")
    response = bot.chat(user_input)
    print(f"Bot:  {response}")

print(f"\nHistory has {len(bot.get_history())} messages (max {bot.window_size})")

Output:

User: What's the capital of France?
Bot:  The capital of France is Paris.

User: What's the population there?
Bot:  Paris has a population of approximately 2.1 million in the city proper...

User: What are some famous landmarks in that city?
Bot:  Paris is famous for the Eiffel Tower, the Louvre Museum...

User: Tell me more about the Eiffel Tower.
Bot:  The Eiffel Tower was built between 1887 and 1889...

User: When was it built?
Bot:  I understand. Could you tell me more?

History has 6 messages (max 6)

The bot understands "there" (Paris) and "that city" (Paris) from context. The sliding window keeps the last 6 messages.

Strategy 2: Summary Memory

When history gets long, summarize old messages and keep recent ones in full.

class SummaryMemoryChatbot:
    def __init__(self, model_pipeline, summarizer_pipeline,
                 max_recent: int = 6, summary_threshold: int = 10,
                 system_prompt: str = "You are a helpful assistant."):
        self.model       = model_pipeline
        self.summarizer  = summarizer_pipeline
        self.max_recent  = max_recent
        self.threshold   = summary_threshold
        self.system      = system_prompt
        self.history     = []
        self.summary     = ""     # compressed memory of older turns

    def _maybe_summarize(self):
        if len(self.history) < self.threshold:
            return

        # Summarize the oldest half of history
        n_to_summarize = len(self.history) // 2
        old_messages   = self.history[:n_to_summarize]
        self.history   = self.history[n_to_summarize:]

        # Format old messages as text
        old_text = "\n".join([
            f"{m['role'].title()}: {m['content']}"
            for m in old_messages
        ])

        # Summarize (in production, call LLM to summarize)
        new_summary_input = f"{self.summary}\n\n{old_text}" if self.summary else old_text
        self.summary = self._summarize(new_summary_input)

        print(f"[Memory] Summarized {n_to_summarize} messages into summary")

    def _summarize(self, text: str) -> str:
        # In production: call LLM with a summarization prompt
        # Here: mock it
        return f"[Summary of earlier conversation: The user asked about France, Paris, its population (~2.1M), and Paris landmarks including the Eiffel Tower.]"

    def _format_prompt(self) -> str:
        parts = [f"System: {self.system}\n"]

        if self.summary:
            parts.append(f"[Earlier conversation summary]: {self.summary}\n")

        for msg in self.history[-self.max_recent:]:
            role = "Human" if msg['role'] == 'user' else "Assistant"
            parts.append(f"{role}: {msg['content']}")

        parts.append("Assistant:")
        return "\n".join(parts)

    def chat(self, user_message: str) -> str:
        self.history.append({'role': 'user', 'content': user_message})
        self._maybe_summarize()

        prompt   = self._format_prompt()
        response = self.model(prompt)
        self.history.append({'role': 'assistant', 'content': response})

        return response

    def memory_status(self):
        print(f"Summary: {'yes' if self.summary else 'none'}")
        print(f"Recent messages in full: {min(len(self.history), self.max_recent)}")
        print(f"Total history: {len(self.history)}")

summary_bot = SummaryMemoryChatbot(mock_model, None, max_recent=6, summary_threshold=8)

for user_input in turns * 2:  # repeat to trigger summarization
    response = summary_bot.chat(user_input)

summary_bot.memory_status()

Strategy 3: Entity Memory

Extract and store specific facts about the user or conversation entities.

import re
from typing import Dict

class EntityMemoryChatbot:
    def __init__(self, model_pipeline,
                 system_prompt: str = "You are a helpful assistant."):
        self.model   = model_pipeline
        self.system  = system_prompt
        self.history = []
        self.entities: Dict[str, str] = {}   # entity store

    def _extract_entities(self, message: str):
        # Simplified entity extraction (in production: use NER model or LLM)
        patterns = {
            'name':     r"(?:my name is|I am|I'm)\s+([A-Z][a-z]+)",
            'location': r"(?:I live in|I'm from|I'm in)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)",
            'job':      r"(?:I am a|I work as a|I'm a)\s+([a-z]+(?:\s+[a-z]+)?)",
            'topic':    r"(?:I want to learn about|I'm studying|I need help with)\s+([a-z\s]+)"
        }

        for entity_type, pattern in patterns.items():
            match = re.search(pattern, message, re.IGNORECASE)
            if match:
                self.entities[entity_type] = match.group(1).strip()

    def _build_entity_context(self) -> str:
        if not self.entities:
            return ""
        lines = ["Known facts about the user:"]
        for entity, value in self.entities.items():
            lines.append(f"  - {entity}: {value}")
        return "\n".join(lines)

    def _format_prompt(self) -> str:
        parts = [f"System: {self.system}"]

        entity_ctx = self._build_entity_context()
        if entity_ctx:
            parts.append(entity_ctx)

        for msg in self.history[-8:]:
            role = "Human" if msg['role'] == 'user' else "Assistant"
            parts.append(f"{role}: {msg['content']}")

        parts.append("Assistant:")
        return "\n".join(parts)

    def chat(self, user_message: str) -> str:
        self._extract_entities(user_message)
        self.history.append({'role': 'user', 'content': user_message})

        prompt   = self._format_prompt()
        response = self.model(prompt)
        self.history.append({'role': 'assistant', 'content': response})

        return response

# Test entity memory
def entity_mock_model(prompt: str) -> str:
    if "name" in prompt.lower() and "Alex" in prompt:
        return "Nice to meet you, Alex!"
    elif "Alex" in prompt and "recommend" in prompt.lower():
        return "Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn."
    elif "course" in prompt.lower():
        return "For machine learning, the Andrew Ng Coursera course is excellent for beginners."
    else:
        return "Tell me more about what you'd like to learn."

entity_bot = EntityMemoryChatbot(entity_mock_model)

conversations = [
    "Hi, my name is Alex.",
    "I want to learn about machine learning.",
    "Can you recommend something?",
    "Are there any courses?",
]

for user_input in conversations:
    print(f"\nUser: {user_input}")
    response = entity_bot.chat(user_input)
    print(f"Bot:  {response}")

print(f"\nExtracted entities: {entity_bot.entities}")

Output:

User: Hi, my name is Alex.
Bot:  Nice to meet you, Alex!

User: I want to learn about machine learning.
Bot:  Tell me more about what you'd like to learn.

User: Can you recommend something?
Bot:  Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn.

User: Are there any courses?
Bot:  For machine learning, the Andrew Ng Coursera course is excellent for beginners.

Extracted entities: {'name': 'Alex', 'topic': 'machine learning'}

The bot remembers the user's name and topic across all turns.

Full Chatbot With the OpenAI API

import openai
import json
from datetime import datetime

class ProductionChatbot:
    def __init__(
        self,
        system_prompt: str = "You are a helpful AI assistant.",
        model: str = "gpt-3.5-turbo",
        max_history: int = 20,
        max_tokens: int = 500,
        temperature: float = 0.7
    ):
        self.client      = openai.OpenAI()
        self.model       = model
        self.max_history = max_history
        self.max_tokens  = max_tokens
        self.temperature = temperature
        self.history     = []
        self.system      = system_prompt
        self.created_at  = datetime.now()

    def chat(self, user_message: str) -> str:
        self.history.append({'role': 'user', 'content': user_message})

        # Trim history if too long
        if len(self.history) > self.max_history:
            self.history = self.history[-self.max_history:]

        # Build message list for API
        messages = [
            {'role': 'system', 'content': self.system}
        ] + self.history

        # Call API
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=self.max_tokens,
            temperature=self.temperature,
        )

        assistant_message = response.choices[0].message.content
        self.history.append({'role': 'assistant', 'content': assistant_message})

        return assistant_message

    def save_conversation(self, filepath: str):
        data = {
            'created_at': self.created_at.isoformat(),
            'saved_at':   datetime.now().isoformat(),
            'model':      self.model,
            'system':     self.system,
            'messages':   self.history
        }
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)
        print(f"Saved {len(self.history)} messages to {filepath}")

    def load_conversation(self, filepath: str):
        with open(filepath, 'r') as f:
            data = json.load(f)
        self.history = data['messages']
        self.system  = data.get('system', self.system)
        print(f"Loaded {len(self.history)} messages from {filepath}")

    def reset(self):
        self.history = []
        print("Conversation reset.")

    def get_stats(self) -> dict:
        n_user      = sum(1 for m in self.history if m['role'] == 'user')
        n_assistant = sum(1 for m in self.history if m['role'] == 'assistant')
        total_chars = sum(len(m['content']) for m in self.history)

        return {
            'turns':            n_user,
            'total_messages':   len(self.history),
            'estimated_tokens': total_chars // 4,
            'history_depth':    len(self.history)
        }

# Usage
# bot = ProductionChatbot(
#     system_prompt="You are a helpful ML tutor specializing in practical examples.",
#     model="gpt-3.5-turbo",
#     max_history=20
# )
# response = bot.chat("Explain overfitting to me.")
# print(response)
# bot.save_conversation('session_001.json')
print("ProductionChatbot ready (requires OPENAI_API_KEY)")

LangChain Memory: The Easy Way

from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory
from langchain.chains import ConversationChain
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline

# Create LLM
gen_pipe = hf_pipeline('text-generation', model='gpt2', max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=gen_pipe)

# Buffer memory: keeps all messages
buffer_memory = ConversationBufferMemory()

# Summary memory: automatically summarizes when too long
# summary_memory = ConversationSummaryMemory(llm=llm)

# Build conversation chain
conversation = ConversationChain(
    llm=llm,
    memory=buffer_memory,
    verbose=False
)

# Chat
result = conversation.predict(input="Hello, my name is Alex.")
print(f"Bot: {result[:100]}...")

result = conversation.predict(input="What is my name?")
print(f"Bot: {result[:100]}...")

# Inspect memory
print(f"\nMemory buffer:\n{buffer_memory.buffer}")

Persisting Memory Across Sessions

import json
import os

class PersistentChatbot:
    def __init__(self, model_pipeline, session_id: str,
                 storage_dir: str = './chat_sessions',
                 max_history: int = 50):
        self.model       = model_pipeline
        self.session_id  = session_id
        self.storage_dir = storage_dir
        self.max_history = max_history
        self.history     = []
        self.metadata    = {}

        os.makedirs(storage_dir, exist_ok=True)
        self._load_session()

    def _session_path(self) -> str:
        return os.path.join(self.storage_dir, f"{self.session_id}.json")

    def _load_session(self):
        path = self._session_path()
        if os.path.exists(path):
            with open(path, 'r') as f:
                data = json.load(f)
            self.history  = data.get('history', [])
            self.metadata = data.get('metadata', {})
            print(f"Loaded session '{self.session_id}' with {len(self.history)} messages")
        else:
            print(f"New session '{self.session_id}' started")

    def _save_session(self):
        data = {
            'session_id':   self.session_id,
            'last_updated': datetime.now().isoformat(),
            'history':      self.history,
            'metadata':     self.metadata
        }
        with open(self._session_path(), 'w') as f:
            json.dump(data, f, indent=2)

    def chat(self, user_message: str) -> str:
        self.history.append({'role': 'user', 'content': user_message})

        if len(self.history) > self.max_history:
            self.history = self.history[-self.max_history:]

        response = self.model(self._format_prompt())
        self.history.append({'role': 'assistant', 'content': response})
        self._save_session()

        return response

    def _format_prompt(self) -> str:
        parts = []
        for msg in self.history[-10:]:
            role = "Human" if msg['role'] == 'user' else "Assistant"
            parts.append(f"{role}: {msg['content']}")
        parts.append("Assistant:")
        return "\n".join(parts)

    def list_sessions(self) -> list:
        sessions = []
        for f in os.listdir(self.storage_dir):
            if f.endswith('.json'):
                sessions.append(f.replace('.json', ''))
        return sessions

# Usage
persistent_bot = PersistentChatbot(mock_model, session_id='user_alex_001')
persistent_bot.chat("What's the capital of France?")
persistent_bot.chat("What's the population there?")

print(f"\nSaved sessions: {persistent_bot.list_sessions()}")
print(f"History length: {len(persistent_bot.history)} messages")

Chatbot Quality Checklist

checklist = {
    "Memory management": [
        "Does the bot remember context from 5+ turns ago?",
        "Does it handle coreferences correctly? ('there', 'it', 'they')",
        "Does it avoid repeating information the user already gave?"
    ],
    "Context window": [
        "Does it handle very long conversations without breaking?",
        "Is there a graceful fallback when history is too long?",
        "Are summarized messages accurate and not lossy?"
    ],
    "Conversation quality": [
        "Does it stay on topic through the conversation?",
        "Does it refer to earlier decisions correctly?",
        "Does it handle topic switches gracefully?"
    ],
    "Persistence": [
        "Does it save conversations for later use?",
        "Can it resume from a previous session?",
        "Is the storage format readable and debuggable?"
    ],
    "Edge cases": [
        "What happens if the user asks about something not in memory?",
        "What happens if the user contradicts themselves?",
        "Does it handle very short or very long user messages?"
    ]
}

for category, items in checklist.items():
    print(f"\n{category}:")
    for item in items:
        print(f"  [ ] {item}")

Quick Cheat Sheet

Memory type	When to use	How it works
Buffer (all history)	Short conversations	Keep all messages, pass everything
Sliding window	Medium conversations	Keep last N messages only
Summary memory	Long conversations	Summarize old messages, keep recent in full
Entity memory	User-specific facts	Extract and store named entities
Persistent memory	Multi-session chatbots	Save/load from disk or database

Pattern	Code
Add to history	`history.append({'role': 'user', 'content': msg})`
Trim history	`history = history[-max_size:]`
Build messages	`[{'role': 'system', 'content': system}] + history`
Save session	`json.dump({'history': history}, f)`
Load session	`history = json.load(f)['history']`
LangChain buffer	`ConversationBufferMemory()`
LangChain summary	`ConversationSummaryMemory(llm=llm)`

Practice Challenges

Level 1:
Build a SlidingWindowChatbot that talks to GPT-2 locally. Have a 10-turn conversation about a topic of your choice. Print the full history at the end. Verify the bot correctly references things from earlier turns.

Level 2:
Implement SummaryMemoryChatbot with a real summarization call. After every 8 turns, summarize the first half using a small T5 model. Test with a 20-turn conversation. Print the summary after it triggers. Is the summary accurate?

Level 3:
Build PersistentChatbot that stores conversations to disk. Start a conversation, close it, restart the program, load the session, and continue the conversation. Verify the bot remembers what was said in the previous session. Add a /history command that prints a summary of previous sessions.

References

Final post, Post 100: OpenAI API: Build With GPT-4. API setup, chat completions, function calling, streaming, and cost management. The last post in the series wraps everything together.

DEV Community

99. Build a Chatbot With Memory

What You'll Learn Here

Why LLMs Are Stateless

The Context Window Problem

Strategy 1: Sliding Window Memory

Strategy 2: Summary Memory

Strategy 3: Entity Memory

Full Chatbot With the OpenAI API

LangChain Memory: The Easy Way

Persisting Memory Across Sessions

Chatbot Quality Checklist

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)