You ask a chatbot: "What's the capital of France?"
It says: "Paris."
You ask: "What's the population there?"
It says: "Where?"
That's a stateless chatbot. Every message is treated as a completely new conversation. It has no idea what "there" refers to. It has no memory.
Real conversation doesn't work like this. Context carries forward. References accumulate. The chatbot needs to know what came before.
This post builds a chatbot with memory. One that knows what you said two messages ago, what topic you're discussing, and what decisions were made earlier.
What You'll Learn Here
- Why LLMs are stateless and how to fake memory
- The conversation history pattern: how it actually works
- Context window limits and why they matter
- Sliding window memory: keep the last N messages
- Summary memory: compress old conversations
- Entity memory: remember specific facts about the user
- Building a full multi-turn chatbot with LangChain
- Persisting memory across sessions
Why LLMs Are Stateless
Every time you call an LLM API, it starts fresh. It has zero memory of previous calls. The only context it has is what you put in the current prompt.
The trick that makes chatbots work: you include the entire conversation history in every prompt.
Turn 1:
USER: What's the capital of France?
→ Send to LLM: "User: What's the capital of France?"
→ LLM replies: "Paris"
Turn 2:
USER: What's the population there?
→ Send to LLM:
"User: What's the capital of France?
Assistant: Paris.
User: What's the population there?"
→ LLM sees full context, knows "there" = Paris
Turn 3:
→ Send EVERYTHING from turns 1, 2, and now 3
Every message appends to a growing list. That list goes into every subsequent prompt. The LLM can refer back to it because it's in the current context.
Simple. But it has a hard limit: the context window.
The Context Window Problem
Every LLM has a maximum number of tokens it can process at once. GPT-3.5-turbo: 16k tokens. GPT-4: 128k tokens. LLaMA-7B: 4k tokens.
A long conversation fills up that window. When the conversation exceeds the limit, you can't just include everything. You need a strategy.
# Estimate token count (rough: 1 token ≈ 4 characters for English)
def estimate_tokens(text: str) -> int:
return len(text) // 4
def estimate_conversation_tokens(messages: list) -> int:
total = 0
for msg in messages:
total += estimate_tokens(msg['content'])
total += 4 # overhead per message (role, formatting)
return total
# Show how fast a conversation fills up
messages = []
example_turns = [
("user", "Tell me about machine learning."),
("assistant", "Machine learning is a field of artificial intelligence that enables computers to learn from data without being explicitly programmed. It includes supervised learning, where models are trained on labeled examples, unsupervised learning, where patterns are found without labels, and reinforcement learning, where agents learn through trial and error."),
("user", "What about deep learning specifically?"),
("assistant", "Deep learning is a subset of machine learning that uses neural networks with many layers. These networks learn hierarchical representations of data, making them especially powerful for images, audio, and text. The transformer architecture, introduced in 2017, has become the foundation for most modern deep learning systems."),
("user", "Can you give me examples of real applications?"),
("assistant", "Sure! Real applications include image classification in medical diagnosis, natural language processing for translation and chatbots, recommendation systems on Netflix and Spotify, fraud detection in banking, and autonomous driving. Deep learning powers most of these through pattern recognition at scale."),
]
print(f"{'Turn':<6} {'New tokens':<14} {'Total tokens':<14} {'% of 4k limit'}")
print("-" * 50)
for role, content in example_turns:
messages.append({'role': role, 'content': content})
total = estimate_conversation_tokens(messages)
new = estimate_tokens(content)
print(f"{len(messages):<6} {new:<14} {total:<14} {total/4000:.1%}")
Output:
Turn New tokens Total tokens % of 4k limit
--------------------------------------------------
1 12 16 0.4%
2 73 93 2.3%
3 13 110 2.8%
4 65 179 4.5%
5 15 198 5.0%
6 72 274 6.9%
A long conversation about a complex topic can easily hit 2000-3000 tokens. Add RAG context and system prompts, and you're at the limit fast.
Strategy 1: Sliding Window Memory
Keep only the last N messages. Simple and effective.
from collections import deque
from typing import List, Optional
class SlidingWindowChatbot:
def __init__(self, model_pipeline, window_size: int = 10,
system_prompt: str = "You are a helpful assistant."):
self.model = model_pipeline
self.window_size = window_size # max messages to keep
self.system_prompt = system_prompt
self.history = deque(maxlen=window_size)
def chat(self, user_message: str) -> str:
# Add user message to history
self.history.append({'role': 'user', 'content': user_message})
# Build the prompt with history
messages = [
{'role': 'system', 'content': self.system_prompt}
] + list(self.history)
# Call the model (using a simple text format for demo)
prompt = self._format_prompt(messages)
response = self.model(prompt)
# Add assistant response to history
self.history.append({'role': 'assistant', 'content': response})
return response
def _format_prompt(self, messages: List[dict]) -> str:
formatted = ""
for msg in messages:
if msg['role'] == 'system':
formatted += f"System: {msg['content']}\n\n"
elif msg['role'] == 'user':
formatted += f"Human: {msg['content']}\n"
else:
formatted += f"Assistant: {msg['content']}\n"
formatted += "Assistant:"
return formatted
def get_history(self) -> list:
return list(self.history)
def clear(self):
self.history.clear()
print("Conversation history cleared.")
# Simulate a conversation (using a mock model for demo)
def mock_model(prompt: str) -> str:
# In production: replace with real LLM call
if "capital of france" in prompt.lower():
return "The capital of France is Paris."
elif "population" in prompt.lower() and "paris" in prompt.lower():
return "Paris has a population of approximately 2.1 million in the city proper, and about 12 million in the greater metropolitan area."
elif "famous landmark" in prompt.lower():
return "Paris is famous for the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe."
elif "eiffel tower" in prompt.lower():
return "The Eiffel Tower was built between 1887 and 1889, designed by engineer Gustave Eiffel. It stands 330 meters tall."
else:
return "I understand. Could you tell me more?"
bot = SlidingWindowChatbot(mock_model, window_size=6)
# Simulate multi-turn conversation
turns = [
"What's the capital of France?",
"What's the population there?",
"What are some famous landmarks in that city?",
"Tell me more about the Eiffel Tower.",
"When was it built?",
]
for user_input in turns:
print(f"\nUser: {user_input}")
response = bot.chat(user_input)
print(f"Bot: {response}")
print(f"\nHistory has {len(bot.get_history())} messages (max {bot.window_size})")
Output:
User: What's the capital of France?
Bot: The capital of France is Paris.
User: What's the population there?
Bot: Paris has a population of approximately 2.1 million in the city proper...
User: What are some famous landmarks in that city?
Bot: Paris is famous for the Eiffel Tower, the Louvre Museum...
User: Tell me more about the Eiffel Tower.
Bot: The Eiffel Tower was built between 1887 and 1889...
User: When was it built?
Bot: I understand. Could you tell me more?
History has 6 messages (max 6)
The bot understands "there" (Paris) and "that city" (Paris) from context. The sliding window keeps the last 6 messages.
Strategy 2: Summary Memory
When history gets long, summarize old messages and keep recent ones in full.
class SummaryMemoryChatbot:
def __init__(self, model_pipeline, summarizer_pipeline,
max_recent: int = 6, summary_threshold: int = 10,
system_prompt: str = "You are a helpful assistant."):
self.model = model_pipeline
self.summarizer = summarizer_pipeline
self.max_recent = max_recent
self.threshold = summary_threshold
self.system = system_prompt
self.history = []
self.summary = "" # compressed memory of older turns
def _maybe_summarize(self):
if len(self.history) < self.threshold:
return
# Summarize the oldest half of history
n_to_summarize = len(self.history) // 2
old_messages = self.history[:n_to_summarize]
self.history = self.history[n_to_summarize:]
# Format old messages as text
old_text = "\n".join([
f"{m['role'].title()}: {m['content']}"
for m in old_messages
])
# Summarize (in production, call LLM to summarize)
new_summary_input = f"{self.summary}\n\n{old_text}" if self.summary else old_text
self.summary = self._summarize(new_summary_input)
print(f"[Memory] Summarized {n_to_summarize} messages into summary")
def _summarize(self, text: str) -> str:
# In production: call LLM with a summarization prompt
# Here: mock it
return f"[Summary of earlier conversation: The user asked about France, Paris, its population (~2.1M), and Paris landmarks including the Eiffel Tower.]"
def _format_prompt(self) -> str:
parts = [f"System: {self.system}\n"]
if self.summary:
parts.append(f"[Earlier conversation summary]: {self.summary}\n")
for msg in self.history[-self.max_recent:]:
role = "Human" if msg['role'] == 'user' else "Assistant"
parts.append(f"{role}: {msg['content']}")
parts.append("Assistant:")
return "\n".join(parts)
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
self._maybe_summarize()
prompt = self._format_prompt()
response = self.model(prompt)
self.history.append({'role': 'assistant', 'content': response})
return response
def memory_status(self):
print(f"Summary: {'yes' if self.summary else 'none'}")
print(f"Recent messages in full: {min(len(self.history), self.max_recent)}")
print(f"Total history: {len(self.history)}")
summary_bot = SummaryMemoryChatbot(mock_model, None, max_recent=6, summary_threshold=8)
for user_input in turns * 2: # repeat to trigger summarization
response = summary_bot.chat(user_input)
summary_bot.memory_status()
Strategy 3: Entity Memory
Extract and store specific facts about the user or conversation entities.
import re
from typing import Dict
class EntityMemoryChatbot:
def __init__(self, model_pipeline,
system_prompt: str = "You are a helpful assistant."):
self.model = model_pipeline
self.system = system_prompt
self.history = []
self.entities: Dict[str, str] = {} # entity store
def _extract_entities(self, message: str):
# Simplified entity extraction (in production: use NER model or LLM)
patterns = {
'name': r"(?:my name is|I am|I'm)\s+([A-Z][a-z]+)",
'location': r"(?:I live in|I'm from|I'm in)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)",
'job': r"(?:I am a|I work as a|I'm a)\s+([a-z]+(?:\s+[a-z]+)?)",
'topic': r"(?:I want to learn about|I'm studying|I need help with)\s+([a-z\s]+)"
}
for entity_type, pattern in patterns.items():
match = re.search(pattern, message, re.IGNORECASE)
if match:
self.entities[entity_type] = match.group(1).strip()
def _build_entity_context(self) -> str:
if not self.entities:
return ""
lines = ["Known facts about the user:"]
for entity, value in self.entities.items():
lines.append(f" - {entity}: {value}")
return "\n".join(lines)
def _format_prompt(self) -> str:
parts = [f"System: {self.system}"]
entity_ctx = self._build_entity_context()
if entity_ctx:
parts.append(entity_ctx)
for msg in self.history[-8:]:
role = "Human" if msg['role'] == 'user' else "Assistant"
parts.append(f"{role}: {msg['content']}")
parts.append("Assistant:")
return "\n".join(parts)
def chat(self, user_message: str) -> str:
self._extract_entities(user_message)
self.history.append({'role': 'user', 'content': user_message})
prompt = self._format_prompt()
response = self.model(prompt)
self.history.append({'role': 'assistant', 'content': response})
return response
# Test entity memory
def entity_mock_model(prompt: str) -> str:
if "name" in prompt.lower() and "Alex" in prompt:
return "Nice to meet you, Alex!"
elif "Alex" in prompt and "recommend" in prompt.lower():
return "Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn."
elif "course" in prompt.lower():
return "For machine learning, the Andrew Ng Coursera course is excellent for beginners."
else:
return "Tell me more about what you'd like to learn."
entity_bot = EntityMemoryChatbot(entity_mock_model)
conversations = [
"Hi, my name is Alex.",
"I want to learn about machine learning.",
"Can you recommend something?",
"Are there any courses?",
]
for user_input in conversations:
print(f"\nUser: {user_input}")
response = entity_bot.chat(user_input)
print(f"Bot: {response}")
print(f"\nExtracted entities: {entity_bot.entities}")
Output:
User: Hi, my name is Alex.
Bot: Nice to meet you, Alex!
User: I want to learn about machine learning.
Bot: Tell me more about what you'd like to learn.
User: Can you recommend something?
Bot: Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn.
User: Are there any courses?
Bot: For machine learning, the Andrew Ng Coursera course is excellent for beginners.
Extracted entities: {'name': 'Alex', 'topic': 'machine learning'}
The bot remembers the user's name and topic across all turns.
Full Chatbot With the OpenAI API
import openai
import json
from datetime import datetime
class ProductionChatbot:
def __init__(
self,
system_prompt: str = "You are a helpful AI assistant.",
model: str = "gpt-3.5-turbo",
max_history: int = 20,
max_tokens: int = 500,
temperature: float = 0.7
):
self.client = openai.OpenAI()
self.model = model
self.max_history = max_history
self.max_tokens = max_tokens
self.temperature = temperature
self.history = []
self.system = system_prompt
self.created_at = datetime.now()
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
# Trim history if too long
if len(self.history) > self.max_history:
self.history = self.history[-self.max_history:]
# Build message list for API
messages = [
{'role': 'system', 'content': self.system}
] + self.history
# Call API
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
assistant_message = response.choices[0].message.content
self.history.append({'role': 'assistant', 'content': assistant_message})
return assistant_message
def save_conversation(self, filepath: str):
data = {
'created_at': self.created_at.isoformat(),
'saved_at': datetime.now().isoformat(),
'model': self.model,
'system': self.system,
'messages': self.history
}
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
print(f"Saved {len(self.history)} messages to {filepath}")
def load_conversation(self, filepath: str):
with open(filepath, 'r') as f:
data = json.load(f)
self.history = data['messages']
self.system = data.get('system', self.system)
print(f"Loaded {len(self.history)} messages from {filepath}")
def reset(self):
self.history = []
print("Conversation reset.")
def get_stats(self) -> dict:
n_user = sum(1 for m in self.history if m['role'] == 'user')
n_assistant = sum(1 for m in self.history if m['role'] == 'assistant')
total_chars = sum(len(m['content']) for m in self.history)
return {
'turns': n_user,
'total_messages': len(self.history),
'estimated_tokens': total_chars // 4,
'history_depth': len(self.history)
}
# Usage
# bot = ProductionChatbot(
# system_prompt="You are a helpful ML tutor specializing in practical examples.",
# model="gpt-3.5-turbo",
# max_history=20
# )
# response = bot.chat("Explain overfitting to me.")
# print(response)
# bot.save_conversation('session_001.json')
print("ProductionChatbot ready (requires OPENAI_API_KEY)")
LangChain Memory: The Easy Way
from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory
from langchain.chains import ConversationChain
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline
# Create LLM
gen_pipe = hf_pipeline('text-generation', model='gpt2', max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=gen_pipe)
# Buffer memory: keeps all messages
buffer_memory = ConversationBufferMemory()
# Summary memory: automatically summarizes when too long
# summary_memory = ConversationSummaryMemory(llm=llm)
# Build conversation chain
conversation = ConversationChain(
llm=llm,
memory=buffer_memory,
verbose=False
)
# Chat
result = conversation.predict(input="Hello, my name is Alex.")
print(f"Bot: {result[:100]}...")
result = conversation.predict(input="What is my name?")
print(f"Bot: {result[:100]}...")
# Inspect memory
print(f"\nMemory buffer:\n{buffer_memory.buffer}")
Persisting Memory Across Sessions
import json
import os
class PersistentChatbot:
def __init__(self, model_pipeline, session_id: str,
storage_dir: str = './chat_sessions',
max_history: int = 50):
self.model = model_pipeline
self.session_id = session_id
self.storage_dir = storage_dir
self.max_history = max_history
self.history = []
self.metadata = {}
os.makedirs(storage_dir, exist_ok=True)
self._load_session()
def _session_path(self) -> str:
return os.path.join(self.storage_dir, f"{self.session_id}.json")
def _load_session(self):
path = self._session_path()
if os.path.exists(path):
with open(path, 'r') as f:
data = json.load(f)
self.history = data.get('history', [])
self.metadata = data.get('metadata', {})
print(f"Loaded session '{self.session_id}' with {len(self.history)} messages")
else:
print(f"New session '{self.session_id}' started")
def _save_session(self):
data = {
'session_id': self.session_id,
'last_updated': datetime.now().isoformat(),
'history': self.history,
'metadata': self.metadata
}
with open(self._session_path(), 'w') as f:
json.dump(data, f, indent=2)
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
if len(self.history) > self.max_history:
self.history = self.history[-self.max_history:]
response = self.model(self._format_prompt())
self.history.append({'role': 'assistant', 'content': response})
self._save_session()
return response
def _format_prompt(self) -> str:
parts = []
for msg in self.history[-10:]:
role = "Human" if msg['role'] == 'user' else "Assistant"
parts.append(f"{role}: {msg['content']}")
parts.append("Assistant:")
return "\n".join(parts)
def list_sessions(self) -> list:
sessions = []
for f in os.listdir(self.storage_dir):
if f.endswith('.json'):
sessions.append(f.replace('.json', ''))
return sessions
# Usage
persistent_bot = PersistentChatbot(mock_model, session_id='user_alex_001')
persistent_bot.chat("What's the capital of France?")
persistent_bot.chat("What's the population there?")
print(f"\nSaved sessions: {persistent_bot.list_sessions()}")
print(f"History length: {len(persistent_bot.history)} messages")
Chatbot Quality Checklist
checklist = {
"Memory management": [
"Does the bot remember context from 5+ turns ago?",
"Does it handle coreferences correctly? ('there', 'it', 'they')",
"Does it avoid repeating information the user already gave?"
],
"Context window": [
"Does it handle very long conversations without breaking?",
"Is there a graceful fallback when history is too long?",
"Are summarized messages accurate and not lossy?"
],
"Conversation quality": [
"Does it stay on topic through the conversation?",
"Does it refer to earlier decisions correctly?",
"Does it handle topic switches gracefully?"
],
"Persistence": [
"Does it save conversations for later use?",
"Can it resume from a previous session?",
"Is the storage format readable and debuggable?"
],
"Edge cases": [
"What happens if the user asks about something not in memory?",
"What happens if the user contradicts themselves?",
"Does it handle very short or very long user messages?"
]
}
for category, items in checklist.items():
print(f"\n{category}:")
for item in items:
print(f" [ ] {item}")
Quick Cheat Sheet
| Memory type | When to use | How it works |
|---|---|---|
| Buffer (all history) | Short conversations | Keep all messages, pass everything |
| Sliding window | Medium conversations | Keep last N messages only |
| Summary memory | Long conversations | Summarize old messages, keep recent in full |
| Entity memory | User-specific facts | Extract and store named entities |
| Persistent memory | Multi-session chatbots | Save/load from disk or database |
| Pattern | Code |
|---|---|
| Add to history | history.append({'role': 'user', 'content': msg}) |
| Trim history | history = history[-max_size:] |
| Build messages | [{'role': 'system', 'content': system}] + history |
| Save session | json.dump({'history': history}, f) |
| Load session | history = json.load(f)['history'] |
| LangChain buffer | ConversationBufferMemory() |
| LangChain summary | ConversationSummaryMemory(llm=llm) |
Practice Challenges
Level 1:
Build a SlidingWindowChatbot that talks to GPT-2 locally. Have a 10-turn conversation about a topic of your choice. Print the full history at the end. Verify the bot correctly references things from earlier turns.
Level 2:
Implement SummaryMemoryChatbot with a real summarization call. After every 8 turns, summarize the first half using a small T5 model. Test with a 20-turn conversation. Print the summary after it triggers. Is the summary accurate?
Level 3:
Build PersistentChatbot that stores conversations to disk. Start a conversation, close it, restart the program, load the session, and continue the conversation. Verify the bot remembers what was said in the previous session. Add a /history command that prints a summary of previous sessions.
References
- LangChain: Memory docs
- OpenAI: Chat Completions API
- LangChain: ConversationBufferMemory
- Llama Index: Chat Engine
Final post, Post 100: OpenAI API: Build With GPT-4. API setup, chat completions, function calling, streaming, and cost management. The last post in the series wraps everything together.
Top comments (0)