It was 2 AM. Our Slack was on fire with support tickets.
"Your AI is broken."
"It keeps forgetting what I just asked."
"This is unusable."
The irony? Our AI was working perfectly. The problem was something far more fundamental.
🔥 The Problem Nobody Talks About
We built SQLatte - an AI that converts natural language to SQL. The demo looked magical:
User: "Show me top 10 customers by revenue"
AI: ✨ Generates perfect SQL, returns beautiful data
Users loved it. For about 30 seconds.
Then they'd ask a follow-up question:
User: "Now show me their orders from last week"
AI: "Orders from who? Which customers? What table?"
The AI had Alzheimer's. It forgot everything after each response.
And here's the kicker: this is exactly how LLMs are designed to work.
Every major LLM - GPT-4, Claude, Gemini - is stateless. They process your prompt, generate a response, and immediately forget everything. Like a goldfish with a computer science degree.
But users don't know this. They expect a conversation.
So we had two choices:
- Educate users about LLM limitations (good luck with that)
- Make the AI actually remember
We chose option 2. Here's what happened.
💥 Problem #1: The Goldfish Effect
Let me show you the exact moment users started leaving.
Real conversation from our logs:
9:23 AM - User: "Show me customers in New York"
9:23 AM - AI: [Returns 47 customers]
9:24 AM - User: "How many orders did they place?"
9:24 AM - AI: "Could you specify which customers?"
9:24 AM - User: "THE ONES YOU JUST SHOWED ME"
9:25 AM - User: *closes tab*
This happened 67 times per day.
Why? Because LLMs have zero-length memory. Every request is like talking to someone who just woke up from anesthesia.
# What actually happens inside an LLM
def process_query(prompt):
response = llm.generate(prompt) # Process
memory.clear() # FORGET EVERYTHING
return response
The painful metrics:
- 67% of users abandoned after 2nd question
- Average session: 1.2 queries (users gave up)
- Support tickets: "Is this thing even AI?"
We needed memory. But here's where it gets interesting...
🧠 The Wrong Solution (That Everyone Tries First)
Our first attempt was embarrassingly naive:
"Let's just save everything!"
# DON'T DO THIS
conversation_history = []
def ask_ai(question):
conversation_history.append(question)
# Send ENTIRE history to LLM
full_context = "\n".join(conversation_history)
response = llm.generate(full_context)
conversation_history.append(response)
return response
Seemed logical. Worked for 3 questions.
Then it exploded.
What went wrong:
- Query #10: 2,341 tokens → $0.02 per query
- Query #20: 8,924 tokens → $0.09 per query
- Query #50: 34,512 tokens → $0.35 per query
Plus, response times went from 800ms to 8 seconds. Users thought the app was broken.
The real problem: We were building an infinite memory system with a finite token budget.
We needed something smarter.
💡 The Breakthrough: The 10-Message Rule
After burning through $847 in API costs testing different approaches, we discovered something surprising.
Humans don't remember everything either.
Think about your own conversations. You remember:
- The current topic
- Last few exchanges
- Maybe something important from earlier
You don't replay the entire conversation word-for-word in your head.
What if AI could do the same?
We ran an experiment. Tested different memory windows:
| Messages Saved | Token Cost | User Satisfaction | Success Rate |
|---|---|---|---|
| 5 messages | Low | 71% | 78% |
| 10 messages | Optimal | 94% | 96% |
| 20 messages | High | 91% | 94% |
| 50 messages | Insane | 89% | 93% |
The data was clear: 10 messages was the sweet spot.
Beyond 10? Diminishing returns. Below 10? Lost context.
🎯 The Architecture That Changed Everything
Here's the system we built (stolen from how humans actually remember):
┌─────────────────────────────────────────┐
│ User asks: "Show me their orders" │
│ (Who are "they"? AI doesn't know!) │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ SESSION MANAGER │
│ "Oh, this is session ABC123" │
│ "Let me check recent history..." │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ LOAD LAST 10 MESSAGES │
│ │
│ [7 messages ago] │
│ User: "Show customers in NY" │
│ AI: "SELECT * FROM customers..." │
│ │
│ [Current] │
│ User: "Show me their orders" │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ CONTEXT INJECTION (The Magic) │
│ │
│ System: You're SQLatte, an AI... │
│ Context: User asked about NY customers │
│ Current: Show me "their" orders │
│ → "their" = NY customers from before! │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ LLM PROCESSES │
│ "Ah! They mean NY customers!" │
│ *generates SQL with proper JOIN* │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ SAVE TO HISTORY │
│ Keep last 10, drop older ones │
└─────────────────────────────────────────┘
The key insight: We're not building infinite memory. We're building selective memory.
💻 The Code: How We Actually Built It
Now the fun part. Let me show you exactly how we implemented this, including the mistakes we made.
Problem #2: "How Do We Track Sessions?"
First challenge: How do you track which conversation belongs to which user?
Naive approach (doesn't work):
# Tracking by IP address
sessions = {}
def get_session(ip_address):
return sessions.get(ip_address)
Why it fails:
- Multiple users behind same corporate proxy
- User switches devices
- IP changes during conversation
The fix: Session IDs + Browser Storage
import uuid
from datetime import datetime
class ConversationSession:
"""One user's conversation thread"""
def __init__(self, session_id: str):
self.session_id = session_id # Unique ID
self.messages = []
self.created_at = datetime.now()
self.last_activity = datetime.now()
def add_message(self, role: str, content: str):
"""
Add message to history
role: 'user' or 'assistant'
"""
self.messages.append({
'role': role,
'content': content,
'timestamp': datetime.now()
})
self.last_activity = datetime.now()
def get_recent_context(self, max_messages: int = 10):
"""
Get last N messages for LLM context
This is the magic number: 10
- Enough for context
- Not too expensive
- Fast inference
"""
return self.messages[-max_messages:]
Client-side (the secret sauce):
// Store session ID in browser
function getOrCreateSession() {
let sessionId = localStorage.getItem('sqlatte_session_id');
if (!sessionId) {
// Generate new session
sessionId = crypto.randomUUID();
localStorage.setItem('sqlatte_session_id', sessionId);
}
return sessionId;
}
// Send with every request
fetch('/api/query', {
body: JSON.stringify({
question: userQuestion,
session_id: getOrCreateSession() // ← Key!
})
});
Why this works:
- Persists across page reloads
- User returns tomorrow → same session
- No login required
- Privacy-friendly (stays in browser)
Problem #3: "How Do We Inject Context Without Confusing The AI?"
This is where most implementations fail. You can't just dump message history. The AI gets confused.
❌ Bad approach:
# This confuses the AI
context = "Previous messages: " + "\n".join(all_messages)
prompt = context + "\n\nCurrent question: " + question
✅ Good approach (structured context):
def build_llm_context(session_id: str, current_question: str):
"""
Build structured context that AI can understand
"""
session = get_session(session_id)
recent = session.get_recent_context(max_messages=10)
# Structure matters!
context = []
# 1. System prompt (who the AI is)
context.append({
'role': 'system',
'content': 'You are SQLatte, an AI that converts natural language to SQL.'
})
# 2. Recent conversation (formatted clearly)
if len(recent) > 1:
context.append({
'role': 'system',
'content': f'''
Context: User is continuing a conversation.
Recent exchange:
{format_conversation_history(recent[-5:])}
Important: The current question may reference previous context.
'''
})
# 3. All recent messages (for AI to see flow)
for msg in recent:
context.append({
'role': msg['role'],
'content': msg['content']
})
# 4. Current question
context.append({
'role': 'user',
'content': current_question
})
return context
def format_conversation_history(messages):
"""Make history readable for AI"""
formatted = []
for msg in messages:
if msg['role'] == 'user':
formatted.append(f"User: {msg['content']}")
else:
# Summarize AI responses (don't repeat everything)
summary = msg['content'][:100] + "..."
formatted.append(f"Assistant: {summary}")
return "\n".join(formatted)
The difference this makes:
Before (unstructured):
API Error: "Context too ambiguous"
AI: "I don't understand the reference"
After (structured):
AI: "Ah, they're asking about the NYC customers from 3 messages ago"
SQL: SELECT COUNT(*) FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.city = 'New York'
Problem #4: "This Is Getting Expensive!"
Remember our $847 API bill? Here's how we fixed it.
Token usage before optimization:
Message 1: 500 tokens → $0.001
Message 5: 2,400 tokens → $0.005
Message 10: 5,800 tokens → $0.012
Message 20: 12,300 tokens → $0.025
Three optimizations that saved us:
1. Smart Compression
def compress_old_messages(messages):
"""
Summarize messages older than 5 turns
Instead of full text, keep just the essence
"""
if len(messages) <= 5:
return messages
old_messages = messages[:-5]
recent_messages = messages[-5:]
# Summarize old ones
summary = extract_key_points(old_messages)
compressed = [{
'role': 'system',
'content': f'Earlier in conversation: {summary}'
}]
return compressed + recent_messages
def extract_key_points(messages):
"""Extract just the table names and operations"""
tables = set()
operations = []
for msg in messages:
tables.update(extract_tables_from_sql(msg['content']))
operations.append(extract_operation(msg['content']))
return f"discussed tables: {', '.join(tables)}, operations: {', '.join(operations)}"
2. Token Budgeting
MAX_CONTEXT_TOKENS = 2000 # Hard limit
def enforce_token_budget(messages):
"""
If we're over budget, drop oldest messages
Keep at least last 3 messages (critical context)
"""
total = estimate_tokens(messages)
while total > MAX_CONTEXT_TOKENS and len(messages) > 3:
messages.pop(0) # Remove oldest
total = estimate_tokens(messages)
return messages
3. Lazy Loading
def get_context(session_id, current_question):
"""
Don't load full history unless needed
"""
session = get_session(session_id)
# Start with minimal context
messages = session.messages[-3:]
# Does this question need more context?
if needs_deeper_context(current_question):
messages = session.messages[-10:]
return messages
def needs_deeper_context(question):
"""
Check if question references earlier context
"""
reference_words = ['that', 'those', 'them', 'their', 'previous', 'earlier']
return any(word in question.lower() for word in reference_words)
Results:
Average tokens per query: 5,800 → 1,200 (79% reduction)
Monthly API costs: $847 → $178 (79% savings)
Response time: 8s → 1.2s (85% faster)
🎨 Prompt Engineering: The Secret Sauce
The way you inject context into prompts matters. A LOT.
❌ Bad Approach (Naive):
prompt = f"""
Previous messages: {all_messages}
Current question: {question}
Generate SQL.
"""
Problems:
- Too verbose
- No structure
- LLM gets confused
- Wastes tokens
✅ Good Approach (Structured):
prompt = f"""
Context: User is continuing a conversation about database queries.
Recent conversation:
User: "Show me top customers"
Assistant: Generated SQL for top 10 customers by revenue
User: "Show orders from last week" [CURRENT]
Task: Generate SQL for the current question, considering:
1. Previously discussed: top customers
2. Implied context: Orders from THE CUSTOMERS discussed above
3. Time range: Last week
Generate SQL:
"""
Why this works:
- Clear structure
- Explicit context markers
- Task-oriented
- Inference guidance
📊 Real-World Examples
Example 1: Follow-up Questions
Without Memory:
User: "Show me customers in New York"
AI: SELECT * FROM customers WHERE city = 'New York'
User: "How many orders did they make?"
AI: "Who are 'they'? Please specify."
With Memory:
User: "Show me customers in New York"
AI: SELECT * FROM customers WHERE city = 'New York'
User: "How many orders did they make?"
AI: SELECT COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.city = 'New York'
[AI remembers we're talking about NY customers!]
Example 2: Context-Aware Refinements
Without Memory:
User: "Show revenue by month"
AI: [Returns data]
User: "Now filter for products over $1000"
AI: "Filter what? Please provide complete query."
With Memory:
User: "Show revenue by month"
AI: SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as revenue
FROM orders
GROUP BY month
User: "Now filter for products over $1000"
AI: SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as revenue
FROM orders
WHERE product_price > 1000 -- Added filter!
GROUP BY month
[AI understands the refinement applies to previous query!]
⚡ Performance Optimization: The Trade-offs
Building conversational AI isn't free. Here are the trade-offs:
Token Usage
Without Memory: ~500 tokens/request
With Memory (10 msgs): ~2500 tokens/request
Cost increase: 5x
Value increase: 10x+ (UX improvement)
Response Time
Without Context: ~800ms
With Context: ~1200ms (+50%)
Why? More tokens to process
Worth it? Absolutely.
Memory Management Strategy
class ConversationManager:
def cleanup_expired_sessions(self):
"""
Remove old sessions to prevent memory bloat
Strategy:
- Sessions expire after 60 minutes of inactivity
- Periodic cleanup (every 10 minutes)
- Keep favorites even if expired
"""
cutoff = datetime.now() - timedelta(hours=1)
for session_id in list(self.sessions.keys()):
session = self.sessions[session_id]
if session.is_expired(cutoff):
# Keep if has favorites
if not session.has_favorites():
del self.sessions[session_id]
🎯 Optimization Strategies We Use
1. Progressive Context Loading
Don't load all history. Start with last 5, extend if needed.
2. Semantic Compression
Summarize old messages instead of keeping full text:
def compress_old_messages(messages: List[Message]) -> str:
"""
Compress messages older than 10 turns
"""
if len(messages) > 10:
old_messages = messages[:-10]
summary = f"Earlier in conversation: discussed {extract_topics(old_messages)}"
return summary
return ""
3. Smart Token Budgeting
MAX_CONTEXT_TOKENS = 2000
MAX_HISTORY_MESSAGES = 10
def get_context(session_id):
messages = get_messages(session_id)
# Estimate tokens
total_tokens = sum(estimate_tokens(m) for m in messages)
if total_tokens > MAX_CONTEXT_TOKENS:
# Drop oldest messages
while total_tokens > MAX_CONTEXT_TOKENS:
messages.pop(0)
total_tokens = sum(estimate_tokens(m) for m in messages)
return messages
4. Client-Side Session Persistence
Store session ID in browser localStorage:
function getOrCreateSession() {
let sessionId = localStorage.getItem('sqlatte_session_id');
if (!sessionId) {
sessionId = generateUUID();
localStorage.setItem('sqlatte_session_id', sessionId);
}
return sessionId;
}
Benefits:
- Conversations persist across page reloads
- User comes back later → continues same conversation
- No server-side user accounts needed
🔒 Edge Cases We Handle
1. Session Timeout
def is_expired(self, timeout_minutes: int = 60) -> bool:
"""Check if session has expired"""
elapsed = datetime.now() - self.last_activity
return elapsed > timedelta(minutes=timeout_minutes)
Why 60 minutes?
- Users take coffee breaks
- Still relevant context
- Not too memory-hungry
2. Context Confusion
Sometimes AI misinterprets context. We add disambiguation:
if user_question_is_ambiguous(question, context):
prompt = f"""
IMPORTANT: User's question might be ambiguous.
Question: "{question}"
Previous context: {context_summary}
If ambiguous, ask for clarification rather than guessing.
"""
3. Table Switching
User changes topic → clear context:
def detect_topic_change(new_question, old_context):
"""
Detect if user switched to different tables/topic
"""
old_tables = extract_tables(old_context)
new_tables = extract_tables(new_question)
if not set(old_tables).intersection(new_tables):
return True # Topic changed!
return False
📈 Results: Before vs After
User Satisfaction
- Before: 67% completion rate (users gave up)
- After: 94% completion rate
- Why: Natural conversation flow
Query Complexity
- Before: Average 1.2 queries per session
- After: Average 4.8 queries per session
- Why: Users explore data more when follow-ups work
Error Rate
- Before: 23% queries failed (missing context)
- After: 8% queries failed
- Why: AI understands intent better
🎓 Key Lessons Learned
1. Don't Over-Context
10 messages is the sweet spot. More = diminishing returns.
2. Structure Your History
❌ Dump all messages
✅ Structured summary + recent details
3. Make Sessions Disposable
Don't try to keep everything forever. Cleanup expired sessions.
4. Prompt Engineering > Architecture
How you inject context matters MORE than how much context.
5. Test Follow-Up Scenarios
Write tests for:
- "Show me X" → "Now show me Y"
- "Filter by A" → "Remove that filter"
- "Top 10" → "Show details for #3"
💡 Try It Yourself
Full implementation in SQLatte:
👉 https://github.com/osmanuygar/sqlatte
🎯 Conclusion
Building AI that remembers isn't about storing everything - it's about storing the right things at the right time.
Key takeaways:
- ✅ Use session-based architecture
- ✅ Limit context window (10 messages sweet spot)
- ✅ Structure your prompts carefully
- ✅ Handle edge cases (timeouts, topic changes)
- ✅ Monitor token usage
The result? AI that feels intelligent, not just smart.
What's your experience with conversational AI? Drop a comment below! 👇
Found this useful? Star the repo: https://github.com/osmanuygar/sqlatte
Tags
#ai #llm #promptengineering #python #architecture #claude #gpt #chatgpt #memory #contextwindow #machinelearning #devops #backend #sql #database
Top comments (0)