DEV Community

osman uygar köse
osman uygar köse

Posted on

The Art of Context Windows: Our AI Had Alzheimer's: Here's How We Taught It To Remember

It was 2 AM. Our Slack was on fire with support tickets.

"Your AI is broken."

"It keeps forgetting what I just asked."

"This is unusable."

The irony? Our AI was working perfectly. The problem was something far more fundamental.

🔥 The Problem Nobody Talks About

We built SQLatte - an AI that converts natural language to SQL. The demo looked magical:

User: "Show me top 10 customers by revenue"

AI:Generates perfect SQL, returns beautiful data

Users loved it. For about 30 seconds.

Then they'd ask a follow-up question:

User: "Now show me their orders from last week"

AI: "Orders from who? Which customers? What table?"

The AI had Alzheimer's. It forgot everything after each response.

And here's the kicker: this is exactly how LLMs are designed to work.

Every major LLM - GPT-4, Claude, Gemini - is stateless. They process your prompt, generate a response, and immediately forget everything. Like a goldfish with a computer science degree.

But users don't know this. They expect a conversation.

So we had two choices:

  1. Educate users about LLM limitations (good luck with that)
  2. Make the AI actually remember

We chose option 2. Here's what happened.

💥 Problem #1: The Goldfish Effect

Let me show you the exact moment users started leaving.

Real conversation from our logs:

9:23 AM - User: "Show me customers in New York"
9:23 AM - AI: [Returns 47 customers]
9:24 AM - User: "How many orders did they place?"
9:24 AM - AI: "Could you specify which customers?"
9:24 AM - User: "THE ONES YOU JUST SHOWED ME"
9:25 AM - User: *closes tab*
Enter fullscreen mode Exit fullscreen mode

This happened 67 times per day.

Why? Because LLMs have zero-length memory. Every request is like talking to someone who just woke up from anesthesia.

# What actually happens inside an LLM
def process_query(prompt):
    response = llm.generate(prompt)  # Process
    memory.clear()  # FORGET EVERYTHING
    return response
Enter fullscreen mode Exit fullscreen mode

The painful metrics:

  • 67% of users abandoned after 2nd question
  • Average session: 1.2 queries (users gave up)
  • Support tickets: "Is this thing even AI?"

We needed memory. But here's where it gets interesting...

🧠 The Wrong Solution (That Everyone Tries First)

Our first attempt was embarrassingly naive:

"Let's just save everything!"

# DON'T DO THIS
conversation_history = []

def ask_ai(question):
    conversation_history.append(question)

    # Send ENTIRE history to LLM
    full_context = "\n".join(conversation_history)
    response = llm.generate(full_context)

    conversation_history.append(response)
    return response
Enter fullscreen mode Exit fullscreen mode

Seemed logical. Worked for 3 questions.

Then it exploded.

What went wrong:

  • Query #10: 2,341 tokens → $0.02 per query
  • Query #20: 8,924 tokens → $0.09 per query
  • Query #50: 34,512 tokens$0.35 per query

Plus, response times went from 800ms to 8 seconds. Users thought the app was broken.

The real problem: We were building an infinite memory system with a finite token budget.

We needed something smarter.

💡 The Breakthrough: The 10-Message Rule

After burning through $847 in API costs testing different approaches, we discovered something surprising.

Humans don't remember everything either.

Think about your own conversations. You remember:

  • The current topic
  • Last few exchanges
  • Maybe something important from earlier

You don't replay the entire conversation word-for-word in your head.

What if AI could do the same?

We ran an experiment. Tested different memory windows:

Messages Saved Token Cost User Satisfaction Success Rate
5 messages Low 71% 78%
10 messages Optimal 94% 96%
20 messages High 91% 94%
50 messages Insane 89% 93%

The data was clear: 10 messages was the sweet spot.

Beyond 10? Diminishing returns. Below 10? Lost context.

🎯 The Architecture That Changed Everything

Here's the system we built (stolen from how humans actually remember):

┌─────────────────────────────────────────┐
│  User asks: "Show me their orders"      │
│  (Who are "they"? AI doesn't know!)     │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│    SESSION MANAGER                      │
│  "Oh, this is session ABC123"           │
│  "Let me check recent history..."       │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│   LOAD LAST 10 MESSAGES                 │
│                                         │
│  [7 messages ago]                       │
│  User: "Show customers in NY"           │
│  AI: "SELECT * FROM customers..."       │
│                                         │
│  [Current]                              │
│  User: "Show me their orders"           │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│   CONTEXT INJECTION (The Magic)         │
│                                         │
│  System: You're SQLatte, an AI...      │
│  Context: User asked about NY customers │
│  Current: Show me "their" orders        │
│  → "their" = NY customers from before!  │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│         LLM PROCESSES                   │
│  "Ah! They mean NY customers!"          │
│  *generates SQL with proper JOIN*       │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│    SAVE TO HISTORY                      │
│  Keep last 10, drop older ones          │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The key insight: We're not building infinite memory. We're building selective memory.

💻 The Code: How We Actually Built It

Now the fun part. Let me show you exactly how we implemented this, including the mistakes we made.

Problem #2: "How Do We Track Sessions?"

First challenge: How do you track which conversation belongs to which user?

Naive approach (doesn't work):

# Tracking by IP address
sessions = {}

def get_session(ip_address):
    return sessions.get(ip_address)
Enter fullscreen mode Exit fullscreen mode

Why it fails:

  • Multiple users behind same corporate proxy
  • User switches devices
  • IP changes during conversation

The fix: Session IDs + Browser Storage

import uuid
from datetime import datetime

class ConversationSession:
    """One user's conversation thread"""

    def __init__(self, session_id: str):
        self.session_id = session_id  # Unique ID
        self.messages = []
        self.created_at = datetime.now()
        self.last_activity = datetime.now()

    def add_message(self, role: str, content: str):
        """
        Add message to history
        role: 'user' or 'assistant'
        """
        self.messages.append({
            'role': role,
            'content': content,
            'timestamp': datetime.now()
        })
        self.last_activity = datetime.now()

    def get_recent_context(self, max_messages: int = 10):
        """
        Get last N messages for LLM context

        This is the magic number: 10
        - Enough for context
        - Not too expensive
        - Fast inference
        """
        return self.messages[-max_messages:]
Enter fullscreen mode Exit fullscreen mode

Client-side (the secret sauce):

// Store session ID in browser
function getOrCreateSession() {
    let sessionId = localStorage.getItem('sqlatte_session_id');

    if (!sessionId) {
        // Generate new session
        sessionId = crypto.randomUUID();
        localStorage.setItem('sqlatte_session_id', sessionId);
    }

    return sessionId;
}

// Send with every request
fetch('/api/query', {
    body: JSON.stringify({
        question: userQuestion,
        session_id: getOrCreateSession()  // ← Key!
    })
});
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Persists across page reloads
  • User returns tomorrow → same session
  • No login required
  • Privacy-friendly (stays in browser)

Problem #3: "How Do We Inject Context Without Confusing The AI?"

This is where most implementations fail. You can't just dump message history. The AI gets confused.

❌ Bad approach:

# This confuses the AI
context = "Previous messages: " + "\n".join(all_messages)
prompt = context + "\n\nCurrent question: " + question
Enter fullscreen mode Exit fullscreen mode

✅ Good approach (structured context):

def build_llm_context(session_id: str, current_question: str):
    """
    Build structured context that AI can understand
    """
    session = get_session(session_id)
    recent = session.get_recent_context(max_messages=10)

    # Structure matters!
    context = []

    # 1. System prompt (who the AI is)
    context.append({
        'role': 'system',
        'content': 'You are SQLatte, an AI that converts natural language to SQL.'
    })

    # 2. Recent conversation (formatted clearly)
    if len(recent) > 1:
        context.append({
            'role': 'system',
            'content': f'''
Context: User is continuing a conversation.

Recent exchange:
{format_conversation_history(recent[-5:])}

Important: The current question may reference previous context.
'''
        })

    # 3. All recent messages (for AI to see flow)
    for msg in recent:
        context.append({
            'role': msg['role'],
            'content': msg['content']
        })

    # 4. Current question
    context.append({
        'role': 'user',
        'content': current_question
    })

    return context

def format_conversation_history(messages):
    """Make history readable for AI"""
    formatted = []
    for msg in messages:
        if msg['role'] == 'user':
            formatted.append(f"User: {msg['content']}")
        else:
            # Summarize AI responses (don't repeat everything)
            summary = msg['content'][:100] + "..."
            formatted.append(f"Assistant: {summary}")

    return "\n".join(formatted)
Enter fullscreen mode Exit fullscreen mode

The difference this makes:

Before (unstructured):

API Error: "Context too ambiguous"
AI: "I don't understand the reference"
Enter fullscreen mode Exit fullscreen mode

After (structured):

AI: "Ah, they're asking about the NYC customers from 3 messages ago"
SQL: SELECT COUNT(*) FROM orders o 
     JOIN customers c ON o.customer_id = c.id
     WHERE c.city = 'New York'
Enter fullscreen mode Exit fullscreen mode

Problem #4: "This Is Getting Expensive!"

Remember our $847 API bill? Here's how we fixed it.

Token usage before optimization:

Message 1:  500 tokens  → $0.001
Message 5:  2,400 tokens → $0.005
Message 10: 5,800 tokens → $0.012
Message 20: 12,300 tokens → $0.025
Enter fullscreen mode Exit fullscreen mode

Three optimizations that saved us:

1. Smart Compression

def compress_old_messages(messages):
    """
    Summarize messages older than 5 turns
    Instead of full text, keep just the essence
    """
    if len(messages) <= 5:
        return messages

    old_messages = messages[:-5]
    recent_messages = messages[-5:]

    # Summarize old ones
    summary = extract_key_points(old_messages)
    compressed = [{
        'role': 'system',
        'content': f'Earlier in conversation: {summary}'
    }]

    return compressed + recent_messages

def extract_key_points(messages):
    """Extract just the table names and operations"""
    tables = set()
    operations = []

    for msg in messages:
        tables.update(extract_tables_from_sql(msg['content']))
        operations.append(extract_operation(msg['content']))

    return f"discussed tables: {', '.join(tables)}, operations: {', '.join(operations)}"
Enter fullscreen mode Exit fullscreen mode

2. Token Budgeting

MAX_CONTEXT_TOKENS = 2000  # Hard limit

def enforce_token_budget(messages):
    """
    If we're over budget, drop oldest messages
    Keep at least last 3 messages (critical context)
    """
    total = estimate_tokens(messages)

    while total > MAX_CONTEXT_TOKENS and len(messages) > 3:
        messages.pop(0)  # Remove oldest
        total = estimate_tokens(messages)

    return messages
Enter fullscreen mode Exit fullscreen mode

3. Lazy Loading

def get_context(session_id, current_question):
    """
    Don't load full history unless needed
    """
    session = get_session(session_id)

    # Start with minimal context
    messages = session.messages[-3:]

    # Does this question need more context?
    if needs_deeper_context(current_question):
        messages = session.messages[-10:]

    return messages

def needs_deeper_context(question):
    """
    Check if question references earlier context
    """
    reference_words = ['that', 'those', 'them', 'their', 'previous', 'earlier']
    return any(word in question.lower() for word in reference_words)
Enter fullscreen mode Exit fullscreen mode

Results:

Average tokens per query: 5,800 → 1,200 (79% reduction)
Monthly API costs: $847 → $178 (79% savings)
Response time: 8s → 1.2s (85% faster)
Enter fullscreen mode Exit fullscreen mode

🎨 Prompt Engineering: The Secret Sauce

The way you inject context into prompts matters. A LOT.

❌ Bad Approach (Naive):

prompt = f"""
Previous messages: {all_messages}

Current question: {question}

Generate SQL.
"""
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Too verbose
  • No structure
  • LLM gets confused
  • Wastes tokens

✅ Good Approach (Structured):

prompt = f"""
Context: User is continuing a conversation about database queries.

Recent conversation:
User: "Show me top customers"
Assistant: Generated SQL for top 10 customers by revenue
User: "Show orders from last week" [CURRENT]

Task: Generate SQL for the current question, considering:
1. Previously discussed: top customers
2. Implied context: Orders from THE CUSTOMERS discussed above
3. Time range: Last week

Generate SQL:
"""
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Clear structure
  • Explicit context markers
  • Task-oriented
  • Inference guidance

📊 Real-World Examples

Example 1: Follow-up Questions

Without Memory:

User: "Show me customers in New York"
AI: SELECT * FROM customers WHERE city = 'New York'

User: "How many orders did they make?"
AI: "Who are 'they'? Please specify."
Enter fullscreen mode Exit fullscreen mode

With Memory:

User: "Show me customers in New York"
AI: SELECT * FROM customers WHERE city = 'New York'

User: "How many orders did they make?"
AI: SELECT COUNT(*) 
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    WHERE c.city = 'New York'

[AI remembers we're talking about NY customers!]
Enter fullscreen mode Exit fullscreen mode

Example 2: Context-Aware Refinements

Without Memory:

User: "Show revenue by month"
AI: [Returns data]

User: "Now filter for products over $1000"
AI: "Filter what? Please provide complete query."
Enter fullscreen mode Exit fullscreen mode

With Memory:

User: "Show revenue by month"
AI: SELECT 
      DATE_TRUNC('month', order_date) as month,
      SUM(amount) as revenue
    FROM orders
    GROUP BY month

User: "Now filter for products over $1000"
AI: SELECT 
      DATE_TRUNC('month', order_date) as month,
      SUM(amount) as revenue
    FROM orders
    WHERE product_price > 1000  -- Added filter!
    GROUP BY month

[AI understands the refinement applies to previous query!]
Enter fullscreen mode Exit fullscreen mode

⚡ Performance Optimization: The Trade-offs

Building conversational AI isn't free. Here are the trade-offs:

Token Usage

Without Memory: ~500 tokens/request
With Memory (10 msgs): ~2500 tokens/request

Cost increase: 5x
Value increase: 10x+ (UX improvement)
Enter fullscreen mode Exit fullscreen mode

Response Time

Without Context: ~800ms
With Context: ~1200ms (+50%)

Why? More tokens to process
Worth it? Absolutely.
Enter fullscreen mode Exit fullscreen mode

Memory Management Strategy

class ConversationManager:
    def cleanup_expired_sessions(self):
        """
        Remove old sessions to prevent memory bloat

        Strategy:
        - Sessions expire after 60 minutes of inactivity
        - Periodic cleanup (every 10 minutes)
        - Keep favorites even if expired
        """
        cutoff = datetime.now() - timedelta(hours=1)

        for session_id in list(self.sessions.keys()):
            session = self.sessions[session_id]

            if session.is_expired(cutoff):
                # Keep if has favorites
                if not session.has_favorites():
                    del self.sessions[session_id]
Enter fullscreen mode Exit fullscreen mode

🎯 Optimization Strategies We Use

1. Progressive Context Loading

Don't load all history. Start with last 5, extend if needed.

2. Semantic Compression

Summarize old messages instead of keeping full text:

def compress_old_messages(messages: List[Message]) -> str:
    """
    Compress messages older than 10 turns
    """
    if len(messages) > 10:
        old_messages = messages[:-10]
        summary = f"Earlier in conversation: discussed {extract_topics(old_messages)}"
        return summary
    return ""
Enter fullscreen mode Exit fullscreen mode

3. Smart Token Budgeting

MAX_CONTEXT_TOKENS = 2000
MAX_HISTORY_MESSAGES = 10

def get_context(session_id):
    messages = get_messages(session_id)

    # Estimate tokens
    total_tokens = sum(estimate_tokens(m) for m in messages)

    if total_tokens > MAX_CONTEXT_TOKENS:
        # Drop oldest messages
        while total_tokens > MAX_CONTEXT_TOKENS:
            messages.pop(0)
            total_tokens = sum(estimate_tokens(m) for m in messages)

    return messages
Enter fullscreen mode Exit fullscreen mode

4. Client-Side Session Persistence

Store session ID in browser localStorage:

function getOrCreateSession() {
    let sessionId = localStorage.getItem('sqlatte_session_id');

    if (!sessionId) {
        sessionId = generateUUID();
        localStorage.setItem('sqlatte_session_id', sessionId);
    }

    return sessionId;
}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Conversations persist across page reloads
  • User comes back later → continues same conversation
  • No server-side user accounts needed

🔒 Edge Cases We Handle

1. Session Timeout

def is_expired(self, timeout_minutes: int = 60) -> bool:
    """Check if session has expired"""
    elapsed = datetime.now() - self.last_activity
    return elapsed > timedelta(minutes=timeout_minutes)
Enter fullscreen mode Exit fullscreen mode

Why 60 minutes?

  • Users take coffee breaks
  • Still relevant context
  • Not too memory-hungry

2. Context Confusion

Sometimes AI misinterprets context. We add disambiguation:

if user_question_is_ambiguous(question, context):
    prompt = f"""
    IMPORTANT: User's question might be ambiguous.

    Question: "{question}"
    Previous context: {context_summary}

    If ambiguous, ask for clarification rather than guessing.
    """
Enter fullscreen mode Exit fullscreen mode

3. Table Switching

User changes topic → clear context:

def detect_topic_change(new_question, old_context):
    """
    Detect if user switched to different tables/topic
    """
    old_tables = extract_tables(old_context)
    new_tables = extract_tables(new_question)

    if not set(old_tables).intersection(new_tables):
        return True  # Topic changed!

    return False
Enter fullscreen mode Exit fullscreen mode

📈 Results: Before vs After

User Satisfaction

  • Before: 67% completion rate (users gave up)
  • After: 94% completion rate
  • Why: Natural conversation flow

Query Complexity

  • Before: Average 1.2 queries per session
  • After: Average 4.8 queries per session
  • Why: Users explore data more when follow-ups work

Error Rate

  • Before: 23% queries failed (missing context)
  • After: 8% queries failed
  • Why: AI understands intent better

🎓 Key Lessons Learned

1. Don't Over-Context

10 messages is the sweet spot. More = diminishing returns.

2. Structure Your History

❌ Dump all messages
✅ Structured summary + recent details
Enter fullscreen mode Exit fullscreen mode

3. Make Sessions Disposable

Don't try to keep everything forever. Cleanup expired sessions.

4. Prompt Engineering > Architecture

How you inject context matters MORE than how much context.

5. Test Follow-Up Scenarios

Write tests for:

  • "Show me X" → "Now show me Y"
  • "Filter by A" → "Remove that filter"
  • "Top 10" → "Show details for #3"

💡 Try It Yourself

Full implementation in SQLatte:
👉 https://github.com/osmanuygar/sqlatte

🎯 Conclusion

Building AI that remembers isn't about storing everything - it's about storing the right things at the right time.

Key takeaways:

  1. ✅ Use session-based architecture
  2. ✅ Limit context window (10 messages sweet spot)
  3. ✅ Structure your prompts carefully
  4. ✅ Handle edge cases (timeouts, topic changes)
  5. ✅ Monitor token usage

The result? AI that feels intelligent, not just smart.


What's your experience with conversational AI? Drop a comment below! 👇

Found this useful? Star the repo: https://github.com/osmanuygar/sqlatte


Tags

#ai #llm #promptengineering #python #architecture #claude #gpt #chatgpt #memory #contextwindow #machinelearning #devops #backend #sql #database

Top comments (0)