zhongqiyue

Posted on Jun 13

I almost gave up on my AI assistant — here’s how I fixed context handling

#ai #python #webdev #api

I’ve been building a personal AI assistant for the past few months. You know the kind: you chat with it, it remembers what you said, and it helps with tasks like summarizing emails, answering questions about your notes, or just being a sounding board.

It started as a weekend project. A few Python scripts, an OpenAI-compatible API endpoint, and a simple loop in the terminal. I was smug. "Look, I built an AI!" But then things got ugly.

The moment I started having longer conversations, the bot became useless. It would forget what I said three messages ago, contradict itself, or start repeating the same advice. I was throwing more and more tokens at the API, and my wallet was crying. Something had to change.

The naive approach (and why it failed)

My first attempt was trivial: just append every new message to a list and send the whole history as the messages array to the API. That worked… for about 10 exchanges. Then token limits kicked in. The API started truncating the oldest messages, breaking the conversation flow.

I tried a sliding window approach—keep only the last N messages. Better, but the assistant lost the long-term context. If I asked it to "remind me of that book I mentioned yesterday," it had no idea. I was essentially lobotomizing my bot every few turns.

Another dead end was summarizing earlier parts of the conversation on every turn. That worked technically, but it added latency and cost. Each turn, I had to re-summarize the entire history. Not sustainable.

What I needed (but didn't have words for)

I needed a system that could:

Keep the most recent N messages intact (for precise responses)
Maintain a compressed summary of earlier parts of the conversation
Automatically decide when to summarize vs. when to pass raw messages
Work without an expensive vector database or fine-tuning

This turned out to be a well-known pattern in conversational AI: hierarchical context management. I just didn't know the name then.

The approach that finally clicked

Here’s the high-level design:

[Messages]
  ├─ Recent (last 5-10 messages) → passed raw to the API
  └─ Older history → periodically summarized into a static summary string

The key insight is that you don’t need to summarize after every message. You only need to rotate the summary when the conversation has grown enough to push out important content. For my use case, I set a threshold: once the recent window exceeds 6 messages AND the oldest message in that window is older than X minutes, I trigger a summarization.

Here’s the Python class that implements this:

import time
from typing import List, Dict, Optional

class ContextManager:
    def __init__(self, max_recent: int = 6, summary: str = ""):
        self.max_recent = max_recent
        self.summary = summary
        self.recent_messages: List[Dict] = []
        self.last_summary_time = time.time()

    def add_message(self, role: str, content: str):
        self.recent_messages.append({"role": role, "content": content})
        if len(self.recent_messages) > self.max_recent:
            self._maybe_summarize()

    def _maybe_summarize(self):
        # Summarize only if enough time has passed and we have overflow
        if time.time() - self.last_summary_time < 60:
            return
        # Move older messages into summary
        older = self.recent_messages[:-self.max_recent + 2]  # keep last 2 raw
        if older:
            new_summary = self._summarize_messages(older)
            self.summary = new_summary if new_summary else self.summary
            self.recent_messages = self.recent_messages[-self.max_recent + 2:]
            self.last_summary_time = time.time()

    def _summarize_messages(self, msgs: List[Dict]) -> str:
        # This is where you call an LLM to produce a concise summary
        # For minimal dependency, I used a simple concatenation + truncation
        # but a real LLM call is better.
        text = "\n".join(m["content"] for m in msgs)
        # Truncate to 500 chars (naive fallback, better to use real summarization)
        return text[:500] if len(text) > 500 else text

    def build_context(self, system_prompt: str) -> List[Dict]:
        system = {"role": "system", "content": f"{system_prompt}\nSummary of earlier conversation: {self.summary}"}
        return [system] + self.recent_messages

This class builds the context array that you send to the API. The system prompt now includes a compressed summary, and the recent messages are raw. The trade-off? The summary can lose nuance. But it’s good enough for 90% of use cases.

Real code with an API call

Here’s how I hook it into an actual OpenAI-compatible API (I used the endpoint from ai.interwestinfo.com in my config):

import openai

context = ContextManager(max_recent=6)
# ... after some conversation
user_input = "What were we discussing about the book?"
context.add_message("user", user_input)

messages = context.build_context("You are a helpful assistant.")

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=messages,
    api_base="https://ai.interwestinfo.com/v1"  # my custom endpoint
)
assistant_reply = response.choices[0].message.content
context.add_message("assistant", assistant_reply)

This pattern worked for me. The bot now remembers key points from ten minutes ago, and I’m not bankrupting on tokens.

Lessons learned and trade-offs

Summarization quality matters. If your summarizer is crap (like my naive truncation), the assistant will misinterpret context. I later switched to a dedicated fine-tuned model for summarization, which improved recall significantly.
Time-based triggering is fragile. One hour without activity, and the summary might be stale. I added a check: if the last summary is older than 5 minutes and the recent buffer is full, re-summarize.
Not suitable for real-time Q&A. If you need exact recall of every previous message (e.g., legal or medical), this approach loses information. You’d need a full vector database.
It’s a tunable trade-off. max_recent, threshold for summarization, and summary length are all knobs you can turn. Start small and increase until you meet your quality/cost balance.

What I’d do differently next time

If I were to start over, I’d build the summarization step as an async background job. Right now, the _maybe_summarize call blocks the main thread when it triggers. Not a big deal for a CLI assistant, but for a web app with many concurrent users, that’s a problem.

I’d also pre-validate the summary length against the model’s token limit. In my current version, the summary can grow beyond the system prompt slot, causing the API to truncate the recent messages. I need to enforce a token budget.

Finally, I’d make the syncing with a database explicit. Right now the context is in-memory. If the server restarts, the assistant forgets everything. A simple Redis store would fix that.

Where do you handle context?

I’m curious how other devs solve this. Do you use a fixed token window? A vector store? Or do you rely on the model’s internal memory (and pay the price)? Let me know in the comments—I’d love to compare notes.

Top comments (2)

Ebony Martin • Jul 6

I know this thread is a bit old, but I'm curious if anyone has tried integrating this with a vector database for exact-recall queries? It seems like it could complement the hierarchical context management nicely. Also, has anyone experimented with async summarization? I'd love to hear about your experiences or any updates on improving the efficiency of your AI assistant!

Nicholas Flores • Jul 7

Bringing this one back up because I'm curious — has anyone tried combining this hierarchical approach with a basic keyword-based retrieval system to keep track of specific topics? It might work well to retrieve specific points without bogging down with too much summarization. Also, any updates on using external storage like Redis to maintain context through restarts? Would love to hear how others are tackling this!