Giving Your AI Memory That Doesn't Suck: Implementing Semantic Caching and Conversation State

#programming #ai #tutorial #llm

Dumping the entire chat history into your LLM prompt is the fastest way to bankrupt your token budget and degrade model reasoning. Here is how to build a smart, stateful memory layer that only retrieves exactly what your agent needs to know.Why this mattersWhen building AI tools, developers almost always start by appending every new user message to a continuously growing messages array. This naive approach scales terribly. As the context window fills up, your API costs skyrocket, latency spikes to unusable levels, and the LLM suffers from the "lost in the middle" phenomenon—forgetting crucial system instructions buried under dozens of irrelevant chat turns.By decoupling memory from the active prompt and pushing state to a fast datastore like Redis, you can separate short-term conversational context from long-term user preferences. This keeps your context window lean, reduces hallucination, and makes your application feel like a cohesive product rather than a goldfish.How it worksLet’s look at an internal DevOps Slack bot for a SaaS platform. If an engineer asks, "Why is my database migration failing?", the bot needs the last few messages for immediate conversational flow (Short-Term Memory). But if the engineer previously stated three weeks ago, "I always use PostgreSQL 15 on staging," that fact shouldn't be lost just because it rolled out of the recent chat history (Long-Term Memory).We solve this with a dual-layer Redis strategy:Sliding Window Chat History: A Redis List (List) holding only the last $N$ messages (e.g., last 6 interactions) for immediate context.User Preference Profile: A Redis Hash (HSET) storing extracted, persistent facts about the user. These facts are injected dynamically into the system prompt.When a user interacts with the bot, the application fetches the preference profile to build the system prompt, retrieves the sliding window history, and appends the new query. After the LLM replies, a lightweight background task evaluates the conversation to see if any new "long term facts" need to be saved to the profile.The Code: Dual-Layer Memory with RedisHere is a practical Python implementation using the redis client. We use a standard Redis list to cap the conversation history and a hash to store persistent traits.
import json
import redis
class AgentMemory:
def init(self, redis_url: str = "redis://localhost:6379/0"):
self.client = redis.Redis.from_url(redis_url, decode_responses=True)
# Keep only the last 6 messages (3 user/assistant turns)
self.max_history = 6

def get_user_profile(self, user_id: str) -> dict:

    """Fetch long-term preferences to inject into the system prompt."""

    profile_key = f"user:{user_id}:profile"

    data = self.client.hgetall(profile_key)

    return data if data else {}

def update_user_preference(self, user_id: str, key: str, value: str):

    """Update a specific fact about the user."""

    profile_key = f"user:{user_id}:profile"

    self.client.hset(profile_key, key, value)

def add_message(self, user_id: str, role: str, content: str):

    """Add a message to the sliding window history."""

    history_key = f"user:{user_id}:history"

    message = json.dumps({"role": role, "content": content})

# Push new message to the right
self.client.rpush(history_key, message)
# Trim list to retain only the latest `max_history` items
self.client.ltrim(history_key, -self.max_history, -1)
# Set a TTL so abandoned conversations expire after 24 hours
self.client.expire(history_key, 86400)



    

    




def build_context(self, user_id: str, new_query: str) -> list:

    """Assemble the final payload for the LLM."""

    # 1. Inject long-term memory into the system prompt

    profile = self.get_user_profile(user_id)

    system_content = "You are a helpful internal DevOps agent."

    if profile:

        system_content += f"\nUser Preferences: {json.dumps(profile)}"

messages = [{"role": "system", "content": system_content}]

# 2. Append short-term sliding window history
history_key = f"user:{user_id}:history"
raw_history = self.client.lrange(history_key, 0, -1)
for msg in raw_history:
    messages.append(json.loads(msg))

# 3. Append the new user query
messages.append({"role": "user", "content": new_query})

return messages

--- Usage Example ---

memory = AgentMemory()
user_id = "eng_402"

Simulating a background process discovering a user preference

memory.update_user_preference(user_id, "preferred_db", "PostgreSQL 15")
memory.update_user_preference(user_id, "environment", "staging")

Simulating conversation

memory.add_message(user_id, "user", "How do I check the logs?")
memory.add_message(user_id, "assistant", "You can use kubectl logs for the current pod.")

Building context for the next request

final_prompt = memory.build_context(user_id, "Why is my migration failing?")
print(json.dumps(final_prompt, indent=2))
Pitfalls and gotchas
Stale Fact Conflicts: If a user changes their mind ("Actually, I moved to MySQL"), your fact-extraction logic needs to overwrite the old preference rather than blindly appending it. Otherwise, the system prompt will contain contradictory instructions.

Profile Token Creep: While the sliding window caps short-term history, the user profile can grow indefinitely. Establish a strict schema or length limit for the HSET to prevent the system prompt from quietly ballooning in size.

Fact-Extraction Latency: Do not block the user's response while trying to extract long-term facts using an LLM. Return the chat response first, and run the extraction prompt as an asynchronous background worker.

Race Conditions: If your agent makes parallel tool calls or concurrent requests, updating the Redis list simultaneously can lead to out-of-order messages. Always append messages in a strict sequence or use atomic transactions.

What to try next
Implement a Background Summarizer: Instead of a strict sliding window, write a Celery or RQ task that triggers when the history hits 10 messages, uses a smaller model to summarize the oldest 5 messages, and stores that summary at the top of the history list.

Vectorize Old Chats: For internal tools where past debugging sessions are highly valuable, dump expired Redis chat histories into a Vector DB (like Qdrant or Pinecone). Give your agent a search_past_conversations tool to retrieve them.

Add TTL to Preferences: Not all user preferences are permanent. Extend the update_user_preference method to include an expiration time so ephemeral states (e.g., "I am currently debugging the auth service") automatically clear themselves out after a few hours.