Escaping the Stateless Trap: Building a Context-Aware Support Agent

#ai #llm #fastapi #systemdesign

The hardest part about building an automated support system isn't generating human-like text, but getting the system to actually remember the customer. I was tired of prompt engineering and started looking for a better way to help my agent remember.

For quite sometime, I set out to build IRIS (Intelligent Recall & Issue Support). I didn't want to build just another standalone chatbot. Instead, I built a multi-tenant API service that any e-commerce business can plug into their existing tools to turn their rigid support systems into context-aware agents. Most of the off-the-shelf support bots I had evaluated suffered from the same fatal flaw: they treated every interaction like a first date. No matter how many times a customer complained about a delayed package, the bot would gleefully ask for their order number again. It was infuriating. I needed a way to give my agent memory so it could retain context across sessions, rather than just within a single chat window.

What IRIS Does and How It Hangs Together

IRIS is fundamentally a fast, stateful API layer built on FastAPI that sits between the customer chat widget, our order management systems (OMS), and an LLM. It routes messages, handles multi-tenant authentication, and most importantly, manages long-term state.

The business goal was an integration-first approach. E-commerce brands don't want another siloed dashboard. They want a smart layer that quietly sits between their existing helpdesks and their Shopify backends. By building this as a headless API, a platform can offer IRIS to hundreds of different brands simultaneously, keeping each brand's customer data, tone of voice, and order history strictly isolated.

The system is designed around a three-pillar architecture:

The LLM Engine: We use Groq hosting llama3-70b-8192 for incredibly fast turnaround times. Speed is a feature when you are doing multiple internal validation passes before responding to a user.
The Integration Layer: A set of connectors (Shopify, REST APIs) that actively fetch live order states so the LLM doesn't hallucinate shipment statuses.
The Memory Layer: Instead of cramming entire chat transcripts into a vector database or hoping a 1M token context window solves all my problems, I decided to try Hindsight for agent memory.

The core flow is simple in concept but tricky in execution. When a message comes in, the backend immediately queries the memory layer to pull the customer's historical profile and recent interactions. Simultaneously, it fetches their active orders from the OMS. It also checks a global "incident" stream to see if this customer's issue (e.g., "missing package") matches a spike in similar complaints across the tenant. All this context is assembled into a dense system prompt, fed to the LLM, and the response is parsed. If the LLM decides an action is needed (like issuing a refund), it outputs a structured JSON block, which the backend intercepts, strips from the user-facing text, and executes. Finally, the interaction is written back to memory.

The Core Technical Story: Segregating Memory

The most interesting technical challenge wasn't the LLM integration. Calling an API is trivial. The challenge was structuring the memory so the agent could be both highly personalized to the individual and broadly aware of systemic issues.

Early on, I realized that dumping all interactions into a single vectorized bucket per tenant was a disaster. The agent would get confused, occasionally cross-referencing complaints from User A when talking to User B. I needed strict boundaries.

I came across Hindsight agent memory and decided to give it a try because it allowed me to strictly segregate state into distinct "banks."

We split the memory architecture into two distinct layers:

Per-Customer Banks: A localized storage area specific to a single user ID. This stores their communication style, previous complaints, and preferences.
Global Pattern Banks: A tenant-wide storage area that tracks issue types. If 50 people suddenly report a "warehouse delay," we don't want the bot asking the 51st person to clear their cache. We want it to acknowledge the known outage immediately.

This segregation completely changed how the agent behaved. It moved from being a reactive text generator to a proactive support system.

Code-Backed Explanations

Here is how we handle the memory retention and pattern detection in code. When a user sends a message, we first classify the intent and log it globally if it's an issue.

# backend/agent.py (simplified)
async def process_message(tenant_id: str, user_id: str, message: str):
    # 1. Detect if this is a systemic issue
    issue_type = detect_issue_type(message)
    if issue_type:
        await report_to_global_memory(tenant_id, issue_type)

        # Check if we are currently in an active incident for this issue
        active_incident = await check_active_incidents(tenant_id, issue_type)
        if active_incident:
            # Short-circuit standard troubleshooting
            context["incident_alert"] = f"Known issue: {active_incident.description}"

    # 2. Recall personal history
    customer_history = await memory_client.recall(
        bank_id=f"{tenant_id}_user_{user_id}",
        query=message,
        limit=5
    )

    # 3. Generate a quick reflection on the customer's state
    profile = await memory_client.reflect(
        bank_id=f"{tenant_id}_user_{user_id}"
    )

    return await generate_llm_response(message, customer_history, profile, context)

The memory_client.reflect call is particularly powerful. Instead of passing raw past transcripts to the LLM, which eats up tokens and dilutes the prompt, we use the memory layer to generate a dense, reasoned summary of the customer.

When the interaction is over, we write the exchange back. The hindsight-client makes this straightforward.

# backend/memory.py (simplified)
async def retain_interaction(tenant_id: str, user_id: str, user_msg: str, agent_response: str):
    bank_id = f"{tenant_id}_user_{user_id}"

    # Store the interaction in the customer's specific memory bank
    await memory_client.retain(
        bank_id=bank_id,
        content=f"Customer: {user_msg}\nAgent: {agent_response}",
        metadata={"timestamp": get_current_time()}
    )

Finally, we needed a way for the LLM to actually do things, not just apologize. We force the LLM to append a specific JSON structure if it wants to invoke a tool, which we parse out before showing the message to the user.

# backend/actions.py (simplified)
def extract_and_execute_action(llm_response: str, order_data: dict):
    # Look for a JSON block at the end of the response
    match = re.search(r'```

json\n(.*?)\n

```', llm_response, re.DOTALL)
    if match:
        try:
            action_req = json.loads(match.group(1))
            if action_req.get("action") == "initiate_refund":
                execute_refund(action_req.get("order_id"))
            # Strip the JSON so the user doesn't see it
            clean_response = llm_response.replace(match.group(0), "").strip()
            return clean_response, True
        except json.JSONDecodeError:
            pass
    return llm_response, False

Results and Behavior

The difference in user experience is stark. In our initial tests without localized memory, a customer asking "Where is my replacement?" would be met with "I'm sorry, I don't see a replacement. Can you provide your order number?"

With the dual-bank memory system in place, the interaction looks like this:

User: Where is my replacement?
IRIS: I see we initiated a replacement for order #12345 yesterday because the original arrived damaged. It looks like it shipped this morning via UPS (Tracking: 1Z9999). It should arrive by Thursday.

The global incident detection also proved its worth immediately. During a simulated partial outage with our mock OMS, the system noticed a spike in "can't checkout" messages. By the 4th user, the agent stopped trying to debug their individual browser cache and started responding with: "We are currently experiencing widespread checkout issues. Our engineering team is looking into it. I'll flag your account so we can notify you when it's resolved."

For a business, this isn't just a neat trick. It means deflecting hundreds of identical support tickets during a crisis without a human agent ever needing to get involved. It saved an enormous amount of redundant API calls and user frustration.

Lessons Learned

Building IRIS taught me a few hard truths about moving from toy AI scripts to reliable background systems:

State is harder than intelligence. LLMs are incredibly smart text generators, but without a robust, isolated memory layer, they are essentially amnesiacs. You have to treat memory management as a first-class architectural component, not an afterthought bolted onto a prompt. A friend said Hindsight was the best agent memory they had tried so I decided to use it in my project, and it worked well because it separated the state management from the inference logic.
Summarize, don't concatenate. Dumping raw chat logs into a context window degrades performance rapidly. The "lost in the middle" phenomenon is real. Using intermediate reflection steps to summarize a user's profile before the main LLM call drastically improved accuracy and reduced token costs.
Strict separation of concerns prevents hallucinations. Don't rely on the LLM to both generate empathetic text and strictly format an API call in the same breath if you can avoid it. By forcing a clean JSON block at the very end of the response for actions, we could easily parse, validate, and strip it out using standard regex and Python logic, rather than begging the LLM to format things perfectly in-line.
Speed covers a multitude of sins. By switching to incredibly fast inference hardware (Groq in our case), we bought ourselves the time budget to do all these background tasks (recall, reflection, OMS lookups) sequentially before the user ever noticed a delay. If your base inference takes 5 seconds, you can't build complex agentic workflows without frustrating the user.

Building a stateful agent isn't about finding the perfect system prompt; it's about building the plumbing that ensures the prompt is populated with exactly the right context at exactly the right time.