Teaching an AI System to Remember What It Learned: Building a Conflict Mediator with Persistent Agent Memory

#agents #ai #machinelearning #showdev

I built a system to mediate roommate conflicts, and it taught me that AI agents without memory are essentially useless for real problems.

The premise was straightforward: create a platform that logs disputes between roommates, identifies root causes, and offers mediation strategies backed by data. But the real challenge wasn't the recommendation logic—it was building an agent that could actually learn from patterns instead of re-remembering the same conflicts every session.

The Problem I Wasn't Expecting

When you build an AI system without persistent state, you're betting the entire learning mechanism on your prompt. Each time the agent runs, it forgets everything it learned last session. For a conflict mediator specifically, this is catastrophic. How can you identify that "Alice's frustration always peaks on Monday mornings" if you start from zero knowledge every time? How do you spot that "noise complaints spike after Bob's work schedule changes to evening shifts" if you don't remember the temporal correlation?

Most projects I'd seen solved this with a database—typically dumping conversations into PostgreSQL and querying them at inference time to build context. That works, but it forces you into two bad patterns: either you engineer static retrieval queries (brittle and task-specific), or you stuff everything into a prompt and pray that the LLM can reason over gigabytes of context.

I needed something different: a system that could retain facts about the household, recall relevant patterns using multiple strategies, and apply agentic reasoning with configurable personality traits. That's when I looked at persistent agent memory—specifically Hindsight.

Architecture: Components Playing Well Together

The system has four main parts:

RoommateMediator orchestrates everything. It's the main entry point that coordinates conflict logging, analysis, recommendations, and insight generation.

HindsightManager wraps the Hindsight API and handles three critical operations:

Retain: Store conflict facts with temporal context
Recall: Pull back relevant historical conflicts and patterns using multiple retrieval strategies
Reflect: Ask the agent to reason about a specific pair and generate recommendations based on personality configuration

ConflictAnalyzer does local pattern detection—finding temporal clusters, trigger correlations, and behavioral trends.

RecommendationEngine combines historical data with agentic reasoning to output mediation strategies.

Here's how a conflict flows through the system:

def log_conflict(self, person_a: str, person_b: str, topic: str,
                severity: str, description: str, context: str = "") -> str:
    """Log a new conflict incident - stored in Hindsight memory."""

    conflict = Conflict(
        person_a=person_a,
        person_b=person_b,
        topic=topic,
        severity=severity,
        timestamp=datetime.now(),
        description=description
    )

    # Local analysis
    self.analyzer.add_conflict(conflict)

    # Persistent memory - this is where learning happens
    conflict_facts = conflict.to_hindsight_facts()
    conflict_id = self.hindsight.retain_conflict(conflict_facts)

    return conflict_id

When you call get_recommendations() later, it doesn't just echo back the logged conflict. It queries Hindsight using multiple retrieval strategies—temporal, semantic, keyword-based, and graph-based—to find related patterns:

def recall_conflicts(self, person_a: str, person_b: str) -> List[Dict]:
    """Recall past conflicts for this pair using multi-strategy search."""

    headers = {
        "Authorization": f"Bearer {self.api_key}",
        "Content-Type": "application/json"
    }

    payload = {
        "search_strategies": ["temporal", "semantic", "keyword", "graph"],
        "query": f"conflicts between {person_a} and {person_b}",
        "max_results": 10
    }

    response = requests.post(
        f"{self.api_url}/memory_banks/{self.memory_bank_id}/recall",
        headers=headers,
        json=payload,
        timeout=15
    )

    if response.status_code == 200:
        print("✅ [Hindsight API] Conflicts recalled successfully")
        return response.json().get("results", [])
    else:
        print("⚠️ [Hindsight API] Recall failed")
        return []

This parallel retrieval is important. Temporal search finds conflicts within a specific time window. Semantic search gets conceptually similar disputes (substance abuse vs. noise are both "lifestyle incompatibility"). Keyword search catches exact matches. Graph search follows relationship links. The results are fused and re-ranked, giving you a much richer signal than any single retrieval method alone.

The Memory Bank: Configuration as Code

Hindsight treats agent personality as first-class configuration. When I initialize the system, I set up what the agent believes and values:

mission: "I am a fair, empathetic roommate conflict mediator. My role is to
understand underlying patterns and root causes of conflicts, not just surface
complaints."

directives: [
    "Always base recommendations on historical evidence",
    "Never take sides; remain neutral and objective",
    "Focus on identifying root causes, not symptoms",
    "Consider each roommate's stress levels and work schedule",
    "Prioritize sustainable solutions over quick fixes"
]

disposition: {
    empathy: 4,        # High empathy
    objectivity: 5,    # Very objective
    skepticism: 3,     # Moderate skepticism
    assertiveness: 4,  # Direct but not pushy
    patience: 5        # Very patient
}

When the agent calls reflect() to generate a recommendation, it uses this configuration. It's not just reasoning over facts—it's reasoning through a specific personality lens. The same conflict data generates different recommendations depending on disposition. A highly skeptical agent might recommend "gather more data before intervening." An empathetic one might recommend "initiate a conversation to understand underlying stress."

Real Data Flow: What Actually Happens

Let's trace a concrete example. Alice logs a conflict:

Person A: Alice
Person B: Bob
Topic: dishes
Severity: medium
Description: "Bob leaves dirty dishes in sink for days without washing"
Context: "Alice finds them every morning before work"

The system:

Retains: Stores this as a world_fact in Hindsight with temporal context (Monday, 9:15 AM).
Analyzes locally: Checks if this is part of a cluster. Is it the third dishes conflict this week? Does it correlate with Bob's work schedule?
Recalls related patterns: Queries Hindsight for:
- Past conflicts between Alice and Bob (temporal strategy)
- Conflicts about responsibility/chores (semantic strategy)
- Other "dishes" disputes (keyword strategy)
- Who else has conflicts with Bob around household tasks (graph strategy)
Reflects with personality: The agent reviews the recall results through its configured disposition:
- "The pattern shows Bob tends to let dishes pile up after 7 PM shifts"
- "This has happened 7 times in the past month"
- "Alice is consistently the one who gets frustrated"
- "Underlying cause is likely work exhaustion, not indifference"
- Recommendation: "Suggest batch dishwashing after meals rather than nightly. Bob's evening energy is low—reframe as logistics problem, not character flaw."

The recommendation isn't a generic template. It's grounded in actual observed patterns about these two specific people.

Why This Matters: The Production Challenge

Here's where I hit reality. I built this initially with mock Hindsight—everything was a local dictionary. The system worked fine for demos. But taking it to production required actually integrating with the real Hindsight API.

That meant:

Actual HTTP calls instead of function stubs
Bearer token authentication
Timeout handling and retry logic
Error handling when the API returns a conflict not found
Thinking carefully about payload structure for multimodal retrieval

The transition forced me to think about resilience. What happens if Hindsight is slow? I added timeouts (10-30 seconds depending on operation). What if the memory bank ID is wrong? Clear error messages. What if the API key expired? Explicit logging.

But more importantly, it made me realize that persistent agent memory isn't just a nice-to-have—it's essential infrastructure for agents that need to learn. Without it, you're forever living in the present tense. With it, you have continuity. Patterns emerge. Learning happens.

Lessons from Building This

1. Don't architect your agent to be stateless unless you have no other choice.

The fantasy of "pure stateless reasoning" sounds elegant until you realize you're re-discovering the same insights every inference. Conflict mediation is inherently temporal and pattern-based. An agent that forgets can't mediate.

2. Multiple retrieval strategies beat single search.

When I first built this, I just did semantic search—"find conflicts conceptually similar to this one." It missed temporal clustering (conflicts that spike on Mondays), relationship patterns (Bob's conflicts cluster with specific people), and exact matches. Using multi-strategy retrieval forced me to think about what "relevant" actually means in context.

3. Configuration matters as much as code.

Swapping the agent's disposition from "high empathy, high patience" to "high skepticism, low patience" changes the recommendations. This is a feature, not a bug. But it means you need to think about personality as carefully as you think about retrieval logic. I stored this as structured config rather than hidden in prompts, which made it auditable and testable.

4. Mock implementations are fine—until they're not.

I shipped the system working with mock Hindsight. That was good for getting the architecture right, testing the CLI and GUI, and understanding the data flow. But there's a hard threshold where mocks break down. Once you care about actually learning from real data, you need the real thing.

5. The problem isn't LLM reasoning—it's continuity.

I spent way more time worrying about whether the language model would make good recommendations than I should have. The hard part is actually pattern detection and memory management. Once Hindsight is holding the facts and retrieving them intelligently, the reasoning part is straightforward.

What's Next

The system currently handles conflict logging and recommendation—the core mediation loop. Where it gets interesting is building on that foundation. Temporal analysis could trigger proactive interventions: "Hey, the past three Mondays have had conflicts. Want to set up a preventive conversation?" Graph analysis could identify household-level dynamics: "Person C is the common denominator in 60% of conflicts—they might be the source of stress everyone else is reacting to."

But right now, I'm focused on one insight: an AI system that actually learns needs actual memory. Not just a database. Not just context injection. A proper persistent memory layer that the agent can retain facts into, recall patterns from, and reflect upon with personality. Hindsight makes that possible without building it from scratch.

If you're building any system where your agent needs to learn over time, this is the real problem to solve. Everything else is optimization.