I Stopped Treating Coaching as a Chat Problem and Built a Memory System Instead

#hindsight

The hard part of building a coaching system was not generating advice. The hard part was deciding what deserved to be remembered.

I built ArenaMind Royale because most game coaching tools, and honestly most LLM products, have the same failure mode. They reset every time you ask a new question. A player loses to Hog cycle three nights in a row, writes the same note after every loss, switches decks twice, and the system still answers like it has never seen them before.

That is not a reasoning problem. It is a memory problem.

What the system does

ArenaMind Royale is a Clash Royale coaching system that pulls live player data, summarizes recent battles, stores self-reported notes, and uses long-term memory to produce coaching plans that actually compound over time.

The stack is deliberately plain:

FastAPI for the backend
SQLite for local note storage and sync bookkeeping
Clash Royale API for player and battle history
A small Jinja + vanilla JavaScript frontend for the dashboard
Hindsight for long-term memory, reflection, and mental models

The main user flow is simple.

A player enters a tag. The backend fetches their profile and battle log. I summarize those battles into things that are actually coachable: primary deck, inferred archetype, matchup record, recurring trouble cards, and a few defensive patterns that show up in wins. Then I merge in manual notes the player wrote after matches, things like “I kept overcommitting Balloon into a building” or “I defended Lavaloon better when I held Musketeer for second phase.”

That blended view becomes the basis for memory. Some of it is stored locally. The durable coaching context is retained in Hindsight agent memory, and later queried through reflect calls that return a structured coaching plan.

The frontend is intentionally not clever. It is a dashboard, not a second brain. The backend does the synthesis. The client just lets a player load their history, browse matchup trends, save notes after a game, and ask for a plan.

The core technical story: memory only works if you are selective

The design choice that mattered most was this: I refused to store raw conversation history as “memory.”

That sounds obvious, but a lot of agent systems still do it. They dump prompts, replies, logs, and user text into one giant context bucket and hope retrieval makes it coherent later. That approach falls apart fast in coaching. Match histories are repetitive. Self-notes are noisy. Many events matter for one session and then should disappear. A coaching system that remembers everything usually remembers the wrong things.

So I built ArenaMind around selective retention.

Every time I load a player, I compute a condensed battle snapshot from recent matches. That snapshot is not just “what happened.” It is a reduction of what is likely to matter in the future: deck identity, matchup trends, recurrent losses, and any note patterns. Then I hash that snapshot and write the hash to a sync log before retaining anything.

if hindsight_client.enabled:
    await hindsight_client.ensure_bank(player_tag, profile.get("name"))
    doc_hash = snapshot_hash(player_tag, profile, analysis["recent_battles"])
    if db.record_sync(player_tag, doc_hash, "battle-snapshot"):
        retention_text = build_retention_text(profile, analysis, notes)
        await hindsight_client.retain(
            player_tag,
            content=retention_text,
            document_id=f"snapshot-{doc_hash}",
            tags=[
                f"player:{player_tag}",
                f"player_name:{profile.get('name', '').lower()}",
                f"archetype:{analysis.get('primary_archetype', 'unknown').lower().replace(' ', '-')}",
                "source:battlelog",
                "source:notes",
            ],
        )

That one block captures the real posture of the system. I do not retain on every request. I retain only when the underlying state changes enough to justify a new memory. That saves cost, cuts duplication, and, more importantly, prevents long-term memory from drifting into a pile of near-identical battle snapshots.

The dedupe mechanism is tiny and boring, which is why I like it. SQLite keeps a sync_log table keyed by (player_tag, sync_hash, source). If the same derived snapshot shows up twice, the second write is ignored.

CREATE TABLE IF NOT EXISTS sync_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    player_tag TEXT NOT NULL,
    sync_hash TEXT NOT NULL,
    source TEXT NOT NULL,
    created_at TEXT NOT NULL,
    UNIQUE(player_tag, sync_hash, source)
);

I have become mildly opinionated about this pattern. If you are building any kind of agent system with durable context, you should have an explicit dedupe layer before memory retention. Do not ask your vector store or retrieval step to fix write amplification that your application caused in the first place.

What I actually chose to remember

The summarization step is where the project stopped feeling like “LLM glue code” and started feeling like software.

I do not send raw battle objects into memory. I first map them into a battle summary with fields that reflect gameplay decisions: deck lists, archetypes, result, crowns, opponent identity, and some feature counts like building density, cheap cycle cards, reset cards, and anti-air coverage.

Then I aggregate those summaries into a player-level view.

return {
    "record": {
        "wins": record.get("Win", 0),
        "losses": record.get("Loss", 0),
        "draws": record.get("Draw", 0),
        "total": len(summaries),
        "win_rate": round((record.get("Win", 0) / len(summaries)) * 100, 1) if summaries else 0.0,
    },
    "recent_battles": summaries,
    "primary_archetype": my_archetypes.most_common(1)[0][0] if my_archetypes else "Unknown",
    "primary_deck": list(top_deck_tuple),
    "matchups": matchup_rows,
    "trouble_cards": [{"card": card, "count": count} for card, count in recurring_loss_cards.most_common(6)],
    "winning_defenses": [{"pattern": name, "count": count} for name, count in winning_defenses.most_common(4)],
    "manual_mistakes": [{"text": text, "count": count} for text, count in mistake_counter.most_common(8)],
    "manual_defenses": [{"text": text, "count": count} for text, count in defense_counter.most_common(8)],
}

This is the part I would not skip if I rebuilt the system from scratch. A lot of “agent memory” discussions jump straight to embeddings and retrieval. I needed a way to give my agent memory, but the real leverage came from reducing the world into reusable coaching primitives before any memory API saw it.

That preprocessing makes later reflection much more stable. Instead of asking a model to infer a player’s bad habits from twenty raw battles and six emotional notes, I am asking it to reason over a compressed, already opinionated representation of play.

Why I used Hindsight

I was tired of prompt engineering and started looking for a better way to help my agent remember. I came across Hindsight agent memory and it fit the way I wanted to structure coaching memory: banks with clear missions, retained memories with tags, reflect queries with structured output, and mental models that can consolidate recurring patterns.

The bank setup is one of my favorite parts of the code because it forces the system to declare what “good memory” means.

payload = {
    "name": f"ArenaMind Coach - {player_name or player_tag}",
    "mission": "Help a Clash Royale player improve over time through personalized memory and evidence-based coaching.",
    "retain_mission": (
        "Extract useful memories about decks, matchup patterns, recurring mistakes, timings, successful defenses, "
        "counterpush plans, and self-reported habits. Ignore fluff and generic statements."
    ),
    "reflect_mission": (
        "Act as a high-level Clash Royale coach. Give personalized, matchup-aware, concrete strategy advice. "
        "Use long-term memory and current evidence together."
    ),
}
return await self._request("PUT", f"/banks/{bank_id}", payload)

That is more than configuration. It is a contract. It tells the memory layer what to retain, what to ignore, and how to behave when asked to synthesize across sessions.

I also create mental models for three specific long-running questions: recurring mistakes, winning defenses, and matchup map. I like this because it avoids the trap of a single undifferentiated “player summary.” Mistakes and successful defenses are not the same kind of knowledge, and they should not be mashed into the same blob.

What coaching looks like in practice

A typical interaction goes like this.

I load a player. The system pulls their profile and recent battles, infers that their most-used deck is a Balloon cycle variant, and shows a weak record into fast building decks. Their recent manual notes say they keep spending support troops too early and then cannot rebuild a defense.

When I ask for coaching with a focus like “I keep losing to Hog Cycle,” the reflect query ties together both time horizons: the fresh battle snapshot and the long-term memory bank.

query = (
    f"Coach Clash Royale player {profile.get('name')} ({player_tag}). "
    f"Current focus: {focus}. "
    f"Use long-term memory plus current battle trends. "
    f"Return an actionable personalized coaching plan with concrete priorities, matchup alerts, defensive patterns, a training block, and deck-adjustment thoughts. "
    f"Recent snapshot: primary archetype {analysis.get('primary_archetype')}, recent record {analysis['record']['wins']}-{analysis['record']['losses']}-{analysis['record']['draws']}, "
    f"top trouble cards {[item['card'] for item in analysis.get('trouble_cards', [])[:5]]}."
)

The output is structured, which matters. I do not want a motivational paragraph. I want a headline, priority fixes, matchup alerts, defensive patterns, a training plan, and deck-adjustment thoughts. That shape then drives the UI directly.

The useful behavior here is not “the model answered a question.” The useful behavior is that the answer can reference the player’s repeated self-reported issue, current loss pattern, and historically effective defensive patterns in one pass. In other words, the system can coach with continuity.

And when Hindsight is unavailable, I still return a local coaching plan based on the same analysis pipeline. I like that fallback more than I expected. It forced me to build deterministic coaching logic first instead of using the model as a crutch.

Lessons I would reuse on the next system

First, memory is not a transcript. It is a product decision implemented in code. The useful question is never “what can I store?” It is “what will still matter in two weeks?”

Second, dedupe before retention. Snapshot hashing plus a tiny sync table did more for memory quality than any retrieval tweak.

Third, compress domain events before you ask a model to reason over them. A battle log is data. A matchup map, recurring trouble cards, and stable defensive patterns are coaching state.

Fourth, structure the model output aggressively. Free-form prose is pleasant in demos and annoying in software. Lists of priority fixes and training steps are easier to inspect, render, and test.

Fifth, keep a deterministic fallback. The local coaching path made the whole system easier to debug and easier to trust.

I think that is the main thing I learned from building this. The interesting part of long-term AI systems is not that they can remember. It is that they force you to decide what forgetting should look like.

That is where the engineering starts.