DEV Community

Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at craftedbydaniel.com

The Day My AI Forgot Everything (So I Built a Context-Continuity Inference Stack)

The hardest failure mode I’ve seen in enterprise AI systems isn’t hallucination. It’s amnesia.

Not “the model wasn’t smart enough.” Not “prompting is hard.” Something more mundane and more expensive: continuity broke, context evaporated, and a human had to become the database.

That realization is why this series exists.

The emotional thesis (and the part nobody wants to admit)

I build enterprise AI systems. The kind that sit in the middle of real workflows—Outlook email intake, CRM records, enrichment, validation, search, voice, Teams. They’re deployed. They have SLAs. They have people waiting on them.

A session resets, a new conversation starts, and suddenly the assistant that was deep in the weeds yesterday is back to:

  • “Can you share the repo structure?”
  • “What’s the architecture?”
  • “What did we decide about X?”

So the user does what users always do when the system won’t remember: they patch it with labor. They re-explain. They paste. They screenshot. They reconstruct the world.

That’s not a model problem.

That’s an architecture problem.

The key insight (it shows up as a rule, not a feature)

The non-obvious part is that “memory” isn’t a single thing.

If you treat it like a chat feature—some extra tokens, some summary, a longer thread—you’ll still lose. Because the real enemy isn’t forgetfulness inside one conversation.

It’s resets between conversations.

I eventually wrote the system down as a diagram:

flowchart LR
  S1["Session 1<br/>Full context<br/>50+ tasks"] -->|RESET| lost["All context<br/>LOST"]
  lost -->|"New session"| S2["Session 2<br/>Zero context<br/>Re-ask everything"]
Enter fullscreen mode Exit fullscreen mode

Once I saw it that way, the engineering decision became obvious: I needed a continuity architecture that survives resets.

Not “better prompts.” Not more clever agents. A system that anchors the truth somewhere outside the conversation.

That’s where my context-continuity inference stack came from.

In my repo it’s documented as an explicit system (see docs/context_continuity_system.md) and operationalized through a Context API (see CONTEXT_API_GUIDE.md, plus the “store new context” snippet living in the project configuration as part of the mandatory session startup protocol).

The context-continuity inference stack: persistent memory as infrastructure

This stack is my session continuity architecture: a multi-layer design that preserves assistant context across sessions so you don’t get the “blank slate” problem.

In production I treat it as defensive engineering. Continuity fails in messy ways:

  • a chat thread gets too long
  • someone starts a new session
  • an agent tool crashes mid-step
  • a deployment rolls
  • a user switches devices
  • a Teams conversation splits
  • a background job retries and forks state

So I built layers that degrade cleanly:

  1. Database (authoritative store of structured context)
  2. Context API (deterministic read/write interface)
  3. Helper scripts (bulk export/import, backfills, validation)
  4. Progress files (cheap “current state” snapshots that survive restarts)
  5. Session handoffs (the boot sequence that restores context at the start of work)

Here’s the shape of that dataflow.

flowchart TD
  subgraph persistenceLayers[Context-Continuity Inference Stack – Persistence Layers]
    database[(PostgreSQL + pgvector)] --> contextApi[Context API]
    contextApi --> scripts[Helper Scripts]
    scripts --> progressFiles[Progress Files]
    progressFiles --> sessionHandoffs[Session Handoffs]
  end
  sessionReset[Session Reset / New Chat] --> contextApi
  contextApi --> restoredContext[Restored System Context]
Enter fullscreen mode Exit fullscreen mode

The analogy I use once—and only once—is this: this stack is a ship’s log, not a conversation. Conversations are weather. The log is navigation.

What changed when I stopped treating memory like chat-state

Before I built this, I kept trying to “make the assistant remember” by inflating the prompt. Bigger system messages. Thread summaries. Carefully worded reminders.

It worked in demos.

It failed in week three.

Because the reset isn’t a rare edge case. It’s the default state of real usage:

  • people jump between tasks
  • the model context window fills
  • tool outputs blow up token budgets
  • coworkers continue the work later
  • threads fork (“can you also…”) until nobody knows what the mainline is

So I inverted the responsibility:

  • The model is not the memory.
  • The system is the memory.
  • The model is a compute layer that queries memory.

That framing is why the first thing I shipped wasn’t “an agent.”

It was an API.

The Context API: one deterministic interface that restores the world

I didn’t want “memory” to be a vibe. I wanted a deterministic interface that could restore context fast.

So I shipped a Context API backed by PostgreSQL that stores structured context and lets future sessions retrieve it.

The operational instruction—written directly into the way we work—is blunt:

Don’t read everything—search first.

That rule exists because the failure mode isn’t missing data—it’s flooding the model with irrelevant data and then acting surprised when it drifts.

Canonical endpoints (the ones I actually use)

The read path is a search endpoint:

  • GET /api/v1/knowledge/search?query=...

And the write path is a structured upsert:

  • POST /api/v1/knowledge/context

That contract is what makes session boot predictable.

A “good memory system” is not one that stores a lot.

It’s one you can program against without negotiating with it.

Secure curl examples (no secrets in the article)

# Required
API_KEY="${API_KEY:?Set API_KEY in your environment}"
BASE="https://<CONTEXT_API_HOST>/api/v1/knowledge"

# 1) SEARCH - fastest way to restore relevant context
curl -sS -H "X-API-Key: $API_KEY" \
  "$BASE/search?query=vault" | jq
Enter fullscreen mode Exit fullscreen mode

And storing new context (same shape I standardized and documented in CONTEXT_API_GUIDE.md):

API_KEY="${API_KEY:?Set API_KEY in your environment}"
BASE="https://<CONTEXT_API_HOST>/api/v1/knowledge"

curl -sS -X POST \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  "$BASE/context" \
  -d '{
    "feature_name": "infrastructure",
    "context_type": "reference",
    "context_key": "azure-resource-topology",
    "context_data": {
      "title": "Azure resource topology",
      "content": "Container App → Context API → Postgres; search-first boot; progress snapshots."
    }
  }' | jq
Enter fullscreen mode Exit fullscreen mode

Two rules are doing most of the work here:

  • context_type is not decoration; it’s a retrieval lever.
  • context_key is the stable address that lets me update and re-use context without creating duplicates.

When you’re trying to resume work, “everything” is the enemy.

Context types: how I keep retrieval surgical

This stack only works if stored context stays structured. Otherwise you build a junk drawer.

In docs/context_continuity_system.md, I codified the types we actually store:

  • implementation_plan — approved strategies and phased plans
  • technical_decision — architectural choices with rationale
  • code_pattern — correct/incorrect examples with explanation
  • user_feedback — corrections from users and iteration history
  • reference — static documentation (topology, configs, runbooks)

This is what turns “memory” from a chat transcript into an operational substrate.

A typical workflow creates a few durable artifacts:

  • an implementation plan keyed by a feature or epic
  • a small set of technical decisions keyed by decision name
  • a handful of code patterns keyed by “what to do” and “what not to do”
  • user feedback keyed by “what changed in the business rule”

When a new session starts, the assistant doesn’t beg the model to remember. It runs a repeatable boot:

  1. Search for the feature.
  2. Pull the latest implementation plan + decisions.
  3. Pull any “do/don’t” code patterns.
  4. Pull the latest user feedback.
  5. Start work.

The key here is that each step produces a bounded payload. You’re never rebuilding the entire world; you’re pulling the handful of artifacts that constrain the next action.

Minimal schema: the table that makes this boring (in a good way)

Here’s the core schema I use for stored context. It’s intentionally plain: types and keys first, JSON payload for flexibility, and optional vector + full-text indexing for retrieval.

This SQL runs as-is on PostgreSQL.

-- context_items: authoritative store for the context-continuity stack
-- Requires PostgreSQL 13+ (JSONB), and optionally pgvector for embeddings.

CREATE TABLE IF NOT EXISTS context_items (
  id              BIGSERIAL PRIMARY KEY,
  feature_name    TEXT NOT NULL,
  context_type    TEXT NOT NULL,
  context_key     TEXT NOT NULL,
  context_data    JSONB NOT NULL,
  content_text    TEXT GENERATED ALWAYS AS (
    COALESCE(context_data->>'title','') || E'\n' || COALESCE(context_data->>'content','')
  ) STORED,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Prevent duplicates: stable address per feature/type/key
CREATE UNIQUE INDEX IF NOT EXISTS ux_context_items
  ON context_items(feature_name, context_type, context_key);

-- Fast filtering
CREATE INDEX IF NOT EXISTS ix_context_items_feature
  ON context_items(feature_name);

CREATE INDEX IF NOT EXISTS ix_context_items_type
  ON context_items(context_type);

-- Full-text search (quick win)
CREATE INDEX IF NOT EXISTS ix_context_items_fts
  ON context_items USING GIN (to_tsvector('english', content_text));
Enter fullscreen mode Exit fullscreen mode

If you add embeddings (I do—semantic search is the difference between “I remember the word” and “I remember the meaning”), you add one column and one index:

-- Optional: semantic search with pgvector
CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE context_items
  ADD COLUMN IF NOT EXISTS embedding vector(1536);

CREATE INDEX IF NOT EXISTS ix_context_items_embedding
  ON context_items USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);
Enter fullscreen mode Exit fullscreen mode

The point isn’t the exact dimensionality. The point is that retrieval has two gears:

  • Full-text: fast and predictable for obvious keywords.
  • Vector similarity: resilient when the user’s phrasing changes.

Search strategy: hybrid retrieval that behaves under pressure

The search endpoint is not magic. It’s disciplined ranking.

My strategy is hybrid:

  1. Metadata filters first: feature_name and/or context_type if provided.
  2. Lexical search using to_tsvector / ts_rank_cd for high-precision hits.
  3. Vector search (pgvector cosine similarity) for semantic matches.
  4. Merge + re-rank into a single list with scores and snippets.

This is exactly why I insist on keys and types. If you don’t structure inputs, the “smart” layer has nothing stable to grab.

One subtle design choice: the search response is optimized for decision-making, not for dumping data.

  • you get scores
  • you get snippets
  • you get stable identifiers

Then the client decides whether to fetch the full JSON payload (or to just use the snippet for bootstrapping and pull full payload only for the top 1–3 hits).

Example response shape from GET /search

When I say “deterministic interface,” I mean the response is shaped so callers can program against it.

Here’s a representative JSON payload:

{
  "query": "vault",
  "count": 3,
  "results": [
    {
      "id": 1842,
      "feature_name": "infrastructure",
      "context_type": "reference",
      "context_key": "azure-resource-topology",
      "score": 0.92,
      "title": "Azure resource topology",
      "snippet": "Container App → Context API → Postgres; search-first boot; progress snapshots.",
      "updated_at": "2026-02-28T19:11:22Z"
    },
    {
      "id": 1750,
      "feature_name": "vault_chatbot",
      "context_type": "technical_decision",
      "context_key": "search-ranking-hybrid",
      "score": 0.87,
      "title": "Hybrid search ranking",
      "snippet": "Combine full-text rank and vector similarity; filter by type; store durable keys.",
      "updated_at": "2026-02-20T03:44:10Z"
    },
    {
      "id": 1603,
      "feature_name": "voice",
      "context_type": "implementation_plan",
      "context_key": "phase-3-streaming",
      "score": 0.81,
      "title": "Phase 3: voice streaming",
      "snippet": "SignalR streaming plan; failure modes; retry + idempotency notes.",
      "updated_at": "2026-01-19T16:03:55Z"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

A few details are non-negotiable:

  • Stable identifiers (feature_name, context_type, context_key) so the client can request exactly what it needs next.
  • A snippet so humans can sanity-check the hit before pulling full payloads.
  • A score so the boot sequence can implement rules like “top 5 above 0.75.”

Minimal runnable API implementation (FastAPI)

My production implementation is split across route handlers and service modules (including dedicated search logic under app/services/, and operational wiring under app/api/). But the pattern is simple enough to show as a complete, runnable slice.

This example runs in isolation:

  • pip install fastapi uvicorn sqlalchemy psycopg2-binary pydantic
  • set DATABASE_URL (securely)
  • set CONTEXT_API_KEY
  • uvicorn app:app --reload
from __future__ import annotations

import json
import os
from typing import Any, Dict, List, Optional

from fastapi import FastAPI, Header, HTTPException, Query
from pydantic import BaseModel, Field
from sqlalchemy import create_engine, text

# Do not embed credentials in code. Provide a real connection string via environment.
# Example (set in your shell/secret manager, not in the repo):
#   export DATABASE_URL="postgresql+psycopg2://<user>:<password>@<host>:5432/<db>"
DATABASE_URL = os.environ.get("DATABASE_URL")
if not DATABASE_URL:
    raise RuntimeError("DATABASE_URL must be set (do not hard-code credentials in source)")

API_KEY = os.environ.get("CONTEXT_API_KEY")
if not API_KEY:
    raise RuntimeError("CONTEXT_API_KEY must be set")

engine = create_engine(DATABASE_URL, pool_pre_ping=True)
app = FastAPI(title="Context API", version="1.0")


class ContextUpsertRequest(BaseModel):
    feature_name: str = Field(min_length=1)
    context_type: str = Field(min_length=1)
    context_key: str = Field(min_length=1)
    context_data: Dict[str, Any]


class SearchResult(BaseModel):
    id: int
    feature_name: str
    context_type: str
    context_key: str
    score: float
    title: str
    snippet: str
    updated_at: str


class SearchResponse(BaseModel):
    query: str
    count: int
    results: List[SearchResult]


def require_key(x_api_key: Optional[str]) -> None:
    if not x_api_key:
        raise HTTPException(status_code=401, detail="Missing X-API-Key")
    if x_api_key != API_KEY:
        raise HTTPException(status_code=403, detail="Invalid API key")


@app.get("/api/v1/knowledge/search", response_model=SearchResponse)
def search(
    query: str = Query(..., min_length=1),
    feature_name: Optional[str] = None,
    context_type: Optional[str] = None,
    x_api_key: Optional[str] = Header(default=None, alias="X-API-Key"),
):
    require_key(x_api_key)

    # Deterministic lexical search using PostgreSQL full-text.
    # In production, I merge this with vector results and re-rank.
    where = ["to_tsvector('english', content_text) @@ plainto_tsquery('english', :q)"]
    params: Dict[str, Any] = {"q": query}

    if feature_name:
        where.append("feature_name = :feature_name")
        params["feature_name"] = feature_name
    if context_type:
        where.append("context_type = :context_type")
        params["context_type"] = context_type

    sql = text(
        f"""
        SELECT
          id,
          feature_name,
          context_type,
          context_key,
          ts_rank_cd(to_tsvector('english', content_text), plainto_tsquery('english', :q)) AS score,
          COALESCE(context_data->>'title','') AS title,
          left(COALESCE(context_data->>'content',''), 180) AS snippet,
          updated_at
        FROM context_items
        WHERE {' AND '.join(where)}
        ORDER BY score DESC, updated_at DESC
        LIMIT 10;
        """
    )

    with engine.begin() as conn:
        rows = conn.execute(sql, params).mappings().all()

    results = [
        {
            "id": int(r["id"]),
            "feature_name": r["feature_name"],
            "context_type": r["context_type"],
            "context_key": r["context_key"],
            "score": float(r["score"] or 0.0),
            "title": r["title"],
            "snippet": r["snippet"],
            "updated_at": r["updated_at"].isoformat(),
        }
        for r in rows
    ]

    return {"query": query, "count": len(results), "results": results}


@app.post("/api/v1/knowledge/context")
def upsert_context(
    body: ContextUpsertRequest,
    x_api_key: Optional[str] = Header(default=None, alias="X-API-Key"),
):
    require_key(x_api_key)

    sql = text(
        """
        INSERT INTO context_items (feature_name, context_type, context_key, context_data)
        VALUES (:feature_name, :context_type, :context_key, CAST(:context_data AS jsonb))
        ON CONFLICT (feature_name, context_type, context_key)
        DO UPDATE SET
          context_data = EXCLUDED.context_data,
          updated_at = now()
        RETURNING id;
        """
    )

    with engine.begin() as conn:
        row = conn.execute(
            sql,
            {
                "feature_name": body.feature_name,
                "context_type": body.context_type,
                "context_key": body.context_key,
                "context_data": json.dumps(body.context_data),
            },
        ).mappings().one()

    return {"ok": True, "id": int(row["id"]) }
Enter fullscreen mode Exit fullscreen mode

That’s the heart of it: stable keys, structured JSON, and a search endpoint that can be called as the first step of every session.

A small implementation note that matters in real teams: the upsert pattern is not just convenience. It’s what makes context updates idempotent. If a background job retries, or two sessions attempt to store the same decision, you don’t spawn duplicates—you converge on one address.

The rule that made it stick: “mandatory first action”

I learned the hard way that a continuity system only works if it’s used before you need it.

So I wrote operational guidance directly into the team’s working docs and tooling (including the session startup checklist in the project configuration):

  • the assistant’s chat memory is unreliable
  • the database + progress snapshots are authoritative
  • always resume from system state

That became muscle memory: when a session starts, you don’t ask the model to remember—you ask the system to restore.

This is the engineer’s job: turning a best practice into a default.

If you’re building this into an agent loop, the simplest enforcement mechanism is also the most effective one: make the first tool call mandatory. No “thinking” step happens until the search step happens.

Progress files and session handoffs: the glue people forget to build

APIs are great, but day-to-day development has a more boring need: “What was I doing when I stopped?”

So I keep progress snapshots alongside the code and automation. The pattern is simple:

  • A small JSON/YAML file that captures current stage, last successful step, open decisions, and pointers to relevant context keys.
  • Updated at the end of any meaningful work session.
  • Read at the beginning of the next session to drive the boot search.

Here’s a representative shape (this is the kind of file that keeps a feature from becoming a weekly re-explanation ritual):

{
  "feature": "voice",
  "stage": "phase-3-streaming",
  "last_success": "websocket-prototype-running",
  "open_questions": [
    "Do we need server-side VAD or client-side only?",
    "What is our retry/backoff policy for dropped streams?"
  ],
  "context_pointers": [
    {"feature_name": "voice", "context_type": "implementation_plan", "context_key": "phase-3-streaming"},
    {"feature_name": "voice", "context_type": "technical_decision", "context_key": "stream-transport-signalr"},
    {"feature_name": "infrastructure", "context_type": "reference", "context_key": "azure-resource-topology"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

This is also the layer that survives outages. If the API is down, the progress snapshot still tells you what to restore once it’s back.

It’s not glamorous. It’s the difference between “we lost a day” and “we lost five minutes.”

What went wrong (the wasted work that triggered the build)

Before this stack, resets created a predictable failure cascade:

  • a session would end mid-implementation
  • the next session would start with missing assumptions
  • the assistant would re-derive decisions that had already been made
  • humans would patch the gap with screenshots, copy/paste, and “here’s the context again”

That’s when I wrote down the rule that governs whether an assistant can own a workflow:

If your assistant can’t resume state, it can’t be trusted to own a workflow.

Trust isn’t about being right once. It’s about being consistent over time.

In the enterprise recruitment platform I’m building, this shows up everywhere:

  • intake pipelines where an email thread forks into multiple candidate records
  • CRM updates that must be replayable when a step fails
  • enrichment workers that retry on transient vendor errors
  • Teams experiences where different users arrive with different assumptions and urgency

A model can generate a plausible answer in any one of those moments.

A system has to carry the timeline.

Why a model wouldn’t build this for itself

This is the central tension of the whole series.

Models are extremely good at generating output inside the frame you give them. They can write code. They can propose designs. They can explain tradeoffs.

But they don’t feel the cost of wasted continuity.

They don’t experience the slow bleed of “re-explain the repo” across weeks.

They don’t wake up to a production platform where the hardest part isn’t generating a response—it’s keeping the system coherent across time.

I built this stack because I recognized that the hardest problem wasn’t capability.

It was continuity.

And continuity is architecture.

Performance numbers that mean something (and how I measured them)

I previously wrote down a single number—136ms—and that’s not good enough without the measurement story.

Here’s the actual measurement I use when I talk about latency for the Context API:

  • Metric reported: p50 latency for GET /api/v1/knowledge/search?query=...
  • Result: p50 = 136ms, p95 = 412ms, p99 = 861ms
  • SLA target: < 3000ms p95 for search during session boot (boot is a prerequisite; it has to be dependable more than it has to be fast)

Environment (the one that produced those numbers):

  • Cloud: Azure
  • Region: East US
  • Compute: Azure Container Apps (single active revision), 0.5 vCPU / 1GiB memory per replica, autoscaling enabled
  • Database: Azure Database for PostgreSQL (General Purpose), single instance, same region

Workload:

  • Query pattern: 1–3 keywords (e.g., vault, voice streaming, SignalR)
  • Result set: top 10
  • Data size at the time: ~8k context rows across features, with ~1–4KB context_data payloads on average
  • Concurrency: 10 virtual users, steady-state
  • Warm cache (typical after first request in a work session)

Measurement method:

  • Synthetic load test using k6, 5-minute run after a 1-minute warmup
  • Latency measured at the HTTP client, not just server-side timing

Those numbers aren’t a trophy; they’re a sanity check. The system’s job is to restore state consistently, under normal team usage, without becoming a new bottleneck.

One more operational detail: I care about tail latency here because session boot is serialized. If boot takes 5 seconds, it doesn’t matter that the model can draft an email in 300ms—you’ve already broken the flow.

Nuances: continuity doesn’t mean “store everything”

There’s a trap here: if you hear “persistent memory” and think “log every message,” you’ll build a system that’s technically impressive and operationally useless.

This stack is structured context:

  • implementation plans
  • technical decisions
  • code patterns
  • user feedback
  • references

It’s the stuff that keeps future work aligned.

The filtering rule I follow is simple: if it changes what we would do next week, it’s context. If it’s just narration, it’s noise.

That’s also why I store both a type and a key. It’s what keeps updates clean:

  • If a decision changes, I overwrite the same (feature_name, context_type, context_key) row.
  • If a decision forks, I mint a new key and keep both.

This is how you preserve history without turning retrieval into archaeology.

The series frame

This is post 0 of 13 because everything else I’m going to show sits downstream of this insight.

Every post in this series is about a decision I made that a model wouldn’t make on its own—not because the model is bad, but because the model doesn’t pay the continuity tax.

This is Post 0 of a 13-part series called “How to Architect an Enterprise AI System (And Why the Engineer Still Matters).” Every post is a real decision from a production system—55+ Azure resources, a LangGraph orchestration layer, a six-tier enrichment cascade, a $16/month Redis instance that outperforms expectations, and a recruiting platform that processes thousands of emails a month. Post 1 starts where every enterprise pipeline starts: the email that breaks it.


🎧 Listen to the Enterprise AI Architecture audiobook
📖 Read the full 13-part series with an AI assistant

Top comments (0)