RecallPal

#sideprojects #showdev #mentalhealth #ai

I Built a Real-Time Face Recognition Memory Aid for Dementia Patients — Here's How the Memory Layer Actually Works

My grandmother stopped recognizing my father about two years before she passed. She knew someone was in the room. She could tell he cared about her. But the name, the relationship, the decades of shared history — gone. That experience sat with me for a long time.

So when I started thinking about what kind of system I actually wanted to build, the answer was straightforward: something that bridges that gap in real time. Not a research prototype. A working system that a caregiver could set up in an afternoon and that would tell a dementia patient — quietly, clearly — who just walked into the room and why they matter.

The result is Dementia Assist: a real-time face recognition system backed by a persistent memory layer that stores not just who someone is, but what you know about them — their relationship to the patient, their interests, how long it's been since their last visit, and a dynamically generated conversation starter to help break the ice.

How the System Hangs Together

The architecture is deliberately simple. A Next.js frontend captures webcam frames once per second and POSTs them as base64 JPEGs to a Flask backend. The backend runs them through a FaceNet embedding model via DeepFace, compares the result against a pickled embedding database, and — if there's a match — queries a memory store for everything we know about that person. The response comes back with a name, a confidence score, relationship context, and a suggestion like "Ask Sarah about her painting class."

The stack:

Face recognition: DeepFace + FaceNet (128-dimensional embeddings, Euclidean distance matching)
Memory: Hindsight Cloud with a local JSON fallback
Backend: Python 3.9, Flask, flask-cors
Frontend: Next.js 14, TypeScript, Tailwind CSS
Voice: Web Speech API for audio announcements when a face is recognized

The whole thing starts with a single script — ./run.sh on Linux/macOS, run.bat on Windows — which installs dependencies, starts both servers, and opens the browser. I wanted caregivers to be able to run this without touching a terminal again after setup.

The Core Technical Problem: Recognition Alone Is Not Enough

Early on I made a mistake that I think a lot of people make when building systems like this: I treated face recognition as the hard problem and memory as an afterthought.

Recognition is actually fairly tractable. DeepFace with FaceNet gives you solid 128-d embeddings. You store them in a pickle file, compute Euclidean distances at inference time, set a threshold, done. What I underestimated was what you do with the name once you have it.

A dementia patient doesn't just need to know a name. They need context. Who is this person? When did I last see them? What do we talk about? If you flash "Sarah" on a screen with no surrounding information, you've built a party trick, not an aid.

That's where I knew I needed agent memory — not a database query, but something that could store rich, evolving context about each person and retrieve it intelligently at recognition time.

I looked at a few options and landed on Hindsight for the memory layer. What sold me was the combination of a clean Python client, an isolated bank-per-application model, and the fact that it degrades gracefully — if you don't have an API key, it falls back to a local JSON store with the exact same interface. That meant I could build and test everything offline without mocking anything out.

How Memory Storage and Retrieval Actually Works

Each person in the system has two representations:

A face embedding (or several) stored in face_db.pkl
A memory record stored in Hindsight (or memory_store.json locally)

The memory record looks like this:

PREDEFINED_PEOPLE: dict[str, dict] = {
    "Sayantan": {
        "text": (
            "Sayantan is a 21-year-old computer science student interested in AI and "
            "coding. He enjoys chess and music and often helps with technical work."
        ),
        "age":      21,
        "relation": "Friend",
        "likes":    ["AI", "Coding", "Chess", "Music"],
        "notes":    "Helpful and calm personality",
    },
    ...
}

On startup, seed_initial_data() checks whether each predefined person already exists in Hindsight and skips them if so — idempotent seeding, no duplicates. New people enrolled live go through the same store_person() path:

def store_person(self, name, relation, notes, age=None, likes=None) -> bool:
    key = name.lower()
    now = _now_iso()
    existing = self._store["people"].get(key, {})
    self._store["people"][key] = {
        "name":       name,
        "age":        age,
        "relation":   relation,
        "likes":      likes or [],
        "notes":      notes,
        "last_seen":  existing.get("last_seen", now),
        "first_seen": existing.get("first_seen", now),
        "added":      existing.get("added", now),
    }

One thing I got wrong initially: I was storing names with their original casing in the face database but querying Hindsight with whatever case the recognition engine returned. This caused silent misses — the face would be recognized, the Hindsight lookup would return nothing, and the UI would show a name with no context. The fix was straightforward but important: normalize everything to lowercase at every write path. The face engine now lowercases all keys on load and on every add_person() call.

The Recognition Pipeline: Crop First, Embed Second

The other place I lost hours was "No face detected" errors from DeepFace. The original approach sent the full webcam frame directly to DeepFace.represent() — which works fine when the person is close and centered, and fails silently when they're not.

The fix was to introduce an OpenCV Haar cascade as a pre-processing step. The cascade detects the face first, crops it with 20% padding, and only then hands the region to DeepFace:

def _crop_largest_face(frame: np.ndarray) -> Optional[np.ndarray]:
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = _FACE_CASCADE.detectMultiScale(
        gray, scaleFactor=1.1, minNeighbors=5, minSize=(60, 60)
    )
    if len(faces) == 0:
        return None
    x, y, w, h = max(faces, key=lambda f: f[2] * f[3])
    pad_x, pad_y = int(w * 0.20), int(h * 0.20)
    return frame[
        max(0, y - pad_y):min(frame.shape[0], y + h + pad_y),
        max(0, x - pad_x):min(frame.shape[1], x + w + pad_x),
    ]

With enforce_detection=False on the cropped input (since the cascade already confirmed a face exists), DeepFace failure rates dropped dramatically. Separately, matching was improved by comparing the query embedding against the average embedding per person rather than against each stored sample individually — this reduces noise from outlier captures significantly.

The Suggestion Engine

Once a face is matched and memory is retrieved, the backend generates a conversation starter. The logic is layered:

Recency first: if the patient hasn't seen this person in over a month, say so — "You haven't seen Sarah in over a month. They might have news to share!"
Likes second: pick a random item from the likes list — "Ask Sarah about her painting class."
Notes keyword scan: if the notes mention "dog", "travel", "wedding", surface a relevant prompt.
Generic fallback: "Ask Sarah about their day."

def generate_suggestion(name: str, memory: dict | None) -> str:
    if not memory:
        return f"Ask {name} about their day"
    last_seen_dt = _parse_last_seen(memory.get("last_seen") or "")
    if last_seen_dt:
        delta_days = (datetime.now(timezone.utc) - last_seen_dt).days
        if delta_days >= 30:
            return f"You haven't seen {name} in over a month. They might have news to share!"
    likes = memory.get("likes") or []
    if likes:
        return f"Ask {name} about {random.choice(likes)}"
    ...

This is deliberately simple. The goal isn't to be clever — it's to give a patient a single, low-friction way to start a conversation without having to remember context they no longer have.

What a Real Interaction Looks Like

A caregiver sets up the laptop in the living room. The patient's daughter walks in. Within about one second:

The camera crops her face, generates an embedding, matches it against the database
The backend queries Hindsight for her memory record
The UI displays: Sarah · Daughter · 87% match
Below that: "Ask Sarah about her painting class"
The Web Speech API reads aloud: "This is Sarah, your daughter."

If someone completely new walks in, the UI shows an amber badge: "Unknown person detected." A caregiver can click "Add This Person," capture 5–10 photos from different angles using the in-modal webcam, fill in name, relationship, and notes, and the person is enrolled immediately. The next scan cycle will recognize them.

Lessons Learned

1. Memory is the product, recognition is the plumbing. Face recognition is a solved problem with off-the-shelf tools. What makes a system like this genuinely useful is the richness of the context layer. Invest there.

2. Normalize your keys at the boundary. The lowercase bug cost me a full afternoon. Any time you have two systems storing the same identifier (face DB and Hindsight in this case), pick a canonical form at the write path and never deviate from it.

3. Crop before you embed. Sending full frames to a model that expects a face is a reliability trap. A lightweight cascade detector as a pre-filter is cheap and it eliminates an entire class of errors.

4. Average embeddings beat per-sample matching. A single bad capture — eyes closed, bad angle — will skew your nearest-neighbor result if you're comparing against all stored samples individually. Averaging collapses noise. It's one extra line per person on load and it makes matching noticeably more stable.

5. Graceful degradation is a feature, not a fallback. The local JSON memory store isn't a hack — it's a first-class deployment mode. Some caregivers will not have API keys and will not create accounts. If your system requires a cloud service to function at all, you've locked out your most constrained users. Build the offline path properly from day one.

The codebase is available on GitHub. If you're building anything in the assistive technology space and want to talk through the memory architecture, I'm reachable. There's a lot more to do here — multi-patient support, mobile-first UI, more nuanced memory retrieval — but the core loop works, and the people it's designed for don't have time to wait for perfect.