Muhammad Ayaan

Posted on Jun 8

Finally Finished Memex: Turning a CLI Based MVP into a Production-Ready Web Service

#devchallenge #githubchallenge #ai

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

Memex is a personal memory app. You tell it things in plain language, and you ask it questions later.

Most notes apps ask you to organise before you can capture. You pick the folder, name the file, decide whether this belongs in tasks or references or ideas. Memex skips all of that. You type what you heard, what occurred to you, what you want to remember. It saves it. Later you ask "where did I hear about those jackets?" and it tells you. Same account, same memories, whether you're at your terminal or in a browser.

Under the hood it's a RAG system. Every memory gets embedded using Gemini's gemini-embedding-001 model at 768 dimensions, stored in Supabase with pgvector, and retrieved by cosine similarity when you ask a question. Gemini synthesises an answer grounded only in what you actually told it. If it finds nothing relevant, it says so, and the model never gets to guess.

Two clients share one backend: a Python CLI and a Next.js web app, both reading and writing the same Supabase account.

The finished version of the web app has a chat page, a Memory Library, and a settings panel. The chat handles three kinds of input: storing a memory, recalling one (with optional time-scoping), and general conversation. Greetings get a warm reply. "What did I save this week?" lists everything from that window. "Forget what I said about the dentist" finds the matching memory, shows it to you, and only deletes it when you confirm. Answers stream in token by token. Save confirmations rotate through a small set of acknowledgements so the app doesn't feel like a form submission every time.

The Memory Library lets you browse, search, pin, edit, and delete memories directly, without going to the CLI or the Supabase dashboard. Memories get auto-tagged at save time into lightweight categories like idea, task, person, and place. Ones that carry a future date — "renew passport next month", "dentist on Tuesday" — get a due date extracted and show up in a due/upcoming view at the right time. The settings panel has a working dark mode, an export option that dumps everything to JSON, and an import path that re-embeds on the way back in with deduplication.

Here's what the finished version ships with:

Three-way intent routing: storing a memory, recalling one, and general conversation are handled separately — greetings never pollute your store
Temporal recall with timezone-aware windows: "what did I save today/this week/yesterday" works correctly in the user's local day, not UTC
Due-date extraction: memories with a future date show up in an upcoming view at the right time
Natural-language forget with two-step confirmation: the server never deletes on the first request
Lossless streaming via JSON-encoded SSE: answers grow token by token without dropping multi-line content
Relevance floor before synthesis: weak matches are filtered out before reaching the model, so one unrelated memory can't derail an answer
Memory Library: browse, search, pin, edit, and delete — no CLI required
Auto-tagging into categories (idea, task, person, place) at save time without blocking the core flow
Dark mode with full token-based theming, no hardcoded hex values left
Export to JSON and re-import with deduplication
Python CLI and Next.js web app sharing one Supabase backend and Gemini API

The initial version worked — but only in a terminal, only for one user, with no auth, no web UI, greetings stored as memories, and no way to ask what you'd saved today. It proved the core idea. It wasn't something you could hand to anyone else.

Turning that into a finished product — with a web app, real multi-device auth, a Memory Library, dark mode, due-date reminders, natural-language deletion, and all the correctness work that makes it trustworthy — is what this submission is really about.

Demo

Live Site:https://memex-web-eta.vercel.app/
Finished repo: https://github.com/Raiden505/memex-web
Before version repo:https://github.com/Raiden505/memex-cli

The Comeback Story

The original Memex was a terminal tool. It stored memories, retrieved them semantically, and answered questions — but only from a command line, only for one account with no real auth, with no way to use it from a browser or a phone. Saying "hi" tried to store "hi" as a memory. There was no concept of today or this week. Deletion required knowing the memory's UUID. I could demo it. I couldn't give it to anyone.

That's the gap nobody talks about when they ship a working concept. The demo path works great. Everything slightly off the demo path doesn't. Saying "hi" stores "hi" as a memory. Clicking a memory card to ask about it sometimes re-saves it instead. "What did I tell you today?" returns nothing because the regex only matches the bare word, not the word inside a sentence. An answer about one thing gets quietly distorted by a weakly related memory that happened to score high enough to be included. None of those kill the demo. All of them kill the product.

The next 17 phases were about finding and fixing that category of problem — the things that work in a controlled walkthrough and break in real use.

The first thing Memex had to get right was correctness. A memory tool that confidently tells you the wrong thing is a different kind of failure than most software. It fails by lying about your own past. So the no-hallucination rule went in from the start: if a personal-recall question returns no relevant memories, the system short-circuits before calling the model and returns a fixed message. The LLM never gets the chance to guess at a personal fact. That constraint stayed intact through every subsequent phase.

Early versions classified every message as either "store" or "query." That worked for the obvious cases but produced embarrassing results at the edges. Say "hi" and the app would try to store "hi" as a memory. Ask "what's the capital of France?" and it would either store the question or return "I don't have anything saved about that." Neither is acceptable for something you want to use every day.

So a third intent, general, got added. Greetings and general-knowledge questions get a short conversational reply from Gemini. They're never stored and never trigger a memory search. A fast-path handles the obvious cases without any LLM call: there's a small hardcoded set of known strings — "hi", "thanks", "good morning", and a few more — that resolve directly to general before the classifier even runs. The three-way LLM classifier handles everything ambiguous, running at temperature 0 and defaulting to store on any failure.

The no-hallucination rule got sharpened by this change. The GENERAL fallback applies only to messages classified general up front. A query that finds no matching memories still returns the fixed "nothing saved" message, every time. "What's the capital of France?" gets answered from general knowledge. "Where did I park?" does not get guessed at. Those are handled by different branches and that separation is deliberate.

Temporal recall came from actually using the app. "What did I tell you today?" is one of the most natural questions to ask a memory tool. The original system couldn't answer it because semantic search has no sense of when something was saved. A temporal.extract_range function now parses time-window phrases from queries, and list_memories_in_range handles the date-filtered fetch from Supabase. One bug took a while to notice: the original implementation used re.fullmatch("today", text), which only matched the bare word. "What did I tell you today?" as a natural sentence never triggered it. Switching to re.search(r"\btoday\b", text) was the fix.

The timezone handling matters more than it might look. "Today" means the user's local day, not UTC. Someone in UTC+5 asking about today at 10pm expects their local calendar day, not a UTC window that cut off five hours ago. The web client reads the timezone from Intl.DateTimeFormat().resolvedOptions().timeZone and sends it with every chat request. All the window calculations run in the supplied timezone before converting to UTC for the database query.

The streaming bug was invisible until you looked for it. Replies appeared to work — answers came back, content was correct — but nothing ever actually streamed, and some multi-line answers arrived truncated. The cause was in how SSE frames were being parsed. The backend emitted each model token as data: <raw token>\n\n. The frontend split the buffer on "\n" and discarded any line that didn't start with "data: ". A token containing a newline produced a second line that got silently dropped. Multi-line content was silently truncated. The non-streaming fallback returned the full response, which is why the product appeared correct but never actually grew text on screen.

The fix was JSON-encoding every payload instead of putting raw text directly in the SSE field. The backend now emits data: {"t": "<token>"}\n\n for each chunk, so newlines inside tokens are escaped as \n and never break SSE framing. The frontend parses events by splitting on "\n\n" (event boundaries) and JSON-parses each one. Backend and frontend had to ship together since the wire format changed, but after that streaming worked properly for the first time.

The accuracy of answers also improved significantly after filtering weak matches before synthesis. The original retrieval returned up to five results and passed all of them to the model regardless of relevance. One weakly related memory could distort an answer about something completely different. A relevance floor now filters results before they reach the synthesiser: anything within 0.2 of the top similarity score survives, with a hard floor at 0.2. If nothing clears the floor, the no-hallucination short-circuit fires and the model is never called. The quality improvement on real-world questions was noticeable immediately.

Clicking a saved memory card to ask about it exposed a routing problem. The click was inserting the memory content into the chat as a plain query string, which then went through intent routing. If the note contained certain words — a date, "due", "task" — the router could classify the input as STORE and try to save the memory again instead of recalling it. A mode: "recall" parameter on the API now bypasses intent routing entirely for those requests, going straight to semantic search.

Natural-language deletion was designed around a two-step confirm protocol. The first request resolves candidates but deletes nothing. The server returns forget_candidates in the response: a list of memories that matched, with content and dates. The client shows them in a confirm card. On confirm, the client re-posts the original message with confirm_forget: [ids]. The server re-checks ownership via user_id before deleting anything. "Forget everything" style requests get an even stronger confirmation step. The rule is that the server never deletes on the first turn, no exceptions. Deletion is irreversible and the confirm step is not optional.

My Experience with GitHub Copilot

Copilot was genuinely impressive throughout this project, and not just for boilerplate. It understood the shape of the problems I was working through and produced accurate, specific fixes — not generic suggestions I had to rework.

The FastAPI backend is a good example of how well it read context. The Python package already had well-defined module contracts — add_memory, search_memories, list_memories_in_range, and so on — because the CLI called them directly. I didn't have to explain the isolation rules. Copilot looked at what was already there and scaffolded the route handlers, request and response models, JWT dependency injection, and streaming wiring in a way that preserved the existing boundaries. The CLI and the web backend ended up calling the same Python modules without any duplication, and Copilot kept that consistent as the API surface grew.

The streaming transport fix is where it really stood out. I described the symptom — replies appearing correct but never actually streaming, and multi-line answers arriving truncated — and Copilot traced the cause correctly: raw token text in the SSE data: field breaks framing when a token contains a newline. It suggested JSON-encoding the payload so newlines get escaped, and produced the correct backend change (_sse({"t": token})) and the matching frontend parser in one shot. That's not a simple find-and-replace; it's understanding an SSE framing contract across two different runtimes.

The word-boundary fix for temporal.extract_range was similar. I mentioned that "what did I tell you today?" wasn't triggering temporal recall even though the word "today" was in it. Copilot immediately identified that re.fullmatch was the issue and replaced it with re.search(r"\btoday\b", text) — exactly the right change, first try.

It was also consistently good at the fixes that are easy to skip when moving fast: the relevance floor filtering before synthesis, the mode: "recall" bypass for memory card clicks, the atomic user_id re-check in the forget confirm flow, the reasons === "install" guard to show onboarding once. Each of those required understanding what invariant was being protected, not just what line to change. Copilot got them right.

The part it couldn't do was figure out which problems were worth solving in the first place. Whether the relevance floor belonged at 0.2. Whether "forget everything from yesterday" should still bulk-delete after adding single-item precision. Whether clicking a memory card should trigger a recall mode or just populate the input. Those came from using the product and noticing what felt wrong — and once I knew what I was trying to fix, Copilot was fast and accurate at fixing it.

What I learned

The biggest thing this project taught me is that "working" and "finished" are completely different states, and the distance between them is easy to underestimate.

Before working on this finish-up-athon. The concept was proven and everything "worked". Every core feature I'd set out to build was there. But working means the happy path runs cleanly. Finished means real users can hand it to, without you standing behind them explaining what to avoid. That's a much higher bar, and almost none of the work that gets you there shows up in a demo.

The greeting problem is the most obvious example. In a concept, it doesn't matter that saying "hi" stores "hi" as a memory — you're demonstrating the store-and-recall loop, not stress-testing edge inputs. But hand it to a new user and the first thing they do is say hello. If their first interaction with the product is watching it store "hi" and reply "got it, I'll remember that," you've lost them before they've tried the actual feature. Fixing it meant adding a whole new intent class, a greeting fast-path, and a three-way LLM classifier. That's a non-trivial amount of work for something that looks like a papercut.

The relevance floor was subtler. Answers seemed correct — the right topic, reasonable phrasing, grounded in real memories. But occasionally an answer felt slightly off, like it was pulling from something adjacent rather than the actual relevant note. The cause was that retrieval returned up to five matches and handed all of them to the model regardless of how relevant they actually were. One weakly related memory, scored just high enough to be included, could quietly skew the answer. You'd only notice after using the app extensively enough to know what it should say. A concept doesn't need that level of correctness. A finished product does.

The same pattern showed up in smaller places too: the memory card click that sometimes re-saved instead of recalled, the confirm dialog that sat below the chat input on mobile and was barely tappable, the timezone handling that made "today" mean UTC midnight instead of the user's local day. Each one was fine in a controlled walkthrough. Each one would have broken trust within the first few minutes of real use.

Finishing is mostly invisible work. It's transport correctness and timezone math and confirm dialogs and cursor focus behaviour and relevance filtering and error messages that say something useful instead of a stack trace. None of it makes the feature list longer. All of it is the difference between something you can demo and something you can actually give to people.

The web UX details are a good example of how subtle this gets. None of these appear in a feature list, but all of them matter:

Cursor returns to the input automatically after a reply, so you don't have to click back
Textarea stays editable during streaming, even though sending is gated until the response finishes
Login navigates optimistically to the chat shell and loads prior memories into it in the background, so there's no blank wait

The concept proved the idea was worth building. The finish-up proved it was worth using. Those are genuinely different achievements, and the second one took longer.

What's next beyond development

The longer-term goal is a proper mobile app. Most people aren't going to capture memories from a terminal or even a browser tab — they're going to want to open their phone, say or type something quickly, and move on. The FastAPI backend already serves both the CLI and the web client through the same endpoints, so a React Native or native app would slot in without any backend changes.

Mobile also unlocks the features that make a second brain actually sticky over time. Due-date reminders as push notifications. Background checks that surface overdue items. Voice input so you can capture something while you're walking. The architecture supports all of it — the due_at column is indexed, the temporal query paths exist, and the reminder logic is already in the backend. What's missing is the client that can actually deliver a notification to your lock screen.

Beyond that, the roadmap includes:

Voice input so you can speak a memory instead of typing it
Push notifications for due-date reminders, delivered at the right time in the user's local timezone
An installable PWA as a stepping stone while native apps are in progress
Export to Markdown in addition to JSON, so your data stays readable outside the app

If anyone has built second-brain or personal-knowledge-management tools before and has thoughts on what actually drives long-term retention, I'd genuinely like to hear them.