Vektor Memory

Posted on Jun 19

We built a privacy-focused vector memory mobile app. And here is what it can do for you.

#mobile #reactnative #javascript #programming

On sovereignty, minimalism, and the architecture of thinking.

Published by Vektor Memory — 13 min read

The Design Constraint That Determined Everything

There are approximately 4.37 million apps in the world. We know this because we looked it up while building ours. The number is not discouraging. It is clarifying. That many apps mean the space is thoroughly colonised by things that capture your attention.

What is almost entirely absent is a memory app that is privacy-focused and free that extends your thinking. That is the gap we built for, a fast, AI note taking app with a 4 stage graph based on our own Vektor Memory technology.

Before the architecture, the memory graph, or any of the technical decisions described in this article, there was a single physical constraint: phone screens are small, slippery glass rectangles with no tactile resistance. Apps with nested submenus and three taps to reach the thing you actually wanted are not just annoying — they actively interrupt the cognitive state you came to the app to support. The entire direction of good mobile UX over the last decade has pointed the same way: fewer taps, more capability, complexity handled invisibly in the background.

Apple understood this when they made the iPhone feel simple while hiding extraordinary engineering underneath. The interface is the magic trick. The architecture is what makes the experience feel easy to use, we like that philosophy.

We applied the same principle to note-taking with our memory expereince. The interface is a minimal surface. The architecture underneath is a four-layer graph with hybrid retrieval, semantic synthesis, and persistent relational edges. You should not have to be aware of any of that. You should just think and discover your ideas as they unfold.

What Vektor Notes Actually Does

You already have somewhere to put your thoughts. What you don’t have is a way to get them back when it matters. Vektor Notes is built for retrieval — the part every other notes app leaves to you.

The core interface has two modes, swiped between: JOT and CHAT.

JOT is the writing surface. A clean text area, no menus, no formatting toolbar demanding attention. An LLM watches quietly, and after 900 milliseconds of silence — long enough that it fires only when you have genuinely paused, short enough that the suggestion arrives before you lose the thread — it offers a ghost suggestion. A connection to something you have written before. Further guidance or possibly a deep question. You accept it with a tap or dismiss it with no trace. If you are in flow, it stays entirely out of the way.

CHAT is the memory conversation. You talk to your accumulated notes. You ask questions. You expand ideas. The system knows what you have stored and uses it. The retrieval is not keyword search. It is a fused pipeline that combines BM25 full-text matching with vector similarity search, merged via Reciprocal Rank Fusion, so the system finds what you meant as well as what you wrote. If you don't have a need for a specific memory, just delete it; it's that easy.

Swipe between Chat and Jot. That is the entire interface. No hamburger menu. No drawer full of settings. No notification asking you to rate the app, no telemetry, and no ads. The architecture is invisible until you need it.

The Provider Model: Your Keys, Your Data

The foundational decision, made before anything else: Your notes contain your actual thinking and not ads or pop-ups whichare a distraction. Vektor Notes runs entirely on-device, uses your own API keys stored in encrypted local storage, and never touches our servers.

Anthropic, OpenAI, Gemini, Groq, Openrouter: you configure whichever provider you want in settings, paste in your own API key, and the app uses it. The key is stored on-device using SecureStore, React Native’s encrypted key vault. It never touches our servers. We do not log your conversations.

Groq gives enough usage with performance at near-instant speeds to keep average usage within daily limits.

This is a deliberate sovereignty choice, and it costs us something: there is no frictionless “just works” onboarding for people without API keys. We decided that small tradeoff was acceptable. The people who think seriously about where their context goes deserve tools that do not harvest it quietly in exchange for convenience. That deal has been struck too many times already.

The Ghost Suggestion Engine

The JOT surface looks like a blank text editor. Under the hood, a debounce timer runs on every keystroke. The goal is to create ideas quickly, expand, save, export and move on.

The logic: when you stop typing for 900 milliseconds and your note is at least 20 characters long, a micro-response request fires. Below 20 characters, the system says nothing. You are still forming the thought, and interrupting a half-formed idea is the exact failure mode that made Clippy — the animated Office assistant that offered help you had not asked for — one of the most reliably cited examples of bad software in the history of the medium. We did not want to make another Clippy.

The request itself is a small, separate LLM call that does not block the editor and runs entirely in the background. The prompt is constrained: given this note in progress, offer a single short suggestion — a next thought, a question, a connection — in under 30 words. We deliberately capped the response length. An LLM that generates a paragraph every time you pause for a second is insufferable. Thirty words or fewer. If it cannot say something useful in thirty words, it says nothing.

The suggestion appears as ghost text beneath your current content: different opacity, a soft label reading “suggestion” in a small mono typeface. Two actions only: accept, which appends the suggestion to your note, or dismiss, which clears it with no trace. If you want to develop an idea further, that is what CHAT is for. Different users will prefer how they interact.

A parallel 2000-millisecond save debounce runs the entire time you are writing. Every two seconds of inactivity, the note saves automatically to SQLite. You never lose content to a crash or a navigation gesture. The active note ID persists to AsyncStorage, so returning to JOT from CHAT returns you to exactly where you left off.

The Save Architecture: Two Paths, One Principle

When you decide a JOT is worth keeping, two paths are available. The distinction matters more than it sounds.

Quick Save routes the raw text directly into the memory database as a semantic layer entry with a default importance score of 0.75. No extraction. No transformation. The text goes in whole, preserved verbatim.

Synthesise triggers a structured LLM call with a specific extraction prompt: pull out a title, a list of tags, named entities, a one-paragraph summary, and infer a layer classification — semantic for ideas and facts, temporal for events and time-based context, causal for cause-and-effect observations, entity for people and things. The result comes back as a JSON object and gets written as a proper memory node into the MAGMA graph, with typed edges connecting it to semantically related existing nodes.

Critically: the raw text also goes in, alongside the structured record. Both paths persist. You never lose the original by choosing to synthesise.

This is a direct consequence of a principle the paper True Memory (arXiv:2605.04897) makes explicit, and that we had arrived at independently: content discarded before the query is known cannot be recovered at retrieval time. The synthesis structure is useful. The original wording is irreplaceable. Keep both.

MAGMA: The Memory Graph Underneath Everything

MAGMA is the four-layer graph architecture that separates Vektor Notes from a notes app with a chat window bolted on.

The four layers are not organisational categories. Each represents a different type of relationship between memories, which determines how they are retrieved.

Semantic layer — facts, ideas, concepts, observations. Edges connect semantically related nodes. The relationship is similarity-based but persists as an explicit graph edge, not a vector lookup recomputed on every query. The edge is a durable connection in the database.

Temporal layer — events, sequences, time-based context. The temporal layer preserves ordering: these events are connected not just by topic similarity but by when they occurred relative to each other. This is the layer that makes “what was I thinking about before I made that decision?” answerable with something more than a keyword search.

Causal layer — cause and effect, reasoning chains, decisions and rationale. Edges in this layer are directional and typed: one node causes or influences another. This is the layer that most agent memory systems skip. It is the layer that makes the app feel less like a notebook and more like an externalised reasoning history.

Entity layer — people, organisations, projects, locations. Named entities extracted during synthesis get their own nodes here, with edges connecting them to every memory node that mentions them. “What is my history with this project?” becomes a graph traversal, not a full-text scan.

The entire graph lives in a single SQLite file, on-device. The schema: a memories table for nodes with content, layer, importance score, and metadata; a memory_edges table for typed directional relationships; an FTS5 virtual table for BM25 full-text search; and a vec_memories table using the sqlite-vec extension for float32 vector embeddings via approximate nearest-neighbour search.

No Pinecone. No Neo4j. No cloud database. No GPU. The whole thing is a file on your phone; queries execute in milliseconds, and it backs up with your device backup.

If you don’t want to use the app any longer, take your memories and move them somewhere else, we also give you free open-source tools to migrate into another database format with VEX.

The Retrieval Pipeline: How CHAT Finds What You Stored

When you ask something in CHAT, the retrieval pipeline runs before the LLM sees your question. This is the part that determines whether the app feels like it actually knows you, or like a polite chatbot with amnesia.

The pipeline runs two parallel paths that are then fused.

BM25 keyword search. Your query is tokenised and run against the SQLite FTS5 index, applying the BM25 ranking algorithm to the raw text of every stored memory. BM25 executes in single-digit milliseconds and excels at exact and near-exact term matching. If you wrote about a configuration issue three weeks ago and now ask about it, BM25 finds it immediately because the specific phrase is there.

Vector similarity search. Your query gets embedded using the same model as your stored memories, and the sqlite-vec ANN index returns the k most semantically similar nodes. This catches what BM25 misses. You stored something as “the authentication flow that kept breaking in production” and you ask about “login problems; the vocabulary is different, the semantic space is overlapping, and the vector path bridges the gap.

The two result sets are merged using Reciprocal Rank Fusion. RRF combines ranked lists by summing the reciprocal of each result’s rank position across all lists: for each document, 1/(k + rank_in_list) for every list it appears in, where k is a smoothing constant of 60. Documents appearing highly in multiple lists score best. RRF is stable across different corpus sizes without per-user calibration. It works.

The top-k fused results — typically five to ten memory nodes — get formatted into a structured context block that prefixes your question in the LLM prompt. The LLM sees your question and a curated selection of relevant memories, not your entire history. Context windows are not free. Stuffing every note into every request would be slow, expensive, and would dilute the relevant signal with noise. The retrieval pipeline does the filtering work so the LLM does not have to.

The Toolbar: Eight Items, One Active Label

The bottom toolbar has Jot, Chat, Notes, Inbox, Graph, Memory, Search, and Config Settings.

Eight items at 380 pixels wide is tight. The implementation labels every item at all times. The result is eight-and-a-half-point monotype trying to render “Settings” in approximately 41 pixels next to seven identically weighted neighbors.

All icons are custom inline SVG components, not Unicode characters or icon library imports. Icon libraries look exactly like icon libraries. Custom SVG means the weight and style are native to this specific product.

The Memory icon is concentric circles with a centre dot, referencing the hippocampal imagery from the HippoRAG research on neurobiological memory architecture. The Graph icon is four nodes and three edges, readable at 18 pixels, which takes more iterations to get right than you would expect. The users who notice will understand immediately.

The Technical Build: React Native and Expo

The app is built in React Native with Expo. One codebase, two targets — Android now, iOS in the next release — with an ecosystem mature enough that the specific problems we would encounter (gesture handling, keyboard avoidance, safe area insets, hardware-accelerated animations) all had documented solutions.

The swipe gesture between JOT and CHAT uses react-native-gesture-handler and react-native-reanimated, driving a shared translation value on the UI thread. The animation runs on the Reanimated worklet thread, hardware-accelerated, meaning it does not drop frames even when the JS thread is processing an LLM response in parallel. The naive implementation had a visible stutter when a CHAT response arrived mid-swipe. The fix was isolating the animation shared value from the response state so neither thread blocks the other.

Keyboard avoidance is different on each platform. On Android, softwareKeyboardLayoutMode: "pan" in app.json handles the viewport correctly without a KeyboardAvoidingView wrapper. On iOS, KeyboardAvoidingView with behavior="padding" and a measured safe-area offset is required. The same code does not work for both platforms, and getting it wrong is immediately visible to any user who types anything.

All SQLite operations run via expo-sqlite for the base database and sqlite-vec loaded as a native extension for vector operations. Every database call is wrapped in async/await.

The Ecosystem

Vektor Notes is the consumer face of a wider set of memory infrastructure tools built over the past year. The app makes more sense in that context.

The Memory SDK is the npm package that powers the memory layer in Notes and can be dropped into any Node.js agent or workflow. It exposes the MAGMA graph, the hybrid retrieval pipeline, and a full MCP (Model Context Protocol) server with 50+ tools across memory, browser automation, SSH, and multimodal categories. Configure it in claude_desktop_config.json with a licence key and the SDK path, and any MCP-compatible agent has persistent memory across sessions: store facts during a session, retrieve them in any future session, with semantic, keyword, or graph-traversal retrieval depending on query type.

The intelligence layer includes recall tuning that adjusts retrieval weights based on past session utility; contradiction detection using an ADD-only policy — new contr
adictions are flagged as conflicts rather than silently overwriting older memories, because overwriting is how agents develop false beliefs and then act confidently on them; deduplication; namespace isolation for multi-project use; and a background consolidation cycle that fires when the app is idle.

VEX CLI is the open-source migration tool, built around a .vmig.jsonl interchange format for agent memory portability. Export from one agent, import into another. Backup. Migrate. Merge two separate memory stores. The v0.3 roadmap includes Drift Adaptor — cross-model vector translation for migrating memories between embedding models that do not share a geometric space.

Via CLI is the integration layer at v0.3.0 that wraps common agent workflow patterns into composable commands: handoff generates a structured session summary (decisions made, things changed, things pending), memory queries or stores, serve starts a local MCP-compatible server. At the end of a session, via handoff writes the context. At the start of the next, via memory loads it. The agent continues where it left off without re-explanation.

Why Memory Is the Whole Point

Notes apps are not productivity tools. They are memory prosthetics. The reason you take a note is that biological memory is fallible, context-dependent, and lossy. The note is a retrieval aid. The app is a retrieval system.

The problem with current note apps on the market is that they optimise for capture and treat retrieval as your problem. You put things in, getting them back out requires a keyword bar and your own memory of how you organised everything. The app is a very expensive text file.

What we were trying to build is a notes app where retrieval is the product. The capture is the input. Everything else — the MAGMA graph, the hybrid retrieval pipeline, the synthesis architecture, the on-device vector search — exists so that what you put in is actually available to you when you need it, in the form most useful to the question you are currently asking.

This connects directly to the research underpinning the design. HAGE (arXiv:2605.09942) argues that stored notes should form a weighted graph with learned relational edges, and retrieval should be query-conditioned — following temporal edges for time-based questions, entity edges for people-based questions, causal edges for why-questions. MAGMA is this structure without the RL training layer yet. That is where the roadmap leads.

True Memory (arXiv:2605.04897) argues you should not extract too aggressively at save time. Keep the raw event. Build structure later, at query time, when you know what question is being asked. This is why synthesis is optional in Notes, why quick save preserves the raw text whole, and why both paths persist when you do synthesise.

Both insights run silently under what looks, to the user, like a clean text editor with a thoughtful companion. The simple interface is the magic trick that makes it usable. The architecture underneath is what makes it useful.

Simple, upfront tools, and all of the complex tech is hidden behind the scenes where it needs to be. As we receive more feedback, we will adjust and refine the settings to suit the actual user experience.