DEV Community: Jason Sosa

I Built an Agent That Trades on Bitcoin Lightning. It remembered nothing. So I built a brain.

Jason Sosa — Thu, 05 Mar 2026 00:23:25 +0000

L402 lets AI agents pay for API calls with Lightning. No credit cards, no OAuth, no subscription plans. An agent gets an invoice, pays it, gets access. It works. And it's growing fast.

But every agent I've seen using L402 has the same problem: it forgets everything.

An agent finds a transcription API that charges 2,000 sats (~$1.42) per request. Good price, fast response. The agent uses it a dozen times. Then the session ends, and the agent has no record of any of it. Next session, it's back to searching for transcription APIs from scratch. It'll probably find a worse one.

Scale that up. An agent managing a real budget, spending millions of sats a week on compute, data, and API access, is flying blind between sessions. No vendor history. No spending patterns. No memory of which endpoints returned errors half the time.

That's the gap I built Lightning Memory to fill.

What It Does

Lightning Memory is an MCP server. Nine tools. Install it, point Claude or any MCP-compatible agent at it, and the agent can store and query memories that persist across sessions.

pip install lightning-memory

Three layers:

Memory. Agents store what happened. Transactions, vendor notes, errors, decisions, preferences. Everything goes into local SQLite with FTS5 full-text search. No cloud. Nothing leaves your machine.

memory_store(
content="transcribe-api.com, 2000 sats for 10min audio. "
"Response under 3 seconds. Output was accurate.",
type="transaction",
metadata='{"vendor": "transcribe-api.com", "amount_sats": 2000}'
)

Intelligence. Four tools that turn raw memories into judgment. Vendor
reputation scores. Spending breakdowns. Anomaly detection that flags a vendor charging 50,000 sats when their historical average is 2,000.

ln_anomaly_check(vendor="transcribe-api.com", amount_sats=50000)
# {verdict: "high", context: "50,000 sats is 25x the historical average of
2,000"}

At ~$71k per BTC, 50,000 sats is about $35. Not catastrophic for a single payment, but an agent making that mistake repeatedly across dozens of vendors burns through real money.

Identity. Every agent gets a Nostr keypair. BIP-340 Schnorr signatures, same crypto as the Nostr protocol. No email, no registration, no platform. Memories are signed NIP-78 events that can sync to any Nostr relay. Your agent's identity and data are portable. If a relay goes down, switch to another. Nothing is lost.

The L402 Gateway

This is the part I think has the most potential.

Lightning Memory includes an HTTP gateway where other agents can pay sats to query your agent's memories. The protocol is L402: the gateway returns a 402 with a Lightning invoice, the client pays, and gets access.

GET /ln/vendor/transcribe-api.com
-> 402 Payment Required (invoice for 3 sats)

(agent pays invoice)

-> 200 OK
-> {total_txns: 47, success_rate: 0.94, avg_sats: 2100}

Three sats for a vendor reputation check. An agent that has spent weeks interacting with L402 APIs has built up real intelligence about which vendors are reliable and what fair prices look like. That intelligence is valuable.

The gateway lets it sell that knowledge.

I haven't built the network layer yet. Right now each gateway is standalone. But the direction is clear: agents trading knowledge with each other, paying in sats, with no intermediary.

Why Nostr for Identity

I could have used API keys or JWTs or any standard auth scheme. I picked Nostr keypairs deliberately.

The AI agent economy is going to be big. Agents will manage budgets,
negotiate with vendors, build reputations over months of interactions. Whoever controls agent identity controls that economy. If agent identity lives on Google or OpenAI accounts, those platforms own your agents.

Nostr keypairs are sovereign. A 32-byte private key stored on your machine. No platform can revoke it, no company needs to stay in business for it to keep working. And because the keys are the same standard used across Nostr, the entire relay infrastructure is already there for syncing and portability.

Try It

pip install lightning-memory

Nine tools, 156 tests, MIT licensed. Works with Claude, any MCP-compatible agent. Listed on the MCP registry.

singularityjason / lightning-memory

Decentralized agent memory for the Lightning economy. Nostr identity, L402 payments, MCP server.

Lightning Memory

Persistent memory for AI agents in the Lightning economy.

The Problem

AI agents spend sats over Lightning via L402 — but they can't remember what they bought. Every session starts from zero. Every vendor is a stranger. Every price is accepted at face value. An agent that paid 500 sats yesterday doesn't know if today's 5,000 sat invoice is a price spike or normal.

The Solution

L1: Bitcoin      — settles
L2: Lightning    — pays
L3: Lightning Memory — remembers

Lightning Memory gives agents persistent memory, vendor intelligence, and payment safety gates. Agents learn from their spending history, track vendor reputations, detect price anomalies, enforce budgets, and share trust signals with other agents.

Interactive Demo — watch an agent learn, get rugged, and route around bad actors.

Building the Agent Economy — trust, budgets, compliance, and the memory marketplace.

Who Is This For

Agents making L402 payments that need…

View on GitHub

The gateway needs Phoenixd (ACINQ's zero-config Lightning node) for invoice
generation. The README has the full setup. Core memory and intelligence tools
work without it.

I'm looking for feedback from anyone building agents on Lightning. What would you store? What queries would be useful? Open an issue or find me on X
(@jasonsosa).

Why Flat Files Break as AI Agent Memory (And What We Built Instead)

Jason Sosa — Fri, 27 Feb 2026 03:58:47 +0000

Your AI coding agent has amnesia.

Every Claude Code session, every Cursor chat, every Windsurf interaction starts from zero. The architectural decision you explained on Monday? Gone. The debugging lesson from Friday? Never happened. The style preferences you've stated twelve times? Say them again.

The common fix is a flat file. Claude Code has CLAUDE.md. Cursor has .cursorrules. They work — for a while.

Then they don't.

Where flat files break

I kept hitting the same five failures as my CLAUDE.md grew past 200 lines:

1. Search is impossible. You're grepping for context that may or may not be there. The phrase you used three weeks ago doesn't match how you'd describe it today.

2. Nothing is auto-captured. Every lesson has to be manually written. You debug a Docker volume mount issue for 30 minutes, and unless you type "remember this," it's gone.

3. It grows forever. No deduplication. No decay. No contradiction detection. "We use REST" from January sits next to "We migrated to WebSockets" from February. The agent picks whichever it attends to first.

4. It's one file per project. That debugging pattern you learned in project A? Invisible in project B. No cross-project learning.

5. No checkpoint/resume. Stop mid-refactor, and there's no structured way to pick up where you left off.

CLAUDE.md is fine for "always use tabs." It breaks when your agent needs to actually learn.

What a real memory pipeline looks like

I built OMEGA to solve this. It runs as an MCP server with 12 tools, and it handles the full memory lifecycle:

Store pipeline (12 sub-phases)

When a memory comes in, it doesn't just get appended:

input → validate → dedup check → evolution check → conflict detection
  → store → entity extraction → auto-relate → contradiction supersession
  → fact splitting → reminder check → feedback

Dedup runs three layers: SHA256 hash (exact match), content hash (near-exact), and embedding cosine similarity with per-type Jaccard thresholds (decisions at 0.80, lessons at 0.85). This catches the agent restating the same decision six times in six paraphrases.

Evolution detects when new content overlaps 55-95% with an existing memory. Instead of creating a duplicate, it appends the new insights to the existing entry.

Conflict detection catches contradictions automatically. "We use JWT" followed by "We switched to session cookies" — the old decision gets superseded, not silently ignored.

Retrieval pipeline

Search isn't keyword matching. It's a five-stage pipeline:

# Simplified from omega/sqlite_store.py
def query(text, limit=10):
    # 1. Vector similarity (bge-small-en-v1.5, 384-dim via sqlite-vec)
    vec_results = cosine_search(embed(text), limit * 3)

    # 2. Full-text search via FTS5
    fts_results = fts5_search(text, limit * 3)

    # 3. Reciprocal rank fusion
    merged = rrf_merge(vec_results, fts_results)

    # 4. Type-weighted scoring (decisions/lessons weighted 2x)
    scored = apply_type_weights(merged)

    # 5. Time-decay (old unaccessed memories rank lower)
    return apply_decay(scored)[:limit]

10 relevant memories out of 500, under 50ms. All local — SQLite + ONNX embeddings, no API keys, no cloud.

Forgetting

This is the feature nobody talks about. Memories that aren't accessed lose ranking weight over time. The floor is 0.35 — nothing disappears completely — but stale context stops dominating retrieval.

Preferences and error patterns are exempt from decay. Your "always use early returns" preference never fades. But that one-off debugging note about a dependency version from six months ago? It quietly drops out of relevance.

Auto-capture: the part that actually matters

The explicit omega_store tool is useful, but the real value is what happens without being asked.

OMEGA hooks into your editor's session lifecycle:

SessionStart: Surfaces relevant memories from past sessions
UserPromptSubmit: Detects decisions and lessons in the conversation and stores them automatically
PostToolUse: Surfaces memories relevant to files you're editing
Stop: Generates a session summary

When you spend 30 minutes debugging and Claude says "The node_modules volume mount was shadowing the container's node_modules. Fixed by adding an anonymous volume" — OMEGA auto-captures that as a lesson. Next time anyone hits the same Docker issue, it's already there.

Checkpoint/resume

This is what convinced me flat files would never work:

You: "Checkpoint this — I'm halfway through migrating the auth middleware."

# OMEGA saves: task description, files modified, current state,
# remaining steps, and links to all related memories

# ...next day, new session...

You: "Resume the auth middleware task."

# OMEGA restores full context. Claude picks up exactly where you left off.

Task state survives session boundaries. No copy-pasting "here's where I was."

The benchmark

OMEGA scores 95.4% on LongMemEval (ICLR 2025) — a 500-question benchmark testing extraction, multi-session reasoning, temporal understanding, knowledge updates, and preference tracking.

System	Score
OMEGA	95.4%
Mastra	94.87%
Emergence	86.0%
Zep/Graphiti	71.2%

The full methodology is at omegamax.co/benchmarks.

Try it

pip3 install omega-memory[server]
omega setup
omega doctor

That's it. omega setup downloads the embedding model, registers the MCP server, and installs hooks. Start a Claude Code session and say "Remember that we always use early returns." Close the session. Open a new one. Ask "What are my code style preferences?"

It's there.

Works with Claude Code, Cursor, Windsurf, Zed, and any MCP client. Apache-2.0 licensed. No API keys. Everything runs on your machine.

GitHub: omega-memory/omega-memory
Website: omegamax.co
PyPI: omega-memory

I built OMEGA because I was tired of re-explaining the same architectural decisions to an agent that forgot everything between sessions. If you're hitting the same problem, give it a try. And if you find bugs, the issue tracker is open.

How I Built a Memory System That Scores 95.4% on LongMemEval (#1 on the Leaderboard)

Jason Sosa — Sun, 15 Feb 2026 02:28:07 +0000

How I Built a Memory System That Scores 95.4% on LongMemEval (#1 on the Leaderboard)

Every AI coding agent has the same problem: amnesia.

You spend an hour with Claude Code debugging a Docker volume mount issue. You find the fix, explain your architectural reasoning, set coding preferences. Then you close the session. Next time you open it, the agent has no idea any of that happened. You start from zero.

I got tired of spending the first 10-15 minutes of every session re-explaining context that was already established. So I built OMEGA — a persistent memory system that gives AI coding agents long-term memory across sessions. It runs entirely on your machine, scores 95.4% on the LongMemEval academic benchmark (#1 on the leaderboard), and you can install it with pip install omega-memory.

This post is a technical walkthrough of what I built, how it works, and where it falls short.

The Problem, Concretely

AI coding agents today are stateless by design. The conversation context is the only "memory" they have, and it evaporates when the session ends.

This means:

Decisions vanish. "We chose PostgreSQL over MongoDB because we need ACID transactions for payment processing" — gone. Next session, the agent might suggest MongoDB.
Mistakes repeat. You debug the same ECONNRESET error three sessions in a row because the agent doesn't remember it was caused by connection pool exhaustion.
Preferences reset. "Always use early returns, never nest conditionals more than 2 levels deep" — you have to say this every single time.

The root cause is that MCP (Model Context Protocol) gives agents access to tools, but no persistent state between sessions. There's no standard way for an agent to store and recall what it learned.

Architecture: SQLite All the Way Down

I went through a few iterations. The first version used an in-memory graph (NetworkX). At around 3,700 nodes it consumed 372 MB of RAM. That was unacceptable for something that runs in the background.

The current architecture is much simpler:

┌─────────────────────────┐
│    Claude Code / Cursor  │
│    (any MCP host)        │
└───────────┬─────────────┘
            │ stdio/MCP protocol
┌───────────▼─────────────┐
│   OMEGA MCP Server       │
│   27 memory tools        │
│                          │
│  ┌─────────────────────┐ │
│  │  Hook Daemon (UDS)  │ │    ← Unix Domain Socket for
│  │  auto-capture +     │ │      <750ms hook dispatch
│  │  auto-surface       │ │
│  └─────────────────────┘ │
└───────────┬─────────────┘
            │
┌───────────▼─────────────┐
│  omega.db (SQLite + WAL) │
│                          │
│  ┌──────┐ ┌───────────┐ │
│  │nodes │ │ vec_nodes  │ │    ← sqlite-vec: 384-dim
│  │      │ │ (vectors)  │ │      cosine similarity
│  └──────┘ └───────────┘ │
│  ┌──────┐ ┌───────────┐ │
│  │edges │ │ nodes_fts  │ │    ← FTS5: full-text
│  │      │ │ (keywords) │ │      keyword search
│  └──────┘ └───────────┘ │
└─────────────────────────┘
            │
┌───────────▼─────────────┐
│  bge-small-en-v1.5       │
│  ONNX Runtime (CPU)      │
│  384-dim embeddings      │
│  ~90 MB model on disk    │
└─────────────────────────┘

Everything is a single SQLite database (~/.omega/omega.db) running in WAL mode with sqlite-vec for vector search and FTS5 for keyword matching. The embedding model is bge-small-en-v1.5 running via ONNX Runtime on CPU — no GPU required, no cloud API calls.

Why SQLite? Because the access pattern is perfect for it. One machine, one user, mostly reads with occasional writes, and the entire database fits in a few megabytes. At ~250 memories, the database is about 10 MB. SQLite's WAL mode handles concurrent reads from multiple MCP server processes, and I added retry-with-backoff for the rare write contention under heavy multi-process usage.

Why not a vector database? I considered Chroma and Qdrant. But adding a separate database process for a system that stores hundreds (not millions) of vectors felt like overengineering. sqlite-vec gives me cosine similarity search in the same process, with zero external dependencies.

The Search Pipeline

Retrieval accuracy is everything for a memory system. If you can't find the right memory when it matters, the whole system is useless. I landed on a six-stage pipeline:

Query: "Docker volume mount issue"
           │
           ▼
┌─────────────────────────┐
│ 1. Vector Similarity    │  cosine distance on 384-dim
│    (sqlite-vec)         │  embeddings, top-K candidates
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 2. Full-Text Search     │  FTS5 keyword matching for
│    (FTS5)               │  terms the embeddings miss
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 3. Type-Weighted Score  │  decisions and lessons get 2x
│                         │  weight (they're higher value)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 4. Contextual Re-rank   │  boost by tag match, project
│                         │  scope, and content overlap
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 5. Time-Decay           │  old unaccessed memories rank
│                         │  lower (floor 0.35, exemptions
│                         │  for prefs + error patterns)
└───────────┬─────────────┘
            ▼
┌─────────────────────────┐
│ 6. Dedup                │  remove near-duplicates from
│                         │  the result set
└───────────┬─────────────┘
            ▼
        Results

The two-source approach (vectors + FTS5) is key. Embeddings handle semantic similarity ("container networking issue" matches "Docker bridge network problem"), while FTS5 catches exact terms that embeddings sometimes miss (specific error codes, package names, config keys). Combining both gives better recall than either alone.

Type-weighted scoring was a deliberate design choice. When you ask "what should I know about the orders service?", a prior architectural decision ("we chose PostgreSQL for ACID compliance") is almost always more relevant than a session summary from three weeks ago. Weighting decisions and lessons 2x in the scoring reflects this.

Memory Lifecycle: Why Forgetting Matters

Most memory systems just accumulate. Store everything, search through everything, forever. This works at 50 memories. At 500, you start getting noise in every query. At 5,000, the system is actively harmful — surfacing outdated context that leads the agent astray.

OMEGA has an explicit forgetting system with five mechanisms:

Dedup on write. SHA-256 hash for exact duplicates, plus embedding similarity (threshold 0.85) for semantic duplicates. If you store "use PostgreSQL for the orders DB" twice with different wording, OMEGA catches it.
Evolution. When new content is 55-95% similar to an existing memory, instead of creating a new entry, OMEGA appends the new information to the existing one. The memory evolves rather than duplicates.
TTL expiry. Session summaries expire after 1 day — they're useful for immediate context but stale quickly. Lessons and preferences are permanent. Everything else gets a configurable TTL.
Compaction. Periodically, OMEGA clusters related memories (Jaccard similarity) and summarizes them into consolidated nodes, marking the originals as superseded. This is like garbage collection for knowledge.
Conflict detection. When a new decision contradicts an existing one, OMEGA detects it automatically. Decisions auto-resolve (newer wins), while lessons are flagged for review. The old memory gets a contradicts edge to the new one.

Every deletion is audited. You can run omega_forgetting_log and see exactly why each memory was removed — TTL expired, consolidation pruned, compaction superseded, LRU evicted, user deleted, or flagged via feedback.

Auto-Capture: The Part That Actually Matters

The 27 MCP tools are nice, but the real value is in the hook system. In Claude Code, OMEGA installs four hooks:

Hook Event	What It Does
SessionStart	Surfaces relevant memories as a welcome briefing
PostToolUse	After file edits, surfaces memories related to the file being changed
UserPromptSubmit	Analyzes the conversation and auto-captures decisions/lessons
Stop	Generates a session summary

The hooks dispatch through a Unix Domain Socket daemon running inside the MCP server process. This is important — the first version spawned a new Python process per hook invocation, which added ~750ms of cold-start overhead. The UDS daemon eliminates that by reusing the warm MCP server with its already-loaded ONNX model and database connection.

The UserPromptSubmit hook is the most interesting. It classifies the conversation context and extracts decisions ("we chose X because Y"), lessons ("the fix for X is Y"), and error patterns — all without the user explicitly saying "remember this." You just code normally, and OMEGA captures what matters in the background.

Benchmarking Against LongMemEval

LongMemEval is an academic benchmark from ICLR 2025 that tests long-term memory systems with 500 questions across five categories:

Information Extraction (IE): Can you recall specific facts from past conversations?
Multi-Session Reasoning (MS): Can you synthesize information across multiple sessions?
Temporal Reasoning (TR): Can you reason about when things happened and in what order?
Knowledge Update (KU): When information changes, do you return the current state?
Preference Tracking (SP): Do you remember and apply user preferences?

Here's how OMEGA stacks up:

System	Overall	IE	MS	TR	KU	SP
OMEGA	95.4%	100%	83.5%	94.0%	96.2%	98.6%
Mastra	94.87%	—	—	—	—	—
Emergence	86.0%	—	—	—	—	—
Zep/Graphiti	71.2%	—	—	—	—	—

A few notes on methodology and honesty:

What the benchmark does well: It tests real conversational memory patterns — things like "what restaurant did I mention in our 3rd conversation?" or "I changed my coffee preference from latte to americano, what's my current preference?" These are practical tests of what a memory system needs to handle.

What it doesn't test: It doesn't test auto-capture quality, retrieval latency under load, or how well the system handles adversarial/contradictory inputs over time. It also doesn't test multi-agent coordination, which is a significant part of OMEGA's feature set.

Where OMEGA struggles: Multi-session reasoning (83.5%) is the weakest category. These questions require counting or aggregating across many sessions ("how many times did I mention going to the gym?"), which is fundamentally harder for a retrieval-based system. My best result came from a simple "list all matches, then count" approach — more aggressive dedup strategies actually caused regressions.

Scoring methodology: The 95.4% is task-averaged — the mean of per-category accuracies. This is the same methodology used by other systems on the leaderboard (including Mastra at 94.87%). The raw score is 466/500 (93.2%).

Cost of benchmarking: Each full run costs real money in LLM API calls (the benchmark uses GPT-4 for evaluation). I ran about 8 iterations to get from 85% to 95.4%, each time targeting specific failure modes. The improvements were incremental: better temporal prompting (+5 questions), knowledge-update current-state prompting (+4), query augmentation (+2), preference personalization (+2).

The Competition

I want to be fair here, because these are all legitimate projects solving the same problem from different angles:

Mem0 (47K GitHub stars): Cloud-first approach. More polished product, larger team, established user base. Requires an API key for the cloud version. Their local mode exists but is more limited. They haven't published LongMemEval scores.
Zep/Graphiti (22.8K stars): Neo4j-backed knowledge graph approach. Sophisticated architecture but requires running Neo4j. Published 71.2% on LongMemEval in their paper, which I respect — most systems don't publish benchmark numbers at all.
Letta (21.1K stars): Agent framework with memory as a component. Different scope — they're building a full agent platform, not just memory.
Claude's built-in memory (CLAUDE.md files): Works without any setup, but it's a flat markdown file with no semantic search, no auto-capture, and no cross-session learning.

OMEGA's differentiator is being local-first with zero external services while still scoring competitively on benchmarks. Whether that tradeoff matters to you depends on your threat model and workflow.

Honest Tradeoffs and Limitations

I'd rather you know the sharp edges before you try it:

Auto-capture hooks only work with Claude Code. Other MCP clients (Cursor, Windsurf, Zed) get the 27 tools but not the automatic memory capture. You have to explicitly tell the agent to remember things.
Memory footprint. ~31 MB at startup, ~337 MB after the ONNX model loads on first query. The model unloads after 10 minutes of inactivity, but if you're memory-constrained, this matters.
English only. The bge-small-en-v1.5 embedding model is trained on English text. It will work poorly for other languages.
Solo maintainer. This is a passion project, not a VC-backed company. I maintain it because I use it every day, but I can't promise the same velocity as a funded team.
Not stress-tested at scale. I've been running it at ~600 memories with no issues. I haven't tested at 10K+ memories. SQLite can handle it, but the search pipeline might need tuning at that scale.
Python 3.11+ only. No support for older Python versions. macOS and Linux are supported; Windows works through WSL.

Try It

pip install omega-memory
omega setup          # auto-detects Claude Code
omega doctor         # verify everything works

For Cursor or Windsurf:

omega setup --client cursor
omega setup --client windsurf

Three commands, no API keys, no cloud accounts, no Docker containers. Everything runs locally.

The source is at github.com/omega-memory/core under Apache-2.0. Stars are appreciated — the project has about 5 right now. Contributions, bug reports, and questions are welcome.

If you want to see the benchmark methodology in detail or how OMEGA compares to alternatives with sources, check omegamax.co.

I build OMEGA because I use it every day. The best test of a developer tool is whether the developer actually uses it — and I haven't opened a Claude Code session without OMEGA in months. If you're spending time re-explaining context to your AI coding agent, give it a try.