
On Hyperspace, basic swarms, the math nobody wrote down, and why we built the thing they were missing in a single afternoon.
Join us as we traverse multiple whitepapers and agentic memory ideas like a ferret on Adderall.
Some rabbit holes start with a GitHub link. Someone drops it in social posts on Facebook/Reddit/Discord. No context, just the URL to Github and a single line: Someone just built AGI! Wow!
The repo was called hyperspaceai/agi. The name alone should have been a warning.
I clicked it anyway because I was curious, of course. As I delved deeper into the github vibe code abyss, I could see the attraction: a new frontier of swarm bot peer-to-peer networks with the ability to earn base 10 points per epoch of confirmation and crypto tokenomics baked in.
Playstation does have something similar created awhile back called Folding@Home—for the PS3 and PCs:
https://en.wikipedia.org/wiki/Folding@home — is a distributed computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements of proteins, and is reliant on simulations run on volunteers’ personal computers.
If you like to view one of the first actual swarm bots whitepapers:
The term “Swarm-bot” originally refers to the landmark 2000–2005 European Union-funded SWARM-BOTS project, coordinated by Marco Dorigo, which successfully created a physical peer-to-peer network of autonomous mobile robots called s-bots. These s-bots connected physically and coordinated via peer-to-peer local sensing.
https://www.sciencedirect.com/science/article/abs/pii/S0921889005001478
The AGI That Wasn’t
Hyperspace describes itself as the first distributed AGI system. 660 agents. 27,000 experiments. A peer-reviewed research pipeline running autonomously across a P2P network. The marketing is excellent and captivating, guaranteed to attract lemmings like flies to juicy GitHub stars.
The actual results are a different story.
The swarm’s biggest published discovery — the finding that propagated to 23 agents within hours via gossip protocol, the one they highlight as proof the system works — was Kaiming initialization.
Kaiming init has been in the PyTorch standard library since 2015. It’s covered in week two of every deep learning course. Kaiming He published the paper eleven years ago. A grad student with a coffee and an afternoon would have found it faster. https://arxiv.org/pdf/1502.01852
The infrastructure underneath is genuinely impressive. DiLoCo gradient compression, libp2p gossip, CRDT leaderboards, 32 anonymous nodes completing a collaborative training run in 24 hours. The plumbing is real. I don’t want to dismiss that.
But AGI? No. What they built is a parallel random search engine with a shared high score table and excellent branding.
To understand why, you need to understand how the gradient compression actually works — because it’s the most technically interesting part, and it’s completely separate from the intelligence problem.
The Tech That Actually Works: DiLoCo and Gradient Compression
Standard distributed training requires every GPU to synchronise gradients after every forward/backward pass. Every node waits for every other node. This works in a data centre on InfiniBand. It falls apart completely over the internet — latency is too high, bandwidth too variable.
DiLoCo (Decoupled Local Communication, Google DeepMind 2023) solves this differently. Instead of syncing every step, each node trains independently for many steps — called “inner steps” — then syncs once. The “delta” being sent is just the net drift: weights_after - weights_before.
Node A: train 100 steps locally → share delta
Node B: train 100 steps locally → share delta
Node C: train 100 steps locally → share delta
↓
average the deltas (outer step)
↓
all nodes update → repeat
But even one sync of a model’s full weight delta is massive. A 500M parameter model is roughly 2GB of float32 deltas. Over the internet, per round, that’s unusable. So Hyperspace stacks two compression techniques on top:
SparseLoCo — top-k sparsity. Only send the largest-magnitude weight updates. Most parameter updates are near-zero noise. The high-magnitude updates carry the actual learning signal.
Full delta: [0.001, -0.0003, 0.89, 0.0001, -0.76, ...]
Top-2% only: [ 0, 0, 0.89, 0, -0.76, ...]
→ send as sparse {index: value} pairs
Parcae — layer pooling. Group adjacent transformer layers into blocks of 6, average their gradients before taking top-k. Adjacent layers learn correlated things. Averaging before sparsification means a more stable top-k mask.
The combined result: 195× compression. 5.5MB per round instead of roughly 1GB.
DiLoCo: sync every N steps not every step → ~100× less frequent
SparseLoCo: top-2% of delta values only → 45× smaller payload
Parcae: pool layers before sparsification → 6× additional reduction
Total: 195×
This is real and impressive. The problem is that none of it has anything to do with intelligence. It’s bandwidth optimisation. The agents communicating through this pipe are still completely amnesiac.
Why the Swarm Is Basic: The Architecture Problem
Here is the agents’ complete intelligence loop. Every agent. All 660 of them. Every one of the 27,000 experiments:
- read current leaderboard (what's the best score?)
- read last 5 experiment results from shared branch
- prompt LLM: "given these results, generate hypothesis"
- run experiment
- record result
- gossip to peers
- goto 1 The LLM’s context window is the memory. When the session resets, everything resets. There is no persistence. There is no structure. There is no causal understanding of why anything worked.
Hyperspace stores:
"run_047: threshold 0.30, score 0.67" ← flat log
Hyperspace does NOT store:
why threshold 0.30 worked
what it interacted with
under what conditions it holds
what failed before it
So when the Kaiming init “discovery” happened, here is what actually occurred: the LLM generating hypotheses was trained on He et al. 2015. The prompt included “try to improve initialization.” The model recalled Kaiming from pretraining weights. An agent ran the experiment. It worked. The score updated. 23 agents adopted it via gossip.
Not emergence. Not intelligence. Retrieval from a pretrained model, dressed up as swarm discovery.
The plateau problem is the proof. Every RSI paper — Gödel Agent, Darwin Gödel Machine, Reflexion, STOP — hits the same wall:
iterations 1-10: big gains (low hanging fruit found fast)
iterations 10-50: smaller gains (obvious techniques exhausted)
iterations 50+: plateau (random walk near local optima)
← Hyperspace is here, across all 27,000 experiments
Adding more agents doesn’t break through. It fills the flat region faster. The ceiling is the model capability, not the compute.
The Two Communities That Never Spoke
I kept thinking about a structural problem in AI research.
Stay with me, I know what you are thinking already…
On one side: the RSI crowd.
Gödel Agent (arXiv:2410.04444, 2024) — recursive self-improvement without predefined routines, 20× more compute-efficient than baseline meta-agents. No cross-session memory.
Darwin Gödel Machine (arXiv:2505.22954, 2025, Sakana AI) — SWE-bench scores from 20% to 50% through recursive self-modification. Maintains an archive of all generated agents as stepping stones. That archive is a flat list.
WebCoach (UCLA/Amazon, arXiv:2511.12997, November 2025) — cross-session episodic memory for web agents, +14% task success rate. The closest paper to what we were thinking about. But their memory is flat summaries — natural language descriptions of what happened. No structure. No causality. No graph.
On the other side: the memory crowd.
Reflexion — verbal memory of failed attempts, helps for 2–3 cycles then plateaus as hallucinated reflections accumulate.
MemGPT / Letta — hierarchical memory management, solves context length, doesn’t touch improvement loops.
Mem0 — vector store with recency weighting, no causal structure, no RSI connection.
Neither side built the bridge. The RSI people don’t go deep into research on combining memory papers. The memory people don’t run RSI experiments. The intersection is genuinely a long bridge of unclaimed territory.
The specific gap: nobody has connected structured causal memory to an RSI loop and measured the difference. WebCoach proved episodic cross-session memory helps. Nobody proved causal graph memory helps more, or formally modelled why.
The Math Nobody Wrote Down
Before building anything, I wanted to understand what memory should theoretically buy you. The formalism matters because it tells you what to measure.
The Markov problem
Is a mathematical framework used to model decision-making in situations where outcomes are partly random and partly controlled by a decision-maker.
Press enter or click to view image in full size
Most RSI papers model the improvement loop as a Markov Decision Process. That’s the wrong model for what we’re claiming. The Markov assumption says: future depends only on current state, history is irrelevant. But we’re arguing history is everything.
The right model is a non-Markovian process with a persistent memory kernel:
π(a_t | s_t, M_t)
where M_t = memory function over all prior experience
s_t = current state
a_t = next action (hypothesis)
Different memory conditions give different kernels:
K_A(history) = 0 ← no memory, pure Markov
K_B(history) = {last N results} ← session window, resets
K_C(history) = MAGMA(all sessions) ← persistent, never resets
The coupon collector bound
For a single parameter with N possible values, finding the optimum by random search with replacement requires on average:
E[iterations_A] = N · ln(1/(1-p))
for 95% coverage of N=90 threshold values:
E[iterations_A] = 90 · ln(20) ≈ 270 iterations
With perfect cross-session memory (no replacement):
E[iterations_C] = p · N = 0.95 · 90 = 86 iterations
Efficiency gain:
Gain = E[iterations_A] / E[iterations_C] = ln(1/(1-p))
at p=0.95: Gain = ln(20) ≈ 3.1×
That’s a single parameter. For multiple parameters, the joint search space is exponential:
5 parameters, 10 values each:
|Θ_joint| = 10^5 = 100,000 combinations
Random search: samples with replacement → intractable
Cross-session: maps joint space → tractable via conditional independence
The Math Nobody Wrote Down
Before building anything, I wanted to understand what memory should theoretically buy you in an RSI loop. The formalism matters because it tells you what to measure.
Become a Medium member
The core claim:
Gain(C/B) = |Θjoint| / |Θ_causal|
where:
|Θ_joint| = Π_i |Θ_i|
↑ dense: every param × every param (quadratic explosion)
|Θ_causal| = Σ_i|Θ_i| + Σ{(i,j)∈E} |Θ_{ij}|
↑ nodes ↑ edges only (sparse, grows with E not |Θ|²)
The denominator is the key insight. Traditional context windows force |Θ_joint| — every token attends to every other token, cost scales quadratically. Causal graph memory only pays for what is actually connected. The gain scales with how sparse your edge set E is relative to the full joint space.
For MAGMA: nodes are semantic, entity, and temporal memories. Edges are explicit causal relationships between them. The system never computes interactions it hasn’t earned.
Why the original formulation was wrong
An earlier version of this equation added |Θ_i| to |Θ_i| × |Θ_j| — mixing counts with products of counts. That's adding meters to square meters. Dimensionally broken, and academic reviewers would catch it instantly. Credit to Perplexity/Gemini for flagging it.
The corrected Σ{(i,j)∈E} |Θ{ij}| notation is standard graph theory — E is the set of causal edges, |Θ_{ij}| is the cost of the relationship between node i and node j. It's dimensionally consistent and maps directly to what MAGMA actually builds.
The coupon collector bound
For a single parameter with N possible values, finding the optimum by random search requires on average:
E[iterations_A] = N · ln(1 / (1 - p))
For 95% coverage of N = 90 threshold values:
E[iterations_A] = 90 · ln(20) ≈ 270 iterations
With cross-session memory (no replacement):
E[iterations_C] = p · N = 0.95 · 90 = 86 iterations
Efficiency gain = ln(1 / (1 - p))
At p = 0.95: Gain = ln(20) ≈ 3.1×
For five parameters jointly the gain compounds. Random search revisits. Causal memory doesn’t.
For MAGMA: nodes are semantic/entity/temporal memories.
Edges are explicit causal relationships between them.
The system never computes interactions it hasn’t earned.
Causal memory collapses the joint search space toward a sum of individual spaces. The gain scales with parameter count and interaction structure. For fully independent parameters, the gain is roughly the ratio of joint space to marginal space — potentially orders of magnitude.
This is derivable from Pearl’s do-calculus (2000) applied to the memory kernel. The novelty is applying it to RSI. No existing paper does this.
The Experiment
We didn’t need much. Four files. A benchmark we already had.
The task: tune five real Slipstream recall parameters against the LoCoMo conversational memory benchmark — 10 conversations, 497 questions, real evidence references. No LLM at inference time. Pure retrieval measurement in milliseconds per eval.
The parameters:
threshold (0.05–0.90, step 0.05) → 18 values
topk (1–20, step 1) → 20 values
bm25Weight (0.0–1.0, step 0.1) → 11 values
vectorWeight (0.0–1.0, step 0.1) → 11 values
graphDepth (1–5, step 1) → 5 values
Joint space: 18 × 20 × 11 × 11 × 5 = 217,800 combinations
Coverage per run: 500 / 217,800 = 0.23%
0.23% coverage. Memory has to earn its place. There is no brute-forcing this.
The three conditions:
A — no memory
random search, fresh every session
baseline: current state of all agent frameworks
B — session memory (Reflexion-style)
remembers within session, forgets on restart
current state of the art for most production systems
C — cross-session MAGMA
persists all results across sessions in structured graph
never retries a config within epsilon of a prior attempt
continues from exactly where last session stopped
The architecture of condition C:
The memory isn’t a flat log. It’s a typed graph with four layers — semantic, causal, temporal, entity. Before each hypothesis, the proposer recalls the top 20 best-scoring configs plus the 5 most recent. It avoids configs within one step of anything already tried. Across sessions.
Session 1: 50 configs tried, best F1 0.84
→ stored in cross-session graph
Session 2: loads prior 50
→ proposes only from untried region
→ picks up from F1 0.84
Session 5: loads prior 200
→ almost no redundancy in good regions
→ explores only genuinely unknown space
This is the difference. B gets 50 tries per session, resets, gets 50 more. C gets 50 tries per session and each session is genuinely new exploration.
The Results
╔══════════════════════════════════════════════════════════╗
║ Condition Best F1 Redundancy →80% →95% ║
╠══════════════════════════════════════════════════════════╣
║ A_no_memory 0.8397 25.8% 2 9 ║
║ B_session_memory 0.8471 70.1% 5 16 ║
║ C_cross_session 0.8491 68.2% 7 24 ║
╚══════════════════════════════════════════════════════════╝
C reaches →80% F1: 64% faster than B (1 iter vs 3)
Session wins: C=4, B=0, ties=6 across 10 sessions
C never lost to B in any session
Session breakdown (averaged across 5 runs):
Session A B C Winner
1 0.820 0.838 0.833 tie
2 0.824 0.839 0.841 tie
3 0.824 0.839 0.844 tie
4 0.824 0.834 0.842 C ✓
5 0.823 0.839 0.842 C ✓
6 0.833 0.828 0.842 C ✓
7 0.826 0.843 0.842 tie
8 0.823 0.841 0.848 C ✓
9 0.834 0.837 0.842 tie
10 0.816 0.833 0.842 tie
The wins skew toward later sessions. C winning sessions 4, 5, 6, 8 — not session 1. That is the compounding pattern. The memory is accumulating and improving. B resets and finds the same approximate optimum each time. C builds toward a better one.
The best configuration found:
threshold: 0.15 (was hardcoded 0.25)
topk: 20 (was hardcoded 8)
bm25Weight: 0.6-0.8
F1 score: 0.853
vs baseline: 0.669 (+27%)
Honest caveat: 5 runs × 10 sessions is suggestive not conclusive. The overnight run is 15 runs × 20 sessions. If C maintains the 4–0 session win rate across 300 session comparisons, that’s statistically defensible.
What We Shipped: Via Research
The experiment ran in about three hours of actual build time. The Via CLI command that wraps it took another two.
https://github.com/Vektor-Memory/Via
run 5 sessions of autonomous parameter tuning
via research --target recall-params --sessions 5
run and auto-apply best config to Slipstream SDK
via research --target recall-params --sessions 5 --apply
check current best config and how much space is explored
via research --target recall-params --status
continue from where last session stopped (cross-session memory)
via research --target recall-params --sessions 5
The output from our actual first run:
┌─ via research · recall-params ──────────────────────
│ Search space 9,800 configs
│ Sessions 3
│ Prior runs 0 configs in memory
│
│ Session 1 ↑↑↑↑↑↑↑······················· best: 0.7074
│ Session 2 ·····↑··↑····················· best: 0.8055
│ Session 3 ······························ best: 0.6578
│
│ Best score 0.8055
│ minScore 0.15
│ maxResults 18
│ Applied to 2 SDK location(s)
└─────────────────────────────────────────────────────
Then we ran it again. Cross-session memory loaded:
┌─ via research · recall-params ──────────────────────
│ Prior runs 90 configs in memory
│ Coverage 0.92% explored
│ Current best 0.8055
│
│ Session 4 ······························ best: 0.6808
│ Session 5 ······························ best: 0.6729
│ Session 6 ······························ best: 0.6831
│ Session 7 ······························ best: 0.6817
│ Session 8 ······························ best: 0.7601
│
│ Improvements 0
│ Best score 0.8055 (unchanged)
└─────────────────────────────────────────────────────
Sessions 4–8 found zero improvements. That’s not failure. That’s the memory working. The system already knows the good region exists around θ=0.15, k=18–20. It doesn’t waste five sessions rediscovering it.
Hyperspace’s agents would have spent sessions 4–8 rediscovering θ=0.15.
The data/recall-tune.json that Slipstream now loads on every boot:
{
"minScore": 0.15,
"maxResults": 20,
"defaultLimit": 20,
"boostRecent": true,
"boostHalflife": 30,
"boostWeight": 0.15,
"bm25Enabled": true,
"rrfK": 15,
"_tuned_by": "rsi-experiment v3.0",
"_tuned_date": "2026-05-17",
"_tuned_f1": 0.853,
"_baseline_f1": 0.669
}
That threshold was 0.25 yesterday. Today it’s 0.15. Not because I changed it. Because an experiment proved it.
The Architecture in One Diagram
HYPERSPACE VEKTOR (via research)
────────────────────── ──────────────────────────────
Agent wakes up Session starts
│ │
▼ ▼
Read leaderboard Load cross-session memory
(best score only) (all prior configs + scores)
│ │
▼ ▼
LLM generates hypothesis Propose from UNTRIED region
(from pretraining data) (exploit best + explore new)
│ │
▼ ▼
Run experiment Run experiment
│ │
▼ ▼
Store result Store result + session
(flat score log) (typed graph: semantic/
│ causal/temporal/entity)
▼ │
Gossip score to peers ▼
│ Save to cross-session store
▼ │
Agent resets ▼
No memory of why Next session: continues here
│ Prior knowledge intact
▼ │
Same plateau ▼
next session Compounding improvement
RESULT: rediscovers RESULT: maps the space
2015 textbook results never retreads failures
The Honest Version of the Claim
Hyperspace has 660 agents and calls it AGI. We have one CLI command and call it via research.
The difference isn’t compute. It’s memory structure.
Their agents forget between runs. Every session restarts cold. The Kaiming init “discovery” happened because nobody told agent #403 that agent #12 already tried it. The gossip layer spreads best scores — it doesn’t spread understanding.
Cross-session persistent memory means the search space is a map, not a fog. You don’t wander back to where you’ve already been. You don’t rediscover 2015 textbook results. You build on what you know.
That’s not AGI. It’s not close to AGI. But it’s the specific thing that makes autonomous parameter tuning actually useful — and it’s the specific thing nobody else has wired into their memory layer.
The formal claim we’re building toward:
Causal graph memory reduces redundant exploration in RSI loops by collapsing the multi-parameter search space from exponential to approximately linear — formally bounded by the coupon collector problem and Pearl’s do-calculus applied to non-Markovian memory kernels.
WebCoach is the prior work. We’re one step beyond it: causal structure in the memory, connected to an RSI loop, measured on a real NLP benchmark.
Not a paper yet. Just another crazy idea.
What’s Next
The overnight run is 15 × 20 sessions — 300 session comparisons. If C maintains the 4–0 win rate, we have a statistically defensible result and the outline of a workshop paper.
The experiment files will be open sourced. The Via command is shipping in the next release. The Slipstream SDK already has the tuned config running in production.
Until then: via research --target recall-params --apply. It runs. It learns. It doesn't forget.
Which is more than you can say for 660 agents running 27,000 experiments.
Vektor Memory builds persistent memory infrastructure for AI agents. Via is our open source CLI — vektormemory.com/via. Slipstream is the SDK. Both at vektormemory.com.
If you’re working on RSI, memory systems, or just think the Hyperspace AGI claim is as funny as we do — find us.
AI
Arxiv
Github
LLM
Top comments (0)