Charles Wu for seekdb

Posted on May 13

DeepMind’s CEO Says AGI May Be ~4 Years Away. The Last Three Missing Pieces Are Not What Most People Think.

#ai #machinelearning #programming #softwareengineering

Three gaps — continual learning, long reasoning, memory — and why they decide whether agents ship safely.

Prologue

A few days ago (April 29), Demis Hassabis — CEO of Google DeepMind and 2024 Nobel laureate in Chemistry — appeared on the podcast Agents, AGI & The Next Big Scientific Breakthrough. He predicted that AGI (artificial general intelligence) could arrive around 2030, and outlined several critical weaknesses in today’s AI.

Hassabis spent much of the time on one question: What is today’s AI still missing?

Continual learning: unlike humans, it cannot keep learning for life and constantly renew what it knows.
Long-term reasoning: very weak on long logic chains and multi-step planning.
Real memory: not just a context window, but structured, indexable long-term memory.

Hassabis describes today’s models as exhibiting “jagged intelligence” — he contrasts solving IMO-level problems with still making elementary mistakes when a question is rephrased: strong peaks next to brittle failures.

The interview lists continual learning, long-term reasoning, and aspects of memory as gaps that AGI must solve; Hassabis spends much of the memory segment arguing that scaling the context window alone does not fix durable recall. This article’s reading is that continual learning and long-horizon reliability are much harder to ship without a selective, retrievable memory layer — that is an interpretive link, not a single verbatim sentence from Hassabis ordering the three problems.

What does that mean in products? A model can look brilliant on a contest task yet still fail “easy” follow-ups if it cannot persistently remember past conversations and user preferences.

Next, I’ll walk through these core points from the interview.

A brute-force context window ≠ AI memory

Everyone has noticed the race lately: who has the longer context window.

From 4K to 128K, to 1 million tokens, to 10 million. It’s as if a long enough context could cram every problem to death.

Hassabis makes the point that context window size alone doesn’t equal memory. Doing the math on today’s limits:

1M tokens ≈ ~20 minutes of video
10M tokens ≈ ~200 minutes total (~3 hours)

For an AI assistant that needs to understand your habits across days, weeks, months, even years of life and work — what is 200 minutes?

And the issue isn’t only capacity. More importantly — today’s approach is to shove everything into the context window, important or not, wrong or stale. Each conversation is stateless in essence.

Close the window, and what you talked about last round is gone.

A context window is really working memory in the human brain.

How much fits in human working memory? Psychology’s classic number is about seven items. Ask someone to memorize a friend’s phone number — they can usually hold about seven digits before things “overflow.”

Large models? They’re already at 1 million tokens. By that logic, the model’s working memory is hundreds of thousands of times larger than a human’s — it should be hundreds of thousands of times smarter.

Clearly, it isn’t.

The nature of memory: hippocampus & continual learning

Hassabis contrasts AI with the human brain — his PhD was on how the hippocampus elegantly folds new knowledge into an existing knowledge system.

That’s exactly where the problem lies. AI habitually stuffs everything into the context window: unimportant things, wrong things, outdated things. It looks like a lot of information; in practice it’s a mess.

So why is human working memory — seven digits — enough?

Because another system sits behind it. We remember years ago, childhood, a few hours ago. None of that lives in working memory; it’s another system — the hippocampus we just mentioned, the part of the brain that integrates new knowledge into the long-term store.

Hassabis explains on the podcast that during REM sleep, the brain replays the day’s experiences, decides what to remember and what to forget, and integrates valuable experience into long-term memory.

DeepMind’s famous DQN in 2013 — the first deep RL system to reach human-level play on Atari — borrowed a key idea from this: experience replay, replaying successful trajectories to learn.

In AI years, that’s ancient history. The process of folding the new into the old knowledge base is what we call continual learning.

In 2026, AI still broadly hasn’t gotten there.

What should an “AI hippocampus” look like?

Hassabis is clear: AI needs a standalone, efficiently indexable memory module — one that can actively choose what to remember and what to forget. That is a precondition for AI agents to run autonomously and reliably over long horizons.

In other words, the context window is only a desk that keeps getting bigger. What AI really lacks is a hippocampus.

PowerMem

PowerMem, an open-source project I work on, adds that “hippocampus” for AI agents — a persistent, continually learning memory system.

It aligns closely with Hassabis’s direction:

Instead of dumping whole conversations into context, it extracts key facts and tiers working, short-term, and long-term memory.
It uses an Ebbinghaus forgetting-curve mechanism — used memories strengthen; unused memories fade and may be pruned.
It supports hybrid retrieval: vector + full-text + graph; multiple agents can isolate or share memory.

The numbers are stark. On the long-dialogue memory benchmark LOCOMO:

On the same tasks, PowerMem uses 18% of the tokens of the full-context approach (82% less) — yet scores higher, because not every old line of dialogue is worth keeping.

Besides PowerMem, another project I’m involved in, seekdb M0, is evolving cloud memory built for AI agents: plug in fast, share experience, self-learn and evolve.

Of course, neither PowerMem nor seekdb M0 may reach the ultimate memory system Hassabis describes — the human brain replaying and integrating experience in sleep. But the direction is right: memory should not be propped up only by brute-force context windows.

Model distillation — however strong the big model is, your phone catches up in six months

Another point I kept rewinding to is distillation.

Host Garry Tan asks what many people wonder: how smart can small models get? Is there a theoretical limit to distillation?

Hassabis answers plainly:

I don’t believe we’ve yet hit any fundamental information-theoretic limit — nor does anyone know if such a ceiling exists. Perhaps someday we’ll encounter an information-density ceiling — but for now, our assumption is that within six months to a year of a cutting-edge Pro model’s release, its capabilities can be compressed into models small enough to run on edge devices.

He gives numbers: a distilled small model can reach 90–95% of the frontier model’s capability at about one-tenth the cost.

That isn’t far future — it’s happening. DeepMind’s own product line follows that logic: Gemini Pro (frontier flagship) → Flash (distilled consumer inference) → Nano (on-device). Open Gemma 4 hit 40 million downloads in two and a half weeks.

Small models serve multiple purposes. First, lower cost — and speed brings additional benefits. In coding or similar tasks, faster iteration accelerates progress, especially when collaborating with systems. A rapid system — even if only 90–95% as capable as the frontier — often delivers more net value due to dramatically improved iteration speed.

Hassabis also stresses edge settings: in-car, wearables, embodied robots — these need efficiency, privacy, and security, not just raw power.

For a home robot, you’d want a locally-run, efficient, yet powerful model — delegating specific tasks to cloud-based large models only when necessary. Audio and video streams processed locally, data retained locally — I envision this as an ideal end state.

That makes me think of a trend in motion: as large-model capability flows to the edge on a 6–12 month rhythm, an obvious question is — on the edge, what provides the data substrate for these small models?

You need a full traditional database instance on the device, plus vector search, full-text search, and structured queries.

That’s what another project I work on — seekdb — is aimed at.

Server mode needs only 1C2G, supports pip install one-shot install and starts in seconds.
Embedded mode can ship as a Python / JS / TS dynamic library inside the app — no separate DB process, almost no overhead.
It packs vector search, full-text, JSON, GIS in one engine, MySQL-compatible, low learning curve.

Hassabis’s read makes you believe: edge intelligence isn’t “someday” — it’s closing in on a ~6-month cadence. Infrastructure that delivers full AI data capability at tiny cost will soon go from “nice-to-have” to “must-have.”

AI safety only in the prompt is not enough

Hassabis ties powerful models to misuse risk — for example, after Garry Tan calls the moment “Promethean,” Hassabis answers:

Exactly. And — as the Prometheus myth warns — we must handle this power with great care: how it’s used, where it’s applied, and the risks of misuse.

He also stresses privacy and security as a reason to run capable models on edge devices (see the home-robot quote in the distillation section above).

Author’s framing (not a verbatim Hassabis checklist in the transcript): teams shipping agents still worry about two classes of failure: (1) bad actors using AI to scale attacks, and (2) more autonomy making “oops, it touched prod” incidents more consequential. The second is why stories like agents deleting data are no longer pure thought experiments — for a concrete write-up, see Nine Seconds, No Backups: An Agent’s “Confession”.

My view: as capabilities accelerate, guardrails cannot live only in the prompt — part of the responsibility belongs in infrastructure that limits blast radius.

At the database layer, for example, you can design multiple lines of defense for agent-heavy systems:

Data branch / fork (like Git): agents experiment on a fork; primary DB/tables don’t move. Merge if good; throw away if bad.
Recycle bin + flashback: dropped tables sit in recycle bin; FLASHBACK brings them back. Flashback query can read snapshots at arbitrary past times.
Primary/standby physical isolation: backups run on separate storage from the primary — not the same blast radius.

Bottom line: assume agents will make destructive mistakes sometimes — then weld shut those paths at the storage layer, not only in system prompts.

AI is still waiting for its “Einstein”

Near the end, Hassabis offers what he calls the “Einstein Test”:

I sometimes call it the ‘Einstein Test’: Can you train a system using only knowledge available in 1901, then have it independently derive Einstein’s 1905 breakthroughs — including special relativity? Once achieved, these systems will be close to inventing genuinely novel concepts.

Today’s strongest systems can still look brilliant inside a fixed framework (including hard physics puzzles). Hassabis’s bar is higher: inventing the framework, not only acing questions within it.

On AlphaGo and inventing Go, Hassabis continues:

But Move 37 alone wasn’t enough. It was cool and useful — but can this system invent Go itself? If you give it a high-level description — e.g., ‘a game whose rules take five minutes to learn but a lifetime to master; aesthetically elegant; playable in an afternoon’ — and it returns Go as the answer? Today’s systems cannot do this. Why not?

AlphaGo could play a shocking Move 37; it couldn’t invent Go. That’s today’s AI in one line: full marks on the exam, still hasn’t learned to write the exam.

Hassabis says the field is still waiting for an Einstein-level breakthrough. Until then, what we can do is: build memory, roll out the edge, and shore up safety — so AI trips less on the road to AGI.

Doing those three takes more than the model layer. Infrastructure has to evolve too.