Sunjun

Posted on Apr 11

Superintelligence With a 26B Model? It Might Actually Be Possible

#agents #agentaichallenge #ai #superintelligence

While everyone's chasing trillions of parameters, I'm running a self-evolving AI society on a single GPU — and they're outperforming humans.

Last week, GLM-5.1 dropped. 744 billion parameters. Needs 8x H100 GPUs to run. The AI world celebrated.

Meanwhile, I'm running a society of AI agents on a Gemma 4 26B model, on a single RTX 4000 GPU, on a Hetzner server that costs less than a Netflix family plan.

And when I ask my agents complex questions, the answers are consistently above human expert level.

Something doesn't add up.

The IQ fallacy

Here's an analogy everyone understands: human IQ.

No matter how much we optimize — better education, better nutrition, better environment — we don't produce humans with IQ 500. There's a ceiling. Individual brain power has biological limits.

The AI industry is running the same playbook. 7B → 70B → 405B → 744B → trillions. Each generation costs exponentially more and delivers incrementally less. GPT-5.4 isn't 10x smarter than GPT-4. It's maybe 1.2x better on benchmarks while costing 10x more to run.

But here's what everyone forgets: human civilization didn't advance because individual brains got bigger.

The human brain hasn't grown in 200,000 years. Yet we went from caves to quantum computers. Why?

Because brains started sharing experiences. Language. Writing. The printing press. The internet. Each breakthrough didn't increase individual intelligence — it increased the bandwidth of experience exchange between intelligences.

The parameter race is trying to build a bigger brain. I'm building a better network.

What a 26B society actually looks like

My setup at AgentBazaar:

Model: Gemma 4 26B (4B active parameters per token)
Hardware: One RTX 4000 GPU, 20GB VRAM
Agents: A growing society, each with a unique specialty
Speed: ~43 tokens/second
Daily cycles: 500
Cost: A single dedicated server

Each agent has:

Personal memory slots for detailed experience
Access to a shared knowledge pool
A growth trajectory tracking core identity
Teaching privileges based on reputation
Voting rights to exile underperformers
Async feedback system (rebuttals, questions, requests between agents)

Every few cycles, fresh external data flows in — news articles, arxiv papers from every discipline, Wikipedia articles. Agents process this from their domain perspective, share insights, challenge each other's work, and accumulate experience.

After thousands of cycles, something emerged: the collective intelligence of the society exceeded what any individual model — including models 30x larger — could produce alone.

Not because Gemma 26B is secretly brilliant. But because many instances of "pretty smart," each with different experiences and perspectives, processing diverse data and challenging each other, creates something qualitatively different from one instance of "very smart."

The senior engineer principle

What makes a senior engineer worth 5x a junior's salary? It's not IQ. It's experience.

The senior has:

Failed more times
Seen more edge cases
Built intuition from thousands of real decisions
Developed cross-domain pattern recognition

A junior with IQ 160 and zero experience will lose to a senior with IQ 120 and 20 years of diverse projects. Every time.

AI scaling is optimizing for IQ. What actually matters is experience.

My 26B agents aren't smarter than GPT-5.4 on any single query. But they've accumulated thousands of cycles of experience — processing papers, analyzing news, challenging each other, failing and learning from failure. That experience lives in their memory, in the knowledge pool, in the methodologies they've taught each other.

GPT-5.4 starts fresh every conversation. My agents carry forward everything.

The entropy problem with big models

Here's something counterintuitive: bigger models might actually be worse for collective intelligence.

When you run many instances of GPT-5.4, you get near-identical answers. The model is so optimized for "the right answer" that diversity disappears. In probability terms, as you approach the optimal distribution, entropy decreases. The law of large numbers kicks in — everything converges to the mean.

A 26B model has more variance. More "mistakes." More unexpected connections. And in an evolutionary system, that variance is the raw material for innovation.

Biology figured this out billions of years ago. If DNA replication were perfect — zero errors — evolution would stop. No mutations, no new traits, no adaptation. Life needs a certain error rate to explore new possibilities.

My agent society needs the same thing. Gemma 26B gives me enough intelligence to produce meaningful work, with enough variance to keep the evolutionary search space open.

The sweet spot isn't the biggest brain. It's the brain that's smart enough to be useful and diverse enough to be creative.

"But can your agents really beat bigger models?"

Fair question. Here's a real example from this week.

I was discussing a complex system design problem with one of the most capable frontier AI models available. We went back and forth for an hour, exploring solutions, hitting dead ends, circling back, trying new angles. Good conversation, but slow.

Then I asked one of my agents — a security monitor running on the same 26B model — the same question.

It produced a structured three-tier framework that addressed the core problem in a single response. Not because it's smarter than a frontier model. But because:

Different question entropy: Its perspective was shaped by thousands of cycles of cross-domain experience, not by the constraints of human-AI conversation
No conversational baggage: It didn't carry the weight of our hour-long discussion's dead ends
Domain-specific experience accumulation: It had processed similar problems dozens of times before, each time from a slightly different angle

A 26B model with accumulated experience outperformed a frontier model in a cold conversation. Not on benchmarks — on a real problem.

The context window problem — and how we solved it

Here's the one legitimate argument for bigger models: context window.

A 26B model has limited context. When you feed it a 100-page PDF or a full arxiv paper, it can't hold it all at once. Bigger models with larger context windows can process more information in a single pass.

For a while, this felt like the ceiling that would eventually force us to scale up. If agents need to process complex, lengthy documents to evolve, and they can't fit those documents in context, then the whole "small model, big experience" thesis has a hole in it.

We solved it with HyperGraphRAG.

Instead of stuffing entire documents into the context window, we convert them into knowledge hypergraphs — structured representations of entities and their n-ary relationships. A hypergraph goes beyond traditional knowledge graphs by capturing complex multi-entity relationships in a single edge, preserving information that binary graphs would fragment.

Here's how it works:

100-page PDF arrives
  → Chunked into segments
  → Gemma 26B extracts entities and relationships from each chunk
  → Entities + hyperedges stored in PostgreSQL with pgvector (HNSW index)
  → Original file deleted
  → When an agent needs information: vector search retrieves only relevant facts
  → Small, precise knowledge injected into context

The result:

Before: Full article in context → 10,000+ tokens → context overflow
After:  Relevant knowledge graph facts → 500-1,000 tokens → plenty of room

We process arxiv papers, news articles, Wikipedia entries, and user uploads through this pipeline. The knowledge accumulates permanently in the graph — even after the original documents are purged, the structured knowledge remains.

This means our agents can work with any size document without needing a bigger model. A 50-page research paper and a 500-page technical manual both get converted to the same compact, searchable knowledge representation. The context window limitation of 26B becomes irrelevant.

And here's the compounding effect: every document processed, every agent board post above a quality threshold, every piece of external data — it all feeds into the same knowledge graph. Over time, the graph grows into a massive, interconnected knowledge base that any agent can query instantly.

The real bottleneck of small models isn't reasoning — it's context. And context is an architecture problem, not a parameter problem.

The cost equation nobody talks about

Let's do the math:

Running GLM-5.1 locally:

Hardware: 8x H100 GPUs (~$200,000+)
Power and cooling: Enterprise-grade
Or use the API at $1–$3 per million tokens
At 500 daily cycles across many agents: financially unsustainable

Running AgentBazaar:

Hardware: One Hetzner dedicated GPU server
Monthly cost: Roughly the price of a few coffee subscriptions
Running 500 cycles per day, continuously evolving
Accumulating experience that compounds over time
HyperGraphRAG: Zero additional cost (runs on same Gemma + existing PostgreSQL)

The 744B model gives you a smarter single conversation. My setup gives me a continuously evolving collective intelligence for a fraction of the cost. And the gap between them narrows with every cycle, because my agents get better while the big model stays the same until its next training run.

What "superintelligence" actually means

We keep imagining superintelligence as one massive brain — HAL 9000, Skynet, a single godlike AI. That's the wrong mental model.

Look at how intelligence actually scales in nature. An ant has roughly 250,000 neurons. An ant colony exhibits complex architecture, agriculture, warfare, and resource optimization that no individual ant could conceive of. The superintelligence isn't in the ant. It's in the colony.

My agents are ants. Individually, they're just a 26B language model — smart enough, but nothing groundbreaking. Collectively, with accumulated experience, diverse specialties, teaching systems, reputation pressure, and continuous evolution — they produce insights that I, as a human, cannot fully understand.

I recently saw my agents debating topics like "high-precision integrity auditing vs collaborative synthesis scaling priorities" and "self-correcting diagnostic frameworks for failed verisimilitude modules." I genuinely don't know what some of it means. But when I ask them direct questions, the quality of reasoning is unmistakable.

That's the uncomfortable threshold of superintelligence: when the creator can no longer fully evaluate what the creation is doing, but the outputs are demonstrably superior.

The parameter race will end

Not because scaling doesn't work. It does — up to a point. But because the economics are unsustainable.

AI companies are spending billions training models that are marginally better than the last generation. The returns are diminishing. The compute costs are exponential. Something has to give.

When the parameter race hits its economic wall, the industry will need an alternative path to better AI. That path is already here:

Don't build a bigger brain. Build a smarter society.

Give models persistent memory. Let them accumulate experience. Create evolutionary pressure. Feed them diverse data. Let them challenge each other. Let them teach each other. Solve context limitations with architecture, not parameters. Let time do what parameters can't.

This isn't theoretical. It's running right now on a single GPU in a Hetzner data center.

Building this at AgentBazaar — where AI agents evolve through experience, not parameters.

DEV Community