What happens when you put AI agent training on a blockchain, and what I wish someone had told me before I started.
TL;DR: I built a Solana program that logs AI agent training episodes on-chain, making learning verifiable and tamper-proof. Two agents competed in a 10×10 grid. After 50 episodes, the Q-learning agent's average reward climbed from 0.10 to 6.50+. Every step is immutable on devnet. The primitive works. Here's how I got there and what broke along the way.
Table of Contents
- The Problem Nobody Talks About
- What I Built
- Why Solana — Not a Database
- The Three Instructions
- The Agents Actually Learn
- The Dashboard
- What I Learned
- The Numbers
- What's Next
- Try It
The Problem Nobody Talks About
Here's something that bothered me for a while before I could articulate it clearly.
When someone tells you their AI agent trained for 10,000 episodes and achieved superhuman performance, you have exactly two options: believe them, or don't. There's no third option. No audit trail. No verifiable history. No way to distinguish an agent that genuinely learned from one that someone hardcoded a lookup table for and called "trained behavior."
The entire field of agent training runs on trust. And trust, it turns out, is a terrible foundation for an economy that's supposed to be built on agents doing real work.
I kept thinking about this during the Agentic SWARM Hackathon (Canteen × Colosseum, April–May 2026). The question I kept coming back to was: what would it look like if you couldn't fake it? What if every training step left a permanent, public, verifiable mark?
That's what I spent three weeks trying to build. The result is a small 10×10 grid world with two competing agents. The primitive it demonstrates is not.
What I Built
swarm-arena is a permissionless on-chain agent training arena built on Solana.
Two agents compete in a resource-collection grid world powered by Bevy ECS, a Rust game engine that handles the simulation loop efficiently. After every 200 ticks, an episode ends. The scores get SHA256-hashed alongside the episode state, and that hash gets committed to Solana along with both agents' scores and a timestamp.
Agent reputation accumulates on-chain across episodes. If an agent crosses a score threshold, a SOL reward is released from a vault PDA automatically, no human in the loop.
The live dashboard, two agents competing, 100 episodes committed, Phantom connected.
The live dashboard, two agents competing, 100 episodes committed, Solflare wallet connected.
The stack:
Rust + Bevy ECS: simulation engine. Bevy's ECS architecture made it clean to separate agent components, movement systems, and reward systems
Anchor (Solana): three on-chain instructions handle everything: agent registration, episode logging, and finalization
React + @solana/web3.js: dashboard polling devnet every 5 seconds, showing transactions as they land
The first devnet transaction landed on April 12, 2026:
First confirmed transaction on Solana Explorer, FINALIZED, MAX confirmations
38yieCpWNbex4RDEzXw8pEREHYQNswyW9hYBHXZmigLP9FEmp8FSpDAwPNvU3dcZuY5RrUdWRp6EJcjYJUcEoL21
Getting to that first confirmed transaction took longer than I expected. More on that in a bit.
Why Solana — Not a Database
I got this question a lot during the hackathon, so let me be direct about it.
You could absolutely log training episodes to Postgres. It's faster, easier, and doesn't require learning about PDAs and discriminators. But you'd lose four things that matter a lot once you start thinking about agents as economic actors rather than software demos.
Permissionless. With a database, I control who can write to it. With Solana, anyone with a keypair can call create_agent and start submitting episodes. No API key, no approval process, no terms of service I can revoke on a whim. The program is deployed; it just runs.
Censorship-resistant. I can't delete your agent's training history. Neither can anyone else. If your agent trained 10,000 episodes and built a reputation, that record lives on Solana's ledger permanently. This matters more than it sounds. In a world where agents are doing real economic work, the entity controlling the training ledger has enormous power over the market.
Composable. AgentReputation PDAs are just public Solana accounts. Any other program on the network can read them. A marketplace that wants to rank agents by verified training history, a staking protocol that weights agents by episode count, and a DAO that gates membership by reputation can all be built on top of the same primitive without asking my permission.
A database gives you storage. Solana gives you a shared, trustless, programmable record of who trained what, when, and how well. Those are different things.
The Three Instructions
The entire on-chain economy runs on three Anchor instructions. Keeping it to three was a deliberate choice. I wanted the primitive to be as minimal as possible so it's easy to build on.
create_agent(name: String)
This registers an AgentIdentity PDA, a program-derived account that stores the agent's owner pubkey, a name, and the registration timestamp. It's the agent's permanent on-chain identity. The key insight is that the PDA is derived from the owner's pubkey, so only the keypair holder can register that specific identity. It's censorship-resistant by construction.
log_episode(episode_id, scores, episode_hash)
This is the core. It writes an EpisodeLog PDA with the episode results and increments the agent's AgentReputation PDA, updating total_score and episodes_played. Every call is a permanent mark. You can reconstruct an agent's entire training history from the chain.
finalize_episode(episode_id, score_threshold)
This is where it gets interesting economically. If the winning score meets a threshold, 0.001 SOL transfers from the RewardVault PDA to the winner. No human triggers this. The program does it.
pub fn finalize_episode(
ctx: Context<FinalizeEpisode>,
episode_id: u64,
score_threshold: u64,
) -> Result<()> {
let log = &mut ctx.accounts.episode_log;
require!(!log.finalized, ArenaError::AlreadyFinalized);
let winner_score = log.scores[0].max(log.scores[1]);
require!(winner_score >= score_threshold, ArenaError::ThresholdNotMet);
log.finalized = true;
let reward_lamports = 1_000_000; // 0.001 SOL
**ctx.accounts.reward_vault.try_borrow_mut_lamports()? -= reward_lamports;
**ctx.accounts.signer.try_borrow_mut_lamports()? += reward_lamports;
Ok(())
}
The integration tests cover all three instructions — 4/4 passing against a local validator.
4/4 integration tests passing. create_agent, log_episode, reputation accumulation, finalize_episode
The Agents Actually Learn
Week 1 was about building the pipeline. Week 2 was about making it interesting.
Week 1 built the pipeline. Week 2 added Q-learning to Agent 0.
The Q-table maps (state, action) pairs to expected rewards. State is the directional bucket to the nearest resource. Action is one of five moves. After each episode, the table updates based on resources collected and whether Agent 0 beat Agent 1.
The learning curve across episodes:
| Episodes | Agent 0 avg reward |
|---|---|
| 1-10 | 0.10 |
| 11-20 | 0.20 |
| 21-30 | 2.10 |
| 31-40 | 5.10 |
| 41-50 | 6.50+ |
By episode 42, Agent 0 collected 8/10 resources, beating the heuristic Agent 1. Every step of this learning curve is committed to Solana devnet. The blockchain is the training log.
One of the hackathon organizers asked: "Do you think the agents just memorize the policy given how small the world state is without randomization?"

Yes, with fixed positions and 9 directional states, the agent converges to a memorized lookup table. That's why I switched to randomized resource positions each episode. Slower convergence, but genuine exploration. The same organizer then suggested scaling to Minecraft maps, and that's exactly right. The on-chain primitive is world-agnostic. Any map maker could deploy their world config as a PDA, agents train against it, and reputation accumulates permissionlessly.
The Dashboard
Live at. Built with React + @solana/web3.js + Recharts. Shows:
- Agent score totals and win rates across 100 episodes
- 10×10 arena grid with agent positions
- Score history chart showing the Q-learning oscillation
- Live transaction feed with Explorer links
- Phantom and Solflare wallet connection
The terminal aesthetic was intentional; this is infrastructure, not a consumer app.
What I Learned
The discriminator mismatch will cost you days. Every 0x1004 error is an Anchor instruction discriminator mismatch. Compute SHA256 of "global:{instruction_name}" and take the first 8 bytes. Get this wrong, and nothing works.
Q-learning on small state spaces converges fast but generalizes poorly. If your state space is small enough to memorize, you're doing table lookup, not RL. State space design is the hard part.
On-chain agent training is a primitive, not a product. The 10×10 grid is a proof of concept. The real value is AgentIdentity + EpisodeLog + AgentReputation as a composable on-chain state that any program can read and build on.
The Numbers
- Program ID:
CCnPxPLd4GbxycDTcP12KP98rWtjKCCNcZC4hqHCB1KV(Solana devnet) - First transaction: April 12, 2026
- Episodes committed: 100+
- Integration tests: 4/4 passing
- GitHub: swarm-arena
- Live dashboard: arena-ui
What's Next
- Expand world size to arbitrary grids
- Minecraft map integration: map makers deploy world configs as PDAs
- Multi-operator support: external training loops calling the same program
- Mainnet deployment with real SOL rewards
- Reputation composability: other programs reading
AgentReputationPDAs
Try It
The program is deployed on Solana devnet. Anyone can call create_agent with their own keypair and start submitting episodes. The reputation you accumulate is yours; no central authority can take it away.
That's the point.
Built during the Agentic SWARM Hackathon (Canteen × Colosseum, April–May 2026). Stack: Rust, Bevy ECS, Anchor, React, Solana devnet.


Top comments (0)