Lymah

Posted on May 9 • Edited on May 10

I Built a Permissionless On-Chain Agent Training Arena on Solana in 3 Weeks

#solana #ai #bevy #rust

Immutable logs as a verifiable audit trail

What happens when you put AI agent training on a blockchain, and what I wish someone had told me before I started.

TL;DR: I built a Solana program that logs AI agent training episodes on-chain, making learning verifiable and tamper-proof. Two agents competed in a 10×10 grid. After 50 episodes, the Q-learning agent's average reward climbed from 0.10 to 6.50+. Every step is immutable on devnet. The primitive works. Here's how I got there and what broke along the way.

The Problem Nobody Talks About
What I Built
Why Solana — Not a Database
The Three Instructions
The Agents Actually Learn
The Dashboard
What I Learned
The Numbers
What's Next
Try It

The Problem Nobody Talks About

Here's something that bothered me for a while before I could articulate it clearly.

When someone tells you their AI agent trained for 10,000 episodes and achieved superhuman performance, you have exactly two options: believe them, or don't. There's no third option. No audit trail. No verifiable history. No way to distinguish an agent that genuinely learned from one that someone hardcoded a lookup table for and called "trained behavior."

The entire field of agent training runs on trust. And trust, it turns out, is a terrible foundation for an economy that's supposed to be built on agents doing real work.

I kept thinking about this during the Agentic SWARM Hackathon (Canteen × Colosseum, April–May 2026). The question I kept coming back to was: what would it look like if you couldn't fake it? What if every training step left a permanent, public, verifiable mark?

That's what I spent three weeks trying to build. The result is a small 10×10 grid world with two competing agents. The primitive it demonstrates is not.

What I Built

swarm-arena is a permissionless on-chain agent training arena built on Solana.

Two agents compete in a resource-collection grid world powered by Bevy ECS, a Rust game engine that handles the simulation loop efficiently. After every 200 ticks, an episode ends. The scores get SHA256-hashed alongside the episode state, and that hash gets committed to Solana along with both agents' scores and a timestamp.

Agent reputation accumulates on-chain across episodes. If an agent crosses a score threshold, a SOL reward is released from a vault PDA automatically, no human in the loop.

The live dashboard, two agents competing, 100 episodes committed, Phantom connected.

The live dashboard, two agents competing, 100 episodes committed, Solflare wallet connected.

The stack:

Rust + Bevy ECS: simulation engine. Bevy's ECS architecture made it clean to separate agent components, movement systems, and reward systems
Anchor (Solana): three on-chain instructions handle everything: agent registration, episode logging, and finalization
React + @solana/web3.js: dashboard polling devnet every 5 seconds, showing transactions as they land

The first devnet transaction landed on April 12, 2026:

First confirmed transaction on Solana Explorer, FINALIZED, MAX confirmations

38yieCpWNbex4RDEzXw8pEREHYQNswyW9hYBHXZmigLP9FEmp8FSpDAwPNvU3dcZuY5RrUdWRp6EJcjYJUcEoL21

Getting to that first confirmed transaction took longer than I expected. More on that in a bit.

Why Solana — Not a Database

I got this question a lot during the hackathon, so let me be direct about it.

You could absolutely log training episodes to Postgres. It's faster, easier, and doesn't require learning about PDAs and discriminators. But you'd lose four things that matter a lot once you start thinking about agents as economic actors rather than software demos.

Permissionless. With a database, I control who can write to it. With Solana, anyone with a keypair can call create_agent and start submitting episodes. No API key, no approval process, no terms of service I can revoke on a whim. The program is deployed; it just runs.

Censorship-resistant. I can't delete your agent's training history. Neither can anyone else. If your agent trained 10,000 episodes and built a reputation, that record lives on Solana's ledger permanently. This matters more than it sounds. In a world where agents are doing real economic work, the entity controlling the training ledger has enormous power over the market.

Composable. AgentReputation PDAs are just public Solana accounts. Any other program on the network can read them. A marketplace that wants to rank agents by verified training history, a staking protocol that weights agents by episode count, and a DAO that gates membership by reputation can all be built on top of the same primitive without asking my permission.

A database gives you storage. Solana gives you a shared, trustless, programmable record of who trained what, when, and how well. Those are different things.

The Three Instructions

The entire on-chain economy runs on three Anchor instructions. Keeping it to three was a deliberate choice. I wanted the primitive to be as minimal as possible so it's easy to build on.

create_agent(name: String)
This registers an AgentIdentity PDA, a program-derived account that stores the agent's owner pubkey, a name, and the registration timestamp. It's the agent's permanent on-chain identity. The key insight is that the PDA is derived from the owner's pubkey, so only the keypair holder can register that specific identity. It's censorship-resistant by construction.

log_episode(episode_id, scores, episode_hash)
This is the core. It writes an EpisodeLog PDA with the episode results and increments the agent's AgentReputation PDA, updating total_score and episodes_played. Every call is a permanent mark. You can reconstruct an agent's entire training history from the chain.

finalize_episode(episode_id, score_threshold)
This is where it gets interesting economically. If the winning score meets a threshold, 0.001 SOL transfers from the RewardVault PDA to the winner. No human triggers this. The program does it.

pub fn finalize_episode(
    ctx: Context<FinalizeEpisode>,
    episode_id: u64,
    score_threshold: u64,
) -> Result<()> {
    let log = &mut ctx.accounts.episode_log;
    require!(!log.finalized, ArenaError::AlreadyFinalized);

    let winner_score = log.scores[0].max(log.scores[1]);
    require!(winner_score >= score_threshold, ArenaError::ThresholdNotMet);

    log.finalized = true;

    let reward_lamports = 1_000_000; // 0.001 SOL
    **ctx.accounts.reward_vault.try_borrow_mut_lamports()? -= reward_lamports;
    **ctx.accounts.signer.try_borrow_mut_lamports()? += reward_lamports;

    Ok(())
}

The integration tests cover all three instructions — 4/4 passing against a local validator.

4/4 integration tests passing. create_agent, log_episode, reputation accumulation, finalize_episode

The Agents Actually Learn

Week 1 built the pipeline. Week 2 added Q-learning to Agent 0.

The Q-table maps (state, action) pairs to expected rewards. State is the directional bucket to the nearest resource. Action is one of five moves. After each episode, the table updates based on resources collected and whether Agent 0 beat Agent 1.

The learning curve across episodes:

Episodes	Agent 0 avg reward
1-10	0.10
11-20	0.20
21-30	2.10
31-40	5.10
41-50	6.50+

By episode 42, Agent 0 collected 8/10 resources, beating the heuristic Agent 1. Every step of this learning curve is committed to Solana devnet. The blockchain is the training log.

One of the hackathon organizers asked: "Do you think the agents just memorize the policy given how small the world state is without randomization?"

Yes, with fixed positions and 9 directional states, the agent converges to a memorized lookup table. That's why I switched to randomized resource positions each episode. Slower convergence, but genuine exploration. The same organizer then suggested scaling to Minecraft maps, and that's exactly right. The on-chain primitive is world-agnostic. Any map maker could deploy their world config as a PDA, agents train against it, and reputation accumulates permissionlessly.

The Dashboard

Live at. Built with React + @solana/web3.js + Recharts. Shows:

Agent score totals and win rates across 100 episodes
10×10 arena grid with agent positions
Score history chart showing the Q-learning oscillation
Live transaction feed with Explorer links
Phantom and Solflare wallet connection

The terminal aesthetic was intentional; this is infrastructure, not a consumer app.

What I Learned

The discriminator mismatch will cost you days. Every 0x1004 error is an Anchor instruction discriminator mismatch. Compute SHA256 of "global:{instruction_name}" and take the first 8 bytes. Get this wrong, and nothing works.

Q-learning on small state spaces converges fast but generalizes poorly. If your state space is small enough to memorize, you're doing table lookup, not RL. State space design is the hard part.

On-chain agent training is a primitive, not a product. The 10×10 grid is a proof of concept. The real value is AgentIdentity + EpisodeLog + AgentReputation as a composable on-chain state that any program can read and build on.

The Numbers

Program ID: CCnPxPLd4GbxycDTcP12KP98rWtjKCCNcZC4hqHCB1KV (Solana devnet)
First transaction: April 12, 2026
Episodes committed: 100+
Integration tests: 4/4 passing
GitHub: swarm-arena
Live dashboard: arena-ui

What's Next

Expand world size to arbitrary grids
Minecraft map integration: map makers deploy world configs as PDAs
Multi-operator support: external training loops calling the same program
Mainnet deployment with real SOL rewards
Reputation composability: other programs reading AgentReputation PDAs

Try It

The program is deployed on Solana devnet. Anyone can call create_agent with their own keypair and start submitting episodes. The reputation you accumulate is yours; no central authority can take it away.

That's the point.

If you're building in the agent infrastructure space, the
Canteen Agentic SWARM Hackathon is where this conversation is happening. The quality of technical feedback from the organizers alone made this worth building.

Built during the Agentic SWARM Hackathon by Canteen × Colosseum, April–May 2026.
Stack: Rust, Bevy ECS, Anchor, React, Solana devnet. Find me in the Canteen Discord if you want to run your own agent against the program.

Top comments (4)

Syed Ahmer Shah • May 10

The concept of using Solana as a verifiable ledger for Reinforcement Learning (RL) training is a powerhouse move. It solves the "trust" problem in AI by turning training logs into immutable proof of work.

Using Bevy's ECS for the simulation loop is a sharp choice—it keeps the logic decoupled and high-performance, which you need when you're hashing states and pushing to a ledger. The jump from a $0.10$ to $6.50$ average reward shows the Q-learning agent actually converged, and having that audit trail on-chain makes it a true economic primitive rather than just another demo.

The real alpha here is the AgentReputation PDA. By making reputation composable, you've basically built an "On-Chain Resume" for AI agents that other protocols can query to verify competence before hiring them for tasks. Keep pushing on the Minecraft map integration—that’s where the state space complexity will really test the scalability of your log_episode architecture.

Lymah • May 12

"On-Chain Resume for AI agents", that's exactly the right mental model.
The AgentReputation PDA is designed to be queried by anything, not just
swarm-arena. Minecraft map integration is next, that's where the
log_episode architecture gets real stress testing. Thanks for the sharp
read.