DEV Community

Cover image for There are MCP servers for building on Solana. I built one for operating the validators underneath.
Sanjeev Kumar
Sanjeev Kumar

Posted on

There are MCP servers for building on Solana. I built one for operating the validators underneath.

There are plenty of MCP servers for building on Solana: querying chain data, sending transactions, talking to programs. They all assume the nodes underneath are simply there. But someone has to run those nodes, and for independent operators that someone is doing it by hand. That is the part I spend the most time stressed about.

agave v3.0 dropped prebuilt validator binaries, so now I build from source. Restart a voting validator during its own leader slots and it skips blocks. Remove the last record from a DNS pool and the endpoint goes dark. Push a binary whose geyser plugin does not match and the node will not start. None of these have an undo button.

MCP made me want to hand that work to an agent. The hard part was never wiring up tools. It was making sure an agent operating a live validator could not break it. So I built solfleet, an MCP server (and CLI) for operating independent Solana validators and RPC nodes, designed so an agent can drive it without being able to cause an outage by accident.

What it does

solfleet describes a fleet across devnet, testnet, and mainnet in one config file, and exposes a set of operations: Solana-aware status, in-place upgrades that build agave from source and distribute it, voting-validator provisioning, and health-driven DNS failover. The read operations are open. Every operation that changes a node is dry-run by default, checked against a policy, written to an audit log, and never goes near a keypair.

What it looks like

Status is the first place the design shows up. A generic health check sees HTTP 200 and calls it healthy. A Solana node can be 500 slots behind and still return 200, so the status is Solana-aware: slot lag against the cluster head, delinquency, and version drift.

CLUSTER  NODE   ROLE  HEALTH  VERSION     SLOT LAG  VOTE
devnet   rpc-1  rpc   ok      4.1.0-rc.1  0         -
devnet   rpc-2  rpc   ok      4.1.0-rc.1  0         -
Enter fullscreen mode Exit fullscreen mode

The more important behavior is what happens when you ask for a change. Asking to upgrade a node does not upgrade it. It returns the ordered plan and the gate decision:

{
  "decision": {
    "operation": "upgrade",
    "node": "rpc-1",
    "mode": "dry-run",
    "allowed": true,
    "plan": [
      "on builder 'build-1': build agave 4.1.0 from source",
      "distribute artifact set to rpc-1; checksum-verify each (abort on mismatch)",
      "stop solana-validator, swap, start",
      "swap /usr/local/bin/agave-validator + geyser .so + version marker atomically",
      "wait until healthy + caught up to https://api.devnet.solana.com",
      "verify reported version == 4.1.0; record before/after"
    ],
    "reasons": ["dry-run: preflight checks pass; pass confirm=true to execute"]
  }
}
Enter fullscreen mode Exit fullscreen mode

To actually run it, the call needs confirm=true. A model that forgets to confirm, or decides it already did, produces a plan, not an outage.

Why MCP, not a CLI

I wrote the CLI first, and it still exists. But the reason I reached for MCP is that the work I want help with is conversational and judgment-heavy, not scripted. "Is anything in the fleet behind?" "Plan an upgrade of the Frankfurt RPC to 4.1.0 and tell me what it would do." "One node looks delinquent, what does its vote account show?" Those are questions, and an agent that can call fleet_status, vote_status, and plan_node_upgrade and reason over the results is genuinely useful in a way a flag-heavy CLI is not.

The CLI is the right tool for cron and CI. MCP is the right tool for the operator sitting with the fleet at 2am trying to understand what is actually wrong. solfleet ships both over the same core.

Architecture and deployment

solfleet runs on the operator's machine (or a small VM), not on the nodes. It talks to the fleet over JSON-RPC to read and over SSH to act. Upgrades are build-and-distribute: a dedicated builder host compiles agave together with the ABI-matched Yellowstone geyser plugin (the .so is locked to the agave version, so the two move in lockstep or the node bricks), caches the artifact set, and the executor distributes it. Each target re-computes the sha256 and compares it to the builder before any swap. Only verified files are swapped, atomically, then the node is cycled and watched until it catches up.

The MCP server is stdio, so it runs locally next to your keys and config. There is no hosted endpoint holding your fleet's access.

Why this is differentiated

The interesting part is not that it manages Solana nodes. It is the set of rules that make it safe to put in front of an agent, and those rules are the whole point:

  • Dry-run by default. Every mutation returns its plan and preflight and changes nothing without confirm=true. This single default removes most of the risk of pointing an agent at the tool.
  • A policy gate that runs in both modes. Per-cluster allowed versions, a disk-free floor, and a minimum leader-free window for voting validators. The gate runs in dry-run too, so the preview predicts whether the real run would be allowed. There is no "looked fine in dry-run, failed on execute" gap.
  • It never touches keys. solfleet does not read, move, or generate identity or vote keypairs. Voting-validator identity failover is deliberately out of scope, because automating it invites double-signing.
  • Everything is audited. Every call, dry-run or execute, goes to a SQLite log: operation, node, decision, reasons.
  • The domain knowledge is in the safety, not just the status. Restarts are leader-aware. The failover loop refuses to ever empty a pool, even if every member is failing, because serving a degraded node beats serving NXDOMAIN.

That last category is what a generic "run commands over SSH" MCP server cannot give you. The safety has to understand validators to be safe.

What's next

The read path, upgrades, provisioning, and the DNS driver are tested live on a disposable devnet node and a real Cloudflare zone. The autonomous failover loop and the Route53 driver are still unit-tested only, and an HTTP transport (for a team sharing one server over a tailnet) is the next milestone. The README has an honest status breakdown of what is proven live versus not.

It is open source under Apache-2.0, on PyPI, and in the official MCP registry:

pipx install solfleet
Enter fullscreen mode Exit fullscreen mode

Source and design notes: github.com/sanjeevkkansal/solfleet.

If you operate your own validators, I would like to hear how you handle upgrades and failover today, and what a tool like this would have to do before you would let an agent near it. And if you have a story about a validator upgrade that went sideways, those are the cases I want to design against.

Top comments (0)