Miroslav Šotek

Posted on Jun 8

I Built a Tokamak Control System Alone. Here's What I Learned About Writing Software That Can't Fail.

#rust #fusionenergy #formalverification #lowlatencycontrolsystems

What happens when you apply production-grade security, formal verification, and 3,300+ tests to a physics problem most people solve with MATLAB scripts.

The Audacity

A year ago, I started building SCPN-Control — a framework for controlling tokamak fusion reactors. By myself. As a solo developer.

If you know anything about fusion energy, you know this is absurd. Tokamak control is the domain of national labs, billion-dollar collaborations, and teams of fifty physicists. The DIII-D Plasma Control System has been in production for twenty years. RAPTOR at EPFL runs on actual hardware at TCV. OMFIT at General Atomics has a thousand users.

I have none of that. I have a laptop, workstation (former Minig Rig) and two rather old servers, a GitHub account, and an unhealthy tolerance for partial differential equations.

But I also have something those projects don't: a neuro-symbolic controller architecture with formal verification, end-to-end differentiable physics, and a safety-case infrastructure that treats a research codebase like flight software.

This post isn't a sales pitch. It's a technical autopsy of what happens when you build production-grade scientific software without a production team — and why the result might matter even if it never touches a real tokamak.

What It Actually Is

SCPN-Control is a multi-layered framework. At the bottom: physics kernels. At the top: a compiler that turns formal specifications into stochastic computing circuits. In between: everything.

Layer 1: The Physics Kernel

The core/ module solves the Grad-Shafranov equation — the elliptic PDE that describes magnetohydrodynamic equilibrium in a tokamak. This isn't a toy implementation. It supports:

Fixed and free-boundary solvers (SOR, multigrid, Anderson acceleration, Newton-Kantorovich)
H-mode pedestal profiles via the modified hyperbolic tangent (mtanh)
Coil optimization with shape, X-point, and divertor target constraints
Toroidal 1/R stencil (not the naive Cartesian Laplacian that most educational codes use)

The solver is written in Python with a Rust acceleration backend, but it falls back to Python gracefully if the native library isn't available. This matters because fusion researchers run code on everything from laptops to HPC clusters.

# From src/scpn_control/core/fusion_kernel.py
# The GS* operator — note the sign-corrected toroidal term
a_E = 1.0 / dR2 - 1.0 / (2.0 * R_safe * self.dR)   # East coefficient
a_W = 1.0 / dR2 + 1.0 / (2.0 * R_safe * self.dR)   # West coefficient
a_N = 1.0 / dZ2                                      # North
a_S = 1.0 / dZ2                                      # South
a_C = -2.0 * (1.0 / dR2 + 1.0 / dZ2)                # Center

The free-boundary solver is experimental and documented as such. The fixed-boundary solver is robust enough to generate equilibria for controller testing.

Layer 2: The Controller Stack

The control/ module implements five NMPC (Nonlinear Model Predictive Control) backends:

Internal — projected gradient descent (fallback, zero dependencies)
SciPy — SLSQP for general nonlinear problems
OSQP — first-order ADMM for sparse, real-time problems
CasADi — IPOPT with exact Hessian for research
acados — structure-exploiting SQP with HPIPM, the solver used in autonomous vehicles and robotics

The acados integration is the one I care about. It uses an augmented-state formulation for slew-rate constraints, validates dynamics residuals post-solve, and fails closed if the solver status isn't zero.

# From src/scpn_control/control/nmpc_controller.py
# Augmented state: [x_k, u_{k-1}] for slew-rate constraints
model.disc_dyn_expr = ca.vertcat(x_next, u)  # u becomes next u_last
model.con_h_expr = u - u_last                # slew-rate constraint

But the part that's actually unique is the transport gradient tuning. Using JAX, the controller can backpropagate through the gyrokinetic transport model to optimize source schedules. This means the MPC doesn't just optimize control actions — it optimizes the physics model it's using.

# JAX autodiff through transport — unique in fusion control
def tune_transport_coefficients_for_tracking(self, ...):
    # Gradient of tracking error w.r.t. transport coefficients
    # Finite-difference audit enforced by default

No other fusion control framework has end-to-end differentiable transport. This is either brilliant or insane, and I'm not sure which yet.

Layer 3: The Neuro-Symbolic Compiler

This is the part that makes SCPN-Control different from every other plasma control project.

The scpn/ module implements a Stochastic Petri Net compiler. You write a control policy as a Petri net — places, transitions, inhibitor arcs, timing delays — and the compiler turns it into a stochastic computing circuit that runs on deterministic bitstreams.

Why stochastic computing? Because it's inherently fault-tolerant. A bit-flip in a stochastic bitstream changes the probability by 1/N, not by orders of magnitude. For a fusion reactor where a controller failure means a $2 billion machine eats itself, this matters.

The compiler produces two paths:

Oracle path: standard floating-point arithmetic for debugging
Stochastic path: bitstream-based computation with antithetic variates

# From src/scpn_control/scpn/controller.py
# Antithetic variates for variance reduction
base = rng.random((n_pairs, self._nT))
low_hits = np.sum(base < p_fire[None, :], axis=0)
high_hits = np.sum(base > (1.0 - p_fire)[None, :], axis=0)
counts = low_hits + high_hits

The stochastic path is deterministic (seeded RNG), reproducible, and formally verifiable. Which brings me to the part I'm actually proud of.

The Security Story

Scientific software is usually terrible at security. Researchers write code that reads arbitrary files, executes shell commands, and exposes internal state over the network because "it's just for internal use."

I treated this like flight software from day one.

WebSocket Hardening

The phase/ module exposes tokamak state over WebSockets for real-time monitoring. The original implementation was an unauthenticated pipe. The current version has:

Bearer token + API key authentication
Token-bucket rate limiting (20 commands/sec per client)
TLS enforcement with loopback-only default
Browser origin allowlisting
Command allowlisting (set_psi, set_pac_gamma, reset, stop)
64KB payload caps
Timeout-based backpressure with explicit disconnect counters

# From src/scpn_control/phase/ws_phase_stream.py
def _bucket_rate_limited(self, buckets, key, now):
    capacity = float(self.command_rate_limit)
    refill_rate = capacity / self.command_rate_window_s
    updated_at, tokens = buckets.get(key, (now, capacity))
    elapsed = max(0.0, now - updated_at)
    tokens = min(capacity, tokens + elapsed * refill_rate)
    limited = tokens < 1.0
    if not limited:
        tokens -= 1.0
    buckets[key] = (now, tokens)
    return limited

C++ Compilation Sandbox

The Rust/C++ solver is compiled on-demand if the prebuilt binary isn't available. The compilation path now has:

SHA-256 source verification against a manifest
hmac.compare_digest for timing-safe comparison
Stack canaries (-fstack-protector-strong)
Full RELRO binding (-Wl,-z,relro -Wl,-z,now)
-mtune=generic instead of -march=native (CPU feature leak eliminated)
120-second compilation timeout
Minimal environment (only PATH, TMPDIR, SystemRoot preserved)

Fault Injection Gating

The stochastic controller has a bit-flip fault injection mode for testing. It requires two independent gates to enable:

# Double-gated — nuclear safety standard
if self._sc_bitflip_rate > 0.0 and not self._allow_fault_injection:
    raise ValueError("sc_bitflip_rate > 0 requires allow_fault_injection=True.")
if self._sc_bitflip_rate > 0.0 and os.environ.get("SCPN_ALLOW_CONTROLLER_FAULT_INJECTION") != "1":
    raise ValueError("sc_bitflip_rate > 0 requires SCPN_ALLOW_CONTROLLER_FAULT_INJECTION=1.")

This is how SCRAM systems work. You don't accidentally enable safety overrides.

Path Traversal Elimination

JSONL logging is constrained to a verified root directory with symlink protection:

def _resolve_jsonl_log_path(log_path, log_root):
    resolved = candidate.resolve(strict=False)
    try:
        resolved.relative_to(root)
    except ValueError as exc:
        raise ValueError("log_path must resolve under log_root.") from exc
    if resolved.suffix != ".jsonl":
        raise ValueError("log_path must use a .jsonl suffix.")

Is this overkill for a research project? Yes. But the discipline carries over. When you write every path resolution as if an attacker controls the input, you stop writing bugs even when no attacker exists.

The Formal Verification Layer

This is where I think SCPN-Control is genuinely ahead of the curve — not just for fusion, but for control systems in general.

The scpn/ module includes a Z3 bounded model checking integration for compiled Petri net controllers. It proves:

Place invariants: Markings never exceed bounds
Temporal response: If condition A fires, condition B responds within N steps
Recurrence: Certain states are always revisited
Exclusivity: Conflicting actions never fire simultaneously

Each proof produces a manifest with SHA-256 digests, schema versioning, and mandatory counterexample paths for failed proofs. The safety-case infrastructure requires:

Formal controller proof (Z3/SMT)
Audited differentiable-transport evidence (JAX + finite-difference validation)
Digital-twin update evidence (TRANSP/TSC backed)
All bound to a canonical controller artifact digest
Explicit readiness gate — blocked until all evidence is present

# Safety-case admission — fail-closed
if not manifest.has_all_required_evidence():
    raise SafetyCaseNotReadyError("Controller artifact lacks required evidence.")

This is 10 CFR 50 nuclear safety documentation standard. For a solo project. Written by one person.

I don't know if this makes me disciplined or delusional. But I know no other fusion control project has formal verification of controller logic.

The Honesty Problem

Here's where I stop talking about what's good and tell you what's broken.

49 out of 50 physics fidelity gaps are still open.

The ROADMAP explicitly states:

"Local-dispersion path overpredicts the GENE CBC reference"
"Latest 2000-step adiabatic run did not reach saturated chi_i"
"Do not publish a saturated CBC chi_i value until the longer campaign passes"
"Must not be presented as quantitative cross-code agreement"

The native TGLF-equivalent transport model exists. The nonlinear gyrokinetic solver exists. But they haven't been quantitatively compared against real TGLF or GENE runs. The infrastructure is there — interfaces to GACODE, GENE, GS2, CGYRO, QuaLiKiz — but the evidence isn't.

There is no real hardware timing evidence yet.

The E2E latency benchmark infrastructure is tamper-evident and schema-versioned, but all measurements are synthetic. I haven't run the control loop on a Raspberry Pi 4 or Jetson Nano and published p50/p95/p99 distributions. The <1ms real-time claim is theoretical.

The H-mode Newton Jacobian is incomplete.

The fixed-boundary Newton solver uses a Jacobian derived for L-mode linear profiles. For H-mode mtanh profiles, the Jacobian is wrong, which means convergence is unreliable for the most physically relevant regime.

I document all of this. The ROADMAP is 50+ entries of "not done yet." But documentation doesn't close gaps. Only work does.

What I Learned About Building Software Alone

If you're a solo developer building something ambitious, here's what actually matters:

1. Test Coverage Is a Forcing Function

I have 3,300+ tests and 99%+ coverage. Not because I'm a testing zealot, but because without a team to catch my mistakes, the tests are the team. The CI runs 25 jobs across Linux, Windows, and macOS. Every PR gate fails if coverage drops by 0.1%.

The ratchet effect is real: once you have 99% coverage, you can't justify a lazy commit that drops it to 98%. The number forces discipline.

2. Security Hardening Is Just Input Validation at Scale

Every security fix I implemented was fundamentally about validating assumptions:

Is this file path where I think it is?
Is this config value finite and positive?
Is this library the one I compiled?

When you write every function as if the caller is malicious, you write better code even for internal APIs. The WebSocket hardening made the streaming layer more robust against network partitions. The C++ compilation sandbox caught a -march=native portability bug. Security and correctness are the same thing viewed from different angles.

3. Type Hints Are Documentation That Doesn't Lie

The entire codebase uses from __future__ import annotations and strict type hints. Pydantic v2 models validate configs at the boundary. This isn't just for IDE autocomplete — it's for catching physics bugs.

When plasma_current_target is typed as float and validated as > 0, you can't accidentally pass a negative current or a string. In scientific computing where a sign error means the plasma goes the wrong way, this matters.

4. Honest Documentation Builds Trust Faster Than Hype

The ROADMAP says "49 open fidelity gaps." The README says "requires external resources for experimental validation." The code says raise ValueError("degenerate equilibrium") instead of silently returning NaN.

This is terrible marketing. It's excellent engineering. The fusion community is skeptical of unvalidated claims — with good reason. I'd rather have ten people trust the code because the limitations are explicit than a thousand people distrust it because I oversold.

5. The "Kitchen Sink" Problem Is Real (for now)

The project has 57+ modules. Equilibrium, transport, MHD, edge physics, neural nets, RL, stochastic computing, FPGA export, real-time control, federated learning, formal verification.

Each of these is a career. Together, they're a maintenance burden. The risk isn't that any single module is wrong — it's that the integration surface becomes too large to validate.

My current strategy: extract a scpn-core package with just the GS solver + basic controllers, and keep scpn-control as the full framework. The Unix philosophy applies even to tokamaks.

Where This Goes Next

The immediate priorities are unglamorous:

Hardware timing evidence. Run the E2E benchmark on a Raspberry Pi 4. Measure p50/p95/p99. Publish the results.
Close 5-10 physics fidelity gaps. Install GACODE. Run native TGLF-equivalent vs real TGLF. Document agreement.
Fix the H-mode Jacobian. Implement mtanh derivative for the Newton solver.

The medium-term goals are where the project gets interesting:

End-to-end differentiable scenario: Couple JAX GS solver → differentiable transport → NMPC. Gradient-through-equilibrium is genuinely unique — no one has this.
Certified neuro-symbolic control: Expand Z3 proofs to CTL/LTL specifications. Auto-generate safety certificates. This could be the basis for ITER safety documentation.
Cross-facility federated learning: Extend the FedAvg/FedProx disruption predictor with differential privacy guarantees. Multi-site learning without data sharing.

Why You Should Care (Even If You Don't Care About Fusion)

If you're a software engineer, SCPN-Control is a case study in what happens when you apply production discipline to a research problem. Most scientific code is written to produce a paper. This is written to produce a system.

If you're a physicist, the neuro-symbolic controller architecture is a genuinely new approach to safety-critical control. Stochastic computing + formal verification + differentiable physics is a combination that doesn't exist anywhere else.

If you're a solo developer wondering if you can build something that competes with teams of fifty people: you can. But you have to be more disciplined than they are. You don't have a colleague to catch your sign errors. You have tests, types, and the humility to document what you don't know.

The Code

SCPN-Control is open source under AGPL-3.0-or-later with commercial licensing available.

GitHub: github.com/anulum/scpn-control
Docs: README + ROADMAP (the ROADMAP is the most honest document I've ever written)
Tests: pytest with 3,300+ cases, 99%+ coverage
Install: pip install scpn-control (with optional dependencies for acados, JAX, etc.)

I'm not asking for stars (yet I don't mind them). I'm asking for scrutiny. If you know plasma physics, tear apart the transport solver. If you know control theory, break the NMPC. If you know security, find the holes I missed.

The project is only useful if it's correct. And it's only correct if people prove it wrong.

I appreciate sharing, Likes, Contributions, Sponsoring / Donations to keep me going, we are open for collaboration.

Miroslav Šotek builds software for problems that are supposed to require teams. He is usually wrong about how hard things are, but occasionally right about how to build them.

DEV Community