Ruslan Manov

Posted on Apr 6

Building a Drone Swarm Orchestrator That Gets Scared — 20 Subsystems in 35K Lines of Rust

#rust #robotics #opensource #algorithms

## Building a Drone Swarm Orchestrator That Gets Scared — 20 Subsystems in 35K Lines of Rust

Tags: rust, robotics, opensource, algorithms
Target: Dev.to

The Observation That Started Everything

A few years ago I was working on a particle filter for tracking hidden state in financial time series — the usual quantitative trading toolkit. Particles representing possible market regimes, a Bayesian update step when new price data arrives, a resampling step to prevent weight degeneracy. Standard stuff.

Then someone asked me to look at a drone swarm coordination problem. I expected something completely different. What I found instead was the same math, wearing different clothes.

Drones tracking uncertain positions in 3D space? That's a particle filter, same as tracking a hidden volatility regime. Allocating scarce drone resources to competing tasks? That's a combinatorial auction, same as portfolio optimization under constraints. Protecting the swarm against catastrophic attrition? That's drawdown protection, same as risk management in a leveraged portfolio. The math didn't change. Only the domain changed.

That observation became STRIX: a 34,889-line Rust + 7,493-line Python drone swarm orchestration library (~42,400 LOC total) that treats the battlefield as a market, implements swarm coordination from ant colony research, and uses a "fear meta-parameter" borrowed from behavioral economics to modulate every subsystem in real time. Designed to scale toward 2000+ drones.

This article is a deep dive into the architecture, the technical decisions, and what 20 subsystems across 9 crates taught me about building complex autonomous systems.

The Problem Space

Drone swarm coordination is hard in a specific way: the difficulty is not computational but architectural. You need to simultaneously solve:

State estimation under uncertainty — where are we, where are threats, what's the ground truth when GPS is jammed?
Task allocation under contention — which drone does which task when you have more tasks than drones and capabilities don't match uniformly?
Safety with formal guarantees — how do you ensure collision avoidance and no-fly zone compliance without a central controller that becomes a single point of failure?
Coordination without centralization — how does the swarm share state when you lose nodes, when comms are degraded, when the network topology changes every second?
Human-in-the-loop — how do you keep a human meaningfully in the decision loop when the swarm acts at millisecond timescales?
Explainability — if the swarm makes a decision you didn't expect, how do you reconstruct exactly why?

Most existing approaches handle one or two of these well. The rest are left as "future work" or "out of scope." STRIX tries to handle all six in a unified architecture.

Architecture: The 10-Step Tick Loop

The core of STRIX is a deterministic tick loop that runs every timestep. Each tick executes exactly 10 steps in order. Every subsystem runs on every tick. No exceptions, no optional steps.

┌─────────────────────────────────────────────────────────────────┐
│                    STRIX TICK LOOP (per drone)                   │
├─────────────────────────────────────────────────────────────────┤
│  Step 1:  EW Threat Scan                                         │
│           Classify: GpsJamming | CommJamming | Spoofing |        │
│                     DirectedEnergy | CyberIntrusion              │
│           Modulate noise params + gossip fanout via fear F∈[0,1] │
├─────────────────────────────────────────────────────────────────┤
│  Step 2:  Dual Particle Filter                                   │
│           Friendly: 200 particles, state [x,y,z,vx,vy,vz]       │
│           Threats:  100 particles, adversarial tracking          │
│           Predict + Measurement update + Resample                │
├─────────────────────────────────────────────────────────────────┤
│  Step 3:  CUSUM Anomaly Detection                               │
│           Per-drone sequential change detection                  │
│           Regime transitions: Patrol → Engage → Evade            │
│           3×3 Markov transition matrix                           │
├─────────────────────────────────────────────────────────────────┤
│  Step 4:  Formation Correction                                   │
│           7 formation types (Vee, Line, Wedge, Column,           │
│           EchelonLeft, EchelonRight, Spread)                     │
│           Proportional control law with deadband                 │
├─────────────────────────────────────────────────────────────────┤
│  Step 5:  Threat Tracker Update                                  │
│           Intent detection: motion pattern → behavior class      │
│           Hysteresis gate prevents classification oscillation    │
│           Adversarial doctrines: PROBING, FEINT, COORDINATED    │
├─────────────────────────────────────────────────────────────────┤
│  Step 6:  ROE Authorization Gate                                 │
│           WeaponsHold | WeaponsTight | WeaponsFree               │
│           Pipeline: classify → IFF confidence → collateral risk  │
│           CVaR risk scoring integration                          │
├─────────────────────────────────────────────────────────────────┤
│  Step 7:  Combinatorial Task Auction  [strix-auction]            │
│           Drones bid on tasks; winner-takes-assignment           │
│           Kill-zone repricing after losses                       │
│           Dark pool compartmentalization for classified tasks     │
├─────────────────────────────────────────────────────────────────┤
│  Step 8:  Gossip State Propagation  [strix-mesh]                 │
│           O(log N) convergence, priority-queued messages         │
│           Pheromone update: Danger/Explored/Rally/Resource       │
├─────────────────────────────────────────────────────────────────┤
│  Step 9:  CBF Safety Clamp  [strix-core]                        │
│           TTC-aware CBF with deadlock detection                  │
│           Collision avoidance + altitude bounds + NFZ exclusion  │
│           APF trajectory planning integration                    │
├─────────────────────────────────────────────────────────────────┤
│  Step 10: XAI Decision Trace  [strix-xai]                       │
│           Record every decision with full causal chain           │
│           Machine traces → human-readable narrative              │
│           Deterministic replay for after-action review           │
└─────────────────────────────────────────────────────────────────┘

The 9 crates correspond roughly to this structure:

strix-core: steps 1–6, 9 (15 modules, the heaviest crate)
strix-auction: step 7
strix-mesh: step 8
strix-xai: step 10
strix-swarm: swarm-level coordination across drones
strix-adapters: MAVLink/ROS2 hardware adapters
strix-python: PyO3 bindings
strix-playground: scenario DSL for testing
strix-optimizer: SMCO multi-objective parameter optimization with Pareto front, 62 tunable parameters

Deep Dive 1: The Battlefield is a Market

The auction system in strix-auction is the most intellectually loaded module in STRIX. The core insight: task allocation in a drone swarm is mathematically equivalent to portfolio optimization in a market with constraints.

In a financial portfolio:

You have scarce capital to allocate
You have a set of available assets with different risk/return profiles
Some assets have correlations you need to track
Drawdown protection prevents you from loading into catastrophic positions
Some trades are only visible to certain participants (dark pools)

In a drone swarm:

You have scarce drone-capacity to allocate
You have a set of tasks with different capability requirements
Some tasks must be done together (bundles)
Kill-zone repricing prevents you from sending more drones into a slaughter
Some tasks are classified and visible only to specific sub-swarms (compartmentalization)

The Task structure makes this mapping explicit:

pub struct Task {
    pub id: u32,
    pub location: Position,
    pub required_capabilities: Capabilities,  // sensor/weapon/EW/relay
    pub priority: f64,
    pub urgency: f64,
    pub bundle_id: Option<u32>,   // tasks that must go together
    pub dark_pool: Option<u32>,   // compartmentalized visibility
}

The auction mechanism is loosely inspired by VCG (Vickrey-Clarke-Groves, 1971) — the same mechanism that underlies modern digital advertising auctions — adapted for multi-unit combinatorial assignment with physical constraints. A drone's bid on a task is a function of its distance to the task location, its capability match score, its current load, and its fear state. High-fear drones bid more conservatively. They're risk-averse, like a trader protecting a drawdown.

Kill-zone repricing is the anti-fragility mechanism: after a drone is lost in a location, the perceived cost of that location increases for all subsequent bidders. The auction organically routes the swarm around high-attrition zones. The swarm doesn't need a central commander to say "stop flying over that hill" — the auction figures it out through price signals.

Dark pool compartmentalization solves a harder problem: in real operations, some tasks are classified above the clearance of certain drones. The auction system supports dark_pool visibility groups, where only drones within the same dark pool can see and bid on compartmentalized tasks. This is architecturally identical to how dark pools work in equity markets — non-public order flow visible only to approved participants.

Benchmark: 465 µs for a full auction cycle with 50 drones competing on 20 tasks. At scale: 4.86 ms for 100 drones competing on 50 tasks. Both run inside the tick loop.

Deep Dive 2: Fear as a Control Signal

The PhiSim integration is the part of STRIX that gets the strongest reactions from people who encounter it for the first time: "You put fear into a drone swarm? Why?"

The answer starts with Kahneman. Prospect theory (Kahneman & Tversky, 1979) shows that human decision-making under uncertainty is not utility-maximizing — it's loss-averse, context-sensitive, and heavily influenced by current emotional state. A trader who just suffered a significant drawdown behaves differently than a trader who is up on the month, even when facing mathematically identical choices. That's not irrational. It's adaptive.

The same logic applies to autonomous systems. A swarm that has lost 30% of its drones to jamming should not behave identically to a full-strength swarm approaching the same objective. It should be more cautious — wider formations, longer evade distances, stronger avoidance signals, more aggressive information sharing. Not because a human operator told it to be cautious, but because the system's own estimate of its situation warrants it.

Fear F ∈ [0, 1] in STRIX is computed from a dual adversarial process: fear and courage as opposing forces, with tension = |fear - courage|. The inputs map financial concepts to combat:

Financial concept	Swarm analog
Portfolio drawdown	Attrition rate (drones lost / total)
Volatility ratio	Threat intensity ratio
Anomaly count	CUSUM change-point breaks
Consecutive losses	Loss ticks (sustained attrition)

When fear rises, every subsystem responds:

// Evade distance: 150m at F=0 → 500m at F=1
evade_distance: base.evade_distance * (1.0 + f * 2.3)

// Formation spacing: +50% at maximum fear
spacing: base.spacing * (1.0 + axes.bias * 0.5)

// Pheromone persistence: 3.3x longer at F=1
modulated_decay = decay * axes.threshold  // threshold ≈ 0.3 at F=1

// Gossip fanout: up to 2x at F=1 (capped at 3x)
(base_fanout + floor(f * base_fanout)).min(base_fanout * 3)

The effect is that a high-fear swarm is simultaneously more cautious (wider formations, longer evade distances) and more communicative (higher gossip fanout, stronger pheromones). This matches what military doctrine recommends for degraded units: pull back, increase information sharing, wait for situation clarity before re-engaging.

Courage is the opposing force. A high-courage swarm can tolerate tighter formations, closer engagement distances, more aggressive auction bids. Tension (|fear - courage|) drives ROE posture suggestions — not automated ROE changes, but recommendations surfaced to human operators through the XAI layer.

Deep Dive 3: Bio-Inspired Coordination

The gossip protocol in strix-mesh and the pheromone system in strix-core are the two bio-inspired coordination mechanisms. They solve the same problem from different angles: how does a decentralized swarm maintain coherent collective behavior without a central coordinator?

Digital Pheromones are borrowed directly from Dorigo's ant colony optimization (1992). Ants leave chemical traces that guide other ants toward food sources and away from dead ends. STRIX implements four pheromone types:

Danger — repulsive, deposited near threats and loss events
Explored — marks already-covered terrain to avoid redundant paths
Rally — attractive, marks gathering points for regrouping
Resource — marks objectives and high-value areas

Each pheromone has exponential decay, but the decay rate is modulated by fear. At high fear, pheromones persist 3.3x longer — the swarm's collective memory of dangerous areas stays fresh longer when it's actively scared. At low fear, pheromones fade quickly, allowing the swarm to be bolder about re-exploring previously dangerous terrain.

Gossip Protocol implements epidemic information spreading. Each drone periodically selects a random set of neighbors and exchanges state updates. The mathematical guarantee: in a connected graph, gossip reaches all nodes in O(log N) rounds with high probability. STRIX implements bandwidth-aware priority queuing:

Priority	Message Type	Rationale
0 (highest)	ThreatAlert	Immediate danger, time-critical
1	TaskAssignment	Coordination, semi-urgent
2	StateUpdate	Position/status, routine
3	PheromoneDeposit	Environmental update, low urgency
4	Heartbeat	Keepalive, lowest priority

When fear is high and bandwidth is constrained, low-priority messages are dropped first. The swarm preferentially shares threat information when it most needs to. When fear drops and bandwidth recovers, state updates and pheromone deposits fill in the collective picture.

The combination of pheromones and gossip produces emergent behavior that no single subsystem explicitly implements: without a central coordinator, the swarm learns the shape of the threat environment, avoids areas that have been costly, concentrates toward objectives, and maintains information coherence across node losses.

Deep Dive 4: Safety and Trajectory Planning

TTC-aware CBF with deadlock detection extends the standard CBF formulation by incorporating Time-to-Collision estimates. Instead of only enforcing static separation distances, the CBF considers the closing velocity between drones — two drones approaching each other head-on at high speed trigger the safety clamp earlier than two drones drifting slowly toward each other. Deadlock detection identifies situations where CBF constraints from multiple drones create a gridlock and applies a resolution strategy.

APF trajectory planning (Artificial Potential Fields) provides smooth, obstacle-aware paths. Attractive potentials pull drones toward objectives; repulsive potentials push them away from obstacles, NFZs, and other drones. The APF output feeds into the CBF as a desired velocity, which the CBF then projects onto the safe set.

CVaR risk scoring (Conditional Value-at-Risk) quantifies the tail risk of mission plans. Instead of optimizing for expected outcomes, CVaR focuses on the worst-case percentile — what happens in the bottom 5% of scenarios? This integrates with the auction's bid evaluation and the ROE authorization pipeline.

NaN hardening ensures that numerical corruption cannot silently propagate through the tick pipeline. Every subsystem includes guards that detect NaN values in inputs and outputs, preventing a single sensor glitch from cascading into nonsensical decisions across the entire swarm.

Performance: What the Numbers Mean

Metric	Value	Context
Full swarm tick (20 drones)	1.15 ms	869 Hz max tick rate, well above any real-time control requirement
Combinatorial auction (50 drones, 20 tasks)	465 µs	Fits inside a single 1ms control loop
Combinatorial auction (100 drones, 50 tasks)	4.86 ms	Scales to larger swarms within real-time bounds
Particle filter (200 particles)	75 µs	Leaves 925 µs for everything else in a 1ms tick

The 1.15 ms full tick number is the most important one. Real-time control for drone swarms typically requires update rates of 10–100 Hz (10–100 ms per tick). At 1.15 ms, STRIX runs 8–87x faster than required, which means:

Margin for hardware latency — you can afford significant communication and sensor latency without missing deadlines
Headroom for scaling — 20 drones at 1.15 ms means you have roughly 850 ms of remaining capacity before hitting 1-second ticks. The architecture is designed to scale toward 2000+ drones.
Deterministic worst-case bounds — Rust's lack of garbage collection means no GC pauses. The 1.15 ms number doesn't have hidden tail latency spikes from heap compaction.

The 75 µs particle filter number deserves context: this is a 200-particle bootstrap filter (Gordon, Salmond, Smith 1993) running in Rust with no SIMD optimization. The equivalent Python/NumPy implementation runs approximately 10x slower. For real-time operation, the Rust implementation matters.

What the Trenches Taught Me

Building 20 subsystems in a single library across 8 months taught specific lessons that don't appear in textbooks.

1. The Dual Particle Filter Architecture is Not Optional

I initially implemented a single particle filter tracking swarm state. The problem: you're trying to use the same filter to track both where your drones are and where threats are. These are fundamentally different inference problems. Friendly state has known dynamics (you control the drones), high-frequency updates (onboard sensors), and low observation noise. Threat state has unknown dynamics (you don't control the threats), sparse updates (radar/EO/IR glimpses), and high observation noise.

Separating into two filters — 200 particles for friendly navigation, 100 particles for adversarial threat tracking — immediately improved both. Each filter could be tuned for its specific dynamics without compromising the other.

2. Anti-Fragility is Not Resilience

Resilience is returning to the prior state after a disturbance. Anti-fragility is being stronger after a disturbance. These are categorically different.

A resilient swarm loses two drones, falls back to a smaller formation, and tries to continue the original mission. An anti-fragile swarm loses two drones, updates kill-zone pricing (future drones avoid that area), increases gossip fanout (more information sharing under threat), triggers formation widening, and potentially performs better on subsequent engagements because it now has better threat map data.

STRIX's kill-zone repricing mechanism is what makes the swarm anti-fragile rather than merely resilient. Every loss is a price signal that updates the auction's cost model. The swarm learns from attrition.

3. Glass-Box XAI is Non-Negotiable for Autonomous Weapons

Every time the auction assigns a task differently than a human operator would expect, that operator needs to understand why. Every time the ROE engine declines an engagement authorization, a human needs to be able to reconstruct the causal chain: what did the threat classification show, what was the friend-foe ID confidence, what was the collateral risk estimate, why did the logic resolve to "hold"?

Without this, autonomous systems operating in contested domains will lose human trust — and rightly so. The strix-xai trace/narrator/replay pipeline is not a nice-to-have feature. For systems with kinetic authority, it's a prerequisite for legitimate use. Deterministic replay makes every decision reproducible.

4. Mathematical Elegance vs. Practical Heuristics

The CBF (Control Barrier Functions) safety system has formal mathematical guarantees: forward invariance of the safe set, provably collision-free velocity fields. The auction system is heuristic — the scoring function is engineered to work well in practice, but there are no optimality proofs. The fear meta-parameter is empirically calibrated, not analytically derived from first principles.

This tension is real. In practice, the formally-correct CBF runs in every tick and you trust it. The heuristic auction you test exhaustively (671 tests across 37+ files) and you trust the tests. Different parts of a complex system warrant different levels of formal rigor — the trick is knowing which parts need proofs and which parts need thorough empirical validation. The strix-optimizer crate with its 62 tunable parameters and isotonic confidence calibration helps bridge this gap.

5. Fear as a Control Signal Produces Coherent Emergent Behavior

When fear first went into the simulation, I expected it to make the swarm more conservative across the board — slower, more cautious, less effective. What actually happened was more interesting: high-fear swarms were more communicative. More gossip, stronger pheromones, wider information sharing. They were slower at engaging, but they were much better at maintaining collective situational awareness.

A high-fear swarm that backs off and talks to itself is often in a better position 30 seconds later than an overconfident swarm that pressed the engagement and took losses. Fear, implemented correctly, is not cowardice — it's a sophisticated information aggregation mechanism.

6. The DSL Was An Afterthought That Became The Most Important Interface

strix-playground started as a test harness — a way to run scenarios without writing full integration tests every time. The Playground builder DSL emerged from repeatedly typing the same setup code:

let report = Playground::new()
    .name("Ambush")
    .drones(20)
    .threats(vec![
        ThreatSpec::approaching(400.0, 8.0),
        ThreatSpec::flanking(500.0, 6.0, 45.0),
        ThreatSpec::flanking(500.0, 6.0, -60.0),
    ])
    .wind([2.0, -1.0, 0.0])
    .cbf(CbfConfig::default())
    .run_for(60.0)
    .run();

It's now the primary way anyone interacts with STRIX for evaluation and experimentation. Four preset scenarios (Ambush, GPS Denied, Attrition Cascade, Stress Test) cover the most important operational conditions. The DSL made the system approachable in a way that raw API calls to 15-module strix-core never could.

7. Adversarial Doctrines Change Everything

Modeling threats as individual entities with simple motion patterns (Approaching, Circling, Retreating) was sufficient for basic scenarios. But real adversaries use structured tactics. Adding PROBING, FEINT, and COORDINATED doctrines forced a fundamental rethink of the threat tracker — it now has to distinguish between a genuine attack and a feint designed to draw resources away from the real objective. This is where CVaR risk scoring becomes critical: evaluating not just the expected threat but the tail-risk scenarios.

What It Doesn't Do — Honest Limitations

This section matters more than any benchmark.

STRIX is a research prototype. It has never flown a real drone, in any context. The performance numbers are simulation results, not flight test data.

Platform adapters are stubs. The MAVLink and ROS2 adapters in strix-adapters are architectural placeholders. Unless you compile with the mavlink-hw feature flag (which requires additional hardware-specific dependencies), these are compile-time stubs that satisfy the interface but don't communicate with real hardware.

The edge LLM architecture is defined but empty. The XAI narrator converts decision traces to human-readable text using a templating approach. The architecture includes provisions for an edge-deployed language model to produce richer narrative explanations, but no pretrained model is included.

No optimality proofs for the auction. The auction scoring is heuristic. It performs well across the test scenarios. It does not have mathematical guarantees of optimality. The SMCO optimizer helps tune parameters but does not provide theoretical bounds.

Not ITAR-controlled. STRIX is open-source (Apache 2.0). The codebase does not contain classified algorithms, export-controlled technologies, or information that would trigger ITAR restrictions. It is a research implementation of publicly available algorithms — particle filters, auction theory, CBF, ant colony optimization — applied to the drone swarm domain.

671 tests is the minimum honest number, not a boast. The test suite (560 Rust + 111 Python) covers the major subsystems and the known failure modes. It does not constitute exhaustive verification of a safety-critical system. Anyone deploying STRIX in a real operational context would need substantially more validation.

NaN hardening is applied across subsystems but not formally verified for all edge cases.

Why Doesn't This Already Exist?

The honest answer: some of it does exist, in parts, in proprietary systems operated by defense contractors and national laboratories. What doesn't exist is an open-source, unified implementation of all these subsystems working together, with a permissive license and a readable codebase.

PX4, ArduPilot, and ROS2 MavROS are excellent flight stacks, but they're focused on single-drone operation and don't include swarm coordination, task auctions, or EW response. MAVSDK is a clean interface layer but has no intelligence. Most academic swarm research is MATLAB or Python — correct in theory, 10–100x too slow for real-time operation.

STRIX occupies a specific niche: close enough to production-quality Rust to be benchmarkable, open enough to be studied and modified, comprehensive enough to be a complete research platform rather than a single-paper implementation. With scalability toward 2000+ drones as a design goal, it's built for the next generation of swarm research.

What's Next

The current architecture has clear extension points:

Real hardware integration — completing the MAVLink adapter with actual serial communication and testing with real flight controllers
Distributed auction — the current auction runs centrally per swarm tick; a fully distributed variant using the gossip network would eliminate the remaining centralized component
Reinforcement learning integration — replacing heuristic auction scoring with a learned policy, using the particle filter state as the observation space
Multi-swarm coordination — the fractal hierarchy architecture supports nested swarm-of-swarms coordination; this is partially implemented but not fully tested
Edge LLM narrator — completing the XAI narrator with a deployed language model for richer after-action reports
Scale validation at 2000+ drones — stress-testing the architecture at full target scale

Try It

git clone https://github.com/RMANOV/strix
cd strix

# Run all 671 tests
cargo test

# Run a specific scenario
cargo run --example ambush_scenario

# Run the full stress test (50 drones, 10 threats, 2 NFZs, GPS denial, drone losses)
cargo run --example stress_test

# Python bindings
cd strix-python
pip install maturin
maturin develop
python examples/basic_swarm.py

The strix-playground crate has four preset scenarios that cover the main operational conditions: Ambush, GPS Denied, Attrition Cascade, and Stress Test. Start there.

Codebase

Repo: github.com/RMANOV/strix
License: Apache 2.0
Lines: 34,889 Rust + 7,493 Python (~42,400 total)
Tests: 671 (560 Rust + 111 Python)
Crates: 9

The code is readable. The architecture is documented. The limitations are honest.

If you find the approach interesting — or if you find something wrong — open an issue.

STRIX is a research prototype. Not flight-tested. Not production-ready. Not combat-proven. Open source. Apache 2.0.

DEV Community