I've recently been deep diving into Fly.io's infrastructure, particularly their flyd orchestration server and the superfly/fsm library that powers its stateful operations. To truly grasp the operational challenges, I built an interactive simulation game: flyd Operator Sim.
Play it here: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim
π€ Why Build a Simulation?
Fly.io's platform is impressive. Reading their insightful blog posts and their public infra-log revealed the complexities of flyd. The superfly/fsm library also highlighted their focus on robust state management.
I wanted to explore:
- What kind of incidents can actually occur on a worker node running flyd? 
- How does an operator diagnose and respond to these issues? 
- What's the impact of different actions on system health and application uptime? 
- How do Finite State Machines (FSMs) play a role in managing complex operations like machine migrations, even if it's abstracted away from the operator in a crisis? 
Building a sim felt like the best way to learn.
β¨ Introducing: flyd Operator Sim!
In flyd Operator Sim, you're an on-call engineer for a Fly.io region. Your goal:
- 
Monitor worker health (CPU, memory, flydstatus).
- 
Respond to incidents like flydstalls,containerdsync issues, network partitions, and storage corruption (many inspired by the infra-log).
- 
Act using tools like flydrestarts, worker drains, log inspection, and (risky!) FSM overrides.
- Maintain Uptime over a simulated period.
Game Objective & Progression:
Your main goal is to maintain high application uptime across your workers for 7 simulated days. Each day lasts about 5 minutes in real time. To make things more interesting, you start with one worker, and an additional worker is added each day, up to a maximum of four, increasing your responsibilities and potential points of failure!
π What I learned
- Orchestration is Complex: Simulating even a part of it showed me the immense challenge of managing global infrastructure.
-  State Management is Crucial (and complicated): The game reinforced how vital accurate state is for flydand why a solid FSM library likesuperfly/fsmis essential, especially seeing potentialcontainerddesync issues.
- Observability is Non-Negotiable: Good metrics and logs (which the game simulates access to) are critical for diagnosing issues, a theme evident in Fly.io's own infra-log.
- Operational Trade-offs: The sim touches on the pressure of quick fixes versus safer, slower solutions.
π€ Tech Stack
Built with: Next.js, TypeScript, Tailwind CSS, Radix UI (shadcn), and React Context.
π Try It & Share Your Thoughts!
This was a personal learning project, but I hope others find it useful or fun.
- Play: https://flydsim.wsoltani.com/
- Repo: https://github.com/wSoltani/flyd-operator-sim (give it a π!)
What incidents should I add next? How can it be a better learning tool? Let me know!
Thanks for reading π
 
 
              



 
    
Top comments (6)
Love how you turned complex infra ops into an interactive sim - I feel like more platforms need stuff like this for onboarding.
Have you thought about adding cascading failures or partial network outages to make it even closer to real-life chaos?
That sounds like the perfect next step!
I'd love to keep adding incident types and gameplay mechanics that can potentially help teach more infra concepts and mirror real-life chaos.
It would be awesome if more people decide to jump in and help improve the sim!
This project really flies above and beyond! I had a bug-tastic time simulating those incidents. πͺ°
Not the fly emoji π
This is honestly genius, props for building something to actually see how it works in practice.
Thanks for the comment! Would love to see more projects like this. I think it makes learning quicker, easier and a lot more fun!