Wassim Soltani

Posted on Jun 3

I Built a Game to Understand Fly.io's Orchestrator: flyd Operator Sim!

#webdev #gamedev #learning #fly

I've recently been deep diving into Fly.io's infrastructure, particularly their flyd orchestration server and the superfly/fsm library that powers its stateful operations. To truly grasp the operational challenges, I built an interactive simulation game: flyd Operator Sim.

Play it here: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim

🤔 Why Build a Simulation?

Fly.io's platform is impressive. Reading their insightful blog posts and their public infra-log revealed the complexities of flyd. The superfly/fsm library also highlighted their focus on robust state management.

I wanted to explore:

What kind of incidents can actually occur on a worker node running flyd?
How does an operator diagnose and respond to these issues?
What's the impact of different actions on system health and application uptime?
How do Finite State Machines (FSMs) play a role in managing complex operations like machine migrations, even if it's abstracted away from the operator in a crisis?

Building a sim felt like the best way to learn.

✨ Introducing: flyd Operator Sim!

In flyd Operator Sim, you're an on-call engineer for a Fly.io region. Your goal:

Monitor worker health (CPU, memory, flyd status).
Respond to incidents like flyd stalls, containerd sync issues, network partitions, and storage corruption (many inspired by the infra-log).
Act using tools like flyd restarts, worker drains, log inspection, and (risky!) FSM overrides.
Maintain Uptime over a simulated period.

Game Objective & Progression:
Your main goal is to maintain high application uptime across your workers for 7 simulated days. Each day lasts about 5 minutes in real time. To make things more interesting, you start with one worker, and an additional worker is added each day, up to a maximum of four, increasing your responsibilities and potential points of failure!

🎓 What I learned

Orchestration is Complex: Simulating even a part of it showed me the immense challenge of managing global infrastructure.
State Management is Crucial (and complicated): The game reinforced how vital accurate state is for flyd and why a solid FSM library like superfly/fsm is essential, especially seeing potential containerd desync issues.
Observability is Non-Negotiable: Good metrics and logs (which the game simulates access to) are critical for diagnosing issues, a theme evident in Fly.io's own infra-log.
Operational Trade-offs: The sim touches on the pressure of quick fixes versus safer, slower solutions.

🤓 Tech Stack

Built with: Next.js, TypeScript, Tailwind CSS, Radix UI (shadcn), and React Context.

💭 Try It & Share Your Thoughts!

This was a personal learning project, but I hope others find it useful or fun.

Play: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim (give it a 🌟!)

What incidents should I add next? How can it be a better learning tool? Let me know!

Thanks for reading 💖

Top comments (6)

Dotallio • Jun 8

Love how you turned complex infra ops into an interactive sim - I feel like more platforms need stuff like this for onboarding.

Have you thought about adding cascading failures or partial network outages to make it even closer to real-life chaos?

Wassim Soltani • Jun 8

That sounds like the perfect next step!

I'd love to keep adding incident types and gameplay mechanics that can potentially help teach more infra concepts and mirror real-life chaos.

It would be awesome if more people decide to jump in and help improve the sim!

Fraser Young • Jun 6

This project really flies above and beyond! I had a bug-tastic time simulating those incidents. 🪰

Wassim Soltani • Jun 8

Not the fly emoji 😂

Nathan Tarbert • Jun 8

This is honestly genius, props for building something to actually see how it works in practice.

Wassim Soltani • Jun 13

Thanks for the comment! Would love to see more projects like this. I think it makes learning quicker, easier and a lot more fun!