DEV Community: Fly.io

Games as Model Eval: 1-Click Deploy AI Town on Fly.io

Daniel Botha — Tue, 26 Aug 2025 12:11:01 +0000

Recently, I suggested that The Future Isn't Model Agnostic, that it's better to pick one model that works for your project and build around it, rather than engineering for model flexibility. If you buy that, you also have to acknowledge how important comprehensive model evaluation becomes.

Benchmarks tell us almost nothing about how a model will actually behave in the wild, especially with long contexts, or when trusted to deliver the tone and feel that defines the UX we’re shooting for. Even the best evaluation pipelines usually end in subjective, side-by-side output comparisons. Not especially rigorous, and more importantly, boring af.

Can we gamify model evaluation? Oh yes. And not just because we get to have some fun for once. Google backed me up this week when it announced the Kaggle Game Arena. A public platform where we can watch AI models duke it out in a variety of classic games. Quoting Google; "Current AI benchmarks are struggling to keep pace with modern models... it can be hard to know if models trained on internet data are actually solving problems or just remembering answers they've already seen."

When models boss reading comprehension tests, or ace math problems, we pay attention. But when they fail to navigate a simple conversation with a virtual character or completely botch a strategic decision in a game environment, we tell ourselves we're not building a game anyway and develop strategic short-term memory loss.
Just like I've told my mom a thousand times, games are great at testing brains, and it's time we take this seriously when it comes to model evaluation.

Why Games Don't Lie

Games provide what benchmarks can't, "a clear, unambiguous signal of success." They give us observable behavior in dynamic environments, the kind that would be extremely difficult (and tedious) to simulate with prompt engineering alone.

Games force models to demonstrate the skills we actually care about; strategic reasoning, long-term planning, and dynamic adaptation in interactions with an opponent or a collaborator.

Pixel Art Meets Effective Model Evaluation - AI Town on Fly.io

AI Town is a brilliant project by a16z-infra, based on the the mind-bending paper, Generative Agents: Interactive Simulacra of Human Behavior. It's a beautifully rendered little town in which tiny people with AI brains and engineered personalities go about their lives, interacting with each other and their environment. Characters need to remember past conversations, maintain relationships, react dynamically to new situations, and stay in character while doing it all.

I challenge you to find a more entertaining way of evaluating conversational models.

I've forked the project to make it absurdly easy to spin up your own AI Town on Fly Machines. You've got a single deploy script that will set everything up for you and some built-in cost and performance optimizations, with our handy scale to zero functionality as standard (so you only pay for the time spent running it). This makes it easy to share with your team, your friends and your mom.

In it's current state, the fork makes it as easy as possible to test any OpenAI-compatible service, any model on Together.ai and even custom embedding models. Simply set the relevant API key in your secrets.

Games like AI Town give us a window into how models actually think, adapt, and behave beyond the context of our prompts. You move past performance metrics and begin to understand a model’s personality, quirks, strengths, and weaknesses; all factors that ultimately shape your project's UX.

The Future Isn't Model Agnostic

Daniel Botha — Tue, 26 Aug 2025 12:01:24 +0000

Your users don't care that your AI project is model agnostic.

In my last project, I spent countless hours ensuring that the LLMs running my services could be swapped out as easily as possible. I couldn't touch a device with an internet connection without hearing about the latest benchmark-breaking model and it felt like a clear priority to ensure I could hot swap models with minimal collateral damage.

So yeah. That was a waste of time.

The hype around new model announcements feels more manufactured with each release. In reality, improvements are becoming incremental. As major providers converge on the same baseline, the days of one company holding a decisive lead are numbered.

In a world of model parity, the differentiation moves entirely to the product layer. Winning isn't about ensuring you're using the best model, its about understanding your chosen model deeply enough to build experiences that feel magical. Knowing exactly how to prompt for consistency, which edge cases to avoid, and how to design workflows that play to your model's particular strengths

Model agnosticism isn't just inefficient, it's misguided. Fact is, swapping out your model is not just changing an endpoint. It's rewriting prompts, rerunning evals, users telling you things just feel... different. And if you've won users on the way it feels to use your product, that last one is a really big deal.

Model < Product

Recently, something happened that fully solidified this idea in my head. Claude Code is winning among people building real things with AI. We even have evangelists in the Fly.io engineering team, and those guys are weird smart. Elsewhere, whole communities have formed to share and compare claude.md's and fight each other over which MCP servers are the coolest to use with Claude.

Enter stage right, Qwen 3 Coder. It takes Claude to the cleaners in benchmarks. But the response from the Claude Code user base? A collective meh.

This is nothing like 2024, when everyone would have dropped everything to get the hot new model running in Cursor. And it's not because we've learned that benchmarks are performance theater for people who've never shipped a product.

It's because products like Claude Code are irrefutable evidence that the model isn't the product. We've felt it first hand when our pair programmer's behaviour changes in subtle ways. The product is in the rituals. The trust. The predictability. It's precisely because Claude Code's model behavior, UI, and user expectations are so tightly coupled that its users don't really care that a better model might exist.

I'm not trying to praise Anthropic here. The point is, engineering for model agnosticism is a trap that will eat up time that could be better spent … anywhere else.

Sure, if you're building infra or anything else that lives close to the metal, model optionality still matters. But people trusting legwork to AI tools are building deeper relationships and expectations of their AI tools than they even care to admit. AI product success stories are written when products become invisible parts of users' daily rituals, not showcases for engineering flexibility.

Make One Model Your Own

As builders, it's time we stop hedging our bets and embrace the convergence reality. Every startup pitch deck with 'model-agnostic' as a feature should become a red flag for investors who understand product-market fit. Stop putting 'works with any LLM' in your one-liner. It screams 'we don't know what we're building.'

If you're still building model-agnostic AI tools in 2025, you're optimizing for the wrong thing. Users don't want flexibility; they want reliability. And in a converged model landscape, reliability comes from deep specialization, not broad compatibility.

Pick your model like you pick your therapist; for the long haul. Find the right model, tune deeply, get close enough to understand its quirks and make them work for you. Stop architecting for the mythical future where you'll seamlessly swap models. That future doesn't exist, and chasing it is costing you the present.

Bonus level: All-in On One Model Means All-out On Eval

If any of this is landing for you, you'll agree that we have to start thinking of model evaluation as architecture, not an afterthought. The good news is, rigorous model eval doesn't have to be mind numbing anymore.

Turns out, games are really great eval tools! Now you can spin up your very own little AI Town on Fly.io with a single click deploy to test different models as pixel people in an evolving environment. I discuss the idea further in Games as Model Eval: 1-Click Deploy AI Town on Fly.io.

Build Better Agents With MorphLLM

Daniel Botha — Tue, 26 Aug 2025 11:34:45 +0000

I'm an audiophile, which is a nice way to describe someone who spends their children's college fund on equipment that yields no audible improvement in sound quality. As such, I refused to use wireless headphones for the longest time. The fun thing about wired headphones is when you forget they're on and you stand up, you simultaneously cause irreparable neck injuries and extensive property damage. This eventually prompted me to buy good wireless headphones and, you know what, I break fewer things now. I can also stand up from my desk and not be exposed to the aural horrors of the real world.

This is all to say, sometimes you don't know how big a problem is until you solve it. This week, I chatted to the fine people building MorphLLM, which is exactly that kind of solution for AI agent builders.

Slow, Wasteful and Expensive AI Code Changes

If you’re building AI agents that write or edit code, you’re probably accepting the following as "the way it is": Your agent needs to correct a single line of code, but rewrites an entire file to do it. Search-and-replace right? It’s fragile, breaks formatting, silently fails, or straight up leaves important functions out. The result is slow, inaccurate code changes, excessive token use, and an agent feels incompetent and unreliable.

Full file rewrites are context-blind and prone to hallucinations, especially when editing that 3000+ line file that you've been meaning to refactor. And every failure and iteration is wasted compute, wasted money and worst of all, wasted time.

Why We Aren’t Thinking About This (or why I wasn't)

AI workflows are still new to everyone. Best practices are still just opinions and most tooling is focused on model quality, not developer velocity or cost. This is a big part of why we feel that slow, wasteful code edits are just the price of admission for AI-powered development.

In reality, these inefficiencies become a real bottleneck for coding agent tools. The hidden tax on every code edit adds up and your users pay with their time, especially as teams scale and projects grow more complex.

Better, Faster AI Code Edits with Morph Fast Apply

MorphLLM's core innovation is Morph Fast Apply. It's an edit merge tool that is semantic, structure-aware and designed specifically for code. Those are big words to describe a tool that will empower your agents to make single line changes without rewriting whole files or relying on brittle search-and-replace. Instead, your agent applies precise, context-aware edits and it does it ridiculously fast.

It works like this:

You add an 'edit_file' tool to your agents tools.
Your agent outputs tiny edit_file snippets, using //...existing code... placeholders to indicate unchanged code.
Your backend calls Morph’s Apply API, which merges the changes semantically. It doesn't just replace raw text, it makes targeted merges with the code base as context.
You write back the precisely edited file. No manual patching, no painful conflict resolution, no context lost.

The Numbers

MorphLLM's Apply API processes over 4,500 tokens per second and their benchmark results are nuts. We're talking 98% accuracy in ~6 seconds per file. Compare this to 35s (with error corrections) at 86% accuracy for traditional search-and-replace systems. Files up to 9k tokens in size take ~4 seconds to process.

Just look at the damn graph:

These are game-changing numbers for agent builders. Real-time code UIs become possible. Dynamic codebases can self-adapt in seconds, not minutes. Scale to multi-file edits, documentation, and even large asset transformations without sacrificing speed or accuracy.

How to Get in on the MorphLLM Action

Integration with your project is easy peasy. MorphLLM is API-compatible with OpenAI, Vercel AI SDK, MCP, and OpenRouter. You can run it in the cloud, self-host, or go on-prem with enterprise-grade guarantees.

I want to cloud host mine, if only I could think of somewhere I could quickly and easily deploy wherever I want and only pay for when I'm using the infra 😉.

Get Morphed

MorphLLM feels like a plug-in upgrade for code agent projects that will instantly make them faster and more accurate. Check out the docs, benchmarks, and integration guides at docs.morphllm.com. Get started for free at https://morphllm.com/dashboard

Games as Model Eval: 1-Click Deploy AI Town on Fly.io

Daniel Botha — Tue, 19 Aug 2025 08:46:14 +0000

Recently, I suggested that [The Future Isn't Model Agnostic](https://fly.io/blog/the-future-isn-t-model-agnostic/), that it's better to pick one model that works for your project and build around it, rather than engineering for model flexibility. If you buy that, you also have to acknowledge how important comprehensive model evaluation becomes.

Why Games Don't Lie

Games force models to demonstrate the skills we actually care about; strategic reasoning, long-term planning, and dynamic adaptation in interactions with an opponent or a collaborator.

Pixel Art Meets Effective Model Evaluation - AI Town on Fly.io

I challenge you to find a more entertaining way of evaluating conversational models.

Trust Calibration for AI Software Builders

Daniel Botha — Tue, 19 Aug 2025 08:42:45 +0000

Trust calibration is a concept from the world of human-machine interaction design, one that is super relevant to AI software builders. Trust calibration is the practice of aligning the level of trust that users have in our products with its actual capabilities.

If we build things that our users trust too blindly, we risk facilitating dangerous or destructive interactions that can permanently turn users off. If they don't trust our product enough, it will feel useless or less capable than it actually is.

So what does trust calibration look like in practice and how do we achieve it? A 2023 study reviewed over 1000 papers on trust and trust calibration in human / automated systems (properly referenced at the end of this article). It holds some pretty eye-opening insights – and some inconvenient truths – for people building AI software. I've tried to extract just the juicy bits below.

Limiting Trust

Let's begin with a critical point. There is a limit to how deeply we want users to trust our products. Designing for calibrated trust is the goal, not more trust at any cost. Shoddy trust calibration leads to two equally undesirable outcomes:

Over-trust causes users to rely on AI systems in situations where they shouldn't (I told my code assistant to fix a bug in prod and went to bed).
Under-trust causes users to reject AI assistance even when it would be beneficial, resulting in reduced perception of value and increased user workload.

What does calibrated trust look like for your product? It’s important to understand that determining this is less about trying to diagram a set of abstract trust parameters and more about helping users develop accurate mental models of your product's capabilities and limitations. In most cases, this requires thinking beyond the trust calibration mechanisms we default to, like confidence scores.

For example, Cursor's most prominent trust calibration mechanism is its change suggestion highlighting. The code that the model suggests we change is highlighted in red, followed by suggested changes highlighted in green. This immediately communicates that "this is a suggestion, not a command."

In contrast, Tesla's Autopilot is a delegative system. It must calibrate trust differently through detailed capability explanations, clear operational boundaries (only on highways), and prominent disengagement alerts when conditions exceed system limits.

Building Cooperative Systems

Perhaps the most fundamental consideration in determining high level trust calibration objectives is deciding whether your project is designed to be a cooperative or a delegative tool.

Cooperative systems generally call for lower levels of trust because users can choose whether to accept or reject AI suggestions. But these systems also face a unique risk. It’s easy for over-trust to gradually transform user complacency into over-reliance, effectively transforming what we designed as a cooperative relationship into a delegative one, only without any of the required safeguards.

If you're building a coding assistant, content generator, or design tool, implement visible "suggestion boundaries" which make it clear when the AI is offering ideas versus making decisions. Grammarly does this well by underlining suggestions rather than auto-correcting, and showing rationale on hover.

For higher-stakes interactions, consider introducing friction. Require explicit confirmation before applying AI suggestions to production code or publishing AI-generated content.

Building Delegative Systems

In contrast, users expect delegative systems to replace human action entirely. Blind trust in the system is a requirement for it to be considered valuable at all.

If you're building automation tools, smart scheduling, or decision-making systems, invest heavily in capability communication and boundary setting. Calendly's smart scheduling works because it clearly communicates what it will and won't do (I'll find times that work for both of us vs. I'll reschedule your existing meetings). Build robust fallback mechanisms and make system limitations prominent in your onboarding.

Timing Is Everything

The study suggests that when we make trust calibrations is at least as important as how. There are three critical windows for trust calibration, each with their own opportunities and challenges.

Pre-interaction calibration happens before users engage with the system. Docs and tutorials fall into this category. Setting expectations up front can prevent initial over-trust, which is disproportionally more difficult to correct later.

Pre-interaction calibrations could look like capability-focused onboarding that shows both successes and failures. Rather than just demonstrating perfect AI outputs, show users examples where the AI makes mistakes and how to catch them.

During-interaction calibration is trust adjustment through real-time feedback. Dynamically updated cues improved trust calibration better than static displays, and adaptive calibration that responds to user behavior outperformed systems that displayed static information.

Build confidence indicators that are updated based on context, not just model confidence. For example, if you're building a document AI, show higher confidence for standard document types the system has seen thousands of times, and lower confidence for unusual formats.

Post-interaction calibration focuses on learning and adjustment that helps users understand successes and failures in the system after interactions. These aren’t reliable, since by the time users receive the information, their trust patterns are set and hard to change.

Post-interaction feedback can still be valuable for teaching. Create "reflection moments" after significant interactions. Midjourney does this by letting users rate image outputs, helping users learn what prompts work best while calibrating their expectations for future generations.

Trust is front-loaded and habit-driven. The most effective calibration happens before and during use, when expectations are still forming and behaviors can still be shifted. Any later and you’re mostly fighting entrenched patterns.

Performance vs. Process Information

Users can be guided through performance-oriented signals (what the system can do) or process-oriented signals (how it works). The real challenge is matching the right kind of explanation to the right user, at the right moment.

Performance-oriented calibration focuses on communicating capability through mechanisms like reliability statistics, confidence scores, and clear capability boundaries.
Process-oriented calibration offers detailed explanations of decision-making processes, breakdowns of which factors influenced decisions, and reasoning transparency.

Process transparency seems like the obvious go-to at first glance, but the effectiveness of process explanations varies wildly based on user expertise and domain knowledge. If we are designing for a set of users that may fall anywhere on this spectrum, we have to avoid creating information overload for novice users while providing sufficient information to expert users who want the detail.

The most effective systems in the study combined both approaches, providing layered information that allows users to access the level of detail most appropriate for their expertise and current needs.

Static vs. Adaptive Calibration

I really wanted to ignore this part, because it feels like the study’s authors are passive aggressively adding todos to my projects. In a nutshell, adaptive calibration – when a system actively monitors user behavior and adjusts its communication accordingly - is orders of magnitude more effective than static calibration while delivering the same information to every user, regardless of differences in expertise, trust propensity, or behavior.

Static calibration mechanisms are easy to build and maintain, which is why we like them. But the stark reality is that they put the burden of appropriate calibration entirely on our users. We’re making it their job to adapt their behaviour based on generic information.

This finding has zero respect for our time or mental health, but it also reveals a legit opportunity for clever builders to truly separate their product from the herd.

Practical adaptive calibration techniques

Behavioral adaptation: Track how often users accept vs. reject suggestions and adjust confidence thresholds accordingly. If a user consistently rejects high-confidence suggestions, lower the threshold for showing uncertainty.
Context awareness: Adjust trust signals based on use context. A writing AI might show higher confidence for grammar fixes than creative suggestions, or lower confidence late at night when users might be tired.
Detect expertise: Users who frequently make sophisticated edits to AI output probably want more detailed explanations than those who typically accept entire file rewrites.

The Transparency Paradox

The idea that transparency and explainability can actually harm trust calibration is easily the point that hit me the hardest. While explanations can improve user understanding, they can also create information overload that reduces users' ability to detect and correct trash output. What's worse, explanations can create a whole new layer of trust calibration issues, with users over-trusting the explanation mechanism itself, rather than critically evaluating the actual output.

This suggests that quality over quantity should be our design philosophy when it comes to transparency. We should provide carefully crafted, relevant information rather than comprehensive but overwhelming detail. The goal should be enabling better decision-making rather than simply satisfying user curiosity about system internals.

Anthropomorphism and Unwarranted Trust

It seems obvious that we should make interactions with our AI project feel as human as possible. Well, it turns out that systems that appear more human-like through design, language, or interaction patterns are notoriously good at increasing user trust beyond actual system capabilities.

So it’s entirely possible that building more traditional human-computer interactions can actually make our AI projects safer to use and therefore, more user-friendly.

Use tool-like language: Frame outputs as "analysis suggests" rather than "I think" or "I believe"
Embrace machine-like precision: Show exact confidence percentages rather than human-like hedging ("I'm pretty sure that...)

Trust Falls Faster Than It Climbs

Nothing particularly groundbreaking here, but the findings are worth mentioning if only to reinforce what we think we know.

Early interactions are critically important. Users form mental models quickly and then react slowly to changes in system reliability.

More critically, trust drops much faster from system failures than it builds from successes. These asymmetries suggest that we should invest disproportionately in onboarding and first-use experiences, even if they come with higher development costs.

Measurement is an Opportunity for Innovation

The study revealed gaping voids where effective measurement mechanisms and protocols should be, for both researchers and builders. There is a clear need to move beyond simple user satisfaction metrics or adoption rates to developing measurement frameworks that can actively detect miscalibrated trust patterns.

The ideal measurement approach would combine multiple indicators. A few examples of viable indicators are:

Behavioral signals: Track acceptance rates for different confidence levels. Well-calibrated trust should show higher acceptance rates for high-confidence outputs and lower rates for low-confidence ones.
Context-specific metrics: Measure trust calibration separately for different use cases. Users might be well-calibrated for simple tasks but poorly calibrated for complex ones.
User self-reporting: Regular pulse surveys asking "How confident are you in your ability to tell when this AI makes mistakes?" can reveal calibration gaps.

The Calibrated Conclusion

It's clear, at least from this study, that there’s no universal formula, or single feature that will effectively calibrate trust. It's up to every builder to define and understand their project's trust goals and to balance timing, content, adaptivity, and transparency accordingly. That’s what makes it both hard and worth doing. Trust calibration has to be a core part of our product’s identity, not a piglet we only start chasing once it has escaped the barn.

The Study:

Magdalena Wischnewski, Nicole Krämer, and Emmanuel Müller. 2023. Measuring and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA 16 Pages. https://doi.org/10.1145/3544548.3581197

The Future Isn't Model Agnostic

Daniel Botha — Tue, 12 Aug 2025 09:31:28 +0000

Your users don't care that your AI project is model
agnostic.

So yeah. That was a waste of time.

Model < Product

Enter stage right, Qwen 3 Coder. It takes Claude to the cleaners in benchmarks. But the response from the Claude Code user base? A collective meh.

I'm not trying to praise Anthropic here. The point is, engineering for model agnosticism is a trap that will eat up time that could be better spent … anywhere else.

Make One Model Your Own

Bonus level: All-in On One Model Means All-out On Eval

Better Business Intelligence in Elixir with Livebook

Mark Ericksen — Fri, 30 Jul 2021 16:57:02 +0000

As a developer, has your manager ever come and asked a question like, "How much money are we making?" If you were a line-of-business developer at a global insurance company, you'd reach for your handy, nosebleed-expensive Business Intelligence (BI) suite to answer this question. But you're not, so how did you answer it for them?

Obviously, you'd do what we all do. You'd SSH into your server, start an Elixir iex session or a Rails console, then run some scripts to query data, sum numbers, and come up with answers.

Well, give yourself a raise! Because you just built a BI suite.

It may not seem super sophisticated, but it solves the business need. And for problems like this, Livebook can be a better BI tool for Elixir developers.

BI What?

What is a BI tool?

Business intelligence (BI) comprises the strategies and technologies used ... for the data analysis of business information.

Translated from Gartner-speak, that means any tools you run to get a picture of how the business is doing are BI tools.

In the last bootstrapped startup I worked at, management routinely asked backend developers for business numbers. It was simple stuff, like:

How many new clients did we add this week?
What was our customer's spend?
Who are our top 10 customers this week and what were their numbers?

As simple as this stuff seems, it's really important for those business focused leaders to understand and make better decisions. That's why global insurance companies with applications that are too complicated to bring up a Rails console on spend six figures on BI suites.

How did we get those answers? Using our Elixir iex, or interactive shell. We ran some scripts and gave them CSV friendly rows they could add to their spreadsheets. In that early stage startup, we were using iex as our BI tool. At a startup before that, we did the same thing but using the Rails console.

If you're using the Rails console, Elixir's iex, or another REPL to examine your data, then that's your BI tool for now. But with Elixir, we can do better. Livebook gives you data, charts and graphs too, but because it's executing your Elixir code, it can also call out to your other integrations and pull in even more.

To understand why Livebook can be a better tool, let's go further and talk about BI tools in general, not just your REPL.

Old School Business Intelligence

Our premise in this post is that we can give "serious" BI tools a run for their money with Elixir and Livebook. Let's see what we're up against.

Companies spend lots of money every year on their BI tools. You hear some of the numbers and it seems bananas. But it's because they add a lot of value. Spotting trends in your data and customer behavior can make the difference of success and failure for a company.

Most BIS tools are commercial. But there's a handful of credible open source projects. Metabase, for example, is an open source BI tool that works quite well. It connects directly to an application's database and helps you do some spelunking, aggregation, and shiny graphing. You can even create and share custom dashboards. Think of it as Grafana, but for MBAs – it's a great tool.

Deploying Metabase alongside your app might look like this:

Metabase is an application that probably shouldn't be exposed publicly and it needs direct access to your database. It's also a bit of a mammoth – one doesn't just walk into Metabase and expect to get anything done, there's a real learning curve even when you know how to write SQL. It can be a heavy tool when you just want to do some quick poking around.

It's also an app you need to keep running. We sell hosting, so we're generally OK with that. In fact, Metabase ships a Docker image and Fly lets you quickly deploy apps using Docker. Money.

This is fine when you want a dedicated data dashboard or you want to let non-developers see reports and graphs and be business-intelligent. However, when a project is young and you're a developer, digging with code is powerful. This is where Livebook can help!

Why is Livebook Better?

Let's start with what Livebook is.

Livebook started out as Elixir's version of Jupyter Notebooks. Jupyter is pretty great. But it turns out that code notebooks on Elixir are something special; they do something you usually can't pull off in Python. That's because Elixir has powerful built-in clustering, built on Erlang's BEAM/OTP runtime. Livebook notebooks can talk directly to running Elixir apps. And so we can do analysis and visualization directly off the models in our applications.

Livebook really sings on Fly.io. We make it easy to deploy clusters of Elixir applications that can talk privately between themselves. More importantly: it's easy to bring up a secure WireGuard tunnel to those applications. So I can run Livebook locally on my machine, and attach it to any of my apps running on Fly.io!

For lean-and-mean startups, this is a win. You only need your app and your database running. Then, on an "as-needed" basis, you connect to your app with Livebook for analysis. Inside Livebook, analysis is done using your app's Elixir code. Livebook therefore doesn't need to connect directly to the database to run queries, and, even better, we get to re-use the business logic, associations, and schemas our apps already have.

As a BI tool, Livebook notebooks have these benefits:

They use your application's code, like a REPL lets you.
They have the ability to generate charts, graphics, and tables.
Notebooks are markdown files and can be checked-in with your project. They can be shared with the team! No more, "I can't run the numbers today because Bill is out and he has the scripts."
They are self-documenting because it's just markdown.
They are designed to be highly reproducible. They are easy to re-run when you want updated numbers.

Some of this you can pull off using just your REPL and some raw SQL queries. But why would you? It's easier to use your project's code, database models (Ecto schemas for Elixir) and associations.

Further, your apps probably rely on external services like Stripe. Because Livebook talks directly to your Elixir code, you can query those external services with that code, and then combine the results with your data. This is a combination that even dedicated tools like Metabase have a hard time beating.

Connecting to Your App on Fly

Great! You have a notebook that loads and visualizes some data! To get the benefits, you need to connect it to your app running on Fly.io.

Follow the Fly.io Elixir Guide for Connecting Livebook to your App in Production to connect Livebook to your app.

With Livebook connected to your app, you can run your notebook and start gaining insight to your data!

Gaining Intelligence

How you use Livebook depends on your application and your industry.

Here are some ideas to get the brain juices flowing. Each of these example notebooks would be a "module" in a serious commercial BI suite, and you'd pay $45k for a license for it.

User Account Setup

How many of your users have fully set up their accounts?

Build and share a notebook that tracks where users are in your onboarding process.

Users Bouncing From The App

Where are accounts stalling out in onboarding?

Graph the stages accounts are at and let analysts drill into the onboarding funnel.

Sales Analysis

How were sales for your products or services last month?

Graph a multi-series chart comparing sales across products.

What about total sales per week?

Build a notebook with a date input making it easy to switch the week being charted.

Integrations

What external financial systems are you integrated with?

Notebooks can execute your code to query those services and visualize refund rates, processing fees, and more.

Wrapping Up

What's great about the Livebook approach is you are writing working Elixir code. When you are ready to build an Admin Dashboard page in your app, you've already done the hard work of figuring out what data is valuable and even the code for how to get it!

Monitoring Elixir Apps on Fly.io With Prometheus and PromEx

Mark Ericksen — Tue, 29 Jun 2021 20:06:15 +0000

Fly.io takes Docker containers and converts them into fleets of Firecracker micro-vms running in racks around the world. If you have a working Docker container, you can run it close to your users, whether they're in Singapore or Amsterdam, with just a couple of commands. Fly.io is particularly nice for Elixir applications, because Elixir's first-class support for distributed computing meshes perfectly with Fly.io's first-class support for clusters of applications.

This post is about another cool Fly.io feature --- built-in Prometheus metrics --- and how easy it is to take advantage of them in an Elixir application. I wrote and maintain an Elixir library, PromEx, that makes it a snap to export all sorts of metrics from your Elixir applications and get them on dashboards in Grafana. Let's explore some of the concepts surrounding Prometheus and see how we can leverage the Fly.io monitoring
tools in an Elixir application to get slick looking dashboards like this one:

Why Application Monitoring is Important

When customers are paying for your application or service, they expect it to work every time they reach for it. When things break or errors occur, your customers will not be happy. If you are lucky, your customers send you an email letting you know that things are not working as expected. Unfortunately, many of these occurrences go unreported.

Knowing exactly when things are going wrong is key to keeping your customers happy. This is the problem that monitoring tools solve. They keep an eye on your application, and let you know exactly when things are behaving suboptimally.

Imagine for example that you have an HTTP JSON API. You deploy a new version that changes a bunch of endpoints. Assume it's infeasible to go through every single route of your application every time you deploy, or to test each endpoint individually with every permutation of input data. That would take far too much time, and it doesn't scale from an organizational perspective: it would keep engineers constantly context switching between feature work and testing new deployments.

A more scalable solution: briefly smoke test the application after a deployment (as a sanity check), and then use monitoring tooling to pick up on and report on any errors. If your monitoring solution reports that your HTTP JSON API is now responding with 400 or 500 errors, you know you have a problem and you can either rollback the application, or stop it from propagating to across the cluster. The key point is that you can proactively address issues as opposed to being blind to them, and at the same time you can avoid sinking precious engineer time into testing all the things.

While ensuring that production users are not experiencing issues is a huge benefit of application monitoring, there are lots of other benefits. They include:

Quantifying stress testing results
Business priority planning based on real usage data
System performance and capacity planning

Let's dig into how Prometheus achieves these goals at the technical level.

How Does Prometheus Work?

At its core, Prometheus is a time-series database that enables you to persist metrics in an efficient and performant manner. Once your metrics are in the Prometheus time-series database, you can create alerting rules in Grafana. Those alerts can then be triggered once certain thresholds and criteria are met, letting you know that something has gone wrong.

"But how exactly do my application metrics end up in Prometheus?" Well, your Prometheus instance is configured to scrape all of your configured applications. At a regular interval, each of their instances is queried for metrics data, which is stored in a database. Specifically, it makes a GET HTTP call to /metrics (or wherever your metrics are exposed) and that endpoint will contain a snapshot in time of the state of your application. Once your metrics are in Prometheus, you can query the time-series database with Grafana to plot the data over time; Grafana uses PromQL to refresh data and update its panels.

Given that Prometheus scrapes your applications at a regular interval, the resolution of your time-series data is bound to that interval. In other words, if you get 1,000 requests in
the span of 10 seconds, you don't know exactly at what timestamps those 1,000 requests came in, you just know that you got 1,000 requests in a 10 second time window. While this may seem limiting, it is actually a benefit in disguise. Since Prometheus doesn't need to keep track of every single timestamp, it is able to store all the time-series data very efficiently.

Luckily with Fly.io, the administration and management of Prometheus can be taken care of for you!

Turning On Prometheus On Fly

Managing, configuring and administering your own Prometheus instance can be a bit of a tall order if you have never worked with Prometheus before. Fortunately, all you need to do to enable Prometheus metrics for your application is add a couple of lines to your fly.toml manifest file. All Fly.io needs to know is what port and path your metrics will be available at. For the TODO List Elixir application for example, the following configuration was all that was needed:

[metrics]
port = 4000
path = "/metrics"

In order to visualize your Prometheus metrics, you'll need to have an instance of Grafana running somewhere. You could deploy your own Grafana instance on Fly.io by following this guide, but you can also use Grafana Cloud (it has a free plan) --- Grafana Cloud works fine with Fly. Which ever route you take, all you then need to do is configure Grafana to communicate with the Fly.io managed Prometheus instance and you are good to go!

Now that we've got Prometheus hooked up, we need to get our Elixir application to start providing metrics.

Monitoring Elixir with PromEx

Whenever I write a production-grade Elixir application that needs monitoring, I reach for PromEx.

I wrote PromEx and maintain it because I wanted something that made it easy to manage both the collection of metrics and the lifecycle of a a bunch of Grafana dashboards. That's to say: PromEx doesn't just export Prometheus metrics; it also provides you with dashboards you can import into Grafana to immediately get value out of those metrics. I think this is a pretty ambitious goal and I'm happy with how it turned out. Let's dig in.

At a library design level, PromEx is a plugin style library, where you enable a plugin for whatever library you want to monitor. For example, PromEx has plugins to capture metrics for Phoenix, Ecto, the Erlang VM itself, Phoenix LiveView and several more. Each of these plugins also has a dashboard to present all the captured metrics for you. In addition, PromEx can communicate with Grafana using the Grafana HTTP API, so it will upload the dashboards automatically for you on application start (if you configure it that is). What this means is that you can go from zero to complete application metrics and dashboards in less that 10 minutes!

In the Elixir example application, you can see that the PromEx module definition specifies what plugins PromEx should initialize, and what dashboards should be uploaded to Grafana:

defmodule TodoList.PromEx do
  use PromEx, otp_app: :todo_list

  alias PromEx.Plugins

  @impl true
  def plugins do
    [
      # PromEx built in plugins
      Plugins.Application,
      Plugins.Beam,
      {Plugins.Phoenix, router: TodoListWeb.Router},
      Plugins.PhoenixLiveView
    ]
  end

  @impl true
  def dashboard_assigns do
    [
      datasource_id: "prometheus"
    ]
  end

  @impl true
  def dashboards do
    [
      # PromEx built in Grafana dashboards
      {:prom_ex, "application.json"},
      {:prom_ex, "beam.json"},
      {:prom_ex, "phoenix.json"},
      {:prom_ex, "phoenix_live_view.json"}
    ]
  end
end

With a little bit of configuration in runtime.exs PromEx can communicate with Grafana to take care of the graph annotations and dashboard uploads:

config :todo_list, TodoList.PromEx,
  manual_metrics_start_delay: :no_delay,
  grafana: [
    host: System.get_env("GRAFANA_HOST") || raise("GRAFANA_HOST is required"),
    auth_token: System.get_env("GRAFANA_TOKEN") || raise("GRAFANA_TOKEN is required"),
    upload_dashboards_on_start: true,
    folder_name: "Todo App Dashboards",
    annotate_app_lifecycle: true
  ]

With the managed Prometheus instance from Fly.io, and the metrics collection from PromEx, you have an instrumented application in record time! Here are some snapshots from the auto generated dashboards for the Todo List application:

And That's It!

Elixir makes it easy to run ambitious, modern applications that take advantage of distributed computing. It should be just easy easy to see what those applications are actually doing, and to have alerts go off when they misbehave. Between Fly.io's built-in Prometheus and the PromEx library, it's easy to get this kind of visibility. Your application can be instrumented with dashboards and annotations in a coffee break's worth of time.

Be sure to check out the Todo List application Repo for more technical details and all the code necessary to do this yourself. What used to take a few days to set up and run, now only takes a few hours, so be sure to give it a test drive!

Build a CDN in about 5 hours

Kurt Mackey — Wed, 16 Jun 2021 15:47:43 +0000

The term "CDN" ("content delivery network") conjures Google-scale companies managing huge racks of hardware, wrangling hundreds of gigabits per second. But CDNs are just web applications. That's not how we tend to think of them, but that's all they are. You can build a functional CDN on an 8-year-old laptop while you're sitting at a coffee shop. I'm going to talk about what you might come up with if you spend the next five hours building a CDN.

It's useful to define exactly what a CDN does. A CDN hoovers up files from a central repository (called an origin) and stores copies close to users. Back in the dark ages, the origin was a CDN's FTP server. These days, origins are just web apps and the CDN functions as a proxy server. So that's what we're building: a distributed caching proxy.

Caching proxies

HTTP defines a whole infrastructure of intricate and fussy caching features. It's all very intimidating and complex. So we're going to resist the urge to build from scratch and use the work other people have done for us.

We have choices. We could use Varnish (scripting! edge side includes! PHK blog posts!). We could use Apache Traffic Server (being the only new team this year to use ATS!). Or we could use NGINX (we're already running it!). The only certainty is that you'll come to hate whichever one you pick. Try them all and pick the one you hate the least.

(We kid! Netlify is built on ATS. Cloudflare uses NGINX. Fastly uses Varnish.)

What we're talking about building is not basic. But it's not so bad. All we have to do is take our antique Rails setup and run it in multiple cities. If we can figure out how to get people in Australia to our server in Sydney and people in Chile to our server in Santiago, we'll have something we could reasonably call a CDN.

Traffic direction

Routing people to nearby servers is a solved problem. You basically have three choices:

Anycast: acquire routable address blocks, advertise them in multiple places with BGP4, and then pretend that you have opinions about "communities" and "route reflectors" on Twitter. Let the Internet do the routing for you. Downside: it's harder to do, and the Internet is sometimes garbage. Upside: you might become insufferable.
DNS: Run trick DNS servers that return specific server addresses based on IP geolocation. Downside: the Internet is moving away from geolocatable DNS source addresses. Upside: you can deploy it anywhere without help.
Be like a game server: Ping a bunch of servers and use the best. Downside: gotta own the client. Upside: doesn't matter, because you don't own the client.

You're probably going to use a little of (1) and a little of (2). DNS load balancing is pretty simple. You don't really even have to build it yourself; you can host DNS on companies like DNSimple, and then define rules for returning addresses. Off you go!

Anycast is more difficult. We have more to say about this — but not here. In the meantime, you can use us, and deploy an app with an Anycast address in about 2 minutes. This is bias. But also: true.

Boom, CDN. Put an NGINX in each of a bunch of cities, run DNS or Anycast for traffic direction, and you're 90% done. The remaining 10% will take you months.

The Internet is breaking

The briny deeps are filled with undersea cables, crying out constantly to nearby ships: "drive through me"! Land isn't much better, as the old networkers shanty goes: "backhoe, backhoe, digging deep — make the backbone go to sleep". When you run a server in a single location, you don't so much notice this. Run two servers and you'll start to notice. Run servers around the world and you'll notice it to death.

What's cool is: running a single NGINX in multiple cities gives you a lot of ready-to-use redundancy. If one of them dies for some reason, there are bunch more to send traffic to. When one of your servers goes offline, the rest are still there serving most of your users.

It's tedious but straightforward to make this work. You have health checks (aside: when CDN regions break, they usually break by being slow, so you'd hope your health checks catch that too). They tell you when your NGINX servers fail. You script DNS changes or withdraw BGP routes (perhaps just by stopping your BGP4 service on those regions) in response.

That's server failure, and it's easy to spot. Internet burps are harder to detect. You'll need to run external health checks, from multiple locations. It's easy to get basic, multi-perspective monitoring – we use Datadog and updown.io, and we're building out our own half-built home grown service. You're not asking for much more than what cURL will tell you. Again: the thing you're super wary about in a CDN is a region getting slow, not falling off the Internet completely.

Quick aside: notice that all those monitoring options work from someone else's data center to your data center. DC-DC traffic is a good start, enough for a lot of jobs. But it isn't representative. Your users aren't in data centers (I hope). When you're really popular, what you want is monitoring from the vantage point of actual clients. For this, you can find hundreds of companies selling RUM (real user monitoring), which usually takes the form of surreptitiously embedded Javascript bugs. There's one rum we like. It's sold by a company called Plantation and it's aged in wine casks. Drink a bunch of it, and then do your own instrumentation with Honeycomb.

Ridiculous Internet problems are the worst. But the good news about them is, everyone is making up the solutions as they go along, so we don't have to talk about them so much. Caching is more interesting. So let's talk about onions.

The Golden Cache Hit Ratio

The figure of merit in cache measurement is "cache ratio". Cache ratio measures how often we're able to server from our cache, versus the origin.

A cache ratio of 80% just means "when we get a request, we can serve it from cache 80% of the time, and the remaining 20% of the time we have to proxy the request to the origin". If you're building something that wants a CDN, high cache ratios are good, and low cache ratios are bad.

If you followed the link earlier in the post to the Github repository, you might've noticed that our naïve NGINX setup is an isolated single server. Deploying it in twenty places gives us twenty individual servers. It's dead simple. But the simplicity has a cost – there's no per-region redundancy. All twenty servers will need to make requests to the origin. This is brittle, and cache ratios will suffer. We can do better.

The simple way to increase redundancy is to add a second server in each region. But doing that might wreck cache ratios. The single server has the benefit of hosting a single cache for all users; with two, you've got twice the number of requests per origin, and twice the number of cache misses.

What you want to do is teach your servers to talk to each other, and make them ask their friends for cache content. The simplest way to do this is to create cache shards – split the data up so each server is responsible for a chunk of it, and everyone else routes requests to the cache shard that owns the right chunk.

That sounds complicated, but NGINX's built in load balancer supports hash based load balancing. It hashes requests, and forwards the "same request" to same server, assuming that server is available. If you're playing the home version of this blog post, here's a ready to go example of an NGINX cluster that discovers its peers, hashes the URL, and serves requests through available servers.

When requests for a.jpg hit our NGINX instances, they will all forward the request to the same server in the cluster. Same for b.jpg. This setup has servers serve as both the load balancing proxy and the storage shard. You can separate these layers, and you might want to if you're building more advanced features into your CDN.

A small, financially motivated aside

Our clustered NGINX example uses Fly-features we think are really cool. Persistent volumes help keep cache ratios high between NGINX upgrades. Encrypted private networking makes secure NGINX to NGINX communications simple and keeps you from having to do complicated mTLS gymnastics. Built in DNS service discovery helps keep the clusters up to date when we add and remove servers. If it sounds a little too perfectly matched, it's because we built these features specifically for CDN-like-workloads.

But of course, you can do all this stuff anywhere, not just on Fly. But it's easy on Fly.

Onions have layers

Two truths: a high cache ratio is good, the Internet is bad. If you like killing birds and conserving stones, you'll really enjoy solving for cache ratios and garbage Internet. The answer to both of those problems involves getting the Internet's grubby hands off our HTTP requests. A simple way to increase cache ratios: bypass the out-of-control Internet and proxy origin requests through networks you trust to behave themselves.

CDNs typically have servers in regions close to their customers' origins. If you put our NGINX example in Virginia, you suddenly have servers close to AWS's largest region. And you definitely have customers on AWS. That's the advantage of existing alongside a giant powerful monopoly!

You can, with a little NGINX and proxy magic, send all requests through Virginia on their way to the origin servers. This is good. There are fewer Internet bear traps between your servers in Virginia and your customers' servers in us-east-1. And now you have a single, canonical set of servers to handle a specific customers' requests.

Good news. This setup improves your cache ratio AND avoids bad Internet. For bonus points, it's also the foundation for extra CDN features.

If you've ever gone CDN shopping, you've come across things like "Shielding" and "Request Coalescing". Origin shielding typically just means sending all traffic through a known data center. This can minimize traffic to origin servers, and also, because you probably know the IPs your CDN regions use, you can control access with simple L4 firewall rules.

Coalescing requests also minimizes origin traffic, especially during big events when many users are trying to get at the same content. When 100,000 users request your latest cleverly written blog post at once, and it's not yet cached, that could end up meaning 100k concurrent requests to your origin. That's a face melting level of traffic for most origins. Solving this is a matter of "locking" a specific URL to ensure that if an NGINX server is making an origin request, the other clients pause until the cache is file. In our clustered NGINX example, this is a two line configuration.

Oh no, slow

Proxying through a single region to increase cache ratios is a little bit of a cheat. The entire purpose of a CDN is to speed things up for users. Sending requests from Singapore to Virginia will make things barely faster, because a set of NGINX servers with cached content is almost always faster than origin services. But, really, it's slow and undesirable.

You can solve this with more onion layers:

Requests in Australia could run through Singapore on the way to Virginia. Even light is slow over 14,624 kilometers (Australia to Virginia), so Australia to Singapore (4,300 kilometers) with a cache cuts a perceptible amount of latency. It will be a little slower on cache misses. But we're talking about the difference between "irritatingly slow" and "150ms worse than irritatingly slow".

If you are building a general purpose CDN, this is a nice way to do it. You can create a handful of super-regions that aggregate cache data for part of the world.

If you're not building a general purpose CDN, and are instead just trying to speed up your application, this is a brittle solution. You are probably better off distributing portions of your application to multiple regions.

Where are we now?

The basic ideas of a CDN are old, and easy to understand. But building out a CDN has historically been an ambitious team enterprise, not a weekend project for a single developer.

But the building blocks for a capable CDN have been in tools like NGINX for a long time. If you've been playing along at home with the Github repo, we hope you've noticed that even the most complicated iteration of the design we're talking about, a design that has per-region redundancy and that allows for rudimentary control of request routing between regions, is mostly just NGINX configuration --- and not an especially complicated configuration. The "code" we've added is just bash sufficient to plug in addresses.

So that's a CDN. It'll work just great for simple caching. For complicated apps, it's only missing a few things.

Notably, we didn't address cache expiration at all. One ironclad rule of using a CDN is: you will absolutely put an embarrassing typo on a launch release, notice it too late, and discover that all your cache servers have a copy titled "A Better Amercia". Distributed cache invalidation is a big, hairy problem for a CDN. Someone could write a whole article about it.

The CDN layer is also an exceptionally good place to add app features. Image optimization, WAF, API rate limiting, bot detection, we could go on. Someone could turn these into ten more articles.

One last thing. Like we mentioned earlier: this whole article is bias. We're highlighting this CDN design because we built a platform that makes it very easy to express (you should play with it). Those same platform features that make it trivial to build a CDN on Fly also make it easy to distribute your whole application; an application designed for edge distribution may not need a CDN at all.

Docker without Docker

Fly.io — Wed, 02 Jun 2021 15:42:48 +0000

We’re Fly.io. We take container images and run them on our hardware around the world. It’s pretty neat, and you should check it out; with an already-working Docker container, you can be up and running on Fly in well under 10 minutes.

Even though most of our users deliver software to us as Docker containers, we don’t use Docker to run them. Docker is great, but we’re high-density multitenant, and despite strides, Docker’s isolation isn’t strong enough for that. So, instead, we transmogrify container images into Firecracker micro-VMs.

Let's demystify that.

What’s An OCI Image?

They do their best to make it look a lot more complicated, but OCI images — OCI is the standardized container format used by Docker — are pretty simple. An OCI image is just a stack of tarballs.

Backing up: most people build images from Dockerfiles. A useful way to look at a Dockerfile is as a series of shell commands, each generating a tarball; we call these “layers”. To rehydrate a container from its image, we just start the the first layer and unpack one on top of the next.

You can write a shell script to pull a Docker container from its registry, and that might clarify. Start with some configuration; by default, we’ll grab the base image for golang:

image="${1:-golang}"
registry_url='https://registry-1.docker.io'
auth_url='https://auth.docker.io'
svc_url='registry.docker.io'

We need to authenticate to pull public images from a Docker registry – this is boring but relevant to the next section – and that’s easy:

function auth_token { 
  curl -fsSL "${auth_url}/token?service=${svc_url}&scope=repository:library/${image}:pull" | jq --raw-output .token
}

That token will allow us to grab the “manifest” for the container, which is a JSON index of the parts of a container.

function manifest { 
  token="$1"
  image="$2"
  digest="${3:-latest}"

  curl -fsSL \
    -H "Authorization: Bearer $token" \
    -H 'Accept: application/vnd.docker.distribution.manifest.list.v2+json' \
    -H 'Accept: application/vnd.docker.distribution.manifest.v1+json' \
    -H 'Accept: application/vnd.docker.distribution.manifest.v2+json' \
      "${registry_url}/v2/library/${image}/manifests/${digest}"
}

The first query we make gives us the “manifest list”, which gives us pointers to images for each supported architecture:

  "manifests": [
    {
      "digest": "sha256:3fc96f3fc8a5566a07ac45759bad6381397f2f629bd9260ab0994ef0dc3b68ca",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      },
    },

Pull the digest out of the matching architecture entry and perform the same fetch again with it as an argument, and we get the manifest: JSON pointers to each of the layer tarballs:

   "config": {
      "digest": "sha256:0debfc3e0c9eb23d3fc83219afc614d85f0bc67cf21f2b3c0f21b24641e2bb06"
   },
   "layers": [
      {
         "digest": "sha256:004f1eed87df3f75f5e2a1a649fa7edd7f713d1300532fd0909bb39cd48437d7"
      },

It’s as easy to grab the actual data associated with these entries as you’d hope:

function blob {
  token="$1"
  image="$2"
  digest="$3"
  file="$4"

  curl -fsSL -o "$file" \
      -H "Authorization: Bearer $token" \
        "${registry_url}/v2/library/${image}/blobs/${digest}"
}

And with those pieces in place, pulling an image is simply:

function layers { 
  echo "$1" | jq --raw-output '.layers[].digest'
}

token=$(auth_token "$image")
amd64=$(linux_version $(manifest "$token" "$image"))
mf=$(manifest "$token" "$image" "$amd64")

i=0
for L in $(layers "$mf"); do
  blob "$token" "$image" "$L" "layer_${i}.tgz"
  i=$((i + 1 ))
done

Unpack the tarballs in order and you’ve got the filesystem layout the container expects to run in. Pull the “config” JSON and you’ve got the entrypoint to run for the container; you could, I guess, pull and run a Docker container with nothing but a shell script, which I’m probably the 1,000th person to point out. At any rate here’s the whole thing.

You’re likely of one of two mindsets about this: (1) that it’s extremely Unixy and thus excellent, or (2) that it’s extremely Unixy and thus horrifying.

Unix tar is problematic. Summing up Aleksa Sarai: tar isn’t well standardized, can be unpredictable, and is bad at random access and incremental updates. Tiny changes to large files between layers pointlessly duplicate those files; the poor job tar does managing container storage is part of why people burn so much time optimizing container image sizes.

Another fun detail is that OCI containers share a security footgun with git repositories: it’s easy to accidentally build a secret into a public container, and then inadvertently hide it with an update in a later image.

We’re of a third mindset regarding OCI images, which is that they are horrifying, and that’s liberating. They work pretty well in practice! Look how far they’ve taken us! Relax and make crappier designs; they’re all you probably need.

Speaking of which:

Multi-Tenant Repositories

Back to Fly.io. Our users need to give us OCI containers, so that we can unpack and run them. There’s standard Docker tooling to do that, and we use it: we host a Docker registry our users push to.

Running an instance of the Docker registry is very easy. You can do it right now; docker pull registry && docker run registry. But our needs are a little more complicated than the standard Docker registry: we need multi-tenancy, and authorization that wraps around our API. This turns out not to be hard, and we can walk you through it.

A thing to know off the bat: our users drive Fly.io with a command line utility called flyctl. flyctl is a Go program (with public source) that runs on Linux, macOS, and Windows. A nice thing about working in Go in a container environment is that the whole ecosystem is built in the same language, and you can get a lot of stuff working quickly just by importing it. So, for instance, we can drive our Docker repository clientside from flyctl just by calling into Docker’s clientside library.

If you're building your own platform and you have the means, I highly recommend the CLI-first tack we took. It is so choice. flyctl made it very easy to add new features, like databases,
private networks, volumes, and our bonkers SSH access system.

On the serverside, we started out simple: we ran an instance of the standard Docker registry with an authorizing proxy in front of it. flyctl manages a bearer token and uses the Docker APIs to initiate Docker pushes that pass that token; the token authorizes repositories serverside using calls into our API.

What we do now isn’t much more complicated than that. Instead of running a vanilla Docker registry, we built a custom repository server. As with the client, we get a Docker registry implementation just by importing Docker’s registry code as a Go dependency.

We’ve extracted and simplified some of the Go code we used to build this here, just in case anyone wants to play around with the same idea. This isn’t our production code (in particular, all the actual authentication is ripped out), but it’s not far from it, and as you can see, there’s not much to it.

Our custom server isn’t architecturally that different from the vanilla registry/proxy system we had before. We wrap the Docker registry API handlers with authorizer middleware that checks tokens, references, and rewrites repository names. There are some very minor gotchas:

Docker is content-addressed, with blobs “named” for their SHA256 hashes, and attempts to reuse blobs shared between different repositories. You need to catch those cross-repository mounts and rewrite them.
Docker’s registry code generates URLs with _state parameters that embed references to repositories; those need to get rewritten too. _state is HMAC-tagged; our code just shares the HMAC key between the registry and the authorizer.

In both cases, the source of truth for who has which repositories and where is the database that backs our API server. Your push carries a bearer token that we resolve to an organization ID, and the name of the repository you’re pushing to, and, well, our design is what you’d probably come up with to make that work. I suppose my point here is that it’s pretty easy to slide into the Docker ecosystem.

Building And Running VMs

The pieces are on the board:

We can accept containers from users
We can store and manage containers for different organizations.
We've got a VMM engine, Firecracker, that we've written about already.

What we need to do now is arrange those pieces so that we can run containers as Firecracker VMs.

As far as we're concerned, a container image is just a stack of tarballs and a blob of configuration (we layer additional configuration in as well). The tarballs expand to a directory tree for the VM to run in, and the configuration tells us what binary in that filesystem to run when the VM starts.

Meanwhile, what Firecracker wants is a set of block devices that Linux will mount as it boots up.

There's an easy way on Linux to take a directory tree and turn it into a block device: create a file-backed loop device, and copy the directory tree into it. And that's how we used to do things. When our orchestrator asked to boot up a VM on one of our servers, we would:

Pull the matching container from the registry.
Create a loop device to store the container's filesystem on.
Unpack the container (in this case, using Docker's Go libraries) into the mounted loop device.
Create a second block device and inject our init, kernel, configuration, and other goop into.
Track down any persistent volumes attached to the application, unlock them with LUKS, and collect their unlocked block devices.
Create a TAP device, configure it for our network, and attach BPF code to it.
Hand all this stuff off to Firecracker and tell it to boot.

This is all a few thousand lines of Go.

This system worked, but wasn't especially fast. Part of the point of Firecracker is to boot so quickly that you (or AWS) can host Lambda functions in it and not just long-running programs. A big problem for us was caching; a server in, say, Dallas that's asked to run a VM for a customer is very likely to be asked to run more instances of that server (Fly.io apps scale trivially; if you've got 1 of something running and would be happier with 10 of them, you just run flyctl scale count 10). We did some caching to try to make this faster, but it was of dubious effectiveness.

The system we'd been running was, as far as container filesystems are concerned, not a whole lot more sophisticated than the shell script at the top of this post. So Jerome replaced it.

What we do now is run, on each of our servers, an instance of containerd. containerd does a whole bunch of stuff, but we use it as as a cache.

If you're a Unix person from the 1990s like I am, and you just recently started paying attention to how Linux storage works again, you've probably noticed that a lot has changed. Sometime over the last 20 years, the block device layer in Linux got interesting. LVM2 can pool raw block devices and create synthetic block devices on top of them. It can treat block device sizes as an abstraction, chopping a 1TB block device into 1,000 5GB synthetic devices (so long as you don't actually use 5GB on all those devices!). And it can create snapshots, preserving the blocks on a device in another synthetic device, and sharing those blocks among related devices with copy-on-write semantics.

containerd knows how to drive all this LVM2 stuff, and while I guess it's out of fashion to use the devmapper backend these days, it works beautifully for our purposes. So now, to get an image, we pull it from the registry into our server-local containerd, configured to run on an LVM2 thin pool. containerd manages snapshots for every instance of a VM/container that we run. Its API provides a simple "lease"-based garbage collection scheme; when we boot a VM, we take out a lease on a container snapshot (which synthesizes a new block device based on the image, which containerd unpacks for us); LVM2 COW means multiple containers don't step on each other. When a VM terminates, we surrender the lease, and containerd eventually GCs.

The first deployment of a VM/container on one of our servers does some lifting, but subsequent deployments are lightning fast (the VM build-and-boot process on a second deployment is faster than the logging that we do).

Some Words About Init

Jerome wrote our init in Rust, and, after being cajoled by Josh Triplett, [we released the code (https://github.com/superfly/init-snapshot), which you can go read.

The filesystem that Firecracker is mounting on the snapshot checkout we create is pretty raw. The first job our init has is to fill in the blanks to fully populate the root filesystem with the mounts that Linux needs to run normal programs.

We inject a configuration file into each VM that carries the user, network, and entrypoint information needed to run the image. init reads that and configures the system. We use our own DNS server for private networking, so init overrides resolv.conf. We run a tiny SSH server for user logins over WireGuard; init spawns and monitors that process. We spawn and monitor the entry point program. That’s it; that’s an init.

Putting It All Together

So, that's about half the idea behind Fly.io. We run server hardware in racks around the world; those servers are tied together with an orchestration system that plugs into our API. Our CLI, flyctl, uses Docker's tooling to push OCI images to us. Our orchestration system sends messages to servers to convert those OCI images to VMs. It's all pretty neato, but I hope also kind of easy to get your head wrapped around.

The other "half" of Fly is our Anycast network, which is a CDN built in Rust that uses BGP4 Anycast routing to direct traffic to the nearest instance of your application. About which: more later.

Building a Distributed Turn-Based Game System in Elixir

Fly.io — Wed, 26 May 2021 17:38:25 +0000

One of the best things about building web applications in Elixir is LiveView, the Phoenix Framework feature that makes it easy to create live and responsive web pages without all the layers people normally build.

Many great Phoenix LiveView examples exist. They often show the ease and power of LiveView but stop at multiple browsers talking to a single web server. I wanted to go further and create a fully clustered, globally distributed, privately networked, secure application. What's more, I wanted to have fun doing it.

So I set out to see if I could create a fully distributed, clustered, privately networked, global game server system. Spoiler Alert: I did.

What I didn't have to build

What I find remarkable is what I didn't need to build.

I didn't build a Javascript front end using something like React.js or Vue.js. That is the typical approach. Building a JS front-end means I need JS components, a front-end router, a way to model the state in the browser, a way to transfer player actions to the server and a way to receive state updates from the server.

On the server, I didn't build an API. Typically that would be REST or GraphQL with a JSON structure for transferring data to and from the front-end.

I didn't need other external systems like Amazon SQS, Kafka, or even just Redis to pass state between servers. This means the entire system requires less cross-technology knowledge or specialized skills to build and maintain it. I used Phoenix.PubSub which is built on technology already in Elixir's VM, called the BEAM. I used the Horde library to provide a distributed process registry for finding and interacting with GameServers.

As for Fly.io's WireGuard connected private network between geographically distant regions and data centers? I don't even know how I would have done that in AWS, which is why I've always given up on the idea.

What I did build

What I built was just a proof of concept, but I'm surprised at how it came together. I ended up with a platform that can host many different types of games, all of which:

Can be multi-player
Offer a Jackbox-style 4-letter game code system
Have on-demand game and match creation
With a fast, response UI

And, just one little extra detail: the platform supports multiple connected servers operating together in clusters. Elixir for the win!

I created this as an open source project on Github, so you can check it out yourself.

https://github.com/fly-apps/tictac

Technology

I've worked with enough companies and teams to imagine several different approaches to build a system like this. Those approaches would all require large multi-disciplinary teams like a front-end JS team, a backend team, a DevOps team, and more. In contrast, I set out to do this by myself, in my spare time, and with a whole lot of "life" happening too.

Here's what I chose to use:

Elixir programming language – A dynamic, functional language for building scalable and maintainable applications.
Phoenix Framework – Elixir's primary web framework
Phoenix LiveView – Rich, real-time user experiences with server-rendered HTML delivered by websockets
libcluster – Automatic cluster formation/healing for Elixir applications.
Horde – Elixir library that provides a distributed and supervised process registry.
Fly.io – Hosting platform that enables private networked connections and multi-region support.

Application Architecture

There are many guides to getting started with LiveView, I'm not focusing on that here. However, for context, this demonstrates the application architecture when running on a local machine.

The "ABCD" in the graphic is a running game identified by the 4-letter code "ABCD".

Let's walk it through.

A player uses a web browser to view the game board. The player clicks to make a move.
The browser click triggers an event in the player's LiveView. There is a bi-directional websocket connection from the browser to LiveView.
The LiveView process sends a message to the game server for the player's move.
The GameServer uses Phoenix.PubSub to publish the updated state of game ABCD.
The player's LiveView is subscribed to notifications for any updates to game ABCD. The LiveView receives the new game state. This automatically triggers LiveView to re-render the game immediately pushing the UI changes out to the player's browser.
All connected players see the new state of the board and game.

We need a game

I needed a simple game to play and model for this game system. I chose Tic-Tac-Toe. Why?

It's simple to understand and play.
Easy to model.
Doesn't bog down the project with designing a game.
Quick to play through and test it being "over".

I want to emphasize that this system can be used to build many turn-based, multi-user games! This simple Tic-Tac-Toe game covers all of the basics we will need. Besides, Tic-Tac-Toe was even made into a TV Show!

This is what the game looks like with 2 players.

The game system works great locally. Let's get it deployed!

Hosting on Fly.io

Following the Fly.io Getting Started Guide for Elixir, I created a Dockerfile to generate a release for my application. Check out the repo here:

https://github.com/fly-apps/tictac

The README file outlines both how to run it locally and deploy it globally on Fly.io.

What is special about hosting it on Fly.io? Fly makes it easy to deploy a server geographically closer to the users I want to reach. When a user goes to my website, they are directed to my nearest server. This means any responsive LiveView updates and interactions will be even faster and smoother because the regular TCP and websocket connections are just that much physically closer.

But for the game, I wanted there to be a single source of truth. That GameServer can only exist in one place. Supporting a private, networked, and fully clustered environment means my server in the EU can communicate with the GameServer that might be running in the US. But my EU players have a fast and responsive UI connection close to them. This provides a better user experience!

Here is what I find compelling about Fly.io for hosting Elixir applications:

Secure HTTPS automatically using Let's Encrypt. I didn't do anything to set that up!
Distributed nodes use private network connections through WireGuard.
Nodes auto-clustered using libcluster and the DNSPoll strategy. (See here for details)
Geographically distributed servers near my users are clustered together.
This was the easiest multi-region yet still privately networked solution I've ever seen! (I have experience with AWS, DigitalOcean, and Heroku)

Conclusion

For a proof-of-concept, I couldn't be happier! In a short time, by myself, I created a working, clustered, distributed, multi-player, globe-spanning gaming system!

The pairing of Elixir + LiveView + Fly.io is excellent. Using Elixir and LiveView, I built a powerful, resilient, and distributed system in orders of magnitude shorter time and effort. Deploying it on Fly.io let let me easily do something I would never have even tried before, namely, deploying servers in regions around the globe while keeping the application privately networked and clustered together.

Whenever I've thought of creating a service with a global audience, I'd usually scapegoat the idea saying, "Well I don't know how I'd get the translations, so I'll just stick with the US. It's a huge market anyway." In short, I've never even considered a globally connected application because it would be "way too hard".

But here, with Elixir + LiveView + Fly.io, I did something by myself in my spare time that larger teams using more technologies struggle to deliver. I'm still mind blown by it!

What will you build?

Tic-Tac-Toe is a simple game and doesn't provide "hours of fun". I know you can think of a much cooler and more interesting multi-player, turn-based game that you could build on a system like this. What do you have in mind?

You should know about Server-Side Request Forgery

Fly.io — Mon, 25 Jan 2021 18:09:17 +0000

This is a post about the most dangerous vulnerability most web applications face, one step that we took at Fly to mitigate it, and how you can do the same.

Server-side request forgery (SSRF) is application security jargon for “attackers can get your app server to make HTTP requests on their behalf”. Compared to other high severity vulnerabilities like SQL injection, which allows attackers to take over your database, or filesystem access or remote code injection, SSRF doesn’t sound that scary. But it is, and you should be nervous about it.

The deceptive severity of SSRF is one of two factors that makes SSRF so insidious. The reason is simple: your app server is behind a security perimeter, and can usually reach things an ordinary Internet user can’t. Because HTTP is a relatively flexible protocol, and URLs are so expressive, attackers can often use SSRF to reach surprising places; in fact, leveraging HTTP SSRF to reach non-SSRF protocols has become a sport among security researchers. A meaty, complicated example of this is Joshua Maddux’s TLS SSRF trick from last August. Long story short: in serious applications, SSRF is usually a game-over vulnerability, meaning attackers can use it to gain full control over an application’s hosting environment.

The other factor that makes SSRF nerve-wracking is its prevalence. As an industry, we’ve managed to drastically reduce instances of vulnerabilities like SQL injection by updating our libraries and changing best practices; for instance, it would be weird to see a mainstream SQL library that didn’t use parameterized queries to keep attacker meta-characters out of query parsing. But applications of all shapes and sizes make server-side HTTP queries; in fact, if anything, that’s becoming more common as we adopt more and more web APIs.

There are two common patterns of SSRF vulnerabilities. The first, simplest, and most dangerous comprises features that allow users to provide URLs for the web server to call directly; for instance, your app might offer “web hooks” to call back to customer web servers. The second pattern involves features that incorporate user data into URLs. In both cases, an attacker will try to leverage whatever control you offer over URLs to trick your server into hitting unexpected URLs.

Fortunately, there’s a mitigation that frustrates attackers trying to exploit either pattern of SSRF vulnerabilities: SSRF proxies.

You should know about Smokescreen

Imagine if your application code didn’t have to be relentlessly vigilant about every URL it reached out to, and could instead assume that a basic security control existed to make sure that no server-side HTTP query would be able to touch internal resources? It’s easy if you try! What you want is to run your server-side HTTP through a proxy.

We've been putting Smokescreen to work at Fly, and it's so useful, we thought we should share. Smokescreen is an egress proxy that was built at Stripe to help manage outgoing connections in a sensible and safe way.

Smokescreen’s job is to make sure your outgoing requests are sane, sensible and safe. SmokeScreen was created by Stripe to ensure that they knew where all their outgoing requests were going. Specifically, it makes sure that the IP address requested is a publicly routed IP and that means checking any request isn't destined for 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, or fc00::/7 (the IPv6 "ULA" space, which includes Fly's 6PN private addresses.

Out of the box

There's more SmokeScreen can do, but before we get to that, let's talk about how Smokescreen determines who you are. By default, Smokescreen uses the client certificate from a TLS connection, extracts the common name of the certificate and uses that as the role. There is another mechanism documented for non-TLS connections using a header but doesn't seem to be actually wired up Smokescreen (probably because it's way too simple to present to be another system). So you'll have to use TLS CA certs for all the systems connecting through Smokescreen and that is an administrative pain.

Getting Basic

We wanted Smokescreen to be simpler to enable, and with Fly we have the advantage of supporting Secrets for all applications. Rather than repurposing TLS CAs to provide a name, we can store a secret with the Smokescreen proxy and with the app that sends requests to the outside world. That secret? For the example, we've gone with a PROXY_PASSWORD that we can distribute to all inside the Fly network.

Here's the Fly Github repository for the Fly Smokescreen.

I'm on the list...

In all cases, what Smokescreen does is turn the identity of an incoming request into a role. That role is then looked up in the acl.yaml file. Here's the Fly example ACL:

---
version: v1
services:
  - name: authed
    project: users
    action: report


default:
    project: other
    action: enforce

We've gone super simple on the roles here. There's one and that's authed. You're either authed or you fall through to default. The project field is there to make logging more meaningful by associating roles with projects.

The control of what happens with requests comes from the action field; this has three settings: open lets all traffic through, report lets all traffic through but logs the request if it's not on the list, and enforce only lets through traffic on the list. The list in this example isn't there, so report logs all requests and enforce blocks all requests.

Adding allowed-domains and a list of domains lets you fine tune these options. For a general purpose block-or-log egress proxy, this example is enough. Smokescreen has more ACL control options, including global allow and deny lists if you want to maintain simple but specfic rules but want to block a long list of sites.

Smokescreen inside

If you are interested in how this modified Smokescreen works, look in the main.go file. This is where the smokescreen code is loaded as a Go package. The program creates a new configuration for Smokescreen with a alternative RoleFromRequest function. It's this function that extracts the Proxy-Authorization password and checks it against the PROXY_PASSWORD environment variable. If it passes that test, it returns authed as a role. Otherwise, it returns an empty string, denoting no role. It's this function that you may want to customize to create your own mappings from username and password combinations to Smokescreen roles.

Deploy now

Fly

This is where we show how to deploy on Fly first:

fly init mysmokescreen --import source.fly.toml --org personal
fly set secret PROXY_PASSWORD="somesecret"
fly deploy

And that's it for Fly; there'll be a mysmokescreen app set up with Fly's internal private networking DNS (and external DNS if we needed that, which we don't here), and it'll be up and running. Turn on your Fly 6PN (Private Networking) VPN and test it with:

curl -U anyname:somesecret -x mysmokescreen.internal:4750 https://fly.io

And that will return the Fly homepage to you. Run fly logs and you'll see entries for the opening and closing of the proxy's connection to fly.io. What's neat with the Fly deployment is that with just two commands you can deploy the same application globally.

Docker - locally

If you're on another platform, you should be able to reuse the Dockerfile. Running locally, you just need to do:

docker build -t smokescreen .
docker run -p 4750:4750 --env PROXY_PASSWORD=somesecret smokescreen

And to test, in another session, do:

curl -U anyname:somesecret -x localhost:4750 https://fly.io

You'll see the log output appearing in the session where you did the docker run. We leave it as an exercise to readers to deploy the application to their own Cloud.

Using a Proxy from an app

To wrap up this article, we present two code examples, one in Go and one in Node, that take from the environment a PROXY_URL pointed at our smokescreen and a PROXY_PASSWORD for that smokescreen and issue a simple GET for an https: URL.

On Fly, the PROXY_URL can be as simple as "http://mysmokescreen.internal:4750/". Fly's 6PN network automatically maps deployed applications' names and instances into the the .internal TLD for DNS. On other platforms, you'll have to configure a hostname for your smokescreen and make sure you change it everywhere if you move your proxy.

Calling through an authenticated Proxy from Go

This example uses only the system libraries. There are no extra modules needed.

package main

import (
    "encoding/base64"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "net/url"
    "os"
)

func main() {
    // Set the proxy's URL
    proxy, ok := os.LookupEnv("PROXY_URL")
    if !ok {
        log.Fatal("Set PROXY_URL environment variable")
    }

    // And parse it.
    proxyURL, err := url.Parse(proxy)
    if err != nil {
        log.Fatal("The proxyURL is unparsable: " + proxy)
    }

    proxyPASS, ok := os.LookupEnv("PROXY_PASSWORD")

    if !ok {
        log.Fatal("Set PROXY_PASSWORD environment variable")
    }

    // Get you a transport that understands Proxies and Proxy authentication
    transport := &http.Transport{Proxy: http.ProxyURL(proxyURL)}

    // Create a usename:password string
    auth := "anyname:" + proxyPASS

    // Base64 that string with "Basic " prepended to it
    basicAuth := "Basic " + base64.StdEncoding.EncodeToString([]byte(auth))

    // Put a header into the proxy connect header
    transport.ProxyConnectHeader = http.Header{}

    // And then add in the Proxy-Authorization header with our auth string
    transport.ProxyConnectHeader.Add("Proxy-Authorization", basicAuth)

    // Now we are ready to get pages, just create HTTP clients which use the
    // Proxy transport.

    client := &http.Client{Transport: transport}

    rawURL := "https://fly.io"

    request, err := http.NewRequest("GET", rawURL, nil)
    if err != nil {
        fmt.Println(err)
        return
    }

    response, err := client.Do(request)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Println("Read ok")

    if response.StatusCode != 200 {
        fmt.Println(response.Status)
    }

    bs, err := ioutil.ReadAll(response.Body)

    fmt.Println(string(bs))

}

Calling through an authenticated Proxy from Node.js

This example uses the https-proxy-agent package.

var url = require('url');
var https = require('https');
var HttpsProxyAgent = require('https-proxy-agent');

// Create a URL for our proxy from the env var
var proxyOpts = new URL(process.env.PROXY_URL);

// Get a password from the environment too
var proxyPass= process.env.PROXY_PASSWORD;

// Inject a Proxy-Authentication header into the proxy using the password
proxyOpts.headers = {
  'Proxy-Authentication': 'Basic ' + (`anyname:${proxyPass}`).toString('base64')
};

// Create an HTTPS Proxy Agent with our accumulated options
var agent = new HttpsProxyAgent(proxyOpts);

var options = new URL("https://fly.io");

options.agent = agent;

https.get(options, function (res) {
  console.log('"response" event!', res.headers);
  res.pipe(process.stdout);
});

Smokescreen summarized

We've shown you examples of setting up a custom Smokescreen with password authentication. You'll find all the code for setting that up at the Fly Github repository for this Smokescreen. Have fun sanitizing your outgoing web requests.