Many folks who are new to operations approach things with an absolutist mindset. They see things as correct or incorrect and secure or insecure. In reality, operations of all types is balancing opposing forces. After all, the most secure website in the world wouldn’t be accessible to anyone and certainly wouldn’t have users.
With that in mind, I turned the typical format for the “how/where do I run my stuff” talk on its side. Instead of covering each product and the pros and cons of using it, I organized my talk around seven questions. I chose the questions based on personal experience taking apps to production.
These may not be the right questions to use, but I received feedback that the questions format helped folks understand trade-offs and equipped them to make architecture decisions with more confidence.
The most crucial question is to figure out your priorities. Operations is all about trade-offs. Because of this, it is vital that you can clearly state your goals, what is critical, and the things you aren’t concerned about. Priorities can be practical matters like “saving money” or “robustness.” They can also be more scenario-based like, “support our series B funding round by allowing investors to see our proof of concept.” Alternatively, maybe your priority is, “get valuable feedback from a limited set of friendly beta users.” Most of the time you’ll have several priorities, and it can be helpful to roughly stack-rank them.
Once you have your priorities, think about what implications they have. For example, if your focus is on saving money, you may not want to pay extra for a highly managed platform-as-a-service system. If you are supporting beta users or investor rounds, it may be crucial that your system scales down to zero, so you aren’t paying when no one is looking at your app. These are just examples. There are many other ways that priorities can influence the decisions that you make.
The shape of your load can also have a profound impact on what compute options, and even what architectures make sense for your app. The load for many apps follows the work day. Either they’re applications people use during the workday, or they are applications people use outside of work. In both those cases, being able to scale efficiently and dynamically as load changes is essential. Some applications have loads that are bursty. They may see several hundred thousand requests in a few minutes and then only a hundred requests a minute for a few hours. Other loads require large amounts of batch processing on a regular schedule but have more modest computing requirements at other times.
Most cloud providers allow for auto-scaling (and to a lesser degree auto-healing) of your application to adjust to current load. When using autoscaling, you usually specify a minimum number of instances and a max number of instances. Personally, I set my max number of instances to 5x my predicted load. I use 5x because your frontend web servers aren’t the only thing that must scale. If I get close to that 5x upper bound, I should double check that other parts of the system like the database and my mailer are still handling the load okay. Also, at 5x I ordinarily have enough warning that we’re approaching that limit to adjust it further upwards if everything is going well.
Not all compute choices allow you to scale down to zero. If you know you will have no traffic sometimes, scaling to zero can be an essential way to save money. Bursty loads may be best served by a serverless compute option like Google Cloud Functions or AWS Lambda. Also, batch processing is usually cheaper using a discounted compute option like preemptible VMs. You can also save money by running jobs that aren’t hurt by latency in less expensive regions (if data locality requirements allow). There are lots of ways load can influence your choices. Your priorities should help guide which thing to prioritize right now so you can choose products that make sense for your application.
This is a question that some folks don’t consciously think about, but it always comes up when I’m talking to successful entrepreneurs. Think about what languages and frameworks your team already knows. Does anyone have experience setting up, maintaining, and securing VMs? What products are you most comfortable with?
You don’t necessarily have to stick with what you know; sometimes a new project is a great chance to learn something new. If you go outside your comfort zone though, things may take longer. You may want to find places you can get real-time help or support before you choose to go with something you aren’t comfortable with yet.
Sometimes folks are eager to use a specific technology just because they heard about it on a podcast or in a conference talk. That isn’t always an incorrect choice, but jumping into an unfamiliar technology may not be the best use of your time and money. As an example, I see many folks these days clamoring to use containers. Containers are awesome. I really like GKE. But containers add a layer of indirection and complexity to your deployments. If you aren’t entirely comfortable with your stack outside a container, you need to be aware that debugging it inside a container is harder.
If you aren’t familiar with the term “happy path” it means the path you want users to go down. For e-commerce that probably is the purchase path. For a social app that might mean adding friends or sending messages. A good metric to figure out what your happy path or paths are is, “if this goes down should I be paged at 2 AM?”
Once you’ve identified your happy path, you need to instrument it. Most folks know to use an uptime monitoring tool to ensure that their site is still rendering pages. Ideally, that monitor should make sure that there’s some content on that page not just that the page returns a 2XX status code. You also should think about what other indicators may show that something is wrong. Some examples: number of successful purchases, number of logins, number 4XX errors, and average load times for critical pages. When you’ve found these other indicators, instrument them as well. They may not all have a “page me at 2 AM” importance level, but these other indicators can detect errors that simple uptime checks fail to find.
For one app I worked on I had a dashboard showing the number of successful payments on a week-over-week graph. Except during holiday periods, the two lines trended together. One day we noticed that the “today” line was well below the “last week” line. We investigated and found an issue with our payment processing. Since payment was one of the happy paths we monitored, we were able to address the issue quickly. We would have found the issue eventually, likely when a customer reported it. By putting the monitoring and application analytics in place up front, we heard about it before too many customers were affected.
Once you’ve instrumented the happy path, you need to instrument the rough spots. By rough spots, I mean the parts of your app that you didn’t have a chance to test thoroughly. Or the places where you integrate with services that don’t feel very reliable. Sometimes rough spots are just the places that seem to break most often.
Monitoring these spots can be tricky. Often they fail enough in benign ways that if you aren’t careful you’ll cause alert fatigue. I usually take an approach centered around the question, “is it getting worse?” I’ll set up dashboards showing the 50th percentile latency and perhaps put alerts on the 5th percentile latency going significantly over reasonable. I also set up application analytics that allow me to dig into requests and see what the call stack looks like or how a request propagated through a series of microservices. Sometimes this more detailed information can help me fix the code issues and smooth out this rough spot.
Any time I mention fault tolerance in a talk some folks get a little bit tense. Most people want their application to be up all the time, and never have any errors. I get it. Success feels awesome. Having a conversation about error budgets with your boss or CEO can be challenging. But technology professionals know that an always up service isn’t realistic. Even if your service is perfect, natural disasters and sharks biting undersea cables can cause errors.
So before you go live, think about how much downtime and what sorts of downtime you can tolerate. Is a small amount of downtime in the middle of the night okay? Is it okay to take downtime in the middle of the day to apply critical OS level patches? When Heartbleed happened, many operations folks had to explain to the business side of their company why taking a small outage in the middle of the day was prudent under the circumstances. Deciding with the business leaders on what is an acceptable error budget and some basic guidelines for what critical bugs and issues look like.
If the error budget you are working with is small, you’ll need to ensure that your architecture is resilient and flexible. You may want to choose technologies that make things like Canary Deployment easy. You’ll probably want automated rollbacks, and you’ll need to test your rollback procedure regularly. Deploying to multiple regions may be prudent. Moreover, you’ll need to come up with, and test, a disaster recovery plan.
High availability/high resiliency isn’t a wrong choice, but doing it correctly requires much up-front work. That work makes you prepared when things inevitably go wrong. You can reduce some of the prep work by choosing technologies, architectures, and tools that are built to support your goals. There’s also nothing wrong with deciding that based on your priorities, occasional downtime isn’t a big deal. Once your priorities change, you can revisit this decision and make necessary changes.
The last question is a reminder to audit your code, logs, and other places to ensure that your secrets are secret. By secrets, I mostly mean API Keys and customer information. Most folks know not to check in their API Keys, yet folks do it all the time. Think about what level of security is enough for your application secrets. Some folks need a tool like Hashicorp Vault or Google KMS with support for encryption and key rotation. Other users are content ensuring that their API keys are in environment variables on production servers and not checked into the source. There are many ways to handle secrets; please make sure you are handling them purposefully after thinking through the risks and use cases.
You are also a steward of your customer’s information. It is surprisingly easy for credit card numbers, usernames, and passwords to end up in logs, internal error messages, and alerts. There have been several high profile cases of this type of thing happening even at large companies with security auditing in place. The most common culprit I’ve seen is logging post bodies either to the application log, Syslog, or a third-party logging service. I’ve also seen companies use downloads of the production database for local testing. Just don’t do that. Obfuscate the data in some non-reversible way or build a new test database. The risks of a hard drive or laptop being compromised are much higher than most of us realize. Also continually monitor your application, logs, and other resources to ensure secrets haven’t made their way into places they don’t belong. Many of the well-known data breaches were due to bug fixes or other modifications years after the initial launch.
Self care is a hot topic these days, and I’m not just talking about face masks. There is a growing movement that underscores the importance of taking time to take care of yourself (in addition to all the other things that you already take time for). You can prevent problems down the road by taking proactive steps to ensure your health and happiness.