The phrase "you build it, you run it" has become a mantra for many as we build more microservice architectures, and switch to building with a DevOps mindset.
Moving from monolithic applications to a distributed set of microservices comes with many challenges, among which is operations. Maybe you have (or had) an operations team that was responsible for operating and maintaining your monolithic application in production. What happens when that monolith is broken down into 100's of microservices, all operating independently? How is one team supposed to operate all of those services?
What does "You run it" really mean?
Now that we're adjusting to building microservices, we also need to adjust to thinking with a DevOps mindset in our development teams. Running an application requires more than just deploying it successfully to Production. We need to put the "Ops" in "DevOps" and take ownership when it comes to operating our services.
Running or operating a service includes many things. Among them are:
- Responding to incidents (we'll save this one for another day)
- Monitoring the service
- Ensuring the service is meeting its service level objectives (SLOs)
- Conducting operational reviews
- Managing disaster recovery
How can we prepare to take on these responsibilities within our teams as we learn to think with a DevOps mindset?
We need to learn about operations, then automate and document how to operate our microservices. Your team may not own your service forever, or there could be turnover or new hires... how to operate your service needs to be documented in a way that's easy to understand by those outside of your immediate team.
A great way to start is by creating an Operations Guide.
What is an Operations Guide?
An Operations Guide should contain everything a person would need to know to operate your microservice. Let's dive into that a bit further.
Note: The questions and methods below are reflective of what I have experimented with, and what has worked well for the teams that I have worked with. They are not exhaustive. I encourage you to take them as inspiration, and add your own twists to create an Operations Guide that is applicable to your team, organization, and architecture.
Deployment
This section should contain everything someone needs to know about where the service is deployed to, and additionally how. Having it contained in a single, easy to reference place is helpful when responding to an incident so that you don't have to parse different files or lots of text to find the information you need. Consider using tables to make data easy to read.
Consider the following questions:
- How is service deployed?
- Where is it deployed to?
- If the service is deployed in the Cloud, which regions is it deployed to?
- What environments is it deployed to? (Ex. Staging, Production)
Contact info
Anyone who is operating the service will need to know how to escalate issues to the team or community supporting the service.
Consider the following questions:
- Where can the team/community be found?
- What is the best way to contact the team/community for a general inquiry?
- What is the best way to contact the team/community during an incident?
- What are the team/community's time zone and hours of operation?
- Who should they escalate to if the the team/community doesn't respond in a timely manner?
Service levels
Service Level Objective (SLO): a target value or range of values for a service level that is measured by an SLI.
Service Level Indicator (SLI): a carefully defined quantitative measure of some aspect of the level of service that is provided.
Service Level Agreement (SLA): an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
📚 The above definitions are quoted from Google's SRE Book, chapter 4 - Service Level Objectives. Check out the resources section below for recommended reading.
Your microservices may not have a customer facing service level agreement (SLA), though the product it is a part of might. Either way, we should always be striving to set and maintain a high level of service in all of our microservices.
Think about your service...
- What would indicate that it is highly available?
- What should the response time of your service be?
- Does availability look different across different environments?
These are examples of Service Level Objectives (SLOs). Next, how can we measure and understand if our service is meeting those objectives? Those metrics and measurements are the Service Level Indicators (SLIs). Every SLO must have an associated SLI to indicate how the objective will be measured. Different environments might have different SLOs - for example, Development
may be slightly less stable than Production
so it might have a slightly lower SLO for availability.
What SLOs and SLIs would be meaningful to track for your service? How will you track them?
Additional resources
- Google's SRE Book Chapter 4 - Service Level Objectives
- Google's SRE Workbook Chapter 2 - Implementing SLOs
- SRE Fundamentals: SLIs, SLAs and SLOs
- SLOs, SLIs, SLAs, oh my—CRE life lessons
Observability and monitoring
Now that we have service level objectives and can measure them with service level indicators, how do we know if we're meeting them? In fact, how do we know our service is operational at all?
This is where Observability and Monitoring come in.
Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.
📚 The above definitions are quoted from DevOps measurement: Monitoring and observability.
Essentially, observability includes things such as logs and traces. These tools give you the ability to debug and work through unknowns (things we aren't aware of) when they appear. Monitoring on the other hand helps you manage knowns (things we are aware of), and would include things like dashboards, pre-defined metrics, and alarms.
Consider the following questions:
- What is your observability strategy?
- What tools are incorporated for observability?
- What is your monitoring strategy?
- What types of dashboards are available for monitoring the service?
- Where are the dashboards?
- What sort of information do they contain?
- How should the information on the dashboards be used?
- What alarms exist?
- Who do they notify? (during and outside of working hours)
- Where are the playbooks for responding to these alarms? (tip - put them in the repo with the service as markdown docs!)
- How does someone know which playbook to use when responding to an alarm? (Ex. Is a link included in the alarm description?)
Operational reviews
Once we have observability, monitoring and alarms in place, how do we know that our service is operating as we intend it to, and only as we intend it to?
Monitoring lets us know when things that we expect might happen actually occur. When our service starts behaving outside of what we know to be normal we should get alerts. What about when our service is behaving, or being used, in a way that we didn't predict? Observability allows us to see things that we might not have anticipated or predicted... but we must be looking in order to see them!
This is why it's important to have regularly scheduled operational reviews. We need to check in on the service even when nothing has alerted us that something has gone wrong. We need to analyze data to understand how the service is behaving, and how our consumers are using it.
Consider the following questions:
- How often will an operational review be run?
- Who will lead the operational reviews?
- Who will be involved in your operational reviews?
- What will you look at during your operational reviews?
- What questions might you ask during your operational reviews?
Disaster recovery (DR)
Disasters can strike at any time, and can take on many forms. We need to be prepared and know how to respond when a disaster strikes - such as an AWS service becoming unresponsive in a region, a server going down, or an account becoming compromised from a security breach.
Consider the following questions:
- What else is running where your service operates?
- What sort of redundancies does your service have?
- What other services depend on this service being available?
- What services does this service depend on in order to operate?
- What else might be impacted if your account is compromised?
- What happens if the service stops working, or is compromised in a single region in production?
- What happens if production is compromised?
- How have you mitigated the impact on customers or other teams, should something happen to part or all of this service?
- How would you redeploy the service to a new AWS account if needed?
- How would you redirect traffic from the affected service to a version that's working?
- What does your deployment strategy look like? (Ex. blue/green deployments, rolling updates, etc.)
- Where are the runbooks located to handle these disaster scenarios?
- How do you know the runbooks work?
- How often will you practice recovering from a disaster scenario?
Tips and tricks
- There is a lot to think about when writing an Operations Guide, and defining operating procedures for your service. Take it one step at a time - start small and build up!
- Don't worry about getting things perfect the first time. As unknowns become knowns, iterate and improve upon your service.
- Be clear and concise. Provide information in a way that's easy to read, and straight to the point. Make use of tables, bullet points, and other formatting to help.
- Ask for feedback from your team; everyone should be aligned when operating a service.
What have we missed?
After writing your Operations Guide take a step back and ask your team, "what have we missed?" Reflect and align as a team. Try to picture yourselves while debugging a production outage. What else might you need to know?
There is always room for improvement, and as I mentioned earlier, this guide is not exhaustive. So now I ask you...
What have I missed?
Let me know in the comments!
Top comments (0)