DEV Community: Better Uptime

How To View And Configure Apache Access & Error Logs

Better Uptime — Wed, 19 May 2021 13:29:26 +0000

Introduction

In this tutorial, you will learn everything you need to know about Apache logging to help you troubleshoot and quickly resolve any problem you may encounter on your server. Logging is a very powerful tool that will give you valuable data about all the operations of your servers. You will learn where logs are stored, how to access them, and how to customize log output to fit your needs.

Prerequisites

Apache web server.
Sudo privileges.

Step 1 — Getting To Know Apache Log Types

Apache writes logs of its events in two different log files.

Access Log - In this file, Apache stores information about incoming requests.

Error Log - This file contains information about errors that the web server encountered while processing requests.

Step 2 — Locating Apache Log Files

The location of the access log file is dependent upon the operating system on which is Apache web server running.

Location Of The Access Log

On Debian-based operating systems like Ubuntu, the access log file is located /var/log/apache2/access.log

On CentOS, the access log file is stored in /var/log/httpd/access.log
A typical access log entry might look like this:

Output:
::1 - - [13/Nov/2020:11:32:22 +0100] "GET / HTTP/1.1" 200 327 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"

Location Of The Error Log

On Debian-based operating systems like Ubuntu, the access log file is located /var/log/apache2/error.log

On CentOS, the access log file is stored in /var/log/httpd/error.log
A typical error log entry might look like this:

Output:
[Thu May 06 12:03:28.470305 2021] [php7:error] [pid 731] [client ::1:51092] script '/var/www/html/missing.php' not found or unable to stat

Step 3 — Viewing Apache Logs

If you are working from an operating system with the UI, the easiest way to view stored logs is by opening files in the text editor. However, sometimes you need to view the content of the files directly in the terminal. In this case, there are few ways to do it.

You can tail command to view logs in real time:

tail -f /var/log/apache2/access.log

The tail command is used to print the last 10 lines from the selected file. With the -f option, the tail command will be viewing the content of the file in real-time.

To view the full content of the file, you can use the cat command:

cat /var/log/apache2/access.log

You may also want to find a specific term in the file. In that case, you can use the grep command:

grep GET /var/log/apache2/access.log

First, specify the term you want to search for, then specify the actual log file. In this case, we are looking for lines in the access log file where GET therm is present.

Step 4 — Configuring Apache Access Logs

In the access log, you can see what pages are users visiting, the status of their requests, and how long it took to process their requests.

Log Formats

As was mentioned earlier, logs are a powerful tool. To be able to use this tool you need to understand the format in which are logs stored. The format of the access logs and the log file location is defined in the CustomLog directive. This directive can be used in the server configuration file (/etc/apache2/apache2.conf) or your virtual host entry. Be aware that defining the same CustomLog directive in both files may cause problems.

Common Log Format

The common log format is the standardized text file format used by many web servers. It's popular as it is easy to read and contains just the necessary information. Its defined in the /etc/apache2/apache2.conf configuration file and its format look like this:

LogFormat "%h %l %u %t \\"%r\\" %>s %O" common

The entry in the log file will look like this:

Output:
127.0.0.1 alice Alice [06/May/2021:11:26:42 +0200] "GET / HTTP/1.1" 200 3477
This is the information that the log message contains:

%h - 127.0.0.1 - Hostname or IP address of the client that made the request
%l - alice - Remote log name (Name used to log in a user). If not set, the default value will be used -
%u - Alice - Remote username (Username of logged-in user). If not set, the default value will be used -
%t - [06/May/2021:11:26:42 +0200] - Day and time of the request
\"%r\" - "GET / HTTP/1.1" - Actual request
%>s - 200 - Response code
%O - 3477 - Size of the response in bytes

Combined Log Format

The combined log format is very similar to the common log format but contains few extra pieces of information.

Its defined in the /etc/apache2/apache2.conf configuration file and its format look like this:

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
The entry in the log file will look like this:

Output:
127.0.0.1 alice Alice [06/May/2021:11:18:36 +0200] "GET / HTTP/1.1" 200 3477 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"

These are the extra pieces of information (aside from those present in the common format):

\"%{Referer}i\" - "-" - URL of the referer
\"%{User-Agent}i\" - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" - Detailed information about he browser of the user that made the request.

Custom Log Format

You can define your custom log format in the /etc/apache2/apache2.conf using LogFormat directive followed by the actual format of the output and nickname that will be used as format identifier.

For this example, we will create a custom log format named custom that will only print the user's browser information. The format will look like this:

LogFormat "%{User-agent}i" custom

In the virtual host file, we will use the CustomLog directive to set the format of the log messages to the custom and log file to the default access log.

CustomLog ${APACHE_LOG_DIR}/access.log custom

Now, we make a request and the Apache server will log the information about the browser that made the request into the access.log file. The log message will look like this:

Output:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36

Logging Into Multiple Files
You can also write multiple messages into multiple files. This can be done by using the CustomLog directive more than once. Note that when logging into the custom log file, the log file has to be manually created before you can log into it.

CustomLog ${APACHE_LOG_DIR}/custom.log custom
CustomLog ${APACHE_LOG_DIR}/access.log common

Step 5 — Configuring Apache Error Logs

The error log contains information about the errors the web server encountered while processing the request. A common error while processing the request is a request for a missing file.

You can choose to which file the error messages will be stored using the ErrorLog directive in your virtual host configuration file. This directive takes one argument - path to the log file. Here is an example from default virtual host configuration file /etc/apache2/sites-available/000-default.conf

ErrorLog ${APACHE_LOG_DIR}/error.log

You can choose a custom file but be aware as the file has to be manually created before you can log into it.

In the virtual host configuration file, you can also specify the level of errors that will be logged using the LogLevel directive. Setting this option to a specific value, the server will ignore errors with lover severity then set in the LogLevel directive. It is not recommended to change it to higher values than error.

These are the possible values:

trace1 - trace8 - Trace messages (LOWEST)
debug - messages used for debugging
info - informational messages
notice - notices
warn - warnings
error - errors while processing the request (doesn't require immediate action)
crit - Critical error that requires prompt action
alert - Error that requires immediate action
emerg - System is unusable
You can set the log level using the LogLevel directive like this:

LogLevel info

If the log level is not set, the server will set the log level to warn by default.

Conclusion

In this tutorial, you learned what types of log Apache web server stores, where you can find those logs, how to understand the formatting, and how to create your custom log formats. Now, you can log into multiple files and set the level of errors to which the server will react. At this point, you know everything you need to efficiently debug your web application.

You can explore more on linux logging in logtail tutorial library.

Top Open-Source Status Page Tools for 2021

Better Uptime — Wed, 19 May 2021 11:47:43 +0000

Status pages are a must for any online business today. In case of incidents or downtime, status pages provide a modern platform for communication with users.

Now let's have a look at some open source status page tools you can use to build, publish, and maintain your status page and start communicating your downtime in a proper way.

Want to host your status page for free? Read our article on best free and paid hosted status page tools.

No. 1 Open Source Status Page:

Upptime

Upptime allows users to use GitHub Actions to schedule workflows to run automatically in pre-set time intervals. The shortest interval that is allowed is 5 minutes. This means that Upptime checks your website automatically every 5 minutes and reflects your website status on the status page.

Once in a day Upptime generates graphs of the site's response times. With this you can easily see and broadcast your long-term stats. Lastly Upptime website also offers some customization options. Those include option to change logo, copy, graphs, and more.

Overall Upptime is a very nicely designed tool, with plenty of functionality, customization options, and well-maintained documentation.

Main benefits:

Runs reliably with GitHub Actions
Neat design and loads of customisation options

Explore more on your own:

Upptime demo page
Upptime installation docs

No. 2 Open Source Status Page:

Cachet

Cachet uses Bootstrap 3 to deliver responsive status pages that work well on any device. They offer basic uptime monitors and a great chart dashboard. With their API you can easily set up any metrics you want; be it uptime, error rates, or response times.

There is also an option to schedule maintenance and communicate it easily to users or other stakeholders.

A great benefit to anyone looking for extra security is that Cachet offers two-factor authentication, which is compatible with the Google Authenticator app.

Main benefits:

Ability to show any metric in a chart
Offers two-factor authentication

Explore more on your own:

Cachet demo page
Cachet installation docs

No. 3 Open Source Status Page:

Statping

Statping has slightly more features included in their dashboard compared to Cachet and Upptime. The main benefit of Statping is that it offers status announcements, which come in different color schemes to quickly inform users of the current situation. The 3 main announcements are downtime, update, and resolved messages.

Visually is Statping also slightly better as it offers a dedicated chart for each monitored site. These charts include average response time, uptime, and a time picker to allow for detailed exploration of the historical data.

However the main benefit of Statping is the notifiers, which are built-in. Those include Slack, Discord, Telegram, Webhooks, and email.

For those that don't want to host and maintain your statuspage there is a hosted option as well, which costs $6/month.

Main benefits:

Notification options integrated
Option to go for a hosted version as well

Explore more on your own:

Statping demo page
Statping installation docs

No. 4 Open Source Status Page:

Statusfy

Statustify is another tool to consider, especially when looking for advanced announcement options. Compared to other tools on this list, Statustify offers tagging, timestamps, categorization, and timelines, of different incident and status update announcements. This comes in handy when you need to communicate with your users and want to use status page as the main way to do so.

On the other hand, Statustify doesn't have charts which is a significant downside for anyone looking to broadcast uptime or incident times data.

The notification options are also quite limited with only basic subscription options via Web Push, iCalendar, and Twitter available.

Main benefits:
Advanced announcement options

Explore more on your own:
Statusfy demo page
Statusfy installation docs

Other open source tools to consider:

If you want to explore more tools feel free to check this list from awecomeopensource. Note that some of the status page tools on this list are no longer maintained and might not be suitable for commercial use.

Not sure about open source options?

There are many open source options to get a nice status page. However there are also a few completely free and sufficient hosted solutions that are worth exploring. Let's have a look at 3 status page providers that offer a status page on their free plans:

Better Uptime

Better Uptime combines status pages, uptime monitoring, and incident management into a single beautifully designed product.

Their status page is available for all free users and can be even published on a custom sub-domain with HTTP(s).

The free plan also offers uptime monitoring with e-mail, Slack, Microsoft teams alerts as well as a basic incident management tool. The paid plans start at $30/month and offer customizable design, e-mail & API subscriptions, and password-protected status pages.

A great feature about Better Uptime is also the embeddable system status notice, which can be used to communicate any incidents directly on the website, without the need to redirect users to the status page.

Overall Better Uptime offers a great way of getting a status page quickly and for free. The uptime monitoring and incident management then comes as a plus, which comes handy, especially if you want to save money on expensive dedicated uptime monitoring solutions.

Main benefits:

Free status page for all users on custom domain
Uptime monitoring built-in
Embeddable system status notice

Explore more on your own:

Better Uptime pricing
Better Uptime docs

Instatus

Instatus is a new alternative to Atlassian's Statuspage. They offer a free status page with unlimited subscribers and unlimited teams, but the catch is that it is not on a custom domain.

Their paid plan then starts at $20/month and offers the same product but with the option to get it on a custom domain.
Instatus is a very well designed tool that is quite similar to Status page and focuses on distinguishing itself mainly with reasonable pricing for smaller teams.

The feature list includes things you would expect like email subscriptions, scheduled maintenance, or incident templates. Instatus also has an API and integrates with incident management tools like Pagerduty.

Main benefits:

Free status page with unlimited team members on instatus domain
Clean and simple design

Explore more on your own:

Instatus pricing
Instatus docs

Atlassian Statuspage

Statuspage made by Atlassian is the main player on the status page market. Statuspage's free plan offers 100 subscribers, 2 team members, email & Slack notifications, and limited access to their API.

The main limitation of the free plan is that it doesn't offer the ability to have it on a custom domain. For such functionality the hobby plan starting at $29/month is necessary. However, this plan also has severe limitations, when it comes to customization as CSS and HTML can't be changed. With Statuspage this is only possible at the business plan that comes at a staggering $399/month.

Overall Statuspage provides a great tool, but with a very high price tag. Considering other free, paid, and open source alternatives it is up to consideration of each team to justify whether it's really worth it.

Main benefits:

Established tool made for enterprise

Explore more on your own:

Atlassian statuspage pricing
Atlassian statuspage docs

What is the difference between open source and paid solutions?

There are two main differences between the open source and paid status pages. The first one is that open source pages are not hosted, while the paid are. The second one is that paid pages provide subscription abilities for both users as well as admins.

There are of course plenty of other differences like customisability, team access, or integration availability, which are usually provided by the paid solutions, but not by the open source once.

When considering what solution to pick, the hosting and update subscription questions should be answered first. The hosting vs not hosted question really depends on your technical capabilities and willingness to set it up.

When it comes to subscription capabilities it is slightly more complicated. As a rule of thumb if you have users and/or customers that rely heavily on your service with their day-to-day operations you should opt for subscriptions. The reason behind this is that once you set up the status page you can either subscribe or ask them to subscribe for status updates. When there is an incident they will all receive a notification about it and you don't have to worry about getting your support channels overwhelmed.

If you have a e-commerce site or a hobby project you can go with open source tool as subscriptions are probably not necessary for you. However please be careful. With hosted solution (especially when providing reasonable SLA uptime) you can stay calm that it will work all the time, but with open source one, all the responsibility lies on you.

What are the benefits of having a status page?

There are two main benefits of having a status page: Lower support cost and higher customer trust.

The lower support costs will come as a result of users and customers checking your status page and reading your system announcements instead of just directly going on your support page and submitting a ticket.

In order to achieve this you will firstly need to have a reachable and easily rememberable URL for your page. The best practise is to go for status.yourdomain.com format. Since it's used by major companies many people often try to check this URL by default.

For less tech-savvy people it's recommended to also include a link to your status page on your website or in your product to make sure they can easily reach it. Of course, in case of downtime, this won't be an option and because of that, it's recommended to have a subscription option for your status page users.

What status page subscription does is that it allows everyone to receive a notification (usually an email) whenever your website goes down.

Once a status page is setup and its existence communicated to users, one can start building trust by being transparent about incidents and communicating them before they are even noticed by users. When this becomes a standard, users will know that if something goes wrong you will be the first one to let them know, which marks a first step towards building trust with your users.

MTTR: Mean time to recovery

Better Uptime — Wed, 20 Jan 2021 15:49:16 +0000

What is Mean time to recovery (MTTR)?

Mean time to recovery (or mean time to restore) is the average time it takes to recover from a product or system failure.

It is an essential metric in incident management as it tells how quickly you solve downtime incidents and get your systems back up and running.

Calculating Mean time to recovery?

Time to recovery (TTR) is a full time of the outage – from the time the system fails to the time it is fully functioning again. The average of all times it took to recover from failures then shows the MTTR for a given system.

MTTR = sum of all time to recovery periods / number of incidents

For example, if a system went down for 20 minutes in 2 separate incidents, the MTTR of such system would be 10 minutes.

10 minutes = 20 minutes / 2 incidents

Other meanings of MTTR?

MTTR usually stands for mean time to recovery, but it can also represent other KPIs (key performance indicators) in the incident management process. Because of those multiple meanings, it is recommended to use the full names to prevent any misunderstandings. The other possible meanings of MTTR are:

Mean time to respond (MTTR)

Mean time to respond is the average time it takes to recover from a product or service failure from the time the first failure alert is received. The difference between the mean time to recovery and mean time to respond gives the time it takes for an alert to come in.

This metric helps you to see how much time of the recovery period comes down to alerting systems and how much is down to the actual work of the repair team.

Calculating Mean time to respond?

The time to respond is a period between the time when an alert is received and the resolution of the incident. The average of all incident response times then gives the mean time to respond.

MTTR = sum of all time to respond periods / number of incidents

Mean time to repair (MTTR)

Mean time to repair is the average time it takes to repair a system. In comparison to mean time to respond, it starts not after an alert is received, but when the incident repairs actually begin.

It is useful when comparing with mean time to respond as it shows how much time the team spends on diagnostics vs. how much they spend on the actual repairs.

Calculating Mean time to repair?

The time to repair is a period between the time when the repairs begin and when they finish, and the system is fully operational again. The average of all incident repair times then gives the mean time to repair.

MTTR = sum of all time to repair periods / number of incidents

Mean time to resolve (MTTR)

Mean time to resolve is the average time it takes to resolve a product or service failure. The resolution is defined as a point in time when the cause of an incident is identified and fixed. This incident resolution prevents similar incidents from occurring in the future.

Mean time to resolve metric gives a great insight into the full scope of fixing and resolving incidents as it goes beyond the downtime and includes the works after the downtime is solved.

Calculating Mean time to resolve?

The time to resolve is a period between the time when the incident begins and the resolution of the specific incident. The average of all incident resolve times then gives the mean time to resolve.

MTTR = sum of all time to resolve periods / number of incidents

Other incident management KPIs

Mean time to acknowledge (MTTA)

Mean time to acknowledge is the average time it takes for the team responsible for the given product or service to acknowledge the incident from when the alert is triggered.

Main use of MTTA is to track team responsiveness and alert system effectiveness. If your team is receiving too many alerts, they might become overwhelmed and get to important alerts later than would be desirable. This situation is called alert fatigue and is one of the major problems in incident management. Luckily thanks to MTTA, it can be tracked and accessed, so it won’t become an issue.

Mean time between failures (MTBF)

Mean time between failures is the average time between one product or system failure and the next. It is a great metric to see how your team is doing in the long term in terms of preventing potential incidents as it gives an overview of system reliability.

How to lower your MTTR?

Use faster monitors: Monitoring for incidents is the first part of any incident resolution process as it provides your team with the information that something is not working properly. Using high check frequency monitors (30-seconds is considered the best practice for general uptime checks) can decrease the time between when downtimes happen and are experienced by your users and when your team gets alerted.
Improve your alerting: Prevent alert fatigue in your team by setting alerts that reflect the importance of the monitored systems. For example, phone call alerts are great for vital systems, but for lower-priority systems, Slack/Microsoft teams messages or push notifications might be enough. Improving alerting this way can significantly reduce your mean time to acknowledge (MTTA).
Understand your incidents: Improving the information that your team is getting in the incident alerts could significantly decrease the time they spend on diagnostics. Quality debugging data like helpful event logs, error screenshots, and system performance graphs can make the diagnostics process noticeably easier.
Automate on-call management process: It is crucial to set up an automated on-call scheduling process integrated into your team members’ calendars. This assures that the right person, on the right team, in the right timezone, and in the right time, is always alerted.
Create an action plan: Sometimes, the assigned on-call team members might not answer the alert or might not be able to solve it on their own. In those cases, it is important to have a solid action plan for escalating incidents so that they are solved as soon as possible. Smaller organizations often have an ad hoc response process when solving incidents, while enterprises employ rigorous procedures. It is, however, recommended even for smaller teams to create an actionable troubleshooting plan for when incidents happen.
Designate team roles and responsibilities: On-call duties are often a dreaded responsibility. Because of that is important to properly set responsibilities for each team member so when an incident occurs everyone knows what they should do.
Don’t underestimate postmortems: Postmortems are often overlooked as they are only reported after everything is back to normal, and no immediate action is necessary. But in-depth postmortems and incident analysis can make a significant difference between solving an incident for once and preventing it from occurring ever again in the future.

Incident Management in 2021: from Basics to Best Practices

Better Uptime — Wed, 20 Jan 2021 15:28:32 +0000

Covering the basics

What is incident management?

Incident management is the process used by developer and IT operations teams to respond to system failures (incidents) and restore normal service operation as quickly as possible.

What is an incident?

Incident is a broad term describing any event that causes either a complete disruption or a decrease in the quality of a given service. Incidents usually require immediate response of the development or operations team, often referred to as on-call or response teams in incident management.

5 parts of the incident management process

1. Monitoring

What is incident monitoring?

Monitoring for incidents is the first part of any incident management process. Monitoring spots problems within the system and verifies that they are indeed being experienced by the end-users. Once a problem has been identified, an incident is created, and depending on the incident alerting the relevant team members are notified.

A common example is monitoring accessibility of a company’s homepage. Such automated checking of a specific website is called a monitor. This monitor will automatically check the website every 30 seconds, and if there is a problem and the website becomes unavailable, it will trigger an alert.

An alert is essentially a notification that includes information about the incident, for example, that the website server was overwhelmed, which might suggest an unexpected spike in traffic.

Example of a monitor that checks the availability of google.com.

2. On-call scheduling

What is on-call?

On-call is a practice where designated team members are available to respond to alerts during specific times. Setting up on-call schedules is vital for any incident management as it assures that the correct person will receive the incident alert from the monitor. When someone is ‘on-call’ it means they are the person who will respond to service issues if they arise.

For example, if someone is on-call from 12 AM to 12 PM on Tuesday, it means that if there is an incident during this time, be it at 2 in the afternoon or 3 in the morning, they have to be ready to respond.

The on-call setup is individual for each organization. However, the goal remains the same, make sure that someone on the team is always ready to fix urgent service issues.

Example of a on-call duty system.

3. Alerting

What is IT alerting?

After the monitor spots an incident, it needs to be passed onto the team that is going to solve it. Incident alerting process ensures that the right person is alerted at the right time and in the right way.

An alert is a notification that is automatically sent to a specific team or team member. It can be in the form of an SMS, phone call, push notification, and more depending on the team’s communication processes.

But alerts are not just plain notifications. They often provide detailed information about the incident that might help the team to find the cause and resolve it faster.

Example of an email incident alert for a spacex.com website.

4. Communication

What is incident communication?

Once an incident happens, it is necessary to communicate it properly with everyone who is affected by it. The response team is automatically notified by the alerting system, but what about other teams inside the company, product users, clients, or potential customers?

In order to communicate with everyone internally and externally, there are several communication channels available. The most common one is a dedicated status page, which shows the current status of the website.

For users and customers, an embedded status widget on the affected page is often put to use. Twitter and other social media are also useful channels for broader incident communication.

Example of a Dedicated status page.

5. Response

What is incident response?

Incident response is a process describing how the team collaborates on solving the cause of an incident. This part of the process is very specific to each team as different companies use very different tools and software.

In general, most of the actual troubleshooting will take place within the specific software, which is believed to be the cause of the incident.

The thing that incident responses have in common is that they are all being directed from one centralized tool. In this incident management tool, individual team members communicate with each other and share critical updates. It is also a single source of information as it shows the detailed timeline of the incident as well as all the actions that were taken to solve it.

Example of centralized incident communication with a detailed incident timeline.

5 steps to a bulletproof incident management process

1. Best incident monitoring practices

Any alerts are only as good as the monitoring tool triggering them. The three main things you want to focus on when setting up monitoring solutions are incident verification, check frequency and alert thresholds.

Incident verification

Incident verification is essentially how the tool ensures that the incident is indeed occurring. Proper incident verification ensures that no false positives happen and you don’t get meaningless alerts.

Check frequency

Check frequency is important as it determines how often the monitor checks the desired service. This determines how quickly the potential incidents get spotted and how quickly you get the alert. For example, for uptime checks, the 30-second check frequency is considered to be best practice.

Alert thresholds

Alert thresholds are the conditions under which an alert is triggered. It is vital to set those incidents triggering thresholds to be realistic so only real incidents create an alert. Correct setup of thresholds can assure that no time of the on-call team goes to waste.

2. Best on-call practices

When it comes to on-call schedules, there is no one-size-fits-all solution. In order to create the most suitable on-call system for your organization, it is important to consider your team size, team locations, individual team members’ abilities, and preferred working hours.

On-call rotation

On-call rotation is a pre-set repeating on-call schedule. On-call rotations are useful as they eliminate the ad hoc approach and create a repeatable system that once established repeats throughout the year.

Team size

The first thing to consider when drafting an on-call rotation is the team size. For teams of two, it is common to go with every other day rotations. This means that one person does Monday, Wednesday, Friday, and Sunday and the other one Tuesday, Thursday, and Saturday, with the Sunday duty changing every other week. In the case of larger teams, weekly rotations are a popular practice.

Team locations

When your team is spread across the world you might be able to mitigate the effects of the dreaded night shifts. WIth team members, in different timezones, a follow-the-sun approach can assure that most of the on-call time is spent during sunlight hours. This will create a better work-life balance for the team members and should be applied when possible.

Individual preferences

Before creating an on-call rotation, it is vital to talk to everyone involved. Different individual preferences might often help to avoid necessary compromises. For example, a morning person on the team might prefer a 4:00 AM to 4:00 PM duty, while a night owl might be happy taking the 4:00 PM to 4:00 AM one. This way, both can be relatively happy, and there is no need to force anyone into full-day duty rotations.

Team member abilities

In most cases, not all team members have the same knowledge of the different systems, and sometimes they need help from more senior colleagues or the specific system owners. In order to do that, the on-call teams need to set up what happens when an incident needs escalating to another employee.

Example of monthly on-call schedule for a team of two.

Escalation policies

Escalation policy describes how an on-call team handles incident escalations. Incident is escalated in two cases. The first one is when the first responder isn’t able to solve the issue alone and needs assistance from another team member.

The second case is when the first on-call person doesn’t acknowledge the incoming alert. This can happen during night shifts when the alert doesn’t wake up the designated team member and the issue is then automatically escalated to another colleague.

Seniority-based escalation

The most basic escalation is calling in a more experienced person. Ideally, all the members should be able to solve incidents on their own, but on rare occasions, this is what can be done to assure that the incident is solved.

Function-based escalation

In some cases, the incident is specific to a system that the first responder is not equipped to resolve. To solve this issue, an escalation to a specific colleague with the needed knowledge of the specific systems is needed.

Automatic escalation

Sometimes the first-in-line person doesn’t respond to the alert within a pre-set time. In this case, the incident should be automatically escalated to another team member or, in critical cases, even the whole team.

Example of an automatic escalation policy.

3. Best incident alerting practices

Compared to on-call, incident alerting rules are not that individual, and most of them can and should be adopted by all incident response teams. Overall, successful alerting is when your response team gets the minimum necessary amount of alarms, with all the necessary information, and via the right channel.

Use the right notification channels

There are many different ways to get notified about system downtime. The most common ones are phone calls, SMS, Slack & Microsoft teams, email, and push notifications. Since some alerts are more important than others, it is necessary to distinguish how on-call teams get notified about incidents with different priority levels.

Phone calls and SMS are a great way to get alerted about critical issues. Slack and email, on the other hand, are preferred for low priority incidents, which might be even of an informative nature rather than something needing an immediate fix.

When selecting the right notification channels, it really depends on the on-call schedules and on the individual team preferences. For example, phone calls might not be useful when fulfilling an on-call duty in the office, however when at home it might be the best option.

De-duplicate and group your alerts

When more significant problems happen, multiple alerts are often triggered. A proper alerting system will automatically de-duplicate those alerts. As a result, related alerts will be grouped into a single one, so no redundant or unactionable alerts reach the on-call team.

Create actionable alerts

Getting an alert stating that there is a problem is great, but having the insight into how to solve it is equally important. That is why alerts need to include quality debugging data like helpful event logs, error screenshots, and system performance graphs. This extra information can make the diagnostics process noticeably easier.

Example of an actionable alert sent via Slack.

Avoid alert fatigue

Alert or alarm fatigue is a situation where an overwhelming number of alerts received by the on-call team leads to increased response time and, in more severe cases, to missed alerts. The psychological reason behind this is that the more people get exposed to false alerting, the more they are to normalize incoming alerts, tolerate them, and neglect them or even purposefully ignore them.

By de-duplicating alerts, making them as actionable as possible, and only using the most relevant notification channels, the possibility of alert fatigue can be severely decreased. Read more about how to avoid alert fatigue by measuring what matters in our MTTR and incident management KPIs article.

4. Best incident communication practices

Incidents happen, and any modern company must be transparent about them because if communicated properly, the damages can often be mitigated. Communication needs to be as fast as possible so whenever the response teams confirms that there is an incident, the following incident resolution should go hand in hand with the incident communication.

Now when it comes to the distribution of honest and timely incident updates, there are three major channels that are considered best practices.

Create useful status pages

The main incident communication channel for the majority of companies is a dedicated status page. A dedicated status page is a webpage that displays updates about ongoing incidents. When you subscribe to a status page, you automatically receive updates the moment they are posted there by the response team.

Leverage embedded status

The easiest way to communicate incidents to your website visitors or users is via embedded status. This embedded widget shows on the top of the website and tells users the incident details. It is usually clickable and leads to a dedicated status page that provides all the necessary information. Communication via widgets can be applied in case of incidents that decrease the performance of the system but don’t create downtime.

Don’t overlook social media

Social media are another way of transparently communicating incidents. Many companies choose Twitter to broadcast downtime. It is also possible to combine social media with previously mentioned channels by integrating updates to your status page.

Stripe’s status profile on Twitter.

5. Best incident response practices

When it comes to actually solving incidents is best to remove any unnecessary manual tasks and diagnostics processes. Any manual tasks of gathering information from different sources can be eliminated by using a centralized mission control tool. The diagnosis process can be easily standardized with an action plan or a runbook.

Since not all team members are experts in all of the systems that might potentially go down, it is best practice to have an action plan that everyone can follow to diagnose the root cause properly.

Have an action plan

When it comes to incident response, all teams should have an action plan of what steps to take in a given scenario. An action plan helps any on-call person to access the given problem and gives the response relevant course even when the on-call expert is not available.

Centralize mission control

A centralized workplace prevents team members from having to search multiple tools and documents to find the necessary information like contact lists, on-call schedules, or escalation policies.

Centralized mission control also means that a precise timeline of the incident is recorded. This includes critical information like what were the different steps different team members took to resolve the incident as well as what was communicated with the public. Having a single source of truth like this prevents repetition of the same tasks and serves well in accessing the KPIs of the incident resolution process.

Don’t underestimate postmortems

Postmortems are often overlooked as they are only reported after everything is back to normal, and no immediate action is necessary. But in-depth postmortems and incident analysis can make a significant difference between solving an incident for once and preventing it from occurring ever again in the future.