Status pages can be used in different forms for internal or external communication which aligns all teams towards a culture of transparency, both with your customers and outside stakeholders as well as your colleagues and peers.
A status page is a communication tool that allows you to display the current working status of your various services - whether fully functional, partially degraded, severely affected, etc. The nomenclature of the service status can be defined by you. On the status page, you can also access & update the uptime and incident history data for all your internal facing or customer impacting components. During an outage, you can just update the status page to let your teammates or customers know of the service status and incident resolution activities being carried out so that they can figure out the impact your outage can have on their systems, and in turn communicate it to all their stakeholders efficiently.
Status pages are useful because when there’s an outage, just the involvement of various disparate teams can add significant complexity to the Incident communication process. One way to break this down is by looking at it in terms of the transparency required for every data point.
Keeping things simple, there are two broad levels of data transparency:
- Internal Communication (Private Status Page)
- External Communication (Public Status Page)
Internal communication can further be subdivided into two types that determine the degree of collaboration going into the resolution of an incident and also reflects the culture of the organization.
Engineering Transparency : Communication that is exclusive to your engineering folks and stays between members of the incident resolution team that is involved in collaborating and resolving an incident. For instance, you might want to include metrics like SLOs, SLIs, Logs and Traces that are easily understood by the engineering teams familiar with this. Other examples of this can be a shared knowledge base of runbooks and incident timelines, incident response basics, glossary, etc..
- Organizational Transparency: The bridge between customers and engineers are usually other teams like marketing, support and product. It is crucial to keep them informed of any customer impacting issues. This helps them prepare for all external communication that needs to go out to the relevant impacted customers and gives support teams a heads-up . This also gives product teams enough information about the current state of the systems and helps understand how to tweak or improve the Service Level Objectives (SLOs) set for the impacted service(s).
External communication is any information that needs to be directly relayed to customers or other external stakeholders. This builds trust between you and your customers.
The most essential information that a customer will look for during an outage are the operation status of your services, severity of impact, impacted dependant services and steps taken to resolve the issue. You can make a massive impact on customer experience just by ensuring that your customer has all of this information.
In essence, status pages can be used in different forms for internal or external communication which aligns all teams towards a culture of transparency, both with your customers and outside stakeholders as well as your colleagues and peers.
Incident management is always a mix of teams, tools and processes. There are a lot of popular tools in place to handle incident alerting and scheduling, but most of them miss out a very key functionality - incident communication. Incident communication is an often overlooked part of the incident response process that can positively impact customer experience.
Communication during an incident can be easily overlooked. When there is a fire, most of the focus goes into putting it out rather than informing various parties. Incident responders find it difficult and distracting to switch between resolution activity and communicating the outage to customers. This is how ‘external communications liaison’ came about as one of the important roles in incident management. This person communicates any and all relevant information to the support and other customer facing teams, and would also be the one involved in posting updates to public status pages.
With companies taking reliability seriously and moving towards having SLAs and SLOs in place, it has become extremely important to have a proactive communication system to let customers know in advance when something is wrong or something might go down as opposed to waiting for a support ticket to be raised and then inform them.
Status pages are an effective solution to improve internal and external incident communication in the most flexible way. A status page can act as the source of your service reliability data by hosting downtime information and making this information accessible for anyone through various means.
Do not fall into the rabbit hole of building and hosting your own status page. Although this is theoretically possible,you will only end up spending countless man-hours on getting this to work exactly the way you want it to. The time, effort and money that goes into maintaining and updating your own status page and keeping it reliable is just not worth the outcome. Typically, you’ll need at least a few dedicated resources to run your entire engineering operations for building and maintaining the status page. So, if there’s a service that gives you a readymade Status Page and also makes sure that it’s up and running at all times, it’s a win-win!
When it comes to having a Status Page, there are several paid services and even a few basic open source options to choose from but the most important things to keep in mind would be:
- How easily can you set up the status page?
- Does it allow for both public and private hosting options?
- Does it accommodate multiple communication channels?
While most tools cover some of these requirements, none have taken the view of making the status page an integral part of the incident response process for seamless communication without context switching between your incident response tool and status communication tool. We believe that given the crucial role that internal and external communication plays in incident response, status pages should be a core functionality within your incident response tool. This also adds to the ease of use, as the status of various components can be auto-updated based on a predefined mapping with the severities of incidents triggered for a service.
We started off with a first version of status page that includes Public and Private status pages with the option to have email subscription and also restricting access exclusively for your team.
Public Status Page: You can configure your public-facing services and their dependent components and show their status in real-time directly within Squadcast itself. Customers can subscribe to email updates by entering their contact information in the status page.
Private Status Page: You can expose the status of your internal services privately to other internal teams. You can also check who is working on an incident if the service is facing issues. You can even page teams responsible for specific services.
We intend to add a few more features very soon like:
- More channels of subscription communication for the public status page
- Uptime history chart
- Visual representation of operational status
- Link your twitter updates feed with your public status page