DEV Community: Pagerly

Journey of Streamlining Oncall and Incident Management

Falit Jain — Fri, 12 Jul 2024 07:30:44 +0000

‍

For many engineering and operations teams, being available when needed is essential to maintaining the dependability and availability of their services. One of the main responsibilities is to assist in meeting different SLAs. The key principles of on-call work are discussed in this article along with real-world examples from the industry on how to plan and carry out these tasks for a worldwide team of site reliability engineers (SREs).

‍

An overview of the main ideas

When there is a person on call, they are available for production incidents promptly and can prevent a breach of service level agreements (SLAs) that could have a major negative impact on the organisation. An SRE team usually dedicates at least 25% of its time to being on call; for instance, they might be on call for one week out of every month.

‍

SREs are engineering teams that, historically, have approached on-call work as more than merely an engineering challenge. Some of the most difficult tasks are managing workloads, scheduling, keeping up with technological advancements, and managing work. Any organisation must also instil the culture of site reliability engineering.

‍

The following are the essential components that SRE teams must take into account in order to successfully handle on-call shifts.

‍

Timetable for on-call creating on-call plans with the right amount of work-life balance
Modify the compositionTypical tasks that those who are available for call should complete HandoffIssues should be summarised and given to the person taking the following shift.
Post-mortem conferencesWeekly conversation about platform stability-related occurrences
Create plans for escalation.Effective flow of escalation with deadlines for turnaround
Enhancing the page loadCreating effective pager policies
runbook upkeepA synopsis that serves as a "Swiss army knife" for SREs who are on call
Management of Change Coordinating the introduction of platform modifications
Instruction and record-keeping establishing documentation for the training of both new and current SREs and integrating additional team members ‍

Scheduling for on-call and using Slack

Traditionally, the purpose of SRE teams has been to maintain complicated distributed software systems. These systems could be set up in a number of data centres worldwide. Teams can use tools like Pagerly for creating schedules on Slack.

‍

‍Handover for on-call

A "handover" procedure is required at the conclusion of every shift, during which the team taking over on-call responsibilities is informed about on-call matters as well as other pertinent matters. For a total of five consecutive working days, this cycle is repeated. If traffic is less than it is during the week, SRE on-call staff could work one shift per weekend. An extra day off the following week or, if they would choose, cash payment should be given to this person as compensation.

‍

While most SRE teams have the aforementioned structure, some are set up differently, with extremely small satellite teams supplementing a disproportionately big centralised office. If the duties are distributed among several regions in that case, the small satellite teams could feel overburdened and excluded, which could eventually lead to demoralisation and a high turnover rate. The expense of having to deal with on-call concerns outside of regular business hours is then seen as justified by having complete ownership of responsibilities at a single place.

‍

Arranging for a rotation with a team from a particular region
The on-call schedule could be created by dividing the year into quarters, such as January to March, April to June, July to September, and October to December, if the company does not have a multi-region team. One of these groups should be allocated to the current team, and every three months, the nocturnal work effort should be rotated.

‍

It is best to have a timetable like this to support the human sleep cycle and have a well-structured team that cycles every three months rather than every few days, which is more strenuous on people's schedules, as it is not good to be on call a few days per week.

‍

Management of PTO and vacation time

Managing the personal time off (PTO) plan is essential since the availability and dependability of the platform as a whole is dependent on the SRE position. On-call support needs to take precedence over development work, and the team size should be large enough for those who are not on call to cover for absentees.

‍
‍

There are local holidays specific to each geographic place, such as Thanksgiving in the USA and Diwali in India. Globally, SRE teams should be allowed to switch shifts at these times. Common holidays around the world, including New Year's Day, should be handled like weekends with minimal staff support and a slack pager response time.

‍Here is a blog about oncall compensation and way to set rotations

Modify the composition

Every shift begins with a briefing given to the on-call engineer about the major events, observations, and any outstanding problems that need to be fixed as part of the handover from the previous shift. Next, the SRE opens the command line terminal, monitoring consoles, dashboards, and ticket queue in preparation for the on-call session.

‍

Alerts with Slack and Teams

Based on metrics, the SRE and development teams identify SLIs and produce alerts. Event-based monitoring systems can be set up to send out alerts based on events in addition to data. Consider the following scenario: the engineering team and the SREs (during their off-call development time) decide to use the metric cassandra_threadpools_activetasks as a SLI to track the performance of the production Cassandra cluster. In this instance, the Prometheus alert management YAML file can be used by the SRE to configure the alert. The annotation that is highlighted in the sample below can be used to publish alerts. One way to interface with contemporary incident response management systems is to utilise this annotation.

‍

The Prometheus alert manager forwards the alert to the incident response management platform when the alert condition is satisfied. The on-call SRE engineer needs to examine the dashboard's active task count closely and ascertain the cause of the excessive task count by delving into the metrics. Corrective action must be taken after the reason has been identified.

‍

Global integration of these alerting systems with a ticketing system, such Atlassian's Jira ticket management platform, is recommended. Every alert ought to generate a ticket on its own. The SRE and all other stakeholders who responded appropriately to the alert must update the ticket when it has been handled.

‍

Troubleshooting

The SRE should have a terminal console available to use SSH or any other CLI tools provided by the organisation at the start of the on-call time. The engineering or customer service staff may contact the on-call SRE for assistance with a technical problem. Assume, for instance, that a user's action on the platform—such as hitting the cart's checkout button—produces a distinct request ID. The distributed system's various components—such as the database service, load balancing service, compute service, and others—all interact with this request. The SRE may receive a request ID; when receiving it, they are supposed to provide information about the request ID's life cycle. Some examples of this data include the machines and components that logged the request.

‍

An SRE might be needed to look into network problems if the problem isn't immediately evident, like in the scenario where the request ID stated above wasn't registered by any service. The SRE may use TCPDump for Wireshark, two open-source packet analysis programmes, to record packets in order to rule out any network-related problems. The SRE may enlist the assistance of the network team and ask them to examine the packets in this laborious task.

‍

When troubleshooting such challenges, the cumulative knowledge gathered from the varied backgrounds of SREs in a given team would undoubtedly be helpful. As part of the onboarding process, all of these actions ought to be recorded and subsequently utilised for training new SREs.

‍

Implementation

The on-call SRE should be in charge of the deployment procedure during business hours. In the event that something goes wrong, the SRE need to be able to resolve the problems and reverse or forward the modifications. SRE should only perform production-impacting emergency deployments and has sufficient knowledge to assist the development team in preventing any negative effects on production. The change management process should be closely integrated with deployment procedures, which should be thoroughly documented.

‍

Ticket administration (Jira, Slack, JSM, Pagerduty)

SREs keep a close eye on the tickets that are being routed through their queues and are awaiting action. These tickets have a lesser priority than those created by ongoing production issues since they were either generated by the alerting programme or escalated by other teams.

‍

It is usual procedure for every ticket to provide a statement regarding SRE actions that have been completed. The tickets have to be escalated to the appropriate teams if nothing is done about them. When one SRE team passes over to another, the queue should ideally be empty.

Pagerly can help on following tickets on Tools like Jira, Pagerduty , Opsgenie
‍

Encourage raising the stakes

SREs who are on call are most affected by these problems. The on-call SRE must use all of their skills to swiftly resolve the issue if the monitoring and alerting software fails to notify them or if there is an issue that is undiscovered and only discovered when the customer reports it. The relevant development teams should also be enlisted by the SRE to help.

‍

After the problem is fixed, all pertinent stakeholders should be given a ticket with comprehensive incident documentation, a summary of all the actions performed to fix the problem, and an explanation of what could be automated and notified on. Above all other development effort, this ticket should be developed first.

Transfer Protocols

During every transition of SRE responsibilities, certain conventions must be observed to enable a seamless handover of the on-call responsibilities to the next personnel member. One such standard is giving the person who is next on call a summary packet. Every ticket in the shift has to be updated and documented with the steps taken to resolve it, any additional comments or queries from the SRE staff, and any clarifications from the development team. These tickets ought to be categorised according to how they affect the dependability and availability of the platform. Tickets that affect production should be marked and categorised, particularly for the post-mortem. Following its compilation and classification, this list of tickets by the on-call SRE ought to be uploaded to a shared communication channel for handover. The team may decide to use a different Slack channel, Microsoft Teams, or the collaborative features found in incident response platforms like Pagerly for this. It should be possible for the entire SRE organisation to access this summary.

‍

Meeting after the death

Every week, all of the engineering leads and SREs attend this meeting. The format of post-mortem sessions is outlined in the flowchart that follows.

‍

Making sure the problems highlighted don't happen again is the post-mortem most crucial result. This is not an easy task to accomplish; the corrective actions could involve anything as basic as creating a script or introducing more code checks, or they could involve something more complex like redesigning the entire application. To come as near to the objective of preventing recurring issues as possible, SRE and development teams should collaborate closely.

‍

Create an escalation strategy

Every time a problem arises, a ticket with the necessary action items, documentation, and feature requests needs to be filed. It is frequently unclear right away which team will handle a specific ticket. Pagerly routing and tagging features make it possible to automate this process, which helps to overcome this obstacle. It may begin as a customer service ticket at first, and based on the type and severity of the event, it may be escalated to the engineering or SRE teams. The ticket will be returned to customer service after the relevant team has addressed it. The resolution may involve communicating with the customer or it may just involve documenting and closing the issue. Furthermore, this ticket will be reviewed in the post-mortem for additional analysis and be a part of the handoff procedure. To lessen alert fatigue, it is recommended practice to accurately classify and allocate the issue to the highest level of expertise.

‍

Enhancing the page load

Paging occurs often among members of on-call engineering teams. In order to minimise the frequency of paging, pages must be carefully targeted to the right teams. When there is a problem that warrants a page on the internet, the customer service team is typically the first person to contact. From that point on, as was covered in the previous part, several investigations and escalation procedures must be taken.

‍

Every alarm needs to have a written plan and resolution procedures. When a client reports a problem, for instance, the SRE should check to see whether any alerts are still running and determine whether they are connected in any way to the customer's problem.

‍

Restricting page views of the SRE to genuine good matters should be the aim. When it is impossible to decide which team to page, a central channel of communication must be established so that SRE, engineering, customer service, and other team members can participate, talk about, and address the problem.

‍

Runbook upkeep

A runbook is a summary of the precise actions—including commands—that an SRE engineer has to take in order to address a specific incident, as opposed to a cheat sheet that lists every command for a certain platform (like Kubernetes or Elasticsearch). When solving difficulties, time is of the essence. If the engineer on call is well-versed in the subject and has a list of instructions and action items at hand, they can implement solutions much more rapidly.

‍

As a team, you may administer a single runbook centrally, or each person could prefer to have their own. Here's an example runbook for you.

‍

One could argue that a large number of the tasks in a runbook could be automated, carried out with Jenkins, or built as separate continuous integration jobs. Although that is the best case scenario, not all SRE teams are experienced enough to carry it out flawlessly. An effective runbook could also be a helpful manual for automating tasks. An engineer's requirement for the command line is constant, thus anything that must be regularly entered into a terminal is a perfect choice for a runbook.

Pagerly has workflow which helps to annotate each alerts and try to keep runbook updated

‍
‍

Concluding remarks

Creating an on-call rotation plan that works and is sustainable requires careful thought. These include preparing shift changes and scheduling, creating an escalation mechanism, and holding post-mortem analysis meetings following every event. Pagerly is made to plan, carry out, monitor, and automate duties related to standard site reliability engineering responsibilities, such as on-call rotations.

Developing more efficient on-call procedures for engineering teams

Harshita Jain — Mon, 08 Jul 2024 08:37:31 +0000

On-call discussions are typically greeted with complaints. Here's how you change the story
and create a more sympathetic and resilient procedure.

I assumed responsibility for an important aspect of our infrastructure a few years ago. It was a complicated project that could have caused significant outages and economic interruptions across a wide radius if it hadn't been managed and controlled. These kinds of projects need a strong team of engineers who are available for emergencies during off-peak hours, along with well-equipped engineers who understand the systems and dependencies.

In my experience, I encountered a well-meaning team that was unprepared for the unpredictable nature of call-in work. My goal was to transform on-call from a perpetual source of stress to an example of steadiness.

Being on call can be stressful
On-call responsibilities are a major socio-technical difficulty and are frequently one of the most stressful parts of a developer's career, according to the Honeycomb CTO Charity Majors. This concept is a result of some fundamental issues that might make being on call especially challenging:

Unpredictable work schedule: In order to be considered on-call, engineers must be accessible to respond to crises after hours. Personal time is interfered with, and the work-life balance becomes erratic.

High stakes and high pressure: Being on call means having to handle failures that could have a big impact on business operations. There may be a lot of strain because of the stakes. Taking care of difficult problems by yourself after work also adds to a feeling of loneliness.

Lack of preparation: Engineers will feel ill-prepared to deal with any problems they may face if they do not receive the necessary training, preparation, and experience. This intensifies the problem by causing worry and the fear of making the wrong choices.

Alert fatigue: Engineers may get desensitised to frequent, non-critical alarms, which may make it harder for them to identify serious problems. This may result in more stress, delayed reaction times, missing important alarms, and weakened system dependability. This leads to general job discontent as well as impairs their capacity to act promptly when a real problem does arise.

Why are on-call procedures necessary?
Being on call is crucial for preserving the robustness and health of production systems, notwithstanding its difficulties. Your services must be covered by someone during off-peak hours.

As an on-call engineer, being near production systems is crucial since it guarantees:

Corporate continuity: In order to reduce downtime and lessen the impact on end users and corporate operations, quick response times are essential. They distinguish between a small hiccup and a significant outage.

Deep comprehension of systems: Being fully integrated into the system enables you to fully appreciate the subtleties of the production environment. Creating a culture where team members prioritise performance optimisation and problem prevention above just responding to them is beneficial.

Enhancing soft skills: Engineers who are on call are required to have a wide range of abilities, such as crisis management, rapid decision-making, and effective communication. These abilities are beneficial for job advancement in a wider range of professional settings.

Getting used to a brave on-call structure
The reality check for those on call
When my team first started adjusting to the on-call system, we encountered a patchwork of temporary solutions and a clear lack of confidence during deployments. Despite their resilience, the engineers worked in isolation. Every night spent on call was a solitary (mis)adventure. It was also evident that a big outage will occur shortly due to the fragility of our system if greater planning and support were not provided. We were dealing with a ticking time bomb.

Small, basic foundational steps marked the beginning of our metamorphosis.

- Putting together a pre-on-call checklist
Using a pre-on-call checklist is an easy yet effective approach to ensure that engineers have completed all required tasks prior to starting their shift. It reduces the possibility of being unprepared, guards against mistakes, and encourages an aggressive approach to incident management. The following categories, each with a specific task underneath, made up the list we created:

- Squad-level instruction
Instruction tailored to the squad's call-out duties. Role-specific assignments, standard on-call protocols, getting acquainted with required tools and architectures, and simulation exercises are all included in this.

- Onboarding records
To comprehend the function and responsibilities during the on-call procedure, have precise documentation. Ideally, this should include thorough descriptions of each function, incident handling protocols, escalation routes, and important contacts. Additionally, we need to confirm that this documentation is updated frequently and is easily accessible.

- Guidelines for responding to incidents
Rules for responding to and overseeing occurrences. These must be thorough, maybe encompassing escalation protocols, organisational policies, and training programmes.

- Timetable for on-call
Give specifics regarding the on-call rotation, such as who is available and when. Additionally, make sure that each engineer's calendar has all scheduled shifts added to it using the appropriate scheduling tools. This keeps everyone ready and informed.

- Tools and access for on-call situations
Ensure that engineers have access to all tools and levels of access needed for on-call activities. This could involve keeping an eye on OpsGenie, AWS queues, and dashboards.

- Routes of communication
registering for relevant Google Groups and Slack channels. This could include department-specific incident channels or channels where information about incidents is shared or debated. There may even be a #all-incidents channel in your organisation. Join the channels for your organisation's stability groups, cross-functional updates (like #marketing-updates) that may affect your services, and any other channels that offer important information that may affect users. Real-time communication during crises can also be facilitated by setting up or establishing a dedicated channel for on-call coordination with all of the participants.

- Runbooks and manuals for troubleshooting
Ensure that every team member has access to the pertinent runbooks and that they are routinely reviewed to become familiar with frequent problems and their fixes.

- The postmortem procedure
Create or strengthen an analysis and learning process for occurrences. Post-mortem meetings, when incidents are examined in detail to determine the underlying reasons and opportunities for change, should ideally be a part of this process. It is imperative for engineers to adopt a blameless approach, prioritising learning and development over placing blame.

- Contacts for emergencies
Make certain that a list of contacts is available to every member of the on-call team in case of an emergency. It is imperative that the team understands when and how to get in touch with emergency services.
Regular evaluation
Attend your squad's or organisation's operations, stability, and on-call process review meetings. For ongoing improvement, these discussions ought to happen on a regular basis—for example, every two weeks or every month. Frequent reviews support efficient operations and aid in the improvement of the on-call procedure.

We added the engineers to the rotation once they were comfortable using the pre-on-call checklist. Although they were aware of the difficulties in theory, we cautioned them that managing an actual crisis might be more difficult and involved.

Make sure that workers are correctly included in any on-call engineer payment system that your company maintains. A good resource for different on-call payment methods is Pagerly.

Wheel of misfortune: Practice role-playing exercises
We presented a role-playing game called "wheel of misfortune," which was modelled after Google's Site Reliability Engineering (SRE) methodology. The concept is straightforward: we mimic service outages to evaluate and enhance on-call engineers' reaction times in a safe setting. Some of these simulations were carried out to assist teams in becoming better equipped to handle actual emergencies.
According to a 2019 post by Google Cloud systems engineer Jesus Climent, "If you have played any role-playing game, you probably already know how it works: a leader such as the Dungeon Master, or DM, runs a scenario where some non-player characters get into a situation (in our case, a production emergency) and interact with the players, who are the people playing the role of the emergency responders."

It's a good technique to make sure engineers are capable of handling important but uncommon occurrences with assurance and skill.

Using data to improve preparation for on-call

Step 1: Data verification

The next thing to do was to make sure we had the proper data for useful dashboards and warnings. This was a crucial step because proper monitoring and incident response depend on having pertinent and accurate data. We also carefully examined if our initial perceptions of what constitutes "good" dashboards and warnings were accurate. Ensuring that alerts are trustworthy indicators of real problems and preventing false positives are two benefits of validating data quality and relevance.

Capturing and interpreting critical system performance or business metrics from incidents is essential to an effective on-call system. We made sure we have the appropriate metrics for every service, recorded them regularly, and kept a dashboard for quick overviews.

Step 2: Improving dashboards and notifications

In order to address alert fatigue, which is a prevalent problem in on-call systems, we redesigned our monitoring procedures:

Review of alerts: To cut down on superfluous noise and fine-tune alert sensitivity, routine reviews of false positives were established. The primary goal was to avoid the desensitisation that can result from receiving numerous, pointless messages.
Review of the dashboard: In order to determine which dashboards are the most beneficial overall, with the best widgets, access, and visibility, we started prioritising them. Some important queries we thought about were: Is the dashboard readable when it should be? Is it practicable? Is it put to use? Adrian Howard provided a set of excellent questions in a recent blog post that may be used to further filter your dashboards in an organised manner.
Using golden signals: We monitored traffic, error rates, and key business indicators using our golden signals dashboard. Errors, saturation, delay, and traffic are the four golden signals. Response times are measured by latency; resource utilisation is captured by saturation; errors track unsuccessful requests; and traffic tracks demand. This makes it easier for on-call engineers to evaluate the system's health and comprehend the effects of any modifications or problems.

Step 3: Raising consciousness

After the team had a firm understanding of our deployments, flow, and health data, we started looking for further areas that might be optimised.

Understanding and keeping an eye on your dependencies: For incident response, it's critical to comprehend both your upstream and downstream dependencies. We began by drawing architecture diagrams to help everyone comprehend the flow of information. The interactions between various components, such as services A, B, and C, where service A depends on service B and service B depends on service C, were made clear using these diagrams. After that, we put up dashboards to keep an eye on these dependencies, which enables us to promptly spot and handle any odd behaviour. This made it easier to keep track of the interdependencies within the system.
Communication about dependencies: We made sure that every engineer understood how to get in touch with the teams and services that we rely on in the event of an emergency.
Runbooks: By increasing their accessibility, we have made our runbooks better. Each runbook had sufficient information and sample situations to allow knowledgeable on-call engineers to dive right into problem-solving.

Stress reduction for on-call
As soon as things settled into a routine, we began to optimise the areas that required more work or had gaps in our understanding.

Handover ritualization: We made the procedure of handover a ritual. The exiting engineer formally analyses the system status, existing issues, recent occurrences, and potential risks prior to the start of the engineer's on-call shift. The new engineer is thoroughly prepped for their shift thanks to this.

Honouring on-call accomplishments: We started honouring on-call accomplishments. We recognised the difficulties encountered and the successes attained in every post-on-call review. This may involve anything from proactively resolving an issue before it got out of hand to managing a serious incident.

Constant development and information exchange
Frequent post-incident evaluations: In order to grow and learn from each event, we conducted impartial, frequent post-incident reviews. With the involved on-call team, each session included a root-cause analysis that prioritised understanding problems above placing blame.

Monthly operations review meetings: We hold monthly operational review meetings in order to keep our on-call procedures up to date and improved. These meetings' agenda items included:

Going through earlier action items
Rotations for on-call status: evaluating the effectiveness of the current on-call system. This includes adding new members to the team rotation, evaluating the group's performance, confirming that all shifts are covered, and estimating the workload.
Current events: Talk about any incidents that have happened since the previous meeting.
Reviews of alerts: Examine reports on mean time to recovery (MTTR).
Lessons discovered: Share your observations gleaned from recent situations or notifications.
Points of pain: Discuss the difficulties the group is facing and come up with ways to make these problems go away.

Knowledge base and documentation: Playbooks, runbooks, and documentation are all included in our knowledge base. It is updated frequently and connected to our monitoring systems, giving on-call engineers incident-specific guidance.

What comes next?
You should begin actively investing in SLOs (Service Level Objectives) and SLIs (Service Level Indicators) as soon as the firefighting has ceased and you have some breathing room. They are the only way your team can stop always reacting to everything that occurs and start acting proactively. SLOs, such as 99.9% uptime, specify the desired level of reliability for your services. SLIs, or specific uptime percentage, are measures that indicate how effectively you're accomplishing certain goals.

Your team may stop constantly responding to issues and concentrate on attaining predetermined performance targets by establishing and maintaining SLOs and SLIs.

Concluding remarks
“It is the engineering's responsibility to be on call and own their code. It is management’s responsibility to make sure that on-call does not suck. This is a handshake, it goes both ways...” – Charity Majors

Engineering leaders can demonstrate effective on-call strategies by looking at how they value and organise their systems. Your strategy for on-call will be based on the unique requirements and past performance of your company. Take into account your needs and what you can live without as you modify your plan to suit your team's needs.

By taking a proactive approach, you may raise the bar for your operational resilience in addition to preserving stability. However, it is going to require work, upkeep, continuous improvement, and care. The good news is that there is a solution available.