DEV Community

Page It to the Limit

Real-Time Learnings With the PagerDuty Community Team

Why real-time operations matters:

Alex Solomon (CTO and Co-Founder of PagerDuty) kicks us off with a definition of real-time operations and why it matters.

Alex: “Real-time operations to me, what that means is, it’s about dealing with problems and incidents and alerts in real-time. Making sure that the right people are pulled in whenever you have an issue with your production software, and only the right people. Those teams and individuals are looped in quickly, looped in via multiple channels to make sure they get there fast. Then once they are paged and looped in it’s about collaborations, it’s about communication, it’s about coordinations, it’s about defining clear roles for all the individuals and making sure they can collaborate and communicate effectively to make decisions quickly and resolve the underlying problems with those systems.”

Matt hops in to discuss that real-time operations also encompasses how we learn about incidents and how we continue to learn from them.

George talks about how real-time operations extends to every facet of online operations that might impact our team, whether it’s web services or code we write and how it operates in production, and how the definition of real-time operations is very broad.

The Myth of Real-Time Operations

Alex talks about the main myth he sees with real-time operations.

Alex: “The myth that you can buy a software platform like a PagerDuty or a DataDog or a New Relic or any of these toolboxes that we all have when running digital systems, and that buying the platforms will solve all your problems and be a silver bullet. In my experience what I see over and over is that yes you can buy the platform but the hard part is changing culture and transforming culture and transforming the way people work, and that comes down to people and process.”

Alex goes on to mention that it’s about the people supporting the services and full-service ownership.

Matt talks about the myth that we can prevent failure.

Matt: “The reality is we can do a lot to kind of steady ourselves and be ready to respond and take information we’ve already had, but our systems are so complex there’s no way to be fully predictive, and we need to understand how to make our systems - our socio-technical systems - more resilient rather than thinking if we just build in enough failover, enough automation, or write the best runbook ever, will be able to prevent failure.”

The discussion moves towards how systems are designed for failure, and that we have ways to detect problems and rectify them quickly so we can detect and resolve problems quickly.

Sharing What We Have Learned at PagerDuty

The conversation moves to what we have each learned during our collective time at PagerDuty, whether it is the incident response process or postmortems.

Scott talks about how his time at PagerDuty has been entirely remote and how to be successful as a remote worker by being vocal about your wins, taking time for yourself and helping others learn about what you are doing by being an internal advocate.

George mentions that advocating internally and externally is about how you communicate with different folks that are distributed.

Julie discusses her experience with this being her first remote job and how the PagerDuty culture of having video on all the time makes being remote much easier by helping to build a great team relationship.

The Shift to Remote Work

The conversation shifts to how real-time operations are impacted by the shift to remote work.

Alex discusses how in the last 20-30 years it was about data centers and folks being on-site, but with remote tools companies have the ability to move to remote easier. However, the challenge and gap can be the culture of remote work if teams and companies aren’t used to that experience.

Julie talks about what it is like to work remotely with families in our homes while we work. She mentions how she packs her son a lunch like she would have if he was physically going to school.

Matt offers his story of how he has trained his kids to understand that he is working when he is at home.

Matt: “What I used to do is I used to wear a special baseball hat if I was going to be in the main room and it was like, if daddy had that hat on he was working, and for all practical purposes he was invisible, and that worked about half the time.”

Matt continues to talk about how we can be empathetic towards our co-workers and get to know them a little better.

Julie shares the biggest learning for her at PagerDuty is that:

Julie: “Every organization feels they have a very unique story to tell, but it’s not as unique as they may think. A lot of these organizations, they may have a different journey but they are still on kind of the same level as to what they deal with.”

Julie goes on to talk about how organizations are dealing with a lot of HybridOps situations.

George hops in to discuss how his background as a first responder applies to managing real-time operations:

George: “A lot of that comes down to preparedness, to having a plan, to knowing what you are going to do when those unexpected surprises come up.”

George continues to say you cannot plan for everything, such as COVID-19, but you can have repetition and practice around when a type of crisis occurs.

George: “Having a plan is not about following that plan to the letter, because we never know what we are going to expect. Real-time operations is completely unpredictable, but what is important is just knowing how you might approach a situation like something we can reasonably infer.”

The hosts talk about how practicing everything helps with times of uncertainty.

HybridOps

Alex shifts to discussing how HybridOps has been a big learning over his 11 years of building PagerDuty. He talks about how early on a lot of the customers were digital natives and cloud-first, and how they helped us in developing our product and vision early on. Alex mentions how HybridOps comes into play as some of these organizations have both legacy systems and newer digital systems, they also have central operations and teams that are DevOps oriented that build and run and maintain their own systems.

Alex: “That’s what HybridOps is all about, it is the situation that these companies are in, that they need to operate in both modes at the same time, while working on modernizing their older applications.”

Wrap Up

The episode wraps up by asking Matt the final two questions on his last Page it to the Limit episode.

Matt talks about how for the majority of his career he felt like his job was to defend production from DevOps but how that changed when he got into the DevOps mindset and changed his perception.

Matt closes by pointing out that he is really happy that in all the time he has worked for Alex he has never been asked to do anything with regular expressions.

Additional Resources

Episode source