Reliability as an Inseparable Part of Software Engineering

#devops #cloudnative #sre #saas

The world is changing:
I feel it in the water,
I feel it in the earth,
and I smell it in the air.
J.R.R. Tolkien, The Fellowship of the Ring

For someone who has been a Software Engineer, building enterprise software, during the past 20+ years, the changes our world is undergoing are as clear and evident as they were for Lady Galadriel of Lothlórien in the epic by J.R.R Tolkien. The meaning of “developing and delivering a software product” as we saw it in the 1990s, when I began my professional career as a Software Engineer, and the meaning of the same thing today are very different.

The differences start with the definition of a “software product”. Traditionally, software vendors would be responsible for developing software and packaging it in a form that allows their customers (businesses or consumers) to deploy it on their own IT infrastructure and use it. This paradigm, also known as “on premises” software products is now being ousted in many B2B and B2C scenarios in favor of delivering “software services”. The distinction between delivering an on-premises product and delivering (potentially the same software as) a service is not just about the form of delivery or consumption, it changes the core set of expectations between the customer and the provider / vendor.

When a customer is acquiring a “service”, the expectations are stretching far beyond its functional capabilities into the domain of availability. The customer rightfully expects that the service is available for use with a defined level of user experience at all times (according to the Service Level Agreement with the provider). The shared responsibility model that was in place for on-premises products, where the vendors had to provide working software products, but the customers had to deploy and operate them according to vendors’ specifications has shifted heavily into the vendors’ side.

As per normal industrial evolution, both the customers and the vendors are supposed to benefit from this new state. Both of them, however, have new challenges to resolve (or new realities to adapt to) in order to be able to reap the benefits:

Software Vendors (turning into Providers of Software Services) need to adopt a new set of responsibilities related to operationalizing the services and delivering guarantees on the service availability
Customers of Software Services need to be able to verify (to the extent possible) that the service providers are operating within the expectations in terms of service quality (availability, responsiveness, etc…)

For providers of software services, Site Reliability Engineering is the discipline that allows adopting this new set of responsibilities, guaranteeing their execution and proving that their services are delivered reliably. Delivering a reliable service and proving its reliability are two very different notions.

In the past, traditional IT operations have ensured reliability of their assets and infrastructure by introducing and following strict operational procedures. Those procedures were applied to “complete products” that were delivered to the IT organizations for operation. The operators could influence the infrastructure, the configuration and the procedures related to the product, but not the product itself. This limitation has clearly introduced a cap on the effectiveness of operations that could be delivered by such organizations.

Site Reliability Engineering is changing the rules of the game by removing this “glass ceiling” and allowing the reliability considerations to influence the product all the way from its architecture and design, through various stages of implementation and testing all the way to its deployment in production. As this approach is being adopted by a growing number of of organizations, it is becoming increasingly clear that this is not just a dedicated “profession” (carried out by dedicated engineers called SREs - Site Reliability Engineers), but a pillar in Software Engineering in general, similar to aspects such as, but not limited to Software Architecture, User Experience, Automated Testing, etc…

Just as the DevOps philosophy has merged the domains of Software Development and Software (Service) Operations, Site Reliability Engineering is relying on this approach to define reliability challenges not as operational ones (where regulations, processes and escalation procedures would be way of delivering results), but as engineering ones - where a set of mechanisms engineered into the product would guarantee the desired results.

Reliability engineering, therefore, is a notion that should affect every stage of software development:

Architecture and Design with reliability metrics and goals in mind
Implementation that ensures reliable operations when deployed to production
Testing that introduces verifications and gates focusing on reliability and not just functionality
Deployment that ensure reliable roll-out Monitoring surfacing potential reliability issues ahead of time (or, at least, when they, unfortunately, take place)
Operationalization that allows both ensuring reliable functioning of the service and provides continuous feedback loop back to Design and Implementation stages to improve the reliability of future product iterations

Building a Reliability Platform the way we do it in StackPulse means, first and foremost, that we have to adopt the above principles as the foundation for our Software Development Lifecycle. We firmly believe that this is the only way to ensure that the users of our services are guaranteed to have the best and the most reliable product at their disposal.

This also means that the platform we are delivering to our customers allows them to adopt a similar state of mind and to build mechanisms that will drive the reliability aspects in their respective software architectures and development processes. Becoming an integral part of the SDLC means being able to operate within the main principles upon which the lifecycle is based, such as:

Single “Source of Truth” is based on the software version control repositories
“Fail fast” verification of all changes prior to release to production
GitOps flow initiating and managing the progressive deployment of changes released to production
Full transparency and metrics are built in to each process for detailed telemetry

Additionally, we believe that the best ideas are not coming from a single source, but are a result of a collaborative effort. That is why our platform allows developers to share the mechanisms they build for reliability, to open them to community feedback and contributions and to make the world a more reliable place as a result of a joint effort. We firmly believe that together we can make a real difference in the reliability of different software services in the world, by sharing our challenges and our practices. We see ourselves as an enabler of such discussions and as a facilitator of knowledge sharing in the field.

The platform that we are about to launch consists of the following components:

Alert Enrichment & Analysis - To reduce alert noise and fatigue with real-time enrichment, triage, and analysis
Automated Incident Response - Automated, code-based playbooks to investigate and remediate production incidents; reducing toil, reducing MTTR and meeting SLOs
Incident Lifecycle Management - To streamline communications, speed remediation, and automate data collection during incidents

This is just the beginning. Our mission is clear: we are delivering a holistic platform that allows engineering organizations to interleave reliability into all stages of their software development and operations. Doing so has to be a continuous agile process, constantly evolving and relying on data-driven insights. This is the way we are doing it ourselves, and this is the way we would like to do this together with our users.

We are StackPulse! Join us in making the world a more reliable place, one software service at a time!

DEV Community

Reliability as an Inseparable Part of Software Engineering

Top comments (0)