Mario Loria of CartaX talked with Ambassador about the changing developer experience, the changing role of the SRE, and delivering visibility and a self-service platform for developers
The application development landscape has fundamentally changed in recent years. In a recent interview with Ambassador Labs, Mario Loria from CartaX said he believes this is still uncharted territory, particularly for developers in the cloud-native space. As he sees it, site reliability engineers (SREs) play a key role in guiding developers through the learning curve toward comprehensive self-service of the supporting platforms and ecosystem, and ultimately to service ownership. This requires a major shift in company and management culture, and developer (and SRE) mindset and tooling as well as insight to make the journey to full lifecycle ownership not just smoother and more transparent but also technically feasible.
The traditional monolith continues to exist in parallel with cloud-native application development. The operations side of the equation, according to Mario, understands that this has caused a big shift in deploying, releasing, and operating applications, and now the role of SREs is to help developers understand and own this shift. Developers know how to code, but building in the necessary understanding (and ownership) of the “ship” and “run” aspects of the lifecycle introduces a steep learning curve. For developers, this means taking on new responsibilities with the support of SREs.
In a cloud-native, service-oriented architecture, how much responsibility for an application's lifecycle should a developer own?
According to Mario, developers should own the full life cycle of services but in most cases don't: "It should not be up to me as an SRE to define how your application gets deployed or at what point it needs to be rolled back, or at what point it needs to be changed, or when its health check should be modified." Developers should be capable of — and empowered — to make these determinations.
Many SREs' experience has forced them to "ride to the rescue" when issues arise. Developers have generally been conditioned to hand over deployment and operations to SRE teams rather than investigate how they might handle issues on their own. This isn't illogical; this is how pre-cloud (and pre-DevOps) monolithic application development worked. Getting to a place in which developers can own the full application life cycle isn't always a linear path, nor is it always organizationally sanctioned or supported.
How does an organization and its developers round the bend on the learning curve?
Mario shared a set of prescriptions on how organizations, SRE teams, and developers might better embrace and support the new development paradigm.
An organization and its leadership needs to get behind the end-to-end "developer-as-service-owner" mindset as part of their higher-level strategy. Leadership needs to make this clear from the top down that there is a focus (and accountability) on ownership. Developers learning this new culture come to it from their own history and experience, which can vary considerably. Changing the developer mindset starts with empathy — understanding their goals, practices and skills — and closely follows with good communication.
Mario believes that supporting developer ownership, and getting developers to embrace this model as well, comes down to communication. Communication has to change from an "it's their [SRE] problem" to a "this is our [developer] problem" frame of reference. Even down to the individual developer and their experience, each problem they encounter is likely something that affects the business. Everyone is pushing toward the same goals. If that fundamental idea of shared responsibility is broken, it's not really possible to change the culture to adopt and sustain the new developer experience.
Mario's views on the role of SRE in supporting the new developer experience, and more fundamentally, developer education can be summed up with the old adage: feed someone a fish, they eat for a day; teach someone to fish, they eat for a lifetime. Mario's take is that developers should start triaging and debugging themselves when they run into issues. As he frames it, it's not that developers should never ask for SRE support, but instead that the role of SREs in cloud-native organizations should support developers learning to gather intelligence, troubleshoot basic issues, and understand what the components of their applications are doing without SRE intervention.
The SRE role, in this framework, supports the idea of SRE-supported education to help developers gain a better understanding of how to handle their services themselves rather than asking SREs to fix broken issues. Mario believes this is the way SREs and developers should work together in the long term: helping put the developer in the driver's seat while training them to drive high-performance machines (applications).
To persuade developers to take on the responsibility, the SRE function needs to provide the platform surface area -- or control plane -- and visibility to make this possible. That is, giving developers control and providing visibility into what the repercussions of their actions can be. A lot of questions surface that developers will need to know how to answer effectively to, as Mario put it, “shift left”, make more decisions, and do more autonomously with interactive self-service.
SREs can create the platforms, or pure infrastructure, to empower developers. Developers need to ship their code safely, and the more they understand about how to do this, the more self-sufficient they become. Meanwhile the more SREs can focus on creating the "platform as a service", or “paved path”, that helps reduce cognitive load for a developer taking on full code-ship-run ownership, the more clarity the developer gets into the processes they need to ship and run their code without breaking anything.
In many ways, this reshapes not just the developer experience but also the SRE experience.
"In terms of developing, deploying, and operating [in the cloud]... especially if you think of someone you just onboarded into the team, there's a mountain they have to climb to actually understand how we manage our services end-to-end. Going back to the frontend we should aim to create something like a single pane of glass, where instead of having all these different places to look or tune, or to try and understand what's going on, we can see everything transparently…".
Giving the developer the tools needed to bring everything together is likely to hinge on something like a developer control plane (DCP), where information can be centralized and made visible.
From an operational standpoint, considerable thought needs to go into how much time and effort the business wants to invest in building or running all the constituent components themselves. While dependent on use case, cloud-native development is all about achieving speed, and this is facilitated by the centralization of distributed information and the creation of a single pane of glass to make a frictionless means for pulling disparate sources of information together to, in a sense "ease the climb".
For developers who want to thrive in the cloud-native development space, learning to rely on SREs in a new way will be a key success factor. SREs should become trusted partners for deploying, releasing, and running services, and not just treated as first responders who are responsible for dealing with incidents. Developers should take the opportunity to share their pain points and also learn about tooling and best practices from SRE teams, with the goal of “paving the path” to developer autonomy, self-service, and full service ownership.