Ethan J. Jackson

Posted on Aug 1, 2020 • Edited on Aug 10, 2020 • Originally published at kelda.io

Why SREs Should be Responsible for Development Environments

#devops #architecture #productivity #sre

Let's discuss an extremely common anti-pattern I've noticed with teams that are relatively new to containers/cloud-native/kubernetes, etc. More so than when building traditional monoliths, cloud-native applications can be incredibly complex and, as a result, need a relatively sophisticated development environment. Unfortunately, this need often isn't evident at the beginning of the cloud-native journey. Development environments are an afterthought – a cumbersome, heavy, brittle drag on productivity.

The best teams treat development environments as a priority and devote significant DevOps/SRE time to perfecting them. In doing so, they end up with development environments that "just work" for every developer, not just those who are experienced with containers and Kubernetes. For these teams, every developer has a fast, easy-to-use development environment that works for every developer every time.

What's a development environment?

Before we go further, let's get on the same page about what we mean by a development environment in this context. When working with cloud-native applications, each service depends on numerous containers, serverless functions, and cloud services to operate. For this post, a development environment is a sandbox in which developers can run their code and dependencies for testing. It's not the IDE, compiler, debugger, or any of those other tools.

Whitepaper: Why Cloud Native kills developer productivity

Sound Familiar?

You're working on a new project or planning to modernize an old one. The team has read all about the whiz-bang nifty new cloud-native technologies, like containers, Kubernetes, etc. So, you decide to take the plunge and build a cloud-native app.

The team realizes that a core group of DevOps/SREs will be necessary to get everything running in a scalable, reliable, and automated setup. Site reliability engineers are hired/trained and get to work. They setup up Kubernetes, CI/CD, monitoring, logging, and all of the other tools we've learned are critical for a modern application.

Everyone knows that it's the DevOps/SRE team's job to get all of this stuff up and running. However, development environments aren't top of mind. The site reliability team considers it their duty to focus on production and CI/CD – Development is the developer's job. At the same time, the developers think it's their job to deliver application features, not to maintain infrastructure. It's not really anyone's responsibility to focus on developer experience before CI/CD, so it's neglected.

Unfortunately, an ad hoc approach to development environments tends to emerge. Whenever there's a new service, whatever developer happens to be working on it, realizes they need some way to boot their dependencies and test their code. They Google around and figure that Docker Compose is a reasonable way to do this. They copy and paste some example, tweak through trial and error until it's working, and move on. The quality fo this initial compose file ranges widely depending on the DevOps knowledge of the engineer who happened to write it. Sometimes it's pretty solid; sometimes, it's brittle and slow.

Worse, this process repeats. Every time there's a new service, it gets a new git repository, and some new engineer finds themselves writing a compose file. Perhaps this new file is copied from an existing project. Perhaps it's developed from scratch. Either way, now we have two compose files that need to be maintained and updated as the app changes over time. This process repeats and repeats until all services have their own ever so slightly different configuration files that are a nightmare to maintain.

As a result of this (all too common) process. We see several typical issues:

Development environments are unmanageable. They spread across dozens of repositories in dozens of subtly different copy-and-pasted docker-compose files. Keeping these up to date in a fast-changing application is impossible.
Development environments are incomplete. They only deal with containers because they are the easiest for an individual developer to get up and running with docker-compose. Everything else developers need to test (serverless functions, databases, specialty cloud services) requires manual effort.
Developers waste time focusing on things that aren't their specialization. Just as most backend engineers can't CSS their way out of a paper bag, there's no reason for every frontend/AI/data engineer to be experts on the current DevOps trends. Developers shouldn't spend time configuring and debugging development environments — they should spend time building features.

Managed Development Environments

So how do we avoid this all-too-common scenario? The good news is that it's not particularly challenging to do so if you're intentional and proactive. The best teams tend to follow a couple of principles to ensure a great experience.

Clear Responsibility:

There's a team that is explicitly responsible for providing development environments for all developers. That team can be the DevOps/SRE team, or a dedicated developer productivity team. The key is that it's someone's job to focus on this issue. Furthermore, that person is likely someone with a large amount of DevOps expertise that will produce better outcomes more efficiently.

Central Management:

The development environment must be managed centrally by the site reliability team responsible for it. A single git repository contains all of the configuration and scripts necessary for a developer to get going. When the site reliability team changes something, they do so once in that central repository, and all developers benefit. Furthermore, typically, the development environments run in a centrally managed cluster in the cloud. As a result, it's easy for the site reliability team to ensure things work consistently for everyone, and debug problems when they do arise.

Full Automation:

Their development environments are fully automated. A single command brings up everything a developer needs to test their code. Developers don't need to do nearly any manual setup work beyond the code changes they're actively working on.

Conclusion

Achieving these goals isn't easy. It requires a significant and sustained investment from the site reliability team, and buy-in from developers and management to succeed. However, while the cost can be significant, it's small relative to the wasted time and effort saved by giving every developer a fast environment that just works every time. At Kelda, we're working hard to make this dream attainable for every developer.

References

Try Blimp to see how you can improve development speed

Read more about Docker internals -- see how registry credentials are stored.

Tutorial: How to Use Docker Volumes to Code Faster

By: Ethan Jackson

Originally: https://kelda.io/blog/devops-should-manage-development-environments/

Latest comments (13)

katy lavallee • Aug 6 '20

We do it the "clear responsibility" way, and my team is... that team. It works quite well. My team also helps people when they have any difficulty with their dev environment. We maintain the base library that all our services are built on, and... well it's been written about before.

scott hutchinson • Aug 2 '20

I very much see your point, however experts in all industries are needed to grab work over the wall. For example, if you have a heart problem you probably will not go see a dentist, if you need a hurricane destruction estimate you won't use an electrician, alternatively if a power transformer gets to hit you would not call a nurse to turn the power back on. Silos are needed because they can allow humans to become proficient at a skill.

scott hutchinson • Aug 2 '20

So one engineer should be proficient in, Development, Dev Ops, Deployment, UI Testing, API Development, Sec ops, Database, QA Automation. This kind of person directly correlates to the over-bloated requirements companies put out in a job requisition.

scott hutchinson • Aug 2 '20

Really like this post, the more segmentation you have the more responsibility a team member has to help the team. My analogy is always a football team (US), a quarterback knows what the running back is doing but the running back has to make the plays to help the team.

Dave • Aug 2 '20

DevOps as a separate team is the problem here, not the management of different environments.

Simon Bracegirdle • Aug 2 '20

The development environment must be managed centrally by the DevOps team responsible for it.

I wouldn't recommend this, as you're creating a barrier between the development team and the tools that are essential for doing their job. The greater the separation between persons involved at various parts of the process, the longer the feedback loops, the lesser the understanding and the more often other teams will block yours.

For example; "We need to add a new service so we can consume things from this queue. Okay, but we need to wait for the SRE team to create an environment for us". This is now an impediment to the team, and it will increase the chance that they will not create a separate service in favour of less robust options because it's easier.

So, what's a better option? Let's remind ourselves of the three ways of DevOps:

First Way: Work always flows in one direction – downstream
Second Way: Create, shorten and amplify feedback loops
Third Way: Continued experimentation, to learn from mistakes, and achieve mastery

freshservice.com/itsm/phoenix-proj...

Returning to the earlier example, what is the best way to shorten feedback loops, ensure work flows in one direction and enable experimentation?

Don't have a separate "DevOps" team. Embed individuals with SRE/DevOps skills into your teams so that those teams are capable to deliver end-to-end solutions themselves.

Stephen Leyva (He/Him) • Aug 2 '20

I’ve found embedding SREs in teams has trade offs as well. One being knowledge sharing across teams (especially if there are a lot of teams) becomes difficult and you arrive at a hundred different ways to solve the problems on each team.

Dedicating teams to build layers of abstractions on top of common tooling kind of gives the best of both worlds as long as the abstractions are clean. This way, you’re building tools for developers who may not have a deep dive ops expertise. The developers still have to be familiar with your abstractions but not the implementation.

At scale in my experience, the embedded model starts to break apart of you silo teams off and one guy becomes the “DevOps guy”.

Just my perspective, It’s ok to have an ops team with a different approach to solving problems through automation and being proactive as opposed to reactive. You can still follow the three ways by

System thinking: Viewing yourself as a stage of the software pipeline. Are you facilitating velocity or becoming a bottleneck?
Shorten Feedback loop: release often and early, dog food (I hate this term :D) your own tooling where possible, and constantly collab and talk with development teams.
Continual improvement: learn from your developers as they learn from you :)

This is just to show there are different implementations of the devops philosophy each with its own trade offs. Just my humble opinion based on my experience:)

Simon Bracegirdle • Aug 2 '20

Yeah I completely agree that there are different implementations of the devops philosophy.

Theoretically you could do this with a separate ops and dev teams, but in my experience it makes it harder because you don't have that mix of disciplines and the diversity of perspectives that it brings. There's also more hand-offs as you pass it to ops to run the thing after building.

It doesn't make it impossible, just less conducive in my experience.

At scale in my experience, the embedded model starts to break apart of you silo teams off and one guy becomes the “DevOps guy”.

Yeah I agree that actually sounds worse. It doesn't sound like DevOps if it's just one person responsible for the "ops" part.

The first way mentions removing impediments; that's not possible if one person is a single point of failure.

The second way mentions feedback loops. Those loops are going to be longer if a single person is blocking the ops part of the process.

The third way mentions continuous learning, which isn't happening if one person is hoarding all of the ops knowledge. It's also not happening if teams are silo'd and not sharing their learning with other teams.

It has to be an organisation wide change, not something that single person or team can do in isolation for it to be effective.

Stephen Leyva (He/Him) • Aug 2 '20

Example: The ops team owns environment creation but builds a service on top of their infrastructure to make environment creation self service.

Increase velocity and owning your problem domain with your specific expertise.

Daniel Ziltener • Aug 1 '20

"If you have a devops team, you have Ops,not DevOps"

Raphael Habereder • Aug 2 '20 • Edited

Thank you! This really bugged me while reading.

I am not one to read a lot into paradigms, so I don't really know what the "official" description of a DevOps entails nowadays, but the name says it right on the tin.
Be both, a dev and an ops. That's what I always understood a DevOps to be.

Splitting them off the rest of the team just seems wrong

View full discussion (13 comments)