DEV Community

Cover image for DevOps vs. Site Reliability Engineering (SRE)
Mike Pfeiffer for CloudSkills.io

Posted on • Updated on • Originally published at cloudskills.io

DevOps vs. Site Reliability Engineering (SRE)

DevOps or SRE? That's the topic of this weeks discussion. In this episode I catch up with Josh Duffney to discuss between DevOps and Site Reliability Engineering (SRE).

Here are the resources we discuss on this episode:

Josh Duffney is a DevOps engineer with 10 years of systems administration and engineering experience. Josh is a Pluralsight author of several courses on the topic of automation and infrastructure development and blogs at duffney.io. Josh enjoys being an active member of the PowerShell and DevOps community where he gets to share his knowledge but more importantly enjoys learning from others in the industry. Outside of his professional work Josh is a practitioner of digital minimalism who strives to find balance in a digital world. Josh also spends his time weight lifting and training in the art of Brazilian jiu-jitsu.

Full Transcript

Mike Pfeiffer:
What’s up, everybody? It’s Mike Pfeiffer and Josh Duffney live streaming on LinkedIn. Hope you guys are doing good today. It looks like we’ve got a ton of people online. Today, Josh and I are going to be talking about the difference between DevOps and SRE. It’s a hot topic these days, isn’t it, Josh?

Josh Duffney:
It’s a loaded question, yeah.

Mike Pfeiffer:
Yeah, right. Let’s see. I just want to say, hey, what’s up, to Joyce and Vivek, Jeff Truman. What’s up, man? Ahmad, good to see you, my friend. All right, Josh, let’s talk about this, man, DevOps versus SRE. This is a question that’s coming up. People are applying for these new jobs and stuff, and job titles out there. You’ve been through this process. Before we get into it, maybe explain your background, who you are, what you do, and for everybody that’s on the live stream, start putting your questions in the comments and we’ll take them here.

Josh Duffney:
Sure. I’ve been in, well, IT operations for 10 years now, and so for the past three years I’ve been a DevOps engineer, or at least had that title. And so a lot of my focus the last couple years has been applying the DevOps principles and practices to mostly operations, so a lot of infrastructures code and the release pipeline engineering has been my focus.

Mike Pfeiffer:
Yeah, and so in that model, doing kind of a transition from a traditional technologist or a traditional systems administration role, right? I think is kind of what you were doing in the early days.

Josh Duffney:
Yep.

Mike Pfeiffer:
And then getting into DevOps and stuff, I’m sure you’ve been navigating the job boards along the way and looking at the differences, what people are looking for out there, so maybe we could break it down and talk about what SRE is and how it compares to DevOps and why people shouldn’t even pay attention.

Josh Duffney:
Sure, yeah. I mean, from my experience… I guess the first thing to notice or to be aware of is that DevOps and SRE, or at least DevOps, it’s an evolutionary transition for a lot of companies, so the organizational structure or the organizational design is going to be different everywhere you interview, even though the title is the same, it’s not the same as we need an IT administrator and you guys handle the backend systems. It’s not as cut and clear. There’s a lot of patterns and anti-patterns that organizations have adopted with both DevOps or SRE, and so you got to kind of ask questions to figure out where they are in that evolution and just realize that it’s not going to be the same everywhere. The better thing is to focus on more the skillset and the technology that you need to have in those different roles.

Mike Pfeiffer:
That’s actually good. I want to come back to that, but now I can actually see everybody’s comments. So, what’s up? Gregor Suddy is here. He was the first ever SRE at JP Morgan. That’s kind of interesting because the origins of SRE really are from Google. Right? I posted something on LinkedIn earlier this week that was a Google video of what was the difference between SRE and DevOps, and kind of going down that rabbit hole you’ll find that there was a guy that was hired at Google in 2003, and he was like the SRE czar. Let me see, do I have his name? Ben Trainer, and now he’s one of the main dudes at Google, like in the engineering teams. But they were kind of the pioneers of the SRE concept, so it’s interesting to hear that Gregor was a SRE at JP Morgan. That’s cool.

Mike Pfeiffer:
What’s up, Apu, Condi? Adeo, what’s up, man? Happy new year to you as well. Minoge, good to see you. Paul. Awesome. So, all right, Josh, when it comes to looking at jobs in the job market for DevOps versus SRE, one of the things that I noticed this morning, just going up to… kind of ironic, I didn’t look on the LinkedIn jobs. I went to indeed.com, but there was, let’s see, 1,925 jobs with the term SRE in the title, and for DevOps there was 25,000.

Josh Duffney:
Yeah. Quite a ratio disparity.

Mike Pfeiffer:
Kind of a small difference, right?

Josh Duffney:
A little.

Mike Pfeiffer:
So, I guess the big question is, what’s the difference?

Josh Duffney:
From what I’ve seen is SRE is more… it’s typically a more mature implementation or organizational design, and the places that I’ve talked to and that I’ve worked in where it’s usually an embedded team, so the SRE is an embedded team with the engineers or their own unit, small, usually fairly small, and they’re much more focused on the monitoring of the system. They have actually a pretty hardcore software engineering background, where DevOps is really more a loose term that fits a lot of different skill sets. So, you could be a cloud engineer, you could be just an automation engineer, or you could be an actual developer that’s just done a little bit of infrastructure stuff. So, DevOps covers a lot more, I don’t know, fundamental pillars. It’s less specific. It kind of goes back to that the Google video where SRE is a specific implementation of DevOps, and so with that, it’s usually a little bit more mature in it’s scope and what it’s trying to do.

Mike Pfeiffer:
That’s been what I’ve gleaned from it too in talking to lots of different people. I mean, it’s still early for this stuff. One of the things that you hear when people talk about these two things is that these concepts have both evolved at different times, but kind of started in roughly the same time era, but we’re completely independent, and so there isn’t some overlap, but it’s not like they were coming up at the same time and everybody on both sides of the fence were talking to each other. Right? The other thing that’s been able to pick up off of different conversations and stuff is that SRE, to your point, is usually people with software engineering discipline, right? It’s somebody that actually understands computer science at a deep level can build applications. Right?

Josh Duffney:
Right.

Mike Pfeiffer:
Cool. So, any questions in the comments? Oh, what’s up, Chris Miller? Happy new year, man. Good to see you. Jeff Truman said SRE is ops heavy. DevOps is enabling both dev and ops. Yeah. So, I guess it also depends on where you’re at, right? Some places SRE might be dev heavy.

Josh Duffney:
Yeah.

Mike Pfeiffer:
It kind of depends. And then that’s the other thing too is what do people usually mean by DevOps engineer? There’s a lot of confusion about what that term actually means. Does that matter to you, and what does it mean? Should people be even getting caught up in the label to that?

Josh Duffney:
To me, from my experience, I kind of bucket these skills or these tendencies kind of like this. So, for DevOps engineer, there’s going to be some kind of a focus or emphasis on release engineering. There’s going to be a requirement in most places that you’re going to need to know how to build a CI system or a CD system, and that you’re going to handle changes through a release pipeline. So, there’s a big facet in DevOps around release engineering. A lot of DevOps engineers, it was just rebranded release engineering, it was rebranded DevOps engineer, and then they threw in some operational components like infrastructure as code with release engineering.

Josh Duffney:
It’s also typically more, from my experience, DevOps has been more operational heavy, where you haven’t needed to know as a DevOps engineer how to write or to bug any of the code that’s actually at the application, where SRE is very much you’re kind of expected to be able to go in there and go look through memory dumps and stuff like that and kind of debug the application from that level and also monitor at a closer level then like the VMs off and on.

Josh Duffney:
And then also I’d say SRE is really focused, in my experience, around the monitoring. Like they’re typically the team that’s running platforms, so they’re not only just the monitoring, but they have more specific platform knowledge like with Rabbit and Retis, where some operations people coming from more of a traditional system and administration, they don’t have that particular skill set of that platform. And it’s more focused on the monitoring of applications and the monitoring of the infrastructure stack.

Mike Pfeiffer:
Yeah. So, Jeff Truman mentions in the comments, right, he said, “So what does that, sorry, mean then? Site reliability engineer, right? Is that a dev or an ops role or both?” And I think that’s an interesting question, and I think one of the interviews that I listened to this week after kind of sharing that first Google and seeing so much traction on that and understanding the interests and the confusion, really. I listened to a 15 minute podcast from pivotal with a guy from Google who’s actually… he works for the guy I was talking about earlier, works for the Ben Trainer guy. I don’t know if I wrote his name down. Oh, Dave Rensin from Google, senior director of engineering at Google. But he said something really interesting to me, which is SRE is a world where the machines work for you and you don’t work for them.

Mike Pfeiffer:
Meaning, if you’re an ops person right now, you’re probably in a world where you’re getting a text or a call or a page in the middle of the night or on a weekend, and you’re going out and putting out a fire, right? And I think to what Google’s trying to do and what they’ve been doing on their SRE teams is that it’s like you get to the office in the morning, something happened, but the system, the service itself is still going. And so I think that it’s a little bit of a gray area, obviously, but I thought that was interesting perspective, right? That things are going to absolutely fail 100%, and then they’re engineering around that concept. And then it’s not like we’re running around with our hair on fire. It’s more like, “Hey, stuff’s going to break. We’re going to be able to troubleshoot it because we’ve got all these practices in place.”

Mike Pfeiffer:
All right, cool. So, tons of kinds of comments, man. So, let’s go back, talk a little bit about some of these questions. This is a fun conversation. Obviously it’s a big concept, right? So, Richard Taylor, what’s up, man? Good to see you. He says, “I think perhaps we get caught up with titles.” Yeah, man, I think that too. And here’s what I think about all this personally, I’ll add my two cents and I’ll let Josh say what he thinks. But to me it’s like, especially as a person who is self-employed now but spent 15 years working for everybody else, if I’m trying to sell something, I got to understand what the customer really wants. And going back to what I said earlier, when you’re looking on indeed.com and there’s thousands of job titles with DevOps, it’s your job to figure out what the market wants, and then you add value. It doesn’t matter what the people are using, in my opinion, it’s just like find out how you’re going to add value. I don’t care what you call it, in my opinion. Call it whatever you want. I’m going to find a way to close the gap by adding value into the system, not bickering and arguing about job titles. What do you think, Josh?

Josh Duffney:
This is my absolute favorite quote from a podcast, I think in like 2015 from… is it Jeffrey Hacker? Who said, “Titles don’t matter, but they absolutely do matter.” And so the reason is is they don’t matter for the reason that they shouldn’t prevent you from doing anything, but they do matter in the sense of your pay, your searchability and keywords on your LinkedIn profile, in authority and just recognition, like they do have some meaning, and so you’ve got to take the best of both. So, I think when people really compare the two, they’re really asking the question, how do my skills align to these titles and where do I best fit and where are my gaps? And so with that you have to kind of understand the different needs of each of these roles, and it’s going to be dependent on the organization because this isn’t really set in stone. You know?

Josh Duffney:
Organizations are evolving, they’re understanding, and then also the tech stack underneath is different per organization. So, you kind of got to go a level higher, understand that there’s practices of both of these that you need to have, like infrastructure as code, some software engineering. You need to have some kind of fundamental scripting language like Bash, Python or PowerShell kind of under your belt and be really proficient with that. And if you can abstract it a little higher, then you can start to identify the gaps. That’s kind of my… it doesn’t matter but it does matter because it can help you identify where you need to grow.

Mike Pfeiffer:
Yeah, yep. I want to take a question here. There’s some really good ones in here. I want to take Kimberly Medina’s question here just in a second. But I did want to mention one thing. To me, if you’re going to work at Netflix or LinkedIn or Amazon or whatever, obviously SRE is going to mean something different there than it might mean to somebody that has 500 users and is a state agency or a small medium size business. Sometimes HR and recruiters or even hiring managers don’t even know what they really need, and so they’re just trying to use a blanket term. So, there’s the onus go on you to figure out what do they mean by this job title, so I think that’s an element of it. Kimberly said… she had a really good question. She says, “I’ve seen many rules open recently that are both listed in both these categories, and how do the tools overlap from one to the other?” What do you think, Josh?

Josh Duffney:
I think the tools overlap 100%. I mean, the same tool… so, infrastructure is code. They’re probably going to use the same tools for your DevOps GRS3, and it depends on probably what exists at the company. Are they going to use Shaft or are they going to use Ants Bowl? Are they using Terraform because they’re in the cloud and they want to be cloud agnostic kind of thing? A lot of the principles overlap too. Yeah, that’s a really good question because there is so much overlap that it’s hard to distinguish between the two. I’ve seen that too, and the job postings were like, DevOps engineer [ORO3 00:13:02], and they actually posted it as both with the same job description, and so even the organization doesn’t really know. They’re just trying to get as many candidates to apply as they can.

Mike Pfeiffer:
I love that in the comments people are, since there’s no emojis, it’s just people are putting in asterisks liked Gregor’s comment. That’s awesome. Man, there’s so many questions I can’t even keep up with them. I’ve got 95 people on the live stream. Really appreciate everybody being here. Paul, how’s it going? And thanks, Jeff, for answering Paul’s question. Neval, what’s up? Pedro, good to see you. Kevin, what’s up, man? Thanks for answering questions in the chat, you guys. That’s really cool. Josh, one of the things that we were chatting before we went live was a book that I know you’re a fan of. You sent me some diagrams from that that that I know that you feel like are pretty valuable. So, let’s get into that just a little bit, because I think it’s early, right? And so we’ve got to figure out what people mean by these terms, but we also need to work on our own careers, right?

Mike Pfeiffer:
So, we talk about this a lot, what your current job wants you to do and what they assign to you may not serve your career in the next five or 10 years, so it’s always going to be your responsibility to take care of your career, so you do a great job of that. I would love to hear about the book that you’re reading that kind of falls into this, and also some recommendations for the folks that are doing that that are thinking further out about their career versus the short term stuff, just what the job looks like today.

Josh Duffney:
Yeah. So, the book is team topologies. I think that’s the one that you’re referring to, and they have some really great diagrams. They’re the creators of the DevOps Topologies, so there’s… I’ll send them over to Mike so we can put them in links afterwards of anti-patterns and do’s of DevOps. So, it kind of identifies the common patterns that organizations have adopted. I was actually really reluctant to read the book because it’s about organizational design, and as an individual contributor, I felt like it didn’t really add much value to me. But it couldn’t have been farther from the truth, like having some fundamental knowledge of organizational design and how teams across the globe really have evolved their DevOps practices really gave me a better mental model to think about the changes that I wanted to advocate for inside the organization, and also what to look for elsewhere if you wanted to leave elsewhere. And so it gives you a good framework to ask questions about what that particular organization that you’re applying for has, like what evolution they’re in, what kind of pattern. Are they in an anti pattern where they’ve just rebranded system administration DevOps, and there’s really no difference between your current job and that job. Yeah.

Mike Pfeiffer:
Yeah. So, one of the questions that came into the comments that I think is an important one, Kevin Sapp said, “Do DevOps related certifications matter in today’s job market? Microsoft, AWS, GCP.” And I would say absolutely 1000%, certifications are more important than they’ve ever been. They’ve always been important in my view because it takes you into darker corners of the technologies and it forces you to validate your skills. Some are more hands on than others, right? But I think, and I know at this point, especially looking at the last 12 to 18 months, five years ago, our customers never ever said anything about are your engineers certified. Today, they all ask, all of them, and that’s really fascinating to me, and it never used to be like that. So, the market wants certified individuals, and we’re all… if you’re buying for a job, you’re competing against a pretty competitive pool of people, and you’ve got to put your best foot forward. Branding yourself as someone that’s certified, massively important. What do you think, Josh? You’re big into certifications, aren’t you?

Josh Duffney:
Well, yeah, I just got the AZ103. I’m happy about that, past that last month to finish out 2019. It was really funny, I’ve got a… thank you. I’ve got a blog post queued up that stole from that quote that I shared earlier called Certifications Don’t matter, But They Absolutely do Matter, and it’s kind of my thoughts on this exact question of what are the pros and cons of certifications? And I think they absolutely do matter for a number of reasons. The primary one being is that it pushes you. So, here, I’ll take a step back. A lot of organizations haven’t figured out the cloud. They don’t know their cloud strategy, they haven’t completely dived into it, and it’s ever evolving, so it’s constantly changing, and they’re struggling to figure out where they fit and how to apply it in there, and they’re making a lot of mistakes.

Josh Duffney:
And so what certifications can allow you to do as an individual is go out ahead and learn that technology stack to better guide those technical decisions for your company or future employers. So, I think more than ever right now, you’re right, Mike, they matter more than they have in the past because technology is moving so fast. In your day job you’re not going to have the time to go out and kind of go ahead of the curve and figure out the best implementations, the best use cases for implementing cloud technologies and certifications, give you a framework to learn those other than going out on your own.

Mike Pfeiffer:
Right.

Josh Duffney:
Some of them are a pretty loose framework that you can’t kind of figure out. You need some guide rails to learn the technology.

Mike Pfeiffer:
Yeah. I mean, you got to get in the game. I mean, I think that’s what it really boils down to. And maybe if you’re intent on not doing certs, then it’s like you got to get in the game by building open source like crazy. There’s got to be something on your resume that’s going to make it pop when I’m looking at it. Why should I hire you versus the other person? Right? And certification is part of it. You know? Sometimes you work at a partner, you have to have it so the partner can maintain their certification with the company. But there is actually SRE certification at least one company is putting out. I think it’s the DevOps Institute is working on one, or at least a course, but I haven’t seen any SRE certifications other than that. Have you, Josh? I know there’s tons of DevOps ones.

Josh Duffney:
No, I haven’t. Yeah.

Mike Pfeiffer:
Yeah. I think it’s still really early. There is several books out there, right? There’s an O’Reilly book or at least there might be a couple of books that the Google folks wrote on SRE stuff, right?

Josh Duffney:
Yeah. There’s a workbook, there’s an implementing one, and then there’s the actual, like the very first SRE book that came out.

Mike Pfeiffer:
And so what do they get into in those books? Is it just kind of like the framework of the job that they have at Google? Is that kind of what they’re exploring?

Josh Duffney:
Yeah. The first one is like the philosophy and then the implementation of it, and I haven’t delved into the workbook, but I think it’s more of how to apply it, apply the principles.

Mike Pfeiffer:
Chris Miller, what’s up, man? It’s good to see you. I’ve known Chris Miller for… I don’t know, it’s got to be close to 20 years now. Good to see you, Chris. He’s saying, “Which certification do you recommend for cloud?” And I think there’s so many variables there. It depends on what you’re working on and where you work. I know Chris, at least when I knew you in the old days and we used to work together, you were a big Microsoft guy. So, when I’m thinking about that kind of stuff, it’s like that’s part of the variables. What’s my experience been? Where am I trying to go? Maybe I’m trying to do something different. So, saying if you’re currently doing Windows server administration, then Azure is going to be a natural and easy move for you, doing what Josh did and getting the easy 103. Awesome way to start. Even easy 900, which is the fundamentals one, teaches you an insane amount about the Azure platform.

Mike Pfeiffer:
But, you know, if you work at a place where it’s not a Microsoft shop, then Azure is still awesome, but if you’re a Linux person, it’s going to be maybe some more things to learn if you’re getting into a Microsoft based environment. So, it really depends on the variables of where you’re at, where you’re going, but any of the expert level ones, the one that was mentioned in the chat or the comments by Jeff, he mentioned he picked up the easy 400, which is Microsoft’s DevOps expert certification, that actually has prerequisites. That’s a tough one to go after. You could do it, but you got to do easy 103 or easy 203 first, so admin or developer.

Mike Pfeiffer:
Amazon actually has their DevOps certification. And the thing that’s interesting is in the early days they required prerequisites, and then their customers complained so much, a lot of people just wanted to go straight into the DevOps cert, so they took off the prerequisites. So, you can just go straight into it now. It kind of depends on do you want to be in a DevOps position? I think most of us are going to get pulled into it if we’re going to remain practitioners anyway because everything’s virtual at this point. Obviously, some people still do hardware, and we’re doing a lot of hybrid, but we’re in a world now where you can automate anything, and everything can be defined to software. So, applying software practices to all this stuff’s going to be important. So, for me, my vision personally is to double down on that concept, whether we’re calling it DevOps or SRE. I think DevOps is pretty much, if you look at the job market, that’s what’s got the most attention right now. What do you think about the question that Chris had there, Josh, about how to start with certs? You think I was on point there?

Josh Duffney:
I think, yeah. It really depends on your background. I mean, I kind of went through this myself. I got a traditional Microsoft background, Windows server administration, huge PowerShell fan. I went down the AWS route for various reasons, and I struggled a little bit. In retrospect, the reason was I lost the community that I had grown up with kind of in my career. There weren’t as many… I mean, there was a few, and so I had some guidance there. So, I really reflect on your skillset and see what would be a good path. But then the easy 103 or the solutions architect is, I think, the entry one or one of the entry ones for AWS, and kind of go that route. But now that I understand both, I can float in between each cloud, I… Azure was a little bit easier for me to immediately grok, and mainly because I had a really awesome PowerShell module that I was able to use my PowerShell experience to kind of be as a catalyst to learn Azure.

Mike Pfeiffer:
Yeah. I think that just getting one under your belt is all that matters. It’s just like learning a programming language and then making a lateral move to one that’s kind of similar. It’s like once you did the hard work of learning the first thing, moving to another one’s way easier. I think that all these patterns are the important thing. The platform you’re deploying to or the technology that you’re using, less important. The patterns, the mindset, that’s the important piece. So, source control, even understanding a little bit more about applications and application development, even if you’re not a developer I think is important. But CICD, all that stuff, pipeline, security, all the automation things, right? That’s important patterns and practices to start getting on your radar. That’s awesome stuff. All right, tons of comments here. Over 100 people on the live stream. I think that’s more than… I don’t think we’ve ever gone over 100, so that’s…

Josh Duffney:
One quick comment on the certifications.

Mike Pfeiffer:
Yeah.

Josh Duffney:
The biggest benefit that I got from getting certified in an easy 103 is it gave me confidence in the cloud, and so now I spent a lot of my time experimenting in my dev environment that I use. I have to do a lot of [inaudible 00:23:50] development. It’s now in Azure. So, that was actually the benefit is that it got me over that hurdle of comfortability, and now I understand I can keep my bill under a certain amount, and I use my own account. So, it got me through a lot of that uncomfortability. It got me more comfortable at using it, and now my experimentation is going to explode in that particular technology stack and I’m going to get way more confident in it because I took the time to study for that exam and I understand what I’m doing there and I’m not afraid of deploying and resource and it costing me $100 and stuff like that.

Mike Pfeiffer:
Yeah, definitely. I really appreciate Jeff Truman doing hard work in the comments section helping people and chiming in. Thanks Jeff, man. Really appreciate it. Henry said, “Hey Mike and Josh, it sort of appears like there’s nowhere to learn SRE unless you’re talking about the books Josh was mentioning earlier. Right? Josh, is there anything other than the books that’s out there that you’ve seen?”

Josh Duffney:
It’s kind of hidden. It’s like my absolute favorite book, I think it’s The Practice of Cloud and System Administration. I always butcher his last name. I won’t even try to attempt it. But that actually models a lot of the SRE book. There’s a lot of overlap in those two, and so usually it’s there, but it’s branded something else. It’s not the specific SRE implementation, but there’s a lot of patterns and practices that are taken from SRE that have existed or are in different forums around DevOps. So, there’s The Effect of DevOps book. It’s really the key words that are used in there, but a lot of the practices inside are the same.

Mike Pfeiffer:
Yeah. Yeah. Peter was saying in the comments, I’ve gotten to appreciate certification as a way to show off your willingness to do continuous learning, evolving your skills. That’s actually what I was alluding to earlier, Peter, and I love that you made that comment, because when I’m looking at resumes, because I hire contractors all the time, and when I worked at AWS, it was like annoying the amount of people that I had to interview all the time. There was so much constant hiring. Right? And so when you’re looking at it from that view, you’re always looking for what’s this person better for? Right? And I think that willingness to get in the game, to be doing the stuff that’s outside the day job, even though that’s hard and it’s extra work, but it really is your responsibility.

Mike Pfeiffer:
You got to treat your career like a business, because if you don’t, your job could change immediately and you just lose it. And then if you weren’t thinking anything about the career, now you’re screwed. Right? So, I love that you put that in there, Peter. Nick, Collier’s in the house. What’s up, man? [inaudible 00:26:18] , friend. What has he said? “Hey, would love to hear your perspective on cross team automation.” You’ve seen lots of customers do automation in silos, not enough cross team automation to make DevOps practices effective. Yeah. You know, I think for me, the customers that we’re working with are very, very early, so I’ll let Josh answer this one and then I’ll chime in. Josh, go ahead.

Josh Duffney:
He hit the nail on the head of the hugest contention with a lot of the operations teams where I currently work. And it’s this, it’s like how do we collaborate together? And so that Team Topologies book gives you a really good framework. It talks about the different communication modes, how there’s collaboration, and that’s usually our default. We’re told to over communicate. Communication is very expensive, and so there’s another mode that’s introduced in the book called X as a service. And so what we’re working on is trying to define that interface between the other operations teams of that automation, and trying to define better ways to consume those resources, those shared pieces of automation through a better interface, and that interface is X as a service.

Josh Duffney:
So, in our case, it’s abstracting our Ansible playbooks into a role that can be consumed by the way that we deploy automation and the way that you do, but we’re still using that same couple RS shared automation, and so you have to put some more thought into how can I make whatever this is that I wrote more consumable, consumable from another team’s perspective, and you have to shift into what the book calls an enabling team at that point, understand their needs and figure out how you can make a clean interface between the two that’s automated, that doesn’t require collaboration or you to explain the script or to explain your Wiki page.

Mike Pfeiffer:
Yeah. And to Nick, Nick also followed up with something where he said he was getting into this in the unicorn project book that recently came out from Gene Kim. Nick was saying that this topic of cross team collaboration is a big focus in that book. He was talking about it with his team, and that’s one of the reasons why he brought it up. I’ll take a higher level road with that. I would say that it all comes back to leadership and culture. The leadership of the company has to enable that cross team collaboration through the culture, how the company rolls. You can’t just go in your own team and then go champion some cross team collaboration thing from the inside of one team. It has to come top down. I’ve been in so many situations where the customer contact who’s the leadership person or the executive, they just want to buy DevOps or they want to buy SRE, right?

Mike Pfeiffer:
So, I haven’t dealt with any customers that even have SRE on their radar, but you get the idea. It’s like you can’t just buy it off a shelf. No matter what you do, there is going to be a hill to climb, and you’re going to have to kind of earn it, man. It’s a big change for people, and people are struggling with it, to be honest. If you’re not Netflix, if you’re not Amazon or Microsoft, it’s going to be tough. You know? It’s going to be a lot of work. All right, let’s take some more questions. Really appreciate all the engagement. This is awesome. 116 people on the live stream, Josh.

Josh Duffney:
That’s cool.

Mike Pfeiffer:
Josh, you’re going to break the internet, man.

Josh Duffney:
Unlikely.

Mike Pfeiffer:
All right, well maybe we’re getting too excited with 115 people, but anyways, let’s see. Couple of questions here. Mohammad, “I feel DevOps role is more specific to individual services where SRE is more broad.” Okay, cool. It could be SRE manages two or more, blah, blah, blah. What about some questions? Oh, yeah, just stop, collaborate and listen. That’s awesome. Cool, man. So, what do you think Josh, for people that’s… or whether it’s SRE or DevOps, what are some resources beyond the stuff that we’ve talked about that people could go off and take a look at?

Josh Duffney:
I don’t know. That’s a good question. So, if you’re looking for certification exams, I can throw out the content that you’ve done. I’ll put it in your newsletter for you because I’ve gotten so much value from that. So, sign up for Mike’s newsletter for sure, because he’s got so much great stuff in there and the different tool chains, and pay attention to the tools that he’s mentioned there. So, the Terraform that was mentioned a couple of weeks back. It’s got a lot of communities content in there, and so understanding those tool chains or that technology would be greatly beneficial in those roles. I definitely dive into, if you haven’t… if you’re looking to go into DevOps and you don’t understand a release pipeline, there’s a fantastic white paper by Steve Murawski and Michael Green called The Release Pipeline. Read that. That’ll give you a good framework mental model for the different components of continuous integration, CI and continuous deployment CD.

Josh Duffney:
Those are both independent books that you can Google and Amazon, continuous integration, continuous deployment, that you can read. So, I would get familiar with software engineering if you’re trying to get into DevOps, from those two, the pipeline perspective. And then infrastructure as code is really a good foundational point for people that are coming from operations backgrounds, is how do I take my automation from infrastructure code and kind of put some software engineering around it?

Mike Pfeiffer:
Cool. So, Gregor said, “I got asked when I came up with the role, why do we need SREs?” And he wants to know how would you answer that question today?

Josh Duffney:
Why do we need SREs? I would say to improve the stability of whatever application. I think one thing that will get you in trouble is if you want to improve the stability of the environment or everything, like it has to be pretty specific to an application to start. That’d be my answer if I was trying to sell it.

Mike Pfeiffer:
That was a good one. I like that. But yeah, I think I would say to that, I think when you’re talking to people, what I would ask would be something that the Google’s guys were saying in some of those other podcasts we were talking about, which is when you’re talking to somebody as a potential SRE or even a DevOps candidate and you ask them when was the last time you got so frustrated by a manual task that you went out and automated it, they’re like, “Oh, well I haven’t,” or “It took me six months.” They might not be the right person for that, because that’s what we need is people thinking ahead of like how do we increase the stability? So, that mindset of how do we get ahead of this, you know? That puts you into a position where you’re a culture fit. Right?

Mike Pfeiffer:
So, I think that for people listening that are new to this stuff, getting into the game of getting your hands dirty, whether it’s open source projects or certifications, which are really important, that is going to position you as somebody that has got their eye on the ball right? And they’re not just in their own world, staying in their lane in their current job, they’re paying attention to the industry. Those are the type of people that I would want on my team. So, that was awesome. Thanks so much, Gregor. Kieffer has a really important question, Josh. He says, “What’s a position to look for if you’re a junior cloud engineer or a junior DevOps engineer?”

Josh Duffney:
A position to look for?

Mike Pfeiffer:
Yeah. So, he’s pretty much saying, if he’s in a junior role, what should he be looking at to level up to the next pay grade?

Josh Duffney:
That’s good. I would continue to look for opportunities for automation. Just like what you were saying, that actually gave me a better answer or an idea for a better answer for the SRE question, which is always be looking to reduce that toil. That’s a concept in the SRE where you want your toil to… basically your operations to your manual… your engineering to your manual work to be 50/50 or lower than 50%. So, showcasing your ability to reduce that toil is going to be the best beacon of your achievement you can have as a junior level.

Josh Duffney:
Aside from that, the stuff that we’ve already talked about with certification tracks and getting in the game and improving your own skillsets, whether that be whatever the tool chain is that you’re currently using, but make sure that you write down those achievements of that toil reduction, because those are going to be the really good stories that you can tell when you go in for a promotion or you go in for another job. Those are the stories that people want to hear, is how did you apply this problem, how did you get rid of it forever, and how did you get it so you didn’t have to click that button anymore? Even if you wrote a script, what did you do to click the button for you?

Mike Pfeiffer:
Yeah, so when I worked at Amazon, Werner Vogels is the CTO of Amazon, right? He’s still there to this day. But when I was working there, I used to listen to him a lot because he was traveling all over the world and doing talks. I think he still does that, but he got famous, or he was famously quoted years ago, eight, nine years ago of saying, “Everything fails all the time.” And one of the first, I think the first cloud certification that AWS ever had was the architecture one. It’s all about eliminating single points of failure and stuff like that. But I think another aspect to this stuff is accepting that, that things will fail. Most of us are way too caught up in trying to make things perfect. Perfection isn’t going to happen. It’s not practical, and that’s something that the Google folks echoed as well.

Mike Pfeiffer:
So, Dave Renson, he’s actually the guy at Google who I was talking about earlier from the podcast. He’s one of their big time engineering managers, and I listened to a show with him. He was talking about error budgets at Google, right? So, they have budget policy where it’s like we’ll accept a certain number of failure and then we’ll call that good. Right? And he’s like, "If you have a lot of leftover budget policy because you’re not having the failures, that’s not good. That’s actually a problem, because now you’ve over engineered the solution, you’re throwing more money at this thing, more people, more resources at it, and so that’s kind of a fascinating concept there of embracing it and just being okay with it and knowing, “Hey, we got to engineer around this and know that this is going to it’s going to tank at some point. How do we mitigate that?”

Mike Pfeiffer:
And so I think that mindset is going to be important for folks. Right? So, embracing the failure. Let’s see if we got any other questions here. Kevin Sapp said, “Is there normally tension within companies between cloud DevOps, SRE teams, since they use similar tools?” There’s definitely, in my view, tension within companies, 100%. It’s the guaranteed thing in my view. What do you think, Josh?

Josh Duffney:
I think it all depends on where they’re at. So, I can definitely see if you got your DevOps team that’s over on your ops camp, and you got your SRE that just emerged organically from the dev side where there’d be contention. In my current organization, we have SRO, site reliability operator, not engineer, and so they’re really focused… we basically functionally siloed a lot of things that we’re working through. But they mainly manage our monitoring platform and have a lot of expertise in that. Then we’re over here doing the infrastructure as code. And so there’s definitely some contention when we try to use the same tools because those interfaces aren’t clear. So, I definitely see that. I definitely see that as well.

Mike Pfeiffer:
Yeah, and I see it a lot too with just the resistance to change and not wanting to do the next thing, especially when you’ve got this other team that is threatening your time, right? Like they’re trying to change all this stuff, and if that happens, then it’s going to screw me over and I have to wake up and do all this stuff, and so I’m constantly fighting that as well. But again, that’s a thing that has to come from the top down. It’s a culture thing. But I got a call at the top of the hour that I got to get ready for, so I’d like to wrap this one up. Any outgoing thoughts here, Josh, before we close this one up?

Josh Duffney:
Let’s see here. I would just recommend that people, they really get a good grasp of their job descriptions that they’re applying for and try to identify what their gaps are. I mean, that’s what I spent probably the last six to eight months doing is going out there and just looking at a lot of different things to identify where my gaps are. I know I’ve had the DevOps engineer title for a while, but I definitely had some gaps, and one of those was the cloud. And so I’m glad that I went out and did that discovery. So, taking an honest assessment of your skills and use job descriptions in interviews to do that for you because it’ll be very clear where your gaps are in that environment, and then make a plan to work through those and to fill those gaps.

Mike Pfeiffer:
Great point, Josh. I love it, man, because you can gather a ton of intel from job postings. That will tell you where this thing is going if you’re paying attention. If you’ll take the time to read it and research it, 100%. Great one, Josh. I appreciate you guys. Thanks so much everybody for the engagement. This one was awesome. We got over 120 people on here right now. Lots of good comments. I appreciate everybody in there answering questions and helping each other out, especially Jeff Truman. Thanks, man. And everybody, really appreciate your time. Josh, thank you so much.

Josh Duffney:
Thank you, [crosstalk 00:38:41].

Mike Pfeiffer:
And we’ll set up another live stream pretty soon.

Josh Duffney:
Sounds good. Thank you.

This episode was originally posted on CloudSkills.io. To view the original article please see Episode 058: DevOps vs. Site Reliability Engineering (SRE)

Top comments (0)