DEV Community: Adam Hawkins

Continuous Delivery with Dave Farley

Adam Hawkins — Mon, 07 Sep 2020 20:05:33 +0000

The latest episode of Small Batches is out!
This is a special episode of Small Batches. I interview Dave Farley in this episode.

Dave, along with Jez Humble, is the co-author of "Continuous Delivery" published in 2010. The book introduced the ideas that grew into DevOps. So, no surprise that DevOps and continuous delivery are the same most people.

Together Dave and Jez introduced continuous delivery to the world. The practices and ideas still hold true ten years on.

Time and research have demonstrated that continuous delivery is the most effective way to develop software. If you’ve read Accelerate then you know what I’m talking about. That’s partially why am so passionate about it and that doesn’t even account for the fun I have working in that environment.

Dave and I talked about different aspects of continuous delivery beginning with the difference between software development and software engineering. Or as Dave put’s it: scientific rationalism.

We also speak about the connection from delivery, feedback, and experimentation. Or, as he put’s it: "just doing engineering".

He also shared why he doesn’t like the term DevOps. I gotta say I tend agree with him after hearing his reasoning.

Lastly I get his view on the Preflight Checks I mentioned in an early episode of this podcast. Go to smallbatches.fm/11 for that episode.

Now I give you my conversation with Dave Farley.

You can find Dave at:

Listen to Continuous Delivery with Dave Farley on smallbatches.fm.

Parts Unlimited

Adam Hawkins — Tue, 25 Aug 2020 02:46:06 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

Hello, everybody. Welcome back to the next episode of small batches. I'm going to begin this episode with some housekeeping on an update on the show.

I produced these episodes by first, coming up with an idea, each of these ideas relate to software delivery, velocity, quality, or reliability. And then I write a script. The idea is that the episodes are probably about five to eight minutes long. I think that's a good length because they fit into any schedule and you can listen to one or you can listen to a whole bunch from the back catalog.

Last episode, I wanted to try something new for me, but also for you as the listener, as a way to bring some more liveliness and personality to the show. And I published the episode and got the first feedback ever. On any of the episodes of the feedback was positive. So I'm going to do more episodes in that direction, because I think that it will add personality to the show. It will give you something different and episodes might be a bit longer, but that's okay.

You'll hear more context, more background, more thoughts, all that kind of stuff. But rest assured I will still do the, five to eight minute episodes written and recorded in that fashion, because that gives me a. A medium to convey a really specific amount of information. It's nutrient dense in that way.

This brings me to my next point about the future of the show. The big news is that I'm going to start doing interviews on small batches. I think this will bring more personality to the show plus you'll get to see me interact with different guests as we work through ideas and all that.

I already have a nice pipeline lined up for the next few months. The first interview episode will be coming out soon. I've already recorded a few actually, and I'm super excited to share the first one with you, but for now I'll let that be a surprise.

Anyway, I'm sticking with the biweekly schedule for the time being. You'd be surprised how much time it takes to produce this podcast. So I'm proud of myself that I've been able to consistently put out episodes so far, at least. So fingers crossed.

I considered maybe going to a weekly schedule, but frankly, that just takes too much time right now. But maybe this will change in the future. So let's see what happens.

Alright, that's a wrap on the update. So look forward to more laid back conversational episodes and interviews all aimed at leveling up velocity, quality and reliability. Now let's get into today's episode.

Parts Unlimited

Today, I'm going to tell the story of parts unlimited parts. Parts Unimited is a fictional company described in two different books. The first is the Phoenix project published in 2013. And the second is the Unicorn Project published in 2019. Both books are published by ITRevolution press.

This is the company behind the DevOps Handbook, Accelerate, and a bunch of other books sort of in this theme of, you know, software delivery, business performance, operations, all this kind of stuff.

First off spoilers ahead for these two books, I'm going to just talk about everything that happens. These are fictional books written for software professionals and people who work in IT. They chronicle the story of different people involved in these organizations and the challenges and things they have to overcome and what they do and how they react to situations, how they handle problems, this type of stuff.

So let's begin with some context here. So parts unlimited is a auto parts company. You can think of them like O'Reilly, Napa. They are definitely an established business. They've been around for decades and they are trying to come into the so-called digital world, trying to compete with, other companies that have better online experiences, you know, all this right? They're, kind of a representation of a old incumbent player who needs to adopt a new ways of working and thinking, kind of becoming what we now call a technology first company to compete in today's market.

Both books begin at the same point, but cover different characters in the same story. So the first book, the Phoenix project covers the story from an ops perspective. And the second book, the unicorn project covers the story from a development perspective.

They start in the same place. Parts Unlimited has just experienced a huge production outage, a huge issue that has resulted in, messed up billing, firings, really, really bad stuff. That kind of thing that just makes you at least made me grimace and think, Oh my God, I never want to work in a place like this. And, luckily I haven't, but fingers crossed that it never happens.

There's a few main characters in these books. The first character is Bill. Bill is the reluctant a CTO who gets promoted after these firings and ultimately is in charge of righting the ship.

And the second book you have Maxine. Maxine is a developer who is kind of an avatar for what we would consider a, you know, pretty good software developer who has experienced different ways of working knows what works knows what doesn't, you know, does things like automated testing, continuous delivery, blah, blah. After this shakeup, she gets relegated to work on what is called the Phoenix project.

So the Phoenix project is a code name for then if the next generation system of Parts Unlimited. No surprised this project is years delayed and drastically over budget, but the company--of course--is betting their future on this project. As the project gets delayed more and more, it only gets delayed longer because of--well, you know, what happens to projects that go on for a long time and never ship?

Well, the scope gets bigger. Requirements change. Things happen to continually push back the project. As a result releases take longer. They become more difficult. Things are more likely to break more negative consequences, yadda yadda, yadda, this whole, negative feedback loop of the opposite of short delivery cycles or working in small batches.

The Characters

That's Bill and Maxine, the two main characters from the different books and the characters are shared across these two stories. So if you read one or read the other, you'll see these people show up in different ways.

One of the other main characters is Eric. He's actually my favorite character. He's kind of this, mysterious sensei type character. In my mind, I kind of thought of him or at least the way he was portrayed as some of these kind of California type hippies with long hair; driving convertibles, living in Santa Cruz, surfing, you know, just sort of like, ah, kind of out there, but knows what he's talking about. They call him a sensei and he refers, refers to other people he's learned from as the sensei's as well.

So he guides Bill and Maxine in the two different books to different objectives. So in the Phoenix Project Bill guides, Eric through what we now call the three ways of DevOps. In Unicorn Project Eric and others guide Maxine into uncovering the five ideals which we'll get into later.

Brent is the engineer who is totally overloaded because he works in a lot of different areas. He knows a lot of stuff and unfortunately many different things have to go through Brent because he's the only one who knows how to do it or really the only one who can get it done. He's a bottleneck in the process for sure.

You have John who is the annoying security engineer. Who's always trying to stop releases and inject requirements at the end of the process that you know--in the Phoenix project they make no bones about it. They do not like John at all.

It's really interesting to see the transformation in John, as he realizes how much of a problem he is and that the way that he is trying to approach his goal of improving information security is having the exact opposite approach. And as a result, people just avoid him and take, just, don't take him seriously.

You also have the classic executive infighting of executives and project planners are kind of the bureaucracy of people who want this, or don't want that. And. The people who are trying to provide cover two different teams or empower different things in the organization. But there is the point that these executives who represent the cross purposes in leadership and that's a problem.

One other main character is Kurt. Kurt is a pretty smart--what was he called, a QA engineer. He plays a big role in both of the books as a powerful force for getting things done.

The Plot

The real story of these two books is the Phoenix project. So as I mentioned earlier, the Phoenix Project is code name for the the next generation version of the system. That they're going to release. In order to really get out of the swamp that they're in, they need to drastically change the ways of working.

And both these books, like I said, cover the transformation inside the company to move away from --for want of a better term--waterfall development and months or years long cycles. They adopt things like, well, the first way of DevOps, so continuous delivery, the second way of DevOps feedback. So using information and telemetry to make empirical decisions about what's happening. And then the third way of DevOps experimentation and learning along with the continuous improvement to get into this virtuous feedback loop.

That's pretty much the whole story.

So in this fictional thing, you have Parts Unlimited who is in a horrible place in the beginning--and I've really mean horrible. If you read this book or any of these books, you'll think, Oh my God, I don't want to be involved in anything like this.

Then surprise, surprise. They adopt DevOps and do these things. Things start to go well. Everybody's happy at the end. The Phoenix project is released. People get promotions. Everything is good. So major success.

The Pheonix Project

Now let's move on to the different high-level theories introduced in both these books. So the Phoenix Project introduces the Three Ways of DevOps. It actually came out before the DevOps Handbook. I think DevOps Handbook was published a few years later. So in the Phoenix project Bill uncovers the Three Ways of DevOps guided by Eric. Eric is the experienced leader who has seen these things in practice.

Eric takes Bill through the progression of lean manufacturing into DevOps, right? DevOps was certainly inspired by lean manufacturing. In the book actually. This is probably one of my favorite parts about the Phoenix Project is that Eric takes bill on gemba walks.

Gemba walk is concept from Toyota, where you go to the factory, you see what's happening. You watch how the work happens and observe and learn from that.

This is a real life forum to see how things work or how they don't. Eric uses this as a training session to introduce things like the theory of constraints to Bill to help him understand that Brent, remember that Brent is this overloaded engineer that everything has to go through.

So, if you imagine you're looking at a factory and you see an assembly line, if there's a bottleneck where everything in the factory has to go through this one choke point at that choke point is overloaded, then. All the other things in this factory now have to wait on that.

That's an example of Eric teaching bill how Brent is a bottleneck and that's probably one of the first things that he should work on improving in the organization.

Eric shows Bill the Three Ways of DevOps through all these exercises and Sort of leaving him just dead in the water sometimes when he has questions, letting him figure out how to sink or swim for himself.

That's the Phoenix Project and the Three Ways of DevOps: flow, feedback, and learning. We'll cover them a lot on the show. So I'm not really gonna repeat them here. Just go to smallbatches.fm to find the first three or four episodes of this podcast cover these things in depth.

The Unicron Project

Next the unicorn project which is told from Maxine's perspective.

Maxine is the great developer. She's relegated into the Phoenix project at the beginning of the Unicorn Project. The Unicorn Project is effectively a code name for taking the Phoenix project And transforming it into a better version of itself.

This is where the idea the Unicorn Project comes from because it's almost in their mind, like impossible that this project could actually be transformed and they could succeed. Hence the Unicorn Project.

Maxine was actually inspired by Michael Nygaard. If you don't know who Michael Nygard is, he's a pretty awesome guy. He wrote Release it! the first edition and now the second edition, which is a great book. Great book. You can go to my website to Hawkins IO to find a review and summary of Release It!

If you haven't read it, then please do. It's great. It totally changed the way I thought about operations and releaseability. It's just great. So do check it out if you haven't. The second edition is I think released.one or two years ago; but it's new, covers all the stuff from the original book and it introduces some more stuff like chaos engineering.

Plus, it's really fun to read. I don't know about you, but when I read some of these tech books, they're dry, you know, they're just talking about ideas, but the way that Michael writes, he adds just so much life into the text. Like you can feel that this guy is with you. You're like having a beer with him. He's telling you these stories. It's just actually fun to read and that's rare in tech books.

Back to the Unicorn Project. Let me pull up my notes here and tell you the progression, like sort of the outline for what happens in The Unicorn Project.

First Maxine is exiled to the Pheonix Project as the scapegoat for the payroll outage. Well scapegoat. Immediate red flag that they had this whole organizational shakeup, and they had to put the blame on somebody. Of course, Maxine didn't do anything about this, but they needed somebody to blame. They blame Maxine. They relegate her to the dustbin of the organization, which is the Phoenix project.

And when Maxine gets there, she can't believe the ghetto that she finds herself in: months to get a working environment on her laptop; there's hundreds of tickets filed and so much bureaucracy to get anything done like getting a test environment or deploying a change to production.

Now bear in mind that Maxine is coming from a previous working environment where all this is at the opposite. Now she was able to clone code, start working on her machine, start making immediate changes. Now she's stuck on the Phoenix project, which is just months.

Actually, if I think I remember correctly in the book. Have this internal dialogue with Maxine where she has just really frustrated, really sad and practically, probably borderline depressed about the whole situation. Like she wants to quit and overall just really not a good thing.

The thing about Maxine is that she is not happy with the status quo. That's a good thing. So she wants to do something about it. She knows that there's a better way to work. So she starts this, what they call a rebellion in the company.

So what she does is she gets a bunch of people together that are aligned to her cause and they meet at this place called the dock side bar after work, where they go and talk about whatever happens, what to do and all these types of things and figure out ways to bring more people into their cause and continue to expand this rebellion in the organization.

And their goal at the rebellion is to effectively apply DevOps to the unicorn project. So Kurt, I mentioned earlier, he's kind of QA manager, he's part of this rebellion and there's an internal shakeup that allows Kurt to get his own team.

Now bear in mind that this is a hidden rebellion. People don't know that it's happening. So. Kurt gets his team and he uses that as organizational cover to just ignore the status quo, get things done, and allow his team to work in whatever way is the best way possible to deliver whatever they need to do.

One of the things I like about Kurt and this part of the Unicorn Project is it doesn't really tell what he does, but he's a stand in for the manager or the person kind of higher up in the organization and knows how to play the politics effectively. Just to make sure that his team is isolated from whatever politics and other things that are going on that may inhibit them, such that they can just work effectively.

So once they have this organizational cover. They start doing things like making continuous integration, automated environment provisioning and automated deployments.

They, being Parts Unlimited, are in a time crunch for getting the Phoenix project out the door because you know, they're losing money and this thing eventually has to ship at some point.

They use this to cut through a lot of the red tape required to launch a successful black Friday promotion that demonstrates the business value of this new way of working. And they pull this whole thing off. It's a wild success. Everybody is happy. Maxine is promoted to a position of distinguished engineer with a job description of effectively what she basically did throughout the book.

Cut to black.

The Five Ideals

Throughout this whole process she discovers what is called the five ideals.

Let me just read off these five ideals to you.

The first one is locality and simplicity. The idea here is that you shouldn't have to load up a huge amount of global things to make a small pointed change.

So, again, coming back to Maxine who was previously working in a better environment than she got into, she was able to have a code on her machine, work with it, make a change, test it. Get it out simple, right? Not having to provision N number of huge things and wait for all this, but just be able to focus on local and small changes.

The next one, number two: focus, flow, and joy. This to me is the first way of DevOps. Actually it was continuous delivery. That developers should be able to enter the so-called flow state where they're able to work through a problem, focus on it, and work it in a way that makes them happy. Key thing, being happy.

If you've worked in an organization. Where the work you do is soul sucking or, just takes a long time, or you just think to yourself, man, I do not want to do this. This just sucks. Well, let's avoid that. Right? Let's create systems in a ways of working that actually bring joy into people's work. This is what continuous delivery has been proven to do.

They mentioned in Accelerate that continuous delivery is a direct contributor to employee happiness and satisfaction.

Alright. Number three: improvement of the daily work. Okay. If you've read, Mike Rother's Toyota kata. You know that what I'm getting at with this one. The idea here is that we make the improvement of the daily work, the daily work.

So I think this is episode. Episode four of small batches, so smallbatches.fm/4, to learn more about the Toyota kata and improvement of the daily work.

Ideal four: psychological safety. This is a new one for me in that the first three, I think are, related to DevOps and. You know, DevOps handbook accelerate, all that.

Psychological safety is the idea that people should feel safe in their work, like secure in the sense that they don't have to worry about being fired for making a mistake or shipping some bug to production; Or that the work that they do is not going to create issues.

So bringing us back to the beginning of both The Unicorn Project and Phoenix Project: horrible, right? Somebody, some poor soul made a mistake. We've all made mistakes. I've shipped a plenty of bugs to production that have had negative consequences. I certainly wouldn't want to be fired over that, but that's what happens at Parts Unlimited. That's one factor of this is that you shouldn't have to worry about huge negative ramifications from the things that are bound to happen to every person who works in software.

And that we have to also build tools that promote a safety, like building automated testing into the deployment pipelines. Such that we don't ship regressions into production or we're confident that the code or the change in question is not going to break production in any known way.

If developers feel safe to make changes, they're more likely to make changes. If they're more likely to make changes and more likely to deliver on the business outcomes you're trying to achieve.

Next the fifth and final one is customer focus. In the unicorn project and the Phoenix project the employees have to go to one of the physical store storefronts, right? Actual brick and mortar place where the company. Sells parts to the customer in exchange for money. Right? This was definitely different from working in a purely online business.

In the Phoenix project, I think Bill goes there and gets a feel for the current system, like the like sales system, these things that are running in the terminals in the stores. Get a feel for the problems of the employees and what the Phoenix project actually needs to do to improve the workflow for these frontline employees.

And same thing in the unicorn project, Maxine goes to storefront and sees what's happening with the software that she's responsible for and how that interacts or how that interplays with the frontline employees and the end customers.

The idea behind the fifth ideal is that the developers should have focus on the customer. So they can see that the work that they're doing is having impact on the end users of this whole system.

Closing Thoughts

That's the five ideals.

After I read the Unicorn Project, I started to think that, well, there's actually a three or four, five out of a ladder here. If you think of these things in sequence, you'll see how they all play together.

So when I say three or four or five, I mean the three ways of DevOps: Flow, Feedback, and Learning; the four metrics of software delivery performance: deployment frequency, lead time, meantime to resolve, and change failure rate; then the five ideals: locality, simplicity, focus, flow, joy, improvement of the daily work, psychological safety, and customer focus. So if you put these three different things together, you have a good picture of how to think about this problem? It's just a good framing.

I want to wrap up this episode by giving you some of my thoughts on these two books.

First of all, I much prefer the Phoenix Project. I think that one was much more fun to read and much more interesting. Seeing things from Bill's perspective--this is the CTO, the manager type person--this was far more interesting to me was this was new territory. I also think of the writing was better. It was just more fun.

The unicorn project. Eh; I think you pass on it. You see some infographic or some blog posts about the five ideals any pretty much got the whole thing.

So take it from the other perspective. Let's say that you're a manager and you associate yourself more with Bill and you don't really know about how developers think. Then reading Unicorn Project would be more interesting than reading the Pheonix Project. So now you can decide based on what your role is which one might make more sense to you; but I could pass on the Unicorn Project. I think that the Phoenix project is just more fun. And. Frankly, the content of the Unicorn Project was just more obvious to me.

Who Should Read These Books?

So who should read these books? Since you're listening to this podcast, you don't need to read these books anymore. Cause I think I've given you all the information that you need to know. However, you can recommend these books to your colleagues.

If you are interested in that from a fictional perspective Like, if you don't want to read, say the DevOps Handbook or Accelerate, you like reading stories, you can read both these books and get the high level picture and then fill it in with more information by listening to podcasts like this one or reading the books or reading blog posts, whatever.

So recommending the Unicorn Project to managers or recommending the Phoenix Project to more technical types, more engineers that you may know.

Wrap-Up

All right. I think that's enough for this episode. We've covered the unicorn project the Phoenix project,the story of Parts Unlimited and their transformation of adopting DevOps principles, and saving the Phoenix project going from unprofitable, horrible place to work, to a happy place to work that's making money; and the five ideals and the three ways of DevOps.

If you want to learn more about these books then head to smallbatches.fm for links. Also there's plenty of blog posts written in depth analysis on both these books, YouTube videos and all that.

That's a wrap on this one. I'll see you again for the next episode. It will be a special interview. So thank you for listening. See you later.

12.1 Factor Apps: Dev/Prod Parity

Adam Hawkins — Mon, 01 Jun 2020 10:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

Yo yo, Adam here to serve up the next episode of Small Batches in the 12.1 Factor App series. This episode covers the dev/prod parity factor. Let’s dive in.

The original 12 factor app¹ states that applications should use the same versions of backing services in development and production. Recall that a "Backing Service²" is any service the app consumes over the network. Examples include datastores like MySQL or Redis and external APIs like Amazon S3. The original 12 factor app guidelines for the "dev/prod" and "backing services" factors mostly focus on datastores. This open the dev/prod factor to interpretation regarding external APIs, which leads to wildly different outcomes depending on the interpretation.

If you take the original guidelines at face value, then your development environment will make network calls to the same versions of backing services. That assumes all backing services are in fact running and accessible. This is problematic for multiple reasons.

Consider a distributed system. Is it possible—ignoring whatever effort that requires—to run all services in single development environment? If so, does your hardware have the necessary compute resources to support it? If not, then what’s the solution and how much parity dev/prod parity is there as result?

Consider a third party API like Twitter. Is the development environment going to make real tweets? If so, to which Twitter account? Does each development environment need a separate twitter account? If so, does that change the utility of the development environment?

These examples show the added complexity in assuming that all backing services are available in the development environment. In the best case, it’s functional but fragile. In the worst case it creates a mess of dependencies and loads that saddles the team with toil³. In my view, These outcomes come from the assumption that development environments should be fully integrated environments masquerading behind the banner of dev/prod parity.

The 12.1 factor app takes a different approach. The 12.1 factor app strives for dev/prod parity where practical and eschews it when not. This requires differenating between bounded and unbounded contexts.

A typical service uses a datastore and interacts with external services. In this case the datastore is within the bounded context. It’s assumed that service consumers will access data via an API instead of accessing the datastore directly because doing so violates the bounded context. In this case, achieving dev/prod parity is practical and certainly useful. The service must be developed and tested against the same version used in production. Tools like Docker make it trivially easy to do, so there’s no argument to be made against it.

This typical service also interacts with external services. These are outside the bounded context because the service has no control over them but is still dependent on them. The 12.1 factor app eschews dev/prod parity for these backing services. These backing services should not be used in development or test. They should be replaced with mocks with tests and fakes in development. Mocks enable unit tests of interactions at the boundary. Fakes in development promote isolated environments and supports forcing the consumer through behaviors which may not be possible in real environments.

Combining these two practices works well with a single service and scales up to multiple services. The 12.1 factor approach prefers locality⁴ over fully integrated environments. Doing so promotes fast and independent iterations on discrete services using automated tests to verify correctness. End-to-End issues that may have been identified with a fully integrated environment with dev/prod parity should be pushed downstream in the deployment pipeline in accordance with test pyramid⁵ principles. If a regression is identified, then it may quickly addressed by adding tests to the relevant service’s test suite.

These recommendations build on a substantial amount of prior knowledge like:

Bounded context from Domain Driven Driven
The Five Ideals in the Unicorn Project
The role of team autonomy as discussed in Team Topologies
The Hexagonal architecture used to swap between mocks, fakes, and real versions of external services
How to grow production ready software guided by tests

Check the show notes for links on these topics. I hope this episode has given you some food for thought on the goal of dev/prod parity and its effects across the software development process.

That’s all for this one. Good luck out there and happy shipping.

12.1 Factor Apps: Logs

Adam Hawkins — Mon, 18 May 2020 10:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

Hey everyone. Adam here for the next episode in the 12.1 factor app series. I'm writing more episodes addressing my amendments to the original 12 factors. After that, I'll propose new factors worth considering.

Also sorry about a mistake in the last episode. The podcast I mentioned is no longer available. The host told me he took the podcast completely offline. However, he did invite me onto his current podcast, Rails with Jason. I'll go on his show in the coming weeks to discuss continuous delivery, deployment pipelines, preflight checks, smoke tests, and all that good stuff. Jason said I can simulcast the episode on small batches, so that's one bonus episode for ya.

OK, enough preamble for now. Time to talk logs.

The 12 factor app states that applications should not concern themselves with storing their log stream. Simply log to standard out or standard error. This works in development because developers can see logs in their terminal. It also works in production because tooling can redirect logs to files or capture and process the streams independently of the application.

My stance on the 12 factor app is that it's a great starting point but requires amendments. Just logging to standard our or standard error is not enough build robust continuous delivery pipelines. We need to layer logging practices onto the original recommendations.

So, the 12.1 factor app does three things:

Supports a LOG_LEVEL configuration option.
Uses a machine readable format, like JSON, in production
Generates time series telemetry from logs

Let's consider each point.

The first point relates to the config factor. More on that in the previous episode at https://smallbatches.fm/6. Applications must support log level configuration instead of hard coding it. Use a low log level like debug in development and info or higher in non-development environments.

Second, logs must be produced in a machine readable format such as JSON. Oh, and no multiline logs. Multiline entries are effectively syntax errors in a log stream. Just avoid them. Using a machine readable format enables new use cases. Error logs may contain stack traces. Contextual information--such as user IDs--may be added to all log entries. Log entries can generate time series telemetry. Log entries may be parsed and routed to different storage systems. warn and error logs may generate alerts. fatal logs may page someone. The list goes on and on.

The point regarding time series telemetry warrants extra attention. Consider nginx or apache. Both output the well known "access log" format. The format includes latencies, response codes, and other information like the origin IP. This single log line contains wonderfully useful telemetry! Parsing the log can generate a histogram on response latencies, incoming request count, percentage of satisfied requests, internal server errors, backend errors, a leaderboard on response codes, and more. That's enough to understand how the HTTP service is operating.

The same approach applies to internal telemetry. Applications can output time series data to standard out for consumption by downstream systems. This eliminates the needs for third party libraries and external metric collection services in favor of infrastructure level log storage and metric generation. You can see this action with products like DataDog and NewRelic. Both offer centralized log storage, searching, and metric generation. Once metrics are generated, then you have access to the full suite of tools around them such as graphing, monitoring, and alerting.

Alright. Let's recap:

Support log level configuration
Log in a machine readable format such as JSON
Treat log streams as a telemetry source

These practices are especially useful in growing distributed systems since they shift responsibility out of applications and onto horizontal support layers.

That's all for this one. Go forth and log.

12.1 Factor Apps: Config

Adam Hawkins — Mon, 04 May 2020 15:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

Hello friends, it’s me Adam. Welcome back to Small Batches. I introduced the 12 factor app in the previous episode along with areas where I think it may be improved. I’m calling these improvements the "12.1 factor app". So new features and fixes, but no breaking changes; more like clarifications. I’m going to cover these in coming episodes. Enough preamble for now. On with the show.

The 12 factor app states that applications should read config from environment variables. It implies separation of code and config. That’s about it, but there’s good bones here. I want something bigger from this factor. Specifically that applications may be deployed to new environments without any code changes. This requires a few additions:

Configure the process through command options and environment variables
Prefer explicit configuration over implicit configuration
Use a dry run option to verify config sanity

These points force applications to be explicit in configuration, which in turn requires engineers to take more responsibility for bootstrapping the process. This has proven to be a good thing in my experience.
Consider the first point regarding command line options and environment variables. Developers interact with command line tools every single day. There’s a standard interface for passing flags: command line options. You’ve likely used curl -X , grep -E, or mysql -u. These tools may even use values from environment variables when command lines options are not provided. This is wonderful because processes may be configured globally with environment variables then overridden in specific scenarios with command line options.

This simple interface also supports another common use case: looking up configuration options. Running a command followed by --help or -h typically outputs a usage message listing all command options. How many times have you struggled to learn which configuration files or environment variables are required to start a service developed by other engineers in your company? Now compare that to how many times you have struggled to find all the options to the grep command? There is no struggle because grep --help tells you everything. On the other hand, you’re left hoping that your team members put something in the README or on confluence.

Moving on to the second point. I’ll explain this by contrasting software produced by two ecosystems.

Rails applications use a mix configuration practices. They may use environment variables but there’s also a a mix of YAML files and environment specific configuration files (such as production.rb or staging.rb). Internal code uses a preset number of environments (namely production, test, and development) to implicitly change configuration. Deploying to a new environment requires creating new configuration files and/or updating YAML files. Starting a rails application requires running the rails command. As a result, developers are disconnected from the code that bootstraps application internals then starts a web server.

On the other hand, consider software produced by the go ecosystem. It’s more common to write a main method that configures everything through command line options. In this case there is no need for extra configuration files or implicit configuration based on environment names since the concept is irrelevant here. Naturally this requires developers take more responsibility, but as I said early on, it’s worth it in the end. Configuring these applications is easier to grok as well as deploying them to a variety of environments. That’s what the 12.1 factor app is going for.

The command line interface approach enables DX improvements too. One of my pet peeves is when a process starts then fails at runtime due to some missing configuration options. This grinds my gears because developers devote huge effort to validating user input through web forms or API calls but tend to neglect configuration validation entirely! Plus, it’s just frustrating to learn which values are required through runtime errors. The 12.1 factor app can do better than this.

The 12.1 factor app will fail and exit non-zero if any required configuration value is missing. The main method that processes command line options and environment variables makes this possible. Does the process require connection to a database and no --db-url or DB_URL provided? Then Boom! Error message and exit non-zero. The goal is to make it impossible to start the process without sane configuration.

Failing with a non-zero exit status integrates nicely with deployment systems. Recall that a 12 factor "release" is the combination of code and config. Therefore it’s possible for a config change to result in a broken release. Given that 12.1 factor apps fail fast, it’s possible for the deployment system to recognize the failed release then switch back to the previous release. Contrast this was a "fail later" approach. The release may be running but failing at runtime. This looks OK from a deployment perspective since the release started, but it’s totally broken from the user’s perspective. The 12.1 factor app easily avoids this scenario.

The "fail fast" approach catches simple user errors such as all values provided. However that only solves part of the problem. Provided values are not necessarily correct. Here’s an example. Say the application requires a connection to Postgres, so the user sets POSTGRESQL_URL. However the application cannot connect to the server for any reason. It could be networking, mismatched ports, or an authentication error. Whatever the reason the result is the same: no database connection thus a nonfunctional application. This would cause downtime if deployed to production. I can’t tell how how many times this has happened to me for legitimate reasons or less so (like mistyping a hostname or specifying the wrong port).

My point is this type of error may be eliminated by simply trying to use the provided configuration before starting the process. The idea here is to use a "dry run" mode to check these sort of things. I’ve used the dry run mode to check connections to external resources like data stores or that API keys for external APIs are valid. This aligns nicely with the "trust but verify" motto. It’s simple. At the end of the day developers make mistakes. It’s our job to ensure those mistakes don’t enter production.

Alright. That’s enough for the 12.1 config factor. Here’s a quick recap:

Configure your process through command line options and environment variables
Fail fast on any configuration error
Use a "dry run" mode to verify as much config as possible
Prefer explicit configuration over implicit configuration based on environment names

Well what do you think of these practices? Have you done anything like this before, if so how did it turn out? Hit me up on twitter at smallbatchesfm or email me at hi@smallbatches.fm. Share this episode around your team too. It’s great reference material for the "new service checklist" or "best practices" section on Confluence you’re always trying to write.

Also go to smallbatches.fm/6 for show notes. I’ll put a link to my appearance on the Rails Testing Podcast where I talk about this topic in more technical detail. If this episode piqued your interest, then definitely check that one out. We talk about preflight checks and smoke tests.

Alright gang. That’s a wrap. See you in the next one. Good luck our there and happy shipping!

12 Factor Apps

Adam Hawkins — Tue, 21 Apr 2020 03:15:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

Hey everyone. Welcome to the next episode of Small Batches. It’s me Adam Hawkins coming to you from sunny and warm Hawaii. Just like all of us, I’m quarantined inside with not much to do besides look longingly out my window and ponder the craft of software engineering. As a result this episode is a longer than usual, so nurse that coffee while I serve up a small batch of software engineering theory and best practices. Today’s topic is 12 factor apps.

The 12 factor app describe how to build software for deployment pipelines. These guidelines make software easier to run in development, staging, production, and any future environment. Heroku members wrote "The 12 Factor App" in 2012. The authors tried to enumerate best practices for running applications on their platform. These guidelines were especially helpful since they coincided with an industry shift towards DevOps, micro services, and continuous delivery.

The 12 factor app guidelines still relevant today since many software teams are building distributed systems, which require attention to detail as to how services are coupled, configured, and deployed. The 12 factor app is a wonderful starting point for best practices. In my experience though they omit a few facets and leave room for antipatterns in certain facets.

My goal with this podcast is to share knowledge and practices for building, deploying, and operating software. The 12 factor app triangulates all three of those areas. This is partially why I love the 12 factor app so much. It’s a rare piece of work that hits on so many themes of this podcast. In this episode let’s review the 12 factor app and flag future discussion points.

Codebase, the first factor, defines the relation between code, apps, and deployments. The gist is that there is a single codebase (i.e. a git repo) equates to a single deployed app (or service). Different versions of a codebase may be deployed to different environments. This implies that if there are multiple codebases, it’s not an app; It’s a distributed system. Each component in a distributed system is an app, which should also comply with twelve-factor. Let’s flag this for future episodes, since we need to adopt a distributed systems first perspective instead of the other way around.

Dependencies, the second factor, states a twelve-factor app never relies on system wide packages. Instead, apps must leverage tools like Ruby’s bundler or Python’s virtualenv to manage their dependencies. This is where the second factor ends. Separating out application dependencies is just part of the problem. You must also isolate the app’s runtime. Admittedly this is more relevant for Ruby, Python, or Nodejs apps which rely on managed runtimes. The sticking point is the same for statically complied applications say written in Go. Just to reiterate, the "dependencies" factor focuses on application dependencies rather than application runtimes or execution environments. So where do those come from and what can deployment pipelines expect from dependencies? Let’s flag this for future episodes.

Config, the third factor, states the applications should read configuration from environment variables. This makes it possible to deploy changes to production without altering code. A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials. Separating config and code is a great start but not enough. Personally I find the third factor the most wanting. There are great ways to work with config and horrible anti patterns, neither are discussed in the third factor. Let’s stick a big flag in this factor for discussion in future episodes.

Backing services, the fourth factor, states applications makes no distinction between local and third party services. Now for example an application’s database and connection to external APIs are treated in same way. The idea promotes loose coupling between apps and services. The 12 factor app uses an example of switching a local SMTP service for a third party service with only config changes and no code changes. Again, this is a good starting point but requires some clarifications and to the config factor and future dev/prod parity factor.

Build, release, run, the fifth factor, states apps use strict separation between the build, run, and release stages. The build stage converts code into an executable. The release stage combines the executable and config into something runnable. In other words, a "deploy" is the combination of code and config. It’s not possible to change one without creating a new release.

Processes, the sixth factor, states apps execute as one more stateless processes using a shared nothing architecture. If data needs to persist across restarts or releases, then it must be stored in a stateful backing service such as a database. I like this factor because it surfaces the idea that apps are composed of multiple processes such as a web server, maybe a background job worker, or a cron process. This model scales up to distributed systems which are composed of multiple apps interacting across various processes. This factor also implies the deployment pipeline must handle that releases contain 1-N process which may require different semantics. In other words, the deployment pipeline cannot assume a single process or process type.

Port binding, the seventh factor, states an app is completely self contained and exposes itself through ports. This seems obvious but it’s a good architectural principle to state out right. Given services are exposed through ports, then it’s possible to configure others by providing the hostname and port of relevant services.

Processes, the eighth factor, state that processes are a first class citizen. The stateless share nothing model naturally promotes horizontal scaling. Just start more processes. Scale vertically by allocating more resources to processes where need be. The corollary is processes should never daemonize or write PID files. Instead they are controlled process manager like systemd. In other words: write apps that start processes in the foreground; use a process manager to scale, start, and stop processes.

Disposability, the ninth factor, states processes should be ready to start or stop a moment’s notice. Fast startup times encourages smooth scaling. Conversely, processes must gracefully handle the SIGTERM signal. Severs should stop listening on the relevant port, finish processing any requests, then exit. This approach ensures new releases or other infrastructure events do not impact user facing requests.

Dev/Prod parity, the tenth factor, states that apps are designed for continuous deployment by minimizing the gap between development and production. The original guideline states: the twelve factor developer resists the urge to use different backing services between development and production. This makes sense when applied to databases but has negative implications when applied to distributed systems. Does this mean a development environment for one service in a large distributed system mandates running all other services? Well if you’re targeting dev/prod parity then the answer may be yes. However answering yes is not always practical. Consider the case where the system in question is a single service. Then it’s simple enough to achieve dev/prod parity. Now scale up. What about dev/prod parity with N=5, 10, 20, 100? The tenth factor offers no advice or guidelines for how handle inflection points as the system grows or which degrees of dev/prod parity to consider.

Logs, the eleventh factor, state that apps never concern itself with routing or storage of its output stream. Instead all output should be sent unbuffered to standard out. This works well in development because developers can see output in their terminal. It also works well in production since tools can capture output streams for analytics and storage. Again the eleventh factor is a wonderful starting point but needs to be improved upon for continuous delivery and production operations. The eleventh factor does not cover what or how things should be logged. In fact this only mention of telemetry—a vital facet of continuous deployment largely uncovered by the 12 factor app.

Admin processes, the twelfth factor, state that admin work such a migrating databases or other out-of-band work executes as a separate process. Processes can be started using the same release (meaning the same config and code). I don’t have more to stay about this factor so let’s put a pin in it and wrap up.

Future episodes will dive deeper into the individual factors with a focus on identifying and closing the gaps. My biggest gripe with the 12 factor app is around the config and dev/prod factors and the overall omission of anything telemetry metrics related.

I’m curious about your experience with 12 factor apps. How has it worked for you? What do you think it’s missing? Where could be improved? Send your comments to hi@smallbatches.fm. Also, tell me what you want covered in future episodes.

With that, I leave you to it. Good luck out there and happy shipping.

The Principle of Improvement

Adam Hawkins — Mon, 06 Apr 2020 19:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

This episode completes our introduction to the three ways of DevOps. The previous two episodes introduced flow and feedback. Flow, the first way, establishes fast left to right flow from development to production. Feedback, the second way of DevOps, establishes right to left feedback from production to development, so using the state of production informs development decisions. The Third Way of DevOps, establishes a culture of experimentation and learning around improving flow and feedback.

Let’s frame this discussion using the four metrics from Accelerate: lead time, deployment frequency, MTTR, and change failure rate. Lead time and deployment frequency measure flow. MTTR measures feedback. Change failure rate measures experimentation and learning. Trends in these four metrics also reflect experimentation and learning.

Consider this scenario. There’s a severe 6 hour production outage. As a result the business has lost money and received angry emails from customers. This long outage window impacts the team’s MTTR. This bug—undetected by the deployment pipeline—caused an outage which increases the change failure rate. The team meets as soon as possible after restoring production operations. What do they do? If they apply the third way of DevOPs, then they would conduct a blameless post-mortem. The post-mortem would identify the root cause of how this bug entered production and what regressions tests to prevent it in the future. Hopefully they also discuss why it took six hours to restore operations and how they can be quicker next time around.

Here’s another scenario. A team ships new features on a regular basis but they don’t see the expected business results. They thought these features would deliver results but can’t figure out why they are not. What can be done? First, the team needs to step back and check their assumptions. Instead of going all in on big features, they can test their assumptions with tiny experiments released to a small segment of their users. If the results are positive, then the team should continue iterating. If not, then the team tries a new idea. Over time the team sees that they spend more time delivering on proven business ideas instead of ideas they assumed would just work. This approach is known as A/B testing or hypothesis-driven deployment from the lean IT school.

Both scenarios demonstrate a focus on improvement through experimentation and learning. However this only possible in a high trust culture. It’s not possible to conduct a blameless post mortem if people are afraid to say what they did to cause an outage. It’s not possible to conduct A/B tests if the organization does not see the value in validating business ideas through experiments. This is why leadership must promote these idea.
I’m going to read one of my favorite passages from the DevOps Handbook. There are many wonderful passages in this book, but this is top tier without a doubt. The passage is great example of leadership’s role:

Internally, we described our goal as creating “buoys, not boundaries.” Instead of drawing hard boundaries that everyone has to stay within, we put buoys that indicate deep areas of the channel where you’re safe and supported. You can go past the buoys as long as you follow the organizational principles. After all, how are we ever going to see the next innovation that helps us win if we’re not exploring and testing at the edges?

I just love that quote—there’s just so much good stuff there. It describes a high trust culture guided by safety and aligned through principles. The four metrics (lead time, deployment frequency, MTTR, and change failure rate) are SLI’s. DevOps is a set principles that guide organizations to move those SLIs in the right direction, and when done right the results are outstanding. You just need to ask how can we improve? If you can stick to that then you’ll uncover that improvement of daily work is the daily work.

Alright, that’s it for principle of experimentation and learning. These three ideas will come back all the time on the podcast, but hey you always come back to these episodes if you need a refresh. Head over to the podcast website smallbatches.dev for links and free ebook on putting continuous improvement into practice.

Until the next one, good look out there and happy shipping!

The Principle of Feedback

Adam Hawkins — Mon, 23 Mar 2020 18:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

This episode continues our discussion on the Three Ways of DevOps. The previous episodes introduced DevOps, the "Three Ways", and the flow-the first way. This episode covers the second way of DevOps: Feedback.

The DevOps philosophy works by establishing ways of working that feed back into each other. The first way establishes fast from flow development to production, or in other words, high velocity. Organizations can’t stop there. Imagine yourself driving a car but with one catch: there’s no speedometer, fuel gauge, or other indicators. You could accelerate to a point, but eventually you’d need to know how fast you’re driving so you can speed up or slow down; or you’d need to know how much fuel is left so you can stop to fill up or not.

This is the principal of feedback in action. We need telemetry about our ongoing actions to make subsequent decisions. Driving a car reliably without a gas gauge would be pretty hard. The same idea also applies to software. Production systems provide a wealth of information about them. Let’s call this information telemetry. Telemetry may be time series data, alerts, or logs; it’s any data that provides insight into operational conditions. Once teams have telemetry, then they can use as the basis for subsequent development.

This usually take two forms: using telemetry to automatically detect error conditions and page an engineer or as maintenance work.

Here’s an example of each: automated monitoring detects a critical job hasn’t run in 48 hours then pages an engineer; Engineers observe telemetry to detect increased memory then decide to allocate more memory to prevent future out-of-memory errors.

Both these examples are technical. Teams tend to focus on these areas, but they need to aim higher. The Principal of Feedback must be applied to all layers in the value steam, so that means the business too.

Good businesses track their "success" metrics. They’re likely enumerated by quarter in some huge Google Sheet. They’re numbers like "new user signups", "monthly recurring revenue", or "minutes watched". Management deeply cares about this telemetry since they driving short, medium, and long term planning.

Plus, this telemetry has something that technical telemetry does not: it’s much easier to understand. That makes it a rallying point for more people in the organization. Let’s face it. It’s just harder to rally an entire business behind"Let’s drop the frontend to backend latency!" compared to "Let’s boost our signups!".

There’s another important point here: something may only be improve if it’s measurable. This is where DevOps overlaps with the "Lean" philosophy. I’d like do a future episode on lean because it’s so powerful and fits nicely along side the second and third way of DevOps. For now let’s focus on lean and it’s relation to the second way of DevOps.

Lean proposes thinking about business as a series of hypothesis validated by real world experience. A common example is first putting up a landing page for a non-functional product, then seeing if the landing page converts users. If the page converts users, then there is interest so it’s worth subsequent experiments to further validate the idea. If the page does not convert, then try something else. Repeat as many times as necessary. This example demonstrates the principal of feedback. In this example, the team is measuring the conversion rate, then plotting a course according to empirical data.

I cannot over emphasize the importance of empirical data. If we are not using empirical data then we’re effectively guessing—and that is dangerous! Here’s one of my favorite passages from the DevOps Handbook:

The outcomes of A/B tests are often startling. Ronny Kohavi, Distinguished Engineer and General Manager of the Analysis and Experimentation group at Microsoft, observed that after “evaluating well-designed and executed experiments that were designed to improve a key metric, only about one-third were successful at improving the key metric!” In other words, two-thirds of features either have a negligible impact or actually make things worse. Kohavi goes on to note that all these features were originally thought to be reasonable, good ideas, further elevating the need for user testing over intuition and expert opinions.

Whoa, scary right! So, if Microsoft only had 1/3 success rate for well designed experiments, imagine their luck if they were shooting from the hip.

As scary as it seems, this is great rallying point for the principal of feedback. The Principal of feedback calls for establishing automated telemetry across all phases of the value stream—from planing, development, and into production—so that teams can monitor their health across the value steam, then ultimately their progress towards business objectives. In other words, if teams have telemetry about their current course, they’re able too—and hopefully more likely too—take a step back and see if things are moving in the right direction.

That’s a good stopping point because we can pick up this conversation again with the Third Way of DevOps in the next episode. Head over the podcast website smallbatches.dev for a transcript and show notes. If you enjoyed this episode then please tweet it and share with your friends and colleagues.

Until the next one, good luck our there and happing shipping. Adidos!

The Principle of Flow

Adam Hawkins — Tue, 10 Mar 2020 00:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

The process used to write code and deploy it production is biggest contributor to your team's velocity. You've probably been in the situation where something is seriously broken in production and you need deploy a fix right now. You may have even tried to circumvent the existing process to deploy it out faster. Simply put, the faster your team can write and deploy code to production the better. This is the principal of flow or the "first way" of DevOps.

The DevOps Handbook provides a two step process for achieving fast flow from development to production.

Step 1: use trunk-based development and continuous integration
Step 2: use continuous delivery

You've like heard these terms before. They're thrown around and often used incorrectly. Continuous integration is a prime example. I'll do my best to clarify the right and proper way to achieve fast flow from development to production.

The goal at the end of each development cycle is produce production-ready builds from master that been verified in a production-like environment and validated with automated tests. That's continuous delivery in a nutshell. Continuous deployment takes it one step further by automatically pushing code to production but that's a topic for another episode. For now let's start at with trunk-based development.

I bet the mere mention of "trunk" makes some of you shudder. Some of you may even be thinking: "trunk what is madman talking about SVN for? We use git so what's the point?" Well the point is reducing cycle times from development to production. Trunk-based development optimizes for team productivity instead of individual productivity.

The trunk-based development boils down to keeping branches small and maintaining an incremental straight line of development. Branches should be merged to trunk (or master) at the end of the day. They must be covered by automated tests so it's clear which commits are broken. This is the origin of continuous integration (and they're a lot of "continuous" things in DevOps). This practices ensures commits are smaller, thus easier to write, test, and deploy to production hence improving times from development to production.

I can hear some of you saying: "Adam-wait, what the hell are you talking about? How does this make any sense? What am I supposed to do with my feature branches? What about our epic branches that are open for weeks?" The answers to that question lies in a perspective shift regarding individual roles and what a team values, but I want to put these questions another way.

Would you rather work in your topic branch for as long as possible or would rather get your code out the door and into production? I choose production and you should too. Anyways, I don't want to get into the weeds on this topic because it's somewhat controversial so check the show notes for more links on the topic. Let's move forward to continuous delivery.

The idea here is connect commits from trunk/master to an automated deployment pipeline that verifies builds are fit for production. Naturally this requires varying levels of tests and automation. Now don't get lost in the statements from the blogosphere that you need Docker or microservices to do this. These proclamations miss the point that technical practices like infrastructure as code and automated testing are more important than specific technologies. There's no prescriptive solution but I'll provide you an outline:

Deploy code to a staging environment
Run a smoke test against staging
Deploy code to production, Ideally using a blue/green or canary deploy
Run smoke tests against production
All good? You’re done. If not, rollback.

Expand out to more pre-production environments as necessary. You may have a dedicated performance testing environments, a manual QA environment, or whatever floats your boat. Honestly, it doesn't matter how many environments you have as long as promotion and verification is automated as much as possible. However, your number of environments will grow over time as your deployment pipeline becomes more rigorous.

Alright, that's enough for this batch. The principal of flow covers reducing cycle times from development to production. Trunk-based development backed continuous integration and continuous delivery is the best way to do that.

The book Accelerate provides two metrics to measure flow: lead time and deployment frequency. Lead time is how long it takes go from commit to production. Deployment frequency is simply how often deploys happens. Accelerate also breaks down these metrics into tiers.

Top tier lead times are under an hour. This means a developer can start working on, and deliver completed code to production in under an hour. Mid their lead times range between a week and a month. How does your team stack up?

Anyways, that’s a wrap on episode. Head over to the podcasts’ website smallbatches.dev for a transcript, show notes, and links to my review and further analysis on both the DevOps Handbook and Accelerate.

Until the next one, good luck out there and happing shipping.

The Three Ways of DevOps

Adam Hawkins — Tue, 25 Feb 2020 06:00:00 +0000

This is a podcast episode transcript. Visit the podcast website for the episode, show notes, and other freebies.

This is it, the first episode of my podcast! The new year gave me the push I needed to start shipping.

I launched this podcast to share ideas and practices that have helped me throughout my career. I write each episode to be short and informative so you can fit them in over a cup of coffee (or sometimes two). Think of Small Batches as a free form anthology on the wide world of software engineering and business.

Let's set the stage before we dive into today's topic. My goal as a software engineer is create systems that are easier to build, test, deploy, and operate in production. I achieve those goals through DevOps.

DevOps connects our work as engineers to business value. Exploring, internalizing, and implementing DevOps changed my career. I've spent the last few years reading and writing about DevOps, so there's no better way to launch this podcast than introducing DevOps.

For this, I turn two of the best books on the topic: The DevOps Handbook and Accelerate. Both are written by Gene Kim (you may also know him from The Phoenix Project and now the Unicorn Project) and other coauthors. The DevOps Handbooks introduces DevOps principles and their associated technical practices. Accelerate provides metrics to measure progress and evidence of DevOp's effectiveness. You need both to understand DevOps.

The DevOps Handbook introduces the "Three Ways of DevOps". They are: flow, feedback, and learning. Each build on the other to create a feedback loop between technology and business. Accelerate defines four metrics: lead time, deployment frequency, mean-time-to-resolve, and change failure rate. Here's how two fit together.

Flow, the first way of DevOps, establishes fast flow from development to production. Organizations achieve this goal by breaking work into smaller batches, preventing defects with continuous integration and automating deployments. These practices fall under the Continuous Delivery umbrella. Teams can measure their flow with two metrics: lead time and deployment frequency. Leads time is the time it takes to go from commit to production. Deployment frequency is simply how often deploys happen.

Feedback, the second way of DevOps, establishes right to left flow of telemetry across the value stream to ensure early detection and recovery or prevent subsequent regressions. The idea is to use production learnings to drive subsequent developments. Here's some examples: say you just shipped a new feature, how is engagement measured? Another: Are their metrics or logs that engineers can use to monitor system health? One more: do we know how long are builds are taking? Teams can measure the second way with the well known mean-time-to-resolve (or MTTR) metric.

Learning, the third way of DevOps, enables a high-trust culture focused around scientific experimentation and learning. The idea is that once work is shipping out to production quickly and the telemetry is in place to across the value stream, then teams should improve their processes through scientific experimentation. The principal of flow enables teams to quickly ship new business ideas or process improvements. The principal of feedback ensures teams have the information to empirically validate their ideas. Organizations apply this principle through activities such as blameless postmortems and A/B tests. Team can measure the third way by tracking their change failure rate. That is the percentage of changes that result in degraded service or require a follow up action like a patch or rollback. Although I offer a different interpretation. I think of it as the percentage of changes that did not deliver the expected results–but that's a topic for a future episode.

I have much more to say on all these topics but I'll leave that for future episodes. Let's recap: Apply continuous delivery for fast flow from development to production; add telemetry across your process and use it to drive future decisions, then strive to improve both those processes. Measure your progress with lead time, deployment frequency, MTTR, and change failure rate.

Your trajectory should be to decrease lead times, increase deployment frequency, decrease MTTR, and decrease change failure rate. Or in other words, as you improve velocity then stability comes along for the ride.

Alright, that's wrap on this episode. Go to smallbatches.fm for show notes, a transcript, and links for my review and analysis on DevOps Handbook and Accelerate. Also subscribe to this podcast to receive future episodes.

Until the next one, good luck out there and happy shipping!

How DevOps Increases System Security

Adam Hawkins — Thu, 11 Apr 2019 07:35:47 +0000

The perception of DevOps and its role in the IT industry has changed over the last five years due to research, adoption, and experimentation. Accelerate: The Science of Lean Software and DevOps by Gene Kim, Jez Humble, and Nicole Forsgren makes data-backed predictions about how DevOps principles and practices yield better software in almost any measurable way and more successful businesses. Their research, along with others such as James Wickett and Josh Corman, former CTO of Sonartype and respected information security researcher, has centered around the concept of incorporating information security objectives into DevOps (a set of practices and principles they termed “Rugged DevOps”). Dr. Tapabrata Pal, Director and Platform Engineering Technical Fellow at Capital One, came up with similar ideas and described their processes as DevOpsSec, having dispelled the myth that DevOps and system security are orthogonal.

In fact, it’s the opposite. DevOps practices done right increases system security in the same way that continuous delivery increases stability.

The Three Ways of DevOps describe continuous delivery, production to development feedback, and constant learning. Continuous delivery requires developing software in incrementally small changes and verifying each change with automated tests across a deployment pipeline. The computerized pipeline offers teams multiple ways to improve security when compared to software development without an automated deployment pipeline.

Security issues are like any other software regression. They may be tested for so that they don’t occur in production. There are multiple ways to apply automated testing to InfoSec:

Scan container or VM images for known software vulnerabilities and fail builds that contain known problematic packages
Run static analysis tools for calls to potentially dangerous system calls and fail builds accordingly
Lint code for plain text secrets like API tokens or SSH keys and fail builds consequently
Run end-to-end tests, like those from OSWAP, against build artifacts

Adding these tests to the deployment pipeline dramatically increases security since it’s automated: this is known as a “shift left“.It ensures software is secure from the start, automatically, and throughout the pipeline.

Organizations often do not have enough InfoSec engineers to go around. That creates negative consequences since InfoSec checks are pushed to the end of the process and may only happen when there’s enough capacity. Consider for a moment just running your existing automated test suite when there was an extra engineer in the team. Accepting that proposition for automated functional testing is ludicrous in modern IT, why allow the same for InfoSec testing? Adding InfoSec tests to the pipeline verifies each change and scales out with the organization. The deployment pipeline is a bigger force for change than a few engineers. More importantly, adding tests exposes issues to everyone and shifts responsibility to the code author to patch the regression.

Automated tests ensure known regressions do not enter production. However, they do not guard against attacks and other malicious activity in production. Teams need to track and alert on telemetry data that indicates malicious activity or other red flags in production. This is the second way of DevOps that establishes feedback from production to development. Teams already have production telemetry for latency, request count, and active users, and so on, so InfoSec telemetry should be integrated as well. Examples include:

SSH connections
User logins
Password resets
Malicious SQL queries
Malformed requests that may indicate probing or other malicious activity
Email address (or additional login information) changes
Billing or payment information changes
Infrastructure security group or firewall changes
XSS attacks
Infrastructure changes such as network, new system users, or modified file checksums
Privilege escalation (e.g., sudo calls)

This kind of telemetry data is critical to understanding how the system is being used in production. Based on this insight, teams can action by adding regression tests to the pipeline having identified potential problems, resulting in an increased security posture for production. More importantly, it increases visibility. Security changes are more likely to occur when a team realizes they’re under attack.

Nick Galberth from Etsy echoes this sentiment after graphing security telemetry:

One of the results of showing this graph was that developers realized that they were being attacked all the time! And that was awesome because it changed how developers thought about the security of their code as they were writing the
code.

This practice also aids scenarios where pre-production testing and compliance checks are not enough. Accelerate includes a troubling case study of an vendor that demonstrates production InfoSec telemetry’s value. The noticed their ATMs were put into maintenance mode at unscheduled times. This allowed the attacker to physically extract cash from the machine. A developer installed the backdoor years ago. Apparently, backdoors of this type are difficult or near impossible to detect beforehand. However, the production telemetry detected the anomaly and alerted the team. The team proactively found the fraud and resolved the issue before the scheduled cash audit process.

These examples demonstrate how DevOps practices improve system security. First, like any other aspect of software, add automated tests to the deployment pipeline. Second, add production telemetry to production to direct development changes. The third way calls for learning and experimentation to further improve the software development process. Unfortunately, sometimes teams will miss this aspect. DevOps establishes feedback loops, and the third way continuously improves them to reduce toil, reduce bugs, and/or adapt to changing conditions.

Compliance and auditing is a common pain point. It slows down the process since documentation has to be produced and manual reviews are required. This doesn’t have to halt the process. Automation can drastically improve the compliance and auditing process by removing toil. The Google SRE Book defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” Accelerate includes a case study of 18F and Cloud.gov.

The case study demonstrates a government organization implementing an automated process for writing system security plans (SSP) and obtaining a right to operate from the designated authority. The SSP plans must be reviewed. They’re often a hundred of pages and highly detailed. Creating and maintaining them manually is impossible in a dynamic cloud environment. 18F created a tool that automatically generates an SSP into YML which can be transformed into PDFs or published as GitBooks for internal and external review saving immeasurable amounts of man-hours (and increasing happiness in the process). Private sector IT companies tend to have a more relaxed level of regulation. Regardless, the same compliance and auditing techniques can be and should be leveraged to reduce ongoing effort and toil.

Similar approaches may be used in downstream auditing and compliance processes. Given the production telemetry systems contain InfoSec data, they may be exposed to auditors in a self-service way during reviews. Auditors can check control like appropriate logging or specific event handling. The deployment pipeline also provides a complete change history for the application in production. It’s possible to generate compliance reports using the code, the deployment pipeline, and other automation. This approach again reduces toil for all involved, increases accuracy, and ideally leads to more completed audits.

DevOps is the best way for modern IT to build, test, and ship software. Three Ways provide a framework for understanding how and why to approach software development problems. Changing and improving InfoSec is not so different than what the cloud and continuous delivery did to software. Everything stems from the idea that increasing frequency decreases difficulty. It saw teams go from deploying quarterly to measuring deploys-per-day per developer. That’s an astonishing velocity improvement. It can affect the same change by applying the three ways to InfoSec outcomes: automated testing, production telemetry, and continuous learning and improvement. Applying all three builds a culture of continuous verification that ultimately raises the security floor across the industry. That sounds like a textbook case of increasing security today and in the future.

CD ChatOps with Slack & Buildkite

Adam Hawkins — Mon, 04 Mar 2019 11:54:58 +0000

Buildkite is my preferred deployment pipeline system. I prefer Buildkite because I can run agents on my own infrastructure. This means I can control AWS access with IAM and even bake AMIs with all dependencies for faster pipelines.

Buildkite pipelines may be triggered from GitHub deployments. The only catch is making a nice UI for triggering deployments. I recently started using SlashDeploy (same name, but no affiliation) to trigger deploys ChatOps style from Slack. Here’s how it works.

ChatOps via Slack

SlashDeploy adds the /deploy command to Slack. /deploy is useful because it is a small and sharp tool. It only triggers GitHub deployments. This a clear integration point with other systems. This means /deploy can integrate with any deployment pipeline. /deploy and BuildKite work especially well together because /deploy maps directly to Buildkite pipelines. /deploy can also specify the Github Deployment task (such as migrate, seed, or activate maintenance) which may processed inside Buildkite to trigger other pipelines. Plus, all this happens in Slack so anyone can /deploy APP or /deploy APP with TASK.

Building the Deployment Pipeline

Buildkite Pipeline Overview.

I prefer continuous deployment with options for triggering manual deployments when needed. /deploy supports both scenarios. /deploy is configured to trigger automatic deployments for my intended branch if tests pass in .slashdeploy.yml:

version: 1
environments:
  production:
    # Important to everyone see how to deploy
    respond_in_channel: true
    # For notifications
    channel: ops
    checks:
      - buildkite/mono-tests
    auto_deploy:
      # auto deploy master
      ref: refs/heads/master

I use a custom pipeline script to process the particular deployment environment and task. This is deployment pipeline’s entry point. Buildkite pipelines can trigger other pipelines as step in larger pipelines. Here’s an example.

The production deploy does not call the seed pipeline, but a dev deploy does. Team members can also invoke /deploy app with seed. The deployment task is set to seed in this case. The default value is deploy. My pipeline script checks these two values then loads the relevant pipeline file via buildkite pipeline upload. Here's a skeleton:

#!/usr/bin/env bash

set -euo pipefail

main() {
    local environment="${BUILDKITE_GITHUB_DEPLOYMENT_ENVIRONMENT?required}"
    local task="${BUILDKITE_GITHUB_DEPLOYMENT_TASK?required}"

    local pipeline=".buidlkite/${environment}-${task}.yml"

    if [ -f "${pipeline}" ]; then
        buildkite pipeline upload "${pipeline}"
    else
        echo "Cannot handle ${envrionment}/${task} invocation!" 1>&2
        return 1
    fi
}

main "$@"

This approach keeps it simple by mapping each environment/task to a specific pipeline file. The pipelines may also trigger other pipelines like so:

- label: ':rocket: :seedling:'
    trigger: mono-seed
    build:
      commit: "${BUILDKITE_COMMIT?}"
      branch: "${BUILDKITE_BRANCH?}"
      env:
        BUILDKITE_GITHUB_DEPLOYMENT_ENVIRONMENT: "${BUILDKITE_GITHUB_DEPLOYMENT_ENVIRONMENT?}"
        BUILDKITE_GITHUB_DEPLOYMENT_TASK: "${BUILDKITE_GITHUB_DEPLOYMENT_TASK?}"

Lastly, I used Buildkite annotations to decorate the pipeline UI with the environment and task. This information is hidden by a few clicks otherwise. It’s useful when scrolling pipeline views to find the relevant build. Here’s a screenshot of production deploy with an annotation.

Production deploy with annotation and a skipped step.

Adding the annotation requires adding an additional pipeline step with an associated script. Buildkite annotations are Markdown. It was easier to handle whitespace sensitive strings and environment variable substitution in a separate file. Here are the relevant code snippets:

# the pipeline step:
steps:
  - label: ':console: Annotate'
    command: script/buildkite/deploy-annotation | buildkite-agent annotate --style info

# The annotation script:
#!/usr/bin/env bash

set -euo pipefail

cat <<EOF
- Environment: **${BUILDKITE_GITHUB_DEPLOYMENT_ENVIRONMENT}**
- Task: **${BUILDKITE_GITHUB_DEPLOYMENT_TASK}**
EOF

Conclusion

Buildkite has long been my preferred deployment pipeline software. Now /deploy combined with dedicated deploy pipelines and ChatOps style GitHub deployment triggers make the setup is better than ever. If you're building a new deployment pipeline then I highly recommend Buildkite (and their Elastic Stack if running on AWS) paired with SlashDeploy for ChatOps and deploy triggers. It's easy and just works—a hard quality to find in software.