DEV Community

CodingBlocks

The DevOps Handbook – Enable Daily Learning

We dive into the benefits of enabling daily learning into our processes, while it’s egregiously late for Joe, Michael’s impersonation is awful, and Allen’s speech is degrading.

This episode’s show notes can be found at https://www.codingblocks.net/episode143, for those reading this via their podcast player, where you can join the conversation.

Sponsors

  • Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after your first dashboard.
  • Teamistry – A podcast that tells the stories of teams who work together in new and unexpected ways, to achieve remarkable things.

Survey Says

Anonymous Vote
Sign in with Wordpress
How often do you change jobs?
  • Job? Why would I do that when I can boss myself around.
  • I don't wanna. Interviewing is awful.
  • Every 3 years, like the Stack Overflow Survey tells me to.
  • About every 5 years, after I've built up enough embarrassments.

News

  • Thank you to everyone that left us a new review!
    • iTunes: John Roland, Shefodorf, DevCT, Flemon001, ryanjcaldwell, Aceium
    • Stitcher: Helia
  • Allen saves your butt with his latest chair review on YouTube.

Enable and Inject Learning into Daily Work

  • To work on complex systems effectively and safely we must get good at:
    • Detecting problems,
    • Solving problems, and
    • Multiplying the effects by sharing the solutions within the organization.
  • The key is treating failures as an opportunity to learn rather than an opportunity to punish.

Establish a Just, Learning Culture

  • By promoting a culture where errors are “just” it encourages learning ways to remove and prevent those errors.
  • On the contrary, an “unjust” culture, promotes bureaucracy, evasion, and self-protection.
    • This is how most companies and management work, i.e. put processes in place to prevent and eliminate the possibility of errors.
  • Rather than blaming individuals, take moments when things go wrong as an opportunity to learn and improve the systems that will inevitably have problems.
    • Not only does this improve the organization’s systems, it also strengthens relationships between team members.
  • When developers do cause an error and are encouraged to share the details of the errors and how to fix them, it ultimately benefits everyone as the fear of consequences are lowered and solutions on ensuring that particular problem isn’t encountered again increase.

Blameless Post Mortem

  • Create timelines and collect details from many perspectives.
  • Empower engineers to provide details of how they may have contributed to the failures.
  • Encourage those who did make the mistakes to share those with the organization and how to avoid those mistakes in the future.
  • Don’t dwell on hindsight, i.e. the coulda, woulda, and shoulda comments.
  • Propose countermeasures to ensure similar failures don’t occur in the future and schedule a date to complete those countermeasures.

Stakeholders that should be present at these meetings

  • People who were a part of making the decisions that caused the problem.
  • People who found the problem.
  • People who responded to the problem.
  • People who diagnosed the problem.
  • People who were affected by the problem.
  • Anyone who might want to attend the meeting.

The meeting

  • Must be rigorous about recording the details during the process of finding, diagnosing, and fixing, etc.
  • Disallow phrases like “could have” or “should have” because they are counterproductive.
  • Reserve enough time to brainstorm countermeasures to implement.
    • These must be prioritized and given a timeline for implementation.
  • Publish the learnings and timelines, etc. from the meeting so the entire organization can gain from them.

Finding more Failures as Time Moves on

  • As you get better at resolving egregious errors, the errors become more subtle and you need to modify your tolerances to find weaker signals indicating errors.
  • Treat applications as experiments where everything is analyzed, rather than stringent compliance and standardization.

Redefine Failure and Encourage Calculated Risk Taking

  • Create a culture where people are comfortable with surfacing and learning from failures.
  • It seems counter-intuitive, but by allowing more failures this also means that you’re moving the ball forward.

Inject Production Failures

  • The purpose is to make sure failures can happen in controlled ways.
    • We should think about making our systems crash in a way that keeps the key components protected as much as possible i.e. graceful degradation.

Use Game Days to Rehearse Failures

“A service is not really tested until we break it in production.”

Jesse Robbins
  • Introduce large-scale fault injection across your critical systems.
  • These gamedays are scheduled with a goal, like maybe losing connectivity to a data center.
    • This gives everyone time to prepare for what would need to be done to make sure the system still functions, failovers, monitoring, etc.
    • Take notes of anything that goes wrong, find, fix, and retest.
  • On gameday, force an outage.
    • This exposes things you may have missed, not anticipated, etc.
    • Obviously the goal is to create more resilient systems.

Resources We Like

  • The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Amazon)
  • The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win (Amazon)
  • The Unicorn Project: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data (Amazon)
  • Netflix Chaos Monkey (GitHub)
  • Chaos Mesh, a cloud-native platform that orchestrates chaos on Kubernetes environments. (GitHub)
  • Alexey Golub’s Twitter response (thread) to our discussion of his article Unit Testing is Overrated during episode 141.
  • Esty’s post mortem tracker: morgue (GitHub)
  • 1987 Crash Test Dummies PSA – Buckle Up (YouTube)

Tip of the Week

  • Firefox Relay – Hide your real email address to help protect your identity (relay.firefox.com)
  • How I Built this with Guy Raz – Khan Academy: Sal Khan (NPR)
  • Automate your world at the push of a button with the Elgato Stream Deck. (Elgato)
  • /the social dilemma – A hybrid documentary-drama that explores the dangerous human impact of social networking. (Netflix)
  • Migrate your repos from TFVC (aka Team Foundation Version Control) to Git using git-tfs. (GitHub)

Episode source