CodingBlocks

The DevOps Handbook – Enable Daily Learning

Oct 12 '20

We dive into the benefits of enabling daily learning into our processes, while it’s egregiously late for Joe, Michael’s impersonation is awful, and Allen’s speech is degrading.

This episode’s show notes can be found at https://www.codingblocks.net/episode143, for those reading this via their podcast player, where you can join the conversation.

Survey Says

How often do you change jobs?

Job? Why would I do that when I can boss myself around.
I don't wanna. Interviewing is awful.
Every 3 years, like the Stack Overflow Survey tells me to.
About every 5 years, after I've built up enough embarrassments.

vote

News

Thank you to everyone that left us a new review!
- iTunes: John Roland, Shefodorf, DevCT, Flemon001, ryanjcaldwell, Aceium
- Stitcher: Helia
Allen saves your butt with his latest chair review on YouTube.

Enable and Inject Learning into Daily Work

To work on complex systems effectively and safely we must get good at:
- Detecting problems,
- Solving problems, and
- Multiplying the effects by sharing the solutions within the organization.
The key is treating failures as an opportunity to learn rather than an opportunity to punish.

Establish a Just, Learning Culture

By promoting a culture where errors are “just” it encourages learning ways to remove and prevent those errors.
On the contrary, an “unjust” culture, promotes bureaucracy, evasion, and self-protection.
- This is how most companies and management work, i.e. put processes in place to prevent and eliminate the possibility of errors.
Rather than blaming individuals, take moments when things go wrong as an opportunity to learn and improve the systems that will inevitably have problems.
- Not only does this improve the organization’s systems, it also strengthens relationships between team members.
When developers do cause an error and are encouraged to share the details of the errors and how to fix them, it ultimately benefits everyone as the fear of consequences are lowered and solutions on ensuring that particular problem isn’t encountered again increase.

Blameless Post Mortem

Create timelines and collect details from many perspectives.
Empower engineers to provide details of how they may have contributed to the failures.
Encourage those who did make the mistakes to share those with the organization and how to avoid those mistakes in the future.
Don’t dwell on hindsight, i.e. the coulda, woulda, and shoulda comments.
Propose countermeasures to ensure similar failures don’t occur in the future and schedule a date to complete those countermeasures.

Stakeholders that should be present at these meetings

People who were a part of making the decisions that caused the problem.
People who found the problem.
People who responded to the problem.
People who diagnosed the problem.
People who were affected by the problem.
Anyone who might want to attend the meeting.

The meeting

Must be rigorous about recording the details during the process of finding, diagnosing, and fixing, etc.
Disallow phrases like “could have” or “should have” because they are counterproductive.
Reserve enough time to brainstorm countermeasures to implement.
- These must be prioritized and given a timeline for implementation.
Publish the learnings and timelines, etc. from the meeting so the entire organization can gain from them.

Finding more Failures as Time Moves on

As you get better at resolving egregious errors, the errors become more subtle and you need to modify your tolerances to find weaker signals indicating errors.
Treat applications as experiments where everything is analyzed, rather than stringent compliance and standardization.

Redefine Failure and Encourage Calculated Risk Taking

Create a culture where people are comfortable with surfacing and learning from failures.
It seems counter-intuitive, but by allowing more failures this also means that you’re moving the ball forward.

Inject Production Failures

The purpose is to make sure failures can happen in controlled ways.
- We should think about making our systems crash in a way that keeps the key components protected as much as possible i.e. graceful degradation.

Use Game Days to Rehearse Failures

“A service is not really tested until we break it in production.”
Jesse Robbins

Introduce large-scale fault injection across your critical systems.
These gamedays are scheduled with a goal, like maybe losing connectivity to a data center.
- This gives everyone time to prepare for what would need to be done to make sure the system still functions, failovers, monitoring, etc.
- Take notes of anything that goes wrong, find, fix, and retest.
On gameday, force an outage.
- This exposes things you may have missed, not anticipated, etc.
- Obviously the goal is to create more resilient systems.

Resources We Like

The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Amazon)
The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win (Amazon)
The Unicorn Project: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data (Amazon)
Netflix Chaos Monkey (GitHub)
Chaos Mesh, a cloud-native platform that orchestrates chaos on Kubernetes environments. (GitHub)
Alexey Golub’s Twitter response (thread) to our discussion of his article Unit Testing is Overrated during episode 141.
Esty’s post mortem tracker: morgue (GitHub)
1987 Crash Test Dummies PSA – Buckle Up (YouTube)

Tip of the Week

Firefox Relay – Hide your real email address to help protect your identity (relay.firefox.com)
- Honorable mention: Sign in with Apple (support.apple.com)
How I Built this with Guy Raz – Khan Academy: Sal Khan (NPR)
- Boost your student’s learning (Khan Academy)
Automate your world at the push of a button with the Elgato Stream Deck. (Elgato)
/the social dilemma – A hybrid documentary-drama that explores the dangerous human impact of social networking. (Netflix)
Migrate your repos from TFVC (aka Team Foundation Version Control) to Git using git-tfs. (GitHub)
- Migrate from TFVC to Git (docs.microsoft.com)
- Use Azure DevOps to simplify the migration process: Import repositories from TFVC to Git (docs.microsoft.com)

Episode source

DEV Community

CodingBlocks

The DevOps Handbook – Enable Daily Learning

Sponsors

Survey Says

How often do you change jobs?

News

Enable and Inject Learning into Daily Work

Establish a Just, Learning Culture

Blameless Post Mortem

Stakeholders that should be present at these meetings

The meeting

Finding more Failures as Time Moves on

Redefine Failure and Encourage Calculated Risk Taking

Inject Production Failures

Use Game Days to Rehearse Failures

Resources We Like

Tip of the Week

CodingBlocks Follow

The DevOps Handbook – Enable Daily Learning

Sponsors

Survey Says

How often do you change jobs?

News

Enable and Inject Learning into Daily Work

Establish a Just, Learning Culture

Blameless Post Mortem

Stakeholders that should be present at these meetings

The meeting

Finding more Failures as Time Moves on

Redefine Failure and Encourage Calculated Risk Taking

Inject Production Failures

Use Game Days to Rehearse Failures

Resources We Like

Tip of the Week

CodingBlocks