DEV Community: Charity Majors

The Future of Software is a Sociotechnical Problem

Charity Majors — Fri, 18 Dec 2020 21:19:14 +0000

I learned this word from Liz Fong-Jones recently, and it immediately entered my daily lexicon. You know exactly what it means as soon as you hear it, and then you wonder how you ever lived without it.

Our systems are sociotechnical systems. This is why technical problems are never just technical problems, and why social problems are never just social problems.

I work on a company, Honeycomb, which develops next-gen observability tooling. But I don't spend my time trying to figure out how to get more people to use observability tools. Observability alone can't solve anything, it's just a necessary part of the solution.

What I do spend my day thinking about is the future of building software. How can we convert the creative fuel of people’s labor into healthier teams and more reliable, resilient systems? We are incredibly wasteful of the creative fuel that people pour into the process, and the result is that we have unreliable, opaque systems hairballs that nobody understands — which are then operated by stressed, burned out humans who are afraid to touch them.

What if we had:

a future where your code goes live a few seconds or minutes after you commit your changes, and this is all very predictable and boring
a future where everyone owns their code in production, and you actually look forward to your own turn on call
a future where all the energy you pour into writing code and building systems genuinely moves the business forward, and you are rarely frustrated or lost or misled by those systems
a future where the debugger of last resort is not the engineer who has been there the longest, but the most curious person
a future where shipping software is not scary.

What do you think, does this sound achievable? Easy? Or are you thinking “never gonna happen for my team in this lifetime?”

This is all much more attainable than you might think.

The future is here, it is just unevenly distributed.

I have lived in the future. It's why I started this company — I got a brief glimpse of what I now think of as ODD, or observability-driven development, a world where the best engineers wrote code with half their screen taken up by their editor, half by a tool where they were constantly watching and poking at and playing with that code live in production. The code they wrote was better. The systems they built were understandable, in a way I had never seen before.

Going back to a world where people write and ship blind was unthinkable. Not an option.

We hear echoes of this from Honeycomb customers now: "This is incredible. I can never go back.“

Because the teams who invest in these sociotechnical practices are radically more productive and happy than those who don't. They move so much faster and with more confidence; their systems are more reliable and better understood; they amass dramatically less technical debt and can do far more with radically fewer people. They attract and retain better candidates.

And as a software company, this is how you win.

We are in the Middle Ages of software delivery.

The Stripe developer report reports that engineers spend at least 40% (self-reported) on miscellaneous technical bullshit that keeps you busy, maybe blocks you from working on what you need to work on ... but does not move the business forward. Just sit with that a sec. Forty percent. Optimistically.

Or maybe you're familiar with the DORA report. The honeycomb team’s engineering stats are an order of magnitude or two better than their Elite teams, which represent the top 20% of all teams. ("But the company is young, easy for you to say!" you may protest. Sure, we are relatively young ... a little over four years old. We are also a fast-growing platform with unpredictable, spiky traffic composed of user-generated streams of content that we have no control over.)

I wish I could tell you "just buy Honeycomb and voila! Get high-performing teams!“

That is not what I'm saying. It is not that easy.

It's a sociotechnical hole, and only a combination of technical fixes and social change will get us out of it.

A sociotechnical recipe for high-performing teams

But lots of smart, creative teams are out there working hard on this and sharing their findings. As a result, we know a LOT more about what contributes to a solution than we knew even just a year or two ago. You will be forgiven for skimming this very long list:

Blameless retrospectives
Automatic deployments triggered on each commit, single commit per deploy
Removing human gates in the deploy pipeline
Good test coverage, instrumented test harness
Shared conventions around instrumentation
Training, education, collaboration
Code reviews and mentoring
Promoting people for their value as team members and force multipliers, not just raw coding ability
Interview processes that value strengths, not lack of weaknesses
Shared value systems and organizational transparency
Welcoming of diverse viewpoints and fresh eyes
Teams that value juniors and know how to train them up
Tooling that rewards curiosity
Job ladders that value communication and independent initiative
Encouraging software engineers to own their code from end to end
Encouraging SRE types to work more like product teams
Adopting SLOs, SLIs, and aligning on call pain strictly with user pain
Making sure everyone gets enough sleep and time off
Observability tooling (in the technical sense, as I define it here; not in the old fashioned sense of "metrics, logs and traces")

Observability is only a one piece of the solution ... but it is a necessary piece that should be actively frontloaded if your efforts are to have maximum impact.

Observability is about the unknown-unknowns

Rolling out o11y tooling is like turning on the light and putting on your glasses before you start swinging at the pinata.

To get at the candy inside — the real actionable user and technical insights — you need to be able to interactively slice and dice in real time, break down by high cardinality dimensions, and ask those new questions, the ones that you couldn’t have predicted you would need to ask. This is the minimum viable technical functionality you need in order to explore exactly what is happening in production, what happened when you deployed a particular piece of code, what happened when that user reported that bug. That’s why previous iterations of monitoring were not enough.

Observability in the modern technical definition is about answering the unknown-unknowns, and it is necessary. With observability, all things become easier. It is a force amplifier for all your other efforts.

If you don't have observability -- if you only have metrics, logs, and/or traces -- all you can ask will be those questions that you predicted and defined in advance. You are swinging out at the pinata in the dark, or where you think it was yesterday or last week. It might not be completely prohibitively impossible, but it's a damn sight harder and a lot comes down to luck.

Observability is a necessary ingredient. But everything matters.

People often kvetch at me "yeah, but anything's easy when you have the best engineers." They have this exactly backwards. Observability-driven development is what makes great engineers. Observability is what enables you to peek under the hood of the abstractions, it grounds you in reality, forces you to think through the code all the way to how the user will use it. It tethers you to your users and lets you see the world through their eyes.

TDD → ODD

Learning to check your assumptions vs reality was the argument for TDD (test-driven development). That makes you write better code, indisputably. But tests stop at the edge of your laptop! Tests imperfectly mock a predictable subset of reality. Testing in production means replacing the artificial test sandbox with reality.

If you believe TDD makes you a better developer, you should be hungry for the developer you will become using ODD.

I am cautiously optimistic that the industry will embrace observability in far less time than it took to adopt TDD and metrics. Mostly because it is much, much easier to do things this way. It’s actually much harder to do things the bad old ways, what with all the hacks and workarounds.

And every little bit helps. Every one of these changes will, if you embrace them, make your people happier and more productive.

Observability-driven development is what creates great software engineers.

The greatest obstacle between us and a better tomorrow is this pervasive lack of hope. (The second greatest is our perverse pride in our Rube Goldberg hacks & sunk costs fallacy.)

Most people still have not experienced what it's like to build software in a radically better way. Even worse, most people don't see themselves in the better world I describe. They don't think this world is meant for them.

I don’t know how to fix this yet. But if we only succeed in making life better for the elites, we will have failed.

Observability is for everyone, and it is easier if you do it first. Observability makes every technical effort that comes after it sooooo much easier to achieve. Observability is what creates great engineers, not vice versa. Start at the edge, instrument some code, and work in. Rinse and repeat. You got this.

Experience what Honeycomb can do for your business. Check out our short and sweet demo!

The Future of Ops Careers — Honeycomb

Charity Majors — Fri, 13 Nov 2020 16:36:21 +0000

Have you seen Lambda: A Serverless Musical?

If not, you really have to. I love Hamilton, I love serverless, and I’m not trying to be a crank or a killjoy or police people’s language. BUT, unfortunately, the chorus chose to double-down on one of the stupidest and most dangerous tendencies the serverless movement has had from day one: misunderstanding and trash-talking operations.

“I’m gonna reduce your… ops
I’m gonna reduce your… ops”

Well, I hate to tell you, but…

“No, I am not throwing away my… ops.
And you’re not throwing away my… ops.”

Or anyone else’s for that matter.

Even if you don’t run any servers or have any infrastructure of your own, you’ll still have to deal with operability and operations engineering problems. I hate to be the bearer of bad news (not really), but the role of operations isn’t going away. At best, the shifts that supposedly reduce your ops are simply delegating the operability of your stack to someone that does it better. The reality for most teams is that operations engineering is more necessary than ever.

Beyond Hamilton clap backs, that distinction matters because it has real career ramifications for engineers who, like me, are so operationally minded. Where are Ops careers heading?

Where Does Ops Fit, Anyway?

In some corners of engineering, “ops” is straight up used as a synonym for toil and manual labor. There is no good ops, only dead ops. The existence of ops is a technical failure: a blemish to be automated away, eradicated by adding more and more code. Code defeats toil. Dev makes ops obsolete. #NoOps!

If this is such an inexorable march towards utopia, maybe someone can explain to me why the shops that flirt the hardest with #NoOps have been, without exception, such humanitarian disasters?

Or, I’ll start. Operations is ridiculously important. When you denigrate it and diminish it, that’s the first sign that you aren’t doing it well. The way to do something well generally starts with adding focus and rigor, not writing it off.

Consider Business Development and Operations. Business is the why, development is the what, operations is the how. Operations is the constellation of your organizational memory: patterns, practices, habits, defaults, aspirations, expertise, tools, and everything else used to deliver business value to users.

The value of serverless isn’t found in “less ops.” Less ops doesn’t yield better systems than more ops, any more than fewer lines of code means better software. The value of serverless is unlocked by clear and powerful abstractions that let you delegate running large portions of your infrastructure to other people who can do it better than you — yes, because of economies of scale, but more so because that’s their core business model. YOUR core business model probably has nothing to do with infrastructure.

Because of that, a great sort is now happening between software engineering, infrastructure operations, and core business value.

What Is Infrastructure?

Infrastructure is software support. It’s the prerequisite thing you have to do, in order to get to the stuff you want to do. It’s not what you want to be doing, yet your business goals presume its existence.

An important quality of infrastructure is that it typically changes less often and is more stable than the software that constitutes your core business value. The features you ship to customers are typically under constant or frequent development, and they change at the rate of pull requests and commits (in fact, the velocity of these changes can be a critical competitive advantage). Infrastructure, on the other hand, changes at a more glacial pace — at the rate of package managers, OS updates, and new machine images. It’s seconds-to-minutes versus hours-to-days.

This dividing line between infrastructure and core business value even holds true for companies whose business model is building infrastructure for other companies. For example, a company providing email focuses on products that consist of email workflow features that are constantly being developed and shipped to users. There isn’t much new business value to be wrung out of modifying commodity SMTP transport layers or optimizing IMAP servers.

To its credit, serverless is perhaps the first trend to have really understood and powerfully leveraged that dividing line. IaaS, PaaS, and full-service suites like Gitlab were all germinal forms of this shift. “Cloud native” was also, arguably, another lurch in that direction. But where has that taken our industry?

*-As-a-Service Is Really Just Code for “Outsourcing”

IaaS, PaaS, and even FaaS/serverless are really all just types of outsourcing. But yet we don’t call it “outsourcing” when we rely on companies like AWS to run our datacenter and provide compute or storage, or when we use Google apps for our email, documents, and spreadsheets?

Historically, “outsourcing” is what we call shifting work off-premises when we aren’t yet comfortable with the arrangement; whether because the fit is awkward, the support is incomplete, or the service isn’t on par with what we could do ourselves. With infrastructure outsourcing, service quality is now creeping up the stack. More and more complex subsystems are becoming commodity components: and other companies utilize them to build their own businesses (or other infrastructure!) on top.

When I started my career, I was a jack-of-all-trades systems person. I ran mail, web, db, DNS, cache, deploys, CI/CD, patched operating systems, built debs and rpms, etc, etc. Most engineers don’t do those things now, and nor do I. Why would I, when I can pay someone else to abstract those details away, so that I can spend my time focusing on delivering customer value?

Increasingly, as an industry, we are outsourcing any bits that we can.

As a more personal example, why would you want to run your own observability team or build your own in-house monitoring software, if that’s not your core business? Why split your focus to building a bespoke and unsustainable version of a thing when you can readily buy a world-class version? If my company has had ten or twenty full-time engineers working on that solution, how long will it be until your team of three or five can catch up?

In a post-cloud world, we’ve learned that it’s usually much better and far easier to buy than it is to build those things that don’t add business value.

How to Outsource Things Well

In my personal example, buying doesn’t mean that you shouldn’t have an observability team. It means that the observability team should turn their gaze inward. That team should take a page out of the SRE or test community’s books and focus on providing value for your org’s developers whenever they interact with this outsourced solution.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

We already know from industry research that the key to success when outsourcing is to embed those off-prem contributions within cross-functional teams, which manage integrating that work back into the broader organization.

Monstrous amounts of engineering work create the stack that ships value to your customers. Trying to save work, some teams build complicated Rube Goldberg machines that are brutal to run, change, and debug. It’s much harder to build simple platforms with operable, intelligible components that provide a humane user experience. Bridging that gap requires quality operations engineering to streamline that outsourcing for successful user adoption.

That’s why even if you run no servers and have no infrastructure of your own, you still have operability and operations problems to contend with. Getting to the point where your org successfully has no infrastructure of its own takes a lot of world-class operations expertise. Staying there is even harder. Any jerk with a credit card can just go spin up a server you’re now responsible for. Try being any sort of roadblock and see how quickly that happens.

What This Means For Operationally Minded Engineers

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin and ClamAV — the world has Gmail. You might find your next job by following the trail of technologies you know, like getting hired as a MySQL expert. But technologies come and go, so you should think carefully before hitching your cart to any particular piece of software. What will this mean for your career?

The industry is bifurcating along an infrastructure fault line, and the long-held indistinguishability between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. These are becoming two different roles and career paths at two different kinds of companies: infrastructure providers, and the rest of them. Those of us who love optimizing, debugging, maintaining, and tackling weird systems problems far more than writing new greenfield code, now have a choice to make: go deep and specialize in infrastructure, or go broad on operability.

If the mission of your company is to solve a category problem by providing infrastructure to the world, then operations will always be a core part of that mission: your company thrives by solving that particular operability problem better than anyone. So you are justified in going deep and specializing in it, and figuring out how to do it better and more efficiently than anyone else in the world — so that other people don’t have to. But know that even this infrastructure-heavy backend work also needs design, product management, and software engineering work — just like those non-infrastructure focused companies!

If your chosen company isn’t solving an infrastructure problem for the world, there are still loads of opportunities for ops generalists here too. But know that a core part of your job is critically examining the cycles your company devotes to infrastructure operations and finding effective ways to outsource or minimize their in-house developer cycles. Your job is not to go deep if there is any alternative.

I see operationally-minded engineers working cross-functionally with software development teams to help them grow in a few key areas: making outsourcing successful, speeding up time to value, and up-leveling their production chops.

They’re evolving very crude “build vs. buy” industry arguments (often based on little more than whimsical notions) into sophisticated understandings of how and when to leverage abstractions that radically accelerate development. They build and maintain the bridges that make outsourcing successful.

They’re evolving release engineering to fulfill the delivery part of CI/CD. Far too many teams are perfectly competent at writing software, yet perfectly remedial when it comes to shipping that software swiftly and safely.

They’re also up-leveling the production operational skills of software engineers by crafting on-call rotations, counseling teams on instrumentation, and teaching observability. As teams leave behind dated metrics and logs, they start using observability to dig themselves out of the ever-increasing massive hole where everyone constantly ships software they don’t understand to a production system they’ve never understood.

Everyone needs operational skills; even teams who don’t run any of their own infrastructure. Ops is the constellation of skills necessary for shipping software; it’s not optional. If you ship software, you have operations work that needs to be done. That work isn’t going away. It’s just moving up the stack and becoming more sophisticated, and you might not recognize it.

I look forward to the improved Lambda Serverless Musical chorus:

I’m going to improve your… ops.
Yes, I’m going to improve your… ops!

Read more about Honeycomb’s hiring methodology. P.S. We’re hiring!

Join the swarm! Get started with Honeycomb for free.

A Next Step Beyond Test Driven Development

Charity Majors — Mon, 09 Nov 2020 17:08:26 +0000

The most successful software development movement of my lifetime is probably test-driven development or TDD. With TDD, requirements are turned into very specific test cases, then the code is improved so the tests pass. You know it, you probably use it; and this practice has helped our entire industry level up at code quality.

But it’s time to take a step beyond TDD in order to write better software that actually runs well in production. That step is observability driven development.

Using TDD to Drive Better Code

TDD has some powerful things going for it. It’s a very pure way of thinking about your software and the problems it’s trying to solve. TDD abstracts away the grimy circus of production and leaves you with deterministic, repeatable bits of code that you can run hundreds of times a day, giving you the warm, fuzzy assurance that your software would continue to work today the same as it worked yesterday and the day before that. But that assurance quickly fades when you start considering whether having passing tests means that your users are actually having a good product experience. Do those passing tests mean that any errors and regressions can be crispily isolated and fixed before your code is released back into the wild?

TDD helps produce better code, but a fundamental limitation of TDD is exactly the thing that makes it most appealing. With TDD, your tests run in a hermetically sealed environment. Everything in that environment is erased and recreated from zero on each run: your data is dropped and seeded afresh, your storage and remote systems are empty mocks. There is no chaotic human element, only wave upon wave of precisely specified bots and mocks performing bounds checks, serializing and deserializing, and checking for expected results, again and again.

All of the interesting deviations that your code might encounter out in the wild have been excised from that environment. We remove those deviations in the interest of making your code testable and tractable. There are no charming surprises: the unexpected is violently unwelcome. Any deviation from the spec must be dealt with — immediately.

But just because something about the environment doesn’t go according to plan and gets excluded from TDD, that doesn’t mean it isn’t valuable. In fact, one might reasonably argue that those deviations are the most valuable parts of your system; the most interesting, valuable, and worthwhile things to surface, watch, stress, and test. Because it’s all of those things that are really going to shape how your software actually behaves when real people start interacting with it.

If this rings true to you, then you may be interested in another method of validating and gaining confidence in your code. I have been referring to that approach as “observability-driven development”, or ODD. That’s oh-dee-dee, because using real data obtained from operating your software in production to drive better code is an approach that no engineer should find odd.

Using Production to Drive Better Code

“But that’s not how it’s done! We have confidence in our tests!!!”

The tests in your code are still valuable. But there’s an additional step we need to take in order to extend our validation to encompass the reality of production. It requires shifting your mindset, developing a practice, and forming a habit.

Embrace failures. Instead of being afraid of failure and trying desperately to avoid it, try adopting a mindset of cheery fatalism. Everything will fail eventually, usually at the worst possible time, and in a way you failed to predict. The first step is admitting that you cannot possibly predict all the entertainingly disastrous ways that your precious code is going to fail in the real world. All the different scenarios you so painstakingly enumerated and wrote tests for are but grains of sand on a beach. Accepting this might take some time. Go on. I’ll wait.

Instrument as you go. Given that we can’t predict the future, the next step is to develop a practice that helps us better see that future as it starts to unfold. This is the practice of developing instrumentation as you go. The things you want to build might become fully broken, partially degraded, or end up in any number of unusual states — or depend on services that are. How will that non-optimal state affect other parts of the system and how might those failure modes manifest in novel ways?

Just as you wouldn’t accept a pull-request without tests, you should never accept a pull-request unless you can answer the question, “how will I know when this isn’t working?”

Close the loop. The habit you then form is one of relentlessly circling back to check on your code once it has been released into the wild. It’s a habit of checking up on any code that has just been deployed through the lens of the instrumentation you just wrote. Is it working as intended? Are you sure? Does anything else look… weird? This should be as automatic as muscle memory. Your job is not done when you have merged to master. It is not done until you have watched it run in the wild, kicked the tires, and made sure it is working as intended.

This step, when followed regularly, will catch the overwhelming majority of problems in production before users can notice and before they’re big enough to trigger an alert. It also helps you catch those transient hard-to-find problems that will never cause big enough errors to trigger a monitoring alert. Plus it catches them at the optimum time: right after you’ve built it and shipped it, while your original intent is still warm and fresh in your mind, before it’s had the chance to decay or page out for all the other things competing for your attention throughout the day.

You need to follow that step so often that checking if your code is working as intended via instrumentation becomes muscle memory: it becomes a natural part of what happens every time you deploy code. It feels weird to not check how it’s running. You should have a nagging itch in the back of your mind that won’t simmer down until you close the loop on that deployment by checking to see how your code is doing in prod.

TDD + Prod = ODD

That’s what I’ve been calling Observability Driven Development. It’s the coding equivalent of wearing a headlamp to go for a hike in the darkness; anywhere you go, it lights up your feet on the path and two steps ahead of you.

With TDD, you rely on automated test suites to raise a hand and object if your code seems to be doing something wrong. All of the tests passed? That’s a green light! Your job is done when the branch is merged and tests have passed; that’s all the confidence you need to move on. Deploying that code is probably someone else’s job. Once it’s in prod, bugs will be surfaced by monitoring software (if you’re lucky) or unhappy users (if you’re not), and eventually make their way back to you or your team in the form of tasks or tickets.

This is a feedback loop that works, more or less, but it is long and slow and leaky. The person peering at your code in prod probably doesn’t know what they’re looking for or looking at, because they don’t have access to your original intent. By the time the bugs wend their way back to you — days, weeks, or months later — you too have probably forgotten a lot of relevant context.

With ODD, you’ve accepted that you can’t enumerate every failure, so you have far less confidence in the ability of any canned tests to surface behavioral anomalies. But you do have the greatest source of chaos and anomalies in the known universe to learn from: live users. Simply running your service with an open port to production invites chaos enough!

Your instrumentation doesn’t exist to serve a set of canned questions, it’s there to unlock your active, curious, novel exploration of the ways users are interacting with your systems: the beating heart of observability. If you make it a daily practice to engage with your code in prod, you will not only better serve your users, you will also hold your systems to a higher standard of cleanliness and understandability. You will develop keen technical instincts, you will write better code. You will be a better engineer.

Start going down the path of Observability Driven Development and follow your curiosity to wherever it leads.

Download our Guide to Achieving Observability and learn more about observability-driven development.

Experience what Honeycomb can do for your business. Check out our short demo!

This post was originally featured on TheNewStack on 9 June 2020.

I'm Charity Majors, Ask Me Anything! [FINISHED]

Charity Majors — Wed, 21 Feb 2018 18:43:46 +0000

I am the cofounder and accidental CEO of honeycomb.io, where we are thinking hard about how to help you debug and understand the complex systems you have now && the even crazier systems you're going to have soon. I think a lot about observability and how to help software engineers own the code they write without losing their quality of life. Before this I was an engineering manager at Facebook, built systems at Parse and Second Life, and spent most of my time worrying about databases. I miss being on call.