DEV Community: Thomas Strömberg

The anatomy of a great playbook entry

Thomas Strömberg — Fri, 21 May 2021 00:07:50 +0000

What if you could easily reduce the length of outages by 3X?

According to the SRE book, "recording the best practices ahead of time in a playbook produces roughly a 3x improvement in MTTR". This improvement mirrors my experience with well-written playbooks.

So what makes a playbook entry "great"?

Philosophy

Remember how you felt in your first on-call rotation, when you were paged at 3am for a system you barely understood? Write your playbook entries for that person.

Playbooks should provide just enough context to confidently work through an incident, without providing extraneous content that will be a burden to keep up-to-date.

Be wary of playbooks that offer exact remediation steps: these are often a sign of sacrificing human blood to a system that should be automated.

Discovery

Alerts should always include the relevant playbook URL. Otherwise, you will introduce human error by introducing the possibility of the responder following the incorrect playbook.

Consider including the alert name in the playbook URL to make it easier to find. This also the alert template to be templatized in some systems. For example: https://playbooks/%%ALERT_NAME%%

Structure

Playbooks are the easiest to scan through in an emergency when they have a consistent structure. The exact best structure may differ depending on the organization, but this is what has worked for me:

The structure that works best is highly dependent on your team's culture, but this is what has worked for me:

Severity: How to assess the criticality of this alert from your team's point of view. Is it a slow-burning issue that generates tickets, a critical paging event, or does the severity depend on the duration?
Impact: How are your customers impacted by this alert? Often a one-liner, for example: "None immediately. If ignored, may result in revenue-impacting customer provisioning failures due to resource exhaustion"
Metrics: 1-2 graphs showing the impact, duration, and if the effect is worsening. Inline live-updating graphs work best, as they can prevent the incident responder from making unnecessary changes when the problem is dissipating. Hyperlinks are nearly as good.
Background: What should a new person on the on-call rotation know about this system? Be terse, providing a hyperlink for more information and/or an architectural diagram. To reduce maintenance burden and cognitive load during incident response, share this section between multiple playbook entries via templating.
Mitigation: What are the recommended steps to mitigate the issue? This is often in checklist-style and may include steps for rolling back or redirecting traffic.
Debugging: How should one get started digging into why this alert is firing? For example:

 1. Check for recent fatal error messages:
 2. Check the cluster for free disk space:
 3. Check <url> to see when the last release went out

References: Links to the alert configuration, or code that generates the metric used by the alert, can be useful in understanding the underlying behavior. Post-mortems can also be valuable.

Formatting

Be concise
Bulleted or numbered lists instead of paragraphs.

The Kubernetes Documentation Style Guide has great recommendations for technical documentation, but the most important for playbooks is: make your commands trivial to copy and paste.

Do not include the command prompt. S
- See: data loss due to > character in prompt
Separate commands from example output
Do not include real but unrelated host, site, or cluster names in your example command.
- I once saw an outage spread when a responder copied an example command with the intent to edit the hostnames before pressing enter. They pressed enter first.

Maintenance

Keep playbooks up to date by:

Regularly scheduled "Wheel of Misfortune" role-playing game sessions, where the previous on-call engineer walks the current on-call engineer through a pager response scenario.
Post-mortem action items that suggest playbook updates to decrease the resolution time for future pages for the same alert.

Big-bang efforts such as auditing all of the playbooks for relevance are best made once initially, to get the playbooks into the same structure. I have never seen quarterly playbook reviews work.

Special thanks to Joseph Bironas for editorial feedback and ideas for this article.

Motivating Software Engineering Teams

Thomas Strömberg — Tue, 09 Mar 2021 22:52:54 +0000

Empathy, purpose, craftsmanship.

Empathy

The key to motivating a team is to identify what motivates the people that make up the team, with enough empathy to put yourself in their shoes.

Everyone wants to be happy, but everyone has their unique path to happiness. Learning the career and life goals of everyone on the team allows you to prime the right tasks for them at the right time.

Asking people directly, "What motivates you as a software engineer?" will often unlock the right set of hints for how to frame messages in a way that works for them. However, to get an honest and complete answer, one needs to build rapport.

Building rapport

Building rapport with the team is critical to establishing the psychological safety required for folks to feel comfortable sharing their real thoughts. My advice is to model and build a real personal connection that supersedes the synthetic relationship of the manager to the direct report.

My strategy for building connection is: demonstrating care, showing vulnerability, and deep listening. Proving that you care is something that you cannot fake. If you cannot care deeply for each person on your team, you will fail to motivate your team in the long term.

A technique I have used successfully is to declare to each person my focus: it's them, and their long-term career. As a manager, the people you inspire are the legacy you will leave behind. This advice may seem antithetical to most business guidance, but if you genuinely care for your team, you will be in a better place to inspire them. After all, as humans, we are more important than the companies in which we serve.

It is important to recognize that the average tenure at a tech company is three years, so your manager/report relationship will last on average only a year and a half, or about 5% of their career. As a manager trying to build a high-functioning team, you should focus on making the most of this overlap to set them up for the other 95% of their career. Letting your direct report know that you are on their side and in it for the long-haul will build the rapport necessary for candor.

Life stories make honest goals approachable

To build mutual respect, candor, and empathy, I like to take time in an early 1:1 to model it through sharing my life story, career journey, goals, and missteps. Next, it's your turn to listen intently to your report's own life story, noting that they may not be ready to share all of it. What you hear though, may surprise and shock you. Life isn't always sunshine and roses.

For subsequent 1:1's, I ask for the report to record their goals, both within the company and outside of it. These goals will be kept at the top of our 1:1 notes to always stay fresh within our minds when we meet. Mine reads:

Career: make large-scale computing trivial for the world
to use in a sustainable manner.

Non-career: Accelerate human knowledge. Be a great father
and spouse.

With time, you should achieve the empathy and respect necessary to understand each other's personal and professional motivation.

Purpose

Many define successful management as the ability to get the most out of the team. Now that you have learned about each individual's motivation, it is time to find the common thread that binds the team together. This shared context will help you build a plan to harmoniously fit everyone's local maximum (personal motivation) into the global maximum (personal+team+company motivation).

To discover this thread, I recommend first sharing the mission and value statements of other teams, and then letting your team discover define their own statement of values. These values should be ideals or traits that your team cannot live without. Once defined, your team is ready to create a mission statement. A mission statement should:

Align with the mission of the group above, if applicable
Clarify who your team serves
Be worded precisely enough to not apply to other teams

Depending on the team size, expect that this may take three different 45-minute meetings to brainstorm and refine it. Your team may resist the idea of brainstorming a mission at first, but once they learn what everyone agrees is essential, they will appreciate the clarity.

For suggestions on how to run a successful exploration of mission statement & values, I highly recommend reading Tribal Leadership: Leveraging Natural Groups to Build a Thriving Organization by Logan, King, Fischer-Wright.

Craftsmanship

Software engineers at their heart are craftspeople. As with any other craft trade, software engineers intrinsically want to be proud of what they create. Craftsmanship is an important trait to cultivate as it paves the way for:

High morale
Product excellence
Prevention of technical debt

Even when one is racing the revenue clock, deprioritizing craftsmanship is never the answer. If you push engineers toward deadline-driven development, you may very well be causing them to lower the code quality bar. If not addressed immediately, the decreased code quality generates a vicious circle of low morale, low velocity, high turnover, and an ever-increasing pile of technical debt.

If you encourage your team to build artifacts (code, design docs) that they can be proud of throughout their career, you will no longer have to worry much about motivation.

For more on craftsmanship in software, I recommend Software Craftsman, The: Professionalism, Pragmatism, Pride by Sandro Mancuso.

Tesla Model Y: family camping in below-zero temperatures

Thomas Strömberg — Mon, 08 Mar 2021 04:42:07 +0000

TL;DR: If you think outside of the box, you can sleep 4 with full climate control for a cost of 10-12% battery usage per night, even in freezing temperatures.

When we bought our Tesla Model Y in December, the intent was to roughly recreate the experiences of our trusty 1971 VW Bus, but in electric form.

With all the cargo space, and the availability of great mattress options, such as the Tesmat, one can get pretty close to a passenger-van conversion. What one cannot simulate easily however, is the 4-person sleeping experience one can get from a pop-up tent. The Tesla Model Y can only comfortably fit 2 adults, or 1 adult & two children.

The best option I've found for sleeping 4 is the Napier SUV Sportz Tent This gargantuan tent attaches easily to the back of a Tesla Model Y, but it is a bit heavy and bulky to pack up. The biggest plus side to this option is that you can still run the Camp Mode on the Tesla Model Y to heat the tent: which is something we tested this weekend.

Using Camp Mode to climate control the Napier tent is a party trick that is mostly only useful in extreme temperatures. In my experience, Camp Mode on the Tesla Model Y consumes ~10% battery life when set to 67'F with a 40'F exterior temperature, and the trunk closed.

My primary concern was power consumption due to the lack of insulation: tents are typically nowhere near as insulated as a Tesla. How much power would we waste by adding the tent?

To counter the insulation issue, I ordered 3mm 48"x50Ft, and a single roll of foil tape, to build a ~5'x5'x5' cube within the Napier SUV Tent, lovingly called the "space station":

The foil tape ran out quite quickly, so I finished it up with some painters tape.

On the first night of testing, we arrived at 10pm with 68% battery, raining cats & dogs, and a chilly 35'F. I was too busy trying to setup without getting soaked that I didn't bother sealing the gaps too much.

I set the Tesla to Camp Mode @ 67'F, and used a screwdriver to flip the trunk latch to trick the car into thinking the trunk was shut. I had heard that otherwise, the climate control would shut off after 30 minutes. Our battery went from 67% to 47%: a 20% drop over 10 hours, which had me a bit worried for the second night.

By this point, we had acquired 2 extra rolls of foil tape (still not enough), and used some duct-tape to seal the gaps between the car and the space station. I was more conservative this time, setting the car to 60'F, setting the vent manually to 1. After 7 hours, we consumed only 5% battery, so I increased the heat to 66'F for the next 4 hours, which consumed another 5%. Not bad when the exterior temperature was 30'F!

One recommendation is a stuff sack, as packing the inner tent can consume a lot of space in the car otherwise (a little bit bigger than the packed size of the Napier SUV tent). Here's a photo of the inner tent before I rolled it up:

When we left the campsite, we were at 33% battery, which was just enough to get us to the supercharger in Ukiah. My backup plan was a slow charger just up the road in Lakeport, which I still tried for fun.

This experiment means that the insulated Napier tent is effectively as efficient insulation wise as the Tesla itself: at worst, 10-20% more power consumption over the regular Camp Mode with the trunk shut. Depending on your target temperature, plan on 10-12% battery consumption per night (11 hours) with Camp Mode in the Tesla Model Y.

This result means that the Tesla Y can sleep 4 people in freezing temperatures for multiple nights in a row without a source of electricity. I still have a few gaps to fill with foil tape, and need to bring painters tape on the next trip to quickly seal it against the car.

Hope this post helps someone!

Go & secondary groups: a kaniko adventure!

Thomas Strömberg — Fri, 26 Feb 2021 02:21:37 +0000

Originally posted on my personal blog in April 2020

I wanted to get my feet wet with understanding Kaniko,
an open-source in-cluster builder for Docker images. I happen to work with one of the maintainers, Tejal,
and I asked her if there was any interesting UNIX-internals sort of bugs that might be interesting.

Here's the mystery issue: "The USER command does not set the correct gids, so extra groups are dropped". Here's an example to reproduce it:

FROM ubuntu:latest
RUN groupadd -g 20000 bar
RUN groupadd -g 10000 foo
RUN useradd -c "Foo user" -u 10000 -g 10000 -G bar -m foo
RUN id foo
USER foo
RUN id

In an ideal world, both "id" commands should give the same output, but the second one did not include foo's membership in bar. This definitely sounded
like a secondary group issue. I happened to know that secondary groups were bolted onto the UNIX implementation some 10 years later than primary groups (SVR4, by way of BSD).

How to reproduce

First, get a shell into the Kaniko debug image, mounting in the out/ and integration/ subdirectory:

docker run -it --entrypoint /busybox/sh -v "$HOME"/.config/gcloud:/root/.config/gcloud -v (pwd)/integration:/workspace -v (pwd)/out:/out gcr.io/kaniko-project/executor:debug

I placed their Dockerfile into kaniko/integration/1097, which was mounted as /workspace. I could then trivially reproduce their case using kaniko:

/kaniko/executor -f 1097 --context=dir://workspace --destination=gcr.io/kaniko/test --tarPath=/tmp/image.tar --no-push

Finding the culprit

The first question was: how does Kaniko implement user switching? Are they switching in such a way that populates secondary groups? I ask because the standard syscalls (seteuid, setegid) do not implement secondary groups: one has to instead call setgroups. Here's what I found:

cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uid, Gid: gid}

[SysProcAttr](https://golang.org/pkg/syscall/#SysProcAttr) is not exactly a well-known feature in Go, but it's perfect for setting exec attributes such as:

Chroot - lock the process into a directory
Pdeathsig - Signal that the process will get when its parent dies (Linux only)
... and many options for user namespacing: handy for container tools.

So, I figured it would be easy enough to improve the function in such a way that performs secondary group impersonation. The trick to you, dear reader, is to find the flaw!

func impersonate(userStr string) (*syscall.Credential, error) {
   ...
   groups := []uint32{}
   gidStr, err := u.GroupIds()
   logrus.Infof("groupstr: %s", gidStr)

   for _, g := range gidStr {
       i, err := strconv.ParseUint(g, 10, 32)
       if err != nil {
           return nil, errors.Wrap(err, "parseuint")
       }
       groups = append(groups, uint32(i))
   }

   return &syscall.Credential{
       Uid:    uid,
       Gid:    gid,
       Groups: groups,
   }, nil
}

After running make, I hop back into the container to run the repro case, and I'm perplexed by the log message:

INFO[0013] u.GroupIds returned: []

Is kaniko running in some alternate chroot universe where it can't see? I double check by adding a shell command:

out, err = exec.Command("grep", "foo", "/etc/group").Output()

The answer is no. At this point, there are only two options in my mind. Either this is a Go bug, or, if Go is using libc to make this call (likely),
it's a libc bug, or at least a disagreement between the two systems. As soon as you have made the decision to blame the compiler, it's time to gather evidence, typically by making a simpler test case. I opt first to investigate if Go is using libc to look up the list of secondary groups, starting with:

os/user/listgroups_unix.go

A couple of nested functions later, and you can see that it's calling:

static int mygetgrouplist(const char* user, gid_t group, gid_t* groups, int* ngroups) {
  return getgrouplist(user, group, groups, ngroups);
}

This is almost the same implementation you see in busybox's id command source

static int get_groups(const char *username, gid_t rgid, gid_t *groups, int *n)
{
  int m;
   if (username) {
     m = getgrouplist(username, rgid, groups, n);
     return m;
   }

Now, it's possible that Go is setting ngroups to 0, so we just build a little test case program:

func main() {
    u, err := user.Lookup(os.Args[1])
    if err != nil {
     panic(fmt.Sprintf("lookup failed: %v", err))
    }

The test program runs great on macOS, but when I use xgo to cross-compile it for Linux, all it outputs is:

-rwxr-xr-x    1 0        0          2125099 Mar 28 20:26 ggroups-linux-amd64

# ./ggroups-linux-amd64
/busybox/sh: ./ggroups-linux-amd64: not found

If you ever see this error in UNIX, it usually means one of three things:

The program specifies an invalid #! line
The binary needs a shared library that does not exist
The binary is for the wrong architecture

In this case, I suspected #2, because I see that busybox is in use, chances are pretty high that this Docker image lacks libc. This environment
does not have ldd, but it has strings, so I can get some hints about the binary that was built:

strings /out/ggroups-linux-amd64  | head
bhkFaBAPgWy3KAp2RQcd/llKGprZSMM7cCxIzwmJ9/0QgnPM9q9pk--9IIyIXn/X9bTurj9MBmKtnVL-ANT
/lib64/ld-linux-x86-64.so.2
ATUSH

It looks like the right architecture, but yeah, that library doesn't exist. Just to confirm my sanity, I confirmed this program works great in an ubuntu container. I immediately suspect that either kaniko's user environment is trash, or kaniko is up to shenanigans in their Makefile. The easier is easier to check, and it doesn't take long to notice:

out/executor: $(GO_FILES)
    GOARCH=$(GOARCH) GOOS=linux CGO_ENABLED=0 go build -ldflags \
       $(GO_LDFLAGS) -o $@ $(EXECUTOR_PACKAGE)

God damnit. kaniko works because they disable cgo to workaround the lack of a libc environment. Look back at listgroups_unix.go - it uses C code, and the build rule specifically states only to build with cgo. If we look at the fallback implementation, we see:

func listGroups(*User) ([]string, error) {
    if runtime.GOOS == "android" || runtime.GOOS == "aix" {
        return nil, fmt.Errorf("user: GroupIds not implemented on %s", runtime.GOOS)
    }
    return nil, errors.New("user: GroupIds requires cgo")
}

But wait - we didn't see an error in our impersonate function! I try to compile it without cgo:

env CGO_ENABLED=0 go run ggroups.go root
panic: groupids failed: user: GroupIds requires cgo

goroutine 1 [running]:
main.main()
    /Users/tstromberg/src/ggroups/ggroups.go:18 +0x117
exit status 2

The mystery deepens

If you see an error in one environment, and not the other, chances are either:

A compiler error
A kernel error
You forgot to check the error code.

It's almost always the last option. Sure enough:

   gidStr, err := u.GroupIds()
   logrus.Infof("groupstr: %s", gidStr)

As soon as I noticed this, I walked away from my computer for an hour. I suggest you do the same.

Persistent multi-user Docker on macOS

Thomas Strömberg — Mon, 01 Feb 2021 20:04:10 +0000

First, be aware that docker is not designed to be securely shared among multiple users. Please assume that anyone who has access to docker is effectively equivalent to `root'.

This assumes that users will be interacting with docker via the command-line, rather than graphically. It also assumes that the environment is such that allows a single user to be automatically logged into via the GUI, but this is mostly out of laziness rather than an underlying technical restriction.

Choose an account that Docker Desktop will run as. I recommend creating a docker user, but it could be any account. This account does not need admin access.
Open Settings -> Users & Groups -> Login Options, and ensure that this user is automatically logged into.
Created a shared containers directory:

sudo mkdir -p /Users/Shared/Library/Containers sudo chown docker:staff /Users/Shared/Library/Containers sudo chmod -R 770 /Users/Shared/Library/Containers/

Login graphically with the account that will run Docker and start /Applications/Docker.app, answer any questions it might have.
Open Settings -> Users & Groups -> Login Items, and drag the Docker app to it.
Quit Docker Desktop via the menu item
Open Terminal and move your Docker data to a shared location that can be written to by other users:

mv ~/Library/Containers/com.docker.docker /Users/Shared/Library/Containers chmod -R 770 /Users/Shared/Library/Containers/com.docker.docker chmod -R +a "group:staff allow list,add_file,search,add_subdirectory,delete_child,readattr,writeattr,readextattr,writeextattr,readsecurity,file_inherit,directory_inherit" /Users/Shared/Library/Containers/com.docker.docker chmod -R g+rw /Users/Shared/Library/Containers/com.docker.docker/Data

Then link your local Docker data to this shared source, and make sure that others can traverse into this folder to resolve the socket symlink:

ln -s /Users/Shared/Library/Containers/com.docker.docker ~/Library/Containers/com.docker.docker chmod g+x ~/Library ~/Library/Containers

Restart /Applications/Docker.app to test
SSH into the host as another username, and run docker run mariadb to test.
Reboot host and reconnect via ssh to test (it may take a moment for Docker to start up)

This is the configuration we use for the #kernelcafe. Please add your improvements to the comments!

kernel café, toward alpha 1

Thomas Strömberg — Fri, 22 Jan 2021 05:22:32 +0000

I spent some time today in the data-hall, getting lighting setup, as well as a console. This weekend I aim to get the rack-boards installed on the walls.

Here are the alpha milestones I aim to reach before the service becomes initially shareable:

Week 1: Initial network setup, first node with SSH cert sync
Week 2: Tinkerbell setup and second node: Honeycomb LX2, installed via Tinkerbell. Trusted testers.
Week 3: Third node: Mac Mini
Week 4: Nodes 4 & 5, Public Kubernetes Cluster
Week 5: Segregate Firewall, resource controls

Beta!

For the SSH cert synchronization setup, I'm considering basing it on:

periblos - YAML to GitHub Org sync
sync-ssh-keys - GitHub Org to SSH keys

The missing component though is GitHub Org to UNIX users and groups, which should be easy enough to solve.

Supply vs Demand

Thomas Strömberg — Wed, 20 Jan 2021 01:38:36 +0000

When running public service as a free-time effort, it is critical to consider supply & demand as early on in the process as feasible.

For most folks:

'supply' is the resources you can sink into the service
'demand' is what the public wants out of this service

If you structure your service in such a way that demand immediately outstrips supply, you may have to quickly pivot in a manner that is disruptive to your users.

On the contrary, building a service that no one wants is a waste of your time entirely.

Addressing demand

For the kernel café, I'll focus initially on solving the use cases where I've seen demonstrable demand by open-source developers:

Access to mixed-architecture Kubernetes clusters
Interactive access to arm64 (Linux, macOS)

To reduce demand, I am considering the following limitations:

IPv6-only
Public data only
1-hour CPU limit*
** Unprivileged access*
Subject to contribution of time or money

Addressing supply

Maintaining open infrastructure is incredibly time consuming, and potentially expensive.

Time

CNCF Projects, such as https://tinkerbell.org, make it substantially cheaper to work with physical infrastructure, by allowing it to be managed as cattle rather than pets. Treating the hosts as ephemeral, where they are reinstalled each reboot, goes a long way to addressing long-term maintainability of nodes. In situations where downtime is acceptable or encouraged, requiring that each node is rebooted on a schedule (weekly) also helps.

=== Money

Setting up a janky datacenter is not expensive, but does require capital outlay.

With my limited experience, the more worrying expense is that of power and cooling. Thankfully, both issues can be addressed by speccing out equipment that consumes no more power than necessary.

I intend to build the backbone of cluster with Raspberry Pi 4's, which seems to be unusually cost-effective, even if there are I/O performance limitations. They only consume ~4w at idle.

I have 4x RockPro64's that I can throw into the mix, but they are a bit more exotic for a build-out, and are limited to 8GB. Similar power consumption properties. Interactive users will likely use a Honeycomb LX2 for now.

For x86 support (a necessary evil for platforms like Fuschia), it is difficult to get similar power consumption numbers. Intel NUC's seem to enjoy the best balance of support and power consumption, even if AMD-based solutions trounce them in performance. Before I begin to acquire NUC's, I'll first need to collect data on consumption vs performance.

To be continued ...

Dreaming of a Kernel Cafe

Thomas Strömberg — Wed, 20 Jan 2021 00:35:36 +0000

Remember the 1990's?

When folks blessed with bandwidth setup shell servers, and invited their friends to share in the bounty?

I just moved into a new house, and discovered that I was in this scenario again, for the first time in 20 years.

Today, open-source developers need bandwidth less than access to dev & test environments that differ from their own. What if I combined the two ideas together to create something new?

Thus was born the idea of the Kernel Cafe: Public Cloud Native Infrastructure, run by volunteers.

It's not yet clear to me whether this idea has legs, but you can follow along here to watch.

Another puzzle to solve ...

Thomas Strömberg — Sat, 05 Dec 2020 17:10:12 +0000

I got into computing, as I admired how computers could connect people from a variety of backgrounds, and loved solving the puzzles it introduced.

My latest hack: http://github.com/tstromberg/campwiz