How a CrashLoopBackOff turned into my first Microsoft OSS contribution

#opensource #kubernetes #ebpf #observability

Views expressed are my own and do not represent my employer.

I spent almost four years at Microsoft working on Azure Firewall. These days I work at another tech company. And a few months ago, I shipped my first commit to a Microsoft open-source project — from the outside.

That contrast is still funny to me. So I figured it was worth writing down.

Why Retina specifically

Retina is Microsoft's eBPF-based network observability platform for Kubernetes. It launched in 2024 and gives you flow-level visibility into pods, drops, DNS, the works — without sidecars, all on top of eBPF.

I'd spent years inside Microsoft on Azure networking, so when Retina dropped I immediately got why it mattered. Cluster operators have wanted this for a long time. The fact that Microsoft was building it in the open, with a real community around it, was the part that pulled me in.

So one evening in February 2026, I sat down with my Mac, kind, and the standard Retina Helm chart — just to play with it.

The first contribution wasn't planned

I wasn't trying to contribute. I was trying to get a local POC working.

The retina-agent pods kept crashing in CrashLoopBackOff. The logs pointed straight at pluginmanager.go:119:

panic: Error running controller manager
error="pluginmanager requires a positive MetricsInterval in its config"

The "Standard" Helm chart didn't set metricsInterval. The Go code didn't have a default for it. So the agent panicked before it could do anything useful. Classic "works for the maintainers, breaks for the new person trying it for the first time" bug — the kind I've shipped plenty of myself.

So I filed issue #2029, said I'd be happy to take a stab at the fix, and a few weeks later opened PR #2084. The fix itself was small: default MetricsInterval to 10 seconds when it's missing or zero, log a warning, move on.

What surprised me about the process

A few things I wasn't expecting.

The CLA was easier than I'd built it up to be.
A bot pinged the PR, I commented @microsoft-github-policy-service agree, and it was done. That was it. I'd been mentally putting it off for weeks because I assumed it would be paperwork-heavy.

The first review made the fix bigger, not smaller.
My first cut put the guard inside pkg/plugin/dropreason — because that's where I'd seen it crash. The maintainer, @nddq, came back with something like: "The approach is sound, just move the fix so it applies to all plugins. And while you're at it, add the default in the Helm values too."

That review comment was probably the most valuable thing I got out of the whole experience. I'd been thinking like a person fixing one bug. They were thinking like a person who owns the project. The right place for the guard wasn't the plugin — it was the plugin manager. So one narrow crash fix became a fix in pkg/managers/pluginmanager plus a sensible default in the Helm chart. Strictly better outcome.

Nobody cared that I worked elsewhere.
I'd quietly worried about this. Engineer from another tech company contributing to a Microsoft project, etc. In practice: nobody mentioned it, nobody asked, nobody seemed to care. The review was about the code. That's how it should be, and I think it's worth saying out loud — because if you're sitting somewhere wondering whether your employer name on your GitHub profile will be weird, it almost certainly won't be.

A few force-pushes, four commits, twenty-nine green checks, and on March 20, 2026, it merged.

Where the rabbit hole led

I thought that was the end of the story. It wasn't.

Sitting with that fix made me actually dig into how Retina works under the hood — the eBPF programs, the plugin lifecycle, how it plumbs flows up to Prometheus. That ended up turning into a CloudNativeNow piece on why Kubernetes networking is still a black box, a Conf42 talk comparing Retina with Cilium on AWS EKS, and a couple more things in the pipeline — a PlatformCon talk in June and a Springer paper in July. I'll write those up properly once they're out.

None of that was the plan when I filed a small bug report from my couch. And that's kind of the point of this post.

How you can actually do this

If you're sitting on the fence about contributing to an OSS project, here's what I'd tell my pre-#2029 self.

1. Contribute to something you actually use.
The only reason I found that bug was that I was running Retina locally for a POC. I wasn't trawling issue trackers looking for a project to adopt. The path is: use it → hit a rough edge → fix it. Not the other way around. If you don't currently use any OSS project deeply enough to find a rough edge, that's the first thing to fix — not the contribution.

2. File the issue before the PR.
I filed #2029 first — clean bug report, reproduction steps, and one sentence at the bottom: "I am interested in contributing a fix." That signals intent without claiming the work, and it gives maintainers a chance to redirect you before you've written a single line. By the time I opened the PR a few weeks later, the conversation already had context.

3. The CLA is not the wall you think it is.
For Microsoft OSS, you comment @microsoft-github-policy-service agree once. That's it. I'd been mentally putting off the contribution for weeks because I assumed it would be a paperwork ordeal. It was thirty seconds.

4. Sign your commits and use the right git identity.
Microsoft repos require DCO sign-off (git commit -s) and most maintainers also appreciate signed commits (git commit -S). If your day job uses a different git identity than your personal OSS work, set the user/email per-repo, not globally — saves you the awkward cleanup of a PR authored by your work email later.

5. Welcome the scope-up review.
The most useful review pattern I've seen — and the one that happened to me — is the maintainer broadening the fix. Don't argue with it. They have load-bearing context you don't. My one-plugin guard turned into a plugin-manager fix plus a Helm default, and the merged version is strictly better than what I opened with. Force-pushing four times to get there is fine. Maintainers expect it.

6. Pick good first issue literally.
There are unassigned good first issue labels in microsoft/retina right now that have been waiting for an owner for months. Some are docs. Some are small CRD edge cases. Some are bigger refactors that have been split into chunks. Comment on the one you want, ask to be assigned, ship. The maintainers want you to succeed — that's literally why the label exists.

Closing thought

If you run a Kubernetes cluster and you've ever wished you could just see what packets are doing, give microsoft/retina a try. And if you find a rough edge, please file the issue — somebody like me, a few months from now, might want exactly that to start with.

Links: