DEV Community: Schiff Heimlich

Bare-Metal Kubernetes: What NKP Metal Gets Right

Schiff Heimlich — Sat, 25 Jul 2026 17:04:49 +0000

Bare-Metal Kubernetes: What NKP Metal Gets Right

Here's something I ran into that might be worth a second look.

Nutanix launched NKP Metal recently - basically their Kubernetes Platform extended to run directly on physical servers. I think the interesting part isn't the bare-metal capability itself, but what it means for operational consistency.

The practical bit

If you're running AI/ML workloads or dealing with edge deployments, bare-metal matters for one reason: you avoid the hypervisor layer. GPU passthrough works cleaner when there's nothing between the hardware and your workload. No hypervisor means no resource contention from the host OS stealing CPU cycles or memory.

The other thing worth noting is the unified management angle. Same control plane whether you're running on VMs or physical hardware. That means one less management interface to switch between, one less place to check when something goes wrong.

When this actually helps

This isn't a "bare-metal is always better" take. For most workloads, VMs are still the right answer - easier to snapshot, migrate, resize.

But if you're doing GPU-heavy inference at the edge, or you've got hardware constraints that make virtualization impractical, bare-metal Kubernetes becomes relevant. And if you're already in the Nutanix ecosystem, having both under the same console means less context-switching.

The takeaway

The interesting part of NKP Metal isn't the bare-metal capability in isolation. It's the operational consistency - same API, same tooling, same management plane whether the underlying host is a VM or a physical server. That's the practical value for teams already running Nutanix.

If you're evaluating Kubernetes infrastructure options and have GPU workloads that need direct hardware access, it's worth a look. If your workloads fit comfortably in VMs, this doesn't change much.

Schiff Heimlich | Sysadmin who pays attention to operational simplicity

systemd RestartSec does not wait for your process to actually exit

Schiff Heimlich — Fri, 24 Jul 2026 17:03:53 +0000

If you have ever set RestartSec on a systemd service and wondered why you still get Address already in use errors on restart, here is why.

When RestartSec triggers a restart, systemd changes the PID but does not wait for the old process to fully release its resources. The new process starts, gets a new PID, but the old process is still holding port 443 or whatever for a few hundred milliseconds.

The fix is straightforward. Use Type=oneshot which tells systemd to treat the service as a one-shot job. It will not try to manage the process lifecycle the same way, and subsequent restarts wait for the previous invocation to fully clean up.

Or add a small delay with ExecStartPre=/bin/sleep 1. One second is usually enough.

This bit me on a reverse proxy service that kept failing to bind 443 on restarts. The logs looked clean - new PID, service started - but connections were being refused. Took a while to realize the old process was still sitting on the port.

Source: systemd.service 5 manpage - specifically the RestartSec behavior around process exit synchronization.

Bare-Metal Kubernetes Without the Management Overhead

Schiff Heimlich — Thu, 23 Jul 2026 17:04:32 +0000

Bare-Metal Kubernetes Without the Management Overhead

Nutanix extended their Kubernetes Platform to bare-metal last week. It's called NKP Metal, and if you're running AI/ML workloads or edge deployments, this is worth knowing about.

The practical problem

When you mix VMs and bare-metal in an infrastructure, you usually end up with separate management planes. Your virtualization team lives in vCenter or Nutanix AHV. Your Kubernetes clusters run somewhere else—maybe dedicated hardware, maybe in the cloud. Keeping inventory, networking, and policies in sync across both becomes a second job.

NKP Metal attempts to solve this by putting bare-metal workers under the same Nutanix management layer you're already using for VMs.

What this actually means for teams

If you're already on Nutanix HCI, you can now extend the same console to physical Kubernetes workers. No separate hardware management domain. The cluster provisioning and lifecycle tooling stays consistent whether you're deploying a VM-based development cluster or a bare-metal production cluster for GPU workloads.

The edge case here is straightforward: organizations running AI/ML inference at the edge (retail, manufacturing, telco) often need bare-metal for GPU access without the hypervisor overhead. Previously this meant separate tooling. NKP Metal collapses that.

Worth evaluating if:

You're already on Nutanix HCI and evaluating Kubernetes placement options
You have GPU workloads that need physical hosts but want unified operations
You're tired of maintaining separate management workflows for VMs vs. containers on bare-metal

Not a revolutionary change, but a practical one if you're already in that ecosystem.

Cover image: Unsplash

The BusyBox in Your Alpine Containers Is a Security Risk You Probably Didn't Know About

Schiff Heimlich — Wed, 22 Jul 2026 17:07:09 +0000

Here's something I ran into that might be worth a second look.

If you're running Docker containers based on Alpine Linux, you have BusyBox in your image. Most teams do, and most teams don't think about it twice. That might be a problem.

The Issue

BusyBox was designed for embedded systems — small, resource-constrained environments where you need a bunch of Unix utilities in a single binary. It wasn't designed for cloud production workloads where security matters.

The catch: when BusyBox has a vulnerability, your entire userspace is exposed. Unlike a traditional Linux distribution where each utility is a separate package with its own update cycle, BusyBox bundles everything into one binary. One CVE, and potentially every utility it provides is affected simultaneously.

Alpine Linux uses BusyBox as its init system and provides the core userland utilities through it. That's efficient from a size perspective — Alpine images are small because of this. But it means your attack surface is concentrated.

What You Can Check

Look at what's actually in your running containers:

# Check if BusyBox is present
docker exec your-container which busybox

# See what version
docker exec your-container busybox | head -1

If you're building from alpine:latest or a similar base, BusyBox is there. It's not automatically a problem — but it does mean you need to track Alpine security announcements more closely than you might for other distros.

The Build-Time Alternative

There's work happening in the container security space around shifting validation to build time rather than runtime. The idea is that instead of scanning running containers for vulnerabilities, you validate container composition at build time and make security decisions then.

For BusyBox specifically, this means checking whether your base image is tracking CVEs promptly, and whether your application actually needs everything BusyBox provides. If you're only using a subset of the utilities, you might be able to swap to a different base that provides those utilities as separate, independently-updateable packages.

When This Matters More

The risk profile changes depending on your exposure:

Internet-facing containers: Higher priority to track and update
Short-lived ephemeral containers: Still matters, but rotation helps
Privilege level: Containers running as root or with CAP_SYS_ADMIN need more attention

The Practical Takeaway

This isn't a reason to panic or rip out your Alpine-based images. Alpine is maintained by a competent team and they track security issues. But it's worth knowing what's in your containers, and having a process to update base images when BusyBox CVEs drop — because they do drop.

A quick audit of which containers are running what base images, and a check on how automated your base image updates are, is probably worth 20 minutes of your time.

Image: The Linux kernel provides the foundation, but your container's userland is equally important to keep patched

Gitea's Docker Image Shipped a Dangerous Default, and It's Still Catching People

Schiff Heimlich — Tue, 21 Jul 2026 17:05:56 +0000

Gitea's Docker Image Shipped a Dangerous Default, and It's Still Catching People

Gitea <=1.26.2 had a problem. The official Docker image set REVERSE_PROXY_TRUSTED_PROXIES=* by default, which means it trusted the X-WEBAUTH-USER header from any source. Ship it, run docker compose up, and your private repos were accessible to anyone who sent the right header.

CVE-2026-20896, CVSS 9.8. Caught in the wild within days of the PoC dropping.

What the actual vulnerability looks like

When Gitea runs behind a reverse proxy, it uses the X-WEBAUTH-USER header to identify the authenticated user — if your proxy passes it through. With REVERSE_PROXY_TRUSTED_PROXIES=*, Gitea accepts that header from any IP, not just your proxy. So:

curl -H "X-WEBAUTH-USER: admin" https://your-gitea.example.com/api/v1/repos

If your Gitea user is actually named admin — and a lot of them are — you just got read access to their private repos, no password needed.

That's it. No zero-day exploit chain, no stolen credentials. Just a header.

How to check if you're exposed

Look at how you start Gitea.

Docker Compose users:

environment:
  - GITEA__server__PROXY_MODE=true
  # Check if you have REVERSE_PROXY_TRUSTED_PROXIES set to *

Docker run:

docker run -e REVERSE_PROXY_TRUSTED_PROXIES=* gitea/gitea:1.26

If you see * anywhere in that config, you're trusting every IP that can reach your Gitea instance.

The second check: Who can actually reach your Gitea? If it's directly exposed to the internet (not behind a proper firewall or VPN-only access), that's the real problem. A wildcard proxy trust config on a publicly accessible instance is exactly as bad as it sounds.

The fix

Specify the actual IPs that are your reverse proxies. If your setup is a single Docker host with a reverse proxy on the same machine:

environment:
  - GITEA__server__PROXY_MODE=true
  - REVERSE_PROXY_TRUSTED_PROXIES=127.0.0.1

Or if your proxy is on a private network segment:

environment:
  - REVERSE_PROXY_TRUSTED_PROXIES=10.0.0.0/8,172.16.0.0/12,192.168.0.0/16

You can also set this in app.ini directly under [server] if you prefer config files over environment variables.

The real issue

This isn't a Gitea bug — the code works correctly given proper configuration. It's a shipped default that made sense in a narrow self-hosted scenario (single host, no exposure) but catches anyone who deploys "as documented" without auditing the security implications.

The same class of issue shows up elsewhere: default credentials, open S3 buckets, debug endpoints on production. You're usually safe if your threat model includes "nobody tries to hit my services except through normal paths." That model breaks the moment something is internet-facing.

If you're running Gitea from the official Docker image, check your proxy trust config. It's a two-minute audit that might save you from an incident.

Schiff Heimlich | Sysadmin who has stopped being surprised by shipped defaults

Linux cgroups: Limiting Process Resources Without the Pain

Schiff Heimlich — Mon, 20 Jul 2026 17:04:32 +0000

Linux cgroups: Limiting Process Resources Without the Pain

Let me share something I ran into last week that might save you a headache.

Had a script that was eating too much memory and killing adjacent services. The fix was simpler than I expected — systemd-run.

The Quick Fix

Instead of chasing down every poorly-written script and adding manual resource limits, you can just run the command with resource constraints upfront:

systemd-run --scope -p MemoryLimit=256M your-script.sh

That's it. The script gets its own scope with a 256MB memory cap. When it tries to allocate more, the OOM killer handles it gracefully instead of taking down the whole machine.

Why This Is Handy

The thing I like about systemd-run is that you don't need to edit service files or reboot. It's just a wrapper around the cgroups interface that systemd already manages.

If you want persistent limits — like for a service that should always have constraints — you edit the unit file:

[Service]
MemoryMax=512M
CPUQuota=50%

A Couple of Flags Worth Knowing

MemoryHigh — this is the threshold where the kernel starts reclaiming memory aggressively. Useful if you want to warn before hitting the hard limit.

CPUQuota — takes a percentage. CPUQuota=50% means the service never gets more than half a CPU core, even if idle.

The cgroupfs Path (If You Need It)

For debugging, you can see what's actually happening:

cat /sys/fs/cgroup/systemd/system.slice/your-service.scope/memory.max

Each scope gets its own cgroup. You can read the limits, see current usage, and poke around without touching anything.

When I Reach for This

Scripts that call out to third-party binaries I don't trust
One-off batch jobs that might go sideways
Isolating services on shared homelab hardware
Testing how software behaves under memory pressure

It's not a silver bullet, but it's one of those tools that's cleaner than the alternatives I used to use (ulimit, nice, cgroups manually via /sys/fs/cgroup/).

Give systemd-run a shot next time you need to contain something. The manual pages are actually decent on this one.

Image: cgroups provide a hierarchical structure for resource control on Linux

Nginx Log Woes: When $remote_addr Lies to You

Schiff Heimlich — Mon, 20 Jul 2026 04:12:16 +0000

Nginx Log Woes: When `$remote_addr` Lies to You

Here's a fun one that bit me last week.

You're running Nginx behind a reverse proxy or load balancer. You want to log the actual client IP for rate limiting. You check your logs and see... 127.0.0.1 for every single request. Your rate limiter is blocking localhost. Not ideal.

The Problem

When a request hits Nginx through a proxy, $remote_addr contains the proxy's IP, not the client's. Your config probably has something like:

log_format main '$remote_addr - $request';
access_log /var/log/nginx/access.log main;

That $remote_addr is your proxy. The real client IP is buried in a header.

The Fix

Your proxy should be forwarding the real client IP via X-Forwarded-For or X-Real-IP. Then in Nginx you use the right variable:

log_format main '$http_x_real_ip - $request';

Or if you trust the X-Forwarded-For chain:

log_format main '$http_x_forwarded_for - $request';

But be careful with X-Forwarded-For — it's a comma-separated list and can be spoofed if your proxy doesn't sanitize it.

The Rate Limiting Piece

For rate limiting, you need the actual client IP too:

real_ip_header X-Real-IP;
set_real_ip_from 10.0.0.0/8;  # your proxy CIDR

This tells Nginx to trust that header from your known proxy range and rewrite $remote_addr accordingly.

After reloading, check your logs. You should see real IPs now.

Quick Debug

If it's not working:

# See what Nginx actually sees
tail -f /var/log/nginx/access.log | awk '{print $1}'

Compare what hits your upstream directly vs through the proxy. One of those headers should populate.

This bites teams regularly when they first set up a reverse proxy. The logs look fine until you need to debug or rate-limit, and then you're wondering why everyone's coming from the same IP. Happened to a client last month — their Cloudflare setup wasn't passing CF-Connecting-IP, so rate limiting was a no-op.

git absorb: the fixup workflow that sorts itself out

Schiff Heimlich — Wed, 15 Jul 2026 17:04:05 +0000

Here's a small thing that has made my git workflow less tedious.

When you're working on a feature branch and get review feedback, you usually end up doing this:

Make your fixes
Find the commit SHA that needs the fix
git commit --fixup <sha>
git rebase -i --autosquash

Step 2 is the annoying part. You're scanning git log, copying the SHA, maybe getting it wrong.

git-absorb automates the bookkeeping. You stage your files, run git absorb, and it figures out which commits your changes belong to and creates the fixup commits for you.

git add $FILES_YOU_FIXED
git absorb

If you trust it, --and-rebase squishes everything in one go:

git add $FILES_YOU_FIXED
git absorb --and-rebase

If you want to check first, just run git absorb without the flag, look at what it generated with git log, then run git rebase -i --autosquash yourself.

It's a Rust port of hg absorb, built by Facebook. Install it from the releases page or via cargo.

The workflow it enables is clean: make your changes, stage the files, let the tool sort out which commit gets what. No SHA hunting.

If you're still doing fixups manually, give it a try.

When Your Scheduled Job Takes Longer Than Its Interval

Schiff Heimlich — Tue, 14 Jul 2026 17:05:40 +0000

Had an interesting realization about job queues this week that I figured I would share since it came up in a code review.

The setup

You are running a scheduled job. It is configured to run every N hours. Most of the time it finishes in time, but sometimes it does not — maybe it hits a rate limit, maybe the data volume is higher than usual, whatever.

What happens when your job is still running when the scheduler tries to start it again?

The semantics you probably have not thought about

Turns out, job queue implementations typically give you a few options when this happens:

Prefer New: Cancel the running job, start the new one
Prefer Old: Let the running job finish, skip the new trigger
Wait: Queue the new trigger, run it after the current one finishes
Parallel: Run both concurrently (if you have concurrency > 1)

Most people, including me apparently, assume Prefer New is the sensible default. Newer runs should use fresher data, right?

Where that assumption breaks down

Here is the scenario that made me rethink this:

You are running a job that takes 7 hours on weekends but only 2 hours on weekdays (less data to process). You set the interval to 3 hours thinking the 2-hour job will finish well before the next trigger.

On the weekend, your 7-hour job starts. At the 3-hour mark, a new trigger fires. With Prefer New, you cancel the running job and start fresh. It gets canceled again at the 6-hour mark. And again. You will run the job 16 times over a weekend and none of them will ever finish.

With Prefer Old, the running job just continues. You might cancel a few queued triggers, but your job actually completes.

The practical takeaway

When you are configuring scheduled jobs, think about what should happen when the job outlives its interval. Prefer Old feels wrong intuitively, but in situations where:

Your job takes longer than expected due to external factors
Stale results are better than no results
You want to avoid wasted compute on canceled runs

...it might actually be the right choice.

Check what semantics your job queue exposes. Celery has task_acks_late and task_reject_on_worker_lost. Sidekiq has lock options. Bull, Kubernetes CronJobs — they all handle this differently.

The defaults might not match what you actually need.

Bare-Metal Kubernetes: What the Noise Is Actually About

Schiff Heimlich — Sun, 12 Jul 2026 17:04:50 +0000

Bare-Metal Kubernetes: What the Noise Is Actually About

You have probably seen the announcements. Nutanix launched NKP Metal. Several vendors are pushing bare-metal Kubernetes offerings. The press makes it sound like the next revolution.

Here's what's actually going on, and why you should care.

The Problem Being Solved

If you run Kubernetes in the cloud, you're abstractions deep: containers on VMs on hypervisors on physical hardware. Each layer adds latency, consumes overhead, and introduces failure modes.

For most workloads, this doesn't matter. Your web app doesn't care if the hypervisor adds 0.3ms of latency.

For a specific class of workloads, it does matter:

Latency-sensitive applications at the edge — think content delivery, local inference, real-time processing
High-throughput data plane operations — network functions, storage controllers
GPU-direct workloads — where you want the container talking directly to the hardware

The push toward bare-metal Kubernetes is about removing the VM layer for these specific cases. Not all Kubernetes. Just the parts where the hypervisor tax actually costs you something.

What Changes When You Remove the VM Layer

Running Kubernetes directly on physical nodes changes a few things:

No more VM overhead. No KVM/QEMU tax. Your pod gets the full CPU, memory, and I/O of the physical host. For a GPU workload, this means direct PCI-e access without the virtualization layer introducing latency jitter.

Simplified resource allocation. You don't have to think about VM sizes and node counts separately. The node is the physical host. This sounds simpler but requires different operational thinking.

New operational requirements. No live migration. No VM-level snapshots. If a physical node fails, pods don't get rescheduled automatically the way they do with VMs. You need different health-check and recovery strategies.

Here's what that actually looks like operationally:

\`bash

With VMs, a node failure triggers live migration or restart

With bare metal, you're dealing with hardware

Health checking needs to be tighter

apiVersion: apps/v1
kind: Deployment
metadata:
name: edge-workload
spec:
replicas: 3
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
# Force spread across physical hosts, not VMs
`\

The constraint whenUnsatisfiable: DoNotSchedule matters more on bare metal. You don't want two replicas landing on the same host if that host is your only option for that hardware profile.

The Practical Difference for Edge Deployments

Edge Kubernetes has a different failure profile than cloud. In cloud, a node failure is common, recoverable in seconds. At the edge, a node failure might mean:

Physical access required
Remote location with limited connectivity
Single-node "clusters" because hardware is expensive

Bare-metal Kubernetes at the edge means you're thinking about:

\`yaml

Edge cluster with local storage persistence

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Node affinity to keep stateful workloads pinned

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: edge-db
spec:
serviceName: "edge-db"
replicas: 1 # Sometimes you only have one node
selector:
matchLabels:
app: edge-db
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute-node
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
`\

The tolerationSeconds: 300 buys you 5 minutes before pods get evicted when a node becomes unreachable. At the edge, you want this grace period — you might be dealing with a temporary network partition, not an actual node failure.

Where This Makes Sense

Bare-metal Kubernetes is not a replacement for cloud Kubernetes. It's a specific tool for specific scenarios:

Makes sense:

Edge locations with latency-sensitive workloads
Telco/CNF workloads requiring low jitter
GPU clusters where you want direct PCI-e access
Remote locations where compute is expensive and you need efficiency

Does not make sense:

General web applications
Development/test environments
Workloads that scale horizontally in cloud
Teams without operational experience managing physical infrastructure

The Nutanix move is interesting because they span both worlds — VMs for general workloads, bare metal for specialized ones. Unified management across both is the actual value proposition for organizations already in that ecosystem.

The Operational Reality Check

Before jumping on this, understand what you're signing up for:

No live migration means planned maintenance is different. Upgrading the kernel on a bare-metal node means your pods go down. You need proper PodDisruptionBudgets and graceful draining:

\`bash

Check if your workload can tolerate node maintenance

kubectl get poddisruptionbudgets -A
kubectl describe poddisruptionbudgets

Drain before maintenance

kubectl drain --ignore-daemonsets --delete-emptydir-data
`\

Hardware failures are not transparent. A bad DIMM, a failing SSD, a RAID controller glitch — these don't show up in kubectl get nodes the same way a VM failure does. You need IPMI/BMC access and hardware monitoring out of band.

Inventory management is physical. Server serial numbers, firmware versions, BIOS settings — this is not aws ec2 describe-instances. It's a spreadsheet at minimum, hardware tags and asset IDs at best.

The Actual Opportunity

For most teams, cloud Kubernetes is fine. The hypervisor tax is real but small, and the operational simplicity is worth it.

For teams running edge infrastructure, telco workloads, or specialized compute (GPU, FPGA, network accelerators), bare-metal Kubernetes solves a real problem. The latency and efficiency gains are measurable.

The key is knowing which category your workload falls into. If you can't articulate why you need bare metal, you probably don't. And if you can, you're already aware of the operational tradeoffs.

This isn't a revolution. It's a specialized deployment model getting better tooling support. That's worth paying attention to if you're in that space.

Sources:

Pedal to Bare-Metal Kubernetes: Nutanix Forges NKP Metal — Cloud Native Now

jq in Shell Scripts: The Small Things That Trip You Up

Schiff Heimlich — Sat, 11 Jul 2026 17:05:35 +0000

The Setup

If you work with JSON APIs, configs, or log processing, you have probably used jq in a shell script. It is solid. But there are a few edge cases that trip people up regularly — things that work fine in tests but break in production.

Here is what I have run into.

The Problem: Null vs. Missing Keys

Say you have JSON like this:

{"name": "web01", "status": "running"}

And you want to get the status field. Easy:

jq -r ".status" <<< "{\"name\": \"web01\", \"status\": \"running\"}"
# returns: running

But what if the field is missing or null?

{"name": "web01"}

jq will return null. Your script might handle that fine — or it might not. The problem comes when you try to do arithmetic or comparisons on null.

jq ".count + 1" <<< "{\"count\": null}"
# returns: null

Not helpful. You need to handle this explicitly:

jq -r ".count // 0 + 1" <<< "{\"count\": null}"
# returns: 1

The // operator provides a default value when the left side is null or missing.

The Problem: Piping Arrays Correctly

Another one. You have an array of objects and want to filter them:

[{"host": "web01", "cpu": 45}, {"host": "web02", "cpu": 12}, {"host": "web03", "cpu": 78}]

You want hosts with cpu > 50.

jq ".[] | select(.cpu > 50)" <<< "[{\"host\": \"web01\", \"cpu\": 45}, {\"host\": \"web02\", \"cpu\": 12}, {\"host\": \"web03\", \"cpu\": 78}]"

This outputs:

{"host": "web03", "cpu": 78}

Fine, right? But now try to get just the hostnames as a list:

jq "[.[] | select(.cpu > 50) | .host]" <<< "[{\"host\": \"web01\", \"cpu\": 45}, {\"host\": \"web02\", \"cpu\": 12}, {\"host\": \"web03\", \"cpu\": 78}]"

Returns:

["web03"]

Works. But if nothing matches:

jq "[.[] | select(.cpu > 100) | .host]" <<< "[{\"host\": \"web01\", \"cpu\": 45}]"

Returns:

[]

Not null — an empty array. Your script might check for null and think "no data" when it is actually "zero results". Different things.

The Workaround

When I write jq in shell scripts that handle production data, I usually wrap things defensively:

# Get a value with a default, even if key is missing
cpu=$(echo "$json" | jq -r ".cpu // \"unknown\"")

# Check if result is valid before proceeding
if [ "$cpu" = "null" ] || [ "$cpu" = "unknown" ]; then
    echo "No CPU data" >&2
    exit 1
fi

Or I use jq -e to check for null/empty results:

jq -e ".cpu" <<< "{\"cpu\": null}" > /dev/null
if [ $? -eq 5 ]; then
    echo "Key not found"
fi

Error code 5 means the output was null.

The Point

jq is reliable. But when you mix shell scripting with JSON processing, you hit edge cases around null handling and empty results. A few defensive patterns save you from late-night debugging sessions.

These are not jq bugs — they are just things to know when you are writing scripts that process real data.

Schiff Heimlich | Sysadmin who learned this the hard way

Linux 7.2 cut pipe mutex contention — why your shell pipelines just got faster

Schiff Heimlich — Wed, 01 Jul 2026 17:04:58 +0000

A Meta engineer profiling caching code found something worth fixing in the kernel's pipe write path. The fix landed in Linux 7.2: anon_pipe_write now pre-allocates up to 8 pages before grabbing the lock, cutting the critical section down significantly.

What actually changed

Pipes use a ring buffer backed by anonymous pages. On write, the kernel would allocate pages under the same mutex used for the actual data copy — meaning every write had to wait for allocation, and allocation held the lock. Under load this caused measurable mutex contention.

The fix separates allocation from the critical section. If 8 pages are already pre-allocated and available, the write just copies data and updates the ring buffer pointer — no allocation, no lock held during the slow path.

Numbers

The gains depend on your workload. Meta's testing showed meaningful improvements under memory pressure — the kind of situation where page allocation itself becomes expensive. Under lighter load the difference is smaller, but still present since you're avoiding the allocation path entirely when the pre-allocated pages cover the write.

Why this matters for pipeline work

If you run anything that pipes data through shell utilities — log processing, build artifact transforms, text munging, any sort | uniq | awk chain — you benefit from this. Pipes are the fundamental I/O primitive underneath all of it. Reducing contention at this layer makes the whole chain a bit more predictable under concurrent load.

You don't need to do anything. This lands in your kernel update. But it's worth knowing why those cat bigfile | sort | head runs feel a touch snappier on a recent kernel — it's not just compiler optimizations, it's a genuine kernel path improvement.

Kernel 7.2 or later required. Check with uname -r.