During the early hours of the morning, I started receiving Gitaly alerts — memory spikes that weren't being released automatically after the daily backup.
This article is about a self-hosted GitLab on EKS and the behavior of some GitLab components in Kubernetes.
If you also run GitLab on Kubernetes, it's worth understanding what's really happening — and why Cgroup v2 is the definitive solution to this kind of problem.
GITALY
Gitaly is the GitLab component responsible for all Git operations: clone, push, pull, merge, diff, and blame. It isolates repository storage from the web application and communicates with other services via gRPC, optimizing performance and concurrency control.
┌─────────────────────┐
│ GitLab Webservice │
│ │
└──────────┬──────────┘
│ gRPC
↓
┌─────────────────────┐
│ Gitaly │
│ - Git operations │
│ - Repository access │
└──────────┬──────────┘
↓
┌─────────────────────┐
│ Persistent Volume │
│ /home/git/repos │
└─────────────────────┘
GITLAB TOOLBOX BACKUP
It's a GitLab component used to perform backups in Kubernetes environments (specifically deployed using Helm charts). It's a pod/container that contains tools and scripts to execute GitLab backup and restore operations. When does it interact with Gitaly? During repository backups.
- Connects to Gitaly via gRPC
- Requests a backup of each repository
- Receives Git bundles from Gitaly
- Processes and compresses the data
- Sends everything to object storage (S3, GCS, etc.)
During the execution of the gitlab-toolbox-backup cronjob in the early morning, I observed high memory usage on the Gitaly pod. This consumption is caused by the default behavior of the Linux kernel, which uses RAM as a cache for files read from disk (page cache).
In Kubernetes environments, this behavior can create resource allocation problems, since the kernel is shared across all pods on the node.
Symptoms:
High memory usage
Critical implications:
- The kernel is shared across the entire node
- Page Cache is global and shared
- Cgroups v1 only limits how much each container can use
- The kernel has no concept of "pod" or "container" — if the node has plenty of RAM, the kernel considers memory available even when a pod is about to be OOM-killed.
┌─────────────────────────────────────────────────┐
│ NODE │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ LINUX KERNEL (single) │ │
│ │ - Manages ALL node RAM │ │
│ │ - Page Cache is SHARED │ │
│ │ - Has no concept of "pod" │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ (Gitaly) │ │ (Redis) │ │ (Web) │ │
│ │ │ │ │ │ │ │
│ │ Sees: │ │ Sees: │ │ Sees: │ │
│ │ Limit:8GB │ │ Limit:4GB │ │ Limit:2GB│ │
│ └────────────┘ └────────────┘ └──────────┘ │
│ │
│ Total RAM: 32GB │
│ Total Cache: 20GB (visible to ALL) │
└─────────────────────────────────────────────────┘
Note: Page Cache is the RAM used by the kernel to cache files from disk.
Backup Flow
What happens?
During the backup (1h):
- Gitaly reads hundreds of Git repositories
- Kernel caches everything: "I'll keep these .git files in RAM"
- Backup ends: Gitaly process returns to normal (195MB)
- Kernel doesn't clean up: Cache stays marked as "active_file" = 35.6GB
- Kubernetes sees: Pod using 37GB → OOM danger!
Why doesn't it clean up automatically?
The cache is marked as "active" (not "inactive"), so the kernel thinks:
"These files were recently used"
"They'll probably be used again soon"
"I'll keep them in RAM"
But since this is a backup that runs once a day, those files won't be accessed again until tomorrow!
Possible solutions evaluated:
| Option | Effort | Benefit | Recommendation |
|---|---|---|---|
| Migrate to cgroup v2 | High (node reboot) | Definitive fix | Best long-term option |
| Privileged CronJob to drop cache | Low (15min) | Solves the problem | If you need a quick fix |
| DaemonSet monitor | Medium (1h) | Automated | Optional |
| Increase memory limit | Low | Temporary workaround | Emergency only |
As we can see, there are workarounds — but the best long-term option is Cgroup v2. It requires a bit more effort to implement, but the benefits make it stand out.
Current Cgroup v1 data:
cache: 38829035520 # 36.2 GB !!!!!
rss: 204779520 # 195 MB
inactive_file: 568246272 # 542 MB
active_file: 38260654080 # 35.6 GB !!!!!
**35.6GB of `active_file`** = actively cached files (page cache)!
Breakdown:
- Gitaly process (RSS): 195 MB
- Active file cache: 35.6 GB ← HERE!
- Inactive file cache: 542 MB
- Total cache: 36.2 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total pod usage: ~37 GB
Cgroup v2
Cgroup v2 has a feature called PSI that detects when there is memory "pressure":
# cgroup v2 exposes:
/sys/fs/cgroup/memory.pressure
# Content:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
When pressure is detected, the kernel automatically releases cache even if it's marked as "active"!
Cgroup v2 is the second generation of the Linux kernel's control groups system, with significant improvements over v1. Our GitLab EKS cluster is currently running Cgroup v1.
Cgroup v1 has multiple independent hierarchies (memory, cpu, io), which can cause inconsistencies. Cgroup v2 uses a single unified tree:
That's it! I had the chance to dig into this topic this week and wanted to share what I learned.
Docs: backup-restore
Kernel Tuning and Optimization for Kubernetes: A Guide
Linux Kernel Version Requirements


Top comments (4)
Good writeup on the page cache interaction. This is the kind of operational knowledge that's hard to get from documentation alone — you have to hit it in production. The Cgroup v2 point is important. A lot of teams running self-hosted GitLab on K8s are still on Cgroup v1 kernels without realizing the memory accounting difference. The kernel treating page cache as 'used' memory inside the container's cgroup is especially confusing because standard monitoring tools show high memory usage even when the actual working set is small. Have you looked at whether the backup schedule could be shifted to use streaming instead of bulk reads? That would reduce the page cache spike even without the Cgroup v2 migration.
Worth noting that cgroup v2 also changes how OOM scoring works in K8s — the kernel can now reclaim page cache from the right pod instead of killing the one with the highest RSS. Saved us from random Gitaly restarts after switching.
Nice article!!
This is gold 🔥 This is the kind of nuanced take that gets lost in the noise.
Following for more! 🔔