Camila

Posted on Feb 24

Understanding Gitaly and Kernel Memory Consumption in Kubernetes on Self-Hosted GitLab

#gitlab #linux #kernel #devops

During the early hours of the morning, I started receiving Gitaly alerts — memory spikes that weren't being released automatically after the daily backup.

This article is about a self-hosted GitLab on EKS and the behavior of some GitLab components in Kubernetes.

If you also run GitLab on Kubernetes, it's worth understanding what's really happening — and why Cgroup v2 is the definitive solution to this kind of problem.

GITALY

Gitaly is the GitLab component responsible for all Git operations: clone, push, pull, merge, diff, and blame. It isolates repository storage from the web application and communicates with other services via gRPC, optimizing performance and concurrency control.

┌─────────────────────┐
│ GitLab Webservice   │
│              │
└──────────┬──────────┘
           │ gRPC
           ↓
┌─────────────────────┐
│ Gitaly              │
│ - Git operations    │
│ - Repository access │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│ Persistent Volume   │
│ /home/git/repos     │
└─────────────────────┘

GITLAB TOOLBOX BACKUP

It's a GitLab component used to perform backups in Kubernetes environments (specifically deployed using Helm charts). It's a pod/container that contains tools and scripts to execute GitLab backup and restore operations. When does it interact with Gitaly? During repository backups.

Connects to Gitaly via gRPC

Requests a backup of each repository

Receives Git bundles from Gitaly

Processes and compresses the data

Sends everything to object storage (S3, GCS, etc.)

During the execution of the gitlab-toolbox-backup cronjob in the early morning, I observed high memory usage on the Gitaly pod. This consumption is caused by the default behavior of the Linux kernel, which uses RAM as a cache for files read from disk (page cache).

In Kubernetes environments, this behavior can create resource allocation problems, since the kernel is shared across all pods on the node.

Symptoms:

High memory usage

Critical implications:

The kernel is shared across the entire node
Page Cache is global and shared
Cgroups v1 only limits how much each container can use
The kernel has no concept of "pod" or "container" — if the node has plenty of RAM, the kernel considers memory available even when a pod is about to be OOM-killed.

┌─────────────────────────────────────────────────┐
│                     NODE                        │
│                                                 │
│  ┌──────────────────────────────────────────┐   │
│  │         LINUX KERNEL (single)            │   │
│  │  - Manages ALL node RAM                  │   │
│  │  - Page Cache is SHARED                  │   │
│  │  - Has no concept of "pod"               │   │
│  └──────────────────────────────────────────┘   │
│                                                 │
│  ┌────────────┐  ┌────────────┐  ┌──────────┐   │
│  │  Pod A     │  │  Pod B     │  │  Pod C   │   │
│  │  (Gitaly)  │  │  (Redis)   │  │  (Web)   │   │
│  │            │  │            │  │          │   │
│  │  Sees:     │  │  Sees:     │  │  Sees:   │   │
│  │  Limit:8GB │  │  Limit:4GB │  │ Limit:2GB│   │
│  └────────────┘  └────────────┘  └──────────┘   │
│                                                 │
│  Total RAM: 32GB                                │
│  Total Cache: 20GB (visible to ALL)             │
└─────────────────────────────────────────────────┘

Note: Page Cache is the RAM used by the kernel to cache files from disk.

Backup Flow

What happens?

During the backup (1h):

Gitaly reads hundreds of Git repositories
Kernel caches everything: "I'll keep these .git files in RAM"
Backup ends: Gitaly process returns to normal (195MB)
Kernel doesn't clean up: Cache stays marked as "active_file" = 35.6GB
Kubernetes sees: Pod using 37GB → OOM danger!

Why doesn't it clean up automatically?

The cache is marked as "active" (not "inactive"), so the kernel thinks:

"These files were recently used"
"They'll probably be used again soon"
"I'll keep them in RAM"

But since this is a backup that runs once a day, those files won't be accessed again until tomorrow!

Possible solutions evaluated:

Option	Effort	Benefit	Recommendation
Migrate to cgroup v2	High (node reboot)	Definitive fix	Best long-term option
Privileged CronJob to drop cache	Low (15min)	Solves the problem	If you need a quick fix
DaemonSet monitor	Medium (1h)	Automated	Optional
Increase memory limit	Low	Temporary workaround	Emergency only

As we can see, there are workarounds — but the best long-term option is Cgroup v2. It requires a bit more effort to implement, but the benefits make it stand out.

Current Cgroup v1 data:

cache: 38829035520           # 36.2 GB !!!!!
rss: 204779520               # 195 MB
inactive_file: 568246272     # 542 MB
active_file: 38260654080     # 35.6 GB !!!!!


**35.6GB of `active_file`** = actively cached files (page cache)!

Breakdown:
- Gitaly process (RSS): 195 MB
- Active file cache: 35.6 GB  ← HERE!
- Inactive file cache: 542 MB
- Total cache: 36.2 GB
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total pod usage: ~37 GB

Cgroup v2

Cgroup v2 has a feature called PSI that detects when there is memory "pressure":

# cgroup v2 exposes:
/sys/fs/cgroup/memory.pressure
# Content:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

When pressure is detected, the kernel automatically releases cache even if it's marked as "active"!

Cgroup v2 is the second generation of the Linux kernel's control groups system, with significant improvements over v1. Our GitLab EKS cluster is currently running Cgroup v1.

Cgroup v1 has multiple independent hierarchies (memory, cpu, io), which can cause inconsistencies. Cgroup v2 uses a single unified tree:

That's it! I had the chance to dig into this topic this week and wanted to share what I learned.

Docs: backup-restore
Kernel Tuning and Optimization for Kubernetes: A Guide
Linux Kernel Version Requirements

Top comments (3)

Matthew Hou • Feb 26

Good writeup on the page cache interaction. This is the kind of operational knowledge that's hard to get from documentation alone — you have to hit it in production. The Cgroup v2 point is important. A lot of teams running self-hosted GitLab on K8s are still on Cgroup v1 kernels without realizing the memory accounting difference. The kernel treating page cache as 'used' memory inside the container's cgroup is especially confusing because standard monitoring tools show high memory usage even when the actual working set is small. Have you looked at whether the backup schedule could be shifted to use streaming instead of bulk reads? That would reduce the page cache spike even without the Cgroup v2 migration.

klement Gunndu • Feb 28

Worth noting that cgroup v2 also changes how OOM scoring works in K8s — the kernel can now reclaim page cache from the right pod instead of killing the one with the highest RSS. Saved us from random Gitaly restarts after switching.

Clinton Rocha • Mar 1

Nice article!!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.