DEV Community: Katz Sakai

GKE's Noisy Neighbor Problem Can Be Invisible in Metrics Explorer

Katz Sakai — Sun, 12 Apr 2026 06:53:14 +0000

Google Cloud's Metrics Explorer has plenty of metrics, and for most monitoring needs, it's more than enough.

However, the sampling interval of those metrics can hide real problems. I once ran into a situation where an API server on Google Kubernetes Engine (GKE) had intermittent response time spikes, yet Metrics Explorer showed nothing abnormal. The root cause turned out to be short-lived batch jobs on the same Node eating up all the CPU, a classic Noisy Neighbor problem.

Here's how I fell into that trap.

An API server that was mysteriously slow from time to time

I had a development API server running on GKE that would occasionally slow down for no obvious reason.

A request that normally completed in around 200 ms would sometimes take about 4 seconds, even under the same conditions. The slowdown was random/intermittent, and I could not find a clear pattern in when it happened.

When the issue occurred, CPU usage for the two GKE Nodes looked like this in Metrics Explorer:

CPU utilization was sitting around 35%. Nothing suggested the CPUs were being saturated.

I then checked the Load Average (1m) for the same Nodes:

There were some spikes, but with a 2-core CPU Node averaging around 1.5 to 2, and combined with the CPU utilization graph, it was hard to conclude that the CPU was saturated.

(In hindsight, the Load Average spikes might have been a clue that something was happening in short bursts. But at the time, I couldn't connect the dots.)

What was actually happening on the API server node

Metrics Explorer wasn't giving me any clues. The database wasn't overloaded, there were no notable error logs from the API server. I was stuck. Since this was a development environment, I let it sit for a while.

One day, I finally decided to log in to the affected node over SSH and watch it in real time with htop¹.

That was the turning point.

At some moments, both CPU cores were pinned at 100% for around 20 to 30 seconds, and the load average reached 7.44.

The process list showed multiple Rails batch tasks running at the same time. These batch jobs were consuming all the CPU, starving the API Pod running on the same Node.

That was the noisy neighbor.

When the batch jobs were not running, CPU usage dropped back down to around 6% to 11%.

So why didn't Metrics Explorer show this?

Because the CPU-hungry batch jobs were short-lived. Each run finished in around 20 to 30 seconds. The VM CPU utilization metric in Metrics Explorer is sampled at 60-second intervals². If CPU is fully saturated for only 20 to 30 seconds out of a 60-second window, the result can still look like only about 33% to 50% average utilization.

That was exactly the trap: the node really was getting hammered, but only briefly, and the 1-minute metric smoothed it into something that looked unremarkable.

(Side note) Why the batch jobs were eating all the CPU

The batch pod had no CPU limits configured, so there was no upper bound on how much CPU it could use.

As a result, when multiple batch jobs ran at the same time, they were able to consume most of the CPU available on the node and interfere with other pods running there.

After I added CPU limits to the batch pod, the API response time became stable.

What I learned

Metrics Explorer is a powerful tool, but you need to be aware of its sampling intervals. Short-lived CPU spikes can get averaged out and won't show up clearly on the graphs.

In this case, I had already noticed the symptom from the application side: “the dev API feels slow sometimes.” But the infrastructure metrics alone did not reveal the cause. I only found the real problem after looking at the node directly with htop.

I took two lessons from this.

First, monitor application latency itself, especially p95 and p99. Infrastructure metrics do not always tell you that users are already feeling pain.

Second, know the sampling resolution of the metrics you rely on. If a problem is short-lived enough, aggregated infrastructure graphs may hide it. In those cases, direct inspection of the machine can still be the fastest way to understand what is happening.

If you need a good starting point for that kind of live investigation, Netflix’s Linux Performance Analysis in 60,000 Milliseconds is still a useful reference.

For how to run htop on a GKE Node, see Debugging node issues using the toolbox ↩
https://docs.cloud.google.com/monitoring/api/metrics_gcp_c?hl=en#gcp-compute ↩

How a Rails and K8s Newcomer Cut GKE Costs by 60% by Looking Across the Stack

Katz Sakai — Tue, 31 Mar 2026 11:41:32 +0000

tl;dr: This is a journal of how an engineer with no prior Rails or Kubernetes experience cut Google Kubernetes Engine (GKE) costs by 60%. The following steps were taken to achieve this cost reduction:

Puma, which hosts the Rails app, was running with 1 worker and 33 threads. Because Ruby has the GVL, only one core was effectively being used. Changing to 4 workers and 8 threads let the Rails app take advantage of multi-core CPUs and process multiple requests more effectively per Pod.
Every API request was running bcrypt for token authentication. This was clearly too much overhead for API token auth, so replacing it with a lighter scheme reduced per-request CPU load.
The GKE Nodes running the Pods were on an older machine generation. Upgrading from n1 to n2d gave 56% better CPU performance, 23% more memory, and a 3% cost reduction, which improved Pod density per Node.
There was no autoscaling for either Pods or Nodes, so capacity was always provisioned for peak traffic. By introducing KEDA for Pod autoscaling and GKE Cluster Autoscaler for Node autoscaling, capacity now scales with actual traffic, so we only pay for what we use.

As you can see, this cost reduction was achieved not only by changing the Kubernetes infrastructure settings, but also by tuning the Rails application running on top of it. It required a comprehensive understanding of the entire stack, from application to infrastructure, and a thorough review of the entire stack. For example, simply implementing Kubernetes autoscaling would only increase or decrease inefficient Pods and Nodes, resulting in limited cost savings.
Furthermore, looking at each individual steps, they might seem like insignificant improvements. However, diligently accumulating these small improvements ultimately led to a 60% cost reduction.

I work on a B2B SaaS platform that runs on Google Kubernetes Engine. The API server is a Rails application, and it was the biggest cost driver on our GKE cluster.

When I joined this project, I had virtually no experience with Ruby on Rails or Kubernetes. However, this lack of knowledge actually became a hidden strength. In other words, it allowed me to look at every implementation and configuration with a fresh perspective and keep asking "why is it implemented/configured this way?" until I was satisfied.

The vicious cycle: why the cluster needed so many Pods

The high cost was not caused by a single problem. It was a cycle:

Inefficiency: Running Puma as a single process meant CPUs were not being fully used. On top of that, running bcrypt on every API request added unnecessary CPU load. The Rails API was simply slow.
Band-aid scaling: Instead of tuning Rails, the response to traffic was to add more GKE Nodes and Pods.
Low cost-performance Nodes: GKE Nodes were running on older generation instances, so the cost per unit of CPU and memory was poor.
Overprovisioning: Without autoscaling, the Pod and Node count was fixed to match peak traffic and stayed that way around the clock, which wasted a lot of resources.

Several issues were driving costs up, and we had to tackle them one by one.

How the cost actually came down

Here's what happened to costs. The chart below shows monthly GKE spending by SKU.

This was not mainly due to lower traffic. API traffic stayed fairly stable during this period, and the savings came mainly from efficiency improvements.

The cost came down in two steps.

In the first step (early 2025 through June 2025), we simply reduced Pod and node counts that were clearly excessive. This brought costs down from the peak, but it was just trimming fat. The underlying inefficiencies remained.

The second step (from July 2025 onward) is where most of the real savings came from. It breaks down into two parts:

Part 1: Making each Pod and node more efficient
Part 2: Scaling with demand

Part 1: Making each Pod and node more efficient

1.1 GKE Node generation upgrade: n1-highmem-2 → n2d-highmem-2

Our GKE nodes had been running on n1-highmem-2 since the early days of the service. The n1 family is Google Cloud's first generation of general-purpose instances, based on older Intel Skylake/Broadwell processors.

We migrated to n2d-highmem-2, which uses AMD EPYC processors. That alone made the upgrade worth doing. According to Google's official CoreMark benchmarks:

Instance type	CoreMark score	RAM	Monthly cost (asia-northeast1)
n1-highmem-2	26,293	13 GiB	~$124
n2d-highmem-2	41,073	16 GiB	~$120

Just upgrading the Node type gives 56% better CPU performance, 23% more RAM, and a 3% lower cost. The migration had no surprises: update the Node Pool configuration, cordon and drain the old Nodes, switch to the new Node Pool. This was done in July 2025 and shows up as a SKU change in the cost graph above.

For binary compatibility with existing Docker images, x86-based machines were used rather than Arm. Switching to Arm for further cost reduction is something being considered for the future.

1.2 Rails process model: from 33 threads to 4 workers

This was where most of the savings came from.

Our Rails application was running Puma with the following configuration:

WEB_CONCURRENCY was not set (defaults to 1, meaning a single worker process)
RAILS_MAX_THREADS=33

That is one process with 33 threads. At first, it looked like this should handle up to 33 concurrent requests, but that was not how it worked in practice. The reason is how Ruby works.

Ruby's Global VM Lock (GVL)

Ruby has a Global VM Lock (GVL). Within a single process, only one thread can execute Ruby code at a time. (Threads waiting on I/O such as DB queries, HTTP requests, or file reads do release the GVL, allowing other threads to run.)

The API server was CPU-bound, so even with 33 threads in one process, effective parallelism was essentially 1. Having 33 threads did not mean 33 requests being processed simultaneously.

The Nodes running the Pods had multi-core CPUs, but the GVL meant each Pod was effectively using only one core.

The bcrypt problem

On top of the GVL issue, every API request was running bcrypt for token authentication. bcrypt is a password hashing algorithm deliberately designed to be CPU-intensive in order to resist brute-force attacks. Running that expensive hash operation on every API request was using up CPU across all Pods. We replaced API token authentication with a lighter method for per-request token validation.

The fix

We changed the Puma configuration to:

WEB_CONCURRENCY=4 (4 worker processes)
RAILS_MAX_THREADS=8

With this setup, Puma spawns 4 worker processes, each with its own GVL, so multi-core CPUs can be used properly. Thread count was reduced from 33 to 8. For reference, since Rails 7.2, Puma's default thread count per worker was reduced from 5 to 3, so further reductions may make sense. For more on tuning workers and threads, see Deployment engineering for Puma.

A fair question is whether going from 1 to 4 workers would also mean using 4x more memory. In practice it did not. Copy-on-Write (CoW) lets worker processes share program memory.
(Note that preload_app! needs to be set in the Puma configuration for CoW to work.)

Setting MALLOC_ARENA_MAX=2 in the Rails container environment also reduced per-Pod memory usage by about 20%. The details are in a separate article:

Katz Sakai

Mar 26

Why Rails App Memory Bloat Happens: Causes and Solutions (2025 Edition)

#ruby #rails #performance #linux

5 min read

By combining CoW, MALLOC_ARENA_MAX, and reducing the number of threads, the amount of memory used by the Pod decreased from 4.2 GiB to 3.5 GiB, even though the number of worker processes was increased from 1 to 4.

That said, the memory improvement was a side benefit. What mattered for cost was needing fewer Pods. With only one Puma worker per Pod and a CPU-bound API, each Pod could handle a limited number of concurrent requests. Moving to 4 workers per Pod meant the same traffic could be served with fewer Pods. Fewer Pods meant fewer Nodes, and that drove the cost down.

Part 2: Scaling with demand

Now that each Pod was running efficiently, the next step was to reduce costs by stopping unnecessary Pods and Nodes during off-peak hours. However, before enabling autoscaling, it was necessary to configure the system so that Pods could be safely started and stopped at any time.

2.0 Prerequisites: making Pods safe to autoscale

Autoscaling Pods means that Kubernetes creates and destroys Pods at any time. If your Pods are not set up for this, you trade cost savings for reliability problems.

We configured the following before enabling any autoscaling:

Startup probe: Our Rails applications can take time to start up (loading the framework, initializing gems, establishing DB connections, warming caches). Without a startup probe, Kubernetes may decide the Pod is not alive during initialization and kill it.
Readiness probe: This tells Kubernetes whether a Pod is ready to accept traffic. When a Pod temporarily cannot handle requests (during heavy processing or after a brief DB connection loss), the readiness probe fails and Kubernetes removes that Pod from the Service endpoints. Once the probe recovers, the Pod is added back and starts receiving traffic again.
Liveness probe: This detects Pods that are running but stuck (a hung Rails process, for example). Kubernetes automatically restarts them. This is important for long-running Pods.
terminationGracePeriodSeconds + preStop hook: The terminationGracePeriodSeconds is used to configure a graceful shutdown time for requests in progress after a termination signal SIGTERM is sent to a Pod during Pod scale-in (Pod deletion). Using this setting allows you to specify the time Kubernetes waits before forcibly terminating a Pod, making it easier to prevent request errors during scaling. However, this setting alone is not sufficient. Kubernetes processes sending the termination signal (SIGTERM) and detaching the Pod from the Service endpoint in parallel, so SIGTERM may arrive before the detachment is complete (after SIGTERM arrives, the Pod enters graceful shutdown and cannot accept new requests). Therefore, it is common to use a preStop hook to sleep for about 10 seconds, delaying the SIGTERM transmission, to ensure that detachment from the Service endpoint occurs before SIGTERM. For more on this timing issue, see Zero-Downtime Rolling Deployments in Kubernetes.

These are the things we needed to put in place before autoscaling safely. Without them, autoscaling may look fine in theory, but it can cause problems in production. For details on configuring each probe type, see the Kubernetes documentation on liveness, readiness, and startup probes.

2.1 KEDA with Cron trigger

As this is a B2B service, traffic is concentrated during weekday business hours (approximately 8:30 AM to 7:30 PM), and quiet at night and on weekends. This traffic pattern remained largely unchanged.

However, the number of Pod replicas (the number of running Pods) was fixed to match the maximum load during peak hours and ran 24/7, 365 days a year. This is a significant waste of resources.

Given this predictability, we chose KEDA's Cron trigger. KEDA is an event-driven autoscaler for Kubernetes, and the Cron trigger is one of its simplest scaling options: it adjusts the replica count on a time-based schedule, rather than reacting to metrics like CPU or memory usage. If your traffic pattern is predictable, this is simpler and more reliable than reactive scaling. There is no lag waiting for metrics to cross a threshold, no risk of flapping, and the configuration is easy to understand.

Our configuration:

Weekdays 08:00-20:00 JST: about 3–4x more replicas than the baseline off-hours level
All other times: baseline replica count

No metrics, no thresholds, no reactive logic. For our traffic pattern, that simplicity was a strength.

Note that Cron-based scaling assumes a stable traffic pattern. If total request volume grows with business growth or the pattern itself changes, the current replica counts may not be enough. To catch that early, API response time and similar metrics are monitored externally on an ongoing basis. Degradation there is the signal to revisit the Cron scaling configuration.

Cron scaling is not something you set once and never touch again. It just tends to need fewer updates over time.

2.2 GKE Node Autoscaling

With KEDA changing Pod replica counts, GKE's Cluster Autoscaler was enabled next.

The logic is simple: when KEDA scales Pods in, some nodes end up underutilized. The Cluster Autoscaler cordons those nodes and removes them. When KEDA scales Pods out, the autoscaler provisions new nodes to accommodate them. We configured a minimum node count for availability zone redundancy and a maximum to cap costs.

With KEDA controlling Pod count and Cluster Autoscaler controlling Node count, the cluster now uses only the capacity it actually needs.

Results

Comparing the H2 2024 average (before optimization) to the Q1 2026 average (after all changes were in place):

Metric	Before	After
Node type	n1-highmem-2	n2d-highmem-2
CoreMark score	26,293	41,073 (+56%)
Node RAM	13 GiB	16 GiB (+23%)
Node cost (asia-northeast1)	~$124/month	~$120/month (-3%)
Puma workers (WEB_CONCURRENCY)	1	4
Threads per worker (RAILS_MAX_THREADS)	33	8
API auth per request	bcrypt	Lighter method
MALLOC_ARENA_MAX	Not set	2
Pod memory request	4.2Gi	3.5Gi (-20%)
Pod scaling	Static (always at peak)	KEDA Cron (higher during weekday daytime, minimum otherwise)
Node scaling	Fixed count	Cluster Autoscaler
GKE monthly cost	Baseline	-60%

Costs have stabilized since the latter half of 2025, once the initial optimization is complete (see graph at the top of the page).

Lessons learned

None of this was clever. Puma config, auth scheme, node type, a cron schedule. Each change looked minor in isolation. Together they cut costs by 60%.

Ruby's GVL means adding threads to CPU-bound work does nothing. More processes, not more threads. Thread count only matters once you know how I/O-heavy the workload actually is.

bcrypt is for password hashing, not API tokens. Running bcrypt on every API request was just a mistake. It's slow by design, that's the point for password hashing. Using it for per-request token verification was the wrong tool. We replaced it, and the CPU load dropped immediately.

Simple autoscaling is often enough. KEDA has a lot of trigger types, metrics, queues, custom event sources. We used none of that. A Cron schedule matched our traffic pattern well enough, and it's been lower maintenance than anything reactive would have been.

Node generation matters more than it looks. Just upgrading from n1 to n2d gave 56% better CPU, 23% more memory, and 3% lower cost. If you are still on n1 instances, moving to n2d or similar is worth doing soon.

Optimize the whole system, not just one layer. The key to this improvement was not just looking at the infrastructure or the application in isolation, but reviewing both together. Revising the Puma worker configuration and authentication method was an application-side improvement, while updating node generations and autoscaling were infrastructure-side improvements. Rather than tuning just one part, aiming for overall optimization while understanding the connections of the entire stack led to significant results.

Non-native English is my hidden strength as a tech blogger

Katz Sakai — Mon, 30 Mar 2026 12:16:20 +0000

Every article about "writing in English as a non-native speaker" seems to give the same advice. Use Grammarly. Read more books. Don't use a translator. Practice.

That's fine, but it treats the problem as "you're bad at English, fix it." And that misses something.

Being a non-native speaker is not just an obstacle to work around. Parts of it are actually useful, if you know how to use them.

My native tongue is Japanese and I write tech posts in English. At some point, I realized my English writing had strengths I didn't expect. My non-native English was actually helping me, not holding me back.

Here's what I mean.

A smaller vocabulary can be a good thing

Writing simply is hard when you know too many words. Native English speakers often use longer words and more complex sentences, not because they're bad writers, but because their brain has more options and picks the familiar ones.

I don't have that problem. My vocabulary is smaller, so I end up picking clear words. My sentences are shorter because I can't handle seven clauses at once. I avoid idioms because I'm not sure I'll use them right.

This is not a weakness. This is what every writing guide tells people to do, and it's surprisingly hard when you have a big vocabulary to choose from.

The audience for English tech blogs is global. A large portion of readers are non-native speakers. Your "limited" English is often easier for them to read than a native speaker's writing.

Don't try to write like a native speaker. Try to write clearly. They're different things, and clarity wins.

Your unique content is the stuff that doesn't exist in English yet

Here's the real advantage nobody talks about.

If you work in a non-English-speaking tech community, you have access to knowledge, war stories, and ideas that the English-speaking internet has never seen. Conference talks, blog posts, and debugging stories that only exist in your language.

I translate Japanese technical blog posts into English. Posts about code signing pipelines built around tools and services that English-language blogs don't cover. Every single one of these gets more attention than another "how to do X" tutorial, because the content doesn't already exist in English 50 times over.

You don't have to translate word by word. Take a problem you solved at work that was discussed in your language, and write about it in English. Add your own context. The idea is the value, not the words.

Nobody else can write that post. That's your unfair advantage.

The "translation test" catches bad writing

Here's a trick I use that I've never seen anyone else mention.

After I write a draft in English, I mentally translate it back to my native language. If a sentence sounds weird or unclear when translated, it's usually because the English version is also unclear. I just couldn't tell because I was too close to it.

This works in reverse too. If I can't figure out how to say something in English, I write it in Japanese first. Then I don't translate the words. I translate the idea. The English version almost always comes out simpler and better than if I'd tried to write it directly.

For me, having two languages means I can use each one to check the other. It's like having an extra pair of eyes.

The way you write is part of your voice

Your writing is probably a bit different from a native speaker's. That's not a problem.

There's a natural urge to smooth out every rough part until your writing looks the same as a native speaker's. But in tech blogs, being yourself matters more than being perfect. A post that reads like it was written by a real person in Tokyo is more interesting than one that could have been written by anyone, anywhere.

Small things show that you're a real person:

Mentioning where you're based and what tech scene you're part of
Mentioning tools, services, or ways of working that are common in your country but less known outside of it
Adding local context sometimes ("In Japan, most companies still use X for this")

These aren't flaws. They're detail. They make your post memorable.

What actually held me back (not language)

These aren't non-native problems. They're writing problems. But I used to blame my English for them, so it's worth mentioning:

Not being specific enough. "I improved performance" means nothing. "I reduced memory usage from 3GB to 2.4GB by setting MALLOC_ARENA_MAX=2" is a post people will bookmark.

Burying the point. In Japanese writing, it's common to build up context before reaching the conclusion. English readers want the point up front. I had to teach myself to put the most interesting thing first. "Our Rails app was eating 3GB of RAM. One environment variable cut it by 20%" as the first line, not the conclusion.

Overthinking the writing, underthinking the title. I'd spend days on the article body and 30 seconds on the title. This is backwards. The title is what decides if anyone reads the rest. A specific, result-based title ("How we cut our CI build time from 40 minutes to 8") does better than a general one ("Optimizing CI/CD Pipelines") every time. This is true no matter what your native language is.

My actual process

For what it's worth, here's the way I do it now:

Outline in my native language. The structure is the hard part. Deciding what to say and the order is easier in my own language.
Write the draft in English. Not translating from the outline, but writing new sentences in English. The outline is just a guide. This produces more natural English than sentence-by-sentence translation.
Run the translation test. Mentally translate the whole draft to my language. If something sounds wrong, the English is probably unclear. If a sentence feels hard to follow, rewrite it shorter.
Write the title last. After I know what the article actually says. It's always a bit different from what I planned. I write a title that shows the most interesting part.
Let it sit for a few hours. Come back, read the first two paragraphs with fresh eyes. If they don't hook me, rewrite them. The opening matters more than everything else combined.

The point

Most advice for non-native writers is about closing the language gap. And sure, getting better at English helps. But the gap gives you things too. You write simpler. You have content that doesn't exist in English yet. You can use your other language to check your writing. And the way you see things is different from everyone else's.

Don't just work on your English. Work on what makes you different.

Behind the Streams: Live at Netflix — 2025–10–22 Tokyo Video Tech #10 Session 2 Report

Katz Sakai — Fri, 27 Mar 2026 05:38:37 +0000

Most people know Netflix as the place you go to binge a series on a quiet evening. But since 2023, the company has been venturing into territory where there are no second takes — live streaming, at a scale that now reaches tens of millions of households simultaneously.

At Tokyo Video Tech #10 “Continuous”, held on October 22, 2025 at the Netflix Tokyo office, members of Netflix’s live streaming team gave their first talk in Tokyo. They walked the audience through what it actually takes to pull off a live broadcast at Netflix scale: the multi-path signal routing from venue to Broadcast Operations Center, the cloud encoding pipeline designed so that entire regions can fail without viewers noticing, and the culture of relentless rehearsal — including deliberate failure injection — that turns each one-shot event into something the team has already practiced dozens of times.

Report by: Katz Sakai

Launched in 2023, Netflix’s live streaming service has grown — through continuous trial and refinement — into a platform capable of reaching tens of millions of households simultaneously.
By skillfully building on the technical foundation and operational expertise developed through its VOD business, Netflix has created an architecture that balances the immediacy and reliability required for live broadcasting.
Above all, the session conveyed a clear message: success in live streaming depends on constant preparation and disciplined operational practice behind the scenes.

Live Streaming and Netflix

When people think of Netflix, on-demand streaming is usually what comes to mind.
However, the company has also been taking on the challenge of live streaming as a new form of entertainment.
Its first step in this direction was the Chris Rock: Selective Outrage comedy special, streamed live in 2023 — the first live event in Netflix’s history, which drew significant attention worldwide. Since then, Netflix has continued to scale its live streaming efforts.
For example, the boxing event Jake Paul vs. Mike Tyson was streamed live to more than 60 million households around the globe.

From the Venue to the BOC: Signal Aggregation

One of the biggest challenges in live streaming is ensuring that the video signals from the event venue are reliably delivered to the Broadcast Operations Center (BOC).

When Netflix first began its live streaming initiatives, all encoding was handled directly at the venue. However, due to various network and infrastructure constraints, achieving scalability proved difficult. Based on these lessons, Netflix transitioned to a model that minimizes on-site processing and consolidates control within the BOC.

Today, signals are transmitted from the venue to the BOC through multiple independent paths. Satellite links, dedicated lines, IP-based transmission (such as SRT), and bonded cellular connections — each with different technical characteristics — are combined to ensure that even if one path fails, the stream remains uninterrupted.
Furthermore, Netflix conducts technical studies for each transmission method, compiling the resulting insights into recommended parameters and operational guidelines that are shared with vendor partners.

At the BOC, the incoming signals are monitored and processed as needed.
This may include inserting the Netflix watermark, adding slates such as “Starting Soon” before the event or “Thanks for Watching” after it, and in the event of a total signal loss, displaying an emergency slate to maintain a consistent viewing experience.

For large-scale international events such as sports or entertainment broadcasts, the BOC also handles multilingual dubbing and audio mixing — producing simultaneous audio tracks in English, Japanese, German, and other languages to deliver regionally tailored streams to global audiences.

Converting BOC Signals into Delivery Formats in the Cloud

When transmitting video signals from the Broadcast Operations Center (BOC) to the cloud, Netflix places particular emphasis on redundancy and synchronization design to ensure uninterrupted streaming.
In live broadcasting, even a slight interruption can significantly impact the viewer experience, making it essential to balance both availability and consistency across the system.

For transmission from the BOC to the cloud, multiple independent paths — such as dedicated lines, IP-based links, and bonded cellular connections — are used in combination. This design ensures that if one route encounters a failure, others can immediately take over.
Within the cloud, the same incoming signal is processed simultaneously across multiple regions, preventing any single point of failure and ensuring high resilience.

Once the signals reach the cloud, they are first processed by encoders and then passed to live packagers, which convert them into delivery-ready formats.
These packagers leverage Netflix’s extensive knowledge accumulated through years of VOD delivery — specifically, its database of device-level capabilities and limitations. In practice, this means the system already “knows” which codecs, resolutions, and streaming formats are supported on each device, and applies that intelligence directly to live packaging.
As a result, Netflix minimizes compatibility risks and ensures that live streams maintain consistent quality and playback success rates across the vast range of viewing devices.

The packagers are also designed to handle the precise timing and segment synchronization requirements unique to live streaming.
All cloud encoders reference a shared epoch (a common timing source), ensuring that segments generated in different regions remain interchangeable. This enables instant failover between regions — if one region experiences an outage, another can seamlessly take over without disrupting the live stream.

In this way, Netflix’s live streaming architecture — built on the three design pillars of redundancy, time synchronization, and device compatibility — maintains high reliability while delivering a consistent playback experience across diverse devices worldwide.

Principles of Delivery: Compatibility, Scale, and Quality

The three key elements that Netflix prioritizes in live delivery are compatibility, scale, and quality.

Compatibility
The devices used by Netflix viewers are incredibly diverse — from smartphones released many years ago to the latest high-end devices — representing thousands of playback environments. As with VOD, Netflix’s live streaming platform is designed with the guiding principle of being playable on as many devices as possible.
To achieve this, the delivery protocol is unified under an HTTPS-based approach, allowing existing Netflix players to stream live content without modification.
Video is encoded in both the widely compatible AVC (H.264) and the more efficient HEVC (H.265) formats, while support for the next-generation AV1 codec continues to expand. This approach ensures that older devices can still play streams smoothly, while newer ones benefit from higher picture quality — achieving a balance between inclusivity and performance.

Scale
Live streaming can attract tens of millions of concurrent viewers accessing the service simultaneously. To handle this immense load, Netflix leverages its proprietary CDN, Open Connect. More than 18,000 servers deployed across 6,000 locations worldwide receive live segments from the origin and deliver them to viewers from the nearest node.
The player references manifests built on a segment template model, minimizing server queries and allowing efficient handling of massive request volumes.
This scalable architecture enables stable delivery even during global events such as Jake Paul vs. Mike Tyson.

Quality
Just as in VOD, the Netflix player continuously monitors network conditions to select the optimal bitrate in real time. If throughput drops during playback, the system immediately switches to an alternative CDN node or adjusts the video bitrate to prevent buffering. Through these adaptive measures, Netflix ensures that viewers around the world can enjoy a smooth, uninterrupted live streaming experience regardless of their network conditions.

Success in Live Streaming Starts with Preparation

Another crucial factor in live streaming is preparation and operational discipline. Unlike VOD, live streaming is a true one-shot performance — there are no retakes. For that reason, the team treats preparation like daily “training,” continuously running tests and rehearsals to stay ready for any scenario.

Building robust test environments
To simulate real-world live events, Netflix developed an internal system capable of generating virtual live streams. This system allows engineers to instantly launch test broadcasts and verify the entire pipeline — from encoding and packaging to CDN behavior — under production-like conditions. It enables safe experimentation with new features and architectural changes without risking live operations.

Practicing resilience through failure testing
Netflix routinely performs failure injection tests, deliberately introducing issues such as network latency, packet loss, and simulated cloud region outages. This culture of anticipating the unexpected strengthens system robustness, ensuring that even during large-scale global events, live streams remain stable and uninterrupted.

Forecasting and real-time adaptability
Every live event begins with audience size predictions powered by machine learning models, allowing teams to pre-provision sufficient cloud and CDN resources. During and immediately before broadcasts, Netflix continuously analyzes real-time viewership trends. If the system detects audience growth exceeding expectations, capacity adjustments are made on the fly to prevent congestion. By integrating thorough pre-event simulations with agile real-time responses, Netflix achieves stable and consistent live delivery under any conditions.

In Conclusion

What stood out most in this session was how Netflix approaches live streaming — a relatively new domain for the company — without relying on individual know-how. Instead, the team has established clearly documented procedures and specifications to ensure operational consistency and reproducibility. Even the signal setup from the venue to the Broadcast Operations Center (BOC) has been systematized into operational guidelines rather than being treated as a “black magic”, enabling the same level of quality and reliability across different countries and production environments.

Equally impressive was how Netflix has effectively leveraged its long-standing technical assets and operational expertise from VOD streaming. By applying its accumulated knowledge of device compatibility and CDN scaling directly to live workflows, the company delivers live events with the same “Netflix-quality” scalability and reliability that define its on-demand service.

Above all, the session conveyed a powerful message: live success is built on daily preparation. Through continuous testing in virtual live environments, deliberate failure injections, and regular rehearsals, Netflix ensures that each one-time broadcast runs smoothly when it matters most.

For more details on Netflix’s live streaming technology and engineering practices, you can refer to their official Netflix Tech Blog.

Join Us

Tokyo Video Tech is an open community — whether you're a seasoned video engineer or just curious about the field, you're welcome to join.
Visit our page on Luma (an event platform) and hit Subscribe to get notified about upcoming events.
Subscribe on Luma

Live Streaming at "Spring Festival in Tokyo" — 2025–10–22 Tokyo Video Tech #10 Session 1 Report

Katz Sakai — Fri, 27 Mar 2026 05:36:03 +0000

A classical music festival and an internet infrastructure company might seem like an unlikely pairing. But for the past several years, IIJ (Internet Initiative Japan) — one of Japan's first commercial internet service providers, founded in 1992 and now a major force in cloud, networking, and enterprise IT — has been quietly pushing the boundaries of live streaming technology. Not for esports or pop concerts, but for opera, chamber music, and orchestral performances in the heart of Tokyo's Ueno Park.

At Tokyo Video Tech #10 "Continuous", held on October 22, 2025 at the Netflix Tokyo office, Fumitaka Watanabe from IIJ pulled back the curtain on what it really takes to deliver these streams: the hundreds of meters of cables laid and removed for every show, the custom subtitle tools built because AI couldn't match the nuance of opera, and the creative networking solutions required when your venue is two floors underground in a World Heritage site.

Report by: Katz Sakai

“Spring Festival in Tokyo” and IIJ

The Spring Festival in Tokyo is a classical music festival founded in 2005 by maestro Seiji Ozawa and others. Each year, over a period of roughly 40 days from mid-March, around 70 performances are held across 12 to 15 venues throughout Ueno Park. In 2020, the festival was canceled due to the pandemic, but in 2021, IIJ embarked on an unprecedented challenge — live streaming every single concert.

Journey So Far

2021: The First Year — “Handcrafted Streaming of Every Performance” A conference room inside Tokyo Bunka Kaikan was converted into a temporary streaming base. Video feeds from each venue were aggregated via NDI and fiber connections. Performances were streamed in multiple formats including 4K, high-resolution audio, multi-angle views, and with subtitles.
2022: Transition to Remote Production A permanent base was established at IIJ’s Iidabashi office, where video from the venues was transmitted via SRT. Encoding and distribution were handled from the office’s stable network environment. IIJ also developed a custom web player that allowed viewers to switch camera angles and check program information in real time.
2023: Launch of “IIJ Studio TOKYO” In October 2022, IIJ built a dedicated in-house streaming studio — IIJ Studio TOKYO — enabling integrated management of video, audio, and network operations. This dramatically improved the overall stability and quality of the live streams.
2024: Returning to the Field with Agile Operations As long-term studio occupation proved difficult, the team brought switching equipment back to Ueno venues, producing finished video signals on-site and transmitting them to IIJ Studio TOKYO via SRT, which continued to serve as the encoding hub. Operations were reoptimized for rapid setup and teardown at each performance.

Challenges Faced in On-Site Operations

The Spring Festival in Tokyo presents unique challenges due to its long duration and multi-venue nature, requiring IIJ to tackle numerous operational and technical issues.

One of the most demanding tasks was cable management and setup. Connecting multiple venues scattered across Ueno Park required laying and removing hundreds of meters of cables for every performance. Preventing tripping hazards and cable damage demanded meticulous on-site expertise and coordination.

Staff allocation was another major concern. At least two operators were assigned to each concert to ensure continuous coverage, but this structure also increased personnel costs and highlighted the difficulty of balancing efficiency with budget constraints.

The diversity of venue environments posed additional hurdles. Some locations had unstable power supplies, causing equipment malfunctions that required off-site reinspection. Others lacked permanent network connections, making it necessary to arrange mobile or temporary lines months in advance.

Unstable networks also proved problematic — especially during the spring hanami season, when mobile networks in Ueno Park became congested despite the use of bonded multi-carrier connections.

A further challenge specific to classical music was the precision required for subtitles. In opera performances, the accuracy and timing of subtitles are integral to the artistic experience, and AI-generated captions proved inadequate. IIJ developed an in-house subtitle insertion tool, allowing human translators’ scripts to be synchronized precisely with the performance in real time.

In summary, delivering these streams demanded not only technical reliability but also deep sensitivity to the cultural context of the event. The success of the project lay in IIJ’s ability to provide a “culturally informed live streaming experience” — one that preserved the integrity and atmosphere of live classical performance.

New Challenges in 2025

In 2025, IIJ took its previous initiatives a step further by introducing several new technical and operational challenges.

First, the team leveraged Vipe, a cloud-based playout system from BCNEXXT, enabling operators to switch between program segments, intermissions, and closing screens with a single click based on a preset timetable. This greatly reduced operational workload and improved on-site efficiency, though it also revealed a new challenge — rising costs associated with cloud services.

Next, IIJ collaborated with KORG and NHK Technologies to conduct a live stream using Dolby Atmos, delivering an immersive audio experience that traditional stereo streaming could not achieve. This marked a new milestone in IIJ’s efforts to “reproduce the live atmosphere as it is.”

The team also took on the ambitious task of testing live streaming from the National Museum of Western Art, a World Cultural Heritage site. Due to strict regulations, every piece of equipment required prior approval, along with detailed reports on power usage and personnel. The concert hall’s location — two floors underground — posed further difficulties, as neither wired nor mobile networks were available, and new line installations were prohibited. IIJ overcame this by connecting the museum to the nearby Tokyo Bunka Kaikan across the street via a 60 GHz wireless network, successfully transmitting both video and audio.

Another major constraint was time. Since the museum remained open to the public until 5 p.m., and the concert opened at 6:30 p.m. and began at 7 p.m., setup time was limited to just one and a half hours, demanding exceptional efficiency and coordination.

Finally, to address mobile network congestion during the cherry blossom season, IIJ experimented with local 5G connectivity. The trial confirmed that stable 60 fps video transmission was possible, though it also revealed that operating local 5G in Japan involves complex licensing procedures, equipment preparation, and substantial costs and lead time.

Network Design Philosophy

For the venue network, IIJ implemented redundancy using multiple connections such as FLET’S Hikari Cross and FLET’S Hikari Next, securing two routing paths — IPoE and PPPoE — and managing them with IIJ’s proprietary router, SEIL. In addition, multiple venues around Ueno Park were interconnected in a mesh topology, enabling remote control of cameras and encoders from any location.

The decision to deploy multiple lines stemmed from practical considerations: during the roughly 40-day festival period, scheduled maintenance by network providers could overlap with event days, and relying on a single line would risk operational downtime.

At the same time, IIJ emphasized that redundancy is not simply “the more, the better.” Excessive investment in multi-layered redundancy could harm overall profitability. Redundancy is a safeguard — a means of maintaining peace of mind and operational continuity. The real challenge lies in striking the right balance between cost efficiency and service reliability.

International Collaboration — Partnership with Berlin Phil Media

Since 2016, IIJ has collaborated with Berlin Phil Media, the streaming subsidiary of the Berlin Philharmonic Orchestra, conducting joint experiments on uncompressed audio transmission and high-quality international live relays. Together, the two organizations have pursued the goal of achieving classical music streaming that surpasses traditional broadcast standards, continuing research and technical verification on advanced delivery systems and cross-border integration.

The greatest lesson IIJ has learned from working with Berlin Phil Media is their unwavering “orchestra-first” philosophy. Every technical decision is made with the musicians in mind — for example, the robotic cameras installed in the concert hall operate completely silently so as not to disturb performers’ concentration. This meticulous attention to the artistic environment inspired IIJ to reaffirm the importance of harmonizing technology and artistry in classical music streaming.

In Conclusion

What left the deepest impression from this session was realizing just how much effort and ingenuity lie behind the seemingly effortless beauty of a concert live stream. From laying cables and designing network routes to fine-tuning subtitles, countless individual actions come together to create the seamless environment that allows us, the audience, to enjoy music online.

It was also inspiring to see how the IIJ team never simply repeats the same routine each year, but continually embraces new technological frontiers — such as cloud-based operations, Dolby Atmos, and local 5G. Their pursuit of stability while still pushing the boundaries of innovation truly embodies IIJ’s mission: to evolve the Internet into a genuine social infrastructure.

Join Us

IIJ and Netflix Talk Live Streaming at Tokyo Video Tech #10 (Seminar Report)

Katz Sakai — Fri, 27 Mar 2026 05:26:50 +0000

On October 22, 2025, Tokyo Video Tech #10 “Continuous” took place at the Netflix Tokyo office.

This time, the theme was live streaming, with engineers from IIJ and Netflix sharing what they’ve learned and the challenges they’ve faced in the field. It was also the first time Netflix’s live streaming team gave a talk in Tokyo, which made for an especially lively Q&A session. Here’s a quick recap of the event.

Report by: Katz Sakai
(本レポートの日本語版はこちらからご覧いただけます)

Executive Summary

Session 1 — Live Streaming at “Spring Festival in Tokyo”

Since 2021, IIJ has been responsible for live streaming all performances of the Spring Festival in Tokyo. Covering multiple venues across Ueno Park over an extended period, IIJ recreated the sense of being at a live classical concert through innovations such as network redundancy and a custom subtitle tool.
They continue to balance stability and innovation each year, taking on new technologies like cloud playout, Dolby Atmos, local 5G, and 60GHz wireless transmission.
Through international collaboration with Berlin Phil Media and others, the team addresses the unique sensitivity of broadcasting cultural events, expanding the reach of live music through technology.

▶️ Read the full session report

Session 2 — Behind the Streams: Live at Netflix

Netflix launched its live streaming service in 2023, which has since grown to support simultaneous viewing by tens of millions of households worldwide.
By connecting live venues to the cloud through multiple redundant paths with synchronized timing, Netflix ensures reliable signal transmission. Leveraging its long-standing VOD expertise, the company supports playback across a wide variety of global devices. Through audience forecasting and real-time monitoring, Netflix dynamically manages capacity and bitrate to deliver a consistently high-quality viewing experience.
Continuous testing and failure injection have strengthened operational readiness, enabling Netflix to maintain stable live streaming performance even in one-shot live events.

▶️ Read the full session report

Opening Remarks: Celebrating the 10th Edition

The event kicked off with opening remarks from Hayashi-san, one of the organizers of Tokyo Video Tech.

Tokyo Video Tech started back in October 2018, inspired by Demuxed, a global conference for video engineers held in San Francisco. The idea was simple but powerful — to create a place in Japan where video engineers could connect, share, and learn from each other. Since then, the community has grown beyond Japan, hosting guest sessions at Taiwan Multimedia Technology and Dublin, expanding its reach worldwide.

The group has also been deeply involved in the broader video tech community — for instance, Tokyo Video Tech supported Video Dev Days 2019, an event that brought together over a hundred developers of VLC, ffmpeg, and the AV1 decoder dav1d for two days of in-depth discussions in Tokyo.

So, how did this edition end up being hosted at Netflix Tokyo?
It all started with a message from Flávio Ribeiro at Netflix about four weeks ago:

“We’re planning to be in Tokyo — how about hosting a meetup together?”
A quick “Sure, we can do it!” set everything in motion.

Thanks to the support of the Netflix Tokyo office, IIJ, and the Tokyo Video Tech team, the meetup came together in just four weeks and welcomed about 35 participants on the day of the event.

Hayashi-san wrapped up his opening remarks by saying:

“This event’s theme, Continuous, represents our determination to keep moving forward — to continue learning and engaging in dialogue, even after the pandemic. I hope today’s meetup will be a fruitful and inspiring one for everyone here.”

The Sessions

What followed were two sessions that showed just how different, and how fascinating, the world of live streaming can be depending on who's behind it and what's at stake.

In the first session, IIJ’s Fumitaka Watanabe took us inside the unlikely intersection of internet infrastructure and classical music — where hundreds of meters of cable are laid and removed for every performance, and a World Heritage venue two floors underground becomes a live streaming challenge unlike any other.

Katz Sakai for Tokyo Video Tech

Mar 27

Live Streaming at "Spring Festival in Tokyo" — 2025–10–22 Tokyo Video Tech #10 Session 1 Report

#techtalks #video #architecture #cloud

6 min read

In the second session, Netflix’s live streaming team — presenting in Tokyo for the first time — revealed the engineering behind broadcasts that reach tens of millions of households, from multi-path signal routing to a culture of deliberate failure injection that turns every one-shot event into something they’ve already rehearsed dozens of times.

Katz Sakai for Tokyo Video Tech

Mar 27

Behind the Streams: Live at Netflix — 2025–10–22 Tokyo Video Tech #10 Session 2 Report

#techtalks #architecture #cloud #devops

7 min read

Join Us

OpenSSL Engine API Explained: Connecting Google Cloud KMS, YubiKey, and More

Katz Sakai — Fri, 27 Mar 2026 00:22:50 +0000

OpenSSL can do more than work with local key files. Through its Engine API, it can delegate signing and encryption to external hardware like cloud HSMs or YubiKeys, so your private key never has to leave secure storage. This matters more than ever: since June 2023, code signing certificates require that private keys be stored on FIPS 140-2 Level 2 compliant hardware. This post explains how the Engine API works, how it connects to Google Cloud KMS and YubiKey via PKCS#11, and why it has become essential for modern code signing workflows.

What Is the OpenSSL Engine API?

While OpenSSL implements fundamental cryptographic operations such as encryption and signing on its own, it also provides a plugin-like mechanism called the Engine API that allows these operations to be delegated to external hardware. By using the Engine API, cryptographic operations provided by cloud-based HSMs (Hardware Security Modules) or other external hardware can be called transparently through OpenSSL.

By delegating cryptographic operations to secure hardware such as HSMs, it becomes possible to perform operations like signing while keeping the private key stored securely on the hardware — all while still using OpenSSL.

Examples of Engine API Usage

One example of the Engine API in action is the pkcs11 engine plugin. This plugin enables OpenSSL to access cryptographic devices that implement the PKCS#11 interface.

Google has published the Google PKCS #11 Cloud KMS Library, which allows Google Cloud HSMs to be operated via PKCS#11. By using this library, encryption and signing operations can be executed on Google Cloud's HSMs.

Similarly, Yubico has published YKCS11, which enables YubiKey hardware to be operated via PKCS#11. Through this, OpenSSL can invoke operations that use asymmetric private keys stored on the YubiKey hardware.

Benefits of Externalizing Signing and Other Operations via the Engine API

By using the OpenSSL Engine API, it becomes possible to perform cryptographic operations such as signing on FIPS 140-2 Level 3 compliant hardware like Google Cloud HSM. As a result, private keys never leave the HSM, significantly reducing the risk of key leakage.

A real-world example of a serious security incident caused by key leakage is the discovery in 2022 that Nvidia's code signing certificates were being used to sign malware.

Because incidents involving the leak of code signing keys have occurred repeatedly, an industry rule was established requiring that, as of June 1, 2023, private keys used for code signing must be stored on FIPS 140-2 Level 2 compliant hardware¹². The Engine API has become an essential means of meeting these industry requirements.

https://support.globalsign.com/code-signing/new-requirements-related-private-key-protection-codesigning-certificates ↩
CA/Browser Forum document outlining code signing certificate requirements: https://cabforum.org/working-groups/code-signing/requirements/ ↩

Why Rails App Memory Bloat Happens: Causes and Solutions (2025 Edition)

Katz Sakai — Thu, 26 Mar 2026 10:58:03 +0000

After running a Rails application in production for a while, you may encounter a phenomenon where memory usage grows unexpectedly large. In July 2025, I investigated the causes of this behavior and explored countermeasures. Here is a summary of my findings.

The reason a Rails app "appears to keep consuming memory" is that glibc, which Ruby relies on, retains freed memory internally for future reuse rather than returning it to the OS. This is distinct from a typical memory leak.
Starting with Ruby 3.3.0, you can optimize the heap by calling Process.warmup after the Rails application has finished booting. However, since this mechanism is intended to be executed at the completion of the application boot sequence, it is not easily applicable to reducing memory in long-running Rails applications.
As a countermeasure that requires zero changes to product code, setting the environment variable MALLOC_ARENA_MAX=2 remains effective. This prevents glibc from creating numerous arenas (memory pools) one after another, forcing memory reuse within existing pools and thereby preventing glibc from hoarding too much freed memory.
Switching to the memory allocator jemalloc, which was previously recommended as an effective countermeasure, should now be avoided because jemalloc is no longer being maintained.

Why Does Memory Usage in Rails Apps Appear to Keep Growing?

Hongli Lai's article "What causes Ruby memory bloat?" covers this in detail.

What causes Ruby memory bloat? – Joyful Bikeshedding

Ruby apps can use a lot of memory. But why? Various people in the community attribute it to memory fragmentation, and provide two “hacky” solutions. Dissatisfied by the current explanations and provided solutions, I set out on a journey to discover the deeper truth and to find better solutions.

joyfulbikeshedding.com

According to that article, the reasons memory bloat "appears to occur" are as follows:

The previously cited explanation of "heap page fragmentation on the Ruby side" was not, in fact, a major contributor to increased memory usage.
The true cause was that "glibc's memory allocator, malloc, retains memory that Ruby has freed instead of returning it to the OS, holding onto it for future use." In particular, free pages that are not at the end of the heap are not returned to the OS, so unused memory continues to accumulate internally. From the OS's perspective, this makes it look like "Ruby keeps consuming memory."
Calling glibc's malloc_trim(0) ensures that memory freed by Ruby is returned to the OS, effectively reducing the process's memory usage (RSS) as seen by the OS.

However, there is an extremely important caveat here:

Memory that Ruby has allocated and freed during processing is usually fragmented. Calling malloc_trim(0) does not resolve the fragmentation; it merely returns the fragmented regions to the OS as-is.
Even though the memory is fragmented, it is still returned to the OS, so Ruby's memory usage (RSS) as seen by the OS does decrease. However, because other programs cannot allocate contiguous regions from fragmented free memory, an OOM (Out of Memory) error can occur even when there appears to be free memory available.
Since returning fragmented memory to the OS does not make it easy to reuse effectively, malloc is designed to retain allocated but unused memory internally and reuse it, enabling stable allocation.

This is one of the reasons "Ruby has freed the memory, but malloc does not readily return it to the OS."¹

What Is the Memory Bloat Countermeasure Code Introduced in Ruby 3.3.0?

Ruby 3.3.0 introduced the Process.warmup method. This method is intended to signal to the Ruby virtual machine from an application server or similar that "the application's startup sequence has completed, making this an optimal time to perform GC and memory optimization."²
When Process.warmup is called, the Ruby virtual machine performs the following optimization operations:

Forces a major GC
Compacts the heap
Promotes all surviving objects to the old generation
Pre-computes string coderanges (to speed up future string operations)

This cleans up objects and caches that were generated during application startup but are no longer needed, improving memory sharing efficiency in Copy-on-Write (CoW) environments.

Furthermore, since unnecessary objects have been garbage collected and the heap has been compacted, there is a high likelihood that fragmentation in the heap allocated by malloc has been reduced. This makes it an ideal time to call malloc_trim(0), and a patch that calls malloc_trim(0) internally within Process.warmup has been merged.

Process.warmup: invoke `malloc_trim` if available #8451

casperisfine posted on Sep 15, 2023

Similar to releasing free GC pages, releasing free malloc pages reduce the amount of page faults post fork.

NB: Some popular allocators such as jemalloc don't implement it, so it's a noop for them.

View on GitHub

An important point is that Process.warmup is not automatically called behind the scenes like GC. It is the kind of method that should be explicitly called at an appropriate time on the application server side when a major GC would be acceptable (e.g., before forking, before worker startup). Therefore, there may not always be an appropriate time to call it in long-running Rails applications.

Countermeasures for Memory Bloat in Long-Running Rails Apps

So how can you suppress memory bloat without using Process.warmup or malloc_trim(0)?

Online resources have recommended using jemalloc, a smarter memory allocator. However, jemalloc's repository was archived in June 2025, and ongoing maintenance cannot be expected. It is best to avoid adopting it for new projects.

As an alternative, setting the environment variable MALLOC_ARENA_MAX=2 remains effective. Here's why:

It reduces the number of arenas (memory management regions) that glibc allocates. glibc's malloc allocates numerous arenas as needed to prevent contention when multiple threads request memory simultaneously (normally, on 64-bit systems, the upper limit is 8 times the number of vCPU cores on the machine).
As described above, glibc's memory allocator is reluctant to return memory to the OS. Therefore, the more arenas there are, the more "unreturned free memory" accumulates internally.
By limiting the number of arenas, you can reduce the total amount of memory that glibc does not return to the OS (at the cost of slightly increased contention for memory allocation among threads).

According to the following articles, setting MALLOC_ARENA_MAX=2 significantly reduces memory usage in exchange for a slight degradation in response time of a few percent.
https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
https://techracho.bpsinc.jp/hachi8833/2022_06_23/50109

Additionally, MALLOC_ARENA_MAX=2 is the default setting on Heroku, which suggests it is a relatively safe configuration.
https://devcenter.heroku.com/changelog-items/1683

Furthermore, the following reasoning supports why a value of 2 for MALLOC_ARENA_MAX is sufficient:

Ruby has a GVL (Global VM Lock), which means only one thread can execute Ruby code at any given time. Therefore, even if the application spawns many threads, the number of threads actively running and allocating memory at any point is expected to be very small. Consequently, glibc does not need to maintain many arenas; a small number (around 2) sufficient to handle requests from active threads should be adequate. For this reason, setting MALLOC_ARENA_MAX to 2 causes virtually no operational issues in practice, while effectively minimizing the total amount of free memory hoarded across multiple arenas.

If you want to be thorough, you should measure memory usage and response time with each setting—unset, 2, 3, and 4—and determine the optimal value.

A proposal was made to "call malloc_trim(0) when a full GC is performed in Ruby to return memory to the OS," but it was not implemented because returning fragmented memory to the OS provides little benefit since the OS cannot effectively utilize it. Feature #15667: Introduce malloc_trim(0) in full gc cycles - Ruby - Ruby Issue Tracking System ↩
The background behind the introduction of Process.warmup is explained in Feature #18885: End of boot advisory API for RubyVM - Ruby - Ruby Issue Tracking System ↩

Building a Cost-Effective Windows Code Signing Pipeline with Sectigo, Google Cloud KMS, and GitHub Actions

Katz Sakai — Thu, 26 Mar 2026 05:29:51 +0000

Overview

Most code signing services charge per signature. That sounds fine until your CI pipeline signs staging and production builds on every push, and your monthly bill starts climbing. We replaced per-signature billing with a flat-cost setup using a Sectigo certificate and Google Cloud KMS. Key storage costs about $2.50/month, signing is $0.15 per 10,000 operations, and the private key never leaves the HSM. This post walks through the full architecture and setup on GitHub Actions.

System Architecture of the Code Signing Environment on GitHub Actions
Why We Chose Sectigo's Code Signing Certificate
Setup Instructions
- Generate a Signing Private Key on Google Cloud KMS HSM
- Create a CSR on Your Local Machine
- Purchase the Code Signing Certificate
- Integrate into GitHub Actions
Notes
- Key Types and Bit Lengths Commonly Used in Code Signing
- Instant SmartScreen Bypass via EV Certificate Is No Longer Possible

System Architecture of the Code Signing Environment on GitHub Actions

The signing environment built on the GitHub Actions Windows Runner is structured as shown below.
The signing process with SignTool.exe is performed using the key stored on the HSM via the KMS CNG (Cryptography Next Generation) Provider supplied by Google. This allows the signing process to be executed securely without ever holding the private key on the GitHub Actions runner.

Why We Chose Sectigo's Code Signing Certificate

The reason is that it allows us to build a code signing environment without incurring unnecessary costs.

For example, the code signing certificates offered by competitor SSL.com are subject to usage-based billing per number of signatures. This usage-based billing can become surprisingly burdensome.
The screenshot below shows SSL.com's code signing pricing. At first glance, 20 signatures per month for 20 USD/month might seem sufficient. However, in practice, production and staging environments are often separate, and if you sign staging binaries each time as well, 20 signatures are quickly exhausted. As a result, you end up needing a higher-tier plan with 100 or 300 signatures, paying a substantial monthly fee.

While we used SSL.com as an example, most vendors that provide their own signing systems employ a similar usage-based billing model.

In contrast, Sectigo code signing allows you to perform signing operations on your own Google Cloud KMS HSM. With this approach, key storage costs just 2.50 USD/month, and signing operations cost 0.15 USD per 10,000 signatures—an extremely cost-effective option.
(All prices as of July 2025.)

Because signing costs are essentially negligible, you can freely sign binaries for both production and staging environments. This guarantees the origin and integrity of binaries in both environments, enabling a more secure distribution pipeline. Furthermore, since signing costs are not a concern, you can flexibly handle re-signing and verification during incident response.

Setup Instructions

Sectigo publishes the procedure as a PDF, and we followed it closely.

https://certificategeneration.com/en/content/pdf/google-kms-code-signing.pdf

Generate a Signing Private Key on Google Cloud KMS HSM

The private key used for code signing is generated on the Google Cloud KMS HSM (Hardware Security Module).
This key does not need to be regenerated as long as its security is maintained, and the same key can be continued to be used when the certificate is renewed (you can skip this step at the next certificate renewal).

(1) Create a Key Ring

Create a key ring to group your keys.
We selected the multi-region asia1 for the region.

(2) Create a Signing Key

Next, create the key used for signing.
On the screen where you specify the key name and protection level, you must select "HSM" as the protection level.
This is because the code signing certificate issuance requirements mandate that "the private key must be generated within an HSM and must not be exportable from the HSM"¹.

For the purpose and algorithm, select "Asymmetric sign" and "4096 bit RSA - PKCS#1 v1.5 padding - SHA256 digest" respectively.
We chose this key type because this combination is most commonly used in code signing for Windows binaries in practice (see the notes at the bottom of the page for details).

Once the settings are configured, execute the key generation.

(3) Obtain the Attestation for the Generated Key

An attestation is data that certifies that "a key was securely generated and is securely stored on the HSM." Certificate vendors such as Sectigo need to verify the key's integrity using the attestation when issuing a code signing certificate, so you must download the attestation from Google Cloud KMS.

To download the attestation, first select the generated key, then click "Verify attestation" from the three-dot menu.

Next, in the dialog that appears, click "Download attestation bundle" to download a zip file. Save this file and submit it when purchasing the certificate.

Create a CSR on Your Local Machine

To issue a certificate, you need to create a CSR (Certificate Signing Request) corresponding to the signing key.
This procedure is easiest to perform in a Linux or WSL environment.

The overall architecture for the CSR creation environment is as shown below.

A PKCS#11 plugin is integrated into OpenSSL's Engine API (plugin mechanism)², and the PKCS#11 plugin accesses the Google Cloud HSM through Google's PKCS#11 Library to generate the CSR using the private key.

Install the Required Tools

First, install the tools needed to access Google Cloud KMS from OpenSSL via the PKCS#11 interface.

sudo apt-get install libengine-pkcs11-openssl opensc

Next, download Google's PKCS#11 library from the GitHub releases page and extract it to a directory of your choice.
pkcs#11 · Releases · GoogleCloudPlatform/kms-integrations

Set the path to libkmsp11.so in the extracted directory as the environment variable PKCS11_MODULE_PATH.

PKCS11_MODULE_PATH="/path/to/libkmsp11.so"

Next, create a YAML file at any location that specifies the path to the key ring containing the key used for CSR creation.
The YAML content should look like this.
Replace YOUR_GCP_PROJECT_ID and KEY_RING_NAME with your own Google Cloud project ID and key ring name.

---
tokens:
  - key_ring: "projects/YOUR_GCP_PROJECT_ID/locations/asia1/keyRings/KEY_RING_NAME"

Set the path of the created YAML file as the environment variable KMS_PKCS11_CONFIG.

KMS_PKCS11_CONFIG="/path/to/google_hsm_config.yaml"

Next, obtain the Application Default Credentials (ADC) using the Google Cloud CLI. The account used for login must have the necessary permissions to access the created key.

gcloud auth application-default login

If the ADC has been obtained and the plugin and YAML configuration described above are correctly set up, you should be able to retrieve the key information with the following command.

pkcs11-tool --module /path/to/libkmsp11.so --list-objects

Finally, generate the CSR using the openssl command. Below is an example command for CSR creation.
Replace "EXAMPLE, Inc." with your company's name, and replace test-authenticode-key with the name of the key you created.

openssl req -new -subj '/CN=EXAMPLE, Inc./' -sha256 -engine pkcs11 -keyform engine -key pkcs11:object=test-authenticode-key

Purchase the Code Signing Certificate

Once the attestation and CSR are prepared, proceed to purchase the code signing certificate.
When purchasing from Sectigo Code Signing Certificates, select "Install on Existing HSM" as the Certificate Delivery Method.
There is little reason to choose Extended Validation for the Validation Option (see the notes at the bottom of the page), so Standard Validation should suffice.

You will be prompted to submit the attestation and CSR during the purchase process.

After the purchase, there is a review process for issuing the certificate.
When issuing a certificate under a corporate entity, the following steps are required:

Applicant identity verification
- Submit a photo of the passport itself
- Submit a selfie holding the passport next to your face
Company existence verification
- Provide your corporate number, and Sectigo will verify it by querying the government's corporate registry
Company phone number verification
- An automated voice call is placed to the company phone number set during purchase, providing a numeric code that you enter into a web form to complete the verification

Once this review process is complete, the certificate is issued.

Integrate into GitHub Actions

Once the certificate has been issued, integrate the code signing process into the Windows Runner on GitHub Actions.
The signing process uses signtool.exe on the Windows Runner. The signing key stored in Google Cloud KMS is accessed through the KMS CNG (Cryptography Next Generation) Provider supplied by Google. Additionally, a timestamp can be applied to the signature, allowing the signature to remain valid even after the certificate expires³.

Below is an example GitHub Actions configuration for performing the signing process.

    - name: Install Google Cloud KMS CNG provider for code signing with Sectigo BYO HSM, and add SignTool.exe to PATH
      run: |
        curl -L https://github.com/GoogleCloudPlatform/kms-integrations/releases/download/cng-v1.2/kmscng-1.2-windows-amd64.zip -o kmscng.zip
        unzip kmscng.zip
        cd kmscng-1.2-windows-amd64
        msiexec /i kmscng.msi /quiet /qn /norestart

        # Create a working directory
        New-Item -Path "C:\Windows\KMSCNG" -ItemType Directory -Force

        # Add SignTool.exe to PATH
        echo "C:\Program Files (x86)\Windows Kits\10\App Certification Kit" >> $env:GITHUB_PATH
      shell: pwsh

    - name: Decode the BASE64-encoded Sectigo code signing certificate and save it to a file
      run: |
        [IO.File]::WriteAllBytes(
          'C:\Windows\KMSCNG\cert.p12',
          [Convert]::FromBase64String("${{ vars.sectigo_code_signing_cert_base64 }}")
        )
      shell: pwsh

    # The account used here must be granted the roles/cloudkms.signerVerifier role
    - name: Authenticate to Google Cloud for access to the Sectigo code signing private key on Google Cloud KMS
      uses: google-github-actions/auth@v2
      with:
        project_id: YOUR_PROJECT_ID
        create_credentials_file: true
        workload_identity_provider: SAMPLE_PROVIDER
        service_account: YOUR_SERVICE_ACCOUNT@PROJECT.iam.gserviceaccount.com
        export_environment_variables: true
        access_token_lifetime: 600

    - name: Sign the code
      run: |
        signtool.exe sign /v /debug /as /fd sha256 /tr http://timestamp.sectigo.com /td SHA384 /f C:\Windows\KMSCNG\cert.p12 /csp "Google Cloud KMS Provider" /kc projects/YOUR_GCP_PROJECT_ID/locations/asia1/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME/cryptoKeyVersions/1 C:\Temp\binary-to-be-signed.exe
      shell: cmd

Notes

Key Types and Bit Lengths Commonly Used in Code Signing

We surveyed the key types and bit lengths used for code signing in real-world Windows applications. The results from examining 30 Windows application installers on hand as of May 2025 are as follows.

Key Type / Bit Length	Count
RSA 4096-bit	17
RSA 3072-bit	9
ECC 384-bit	1
ECC 256-bit	1

These results show that choosing RSA 4096-bit is the safe bet.

Instant SmartScreen Bypass via EV Certificate Is No Longer Possible

Previously, performing code signing with an EV (Extended Validation) code signing certificate could bypass the Microsoft SmartScreen warning screen.

However, due to a change in Windows behavior in March 2024, even EV certificates can no longer instantly bypass SmartScreen. As of now, SmartScreen warnings may still appear until sufficient reputation has been built up for the certificate itself or the signed binary.

This is also noted on Sectigo's EV code signing certificate purchase page:

Note: In March 2024, Microsoft changed the way MS SmartScreen interacts with EV Code Signing certificates. EV Code Signing certificates remain the highest trust certificates available, but they no longer instantly remove SmartScreen warnings.

Therefore, there is no longer a reason to choose an EV certificate for the purpose of SmartScreen mitigation.

This became an industry requirement because there were numerous incidents where private keys were leaked and code signing certificates were abused, as exemplified by NVIDIA's code signing private key leak. ↩
For more on the OpenSSL Engine API, see the article What is the OpenSSL Engine API that enables integration with cloud HSMs and YubiKeys. ↩
When a trusted third-party timestamp is applied, the date and time of signing become provable, allowing a third party to verify that the certificate was indeed valid at the time of signing. Without a timestamp, the signing date and time are self-reported by the signer, making it impossible for a third party to verify whether the certificate was valid at the time of signing (since it would be possible to set the machine's clock to a past date and sign). Therefore, once the certificate expires, the code signature is also considered invalid. ↩

The Missing Guide to Windows Code Signing in CI/CD (GitHub Actions Edition)

Katz Sakai — Thu, 26 Mar 2026 04:36:15 +0000

Overview

Code signing Windows binaries is straightforward in theory. In practice, doing it inside GitHub Actions introduces a real constraint: your private key must live on an HSM, and that HSM must be reachable from a CI runner. You can either use an HSM hosted by your certificate authority or bring your own cloud HSM. Each option has different trade-offs in cost, flexibility, and setup complexity.

This post maps out the full architecture so you can choose the right approach before committing.

As a more concrete follow-up to what is discussed in this article, I also published a separate article on how to set up a code signing environment on GitHub Actions using Google Cloud KMS and Sectigo:

Katz Sakai

Mar 26

Building a Cost-Effective Windows Code Signing Pipeline with Sectigo, Google Cloud KMS, and GitHub Actions

#github #security #windows #infosec

9 min read

The Big Picture

The overall architecture of the system built on GitHub Actions looks roughly like the diagram below.

When integrating a code signing process into a CI/CD pipeline on GitHub Actions, the code signing private key must be stored in a cloud-based HSM (Hardware Security Module) that is accessible from the Windows machine running on GitHub Actions.¹

There are two types of cloud-based HSMs: those provided by a Certificate Authority (CA), and those running in your own cloud environment. With the former, you may be charged based on the annual number of code signing operations, and those costs can be quite high. So, if managing your own infrastructure is not a burden, using your own cloud HSM is the recommended approach.

It is common practice to attach a timestamp when code signing. Without a timestamp, the date and time of signing is self-reported by the signer, meaning third parties cannot verify whether the certificate was valid at the time of signing (since one could set the machine's clock to a past date and sign). As a result, once the certificate expires, the code signature is also considered invalid.

In contrast, if a trusted third-party timestamp is attached, the signing date and time become verifiable, allowing third parties to confirm that the certificate had not been revoked at the time of signing. This means the code signature remains valid even after the certificate itself expires.

The code signing certificate issued by the CA does not contain sensitive data such as private keys. Therefore, it can be placed on the GitHub Actions Runner as a file.

General Steps for Obtaining a Code Signing Certificate

First and foremost, you need to have a code signing certificate issued.
Here are the key points for this process:

The code signing private key must be generated on a FIPS 140-2 Level 2 compliant HSM and stored within the HSM (the private key must not be exportable from the HSM).
- This is an industry requirement for how private keys must be handled. If this requirement is not met, the CA will not issue a code signing certificate.¹
- While it is technically possible under the requirements to store the private key on a FIPS 140-2 Level 2 compliant USB security token, GitHub Actions Runners cannot access USB devices, so this option is not viable for this architecture.
The CA issues the code signing certificate against the public key corresponding to the private key.
- In other words, the CA is certifying that "this public key indeed belongs to organization XXX."
Once a private key is generated, it does not need to be regenerated as long as its security is not compromised. This means you can continue using the same key when renewing the certificate without creating a new one.
Certificates have an expiration date, so they must be reissued each time the expiration arrives.

If you generate the private key on your own cloud HSM, the general steps are as follows:

Generate the code signing private key on a FIPS 140-2 Level 2 compliant HSM and store it within the HSM.
- As long as the key's security has not been compromised, there is no need to regenerate it. This means that when reissuing the code signing certificate in subsequent years, you can start from step 2.
Obtain the Attestation for the private key.
- A private key Attestation is data that allows a third party (in this case, the CA) to verify that the private key was generated in a trusted environment and has not been tampered with.
- The Attestation is downloaded from the HSM.
Create a CSR (Certificate Signing Request) using the private key.
- The CSR contains the public key and a signature created with the private key, so the certificate issued by the CA is linked to the key pair through this CSR.
Submit both the CSR and the Attestation to the CA to have the code signing certificate issued.
- The CA verifies the integrity of the private key based on the Attestation and then issues the code signing certificate based on the CSR.
Receive the certificate file from the CA and store it somewhere safe.

Step 3, creating the CSR, is typically done on a local Linux machine using an OpenSSL command like the following.
For information on how to use OpenSSL with cloud HSMs, please refer to What is the OpenSSL Engine API that enables integration between OpenSSL and cloud HSMs or YubiKey.

export PKCS11_MODULE_PATH=/tmp/libkmsp11-1.6-linux-amd64-fips/libkmsp11.so
export KMS_PKCS11_CONFIG=/tmp/pkcs11-config.yaml

openssl req -new -subj '/CN=example.com/' -sha256 \
  -key pub.pem -engine pkcs11 -keyform engine \
  -key pkcs11:object=sign_key_name > cert-request.csr

If generating the private key and creating the CSR yourself seems too cumbersome, using an HSM provided by the CA may allow some of these steps to be semi-automated.

Code Signing Procedure

Once the certificate has been issued, you need to set up the code signing environment on the Windows machine running on the GitHub Actions Runner.
Here are the key points for setting up the environment:

Code signing is performed using SignTool.exe, which is included in the Windows SDK (it is pre-installed on GitHub-hosted Windows machines).²
You need to install a library on the GitHub Actions Windows Runner that allows SignTool.exe to delegate its signing operations to the cloud-based HSM.
- To connect with Google Cloud's HSM, install the Google Cloud CNG Provider.
- To connect with a CA-provided HSM, install the library provided by the respective CA:
  - DigiCert: https://docs.digicert.com/en/software-trust-manager/client-tools/cryptographic-libraries-and-frameworks/ksp-library.html
  - SSL.com: https://www.ssl.com/guide/code-signing-automation/
Simply installing the library is not enough — you also need to configure which credentials to use for cloud access and which key to use, following the setup instructions for each library.
- For example, with the Google Cloud CNG Provider, ADC (Application Default Credentials) is used for authenticating with Google Cloud, and the key to use is defined in a YAML configuration file.
- When connecting to proprietary HSM environments such as DigiCert or SSL.com, refer to the documentation published by each company.

Once the environment is set up, all that remains is to call SignTool.exe to perform the code signing.
The following is an example command for signing with a Sectigo certificate:

signtool.exe sign /v /fd sha256 /t http://timestamp.sectigo.com /f path/to/mysigncscertificate.crt /csp "Google Cloud KMS Provider" /kc projects/PROJECT_ID/locations/LOCATION/keyRings/KEY_RING/cryptoKeys/KEY_NAME/cryptoKeyVersions/1 path/to/file-tobe-signed.exe

Prior to 2023, it was possible to store private keys in PEM files. However, due to ongoing private key leakage incidents — exemplified by the NVIDIA code signing private key leak — the industry now requires that private keys be stored on an HSM when issuing code signing certificates. ↩
You can view the list of software pre-installed on Runners at https://github.com/actions/runner-images. ↩

Why Rails App Memory Bloat Happens: Causes and Solutions (2025 Edition)

Katz Sakai — Thu, 26 Mar 2026 04:15:01 +0000

When running a Rails application in a production environment for a while, you may encounter a phenomenon where memory usage increases unexpectedly. In July 2025, I investigated the cause of this behavior and considered countermeasures, and this article summarizes my findings.

TL;DR: Key Takeaways

The reason a Rails application "appears to be constantly consuming memory" is due to the design of glibc, which Ruby uses, which holds onto free memory internally instead of returning it to the OS for future reuse. This is not a typical memory leak.
Since Ruby 3.3.0, it's possible to optimize the heap by calling Process.warmup when a Rails application has finished booting. However, this mechanism is intended to be executed when the application has finished booting, making it difficult to use for reducing memory usage in Rails applications that are running for extended periods.
Setting the environment variable MALLOC_ARENA_MAX=2 remains an effective way to reduce memory usage without rewriting any product code. This setting prevents glibc from creating numerous arenas (memory pools) one after another, and instead reuses memory within existing memory pools, thus preventing glibc from accumulating too much free memory.
~~Switching to jemalloc, which used to be a common recommendation, should now be avoided because jemalloc is no longer being maintained.~~ Update Apr 3rd, 2026: Meta has recently announced a renewed commitment to jemalloc¹. If that leads to active releases again, jemalloc may become an option again.

Why Does Memory Usage in Rails Apps Appear to Keep Growing?

Hongli Lai's article "What causes Ruby memory bloat?" covers this in detail.

What causes Ruby memory bloat? – Joyful Bikeshedding

joyfulbikeshedding.com

According to that article, the reasons memory bloat "appears to occur" are as follows:

The previously held belief that "heap page fragmentation on the Ruby side" was the primary cause of increased memory usage was not actually the main factor.
The true cause was that "glibc's memory allocator, malloc, retains memory that Ruby has freed instead of returning it to the OS, holding onto it for future use." In particular, free pages that are not at the end of the heap are not returned to the OS, so unused memory continues to accumulate internally. From the OS's perspective, this makes it look like "Ruby keeps consuming memory."
Calling glibc's malloc_trim(0) ensures that memory freed by Ruby is returned to the OS, effectively reducing the process's memory usage (RSS) as seen by the OS.

Though there is one important caveat:

Memory that Ruby has allocated and freed during processing is usually fragmented. Calling malloc_trim(0) does not resolve the fragmentation; it merely returns the fragmented regions to the OS as-is.
Even if the memory is fragmented, it is still returned to the OS, so Ruby's memory usage (RSS) goes down. However, because other programs cannot allocate contiguous regions from fragmented free memory, an OOM (Out of Memory) error can occur even when there appears to be free memory available.
Since returning fragmented memory to the OS does not make it easy to reuse effectively, malloc is designed to retain allocated but unused memory internally and reuse it, enabling stable allocation.

This is one of the reasons "Ruby has freed the memory, but malloc does not readily return it to the OS."²

What Memory-Related Improvement Was Added in Ruby 3.3.0?

Ruby 3.3.0 introduced the Process.warmup method. This method is intended to signal to the Ruby virtual machine from an application server that "the application's startup sequence has completed, making this an optimal time to perform GC and memory optimization."³
When Process.warmup is called, the Ruby virtual machine performs the following optimizations:

Forces a major GC
Compacts the heap
Promotes all surviving objects to the old generation
Pre-computes string coderanges (to speed up future string operations)

This cleans up objects and caches that were generated during application startup but are no longer needed, improving memory sharing efficiency in Copy-on-Write (CoW) environments.

Also, because unnecessary objects have already been collected and the heap has been compacted, malloc-side fragmentation is likely lower at this point. This makes it an ideal time to call malloc_trim(0), and a patch that calls malloc_trim(0) internally within Process.warmup has been merged.

Process.warmup: invoke `malloc_trim` if available #8451

casperisfine posted on Sep 15, 2023

Similar to releasing free GC pages, releasing free malloc pages reduce the amount of page faults post fork.

NB: Some popular allocators such as jemalloc don't implement it, so it's a noop for them.

View on GitHub

Reducing Memory Bloat in Long-Running Rails Apps

So how can you prevent memory bloat without using Process.warmup or malloc_trim(0)?

Online resources have recommended using jemalloc, a smarter memory allocator. However, jemalloc's repository was archived in June 2025, and it does not appear to be actively maintained. It is best to avoid adopting it for new projects. Update Apr 3rd, 2026: Meta has recently announced a renewed commitment to jemalloc. If that leads to active releases again, jemalloc may become an option again.

As an alternative, setting the environment variable MALLOC_ARENA_MAX=2 remains effective. Because:

It reduces the number of arenas (memory management regions) that glibc allocates. glibc's malloc allocates numerous arenas as needed to prevent contention when multiple threads request memory simultaneously (normally, on 64-bit systems, the upper limit is 8 times the number of vCPU cores on the machine).
As described above, glibc's memory allocator tends to hold on to memory instead of returning it to the OS. Therefore, the more arenas there are, the more "unreturned free memory" accumulates internally.
Limiting the number of arenas can reduce the amount of memory glibc keeps, though it may slightly increase contention between threads during memory allocation.

The articles below suggest that MALLOC_ARENA_MAX=2 can cut memory usage noticeably, while increasing response time by only a few percent.
https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html

Additionally, MALLOC_ARENA_MAX=2 is the default setting on Heroku, which suggests it is a relatively safe configuration.
https://devcenter.heroku.com/changelog-items/1683

So why is MALLOC_ARENA_MAX=2 enough?:

Ruby has a GVL (Global VM Lock), which means only one thread can execute Ruby code at any given time. So even if the application has many threads, only a small number of them are likely to be running and allocating memory at the same time. Consequently, glibc does not need to maintain many arenas; a small number (around 2) sufficient to handle requests from active threads should be adequate. For this reason, setting MALLOC_ARENA_MAX=2 usually does not cause problems, while helping reduce the amount of freed memory glibc keeps across multiple arenas.

If you want to test it more carefully, compare memory usage and response time with the value unset, then with 2, 3, and 4, and see which works best for your app.

We compared the memory usage per Pod before and after setting MALLOC_ARENA_MAX=2. The solid line represents the usage after the setting was applied, and the dashed line represents the usage before. You can see the clear difference.

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc ↩
A proposal was made to "call malloc_trim(0) when a full GC is performed in Ruby to return memory to the OS," but it was not implemented because returning fragmented memory to the OS provides little benefit since the OS cannot effectively utilize it. Feature #15667: Introduce malloc_trim(0) in full gc cycles - Ruby - Ruby Issue Tracking System ↩
The background behind the introduction of Process.warmup is explained in Feature #18885: End of boot advisory API for RubyVM - Ruby - Ruby Issue Tracking System ↩

DEV Community: Katz Sakai

GKE's Noisy Neighbor Problem Can Be Invisible in Metrics Explorer

An API server that was mysteriously slow from time to time

What was actually happening on the API server node

(Side note) Why the batch jobs were eating all the CPU

What I learned

How a Rails and K8s Newcomer Cut GKE Costs by 60% by Looking Across the Stack

The vicious cycle: why the cluster needed so many Pods

How the cost actually came down

Part 1: Making each Pod and node more efficient

1.1 GKE Node generation upgrade: n1-highmem-2 → n2d-highmem-2

1.2 Rails process model: from 33 threads to 4 workers

Part 2: Scaling with demand

2.0 Prerequisites: making Pods safe to autoscale

2.1 KEDA with Cron trigger

2.2 GKE Node Autoscaling

Results

Lessons learned

Non-native English is my hidden strength as a tech blogger

A smaller vocabulary can be a good thing

Your unique content is the stuff that doesn't exist in English yet

The "translation test" catches bad writing

The way you write is part of your voice

What actually held me back (not language)

My actual process

The point

Behind the Streams: Live at Netflix — 2025–10–22 Tokyo Video Tech #10 Session 2 Report

Live Streaming and Netflix

From the Venue to the BOC: Signal Aggregation

Converting BOC Signals into Delivery Formats in the Cloud

Principles of Delivery: Compatibility, Scale, and Quality

Success in Live Streaming Starts with Preparation

In Conclusion

Join Us

Live Streaming at "Spring Festival in Tokyo" — 2025–10–22 Tokyo Video Tech #10 Session 1 Report

“Spring Festival in Tokyo” and IIJ

Challenges Faced in On-Site Operations

New Challenges in 2025

Network Design Philosophy

International Collaboration — Partnership with Berlin Phil Media

In Conclusion

Join Us

IIJ and Netflix Talk Live Streaming at Tokyo Video Tech #10 (Seminar Report)

Executive Summary

Session 1 — Live Streaming at “Spring Festival in Tokyo”

Session 2 — Behind the Streams: Live at Netflix

Opening Remarks: Celebrating the 10th Edition

The Sessions

Join Us

OpenSSL Engine API Explained: Connecting Google Cloud KMS, YubiKey, and More

What Is the OpenSSL Engine API?

Examples of Engine API Usage

Benefits of Externalizing Signing and Other Operations via the Engine API

Why Rails App Memory Bloat Happens: Causes and Solutions (2025 Edition)

Why Does Memory Usage in Rails Apps Appear to Keep Growing?

What Is the Memory Bloat Countermeasure Code Introduced in Ruby 3.3.0?

Countermeasures for Memory Bloat in Long-Running Rails Apps

Building a Cost-Effective Windows Code Signing Pipeline with Sectigo, Google Cloud KMS, and GitHub Actions

Overview

Table Of Contents

System Architecture of the Code Signing Environment on GitHub Actions

Why We Chose Sectigo's Code Signing Certificate

Setup Instructions

Generate a Signing Private Key on Google Cloud KMS HSM

(1) Create a Key Ring

(2) Create a Signing Key

(3) Obtain the Attestation for the Generated Key

Create a CSR on Your Local Machine

Install the Required Tools

Purchase the Code Signing Certificate

Integrate into GitHub Actions

Notes

Key Types and Bit Lengths Commonly Used in Code Signing

Instant SmartScreen Bypass via EV Certificate Is No Longer Possible

The Missing Guide to Windows Code Signing in CI/CD (GitHub Actions Edition)

Overview

The Big Picture

General Steps for Obtaining a Code Signing Certificate

Code Signing Procedure

Why Rails App Memory Bloat Happens: Causes and Solutions (2025 Edition)