Steve

Posted on Mar 9

Satori GC: High Throughput, Low Latency, and a Small Memory Footprint

#csharp #dotnet

Introduction

GC design has always had to wrestle with a hard trade-off: high throughput, low latency, and low memory usage are usually difficult to get at the same time.

Traditional approaches usually improve one side by paying somewhere else. If you want shorter pauses, you typically move more work into concurrent phases, and sometimes make ordinary object access more expensive. If you want higher throughput, you usually try to keep the steady-state fast path as cheap as possible, which means more work gets pushed into collection time. And if you want lower memory usage, you need to reclaim, compact, and return memory more aggressively.

What makes .NET's Satori GC interesting is not that it pushes one of those familiar strategies to the extreme. It starts by asking a different question: does the most frequent part of collection really have to be a global problem?

What a GC Actually Does

For a managed runtime, the GC's job is conceptually simple:

Find the objects that are still alive
Reclaim objects that are no longer reachable
Move live objects when necessary to reduce fragmentation and keep allocation efficient

The third part is where things get complicated. Once objects move, every reference to them has to stay correct. To make that possible, the GC has to pause user threads at certain points. That whole-program pause is what we usually call STW.

Generations

The reason a GC does not have to scan the entire heap every time is that it leans on a very practical observation: most objects die young.

That is why modern GCs are usually generational:

Gen 0: newly allocated objects, and the ones most likely to die quickly
Gen 1: objects that survived a few collections, but are not really old yet
Gen 2: long-lived objects that have already proven they tend to stick around

The point of generations is to keep the most frequent collections focused on the youngest part of the heap. If most short-lived objects die in Gen 0, most GC cycles never need to touch older objects at all.

Write Barriers and Card Tables

Generational GC has another crucial problem to solve: old objects can reference young objects.

If the next collection only reclaims Gen 0, but the GC does not know that some Gen 2 object contains a reference into Gen 0, it could mistakenly collect an object that is still live.

So the runtime needs an incremental way to record those updates. The common solution looks like this:

When the program writes an object reference, it also performs a tiny extra action called a write barrier
The runtime marks the relevant memory area as worth rechecking later, typically through a card table

That lets the GC avoid rescanning the entire older generation during a young-generation collection. It only needs to inspect the areas that were dirtied by the write barrier.

The Impossible Triangle

The painful part of GC design is that high throughput, low latency, and low memory usage are constantly pulling against each other.

If you optimize for low latency, you usually move more work into concurrent phases so the application and the GC can run at the same time. But concurrency is not free. To keep things correct, you often need more complicated barriers, stricter invariants, and sometimes more extra space.

If you optimize for throughput, you usually want the everyday path to stay as cheap as possible. Allocating an object, reading a reference, and updating a reference should not all carry a heavy tax. The downside is that if you avoid paying much during normal execution, more work piles up for collection time, and pauses can get longer.

If you optimize for low memory usage, you typically need to collect, compact, and release memory more aggressively. But those actions also consume time and resources.

So in a real sense, every GC is answering the same question: where do I want to pay the cost?

What Satori Is Trying to Do

Satori is not a toy collector detached from the real .NET runtime. It plugs into the actual runtime interfaces, which means it is trying to reorganize collection under real constraints, not in a simplified laboratory model.

Its goals are straightforward:

Minimize tuning, and ideally adapt automatically to the workload
Avoid long pauses
Stay fully usable under real-world feature constraints

That last point matters. A GC can look elegant if it simply deletes all the hard parts. Satori does not do that. It still has to support internal pointers, finalizers, weak references, dependent handles, collectible types, precise root scanning, and conservative root scanning.

In other words, Satori is not dodging real-world scenarios. It is reorganizing GC work while still living in the real world.

Page, Region, and Generations

Satori organizes the heap around two core abstractions: Page and Region.

Page: a larger reservation unit
Region: a smaller unit inside a Page, better suited to independent management

A single Page can contain many Regions, which gives the GC a much finer-grained unit to reason about.

Region is especially important in Satori, because it is not just a physical subdivision of the heap. It is also the:

Allocation unit
Thread-local ownership unit
Local Gen 0 collection unit
Planning unit for relocation and compaction
Unit used when reclaiming and returning free memory

Many GCs split the heap into smaller pieces, but those pieces are often mostly scheduling conveniences. In Satori, Region is a core abstraction.

Satori also keeps compact metadata around Page and Region. A Page contains structures such as a card table, a Region map, and coarser-grained card-group information. A Region contains bitmaps that record important object state, such as whether an object is marked, has escaped, or is pinned. That metadata is what makes thread-local collection, escape tracking, and local compaction practical later on.

Turning Gen 0 into a Local Problem

Satori's most important idea can be summarized in one sentence: if a batch of newly allocated objects mostly exists for a short time on a single thread, why should collecting them require stopping the world?

Satori still uses per-thread fast allocation contexts for small-object allocation, so the normal allocation path remains direct. But it adds an important twist: threads do not simply grab space from a global heap whenever they need it. They try to allocate within Regions they already own.

If the current Region still has space, allocation just continues. If the Region is nearly full, Satori's first reaction is not "this chunk is exhausted, hand it to global GC and fetch a new one." It first asks whether that Region is a good candidate for local cleanup.

That is based on a pattern that shows up constantly in real programs:

The objects were just created
Their lifetimes are short
They mostly stay within the current thread

If a Region looks like that, collecting it does not need to become a global problem. The current thread can perform a limited Gen 0 collection for that Region by itself. That is Satori's thread-local Gen 0.

An ordinary Gen 0 collection still thinks in whole-process terms: what are all the threads doing, where can old objects point into young ones, which card tables need to be scanned, and which global structures need to be synchronized?

Satori's thread-local Gen 0 shrinks the problem down to:

Which objects the current thread still holds on its own stack
Which objects in the current Region have established relationships with the outside world

That smaller scope is what makes local collection realistic.

Escape Tracking

Satori uses escape tracking to decide whether a Region can continue behaving like a thread-local one.

An object may start out thread-local and later escape. For example, it might:

Be stored in a global cache
Be attached to an object another thread can reach
Be queued as work that another thread will process later

Once that happens, the object is no longer purely local to the thread. It now has an external connection. That is what escape means here.

Thread-local collection only pays off if most of the objects in a Region really do remain local to the current thread. If more and more objects escape, the Region loses that locality, and local collection becomes less attractive.

So Satori does not pretend thread-locality lasts forever. It tracks escape explicitly. Once an object escapes, Satori records more than just that object. It also follows that object's references within the current Region, so anything in the Region that is reachable through the escaped object is accounted for too.

If only a small number of objects escape, thread-local collection can still be well worth doing, because the Region is still mostly owned by one thread. The real question is what to do when escape becomes common.

Satori uses a practical threshold: once escape grows beyond a certain point, the Region is no longer treated as thread-local Gen 0. It gets moved into more global generational management instead.

That threshold is there for a good reason:

If the collector gave up on thread-local Gen 0 at the first sign of escape, it would lose many cases where local collection is still a big win
If it insisted on thread-local Gen 0 no matter how much escape occurred, local collection would become less and less worthwhile

Satori aims for the balance point between those two extremes.

Thread-Local Collection

Once you understand thread-local Regions and escape tracking, the thread-local collection itself makes more sense.

It has a chance to be fast not because it magically does less work in the abstract, but because it works over a much smaller scope with a much smaller root set.

1. Decide Whether Collection Is Worth It

Satori does not immediately run a local collection the moment space gets tight. It first decides whether doing so is likely to pay off. For example:

Was the last local collection too recent?
Has escape already become too high?
Has the surviving data in this Region already grown too large for a small local cleanup to be attractive?

If those signals suggest the local collection would not be a good trade, Satori skips it and lets a more global path take over.

2. Mark from a Much Smaller Root Set

If the Region is still a good fit for thread-local collection, the root set is quite limited. It mainly consists of:

Objects on the current thread's stack that still point into that Region
Objects that have already escaped, plus the objects reachable from them within the Region

That is very different from a global Gen 0 collection. A global Gen 0 has to reason about all managed threads, references from older generations into younger ones, and a variety of global structures. Thread-local collection only has to look at the current thread's stack and the already-escaped part of the object graph inside the current Region.

3. Plan Compaction

After marking live objects, the job still is not finished, because the Region may now contain many holes.

At that point, Satori decides:

Which objects are still live
Where those live objects should move to make the layout denser
Which references will need to be updated afterward

Conceptually, this is the planning phase: figure out what survives, where it should go, and which references need rewriting.

4. Update References and Compact Locally

Only then does the actual compaction happen. If objects move to a denser layout, every reference to them has to be updated as well. Once that is done, the Region becomes contiguous again, and future allocation stays smooth.

And importantly, Satori is compacting a small Region, not the whole heap. That is why it has a real shot at keeping pauses small.

A Simple Example

Imagine a web request being handled on thread A. Processing that request creates a lot of short-lived objects: request context data, route-matching results, temporary parse structures, short-lived strings, and a few temporary lists.

In Satori, those objects will likely be allocated into a Region currently owned by thread A.

If the request finishes and none of those objects gets stored into a global cache or handed off to another thread, then when space in that Region starts running low, thread A can simply perform a local collection, reclaim the dead objects, and keep allocating in the same Region.

If some objects are published to a global cache, or packaged into work that gets handed to the thread pool, then those objects have escaped.

If only a small amount of that happens, Satori can still preserve the Region's thread-local character, because most of the Region is still primarily used by thread A. But if that sharing keeps growing and escape crosses the threshold, the Region is no longer a good fit for thread-local Gen 0. It leaves that private state and moves onto the more global GC path.

That is the basic strategy: solve the problem locally when locality still exists, and fall back to the global path as soon as locality stops being true.

Global GC

Satori is not only a thread-local collector. It also has a full global GC pipeline to handle cases such as:

Regions with heavy escape
Older objects
Global memory pressure
Relocation and compaction across Regions

The point is not to do only local collection. The point is to pull the most frequent and most localizable work out of the global path first, and leave genuinely global work to the global collector.

Global GC Phases

A global GC still has to do the familiar jobs:

Mark the objects that are still alive
Decide which Regions are worth compacting and which are worth moving
Update references
Move and compact when needed

Satori's goal is not to erase those phases. It is to let as much of that work as possible happen concurrently with the application. It is not chasing an absolute promise of zero pauses. It is trying to avoid piling heap-size-dependent work into a blocking phase whenever possible.

Optional Relocation

This is where the philosophical difference between Satori and another class of low-latency collectors becomes easier to see.

Some ultra-low-pause GCs keep pause times very stable because they are willing to pay a higher steady-state cost in exchange for making objects concurrently movable at almost any time. In other words, ordinary execution has to live under stronger rules all the time.

Satori does not take that route by default. It treats relocation and compaction as important, but it does not require unconditional concurrent relocation as a universal rule. Relocation at the Region level is an optional capability, not an always-on mandate.

The trade-off is fairly clear:

If you insist that any object must be movable concurrently at any moment, the normal execution path usually gets more expensive
If you treat relocation as something you do when needed, the normal execution path has a better chance to stay cheap

Satori chooses the second option. That is one of the reasons it can preserve throughput.

Letting Application Threads Help Advance Collection

Satori also includes a very practical idea: application threads can help advance collection work.

If the program allocates memory extremely quickly while concurrent GC work is not keeping up, garbage will accumulate. If that gap grows too large, the runtime may end up needing a heavy blocking phase just to catch up.

Satori's answer is simple: when it detects that risk, the threads doing allocation will also perform a bit of GC progress work themselves.

That has two benefits:

It keeps allocation from completely outrunning collection
It reduces the chance that the runtime has to recover with one large pause at the end

So Satori is not spreading concurrency overhead evenly across every object access. It prefers to ask application threads for extra help when things are actually at risk of falling behind. That is one of the key techniques it uses to balance throughput and low latency.

Low Memory Usage

Low-latency GCs often consume more memory, and that is not surprising. Keeping pauses short usually means:

More concurrent phases
More buffering
More conservative memory reservation
More metadata to coordinate correctness

Satori has a chance to keep memory usage down because it is not only rethinking how collection works. It is also attacking the problem from three directions at once.

1. Let Short-Lived Garbage Die Young

If an object only exists briefly on one thread, the ideal outcome is for it to die in thread-local collection before it ever ages into an older generation.

That directly helps in two ways:

Older generations do not get polluted with large amounts of short-lived garbage
Later global GC cycles have less surviving data to scan, move, and compact

The thread-local Gen 0 mechanism is already laying the groundwork for a smaller memory footprint.

2. Explicitly Return Free Memory to the OS

Satori also includes a dedicated thread whose job is to consolidate free Regions and return memory.

It is not the main collection engine. Its job is more practical than that:

Identify Regions that are empty enough to reclaim
Try to merge adjacent free Regions
Return memory that no longer needs to stay committed back to the operating system

More importantly, it does this in a rate-limited way. It controls how aggressively it scans and returns memory so the system does not end up thrashing just to save a little footprint.

3. Keep Metadata Overhead Under Control

The GC has to manage more than just objects. It also carries a lot of auxiliary state. If that metadata grows without restraint, overall memory usage can still look bad even if object reclamation is otherwise efficient.

Satori has a neat trick here: on 64-bit systems, it tries to reuse space near the object header that would otherwise be underutilized for temporary information, such as linkage needed during local compaction or forwarding information after relocation, instead of immediately allocating large side tables for everything.

Solving the Triangle: High Throughput, Low Latency, and Low Memory Usage

Once you put the pieces together, Satori's logic becomes much easier to see.

Why It Can Deliver Low Latency

Because it turns the most frequent young-object collection from a global coordination problem into a thread-local one.

As long as objects mostly stay on the thread that created them, a collection only needs to reason about one small Region, that thread's own stack, and a limited set of already-escaped objects, instead of stopping all managed threads to cooperate.

Why It Can Deliver High Throughput

Because it does not start by making every object access more expensive.

Satori's main strategy is:

Localize the most common short-lived garbage first
Use concurrent global GC plus mutator assistance to absorb larger pressure

Local collection also reduces the chance that short-lived garbage gets promoted into older generations. That, in turn, reduces later scanning and compaction pressure, which helps throughput too.

Why It Can Also Deliver Low Memory Usage

Because it is not relying on one trick. It combines several effects:

Let short-lived garbage die earlier, before it pollutes older generations
Explicitly return unused committed memory
Keep metadata from ballooning

So Satori's small memory footprint is not the result of one isolated optimization. It is the outcome of the whole design working together.

How It Compares with Other GCs

When comparing GC designs, the most important question is usually not which collector won a specific benchmark. It is where each collector chooses to pay its costs.

Workstation GC and Server GC

These two are variations on the same architectural line, not two completely unrelated algorithms.

Both are generational collectors. Both rely on write barriers and card tables to track references from older objects into younger ones. The small object heap is still structured as Gen 0, Gen 1, and Gen 2, and older-generation collection still involves heavier marking, planning, and compaction work.

Workstation GC is better suited to lighter, more interactive scenarios. Its characteristics include:

A more single-heap-oriented design and more restrained resource usage
Gen 0 and Gen 1 still happening as foreground STW collections
Gen 2 being able to run in the background, but with limited overall parallelism

Its strengths are maturity and relatively conservative resource usage. Its downside is also clear: under high concurrency and high allocation rates, the throughput ceiling shows up earlier.

Server GC keeps the same basic model but turns up the parallelism. It provisions separate heaps and stronger GC-thread resources for each logical processor, which makes it a better fit for server workloads. The trade-offs are usually:

Larger heaps
More threads
Heavier overall resource usage

But the key common point remains: the most frequent Gen 0 and Gen 1 collections are still fundamentally part of the global STW path. The biggest difference with Satori is that Satori attacks that hottest path first.

DATAS

DATAS is not a brand-new GC architecture. It is a policy layer on top of the existing Server GC.

It is trying to answer questions such as:

How much heap budget should this application get?
How many heaps should it use?
How should Gen 0 growth be controlled?
How can heap size better track the true volume of long-lived data?

So DATAS changes policy, not mechanism. It makes the existing Server GC smarter, but it does not change the fact that the most frequent young-generation collection still runs on the global path.

Satori is addressing a different question: can the hottest young-object collection avoid becoming a global operation in the first place?

G1

G1 also manages the heap in Regions, but it uses them for a different purpose than Satori does.

G1's core model is:

Split the heap into many fixed-size Regions
Use cross-Region reference tracking to record references between Regions
Use snapshot-style concurrent marking and write barriers to support concurrent marking
Copy surviving objects out of Regions selected for collection

In other words, G1's Regions primarily serve global balancing and pause-time control. Which Regions enter the collection set, which take part in mixed collection, and which participate in object copying are all driven by global scheduling.

Satori also uses Region, but it pushes the abstraction further. A Region is not just a scheduling unit. It is also the boundary for thread-local ownership, escape tracking, and local collection. That is the biggest philosophical difference between Satori and G1.

ZGC and Shenandoah

These low-latency collectors choose a different route.

What they have in common is that they are willing to pay for stronger runtime machinery on the normal execution path in exchange for more stable concurrent relocation. But the exact mechanisms are different.

ZGC's core design encodes some GC state directly into pointers and pairs that with read barriers. Generational ZGC adds more write-time barrier machinery on top. The goal is very clear: make concurrent relocation the default so STW time does not grow with heap size.

Shenandoah is closer to object indirection plus concurrent evacuation and compaction. It maintains an extra level of indirection per object to support concurrent relocation. Its cost profile is not identical to ZGC's, but the principle is similar: stronger runtime machinery in exchange for stronger low-pause behavior.

Satori does not take that route by default. It does not assume objects must always be movable concurrently. Instead, it makes relocation optional and localizes the hottest Gen 0 collections. That means it does not need to pay, by default, for an extra check on every object read the way ZGC does, or an extra level of indirection on every object the way Shenandoah does.

So the difference between Satori and collectors like ZGC or Shenandoah is not really about which one is more aggressive. It is about where the cost gets paid: on ordinary access paths, or in how the collector structures and scopes reclamation.

Conclusion

What makes Satori interesting is not simply that it is a concurrent GC. The more interesting point is that it rethinks how the most frequent part of collection should work.

Its core idea can be condensed into four lines:

Keep short-lived objects local whenever possible
Fall back to the global path quickly once objects start being broadly shared
Keep global GC as concurrent as possible without forcing every ordinary execution path to pay for always-on relocation
Keep memory usage low not just through faster reclamation, but also through active memory return and disciplined metadata design

If this direction matures, the significance may be larger than just "one more experimental GC." It could give .NET a meaningfully different direction for GC design.

DEV Community