amir

Posted on Jun 2

Demystifying the Linux Page Cache: The Kernel Optimization Hiding Behind Every Fast I/O

#linux #kernel #backend #performance

For the last few months, I have been spending a lot of time reading and researching Linux kernel internals.

Not just from the surface.

I mean going deeper into how the kernel actually manages memory, file I/O, processes, namespaces, cgroups, filesystems, and all the invisible mechanisms that make our backend systems feel fast.

As backend engineers, we usually talk about performance in terms of application code:

optimize the query
add Redis
reduce allocations
use a better index
improve concurrency
tune the API response time

And all of those things matter.

But after working on production systems for years, I have learned that sometimes the real bottleneck is not inside your application code.

Sometimes the real story is happening below your code, inside the operating system.

One of the most important examples of that is the Linux Page Cache.

It is one of those kernel features that quietly improves almost every backend system we run. It makes file reads faster, batches writes, reduces disk pressure, and gives us the illusion that storage is much faster than it actually is.

But it also comes with trade-offs.

And if you do not understand those trade-offs, you can easily misread performance numbers, misunderstand memory usage, or even risk losing data in the wrong failure scenario.

This article is my attempt to explain the Linux Page Cache from a backend engineer’s point of view.

Not as a kernel developer writing C inside mm/filemap.c.

But as someone who builds systems on top of Linux and wants to understand what the kernel is really doing behind the scenes.

Why Page Cache Exists

The reason Page Cache exists is simple:

Disk is slow. RAM is fast.

Even with modern SSDs and NVMe drives, accessing persistent storage is still much slower than accessing memory.

When our application reads from a file, the kernel could theoretically go to the disk every single time.

But that would be extremely inefficient.

So Linux uses available memory as a cache for file data.

That cache is called the Page Cache.

When a process reads a file, Linux usually does not just copy data directly from disk to the application and forget about it.

Instead, the kernel loads file data into memory pages and keeps those pages around.

So the next time the same file data is requested, Linux can serve it directly from RAM instead of touching the disk again.

That is the basic idea.

But the impact is huge.

Imagine a web server reading the same static files again and again.

Or a database repeatedly touching the same data files.

Or a log processor scanning files that were recently written.

Without Page Cache, every access would become much more expensive.

With Page Cache, many of those reads become memory-speed operations.

The First Important Idea: Linux Uses Free RAM Aggressively

One thing that confuses many developers is Linux memory usage.

You run free -h, and it looks like most of your RAM is used.

At first, it feels scary.

But often, that memory is not “wasted” or permanently consumed by applications.

A large part of it may be used by the kernel for cache.

Linux has a very practical philosophy:

Empty RAM is wasted RAM.

So if memory is available, the kernel uses it to cache useful data.

If an application later needs more memory, Linux can reclaim cache pages and give that memory back.

This is why a Linux server may look like it is using a lot of RAM even when your applications are not actually consuming that much.

The kernel is trying to help you.

It is using memory to avoid unnecessary disk I/O.

That is Page Cache doing its job.

What Actually Happens During a File Read?

Let’s say your application calls read() on a file.

At a high level, Linux checks whether the requested file data already exists in the Page Cache.

There are two possible outcomes.

Cache Hit

If the data is already in memory, the kernel can copy it to your application without going to disk.

This is fast.

Very fast compared to storage.

Cache Miss

If the data is not in memory, Linux has to read it from disk.

But after reading it, the kernel stores it in the Page Cache.

So the first read may be slower, but future reads can be much faster.

This is one reason benchmarks can be misleading.

If you run the same file-read benchmark twice, the second run may look much faster because the data is already cached.

Your application did not magically become better.

The kernel just remembered the file data.

The Kernel Also Predicts What You Might Read Next

Linux does not only cache what you already read.

It also tries to predict what you are going to read.

For sequential file access, the kernel may perform readahead.

That means when you read one part of a file, Linux may load the next parts into memory before you explicitly ask for them.

This is extremely useful for workloads like:

reading large logs
streaming files
scanning datasets
serving static assets
processing backups
importing CSV files

From the application’s point of view, it may feel like the disk is faster than expected.

But in reality, the kernel is doing smart work in the background.

It sees a pattern and tries to stay ahead of your application.

Writes Are Even More Interesting

Reads are easy to understand:

If data is cached, serve it from RAM.

Writes are more subtle.

When your application writes data to a file, Linux usually does not immediately force that data to physical storage.

Instead, the kernel writes the data into memory and marks the affected pages as dirty.

A dirty page means:

This page has been modified in memory, but the change has not necessarily been persisted to disk yet.

This is where the kernel gives your application a very useful illusion.

Your application calls write().

The kernel accepts the data.

The call returns successfully.

But that does not always mean the data is already safely stored on disk.

It may only mean the data is now in the kernel’s memory and scheduled to be written later.

This is one of the biggest performance tricks in the operating system.

And also one of the most important trade-offs.

Why Delayed Writes Are So Powerful

Imagine an application appending tiny log entries to a file thousands of times per second.

If Linux forced every small write to disk immediately, performance would be terrible.

Storage devices are much better at handling larger, sequential writes than many tiny random writes.

So Linux delays and combines writes.

Many small writes can be collected in memory and later flushed to disk in a more efficient way.

This process is called writeback.

The kernel has background mechanisms that periodically flush dirty pages to storage.

This gives us much better throughput.

Instead of turning every write() call into an expensive physical I/O operation, Linux turns many small writes into fewer, larger writes.

That is a huge win for performance.

The Dangerous Part: `write()` Is Not `fsync()`

This is where many bugs and misunderstandings happen.

A successful write() does not always mean your data is durable.

It usually means the kernel accepted your data.

If the machine loses power before dirty pages are flushed, some recently written data may be lost.

That is why databases, queues, and storage engines care so much about fsync().

When durability matters, you need to force the kernel to flush data to stable storage.

For example:

write(fd, data, size);
fsync(fd);

write() says:

Please accept this data.

fsync() says:

Now make sure it is actually persisted.

This distinction is extremely important when building systems where data loss is not acceptable.

For normal logs, delayed writeback may be fine.

For a financial transaction, it is not something you can ignore.

Why Databases Behave Differently

Databases like PostgreSQL, MySQL, RocksDB, and others are very careful with disk I/O.

They know the kernel is caching data.

They know writes may be delayed.

They know crashes can happen.

So they use techniques like:

write-ahead logging
fsync
checkpoints
direct I/O in some configurations
controlled flushing
careful ordering of writes

A database cannot simply trust that because write() returned successfully, everything is safe.

It needs stronger guarantees.

This is also why database performance tuning is complicated.

You are not only tuning SQL queries.

You are tuning the interaction between:

application
database engine
filesystem
Linux kernel
Page Cache
storage device
cloud virtualization layer

That stack is deep.

And Page Cache sits right in the middle of it.

Page Cache and `mmap`

Another important part of this story is mmap.

With normal file I/O, you call read() and write().

With mmap, a file can be mapped into a process’s virtual memory.

Then the application can access file data almost like normal memory.

But behind the scenes, Page Cache is still involved.

When the process touches a memory-mapped page, the kernel may load the corresponding file data into the Page Cache.

This is powerful because it can reduce copying and make file access feel very natural from the application side.

But it also means that memory-mapped I/O is deeply connected to the kernel’s virtual memory system.

This is where the boundary between “file” and “memory” becomes very thin.

And that is one of the reasons Linux I/O is such a fascinating topic.

Page Cache Can Make Benchmarks Lie

When I started looking deeper into kernel internals, one of the first practical lessons was this:

Never trust a file I/O benchmark unless you understand the cache state.

For example, if you benchmark reading a large file once, the first run may include real disk I/O.

The second run may mostly hit Page Cache.

So it looks much faster.

But that does not necessarily mean your code improved.

It may only mean the file is already cached.

This matters when testing:

file parsers
log processors
database imports
backup tools
search indexing jobs
media processing pipelines

If you want to test cold disk performance, you need to be intentional.

If you want to test warm cache performance, that is also valid.

But you should know which one you are measuring.

Otherwise, you are not benchmarking your application.

You are benchmarking a combination of your application and the kernel’s memory state.

Page Cache Can Also Hurt You

Page Cache is usually helpful.

But not always.

There are cases where it can create problems.

1. Cache Pollution

Imagine a server running a database.

Most of the time, the database benefits from hot data staying in memory.

Now imagine a backup process reads a huge 500GB file sequentially.

That large read can fill the Page Cache with data that may never be used again.

As a result, more important cached pages may be evicted.

This is called cache pollution.

The kernel tries to manage this intelligently, but no heuristic is perfect.

2. Memory Pressure

Because Page Cache uses RAM, it competes with applications for memory.

Usually Linux can reclaim cache pages when needed.

But under heavy memory pressure, the system may start doing more reclaim work, causing latency spikes.

For backend services, this can show up as random performance drops.

3. Dirty Page Spikes

If an application writes faster than storage can flush, dirty pages can accumulate.

At some point, Linux may throttle writers to prevent memory from being overwhelmed by dirty data.

From the application’s point of view, write latency may suddenly increase.

This is not because your code changed.

It is because the kernel is protecting the system.

Useful Commands to Observe Page Cache Behavior

You do not need to be a kernel developer to observe some of this behavior.

Linux exposes useful information through /proc and common tools.

Check memory usage

free -h

Look at the buff/cache column.

That is where you often see memory used for kernel buffers and cache.

Check dirty pages

cat /proc/meminfo | grep -E "Dirty|Writeback"

Example fields:

Dirty:              123456 kB
Writeback:             0 kB

Dirty shows memory waiting to be written back.

Writeback shows memory currently being written.

Watch I/O activity

iostat -xz 1

This helps you see disk utilization, await time, and whether the storage device is under pressure.

Watch process I/O

pidstat -d 1

This helps you understand which processes are doing reads and writes.

A Simple Experiment

You can see Page Cache behavior with a simple test.

Create a large file:

dd if=/dev/zero of=testfile bs=1M count=1024

Now read it:

time cat testfile > /dev/null

Run the same command again:

time cat testfile > /dev/null

The second run may be faster because the file data is already in Page Cache.

Now, depending on your system and permissions, you can drop caches for testing:

sync
echo 3 | sudo tee /proc/sys/vm/drop_caches

Then read again:

time cat testfile > /dev/null

Important note:

Do not randomly drop caches on production servers.

This is only for controlled testing.

How This Changed the Way I Think About Backend Performance

The more I study Linux internals, the more I realize that backend engineering is not only about writing application code.

A production system is a conversation between many layers:

your code
runtime
memory allocator
database
filesystem
kernel
storage
network
container runtime
orchestration platform

If you only look at your code, you may miss the real reason behind a performance issue.

For example:

An API becomes slow because disk writeback is saturated.
A database has latency spikes because dirty pages are being flushed.
A benchmark looks amazing because everything is cached.
A log processor slows down because it is causing cache pollution.
A container looks memory-heavy because the host is using RAM for cache.

Understanding Page Cache gives you better intuition.

It helps you ask better questions.

It helps you debug production issues with more confidence.

And it reminds you that the operating system is not just a passive layer.

The kernel is constantly making decisions on your behalf.

Practical Lessons for Backend Engineers

Here are the lessons I keep in mind now:

1. Do not panic when Linux uses RAM

High memory usage is not always bad.

Check whether memory is used by applications or by cache.

2. Benchmark carefully

Always understand whether your benchmark is testing cold reads or cached reads.

3. `write()` is not durability

If data must survive a crash, understand fsync() and the durability model of your storage system.

4. Watch dirty pages

Dirty pages can explain sudden write latency spikes.

5. Be careful with large sequential jobs

Backups, imports, and scans can affect cache behavior for other workloads.

6. Learn the kernel slowly

You do not need to understand everything at once.

But each concept you learn gives you better production intuition.

Final Thoughts

The Linux Page Cache is one of the most important performance features in the operating system.

It hides disk latency.

It makes repeated reads fast.

It batches writes.

It uses free memory intelligently.

And it does all of this quietly, behind almost every backend system we deploy.

But like every powerful abstraction, it has trade-offs.

It can make benchmarks misleading.

It can delay durability.

It can create latency spikes under write pressure.

It can affect databases, log processors, and large file workloads in unexpected ways.

For me, studying the Page Cache is part of a bigger journey: understanding Linux not just as a server environment, but as a complex engineering system.

The deeper I go into the kernel, the more respect I have for the invisible work it does every second.

And the more I believe that strong backend engineers should not stop at frameworks, databases, and APIs.

At some point, we also need to understand the machine underneath.

Because sometimes the bug is not in your code.

Sometimes the answer is in the kernel.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.