Mirage of memory, part 2: PSS

#linux #postgres #yugabyte #performance

This is a blog series about linux kernel memory management in general, and how it relates to postgres specifically. This part is about the Linux memory statistic 'PSS': proportional set size.

The first part of this series was about RSS: resident set size. That is a fundamentally important statistic, and is the statistic that the OOM-killer (out of memory killer) uses.

However, the resident set size is not the "absolute" truth about memory used, at least on linux. The reason for saying that is because linux provides COW (copy on write) access to the memory pages of the parent process for its children. (A threaded process inherently has access to the memory areas in the process for all threads)

This means that for linux applications that fork() or clone() processes, it gives these applications the ability to share memory pages to which it can provide read-only access. One such application is PostgreSQL:

[vagrant@centos8-pg11 ~]$ ps -ef | grep $(pgrep -f postmaster)
postgres     886       1  0 11:16 ?        00:00:00 /usr/pgsql-11/bin/postmaster -D /var/lib/pgsql/11/data/
postgres     888     886  0 11:16 ?        00:00:00 postgres: logger
postgres     890     886  0 11:16 ?        00:00:00 postgres: checkpointer
postgres     891     886  0 11:16 ?        00:00:00 postgres: background writer
postgres     892     886  0 11:16 ?        00:00:00 postgres: walwriter
postgres     893     886  0 11:16 ?        00:00:00 postgres: autovacuum launcher
postgres     894     886  0 11:16 ?        00:00:00 postgres: stats collector
postgres     895     886  0 11:16 ?        00:00:00 postgres: logical replication launcher

The postmaster process has PID 1 as parent, and all the background processes have the postmaster as parent.

But that is all nice, do I have any proof this is happening? The proof for this can be found in the proc meta-filesystem 'smaps' file, which shows the memory allocations, as well as the PSS figure. This is how that looks like:

[vagrant@centos8-pg11 ~]$ sudo grep -A6 00400000 /proc/886/smaps
00400000-00af7000 r-xp 00000000 fd:00 101318716                          /usr/pgsql-11/bin/postgres
Size:               7132 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                3128 kB
Pss:                1053 kB
Shared_Clean:       2688 kB

This is the text (code) segment for the postgres executable for the postmaster process, which has a virtual set size of 7132 kB. Based on the demand-paging mechanism I described in the previous post, it only paged in 3128 kB. And because it forked several processes which also use this code segment, proportionally, the use is 1053 kB.

This is where things get a bit blurry. A process either uses a memory or not, in other words: it uses the full 4 kB of a page or not. The PSS in the above example is 1053 kB. That means: 1053/4=263.25, so it suggests a fractional page. A linux process does not use fractional pages of memory.

The kernel simply keeps a counter per memory page for the amount of processes that "use" the page, alias have the page paged in to their process space. When the PSS figure is requested, the size of the page is divided by the number of processes it is shared with, which provides the proportional size of memory, instead of the real size.

This means for actual space usage calculations, you must calculate all the PSS sizes of all the processes involved for a memory segment at that moment in time to get a meaningful figure to calculate the total memory used based on PSS.

If you wonder why I am mentioning memory usage calculations: this is something that I have seen being mentioned and tried forever since I work in IT. Every once in a while someone wants to calculate the accurate, actual amount of memory in use for an application.

With PostgreSQL the most logical biggest taker of memory will be the shared memory segment that it is using as the buffer cache. The buffer cache is explicitly defined as shared, without the somewhat hidden COW page sharing. Also mind that if the buffer cache is sized to contain a significant percentage of memory, the page sharing for the executable and library memory segments might be marginal. However, the same proportional sharing applies to the shared memory segment.

It is a lot of work to inspect all the smaps files in proc. Luckily, there is a utility that can give the PSS figures, without having to go through all the 'smaps' files. That utility is 'smem'. The 'smem' utility is available in EPEL (extra packages for enterprise linux; added via yum install epel-release on most EL based linuxes).

For my lab Alma8 virtual machine with postgres 11, this is how that looks like:

[vagrant@alma-85 ~]$ sudo smem -U postgres -tk
  PID User     Command                         Swap      USS      PSS      RSS
 4727 postgres postgres: logger                   0   152.0K   537.0K     4.4M
 4733 postgres postgres: stats collector          0   144.0K   591.0K     4.7M
 4730 postgres postgres: background writer        0   216.0K     1.0M     6.4M
 4729 postgres postgres: checkpointer             0   272.0K     1.3M     7.3M
 4734 postgres postgres: logical replicati        0   460.0K     1.4M     6.6M
 4732 postgres postgres: autovacuum launch        0   492.0K     1.7M     7.6M
 4731 postgres postgres: walwriter                0   212.0K     2.7M     9.5M
 4724 postgres /usr/pgsql-11/bin/postmaste        0     6.9M    11.2M    24.0M
-------------------------------------------------------------------------------
    8 1                                           0     8.9M    20.5M    70.4M

This utility introduces another abbreviation: USS; which means 'unique set size'. It's the opposite of PSS, and counts memory that is uniquely allocated to a process. Any memory page changed by a process is counted as USS, because that makes it unique for that process.

The above smem output quite well shows the double counting that can take place: the actual allocation on this system for the postgres executables is 8.9M (USS) + 20.5M (PSS) = 29.4M, whilst the count of all RSS memory is 70.4M. I hope my explanation made you realize that by this clever page sharing for readonly pages, the RSS size can double count memory.

Top comments (1)

Martin Berger 💉💉💉👌 • Feb 7 '22

for a 2nd "proof this is happening", you can search /proc//pagemap for an corresponding entry to the corresponding address in smaps. There the page frame number (PFN) is visible.
Now search for all other processes for any smaps entry with corresponding PFN ;)
As this is a lot of work to do, iIt's nice to have the PSS calculated in smaps already.