DEV Community: Jarek Przygódzki

Monitoring new process creation

Jarek Przygódzki — Thu, 04 Jul 2019 20:39:36 +0000

Monitoring process creation and termination events is a useful skill to have in you toolbox. This article consists of two parts. The first introduces exiting tools for diffrent platforms. The second explains how these tools work internally.

Introducing tools

Linux

forkstat

Forkstat monitors process fork(), exec() and exit() activity. It is mature and it's available in most distribution's repositories. Uses the Linux netlink connector to gather process activity.

# Ubuntu\Debian
sudo apt install forkstat
sudo forkstat -e exec
Time     Event   PID Info   Duration Process
21:20:15 exec   3378                 sleep 10

sudo forkstat -e exec,exit
Time     Event   PID Info   Duration Process
21:21:30 exec   3384                 sleep 10
21:21:40 exit   3384      0  10.003s sleep 10

Requires root privilege (or CAP_NET_ADMIN capability). Developed by Colin Ian King.

execsnoop (eBPF)

execsnoop traces process execution. It works by dynamic tracing an execve kernel function.
See the bcc installation instructions for your OS. On Ubuntu, versions of bcc are available in the standard Ubuntu repository as of Ubuntu Bionic (18.04). The tools are installed in /sbin (/usr/sbin in Ubuntu 18.04) with a -bpfcc extension.

sudo apt-get install bpfcc-tools linux-headers-$(uname -r)

sudo execsnoop-bpfcc 
PCOMM            PID    PPID   RET ARGS
sleep            5380   5379     0 /usr/bin/sleep 10

Originally created by Brendan Gregg.

execsnoop

Precedessor of eBPF based execsnoop. Still relevant because it has no dependecies other that awk and works on older Linux kernel versions (3.2+).

wget https://raw.githubusercontent.com/brendangregg/perf-tools/master/execsnoop \
    -O /usr/local/bin/execsnoop  && chmod +x /usr/local/bin/execsnoop

execsnoop

Does not work on many newer systems, try execsnoop (eBPF) first.

Windows

Process Monitor

Process Monitor, part of Sysinternals Suite is an advanced monitoring tool for Windows that can be used to keep track of process creation events. Can be downloaded as standalone executable from project's website or installed with chocolatey package manager: choco install procmon. Also part of Sysinternals Suite package. Primarily created by Mark Russinovich and Bryce Cogswell

ProcMonX

Process Monitor X (ProcMonX) is a alternative to ProcMon created by Pavel Yosifovich. ProcMonX provides information on similar activities to ProcMon, but adds more events, such as networking, ALPC and memory. Can be downloaded as standalone executable from here.

PowerShell

Microsoft Scripting Guy, Ed Wilson shown that PowerShell can be used to monitor process creation.

Register-CimIndicationEvent `
    -ClassName Win32_ProcessStartTrace `
    -SourceIdentifier "ProcessStarted"

Get-Event | `
    Select timegenerated, `
        @{L='Executable'; E = {$_.sourceeventargs.newevent.processname}}

TimeGenerated       Executable
-------------       ----------
12.06.2019 22:28:19 ps.exe
12.06.2019 22:29:13 bash.exe
12.06.2019 22:29:13 bash.exe
12.06.2019 22:29:13 bash.exe
12.06.2019 22:29:13 git.exe
12.06.2019 22:30:47 chrome.exe
12.06.2019 22:30:48 chrome.exe

# Cleanup
get-event | Remove-Event
Get-EventSubscriber | Unregister-Event

See this article for details.

macOS

dtrace

sudo newproc.d will use DTrace to trace all new processes. It won't work if System Integrity Protection is on.

sudo newproc.d
dtrace: system integrity protection is on, some features will not be available

dtrace: failed to compile script /usr/bin/newproc.d: line 22: probe description proc:::exec-success does not match any probes. System Integrity Protection is on

El Capitan introduced a security mechanism called System Integrity Protection to help ensure that no malicious parties can modify the operating system and it severely limits what DTrace can do.

SIP has to be partially diabled

csrutil enable --without dtrace # disable dtrace restrictions only

Reboot and DTrace starts to work.

How do these tools work

Forkstat

Forkstat uses the kernel Netlink connector interface to gather process activity. It allows program to receive notifications of process events such as fork, exec, exit, core dump as well as changes to a process's name, UID, GID or SID over a socket connection.

With default parameters, forkstat will report fork, exec and exit events, but the -e option allows to specify one or more of the fork, exec, exit, core, comm, clone, ptrce, uid, sid or all events. When a fork event happens, forkstat will report the PID and process name of the parent and child, allowing one to easily identify where processes are originating. Forkstat attempts to track the life time of a process and will log the duration of a processes when it exits where possible. Note that forkstat may miss events if the system is under heavy load. Netlink connector also requires root privilege (or using CAP_NET_ADMIN capability).

Netlink is a Linux kernel IPC mechanism, enabling communication between a userspace process and the kernel, or multiple userspace processes. Netlink sockets are the primitive which enables this communication.
CONFIG_PROC_EVENTS kernel option enables Process Events Connector which exposes the process events to userland via a Netlink socket and was introduced in 2005 in this patch by Matt Helsley.

Forkstat's source code is here but it's very C-like in a sense that it manages to obfuscate relatively simple idea.

To let userspace know about different process events we will have to

make a netlink socket and bind it
send the PROC_CN_MCAST_LISTEN message to the kernel to let it know we want to receive events
receive events by reading datagrams from socket
parse event data and extract the relevant process information

sock, _ := unix.Socket(unix.AF_NETLINK,
    // used interchangeably with SOCK_RAW
    unix.SOCK_DGRAM, unix.NETLINK_CONNECTOR)
addr := &unix.SockaddrNetlink{
        Family: unix.AF_NETLINK, Groups: C.CN_IDX_PROC, Pid: uint32(os.Getpid())}
unix.Bind(sock, addr)
send(sock, C.PROC_CN_MCAST_LISTEN)
for {
    p := make([]byte, 4096)
    nbytes, from, _ := unix.Recvfrom(sock, p, 0)
    nlmessages, _ := syscall.ParseNetlinkMessage(p[:nbytes])
    for _, m := range nlmessages {
            if m.Header.Type == unix.NLMSG_DONE {
                // netlink uses the host byte order
                cnhdr := (*C.struct_cn_msg)(unsafe.Pointer(&m.Data[0]))
                ptr := uintptr(unsafe.Pointer(cnhdr))
                ptr += unsafe.Sizeof(*cnhdr)
                pe := (*C.struct_proc_event)(unsafe.Pointer(ptr))
                switch pe.what {
                case C.PROC_EVENT_EXEC:
                    e := (*C.struct_exec_proc_event)(unsafe.Pointer(&pe.event_data))
                    fmt.Printf("Process started: PID %d\n", e.process_pid)
                case C.PROC_EVENT_EXIT:
                    e := (*C.struct_exit_proc_event)(unsafe.Pointer(&pe.event_data))
                    fmt.Printf("Process exited: PID %d\n", e.process_pid)
                }
            }
            }
    }

}

That's it! The only problem is that exec_proc_event contains little data. We could try to immediately read process information from /proc/<PID> but that wouldn't be reliable (it's racy). There is a risk that by that time we read process information the process has already finished or even another one took its PID. Full example is here.

execsnoop (eBPF)

execsnoop is part of BCC. It's a suite tools that use eBPF tracing: infrastructure to dynamically instrument the kernel. It allows to define programs that run in kernel. Learn about eBPF here or read execsnoop source code. The only downside is that these tools require new-ish kernel.

execsnoop

Hack from Brendan Gregg's perf-tools collection. It traces stub_execve() or do_execve(), and walks the %si register as an array of strings. Check details on the author's blog here.

Process Monitor

Process Monitor (ProcMon) installs a kernel driver on startup which does the system-wide monitoring of userland processes. Driver API provides the kernel routine PsSetCreateProcessNotifyRoutine/PsSetCreateProcessNotifyRoutineEx to allow software to monitor process creation and termination events in the Windows kernel. No code here, but this example from Windows Driver Kit (WDK) 10 is close to what we want.

ProcMonX

ProcMonX uses Event Tracing for Windows (ETW) (a diagnostics and logging mechanism that existed since Windows 2000) through Microsoft.Diagnostics.Tracing.TraceEvent library.

PowerShell

PowerShell example uses WMI (Windows Management Instrumentation) and Win32_ProcessStartTrace event.

Creating your own monitoring tool requires few lines of code

/*
 * csc procmon_wmi.cs
 */
using System;
using System.Management;

class ProcessMonitor
{
    static public void Main(String[] args)
    {
        var processStartEvent = 
            new ManagementEventWatcher("SELECT * FROM Win32_ProcessStartTrace");
        var processStopEvent = 
            new ManagementEventWatcher("SELECT * FROM Win32_ProcessStopTrace");

        processStartEvent.EventArrived += 
            new EventArrivedEventHandler(
                delegate (object sender, EventArrivedEventArgs e)
        {
            var processName = e.NewEvent.Properties["ProcessName"].Value;
            var processId = e.NewEvent.Properties["ProcessID"].Value;

            Console.WriteLine("{0} Process started. Name: {1} | PID: {2}", 
                DateTime.Now, processName, processId);
        });

        processStopEvent.EventArrived += 
            new EventArrivedEventHandler(
                delegate (object sender, EventArrivedEventArgs e)
        {
            var processName = e.NewEvent.Properties["ProcessName"].Value;
            var processId = e.NewEvent.Properties["ProcessID"].Value;

            Console.WriteLine("{0} Process stopped. Name: {1} | PID: {2}", 
                DateTime.Now, processName, processId);
        });

        processStartEvent.Start();
        processStopEvent.Start();

        Console.ReadKey();
    }
}

macOS

dtrace

DTrace is as dynamic tracing framework for Solaris, macOS and FreeBSD. You can learn more about DTrace Tools and read newproc.d source code here.

Why is my Docker image so large?

Jarek Przygódzki — Sun, 12 May 2019 12:20:45 +0000

Keeping Docker images as small as possible has a lot of practical benefits. But even when following best practices

use stripped-down base image
utilize multi-stage builds
don't install what you don't need
optimize build context
minimize number of layers

some images end up being larger than they have to be inadvertently including files that are not necessary .

How to troubleshoot issues with image size?

Image filesystem changes are tracked in layers. Each layer is the the representation of the file system changes for each instruction in Dockerfile. Layers of a Docker image are essentially files generated from running some command during docker build in ephemeral intermediate container.

In the past, I used to perform docker history <image name> to view all the layers that make up the image, manually extract suspicious layers and inspect their content. It worked, but it was tedious.

Dive

Dive is a new tool for exploring a Docker images, inspecting layer contents and discovering ways to shrink your Docker image size - all that in a nice text-based user interface.

I recently used it to diagnose an unexpected image size growth caused by change of file ownership & permissions. One of the instructions in jboss/wildfly based Dockerfile was chown -R jboss:jboss /opt/jboss/wildfly/. It looks innocent, but these files are originally owned by jboos:root. Docker doesn't know what changes have happened inside a layer, only which files are affected. As such, this will cause Docker to create a new layer, replacing all those files (same content as /opt/jboss/wildfly/ but with with new ownership), adding hundreds of megabytes to image size.

Resources

A curious case of slow Docker image builds

Jarek Przygódzki — Fri, 31 Aug 2018 19:36:18 +0000

Investigating slow Docker image builds

A rather short but hopefully interesting troubleshooting story that happened recently.

Lately, I was investigating a case of slow Docker image builds on CI server (Oracle Linux 7.5 with Docker devicemapper storage driver in direct-lvm mode). Each operation which altered layers (ADD, COPY, RUN) took up to 20 seconds - the larger the image, the longer.

A typical way of dealing with with apparently stuck program is collecting thread stack traces. Or goroutines' stacktraces in the case of the Go application.

Dump of dockerd

Docker deamon will write goroutines stacktraces to a file named goroutine-stacks-<datetime>.log after receving a SIGUSR1 signal (engine/daemon/debugtrap_unix.go)

pkill -SIGUSR1 dockerd

A quick analysis showed that almost all the time was spent in NaiveDiffDriver.Diff. Here it is one of the dumps

What is a NaiveDiffDriver and why is it naive?

Docker image consist of immutable layers that are based on ordered root filesystem changes (and some metadata). Storage driver implementation handles merging of layers into a single mount point and provides a writable layer (called the “container layer”) on top of the underlying layers. All filesystem changes are written to this thin writable container layer. Each time a container is committed (manually or as part of building a Dockerfile), the storage driver needs to provide a list of modified files and directories relative to the base image to create a new layer. Some drivers keep track of these changes at run time and can generate that list easily but for drivers with no native handling for calculating changes Docker provides NaiveDiffDriver. This driver produces a list of changes between current container filesystem and its parent layer by recursively traversing both directory trees and comparing file metadata. This operation is expensive for big images with many files and directories. See here and here for in depth description of storage drivers in Docker.

Solution

The Device Mapper storage driver is good choice for running container in production on Red Hat and it's derivatives but not for building images because it's lack of native diff support. After some thought I choose overlay2 as a replacement. It turned out that native diff support in overlay2 in incompatible with OVERLAY_FS_REDIRECT_DIR option enabled in modern kernels: storage driver falls back to NaiveDiffDriver with a waring when it's detected.

# https://github.com/docker/docker-ce/blob/18.09/components/engine/daemon/graphdriver/overlay2/overlay.go#L287
Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled

The workaround I came up with is to disable overlay_redirect_dir option in overlay module

echo 'options overlay redirect_dir=off' > /etc/modprobe.d/disable_overlay_redirect_dir.conf

which finally enables native diffs

Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true

No more time an CPU cycles lost computing layer diffs.

Bonus

I created a few virtual machines to confirm the source of the problem and to work on the solution.

Nice think about Go is that it integrates pprof into the standard library and dockerd enables pprof/debug endpoints by default since 17.07.0-ce (2017-08-29) (earlier the profiler api was only available in debug mode: --debug/-D, Enable pprof/debug endpoints by default #32453).

Docker in flames

go-torch --file docker-build.svg --title="docker build" \
    --url http://localhost:2375

Docker daemon is spending most of its time in NaiveDiffDriver.Diff.

go-torch --file docker-build-inversed.svg --inversed --title="docker build" \
    --url http://localhost:2375

... doing syscalls.

Generating JVM memory dumps from JRE

Jarek Przygódzki — Tue, 28 Aug 2018 19:14:59 +0000

Generating JVM memory dumps from JRE on Linux/Windows/OSX

Generating a JVM heap memory dump with JDK is straightforward as almost every Java developer knows about jmap and jcmd tools that come with the JDK. But what about JRE?

Some people think you need JDK, or at least part of it, but that's not true. The answer lies in jattach, a tool to send commands to JVM via Dynamic Attach mechanism created by JVM hacker Andrei Pangin (@AndreiPangin). It's tiny (24KB), works with just JRE and supports Linux containers.

Usage

Most of the time it comes down to downloading a single file

wget -L -O /usr/local/bin/jattach \
    https://github.com/apangin/jattach/releases/download/v1.5/jattach && \
    chmod +x /usr/local/bin/jattach

We can then send dumpheap command do JVM process

jattach PID-OF-JAVA dumpheap <path to heap dump file>

e.g

java_pid=$(pidof -s java) && \
    jattach $java_pid dumpheap /tmp/java_pid$java_pid-$(date +%Y-%m-%d_%H-%M-%S).hprof

How does it work?

Built-in JDK utilities like jmap and jstack have two execution modes: cooperative and forced. In normal cooperative mode these tools use Dynamic Attach Mechanism to connect to the target VM. The requested command is then executed by the target VM in it's own process. This mode is used by jattach.

The forced mode (jmap -F, jstack -F) works differently. The tool suspends the target process and then reads the process memory using Serviceability Agent. See this for details.

Docker

Prior to Java 10 jmap, jstack and jcmd could not attach from a process on the host machine to a JVM running inside a Docker container because of how the attach mechanism interacts with pid and mount namespaces. Java 10 fixes this by the JVM inside the container finding its PID in the root namespace and using this to watch for a JVM attachment.

Jattach supports containers and is compatible with earlier versions of JVM - all we need is process id in host PID namespace. How can we get it?

If JVM is the main process of a container (PID 1), the needed information is included in docker inspect output

cid=<container name or id>
host_pid=$(docker inspect --format {{.State.Pid}} $cid)

If it's not? Then things become more interesting. The easiest way that I know of is to use /proc/PID/sched - kernel scheduling statistics.

cid=<container name or id>
docker exec -it $cid bash -c 'cat /proc/$(pidof -s java)/sched'

java (8251, #threads: 127)
------------------------------------------------------------------------
se.exec_start                                :        275669.207074
se.vruntime                                  :            80.606203
se.sum_exec_runtime                          :            57.897264
nr_switches                                  :                  157
nr_voluntary_switches                        :                  149
nr_involuntary_switches                      :                    8
se.load.weight                               :                 1024
se.avg.load_sum                              :              8883079
se.avg.util_sum                              :                 4424
se.avg.load_avg                              :                  181
se.avg.util_avg                              :                   90
se.avg.last_update_time                      :         275669207074
policy                                       :                    0
prio                                         :                  120
clock-delta                                  :                   52
mm->numa_scan_seq                            :                    0
numa_migrations, 0
numa_faults_memory, 0, 0, 1, 0, -1
numa_faults_memory, 1, 0, 0, 0, -1

For us interesting is the first line of the output (format defined in kernel/sched/debug.c#L877. Desired PID can be extract with a little bit of shell scripting

docker exec -it $cid sh -c 'head -1 /proc/$(pidof -s java)/sched | grep -P "(?<=\()\d+" -o'

When target container is bare (no shell, no cat, no nothing), nsenter is a possible alternative to docker exec

host_pid=$(docker inspect --format {{.State.Pid}} <container name or id>)
nsenter --target $host_pid  --pid --mount  sh -c 'cat /proc/$(pidof -s java)/sched'

What can go wrong?

Jattach from project's release page is linked against glibc so it most likely won't work on Alpine Linux. But it is not too hard to make it work.