I have a fairly long history using Linux for a number of purposes. After being assigned Linux as a development machine while working with the team at Fontis a combination of curiosity and the need to urgently repair this development machine as a result of curiosity driven stick pokery meant that I learned a large amount of Linux trivia knowledge fairly quickly. I built further on this while helping set up Sitewards infrastructure tooling; a much more heterogeneous set of computers and providers but with a standard approach emerging built of Docker and Kubernetes.
The sum total of this experience means I’ve been heavily motivated to invest more in the technologies associated with Linux. One of the more interesting technologies I’ve become peripherally aware of during this tuition is the "Extended Berkeley Packet Filter", or "eBPF".
I was introduced to this technology by Brendan Greggs excellent videos on performance analysis with eBPF. This was somewhat experimentally useful, but required recent kernels and various other oddities that weren’t consistent across our infrastructure. However, in parallel there was some interesting discussion about another eBPF project — Cilium. This project provides the underlying networking for Kubernetes, but does so in a way that appears to provide additional security and visibility that other network plugins do not; naively similar to Istio.
Very recently I’ve had the opportunity to help another team with some scaling issues with a large, bespoke Kubernetes cluster. This cluster had a large number of services, and these services were being updated slowly due to performance issues with their calico & kube-proxy iptables
implementations. That particular issue addressed another way, but lead to the investigation into Calico and subsequent eBPF network tooling.
What is BPF?
The original "Berkeley Packet Filter" was derived from a paper written by Steve McCanne and Van Jacobson in 1992 for the Berkeley Software Distribution. It’s purpose was to allow an efficient capture of packets from within the Kernel to the Userland by compiling a program that filtered out packets that should not be copied across. This was subsequently employed in utilities such as tcpdump
.
In 2011 Eric Dumazet considerably improved the performance of this BPF filter by adding a Just In Time (JIT) compiler that would compiled the BPF bytecode into an optimized instruction sequence. Later, in 2014, Alexi Starovoitov capitalised on this performant virtual machine to expose kernel tracing information more efficiently than it otherwise would be and extending BPF beyond its initial packet filtering purpose. Jonathan Corbet noted and published this to the LWN network, hinting that eventually BPF programs may not only be used internally in the kernel but compiled in userland and loaded into the Kernel. Later that same year the Alexi started work on the bpf()
syscall, and the current notion of eBPF was kicked off.
eBPF is now an extension of the BPF tooling, converted into a more general purpose virtual machine and used for roles well beyond its initial packet filtering purpose. It is a quirk of history that it is still referred to as the Berkley packet filter, but the name has now stuck.
Because eBPF is an extension of the original specification it is generally simply referred to as BPF. The older language is transpiled in the kernel to the newer eBPF before its compiled so the only BPF that’s in the kernel is the newer eBPF.
How does BPF work?
BPF is a sequence of 64 bit instructions. These instructions are generally generated by an intermediary such as tcpdump
(libpcap):
# See https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
$ sudo tcpdump -i wlp2s0 'ip and tcp' -d
(000) ldh [12] # Load a half-word (2 bytes) from the packet at offset 12.
(001) jeq #0x800 jt 2 jf 5 # Check if the value is 0x0800, otherwise fail.
# This checks for the IP packet on top of an Ethernet frame.
(002) ldb [23] # Load byte from a packet at offset 23.
# That's the "protocol" field 9 bytes within an IP frame.
(003) jeq #0x6 jt 4 jf 5 # Check if the value is 0x6, which is the TCP protocol number,
# otherwise fail.
(004) ret #262144 # Return fail
(005) ret #0 # Return success
But can also be written in a limited subset of C and compiled.
BPF programs have a certain set of guarantees enforced by a kernel verifier that make BPF programs safe to run in kernel land without risk of locking up or otherwise breaking the kernel. The verifier ensures that:
- The program does not loop
- There are no unreachable instructions
- Every register and stack state are valid
- Registers with uninitialized content are not read
- The program only accesses structures appropriate for its BPF program type
- (Optionally) pointer arithmetic is prevented
The BCC tools repository contains a set of tools wrapping BPF programs that can do useful things. We can use one of those programs (dns_matching.py
) to demonstrate how BPF
is able to instrument the network:
# Clone the repository
$ git clone https://github.com/iovisor/bcc.git
Cloning into 'bcc'...
Receiving objects: 100% (17648/17648), 8.42 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (11460/11460), done.
# Pick the DNS matching
$ cd bcc/examples/networking/dns_matching
# Run it!
$ sudo ./dns_matching.py --domains fishfingers.io
$ sudo ./dns_matching.py --domains fishfingers.io
>>>> Adding map entry: fishfingers.io
Try to lookup some domain names using nslookup from another terminal.
For example: nslookup foo.bar
BPF program will filter-in DNS packets which match with map entries.
Packets received by user space program will be printed here
Hit Ctrl+C to end...
In another window we can run:
$ dig fishfingers.io
Which will show in our first window:
Hit Ctrl+C to end...
[<DNS Question: 'fishfingers.io.' qtype=A qclass=IN>]
The domain is nonsense, but the question is still posed. Looking at the source file we can see the eBPF program written in C that:
- Checks the type of Ethernet frame
- Checks to see if its UDP
- Checks to see if its Port 53
- Check if the DNS name supplied is within the payload
That’s it! Our eBPF program has successfully run in the kernel and packets copied out to the userland python program where they’re subsequently saved.
While this example was associated with the network kernel system (BPF_PROG_TYPE_SOCKET_FILTER), there are a whole series of kernel entry points that can execute these eBPF programs. At the time of there are a total of 22 program types; unfortunately, they are currently poorly documented.
eBPF in the wild
To understand where eBPF sits in the infrastructure ecosystem it’s worth looking at where other companies have chosen to use it over other, more conventional ways of solving the problem.
Firewall
The de facto implementation for a Linux firewall uses iptables
as its underlying enforcement mechanism. iptables
allows configuring a set of netfilter tables that manipulates packets in a number of ways. For example, the following rule drops all connections from the IP address 10.10.10.10:
iptables -A INPUT -s 10.10.10.10/32 -j DROP
iptables
can be used for a number of packet manipulation tasks such as Network Address Translation (NAT) or packet forwarding. However iptables
runs into a couple of significant problems:
-
iptables
rules are matched sequentially -
iptables
updates must be made by recreating and updating all rules in a single transaction
These two properties mean that under large, diverse traffic conditions (such as those experienced by any sufficiently large service — Facebook) or in a system that has a large number of changes to iptables
rules there will be an unacceptable performance overhead to running iptables
which can either degrade or take offline an entire service.
There are already improvements to this subsystem in the Linux kernel by way of nftables
. This system is designed to improve iptables
and is architecturally similar to BPF in that it implements a virtual machine in the kernel. nftables
is a little older and better supported in existing Linux distributions, and in the testing distributions has even begun to entirely replace iptables
. However with the advent and optimizations of BPF nftables
is perhaps a technology less worth investing in.
That leaves us with BPF. BPF has a couple of unique advantages over iptables
:
- Its implemented as an instruction set in a virtual machine, and can be heavily optimized
- It is matched against the "closest" rule, rather than by iterating over the entire rule set.
- It can introspect specific packet data when making decisions as to whether to drop
- It can be compiled and run in the Linux "Express Data Path" (or XDP); the earliest possible point to interact with network traffic
These advantages can yield some staggering performance benefits. In CloudFlare’s (artificial) tests BPF with XDP was approximately 5x better at dropping packets than the next best solution (tc). Facebook saw a much more predictable CPU usage with the use of BPF filtering.
In addition to the performance benefits some applications use BPF in combination with userland proxies (such as Envoy) to allow or deny the application protocols HTTP, gRPC, DNS or Kafka. This sort of application specific filtering is only otherwise seen in service meshes, such as Istio or Linkerd which incur more of a performance penalty than the BPF based solution.
So, packet filtering based on BPF is both more flexible and more efficient (with XDP) than the existing iptables
solution. Whiletc
and nftables
may provide similar performance now or in future, BPFs combination of a large set of use cases and efficiency means it’s perhaps a better place to invest.
Kernel tracing & instrumentation
After running Linux in production for some period of time invariably we can run into issues. In the past I’ve had issues debugging:
-
iptables
performance problems - Workload CPU performance
- Software not loading configuration
- Software becoming stalled
- Systems being "slow" for no apparent reason
In those cases we need to dig further into what’s happening between kernel land and userland and to poke at why the system is doing.
There are an abundance of tools for this task. Brendan Gregg has an excellent image showing the many tools and what they’re useful for. From the list above, I’m familiar with:
-
strace
/ltrace
top
sysdig
iotop
df
perf
These tools each have their own unique tradeoffs and doing a depth analysis of them is beyond the scope of this article. However, the most useful tool is perhaps strace
. strace
provides visibility into what system calls (calls to the Linux kernel) the process is using. The following example shows what file system calls the process cat /tmp/foo
will make:
$ strace -e file cat /tmp/foo
execve("/bin/cat", ["cat", "/tmp/foo"], 0x7fffc2c8c308 /* 56 vars */) = 0
access("/etc/ld.so.preload", R_OK) = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libsnoopy.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/tmp/foo", O_RDONLY) = 3
hi
This allows us to debug a range of issues including configuration not working, what a process is sending over a network, what a process is receiving and what processes a given process spawns. However, it comes at a cost — strace
will significantly slow down that process. Suddenly introducing large latency into the system will annoy users, and can block and stack up requests eventually breaking the service. Accordingly it needs to be used with caution.
However, a much more efficient way to trace these system calls is with BPF. This is made easy with the bcc
tools git repository; specifically, the trace.py
tool. The tool has a slightly different interface than strace
; perhaps because BPF is compiled and executed based on events in the Kernel rather than interrupting a process at the kernel interface. However, it can be replicated as follows:
$ sudo ./trace.py 'do_sys_open "%s", arg2' | grep 'cat'
And then in another window:
$ cat /tmp/foo
Will yield
13785 13785 cat do_sys_open /etc/ld.so.preload
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libsnoopy.so
13785 13785 cat do_sys_open /etc/ld.so.cache
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libc.so.6
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libpthread.so.0
13785 13785 cat do_sys_open /lib/x86_64-linux-gnu/libdl.so.2
13785 13785 cat do_sys_open /usr/lib/locale/locale-archive
13785 13785 cat do_sys_open /tmp/foo
This fairly accurately replicates the functionality of strace
; each of the files listed earlier are shown in the trace.py
output the same as they were in the strace
output.
BPF is not limited to strace
like tools. It can be used to introspect a whole series of both user and kernel level problems and has been packaged into user friendly tools in the BCC repository. Additionally, BPF now powers Sysdig, the tool used for spelunking into a machine to determine its behaviour by analysing system calls. There is even some work to export the result of BPF programs in the Prometheus format for aggregation in time series data.
Because of its high performance, flexibility and good support in more recent Linux kernels BPF forms the foundation of a new set of systems introspection tools that are able to provide more flexible, performant systems introspection. Additionally BPF seems simpler than the kernel hacking that would otherwise be required to provide this sort of systems introspection and may democratize the design of such tools, leading to more innovation in this area.
Network visibility
Given the history of BPF in packet filtering a reasonable next logical step is collecting statistics from the network for later analysis.
There are already a number of network statistics exposed via the /proc
subdirectory that can be read with little overhead. The Prometheus "node exporter" reads:
/proc/sys/net/netfilter/
/proc/net/ip_vs
/proc/net/ip_vs_stats
/sys/class/net/
/proc/net/netstat
/proc/net/sockstat
/proc/net/tcp
/proc/net/tcp6
However as much as this exposes, there are still things about connections that either can’t be read directly from /proc
or via the set of CLI tools that also read from this (ss
, netstat
etc). One such case was discussed by Julia Evans and Brendan Gregg on Twitter: The stats of TCP connection lengths on a given port.
This is useful for debugging what a system is connected to, and how long it spends in that connection. We can in turn use this to determine who our machine is talking to, and whether it’s getting stuck on any given connection.
Brendan Gregg has a post that describes how this is implemented in detail, but to summarise it listens to tcp_set_state()
and queries the properties of the connection from struct tcp_info
. There are various limitations to this approach, but it seems to work pretty well.
The result has been committed to the bcc repository and looks like:
# Trace remote port 443
$ sudo ./tcplife.py -D 443
Then, in another window:
$ curl https://www.andrewhowden.com/ > /dev/null
The first window then shows:
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
7362 curl 10.1.1.247 43074 34.76.108.124 443 0 16 3369.32
Indicating that a process with ID 7362 connected to 34.76.108.124 over port 443 and took 3369.32ms to complete its transfer (Australian internet is a bit slow in some areas).
These kind of ad-hoc debugging statistics are essentially impossible to gather any other way. Additionally it should be possible (if desired) to express these statistics in such a way the Prometheus exporter will load them and export them for collection, making the network essentially arbitrarily introspectable.
Using BPF
Given the above BPF seems like a compelling technology that it’s worth investing in learning more about. However there are some difficulties in getting BPF to work properly:
BPF is only in "recent" kernels
BPF is an area that’s undergoing rapid development in the Linux kernel. Accordingly features may not be complete, or may not be present at all. Tools may not work as expected and their failure conditions not well documented. Accordingly if the kernels used in production are fairly modern than BPF may provide considerable utility. If not, it’s perhaps worth waiting until development in this area slows down and an LTS kernel with good BPF compatibility is released.
It’s hard to debug
BPF is fairly opaque at the moment. While there are bits of documentation here and there and one can go and read the kernel source its not as easy to debug as (for example) iptables
or other system tools. It may be difficult to debug network issues that are created by improperly constructed bpf
programs. The advice here is the same as other new or bespoke technologies: ensure that multiple team members understand and can debug it, and if they cant or those people are not available, pick another technology.
It’s an implementation detail
Its my suspicion that the vast majority of our interaction with BPF will not be interaction of our design. BPF is useful in the design of analysis tools, but the burden is perhaps too large to place on the shoulders of systems administrators. Accordingly, to start reaping the benefits of BPF its worth instead investing in tools that use this technology. These include:
- Cilium
- BCC Tools
bpftrace
- Sysdig
More tools will arrive in future, though those are the only ones I would currently invest in.
Conclusion
BPF is an old technology that has had new life breathed into it with the extended instruction set, implementation of a JIT and ability to execute BPF at various points in the Linux kernel. It provides a way to export information about or modify Linux kernel behaviour at runtime without needing to reboot or reload the Kernel, including just for transient systems introspection. BPF has probably most immediate ramifications on network performance as networks need to handle a truly bizarre level of both traffic and complexity, and BPF provides some concrete solutions to these problems. Accordingly its a good start to understand BPF in the context of networks, particularly instead of investing in nftables
or iptables
. BPF additionally provides some compelling insights into both system and network visibility that are otherwise difficult or impossible to achieve, though this area is somewhat more nascent than the network implementations.
TL, DR — BPF is pretty damned cool.
References
- IOVisor project: a bunch of good eBPF and XDP reading and tools
- API aware networking and security, powered by eBPF and XDP
- Why is the kernel community replacing IPTables with eBPF
- Achieving high performance low latency networking with XDP
- Inside Facebook’s eBPF Firewall
- Cilium Architecture
- A brief introduction to XDP and eBPF
- eBPF: Past, present and future
- Debating the value of XDP
- Network debugging with eBPF
- A JIT for packet filters
- NFTables: A new packet filtering engine
- eBPF and XDP: A reference guide
- How to build a kernel with XDP support
- Cilium: rethinking Linux networking and security in the age of Microservices
- Cilium 1.4 release notes
- New approaches to network fast paths
- A thorough introduction to eBPF
- Sysdig: Now powered by eBPF
- Linux eBPF Superpowers
- BPF: the universal in-kernel virtual machine
- The BSD packet filter: A new architecture for User Level packet capture
- Unofficial eBPF spec
- Linux Socket Filtering aka Berkeley Packet Filter (BPF)
- BPF: The forgotten bytecode
- TC BPF man page
- L4Drop: XDP DDoS Mitigation
- The beginners guide to iptables and the Linux firewall
- Wikipedia: NFTables
- How to drop 10 million packets
- Introducing the p0f compiler
- L4Drop: XDP DDoS mitigations
Top comments (0)