A few years back, I got really interested in understanding how Docker and containers work. One area that grabbed my attention was containers and networking. Docker handles networking pretty simply. It mainly relies on virtual interfaces, bridges, NAT routing, and other features provided by the Linux operating system. But I won't be focusing on Docker and networking in this article.
As I delved deeper, I became curious about how Docker do all this networking. If you're like me—a developer who spends almost all their time in user space—you're probably familiar with interacting with the kernel through syscalls or the filesystem (like /proc
). At first, I had this naive idea that OS had some kind of magic syscall or something in /proc
that Docker was tapping into. But nope, I was way off. In Docker's code, I discovered another method: Netlink.
So, what is Netlink? It's another way of exchanging information between user space and the kernel. But this time, the socket-like way. Think of Netlink as a direct socket connection into the kernel, allowing you to send and receive messages. This approach is quite interesting, because I can do asynchronous communication with the kernel or simply listening for messages from the kernel in user space.
With Netlink, I can communicate with various kernel subsystems. For example, I can receive events from SELinux, updates about routing or network links, and even modify routing tables and IP addresses.
If you're comfortable with basic socket programming like me, handling Netlink could be easy to understand. All you need to do is open a socket to the kernel, address the subsystem, and send or receive binary messages.
Bring interface UP
Let's get practical here. Let's start with something super basic—like a Netlink hello world
. One of the simplest examples I could think of is enabling an network interface. On my system, I've got this veth0
interface sitting there, in the DOWN
state:
$ ip link
...
78: veth0@veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether de:8f:4e:7e:9c:cd brd ff:ff:ff:ff:ff:ff
I'd like to write my own simple Go program to execute ip link set veth1 up
.
One note before I start: Since I want to use syscalls and related data structures, I'll need to install the Go golang.org/x/sys
package.
Let's begin by establishing the socket. The Netlink socket is created using unix.Socket()
:
// open the Netlink socket
sock, err := unix.Socket(
unix.AF_NETLINK,
unix.SOCK_RAW,
unix.NETLINK_ROUTE,
)
if err != nil {
fmt.Printf("Error creating socket: %s\n", err)
return
}
defer unix.Close(sock)
Instead of the AF_INET
domain, which we typically use for TCP/IP communication, I'm using AF_NETLINK
. What's also interesting here is the last parameter. This value determines which subsystem I want to communicate with. It could be NETLINK_NETFILTER
, NETLINK_SELINUX
, and so on. I'm opting for NETLINK_ROUTE
, which is dedicated to interfaces, links, IP addresses, and such.
However, just having the socket isn't enough. In TCP/IP communication, we associate a socket with a specific network address using bind()
. Similarly, in Netlink, I'll set what group and port ID (PID) I want to use. For simplicity, I'll use values of 0
.
// bind the socket to group and PID
err = unix.Bind(sock, &unix.SockaddrNetlink{
Family: unix.AF_NETLINK,
Groups: 0,
Pid: 0,
})
if err != nil {
fmt.Printf("Error in binding socket: %s\n", err)
return
}
Create and send message
The easy part is done. At this point, I've got everything ready for sending and receiving messages. Now comes the tricky part—building and parsing Netlink messages. Like in many binary protocols, the Netlink message consists of a header and payload.
Let's start from the end—with the payload. The payload is all about what I want to do or get from Netlink. In my case, it's about enabling a network interface. Therefore, I'll use IfInfomsg
, where I'll set the Change
field to IFF_UP
. The good news is that the structure is available in the golang.org/x/sys
package, so I don't need to write it from scratch.
payload := unix.IfInfomsg{
Family: unix.AF_UNSPEC,
Change: unix.IFF_UP,
Flags: unix.IFF_UP,
Index: int32(ethIndex), // index of network interface I would like to enable (in my case it's 79 - veth1)
}
Then, I need to build a header. The header carries information like the type of the payload or the total length of the whole message. The structure for the header is NlMsghdr
. The type I need to set is RTM_NEWLINK
, which is related to the IfInfomsg
payload.
// total length of message is size of header + size of payload
length := unix.SizeofNlMsghdr + unix.SizeofIfInfomsg
header := unix.NlMsghdr{
Len: uint32(length),
Type: uint16(unix.RTM_NEWLINK),
Flags: uint16(unix.NLM_F_REQUEST) | uint16(unix.NLM_F_ACK),
Seq: 1,
}
Alright, the message should be almost ready. I just need to put the header and payload into one message structure. I'll create an anonymous structure and fill it with the payload and header:
msg := struct {
header unix.NlMsghdr
payload unix.IfInfomsg
}{
header: header,
payload: payload,
}
The message is ready, and I could write the message data into the socket. But before I call Sendto()
, I need to convert the message structure to an array (or slice) of bytes:
// first I need convert the `msg` to slice of bytes
var asByteSlice []byte = (*(*[unix.SizeofNlMsghdr + unix.SizeofIfInfomsg]byte)(unsafe.Pointer(&msg)))[:]
// write the data to the socket
err = unix.Sendto(sock, asByteSlice, 0, &unix.SockaddrNetlink{Family: unix.AF_NETLINK})
if err != nil {
fmt.Printf("Could not write message to socket:%s\n", err)
}
Receiving message
At this point, if I compile and run my simple program with root privileges, the code will bring the veth1
interface up. Mission accomplished. Right? But what about receiving messages?
Receiving messages might be complicated. The messages might be large, or the information might be broken into multiple pieces. There are various factors to consider. But I'll stick with my simple scenario. I just want to know if my up
operation failed or if it was successful.
To receive the response, I'll use unix.Recvfrom()
, which will read all remaining data from the socket into the buf
:
var buf [1024]byte
n, _, err := unix.Recvfrom(sock, buf[:], 0)
if err != nil {
fmt.Printf("Could not read data from socket: %s\n", err)
return
}
The next step is parsing the received raw data. Here, I'll use ParseNetlinkMessage()
to do just that.
// parse data to messages
msgs, err := syscall.ParseNetlinkMessage(buf[:n])
if err != nil {
fmt.Printf("Could not parse the response: %s\n", err)
return
}
The function will return parsed data as an array of []NetlinkMessage
. The NetlinkMessage
is a simple structure with Header
and Data
. The Header
is NlMsghdr
, and Data
is an array of bytes. Based on the type in the Header
, I can cast the Data
to the proper type. In my case, the first response message will be NLMSG_ERROR
, so I'll cast Data
to NlMsgerr
.
// the first received message must be `NLMSG_ERROR`
if msgs[0].Header.Type != unix.NLMSG_ERROR {
fmt.Printf("The first received message is not NLMSG_ERROR\n")
return
}
// cast the data to NlMsgerr payload
errPayload := (*unix.NlMsgerr)(unsafe.Pointer(&resp[0].Data[0]))
if errPayload.Error != 0 {
fmt.Printf("Error returned by Netlink\n")
}
fmt.Printf("Interface is UP\n")
The full code is available on github.com/sn3d/netlink-example
Let's try...
It's time to play with my program. As I mentioned above, I have veth1
present in my system which is DOWN
.
$ ip link
...
78: veth0@veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether de:8f:4e:7e:9c:cd brd ff:ff:ff:ff:ff:ff
If you look closer, you might see the veth0
have index 78
. I need this index pass to my program. Now when I run my program with this index, I should get information Interface is UP
:
$ sudo go run main.go 78
Interface is UP
This is not just a message from my program. If I will check the veth0
, I will notice the interface is UP
.
$ ip link
...
78: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether de:8f:4e:7e:9c:cd brd ff:ff:ff:ff:ff:ff
How to debug messages
As you noticed, creating a Netlink connection, reading from, and writing to the socket is the easy part. Maybe reading bigger chunks of data or data that's broken into smaller chunks is more complicated, but it's something we're familiar with from socket programming.
The tricky part for me was creating a proper Netlink message. But there's a pretty useful way to debug and observe Netlink messages - using strace
. Modern strace
has a great feature - it can parse and understand Netlink messages.
If you try to execute ip link set veth0 up
with strace
, you might see sendmsg
with a parsed Netlink message:
$ sudo strace -Tfe trace=sendmsg ip link set veth0 up
endmsg(4, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=52, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST, nlmsg_seq=1714051690, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=0, ifi_flags=0, ifi_change=0}, [[{nla_len=8, nla_type=IFLA_EXT_MASK}, RTEXT_FILTER_VF|RTEXT_FILTER_SKIP_STATS], [{nla_len=10, nla_type=IFLA_IFNAME}, "veth0"]]], iov_len=52}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 52 <0.000094>
sendmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=32, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK, nlmsg_seq=1714051690, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=if_nametoindex("veth0"), ifi_flags=IFF_UP, ifi_change=0x1}], iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32 <0.000064>
It's a bit messy output, but you could notice 2 sendmsg
calls. One is for RTM_GETLINK
, and the second is for RTM_NEWLINK
. In this output, you might see the header and payload. For instance, the header of the second RTM_NEWLINK
is:
{nlmsg_len=32, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK, nlmsg_seq=1714051690, nlmsg_pid=0}
And the payload is:
{ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=if_nametoindex("veth0"), ifi_flags=IFF_UP, ifi_change=0x1}
With strace
, we could study requests like creating a bridge, etc., and reproduce the messages from our code.
Netlink library
Working with sockets, building, and parsing our own messages require quite a lot of work. Thanks to Vish Abrams, we can use in our project the package github.com/vishvananda/netlink
, which provides a lot of Netlink functionalities without having to build our own message structures from scratch. It's well-maintained and used by many projects like Docker, Cilium, Flannel, Istio, etc.
Thanks to this library, adding a new bridge to the system is a matter of a few lines:
la := netlink.NewLinkAttrs()
la.Name = "docker0"
dockerBridge := &netlink.Bridge{LinkAttrs: la}
err := netlink.LinkAdd(dockerBridge)
A few words in conclusion...
One important aspect I didn't mention earlier is that the byte order in messages depends on the host's CPU architecture. This means we don't need to worry about converting between little and big-endian for integers. Additionally, it's crucial that our messages follow a four-byte padding rule. For example, if a message is 33 bytes long, we'll need to send 36 bytes.
I wrote about Netlink almost three years ago, but it was in Slovak. I decided to write this English version because, even after three years, I still find Netlink interesting and worth studying and trying.
Top comments (0)