From AF_UNIX to kdbus

You’re a developer and you know AF_UNIX? You used it occasionally in your code, you know how high-level IPC puts marshaling on top and generally have a confident feeling when talking about it? But you actually have no clue what this fancy new kdbus is really about? During discussions you just nod along and hope nobody notices?

Good.

This is how it should be! As long as you don’t work on IPC libraries, there’s absolutely no requirement for you to have any idea what kdbus is. But as you’re reading this, I assume you’re curious and want to know more. So lets pick you up at AF_UNIX and look at a simple example.

AF_UNIX

Imagine a handful of processes that need to talk to each other. You have two options: Either you create a separate socket-pair between each two processes, or you create just one socket per process and make sure you can address all others via this socket. The first option will cause a quadratic growth of sockets and blows up if you raise the number of processes. Hence, we choose the latter, so our socket allocation looks like this:

int fd = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC | SOCK_NONBLOCK, 0);

Simple. Now we have to make sure the socket has a name and others can find it. We choose to not pollute the file-system but rather use the managed abstract namespace. As we don’t care for the exact names right now, we just let the kernel choose one. Furthermore, we enable credential-transmission so we can recognize peers that we get messages from:

struct sockaddr_un address = { .sun_family = AF_UNIX };
int enable = 1;

setsockopt(fd, SOL_SOCKET, SO_PASSCRED, &enable, sizeof(enable));
bind(fd, (struct sockaddr*)&address, sizeof(address.sun_family));

By omitting the sun_path part of the address, we tell the kernel to pick one itself. This was easy. Now we’re ready to go so lets see how we can send a message to a peer. For simplicity, we assume we know the address of the peer and it’s stored in destination.

struct sockaddr_un destination = { .sun_family = AF_UNIX, .sun_path = "..." };

sendto(fd, "foobar", 7, MSG_NOSIGNAL, (struct sockaddr*)&destination, sizeof(destination));

…and that’s all that is needed to send our message to the selected destination. On the receiver’s side, we call into recvmsg to receive the first message from our queue. We cannot use recvfrom as we want to fetch the credentials, too. Furthermore, we also cannot know how big the message is, so we query the kernel first and allocate a suitable buffer. This could be avoided, if we knew the maximum package size. But lets be thorough and support unlimited package sizes. Also note that recvmsg will return any next queued message. We cannot know the sender beforehand, so we also pass a buffer to store the address of the sender of this message:

char control[CMSG_SPACE(sizeof(struct ucred))];
struct sockaddr_un sender = {};
struct ucred creds = {};
struct msghdr msg = {};
struct iovec iov = {};
struct cmsghdr *cmsg;
char *message;
ssize_t l;
int size;

ioctl(fd, SIOCINQ, &size);
message = malloc(size + 1);
iov.iov_base = message;
iov.iov_len = size;

msg.msg_name = (struct sockaddr*)&sender;
msg.msg_namelen = sizeof(sender);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = control;
msg.msg_controllen = sizeof(control);

l = recvmsg(fd, msg, MSG_CMSG_CLOEXEC);

for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
        if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_CREDENTIALS)
                memcpy(&creds, CMSG_DATA(cmsg), sizeof(creds));
}

printf("Message: %s (length: %zd uid: %u sender: %s)\n",
       message, l, creds.uid, sender.sun_path + 1);
free(message);

That’s it. With this in place, we can easily send arbitrary messages between our peers. We have no length restriction, we can identify the peers reliably and we’re not limited by any marshaling. Sure, we now dropped error-handling, event-loop integration and ignored some nasty corner cases, but that can all be solved. The code stays mostly the same.

Congratulations! You now understand kdbus. In kdbus:

  • socket(AF_UNIX, SOCK_DGRAM, 0) becomes open(“/sys/fs/kdbus/….”, …)
  • bind(fd, …, …) becomes ioctl(fd, KDBUS_CMD_HELLO, …)
  • sendto(fd, …) becomes ioctl(fd, KDBUS_CMD_SEND, …)
  • recvmsg(fd, …) becomes ioctl(fd, KDBUS_CMD_RECV, …)

Granted, the code will look slightly different. However, the concept stays the same. You still transmit raw messages, no marshaling is mandated. You can transmit credentials and file-descriptors, you can specify the peer to send messages to and you got to do that all through a single file descriptor. Doesn’t sound complex, does it?

So if kdbus is actually just like AF_UNIX+SOCK_DGRAM, why use it?

kdbus

The AF_UNIX setup described above has some significant flaws:

  • Messages are buffered in the receive-queue, which is relatively small. If you send a message to a peer with a full receive-queue, you will get EAGAIN. However, since your socket is not connected, you cannot wait for POLLOUT as it does not exist for unconnected sockets (which peer would you wait for?). Hence, you’re basically screwed if remote peers do not clear their queues. This is a really severe restriction which makes this model useless for such setups.
  • There is no policy regarding who can talk to whom. In the abstract namespace, everyone can talk to anyone in the same network namespace. This can be avoided by placing sockets in the file system. However, then you lose the auto-cleanup feature of the abstract namespace. Furthermore, you’re limited to file-system policies, which might not be suitable.
  • You can only have a single name per socket. If you implement multiple services that should be reached via different names, then you need multiple sockets (and thus you lose global message ordering).
  • You cannot send broadcasts. You cannot even easily enumerate peers.
  • You cannot get notifications if peers die. However, this is crucial if you provide services to them and need to clean them up after they disconnected.

kdbus solves all these issues. Some of these issues could be solved with AF_UNIX (which, btw., would look a lot like AF_NETLINK), some cannot. But more importantly, AF_UNIX was never designed as a shared bus and hence should not be used as such. kdbus, on the other hand, was designed with a shared bus model in mind and as such avoids most of these issues.

With that basic understanding of kdbus, next time a news source reports about the “crazy idea to shove DBus into the kernel”, I hope you’ll be able to judge for yourself. And if you want to know more, I recommend building the kernel documentation, or diving into the code.

10 thoughts on “From AF_UNIX to kdbus

  1. linux_since_1_2_13

    “buffered in the receive-queue”

    It’s like you’ve never heard of POSIX message queues, which let you read messages out-of-order by requesting messages of a certain priority.

    “There is no policy”

    Only if you choose the wrong tool for the job – named pipes (FIFOs) and the rest of sysvipc (semaphores, message queues) are bound to a file handle, which you can use to limit access by standard file permissions. Almost every use-case that where it is usually claimed that file permissions are not sufficient is handled by creating more *groups*, and the rest can be handled by ACLs.

    “single name per socket”, “broadcasts”

    You do know that AF_TIPC has been in the kernel for some time now, right? How, *exactly* is it not sufficient, when it is already used for these purposes in high-speed compute clusters? Oh, right – it wouldn’t require new code (more NIH-syndrome), and wouldn’t require a daemon (dbus) to re-implement what the kernel already does.

    Reply
    1. David Herrmann Post author

      This story explains what kdbus does, based on an AF_UNIX example. No idea where your anger comes from and why this is any way related to this story. But ok, lets discuss your points:

      1) POSIX msgqueues are 1:1, not 1:n. If used for buses, it’d require n^2 endpoints (as discussed in the post).

      2) Regarding policy, I explicitly stated that this applies to the abstract namespace. If you place sockets in the file-system, normal unix permissions apply (did you even read the story?).

      3) Last, but not least, TIPC is unsuitable for shared-memory machines. Or at least bears significant overhead due to being designed as networking protocol (compared to shared-memory protocols like kdbus which allow zero-copy). It does not allow credential-passing, nor file-descriptor passing, not even speaking about the trust-model it enforces.

      If you can provide an existing transport that is suitable for DBus, I’d be more than happy to hear about it. I’m not aware of any, so we wrote kdbus.

      Reply
  2. Jude Nelson

    Hi David, thank you for taking the time to write this up!

    I’m still trying to understand the functional difference between kdbus’s functionality and the functionality offered by a composition of existing IPC mechanisms (including but not limited to AF_UNIX sockets). You took a crack at this when discussing the flaws of AF_UNIX versus kdbus, but I’m hoping you could clarify a few things:

    “Messages are buffered in the receive-queue, which is relatively small. If you send a message to a peer with a full receive-queue, you will get EAGAIN. However, since your socket is not connected, you cannot wait for POLLOUT as it does not exist for unconnected sockets (which peer would you wait for?). Hence, you’re basically screwed if remote peers do not clear their queues. This is a really severe restriction which makes this model useless for such setups.”

    If I’m going to develop an application-level communication protocol on top of a kdbus file descriptor, I’m probably going to employ some sort of message framing anyway. This also implies that I’m probably going to be able to figure out how big the maximum message size will be, so it’s feasible to ask the kernel for an appropriately-sized buffer in this case.

    But for the sake of argument, suppose this is not the case. If I need the OS to provide me a bytestream abstraction, why is it not good enough for a “source” process to create a pipe and use an AF_UNIX socket to send the read end to the “sink” process? This seems to solve the problem of detecting when the peer has died or disconnected–the peer would get SIGPIPE. It also seems to solve the problem of the receiver not clearing its buffer fast enough–the sender can poll(2) to detect when there’s space. I can already share data across a pipe in a zero-copy manner via splice(2), and I can already increase the maximum pipe size to whatever I need (albeit not on a per-process basis, but I’d imagine that this functionality could be added with minimal effort). What am I missing here?

    “There is no policy regarding who can talk to whom. In the abstract namespace, everyone can talk to anyone in the same network namespace. This can be avoided by placing sockets in the file system. However, then you lose the auto-cleanup feature of the abstract namespace. Furthermore, you’re limited to file-system policies, which might not be suitable.”

    I’m not sure entirely what you mean by “policy” here–do you mean to say that the way a process treats a peer is predicated on information besides that provided by `struct ucred`? If so, then arguably you could solve this problem by sharing authentication tokens and shared secrets with peers out-of-band, and have the peers submit them (or proof that they have them, like an HMAC) when communicating.

    Also, the auto-cleanup feature is solvable in the filesystem, if you’re willing to use FUSE. With FUSE, each I/O operation the filesystem server receives includes the task ID of the calling process. Therefore, you can create a FUSE filesystem for holding data that shares fate with the process that creates it:

    * On create(2), record in the inode the relevant information about the calling process from /proc
    * On stat(2), use the inode’s process creator information to check if the calling process is still alive. If not, then return ENOENT and unlink the inode.

    As a result, once a process dies, no subsequent attempt to open its files will succeed. I have a proof-of-concept here, if you’re interested: https://github.com/jcnelson/runfs. It should be possible to extend the concept to do more clever things, like support an extended attribute to “reassign” the file to a different process, or mark a file as “unshared” from its creator’s fate, for example.

    “You can only have a single name per socket. If you implement multiple services that should be reached via different names, then you need multiple sockets (and thus you lose global message ordering).”

    This assumes that I want global message ordering in the first place. More often than not, I don’t need it or want it, since messages between peers that only interact with one another will suffer the convoy effect if the bus is under load.

    It seems to me that what I really need is causal message ordering–I need message M1 to arrive before message M2 only if there is some way that M1’s transmission caused M2’s transmission. But, the ordering of M1 and M2 relative to M3 is (usually) irrelevant if there is no causal link between M3 and the other two. If this assumption holds, then groups of processes communicating need only implement vector clocks as part of their application-level messages to one another–the IPC system does not need to be involved.

    But, again, what am I missing here? Is there a non-trivial scenario where global message ordering is absolutely needed?

    “You cannot send broadcasts. You cannot even easily enumerate peers.”

    What if the processes that needed to communicate simply agreed on a naming convention for their AF_UNIX sockets? Then, there’s a trivial solution to enumeration: put all AF_UNIX sockets for the group in the same directory, and have peers list that directory to see who’s available.

    This also solves the broadcast problem (albeit inefficiently): the broadcaster simply enumerates each peer’s AF_UNIX socket and sends the message to each one. If memory bandwidth is a concern, the sender could instead put the message into a memfd, seal it, and send the memfd’s descriptor to each peer (replace “memfd” with whatever shared memory scheme fits your threat model). Granted, this would mean you couldn’t use pipes as described above, but I’d imagine that this would be an exceptional case already (I’d be curious to know how often this is a problem, when considering the distribution of dbus messages sizes on a typical desktop today?). Is there a case where even that would be too costly?

    “You cannot get notifications if peers die. However, this is crucial if you provide services to them and need to clean them up after they disconnected.”

    If you use pipes as described above, you’d get SIGPIPE when the peer died.

    Thanks for reading, and I look forward to your response!
    -Jude

    Reply
    1. David Herrmann Post author

      The comment section isn’t really appropriate for discussions, but I’ll try to answer the key points shortly:

      1) If you use pipes as you described, you need n^2 endpoints if each peer talks to each other. This blows up if you get many peers. 1:1 communication is really not suitable to implement buses.

      2) “unknown maximum packet size” does *not* imply that you want stream-behavior. It just means that the payload of your packet might be of any size (eg., you send strings.. or file contents). You still want packet-based behavior, not streams. Streams, on the other hand, are useful for _streaming_ of data and flow-control. Both have valid use-cases.

      3) With ‘policy’ I mean “peer protection” so if multiple different users talk to the same service, that service needs to be able to serve those fairly and dictate the access-restrictions regarding any communication to it. This is crucial for shared, unprivileged buses.

      4) Importance of global ordering: Imagine two services A and B, where B depends on the state of A. If client C modifies A and afterwards modifies B, then with global ordering, B can be sure to get the notification that A changed before the request from C comes in. If that was not true, B would have to synchronously ask A for its state before serving C.
      This is a pretty common model. One major example: Many services are only offered to the foreground session. Once that session changes, you want to reject requests from the old session. With global ordering, those services can be sure to get notifications about session-changes in-order.

      5) Broadcasts _require_ filters. Your file-system model does not work for that. So what you’d have to do is a public-subscriber model then (eg., sub-directories with symlinks back to the subscribers). Not compatible with the dbus-spec, though.

      Reply
      1. Jude Nelson

        Thank you for your reply! Will try to be brief this time.

        (1) Gotcha–kdbus adds efficient broadcasts.

        (2) I think we’re on the same page. The point I was trying to make was that conventionally the *application* handles message framing, not the IPC system (i.e. the end-to-end argument). My confusion was with what kdbus was trying to achieve here, but my understanding now is that kdbus strives to offer an easy-to-use framing mechanism?

        (3) I don’t see how this is incompatible with what I said above. If the application-level messaging protocol can already address this, then I guess the point of having kdbus do this as well is to implement peer protection in an application-agnostic manner?

        (4) In your example, A, B, and C each live in the same causal domain, which is different from my previous example. By contrast, messages in different sessions are usually in different causal domains, since applications in different sessions usually do not interact. For these messages, there is no need to impose a global order, since applications in one session should not be able to affect the others in the first place. This is why we can get away with a per-session bus in a lot of cases, for example.

        Within a causal domain, having A, B, and C include vector clocks in their messages would address the ordering problem without needing to get the bus involved. C’s message to B would include the timestamp A put into its reply to C, and if B sees that it’s greater than the timestamp of A’s last message to it (i.e. C raced ahead of A when it messaged B), then B knows to block C until it receives all of the messages from A up to that timestamp.

        (5) I agree that broadcasts require filtering, but the confusion is in where the filtering is most effectively applied. If the goal is to have system-wide filtering policy enforcement, then it makes sense to put the logic to do so at the IPC level (e.g. an “IPC firewall”), which I guess is what you’re trying to achieve?

      2. David Herrmann Post author

        1) “everyone talks to each other” does not imply broadcasts. It might just be unicasts, but it’d still require n^2 endpoints. But I think we’re on the same page.

        2) Yes. DBus provides simple message framing without the app having to deal with buffer sizes.

        3) You still need to make sure your application is not flooded by a single user, but the in-queue is used fairly amongst your peers. You probably even want some rate-limiting at some point.

        4) I cannot see how your example (within the same domain) is reliable if you have malicious peers (you copy timestamps over?).

        5) Yes.

  3. Stanislav Tertin

    I love the list (at the end) of problems kdbus solves — it could double as a list of horrible software design practice. “kdbus lets you implement the following awful ideas:”

    Reply
  4. wtarreau

    I’m particularly concerned about the semantics here :
    – open(“/sys/fs/kdbus/…”) => not compatible with a chroot, or you have to mount /sys into your chroots, which more or less defeats the isolation purpose of chroot
    – ioctls are not the cleanest way to exchange large amounts of data. I don’t understand why send/recv were not used instead. And it’s not natural to poll() for read/write when an ioctl reports EAGAIN.

    Reply
    1. David Herrmann Post author

      Hey!

      1) Regarding /sys: this is the default mount-point, but in no way mandatory. It is also not bound to sysfs at all. Containers can safely mount a tmpfs at /sys and then mount kdbusfs at /sys/fs/kdbus. They’re also free to mount it anywhere else. We just tried to follow the current standard default mount-points (like cgroupsfs, fuse, pstore, ..).

      2) We support transmitting metadata (just like sendmsg() / recvmsg() do). Hence, we need something similar to “struct msghdr”. Also note that sendmsg/recvmsg are not available to normal files, hence we used ioctl(). Also note that the difference between sendmsg() and ioctl() is marginal as both basically just transmit a command-blob.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s