From AF_UNIX to kdbus

10 Replies

You’re a developer and you know AF_UNIX? You used it occasionally in your code, you know how high-level IPC puts marshaling on top and generally have a confident feeling when talking about it? But you actually have no clue what this fancy new kdbus is really about? During discussions you just nod along and hope nobody notices?

Good.

This is how it should be! As long as you don’t work on IPC libraries, there’s absolutely no requirement for you to have any idea what kdbus is. But as you’re reading this, I assume you’re curious and want to know more. So lets pick you up at AF_UNIX and look at a simple example.

AF_UNIX

Imagine a handful of processes that need to talk to each other. You have two options: Either you create a separate socket-pair between each two processes, or you create just one socket per process and make sure you can address all others via this socket. The first option will cause a quadratic growth of sockets and blows up if you raise the number of processes. Hence, we choose the latter, so our socket allocation looks like this:

int fd = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC | SOCK_NONBLOCK, 0);

Simple. Now we have to make sure the socket has a name and others can find it. We choose to not pollute the file-system but rather use the managed abstract namespace. As we don’t care for the exact names right now, we just let the kernel choose one. Furthermore, we enable credential-transmission so we can recognize peers that we get messages from:

struct sockaddr_un address = { .sun_family = AF_UNIX };
int enable = 1;

setsockopt(fd, SOL_SOCKET, SO_PASSCRED, &enable, sizeof(enable));
bind(fd, (struct sockaddr*)&address, sizeof(address.sun_family));

By omitting the sun_path part of the address, we tell the kernel to pick one itself. This was easy. Now we’re ready to go so lets see how we can send a message to a peer. For simplicity, we assume we know the address of the peer and it’s stored in destination.

struct sockaddr_un destination = { .sun_family = AF_UNIX, .sun_path = "..." };

sendto(fd, "foobar", 7, MSG_NOSIGNAL, (struct sockaddr*)&destination, sizeof(destination));

…and that’s all that is needed to send our message to the selected destination. On the receiver’s side, we call into recvmsg to receive the first message from our queue. We cannot use recvfrom as we want to fetch the credentials, too. Furthermore, we also cannot know how big the message is, so we query the kernel first and allocate a suitable buffer. This could be avoided, if we knew the maximum package size. But lets be thorough and support unlimited package sizes. Also note that recvmsg will return any next queued message. We cannot know the sender beforehand, so we also pass a buffer to store the address of the sender of this message:

char control[CMSG_SPACE(sizeof(struct ucred))];
struct sockaddr_un sender = {};
struct ucred creds = {};
struct msghdr msg = {};
struct iovec iov = {};
struct cmsghdr *cmsg;
char *message;
ssize_t l;
int size;

ioctl(fd, SIOCINQ, &size);
message = malloc(size + 1);
iov.iov_base = message;
iov.iov_len = size;

msg.msg_name = (struct sockaddr*)&sender;
msg.msg_namelen = sizeof(sender);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = control;
msg.msg_controllen = sizeof(control);

l = recvmsg(fd, msg, MSG_CMSG_CLOEXEC);

for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
        if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_CREDENTIALS)
                memcpy(&creds, CMSG_DATA(cmsg), sizeof(creds));
}

printf("Message: %s (length: %zd uid: %u sender: %s)\n",
       message, l, creds.uid, sender.sun_path + 1);
free(message);

That’s it. With this in place, we can easily send arbitrary messages between our peers. We have no length restriction, we can identify the peers reliably and we’re not limited by any marshaling. Sure, we now dropped error-handling, event-loop integration and ignored some nasty corner cases, but that can all be solved. The code stays mostly the same.

Congratulations! You now understand kdbus. In kdbus:

socket(AF_UNIX, SOCK_DGRAM, 0) becomes open(“/sys/fs/kdbus/….”, …)
bind(fd, …, …) becomes ioctl(fd, KDBUS_CMD_HELLO, …)
sendto(fd, …) becomes ioctl(fd, KDBUS_CMD_SEND, …)
recvmsg(fd, …) becomes ioctl(fd, KDBUS_CMD_RECV, …)

Granted, the code will look slightly different. However, the concept stays the same. You still transmit raw messages, no marshaling is mandated. You can transmit credentials and file-descriptors, you can specify the peer to send messages to and you got to do that all through a single file descriptor. Doesn’t sound complex, does it?

So if kdbus is actually just like AF_UNIX+SOCK_DGRAM, why use it?

kdbus

The AF_UNIX setup described above has some significant flaws:

Messages are buffered in the receive-queue, which is relatively small. If you send a message to a peer with a full receive-queue, you will get EAGAIN. However, since your socket is not connected, you cannot wait for POLLOUT as it does not exist for unconnected sockets (which peer would you wait for?). Hence, you’re basically screwed if remote peers do not clear their queues. This is a really severe restriction which makes this model useless for such setups.
There is no policy regarding who can talk to whom. In the abstract namespace, everyone can talk to anyone in the same network namespace. This can be avoided by placing sockets in the file system. However, then you lose the auto-cleanup feature of the abstract namespace. Furthermore, you’re limited to file-system policies, which might not be suitable.
You can only have a single name per socket. If you implement multiple services that should be reached via different names, then you need multiple sockets (and thus you lose global message ordering).
You cannot send broadcasts. You cannot even easily enumerate peers.
You cannot get notifications if peers die. However, this is crucial if you provide services to them and need to clean them up after they disconnected.

kdbus solves all these issues. Some of these issues could be solved with AF_UNIX (which, btw., would look a lot like AF_NETLINK), some cannot. But more importantly, AF_UNIX was never designed as a shared bus and hence should not be used as such. kdbus, on the other hand, was designed with a shared bus model in mind and as such avoids most of these issues.

With that basic understanding of kdbus, next time a news source reports about the “crazy idea to shove DBus into the kernel”, I hope you’ll be able to judge for yourself. And if you want to know more, I recommend building the kernel documentation, or diving into the code.

GNOME Westcoast Summit 2014

9 Replies

Two months ago in April ’14 I’ve been in San Francisco to meet with other FOSS developers and discuss current projects. There were several events, including the first GNOME Westcoast Summit and a systemd hackfest at Pantheon. I’ve been working on a lot of stuff lately and it was nice to talk directly to others about it. I wrote in-depth articles (on this blog) for the most interesting stories, but below is a short overview of what I focused on in SF:

memfd: My most important project currently is memfd. We fixed several bugs and nailed down the API. It was also nice to get feedback from a lot of different projects about interesting use-cases that we didn’t think of initially. As it turns out, file-sealing is something a lot of people can make great use of.
MiracleCast: For about half a year I worked on the first Open-Source implementation of Miracast. It’s still under development and only working Sink-side, but there are plans to make it work Client-side, too. Miracast allows to replace HDMI cables with wireless-solutions. You can connect your monitor, TV or projector via standard wifi to your desktop and use it as mirror or desktop-extension. The monitor is sink-side and MiracleCast can already provide a full Miracast stack for it. However, for the more interesting Source-side (eg., a Gnome-Desktop) I had a lot of interesting discussions with Gnome developers how to integrate it. I have some prototypes running locally, but it will definitely take a lot longer before it works properly. However, the current sink-side implementation has a latency of approx. 50ms and can run 30fps 1080p. This is already pretty impressive and on-par with proprietary solutions.
kdbus: The new general-purpose IPC mechanism is already fleshed out, but we spent a lot of time fixing races in the code and doing some general code review. It is a very promising project and all of the criticism I’ve heard so far was rubbish. People tend to rant about moving dbus in the kernel, even though kdbus really has nearly nothing to do with dbus, except that it provides an underlying data-bus infrastructure. Seriously, the helpers used for kernel-mode-setting without including the driver-specific code is already much bigger than kdbus… and in my opinion, kdbus will make dbus a lot more efficient and appealing to new developers.
GPU: GPU-switching, offload-GPUs and USB/wifi display-controllers are few of the many new features in the graphics subsystem. They’re mostly unsupported in any user-space, so we decided to change that. It’s all highly technical and the way how it is supposed to work is fairly obvious. Therefore, I will avoid discussing the details here. Lets just say, on-demand and live GPU-switching is something I’m making possible as part of GSoC this summer.
User-bus: This topic sounds fairly boring and technical, but it’s not. The underlying question is: What happens if you log in multiple times as the same user on the same system? Currently, a desktop system either rejects multiple logins of the same user or treats them as separate, independent logins. The second approach has the problem that many applications cannot deal with this. Many per-user resources have to be shared (like the home-directory). Firefox, for instance, cannot run multiple times for the same user. However, no-one wants to prevent multiple logins of the same user, as it really is a nice feature. Therefore, we came up with a hybrid aproach which basically boils down to a single session shared across all logins of the same user. So if you login twice, you get the same screen for both logins sharing the same applications. The window-manager can put you on a separate virtual desktop, but the underlying session is basically the same. Now if you do the same across multiple seats, you simply merge both sessions of these seats into a single huge session with the screen mirrored across all assigned monitors. A more in-depth article will follow once the details have been figured out.

A lot of the things I worked on deal with the low-level system and are hardly visible to the average Gnome user. However, without a proper system API, there’s no Gnome and I’m very happy the Gnome Foundation is acknowledging this by sponsoring my trip to SF: Thanks a lot! And hopefully I’ll see you again next year!

memfd_create(2)

15 Replies

For 4 months now we’ve been hacking on a new syscall for the linux-kernel, called memfd_create. The intention is to provide an easy way to get a file-descriptor for anonymous memory, without requiring a local tmpfs mount-point. The syscall takes 2 parameters, a name and a bunch of flags (which I will not discuss here):

int memfd_create(const char *name, unsigned int flags);

If successful, a new file-descriptor pointing to a freshly allocated memory-backed file is returned. That file is a regular file in a kernel-internal filesystem. Therefore, most filesystem operations are supported, including:

ftruncate(2) to change the file size
read(2), write(2) and all its derivatives to inspect/modify file contents
mmap(2) to get a direct memory-mapping
dup(2) to duplicate file-descriptors
…

Theoretically, you could achieve similar behavior without introducing new syscalls, like this:

int fd = open("/tmp/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
unlink("/tmp/random_file_name");

or this

int fd = shm_open("/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
shm_unlink("/random_file_name");

or this

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

Therefore, the most important question is why the hell do we need a third way?

Two crucial differences are:

memfd_create does not require a local mount-point. It can create objects that are not associated with any filesystem and can never be linked into a filesystem. The backing memory is anonymous memory as if malloc(3) had returned a file-descriptor instead of a pointer. Note that even shm_open(3) requires /dev/shm to be a tmpfs-mount. Furthermore, the backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.
There are no name-clashes and no global registry. You can create multiple files with the same name and they will all be separate, independent files. Therefore, the name is purely for debugging purposes so it can be detected in task-dumps or the like.

To be honest, the code required for memfd_create is 100 lines. It didn’t take us 2 months to write these, but instead we added one more feature to memfd_create called Sealing:

File-Sealing

File-Sealing is used to prevent a specific set of operations on a file. For example, after you wrote data into a file you can seal it against further writes. Any attempt to write to the file will fail with EPERM. Reading will still be possible, though. The crux of this matter is that seals can never be removed, only added. This guarantees that if a specific seal is set, the information that is protected by that seal is immutable until the object is destroyed.

To retrieve the current set of seals on a file, you use fcntl(2):

int seals = fcntl(fd, F_GET_SEALS);

This returns a signed 32bit integer containing the bitmask of currently set seals on fd. Note that seals are per file, not per file-descriptor (nor per file-description). That means, any file-descriptor for the same underlying inode will share the same seals.

To seal a file, you use fcntl(2) again:

int error = fcntl(fd, F_ADD_SEALS, new_seals);

This takes a bitmask of seals in new_seals and adds these to the current set of seals on fd.

The current set of supported seals is:

F_SEAL_SEAL: This seal prevents the seal-operation itself. So once F_SEAL_SEAL is set, any attempt to add new seals via F_ADD_SEALS will fail. Files that don’t support sealing are initially sealed with just this flag. Hence, no other seals can ever be set and thus do not have to be enforced.
F_SEAL_WRITE: This is the most straightforward seal. It prevents any content modifications once it is set. Any write(2) call will fail and you cannot get any shared, writable mappings for the file, anymore. Unlike the other seals, you can only set this seal if no shared, writable mappings exist at the time of sealing.
F_SEAL_SHRINK: Once set, the file cannot be reduced in size. This means, O_TRUNC, ftruncate(), fallocate(FALLOC_FL_PUNCH_HOLE) and friends will be rejected in case they would shrink the file.
F_SEAL_GROW: Once set, the file size cannot be increased. Any write(2) beyond file-boundaries, any ftruncate(2) that increases the file size, and any similar operation that grows the file will be rejected.

Instead of discussing the behavior of each seal on its own, the following list shows some examples how they can be used. Note that most seals are enforced somewhere low-level in the kernel, instead of directly in the syscall handlers. Therefore, side effects of syscalls I didn’t cover here are still accounted for and the syscalls will fail if they violate any seals.

IPC: Imagine you want to pass data between two processes that do not trust each other. That is, there is no hierarchy at all between them and they operate on the same level. The easiest way to achieve this is a pipe, obviously. However, to allow zero-copy (assuming splice(2) is not possible) the processes might decide to use memfd_create to create a shared memory object and pass the file-descriptor to the remote process. Now zero-copy only makes sense if the receiver can parse the data in-line. However, this is not possible in zero-trust scenarios as the source can retain a file-descriptor and modify the contents while the receiver parses it, causing any kinds of failure. But if the receiver requires the object to be sealed with F_SEAL_WRITE | F_SEAL_SHRINK, it can safely mmap(2) the file and parse it inline. No attacker can alter file contents, anymore. Furthermore, this also allows safe mutlicasts of the message and all receivers can parse the same zero-copy file without affecting each other. Obviously, the file can never be modified again and is a one-shot object. But this is inherent to zero-trust scenarios. We did implement a recycle-operation in case you’re the last user of an object. However, that was dropped due to horrible races in the kernel. It might reoccur in the future, though.
Graphics-Servers: This is a very specific use-case of IPC and usually there is a one-way trust relationship from clients to servers. However, a server cannot blindly trust its clients. So imagine a client renders its window-contents into memory and passes a file-descriptor to that memory region (maybe using memfd_create) to the server. Similar to the previous scenario, the server cannot mmap(2) that object for read-access as the client might truncate the file simultaneously, causing SIGBUS on the server. A server can protect itself via SIGBUS-handlers, but sealing is a much simpler way. By requiring F_SEAL_SHRINK, the server can be sure, the file will never shrink. At the same time, the client can still grow the object in case it needs bigger buffers for growing windows. Furthermore, writing is still allowed so the object can be re-used for the next frame.

As you might imagine, there are a lot more interesting use-cases. However, note that sealing is currently limited to objects created via memfd_create with the MFD_ALLOW_SEALING flag. This is a precaution to make sure we don’t break existing setups. However, changing seals of a file requires WRITE-access, thus it is rather unlikely that sealing would allow attacks that are not already possible with mandatory POSIX locks or similar. Hence, it is possible that sealing will expand to other areas in case people request it. Further seal-types are also possible.

Current Status

As of June 2014 the patches for memfd_create and sealing have been publicly available for at least 2 months and are considered for merging. linux-3.16 will probably not include it, but linux-3.17 very likely will. Currently, there’s still some issues to be figured out regarding AIO and Direct-IO races. But other than that, we’re good to go.

On Wifi-Display, Democratic Republics and Miracles

53 Replies

For a long time connecting your TV or other external monitors to your laptop has always been a hassle with cables and drivers. Your TV had an HDMI-port, but your laptop only DVI, your projector requires VGA and your MacBook can only do DP. Once you got a matching adapter, you noticed that the cable is far to short for you to sit comfortably on your couch. Fortunately, we’re about to put an end to this. At least, that was my impression when I first heard of Miracast.

Miracast is a certification program of the Wifi-Alliance based on their Wifi-Display specification. It defines a protocol to connect external monitors via wifi to your device. A rough description would be “HDMI over Wifi” and in fact the setup is modeled closely to the HDMI standard. Unlike existing video/audio-streaming protocols, Miracast was explicitly designed for this use-case and serves it very well. When I first heard of it about one year ago, I was quite amazed that it took us until late 2012 to come up with such a simple idea. And it didn’t take me long until I started looking into it.

For 4 months now I have been hacking on wifi, gstreamer and the kernel to get a working prototype. Despite many setbacks, I didn’t loose my fascination for this technology. I did complain a lot about the hundreds of kernel dead-locks, panics and crashes during development and the never-ending list of broken wifi-devices. However, to be fair I got the same amount of crashes and bugs on Android and competing proprietary Miracast products. I found a lot of other projects with the same goal, but the majority was closed-source, driver-specific, only a proof-of-concept or limited to sink-side. Nevertheless, I continued pushing code to the OpenWFD repository and registered as speaker for FOSDEM-2014 to present my results. Unfortunately, it took me until 1 week before FOSDEM to get the wifi-P2P link working realiably with an upstream driver. So during these 7 days I hacked up the protocol and streaming part and luckily got a working demo just in time. I never dared pushing that code to the public repository, though (among others it contains my birthday, which is surprisingly page-aligned, as magic mmap offset) and I doubt it will work on any setup but mine..

The first day after FOSDEM I trashed OpenWFD and decided to start all over again writing a properly integrated solution based on my past experience. Remembering the advice of a friend (“Projects with ‘Open’ in their name are just like countries named ‘Democratic Republic of …'”) I also looked for a new name. Given the fact that it requires a miracle to get Wifi-P2P working, it didn’t take me long to come up with something I liked. I hereby proudly announce: MiracleCast

MiracleCast

The core of MiracleCast is a daemon called miracled. It runs as system-daemon and manages local links, performs peer-discovery on request and handles protocol encoding and parsing. The daemon is independent of desktop environments or data transports and serves as local authority for all display-streaming protocols. While Miracast will stay the main target, other competing technologies like Chromecast, Airtame and AirPlay can be integrated quite easily.

The miraclectl command-line tool can be used to control the daemon. It can create new connections, modify parameters or destroy them again. It also supports an interactive mode where it displays events from miracled when they arrive, displays incoming connection attempts and allows direct user-input to accept or reject any requests.

miracled and miraclectl communicate via DBus so you can replace the command-line interface with graphical helpers. The main objects on this API are Links and Peers. Links describe local interfaces that are used to communicate with remote devices. This includes wifi-devices, but also virtual links like your local IP-Network. Peers are remote devices that were discovered on a local link. miracled hides the transport-type of each peer so you can use streaming protocols on-top of any available link-type (given the remote side supports the same). Therefore, we’re not limited to Wifi-P2P, but can use Ethernet, Bluetooth, AP-based Wifi and basically any other transport with an IP layer on top. This is especially handy for testing and development.

Local processes can now register Sources or Sinks with miracled. These describe capabilities that are advertised on the available links. Remote devices can then initiate a connection to your sink or, if you’re a source, you can connect to a remote sink. miracled implements the transmission protocol and hides it behind a DBus interface. It routes traffic from remote devices to the correct local process and vice versa. Regardless of whether the connection uses Miracast, Chromecast or plain RTSP, the same API is used.

Current Status

The main target still is Miracast! While all the APIs are independent of the protocol and transport layer, I want to get Miracast working as it is a nice way to test interoperability with Android. And for Miracast, we first need Wifi-P2P working. Therefore, that’s what I started with. The current miracled daemon implements a fully working Wifi-P2P user-space based on wpa_supplicant. Everyone interested is welcome to give it a try!

Everything on top, including the actual video-streaming is highly experimental and still hacked on. But if you hacked on Miracast, you know that the link-layer is the hard part. So after 1 week of vacation (that is, 1 week hacking on systemd!) I will return to MiracleCast and tackle local sinks.

As I have a lot more information than I could possible summarize here, I will try to keep some dummy-documentation on the wiki until I the first project release. Please feel free to contact me if you have any questions. Check out the git-repository if you want to see how it works! Note that MiracleCast is focused on proper desktop integration instead of fast prototyping, so please bear with me if API design takes some time. I’d really appreciate help on making Wifi-P2P work with as many devices as possible before we start spending all our efforts on the upper streaming layers.

During the next weeks, I will post some articles that explain step by step how to get Wifi-P2P working, how you can get some Miracast protoypes working and what competing technologies and implementations are available out there. I hope this way I can spread my fascination for Miracast and provide it to as many people as possible!

Splitting DRM and KMS device nodes

28 Replies

While most devices of the 3 major x86 desktop GPU-providers have GPU and display-controllers merged on a single card, recent development (especially on ARM) shows that rendering (via GPU) and mode-setting (via display-controller) are not necessarily bound to the same device. To better support such devices, several changes are being worked on for DRM.

In it’s current form, the DRM subsystem provides one general-purpose device-node for each registered DRM device: /dev/dri/card<num>. An additional control-node is also created, but it remains unused as of this writing. While in general a kernel driver is allowed to register multiple DRM devices for a single physical device, no driver made use of this, yet. That means, whatever hardware you use, both mode-setting and rendering is done via the same device node. This entails some rather serious consequences:

Access-management to mode-setting and rendering is done via the same file-system node
Mode-setting resources of a single card cannot be split among multiple graphics-servers
Sharing display-controllers between cards is rather complicated

In the following sections, I want to look closer at each of these points and describe what has been done and what is still planned to overcome these restrictions. This is a highly technical description of the changes and serves as outline for the Linux-Plumbers session on this topic. I expect the reader to be familiar with DRM internals.

1) Render-nodes

While render-nodes have been discussed since 2009 on dri-devel, several mmap-related security-issues have prevented it from being merged. Those have all been fixed and 3-days ago, the basic render-node infrastructure has been merged. While it’s still marked as experimental and hidden behind the drm.rnodes module parameter, I’m confident we will enable it by default in one of the next kernel releases.

What are render-nodes?

From a user-space perspective, render-nodes are “like a big FPU” (krh) that can be used by applications to speed up computations and rendering. They are accessible via /dev/dri/renderD<num> and provide the basic DRM rendering interface. Compared to the old card<num> nodes, they lack some features:

No mode-setting (KMS) ioctls allowed
No insecure gem-flink allowed (use dma-buf instead!)
No DRM-auth required/supported
No legacy pre-KMS DRM-API supported

So whenever an application wants hardware-accelerated rendering, GPGPU access or offscreen-rendering, it no longer needs to ask a graphics-server (via DRI or wl_drm) but can instead open any available render node and start using it. Access-control to render-nodes is done via standard file-system modes. It’s no longer shared with mode-setting resources and thus can be provided for less-privileged applications.

It is important to note that render-nodes do not provide any new APIs. Instead, they just split a subset of the already available DRM-API off to a new device-node. The legacy node is not changed but kept for backwards-compatibility (and, obviously, for mode-setting).

It’s also important to know that render-nodes are not bound to a specific card. While internally it’s created by the same driver as the legacy node, user-space should never assume any connection between a render-node and a legacy/mode-setting node. Instead, if user-space requires hardware-acceleration, it should open any node and use it. For communication back to the graphics-server, dma-buf shall be used. Really! Questions like “how do I find the render-node for a given card?” don’t make any sense. Yes, driver-specific user-space can figure out whether and which render-node was created by which driver, but driver-unspecific user-space should never do that! Depending on your use-cases, either open any render-node you want (maybe allow an environment-variable to select it) or let the graphics-server do that for you and pass the FD via your graphics-API (X11, wayland, …).

So with render-nodes, kernel drivers can now provide an interface only for off-screen rendering and GPGPU work. Devices without any display-controller can avoid any mode-setting nodes and just provide a render-node. User-space, on the other hand, can finally use GPUs without requiring any privileged graphics-server running. They’re independent of the kernel-internal DRM-Master concept!

2) Mode-setting nodes

While splitting off render-nodes from the legacy node simplifies the situation for most applications, we didn’t simplify it for mode-setting applications. Currently, if a graphics-server wants to program a display-controller, it needs to be DRM-Master for the given card. It can acquire it via drmSetMaster() and drop it via drmDropMaster(). But only one application can be DRM-Master at a time. Moreover, only applications with CAP_SYS_ADMIN privileges can acquire DRM-Master. This prevents some quite fancy features:

Running an XServer without root-privileges
Using two different XServers to control two independent monitors/connectors of the same card

The initial idea (and Ilija Hadzic’s follow-up) to support this were mode-setting nodes. A privileged ioctl on the control-node would allow applications to split mode-setting resources across different device-nodes. You could have /dev/dri/modesetD1 and /dev/dri/modesetD2 to split your KMS CRTC and Connector resources. An XServer could use one of these nodes to program the now reduced set of resources. We would have one DRM-Master per node and we’d be fine. We could remove the CAP_SYS_ADMIN restriction and instead rely on file-system access-modes to control access to KMS resources.

Another discussed idea to avoid creating a bunch of file-system nodes, is to allocate these resources on-the-fly. All mode-setting-resources would now be bound to a DRM-Master object. An application can only access the resources available on the DRM-Master that it is assigned to. Initially, all resources are bound to the default DRM-Master as usual, which everyone gets assigned to when opening a legacy node. A new ioctl DRM_CLONE_MASTER is used to create a new DRM-Master with the same resources as the previous DRM-Master of an application. Via a DRM_DROP_MASTER_RESOURCE an application can drop KMS resources from their DRM-Master object. Due to their design, neither requires a CAP_SYS_ADMIN restriction as they only clone or drop privileges, they never acquire new privs! So they can be used by any application with access to the control node to create two new DRM-Master resources and pass them to two independent XServers. These use the passed FD to access the card, instead of opening the legacy or mode-setting nodes.

From the kernel side, the only thing that changes is that we can have multiple active DRM-Master objects. In fact, per DRM-Master one open-file might be allowed KMS access. However, this doesn’t require any driver-modifications (which were mostly “master-agnostic”, anyway) and only a few core DRM changes (except for vmwgfx-ttm-lock..).

3) DRM infrastructure

The previous two chapters focused on user-space APIs, but we also want the kernel-internal infrastructure to account for split hardware. However, fact is we already have anything we need. If some hardware exists without display-controller, you simply omit the DRIVER_MODESET flag and only set DRIVER_RENDER. DRM core will only create a render-node for this device then. If your hardware only provides a display-controller, but no real rendering hardware, you simply set DRIVER_MODESET but omit DRIVER_RENDER (which is what SimpleDRM is doing).

Yes, you currently get a bunch of unused DRM code compiled-in if you don’t use some features. However, this is not because DRM requires it, but only because no-one sent any patches for it, yet! DRM-core is driven by DRM-driver developers!

There is a reason why mid-layers are frowned upon in DRM land. There is no group of core DRM developers, but rather a bunch of driver-authors who write fancy driver-extensions. And once multiple drivers use them, they factor it out and move it to DRM core. So don’t complain about missing DRM features, but rather extend your drivers. If it’s a nice feature, you can count on it being incorporated into DRM-core at some point. It might be you doing most of the work, though!

Sane Session-Switching

34 Replies

In a previous article I talked about the history of VT switching. Created as a simple way to switch between text-mode sessions it has grown into a fragile API to protect one testosterone monster (also called XServer) from another. XServers used to poke in PCI bars, modified MMIO registers and messed around with DMA controllers. If they hauled out the big guns, it was almost absurd to believe a simple signal-flinging VT could ever successfully negotiate. Fortunately, today’s XServer is a repentant sinner. With common desktop hardware, all direct I/O is done in the kernel (thanks KMS!) and chances of screwing up your GPUs are rather minimal. This allows us to finally implement proper device-handover during session-switches.

If we look at sessions at a whole, the XServer isn’t special at all. A lot of session-daemons may run today that provide some service to the session as a whole. This includes pulseaudio, dbus, systemd –user, colord, polkit, ssh-keychain, and a lot more. All these daemons don’t need any special synchronization during session-switch. So why does the XServer require it?

Any graphics-server like the XServer is responsible of providing access to input and graphics devices to a session. When a session is activated, they need to re-initialize the devices. Before a session is deactivated, they need to cleanup the devices so the to-be-activated session can access them. The reason they need to do this is missing infrastructure to revoke their access. If a session would not cleanup graphics devices, the kernel would prevent any new session from accessing the graphics device. For input devices it is even worse: If a session doesn’t close the devices during deactivation, it would continue reading input events while the new session is active. So while typing in your password, the background session might send these key-strokes to your IRC client (which is exactly what XMir did). What we need is a kernel feature to forcibly revoke access to a graphics or input device. Unfortunately, it is not as easy at it sounds. We need to find some-one who is privileged and trusted enough to do this. You don’t want your background session to revoke your foreground session’s graphics access, do you? This is were systemd-logind enters the stage.

systemd-logind is already managing sessions on a system. It keeps track on which session is active and sets ACLs in /dev to give the foreground session access to device nodes. To implement device-handover, we extend the existing logind-API by a new function: RequestDevice(deviceNode). A graphics-server can pass a file-system path for a device-node in /dev to systemd-logind, which checks permissions, opens the node and returns a file-descriptor to the caller. But systemd-logind retains a copy of the file-descriptor. This allows logind to disable it as long as the session is inactive. During a session-switch, logind can now disable all devices of the old session, re-enable the devices of the new session and notify both of the session-switch. We now have a clean handover from one session to the other. With this technology in place, we can start looking at real scenarios.

1) Session-management with VTs

Session management using VTs for foreground control and logind for device management

Session management with VTs and logind

Based on the graphs for VT-switching, I drew a new one considering logind. VTs are still used to switch between sessions, but sessions no longer open hardware devices directly. Instead, they ask logind as described above. The big advantage is that VT-switches are no longer fragile. If a VT is active, it can be sure that it has exclusive hardware-access. And if a session is dead-locked, we can force a VT-switch and revoke their device-access. This allows to recover from situations where your XServer hangs without SSH’ing from a remote machine or using SysRq.

2) Session-management without VTs

Session management based solely on logind

Pure logind session management

While sane VT-switching is a nice feature, the biggest win is that we can implement proper multi-session support for seats without VTs. While previously only a single session could run on such seats, with logind device-management, we can now support session-switching on any seat.

Instead of using VTs to notify sessions when they are activated or deactivated, we use the logind-dbus-API. A graphics-server can now request input and graphics devices via the logind RequestDevice API and use it while active. Once a session-switch occurs, logind will disable the device file-descriptors and switch sessions. A dbus signal is sent asynchronously to the old and new session. The old session can stop rendering while inactive to save power.

3) Asynchonous events and backwards-compatibility

One thing changes almost unnoticed when using RequestDevice. An active graphics-server might be almost about to display an image on screen while a session-switch occurs. logind revokes access to graphics devices and sends an asynchronous event that the session is now inactive. However, the graphics-server might not have received this event, yet. Instead, it tries to invoke a system-call to update the screen. But this will fail with EACCES or EPERM as it doesn’t have access to it, anymore. Currently, for most graphics servers this is a fatal error. Instead of handling EACCES and interpreting it as “this device is now paused”, they don’t care for the error code and abort. We could fix all the graphics-servers, but to simplify the transition, we introduced negotiated session-switches.

Whenever logind is asked to perform a session-switch, it first sends PauseDevice signals for every open device to the foreground graphics-server. This must respond with a PauseDeviceComplete call to logind for each device. Once all devices are paused, the session-switch is performed. If the foreground session does not respond in a timely manner, logind will forcibly revoke device access and then perform the session-switch, anyway.

Note that negotiated session-switches are only meant for compatibility. Any graphics-server is highly encouraged to handle EACCES just fine!

All my local tests ran fine so far, but all this is still under development. systemd patches can be found at github (frequently rebased!). Most tests I do rely on an experimental novt library, also available at github (I will push it during next week; this is only for testing!). Feedback is welcome! The RFC can be found on systemd-devel. Now I need a day off..

Happy Switching!

How VT-switching works

6 Replies

Having multiple sessions on your system in parallel is a quite handy feature. It allows things like Fast User Switching or running two different DEs at the same time. Especially graphics developers like it, because they can test-run their experimental XServer/weston on the same machine they develop on.

To understand how it works, we need the concept of a session. See my previous introduction into session-management if you’re not familiar with it. I expect the reader to be familiar with basic session-management concepts (i.e., seats, systemd-logind, login-sessions, user-sessions, processes and daemons in a session).

1) Traditional text-mode VT-switching

Virtual terminals were introduced with linux-0.12. It’s the origin of multi-session support on linux. Before this, linux only supported a single TTY session (which was even available in the first tarball of linux-0.01). With virtual terminals, we have /dev/tty<num> devices, where <num> is between 1 and 63. They are always bound to seat0 and every session on seat0 is bound to a single VT. That means, there are at most 63 live sessions on seat0. Only one of them is active at a time (which can be read from /sys/class/tty/tty0/active).

Session management and device access with VTs in text-mode

Session management in text-mode

The kernel listens for keyboard events and switches between VTs on ctrl+alt+Fx shortcuts. Alternatively, you can issue a VT_ACTIVATE ioctl to politely ask the kernel to switch sessions.

In the old days, all sessions ran in text-mode. In text-mode, a process can get keyboard input by reading from a VT and can write onto the screen by writing to a VT. Some rather ugly control-sequences are supported to allow colors or other advanced features. The kernel interprets these and instructs the graphics hardware to print the given text.

Important to note is that in text-mode sessions don’t have direct hardware access. The kernel merges all keyboard events into one stream and a session cannot tell which device it came from. In fact, it cannot even tell how many devices there are. Same for graphics devices. As a VT has only a single output stream, there is only a single screen to write to. Multi-head support is not available. Neither are any advanced graphics-operations. But this allows the kernel to serialize access to hardware devices. Only the active VT gets input events and only the buffer of the active VT is displayed on the screen. No resource-conflicts can occur.

While there are ways to detect when a session is activated/deactivated, in text-mode a session normally doesn’t care. It just stops receiving keyboard input. If the session writes to the VT while deactivated, it will affect the internal buffer of the VT, but not the screen. Only the buffer of the active VT is shown on screen.

2) Session-switching in graphics-mode

System session management and device access with VTs in graphics mode

Session management in gfx-mode

Very soon it became clear that text-mode is not enough. We wanted more! That’s when the VT graphics mode was introduced. Graphics mode doesn’t change the setup, we still have 63 VTs and each session is bound to a VT. But a VT can now be switched from text-mode KD_TEXT into graphics-mode KD_GRAPHICS (via KDSETMODE ioctl). This doesn’t do anything spectacular. Really! The only effect is it disables the kernel-internal graphics routines. As long as a VT is in graphics-mode, the kernel will not instruct the graphics hardware to display the VT on screen. Once it is reset to text-mode and the VT is active, the kernel will display it again.

So the graphics-mode itself is useless. But at the same time, the kernel started providing separate interfaces to input and graphics devices. So while a VT is in graphics-mode, a session can access the graphics devices directly and render whatever they want. It can also ignore input from the VT and instead read input-events directly from the input-event interfaces. This is how the XServer works today.

But it is pretty obvious that this becomes problematic during session switches. If the kernel switches away from a graphics VT, the session needs to release the graphics hardware before the new session can be activated. Otherwise, the new session is active, but you still get the images from the old one. Or worse, it might flicker between the images of both sessions. Without kernel-mode-setting it might even hang your graphics hardware.

(A similar interface to KDSETMODE exists for keyboards with KDSKBMODE.)

Unfortunately, the kernel has no interface to forcibly revoke graphics or input access. So VTs were extended by the VT_SETMODE ioctl. A process can issue VT_SETMODE on a VT and pass two signal-numbers (usually SIGUSR1 and SIGUSR2). If the kernel wants to perform a VT-switch, it sends one such signal to the active VT. This VT can cleanup resources, stop using graphics/input devices and acknowledge the VT-switch via the VT_RELDISP ioctl. If the process dead-locked and doesn’t acknowledge the request, the VT-switch will not happen! This is why a crashed XServer can hang your system. But if it correctly issues the VT_RELDISP ioctl, the kernel will perform the VT-switch. Once the kernel switches back to the given VT it sends the second signal as notification that it is now active again.

However, on every VT (precisely, on every session) only a single process can call VT_SETMODE. This already shows that the concept is flawed. For example, if the XServer takes the VT in posession for graphics and input devices, another audio-server in the session couldn’t do the same for audio-hardware. This applies to all other devices. An alternative involving logind is discussed in a followup article.

3) Seats without VTs

We discussed that VTs are always bound to seat0. So if you run a session on a seat other than seat0 or if you disabled VTs entirely, then the situation becomes pretty simple: no multi-session support is available. This is enforced by systemd-logind so the active session can run without interruptions.

Also important to note is that there is no text-mode. A session must run in graphics mode as the kernel facility to interpret text-commands is bound to VTs.

Single-session setup on seats without VTs

Session management without VTs

4) logind integration

As we discussed in the previous article, today systemd-logind tracks and manages sessions. But how does this integrate with VTs? VTs pre-date systemd by years, so to preserve backwards-compatibility, we need to keep the infrastructure as it is. That’s why logind watches /sys/class/tty/tty0/active for changes. Once a VT switch happens, logind notices it and marks the old session as inactive and the new as active. It also adjusts ACLs in /dev to keep access-restrictions in sync with the active session. However, this cannot be done properly without a race-condition. Therefore, logind provides a new dbus-API to replace the ageing VT API. New session-daemons are advised to use it in favor of VTs.

Session-Management on Linux

12 Replies

One thing I always admired on linux is fast session-switching. A simple key-press (ctrl+alt+Fx) and I can ditch my current session to switch to another one. While in principle session-interaction is fairly simple, you can easily get it wrong (as can be seen with XMir). But how exactly does session management work today? And how did systemd-logind change this situation?

First, we need to define a session. A session is a group of processes running under control of a single user. Only a single session is active on a system (more precisely on a seat; discussed below) and a user can only interact with the active (or foreground) session. A session can start and stop processes, but it cannot escape from the session. In other words, all started processes will always belong to the session they were started in. Only system-daemons (which are not attached to a session) can spawn new sessions.

A session usually has different user-daemons which provide services for the session. For instance, the X-Server provides graphics access to other session-processes, pulseaudio provides audio-access. Then there is a central process which controls and monitors a session. Whenever a session is spawned, this process is the first process that is created. It initializes the session, spawns other management processes and optionally restarts them if they crash. It might also exit once the session was initialized if no process-monitoring is required. An example would be gnome-session.

1) Session Management

Historically, every process on a system has a session-id which defines the session a process belongs to. The setsid() system call allows to create a new session. However, with systemd-logind a new more-versatile concept was introduced. To understand this, we need to first look at seats.

Seats

Multi-seat setup (source: Wikipedia)

On normal systems only a single seat exists (called seat0). But more seats can be created on request. A seat is a virtual object that describe a physical interface to a system. The combination of a monitor plus keyboard and mouse form a usual seat. One user can interact with the system through this seat. If you attach another monitor plus keyboard and mouse, another user can interact with the system in parallel via this new seat (eg., called seat1). It is important to note that these seats are independent of each other. The pictures you get on the monitors are different and keyboard input from one seat doesn’t affect the other.

Any device connected to your system can be attached to a seat, but only to one seat at a time! If a device is attached to seat, it cannot be used by processes running on another seat. Obviously, input devices, sound-devices and monitors are usually attached to a seat. Network-interfaces, for example, can be left unattached so all seats can share an internet-connection. It is up to the system-administrator to decide which devices to attach to which seat, or whether to leave them as global devices that can be shared.

Bootup

During system-boot, systemd starts several daemons to manage a system. All these daemons are global and not attached to a seat or session. This also means, these daemons must not access devices attached to a seat. Among these daemons is also systemd-logind, which is responsible for session and seat management.

To allow user interaction, systemd needs to automatically spawn a new session for each seat. This is usually a login-session (like gdm, kdm or LightDM) but might also be configured as auto-login and spawn directly a user-session like gnome-session. In case of a login-session, the greeter draws login and password fields and after verifying the input, it instructs systemd to spawn a user-session. Note that the greeter cannot start the user-session itself as it already runs in a session (you cannot escape your session, remember?).

But what exactly does systemd do to control sessions?

Sessions

Via kernel interfaces (namely cgroups) systemd can reliably track processes and their childs. Whenever systemd is instructed to start a new session, it first verifies the caller is allowed to do that and then it spawns the first process for the session. Internally, it attaches a new session-id to this process, saves the seat this session is run on and some other maintenance data. Whenever this session process creates new processes, systemd can use cgroups to track these. So it always knows which session a process belongs to.

The first process for a session is responsible of initializing the session. In case of gnome-session, it spawns the X-Server, some gnome-specific maintenance daemons and then monitors the session in case something goes wrong. You can now interact with this session (normally through the X-Server) and do your daily work.

A session can explicitly instruct systemd to close itself. However, this will only mark it as dead. Once all processes of a session exited, the session is automatically cleaned up and removed.

2) Case-Study: Text-mode sessions

Text-mode interface (source: Wikipedia)

As a concrete non-trivial scenario, I’ll explain how text-mode sessions integrate with this. First, we need to know that virtual-terminals are basically a combination of input devices and a monitor. They can be accessed via /dev/tty<num> (where <num> is between 1 and 63; the other /dev/ttyXY devices are no VTs!). If you read from them, you get input from connected keyboards, if you write to them, the kernel displays it as text on the monitor. Some rather complex control-sequences are available to do colors or more, but that’s beyond the scope. VTs are always bound to seat0. So only sessions on this seat can use VTs (which means, classic text-mode is only available on seat0).

Text-mode sessions are special in that they are not spawned by a login-session. systemd-logind spawns /bin/agetty right during boot for each VT (you can configure how many of the 63 possible should be started). agetty is running as system daemon and not in a session! However, it is bound to seat0, obviously. agetty initializes the VT and runs /bin/login. login then writes a greeter to the VT and reads username plus password from it. As you might notice, this is problematic as it accesses devices attached to a seat even though it’s not running in a session. But due to the design of VTs, this cannot be easily avoided.

Once login verified the user-input, it instructs systemd to start the given user-session in text-mode. It looks up the initial process in /etc/passwd (let’s assume it is /bin/bash) and runs it. So bash is the controlling process in this session. It also substitutes the X-Server in that it provides user-interaction (in text mode rather than graphics). It allows to switch between different processes (ctrl+Z, fg, bg and jobs) and allows starting other session-daemons during initialization (via .bashrc).

So while text-mode feels a lot different, it is in fact very similar to graphics mode and is internally handled almost equally.

3) Multi-Session

We now understand what a session is, how they are created and how a user interacts with them. However, we haven’t discussed what happens if there are multiple sessions on a seat.

On current systems, multiple sessions are only allowed on seat0 if VTs are enabled (which is usually the case). That means, on all other seats, once you spawned a session, you have to stick with it. You cannot ctrl+alt+Fx away from the session (the reason for this is historical and we’re about to change this). However, you can close the session and then start a new one.

But if we assume our seat supports multiple sessions, how does that work? systemd-logind is responsible to manage and track sessions. Therefore, it keeps an Active attribute for each session. It’s a boolean value that defines whether the session is currently active or not. systemd-logind takes care that always at most one session is active. If no session runs on a seat, no session might be active (also in other situations which we ignore here as it’s an implementation detail). Additionally, systemd-logind sets the correct access-restrictions on devices attached to a seat so only the active session can access these devices. This guarantees that no background session interferes with foreground activity.

To switch between sessions, each session has to listen for special keyboard input (normally ctrl+alt+Fx) and instruct systemd-logind to switch to another session once it’s pressed. Today, this is implemented in the X-Server or, if no x-server runs in the session, by the underlying VT. You can also send a dbus call to systemd-logind directly to perform the session-switch (see loginctl activate <sid>).

So for proper multi-session support, a login-session spawns a user-session during login but stays active in background. A user can now switch between the user-session and the login-session and optionally spawn additional session by logging in again. Session processes can listen for dbus signals from systemd-logind to be notified when they are activated or deactivated. This allows them to go asleep while being in background and start up again once being put in foreground.

For an in-depth view on session-switching, see this followup article.

Thoughts on Linux System Compositors

26 Replies

My work towards deprecating CONFIG_VT recently got stuck due to several kernel changes that are needed to proceed. DRM Render-nodes and Modeset-nodes are on their way and so I thought I’d provide a short update on what is being planned regarding session management in user-space.

This is no technical discussion or comparison between VTs and other session managers. Instead, I try to describe how we intend to proceed and why we need a new kind of user-space session manager. For technical details, feel free to browse through other articles in this blog or join the discussions on systemd-devel, wayland-devel and dri-devel MLs.

Virtual Terminals

While legacy systems with CONFIG_VT will always be supported, the current trend is towards a system without VTs. VTs are controversially discussed, but the major flaw of VTs is that they’re only available on the default seat. So if you’re a big fan of VTs, consider this whole article being about seats other than seat0, that is, no VTs are available! If you’re not so a big fan of VTs, join my dream of a world without VTs (and without fbcon, hurray!).

Systemd

With Ubuntu (and upstart) considering to emulate systemd-logind, it is becoming clear that the systemd-logind API is getting the de-facto standard for seat and session management on desktop linux systems. I don’t care what people think of systemd; for the purpose of this article I will call the session manager logind, feel free to replace it by your favorite seat/session manager.

Non-desktop systems are beyond the scope of this document. However, as long as you have a monitor+keyboard connected to your system, your setup is targeted by this article.

#1: The Problem

Most of the time, on a common linux desktop system, we interact with our favorite Desktop Environment; lets choose Gnome for today. Gnome manages the different screens, applications and tasks that we want to run. We can switch between them, stop them or start them. This is all managed inside a single Gnome-User-Session. The need for multiple sessions (and thus, multiple VTs) is barely noticeable to us. And that fact is very important! Whatever we do to extend session management, the single user session should still be the central part of the design. We must not make it, in any way slower, less stable, or harder to implement. Instead, we design our system around it.

So lets broaden our perspective and look at all the systems running in parallel to the user session: There is the boot-splash, linux-console, emergency-console, sleeping user-sessions, lock-screens, ssh-sessions, custom parallel Wayland/X11 sessions and probably a lot of crazy user-setups I am not even aware of. It is hard to work with such a diverse and moving target, however, they always have something in common: Only one desktop session is active at a time!

But which one?

#2: logind to the rescue

It is pretty obvious that we need an IPC mechanism to notify the different sessions about who is active and when to switch between them. Historically, VTs provided a decentralized way by letting each session open a different VT and making the active VT responsible of session-management. To be fair, there is support for automatic session-switching, but most sessions take control over input devices and thus need to schedule session switches themselves. Major flaw is that if the active session deadlocks, your system becomes unresponsive. While the VT system has several other drawbacks, discussion is beyond the scope of this document as we always have the situation where we don’t have VTs (think: seats other than seat0).

Without VTs, we are currently left with only a single session per seat. Pretty boring! And that’s probably also what the desktop group at RedHat thought. A broadly discussed idea from several RedHat developers is to extend logind to arbitrate between the different sessions (Note: this article is about my perception, but the idea of using logind was originally not from me). logind already knows about all sessions on a system, knows about all seats and is the perfect place to implement session management. So instead of opening a VT, sessions now use dbus to tell logind that they provide a new user session. The API hasn’t been figured out, yet, but the big advantage over VTs is that we can pass file-descriptors directly to each session. logind can mute input-devices (patches pending on linux-input ML) for inactive sessions and can revoke DRM access. It guarantees that the active session has exclusive hardware access even if a session dead-locks.

Left for discussion is the design of the logind dbus API. It will include a set of calls to register, activate, deactivate and unregister sessions. Due to the fact that we have control over hardware resources, we can also implement forced session switching. Moreover, we could make logind listen to magic sysrq keys which deactivates the current session and instead activates the emergency console.

#3: Limitations

logind is no system compositor. logind will never be a system compositor. In fact, if logind becomes the central session manager, the concept of a system compositor will be dead and buried. The problem is, a session manager requires exclusive access over any sessions and hardware. Same is true for a system-compositor. Both concepts don’t play nice together. But what are the benefits of a system-compositor over logind?

While anything a system-compositor can do, could theoretically be implemented in logind, we assume that logind is supposed to stay small and efficient. So logind will never provide any DRM display handling. All sessions must implement it themselves. Same holds true for input handling. logind can only provide FD forwarding, but the heavy lifting must be done directly in each session program. In every small session program! The boot-splash must include complex DRM and XKB code, the screen-lock, too. Even the emergency console has to do all the low-level hardware handling. This introduces complexity and, very likely, bugs. But on the other hand, it guarantees efficiency. An xserver can run with direct hardware access and no extra context-switch to logind is needed on every frame.

But DRM handling could be simplified via a shared library for every session that doesn’t need custom 3D acceleration. So what else do we abandon for the sake of simplicity?

Probably the most hurtful is that global keyboard management is impossible. We cannot make logind listen for keyboard events (other than real emergency shortcuts) at the same time as a session is active. Otherwise, we might end up with logind and the session reacting to a single shortcut, which is unacceptable in any non-emergency situation. So session-switching will be left to the active-session itself. But if the active session dead-locks, we have the same problem as with VTs: an unresponsive system

We can implement emergency-shortcuts which violate this rule, but it will never be a clean solution. Moreover, we will never be able to have global Volume-Up and Volume-Down shortcut handling, no Brightness-keys, no global NumLock handling. Global keyboard handling will be reliant on policies, if implemented at all, and otherwise left to each session to implement.

#4: Quirks

Upgrading logind to a session-manager has advantages on its own. Most of the code is already available, no external manager is needed, which would have to synchronize with logind. We have direct hardware access in each session and, as mentioned earlier, sessions are self-hosting. A session itself doesn’t need any major modifications to run under this scenario. A session controls itself over the whole runtime and, more importantly, doesn’t depend on another daemon to function properly. Yes, session-management depends on logind, but the session itself, once activated, runs standalone.

To overcome the issues mentioned above, I developed a set of quirks that sit between logind and a session:

UVTD

uvtd (User-Space Virtual-Terminal Daemon) is an implementation of VTs in user-space. It is available as experimental branch in the official kmscon repository. Even if you disable kernel VTs, uvtd can provide legacy VT nodes for backwards compatibility. But in contrast to kernel VTs, uvtd can control the “server side” of each VT and provide a bridge between a fake VT and logind. This allows to run legacy VT applications even if no VTs are available. The current code already allows to run multiple xservers in parallel without any real VT and without modifying a single xserver line of code.

wlsystemc

wlsystemc (Wayland System Compositor) is another research project I was working on. It is a Wayland compositor that displays each client as fullscreen surface. Due to the fast moving Wayland codebase it got slightly out of date during the last year, but it was once working with basic SHM buffers. Anyhow, together with logind, wlsystemc can be redesigned to take over modesetting for basic sessions. For instance, a boot-splash would create a surface in wlsystemc, which itself registers a session in logind for this surface. Once the session gets activated, wlsystemc performs all the heavy DRM operations and the boot-splash does nothing more than simple generic Wayland rendering (i.e., passing custom buffer).

Kernel Log

“No VTs, no kernel log!”. But especially developers want a kernel oops or panic printed on screen whenever they occur. The solution is fblog or drmlog, a kernel module, that does nothing else than printing the kernel log to all connected monitors. logind can register it like any other session and enable or disable it via sysfs. During oops/panic the kernel will activate them automatically and notify logind asynchronously via sysfs. Both modules are still under development, but at least fblog is already working.

Linux Console

For completeness: kmscon can easily replace the linux-console in this setup. It will register a text-console as a normal session just like any other session does.

#5: Conclusion

While I was an enthusiastic advocate of the system-compositor idea, it did become pretty clear, that the performance penalty for input-event forwarding and the huge required kernel API extensions make it very unlikely that it will get accepted in the community. Furthermore, besides global keyboard handling, most technical concerns on the logind idea can be solved with quirks like wlsystemc and uvtd.

This post didn’t discuss or highlight the low-level technical concerns on kernel VTs, but I hope it explains why we get a lot more flexibility and possibilities by moving session handling into user-space. logind can be taught boot-splash awareness, emergency-handling or forced session-switching, which will be very unlikely to get implemented in kernel VTs.

But independent of how we implement it, moving VT handling to userspace and disabling CONFIG_VT still has one major advantage: it simplifies kernel programming a lot. And I know, nearly every DRM developer would be very happy to see this happen!

DRM Render- and Modeset-Nodes

GSoC-Proposal

If someone is interested in more details, some excerpts from my original GSoC proposal:

Project Description:
Since several years xserver is no longer the only user-space project that makes
use of the kernel DRM API. The introduction of KMS allowed many new projects to
emerge, including plymouth, weston and kmscon. On the other side, OpenCL support
allows applications to make use of DRM without requiring any KMS APIs. Even though
both use-cases work with the current APIs, there are a lot of restrictions that
need to be worked around.

The most problematic concept is DRM-Master. KMS applications are required to be
DRM-Master to perform modesetting, but DRM-Master is tightly coupled to
CAP_SYS_ADMIN/root. On the other side, render clients are required to be assigned
to a DRM-Master so they can get authenticated. This prevents off-screen/offline
rendering without a running compositor.

One possible solution is to split render- and modeset-nodes apart. The DRM control
node can be used as the management node (which is, as far as I understand, what
it was designed for, anyway). A separate static render-node is created for each
DRM device which is restricted to ioctls specific to rendering operations. Instead
of requiring drmAuth() for authorization, we can now use filesystem access-modes.
This allows slightly more dynamic access-control, but the biggest advantage is
that we can do off-screen/offline rendering without a running
compositor/DRM-Master.

On the other side, a modeset-node is a concept to have KMS separated from DRM.
The use-case is to split modeset objects (eg., crtcs, encoders, connectors) across
different modesetting applications. This allows one compositor to use one
CRTC+connector combination, while another compositor (maybe on another seat) can
use another CRTC+external-connector. This doesn't have to be a static setup. On
the contrary, one use-case I am very interested in is a dynamic modeset-object
assignment to temporary clients. This way, a fullscreen application can be granted
page-flip rights from the compositor to avoid context-switches to the compositor
for doing trivial page-flips only.

Deliverables
* Working render-node clients: Preferably an offline OpenCL example and
  a wayland EGL client
* Merged kernel render-node implementation with at least i915 support
* Dynamic kernel modeset-nodes
* "Zero-context-switches" wayland/weston fullscreen client (optional)

Known Problems:
There were several attempts to push render-nodes into the kernel, but all failed
due to missing motivation to finish the user-space clients. Writing up new fancy
APIs is one part, but pushing API changes to such big projects requires the whole
environment to work well with the changes. That's why I want to concentrate on
the user-space side of render-nodes. And I want to finish the render-nodes project
before continuing with modeset-nodes. The idea has been around long enough that
it's time that we get it done.

However, one problem is that I never worked with the low-level X11 stack. The
wayland environment is great for experiments and quite active. I am very familiar
with it and know how to get examples easily running. The xserver, however, is a
huge black box to me. I know the concepts and understand the input and graphics
drivers design. But I never read xserver core code. That's something I'd like to
change during this project. I will probably be limited to DRI and graphics
drivers, but that's a good start.

Another idea that came up quite often is something like gem-fs. It's far beyond
the scope of this project, but it's something I'd like to keep in mind when
designing the API. It's hard to account for something that's only an idea, but
the concepts seem related so I will try to understand the reasons behind gem-fs
and avoid orthogonal implementations.

A few smaller implementation-specific problems are already known, including the
mmap-security problem, static "possible_encoders"/"possible_crtcs" bitsets and
missing MMUs on GPUs. However, there already have been ideas how to solve them
so I don't consider them blockers for render-nodes.

Ponyhof

Dysfunctional Programming

From AF_UNIX to kdbus

AF_UNIX

kdbus

GNOME Westcoast Summit 2014

memfd_create(2)

File-Sealing

Current Status

On Wifi-Display, Democratic Republics and Miracles

MiracleCast

Current Status

Splitting DRM and KMS device nodes

1) Render-nodes

2) Mode-setting nodes

3) DRM infrastructure

Sane Session-Switching

1) Session-management with VTs

2) Session-management without VTs

3) Asynchonous events and backwards-compatibility

How VT-switching works

1) Traditional text-mode VT-switching

2) Session-switching in graphics-mode

3) Seats without VTs

4) logind integration

Thoughts on Linux System Compositors

Virtual Terminals

Systemd

#1: The Problem

#2: logind to the rescue

#3: Limitations

#4: Quirks

UVTD

wlsystemc

Kernel Log

Linux Console

#5: Conclusion

DRM Render- and Modeset-Nodes

GSoC-Proposal