Tag Archives: linux

memfd_create(2)

For 4 months now we’ve been hacking on a new syscall for the linux-kernel, called memfd_create. The intention is to provide an easy way to get a file-descriptor for anonymous memory, without requiring a local tmpfs mount-point. The syscall takes 2 parameters, a name and a bunch of flags (which I will not discuss here):

int memfd_create(const char *name, unsigned int flags);

If successful, a new file-descriptor pointing to a freshly allocated memory-backed file is returned. That file is a regular file in a kernel-internal filesystem. Therefore, most filesystem operations are supported, including:

  • ftruncate(2) to change the file size
  • read(2), write(2) and all its derivatives to inspect/modify file contents
  • mmap(2) to get a direct memory-mapping
  • dup(2) to duplicate file-descriptors

Theoretically, you could achieve similar behavior without introducing new syscalls, like this:

int fd = open("/tmp/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
unlink("/tmp/random_file_name");

or this

int fd = shm_open("/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
shm_unlink("/random_file_name");

or this

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

Therefore, the most important question is why the hell do we need a third way?

Two crucial differences are:

  • memfd_create does not require a local mount-point. It can create objects that are not associated with any filesystem and can never be linked into a filesystem. The backing memory is anonymous memory as if malloc(3) had returned a file-descriptor instead of a pointer. Note that even shm_open(3) requires /dev/shm to be a tmpfs-mount. Furthermore, the backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.
  • There are no name-clashes and no global registry. You can create multiple files with the same name and they will all be separate, independent files. Therefore, the name is purely for debugging purposes so it can be detected in task-dumps or the like.

To be honest, the code required for memfd_create is 100 lines. It didn’t take us 2 months to write these, but instead we added one more feature to memfd_create called Sealing:

File-Sealing

File-Sealing is used to prevent a specific set of operations on a file. For example, after you wrote data into a file you can seal it against further writes. Any attempt to write to the file will fail with EPERM. Reading will still be possible, though. The crux of this matter is that seals can never be removed, only added. This guarantees that if a specific seal is set, the information that is protected by that seal is immutable until the object is destroyed.

To retrieve the current set of seals on a file, you use fcntl(2):

int seals = fcntl(fd, F_GET_SEALS);

This returns a signed 32bit integer containing the bitmask of currently set seals on fd. Note that seals are per file, not per file-descriptor (nor per file-description). That means, any file-descriptor for the same underlying inode will share the same seals.

To seal a file, you use fcntl(2) again:

int error = fcntl(fd, F_ADD_SEALS, new_seals);

This takes a bitmask of seals in new_seals and adds these to the current set of seals on fd.

The current set of supported seals is:

  • F_SEAL_SEAL: This seal prevents the seal-operation itself. So once F_SEAL_SEAL is set, any attempt to add new seals via F_ADD_SEALS will fail. Files that don’t support sealing are initially sealed with just this flag. Hence, no other seals can ever be set and thus do not have to be enforced.
  • F_SEAL_WRITE: This is the most straightforward seal. It prevents any content modifications once it is set. Any write(2) call will fail and you cannot get any shared, writable mappings for the file, anymore. Unlike the other seals, you can only set this seal if no shared, writable mappings exist at the time of sealing.
  • F_SEAL_SHRINK: Once set, the file cannot be reduced in size. This means, O_TRUNC, ftruncate(), fallocate(FALLOC_FL_PUNCH_HOLE) and friends will be rejected in case they would shrink the file.
  • F_SEAL_GROW: Once set, the file size cannot be increased. Any write(2) beyond file-boundaries, any ftruncate(2) that increases the file size, and any similar operation that grows the file will be rejected.

Instead of discussing the behavior of each seal on its own, the following list shows some examples how they can be used. Note that most seals are enforced somewhere low-level in the kernel, instead of directly in the syscall handlers. Therefore, side effects of syscalls I didn’t cover here are still accounted for and the syscalls will fail if they violate any seals.

  • IPC: Imagine you want to pass data between two processes that do not trust each other. That is, there is no hierarchy at all between them and they operate on the same level. The easiest way to achieve this is a pipe, obviously. However, to allow zero-copy (assuming splice(2) is not possible) the processes might decide to use memfd_create to create a shared memory object and pass the file-descriptor to the remote process. Now zero-copy only makes sense if the receiver can parse the data in-line. However, this is not possible in zero-trust scenarios as the source can retain a file-descriptor and modify the contents while the receiver parses it, causing any kinds of failure. But if the receiver requires the object to be sealed with F_SEAL_WRITE | F_SEAL_SHRINK, it can safely mmap(2) the file and parse it inline. No attacker can alter file contents, anymore. Furthermore, this also allows safe mutlicasts of the message and all receivers can parse the same zero-copy file without affecting each other. Obviously, the file can never be modified again and is a one-shot object. But this is inherent to zero-trust scenarios. We did implement a recycle-operation in case you’re the last user of an object. However, that was dropped due to horrible races in the kernel. It might reoccur in the future, though.
  • Graphics-Servers: This is a very specific use-case of IPC and usually there is a one-way trust relationship from clients to servers. However, a server cannot blindly trust its clients. So imagine a client renders its window-contents into memory and passes a file-descriptor to that memory region (maybe using memfd_create) to the server. Similar to the previous scenario, the server cannot mmap(2) that object for read-access as the client might truncate the file simultaneously, causing SIGBUS on the server. A server can protect itself via SIGBUS-handlers, but sealing is a much simpler way. By requiring F_SEAL_SHRINK, the server can be sure, the file will never shrink. At the same time, the client can still grow the object in case it needs bigger buffers for growing windows. Furthermore, writing is still allowed so the object can be re-used for the next frame.

As you might imagine, there are a lot more interesting use-cases. However, note that sealing is currently limited to objects created via memfd_create with the MFD_ALLOW_SEALING flag. This is a precaution to make sure we don’t break existing setups. However, changing seals of a file requires WRITE-access, thus it is rather unlikely that sealing would allow attacks that are not already possible with mandatory POSIX locks or similar. Hence, it is possible that sealing will expand to other areas in case people request it. Further seal-types are also possible.

Current Status

As of June 2014 the patches for memfd_create and sealing have been publicly available for at least 2 months and are considered for merging. linux-3.16 will probably not include it, but linux-3.17 very likely will. Currently, there’s still some issues to be figured out regarding AIO and Direct-IO races. But other than that, we’re good to go.

Splitting DRM and KMS device nodes

While most devices of the 3 major x86 desktop GPU-providers have GPU and display-controllers merged on a single card, recent development (especially on ARM) shows that rendering (via GPU) and mode-setting (via display-controller) are not necessarily bound to the same device. To better support such devices, several changes are being worked on for DRM.

In it’s current form, the DRM subsystem provides one general-purpose device-node for each registered DRM device: /dev/dri/card<num>. An additional control-node is also created, but it remains unused as of this writing. While in general a kernel driver is allowed to register multiple DRM devices for a single physical device, no driver made use of this, yet. That means, whatever hardware you use, both mode-setting and rendering is done via the same device node. This entails some rather serious consequences:

  1. Access-management to mode-setting and rendering is done via the same file-system node
  2. Mode-setting resources of a single card cannot be split among multiple graphics-servers
  3. Sharing display-controllers between cards is rather complicated

In the following sections, I want to look closer at each of these points and describe what has been done and what is still planned to overcome these restrictions. This is a highly technical description of the changes and serves as outline for the Linux-Plumbers session on this topic. I expect the reader to be familiar with DRM internals.

1) Render-nodes

While render-nodes have been discussed since 2009 on dri-devel, several mmap-related security-issues have prevented it from being merged. Those have all been fixed and 3-days ago, the basic render-node infrastructure has been merged. While it’s still marked as experimental and hidden behind the drm.rnodes module parameter, I’m confident we will enable it by default in one of the next kernel releases.

What are render-nodes?

From a user-space perspective, render-nodes are “like a big FPU” (krh) that can be used by applications to speed up computations and rendering. They are accessible via /dev/dri/renderD<num> and provide the basic DRM rendering interface. Compared to the old card<num> nodes, they lack some features:

  • No mode-setting (KMS) ioctls allowed
  • No insecure gem-flink allowed (use dma-buf instead!)
  • No DRM-auth required/supported
  • No legacy pre-KMS DRM-API supported

So whenever an application wants hardware-accelerated rendering, GPGPU access or offscreen-rendering, it no longer needs to ask a graphics-server (via DRI or wl_drm) but can instead open any available render node and start using it. Access-control to render-nodes is done via standard file-system modes. It’s no longer shared with mode-setting resources and thus can be provided for less-privileged applications.

It is important to note that render-nodes do not provide any new APIs. Instead, they just split a subset of the already available DRM-API off to a new device-node. The legacy node is not changed but kept for backwards-compatibility (and, obviously, for mode-setting).

It’s also important to know that render-nodes are not bound to a specific card. While internally it’s created by the same driver as the legacy node, user-space should never assume any connection between a render-node and a legacy/mode-setting node. Instead, if user-space requires hardware-acceleration, it should open any node and use it. For communication back to the graphics-server, dma-buf shall be used. Really! Questions like “how do I find the render-node for a given card?” don’t make any sense. Yes, driver-specific user-space can figure out whether and which render-node was created by which driver, but driver-unspecific user-space should never do that! Depending on your use-cases, either open any render-node you want (maybe allow an environment-variable to select it) or let the graphics-server do that for you and pass the FD via your graphics-API (X11, wayland, …).

So with render-nodes, kernel drivers can now provide an interface only for off-screen rendering and GPGPU work. Devices without any display-controller can avoid any mode-setting nodes and just provide a render-node. User-space, on the other hand, can finally use GPUs without requiring any privileged graphics-server running. They’re independent of the kernel-internal DRM-Master concept!

2) Mode-setting nodes

While splitting off render-nodes from the legacy node simplifies the situation for most applications, we didn’t simplify it for mode-setting applications. Currently, if a graphics-server wants to program a display-controller, it needs to be DRM-Master for the given card. It can acquire it via drmSetMaster() and drop it via drmDropMaster(). But only one application can be DRM-Master at a time. Moreover, only applications with CAP_SYS_ADMIN privileges can acquire DRM-Master. This prevents some quite fancy features:

  • Running an XServer without root-privileges
  • Using two different XServers to control two independent monitors/connectors of the same card

The initial idea (and Ilija Hadzic’s follow-up) to support this were mode-setting nodes. A privileged ioctl on the control-node would allow applications to split mode-setting resources across different device-nodes. You could have /dev/dri/modesetD1 and /dev/dri/modesetD2 to split your KMS CRTC and Connector resources. An XServer could use one of these nodes to program the now reduced set of resources. We would have one DRM-Master per node and we’d be fine. We could remove the CAP_SYS_ADMIN restriction and instead rely on file-system access-modes to control access to KMS resources.

Another discussed idea to avoid creating a bunch of file-system nodes, is to allocate these resources on-the-fly. All mode-setting-resources would now be bound to a DRM-Master object. An application can only access the resources available on the DRM-Master that it is assigned to. Initially, all resources are bound to the default DRM-Master as usual, which everyone gets assigned to when opening a legacy node. A new ioctl DRM_CLONE_MASTER is used to create a new DRM-Master with the same resources as the previous DRM-Master of an application. Via a DRM_DROP_MASTER_RESOURCE an application can drop KMS resources from their DRM-Master object. Due to their design, neither requires a CAP_SYS_ADMIN restriction as they only clone or drop privileges, they never acquire new privs! So they can be used by any application with access to the control node to create two new DRM-Master resources and pass them to two independent XServers. These use the passed FD to access the card, instead of opening the legacy or mode-setting nodes.

From the kernel side, the only thing that changes is that we can have multiple active DRM-Master objects. In fact, per DRM-Master one open-file might be allowed KMS access. However, this doesn’t require any driver-modifications (which were mostly “master-agnostic”, anyway) and only a few core DRM changes (except for vmwgfx-ttm-lock..).

3) DRM infrastructure

The previous two chapters focused on user-space APIs, but we also want the kernel-internal infrastructure to account for split hardware. However, fact is we already have anything we need. If some hardware exists without display-controller, you simply omit the DRIVER_MODESET flag and only set DRIVER_RENDER. DRM core will only create a render-node for this device then. If your hardware only provides a display-controller, but no real rendering hardware, you simply set DRIVER_MODESET but omit DRIVER_RENDER (which is what SimpleDRM is doing).

Yes, you currently get a bunch of unused DRM code compiled-in if you don’t use some features. However, this is not because DRM requires it, but only because no-one sent any patches for it, yet! DRM-core is driven by DRM-driver developers!

There is a reason why mid-layers are frowned upon in DRM land. There is no group of core DRM developers, but rather a bunch of driver-authors who write fancy driver-extensions. And once multiple drivers use them, they factor it out and move it to DRM core. So don’t complain about missing DRM features, but rather extend your drivers. If it’s a nice feature, you can count on it being incorporated into DRM-core at some point. It might be you doing most of the work, though!

Sane Session-Switching

In a previous article I talked about the history of VT switching. Created as a simple way to switch between text-mode sessions it has grown into a fragile API to protect one testosterone monster (also called XServer) from another. XServers used to poke in PCI bars, modified MMIO registers and messed around with DMA controllers. If they hauled out the big guns, it was almost absurd to believe a simple signal-flinging VT could ever successfully negotiate. Fortunately, today’s XServer is a repentant sinner. With common desktop hardware, all direct I/O is done in the kernel (thanks KMS!) and chances of screwing up your GPUs are rather minimal. This allows us to finally implement proper device-handover during session-switches.

If we look at sessions at a whole, the XServer isn’t special at all. A lot of session-daemons may run today that provide some service to the session as a whole. This includes pulseaudio, dbus, systemd –user, colord, polkit, ssh-keychain, and a lot more. All these daemons don’t need any special synchronization during session-switch. So why does the XServer require it?

Any graphics-server like the XServer is responsible of providing access to input and graphics devices to a session. When a session is activated, they need to re-initialize the devices. Before a session is deactivated, they need to cleanup the devices so the to-be-activated session can access them. The reason they need to do this is missing infrastructure to revoke their access. If a session would not cleanup graphics devices, the kernel would prevent any new session from accessing the graphics device. For input devices it is even worse: If a session doesn’t close the devices during deactivation, it would continue reading input events while the new session is active. So while typing in your password, the background session might send these key-strokes to your IRC client (which is exactly what XMir did). What we need is a kernel feature to forcibly revoke access to a graphics or input device. Unfortunately, it is not as easy at it sounds. We need to find some-one who is privileged and trusted enough to do this. You don’t want your background session to revoke your foreground session’s graphics access, do you? This is were systemd-logind enters the stage.

systemd-logind is already managing sessions on a system. It keeps track on which session is active and sets ACLs in /dev to give the foreground session access to device nodes. To implement device-handover, we extend the existing logind-API by a new function: RequestDevice(deviceNode). A graphics-server can pass a file-system path for a device-node in /dev to systemd-logind, which checks permissions, opens the node and returns a file-descriptor to the caller. But systemd-logind retains a copy of the file-descriptor. This allows logind to disable it as long as the session is inactive. During a session-switch, logind can now disable all devices of the old session, re-enable the devices of the new session and notify both of the session-switch. We now have a clean handover from one session to the other. With this technology in place, we can start looking at real scenarios.

1) Session-management with VTs

Session management using VTs for foreground control and logind for device management

Session management with VTs and logind

Based on the graphs for VT-switching, I drew a new one considering logind. VTs are still used to switch between sessions, but sessions no longer open hardware devices directly. Instead, they ask logind as described above. The big advantage is that VT-switches are no longer fragile. If a VT is active, it can be sure that it has exclusive hardware-access. And if a session is dead-locked, we can force a VT-switch and revoke their device-access. This allows to recover from situations where your XServer hangs without SSH’ing from a remote machine or using SysRq.

2) Session-management without VTs

Session management based solely on logind

Pure logind session management

While sane VT-switching is a nice feature, the biggest win is that we can implement proper multi-session support for seats without VTs. While previously only a single session could run on such seats, with logind device-management, we can now support session-switching on any seat.

Instead of using VTs to notify sessions when they are activated or deactivated, we use the logind-dbus-API. A graphics-server can now request input and graphics devices via the logind RequestDevice API and use it while active. Once a session-switch occurs, logind will disable the device file-descriptors and switch sessions. A dbus signal is sent asynchronously to the old and new session. The old session can stop rendering while inactive to save power.

3) Asynchonous events and backwards-compatibility

One thing changes almost unnoticed when using RequestDevice. An active graphics-server might be almost about to display an image on screen while a session-switch occurs. logind revokes access to graphics devices and sends an asynchronous event that the session is now inactive. However, the graphics-server might not have received this event, yet. Instead, it tries to invoke a system-call to update the screen. But this will fail with EACCES or EPERM as it doesn’t have access to it, anymore. Currently, for most graphics servers this is a fatal error. Instead of handling EACCES and interpreting it as “this device is now paused”, they don’t care for the error code and abort. We could fix all the graphics-servers, but to simplify the transition, we introduced negotiated session-switches.

Whenever logind is asked to perform a session-switch, it first sends PauseDevice signals for every open device to the foreground graphics-server. This must respond with a PauseDeviceComplete call to logind for each device. Once all devices are paused, the session-switch is performed. If the foreground session does not respond in a timely manner, logind will forcibly revoke device access and then perform the session-switch, anyway.

Note that negotiated session-switches are only meant for compatibility. Any graphics-server is highly encouraged to handle EACCES just fine!

All my local tests ran fine so far, but all this is still under development. systemd patches can be found at github (frequently rebased!). Most tests I do rely on an experimental novt library, also available at github (I will push it during next week; this is only for testing!). Feedback is welcome! The RFC can be found on systemd-devel. Now I need a day off..

Happy Switching!

How VT-switching works

Having multiple sessions on your system in parallel is a quite handy feature. It allows things like Fast User Switching or running two different DEs at the same time. Especially graphics developers like it, because they can test-run their experimental XServer/weston on the same machine they develop on.

To understand how it works, we need the concept of a session. See my previous introduction into session-management if you’re not familiar with it. I expect the reader to be familiar with basic session-management concepts (i.e., seats, systemd-logind, login-sessions, user-sessions, processes and daemons in a session).

1) Traditional text-mode VT-switching

Virtual terminals were introduced with linux-0.12. It’s the origin of multi-session support on linux. Before this, linux only supported a single TTY session (which was even available in the first tarball of linux-0.01). With virtual terminals, we have /dev/tty<num> devices, where <num> is between 1 and 63. They are always bound to seat0 and every session on seat0 is bound to a single VT. That means, there are at most 63 live sessions on seat0. Only one of them is active at a time (which can be read from /sys/class/tty/tty0/active).

Session management and device access with VTs in text-mode

Session management in text-mode

The kernel listens for keyboard events and switches between VTs on ctrl+alt+Fx shortcuts. Alternatively, you can issue a VT_ACTIVATE ioctl to politely ask the kernel to switch sessions.

In the old days, all sessions ran in text-mode. In text-mode, a process can get keyboard input by reading from a VT and can write onto the screen by writing to a VT. Some rather ugly control-sequences are supported to allow colors or other advanced features. The kernel interprets these and instructs the graphics hardware to print the given text.

Important to note is that in text-mode sessions don’t have direct hardware access. The kernel merges all keyboard events into one stream and a session cannot tell which device it came from. In fact, it cannot even tell how many devices there are. Same for graphics devices. As a VT has only a single output stream, there is only a single screen to write to. Multi-head support is not available. Neither are any advanced graphics-operations. But this allows the kernel to serialize access to hardware devices. Only the active VT gets input events and only the buffer of the active VT is displayed on the screen. No resource-conflicts can occur.

While there are ways to detect when a session is activated/deactivated, in text-mode a session normally doesn’t care. It just stops receiving keyboard input. If the session writes to the VT while deactivated, it will affect the internal buffer of the VT, but not the screen. Only the buffer of the active VT is shown on screen.

2) Session-switching in graphics-mode

System session management and device access with VTs in graphics mode

Session management in gfx-mode

Very soon it became clear that text-mode is not enough. We wanted more! That’s when the VT graphics mode was introduced. Graphics mode doesn’t change the setup, we still have 63 VTs and each session is bound to a VT. But a VT can now be switched from text-mode KD_TEXT into graphics-mode KD_GRAPHICS (via KDSETMODE ioctl). This doesn’t do anything spectacular. Really! The only effect is it disables the kernel-internal graphics routines. As long as a VT is in graphics-mode, the kernel will not instruct the graphics hardware to display the VT on screen. Once it is reset to text-mode and the VT is active, the kernel will display it again.

So the graphics-mode itself is useless. But at the same time, the kernel started providing separate interfaces to input and graphics devices. So while a VT is in graphics-mode, a session can access the graphics devices directly and render whatever they want. It can also ignore input from the VT and instead read input-events directly from the input-event interfaces. This is how the XServer works today.

But it is pretty obvious that this becomes problematic during session switches. If the kernel switches away from a graphics VT, the session needs to release the graphics hardware before the new session can be activated. Otherwise, the new session is active, but you still get the images from the old one. Or worse, it might flicker between the images of both sessions. Without kernel-mode-setting it might even hang your graphics hardware.

(A similar interface to KDSETMODE exists for keyboards with KDSKBMODE.)

Unfortunately, the kernel has no interface to forcibly revoke graphics or input access. So VTs were extended by the VT_SETMODE ioctl. A process can issue VT_SETMODE on a VT and pass two signal-numbers (usually SIGUSR1 and SIGUSR2). If the kernel wants to perform a VT-switch, it sends one such signal to the active VT. This VT can cleanup resources, stop using graphics/input devices and acknowledge the VT-switch via the VT_RELDISP ioctl. If the process dead-locked and doesn’t acknowledge the request, the VT-switch will not happen! This is why a crashed XServer can hang your system. But if it correctly issues the VT_RELDISP ioctl, the kernel will perform the VT-switch. Once the kernel switches back to the given VT it sends the second signal as notification that it is now active again.

However, on every VT (precisely, on every session) only a single process can call VT_SETMODE. This already shows that the concept is flawed. For example, if the XServer takes the VT in posession for graphics and input devices, another audio-server in the session couldn’t do the same for audio-hardware. This applies to all other devices. An alternative involving logind is discussed in a followup article.

3) Seats without VTs

We discussed that VTs are always bound to seat0. So if you run a session on a seat other than seat0 or if you disabled VTs entirely, then the situation becomes pretty simple: no multi-session support is available. This is enforced by systemd-logind so the active session can run without interruptions.

Also important to note is that there is no text-mode. A session must run in graphics mode as the kernel facility to interpret text-commands is bound to VTs.

Single-session setup on seats without VTs

Session management without VTs

4) logind integration

As we discussed in the previous article, today systemd-logind tracks and manages sessions. But how does this integrate with VTs? VTs pre-date systemd by years, so to preserve backwards-compatibility, we need to keep the infrastructure as it is. That’s why logind watches /sys/class/tty/tty0/active for changes. Once a VT switch happens, logind notices it and marks the old session as inactive and the new as active. It also adjusts ACLs in /dev to keep access-restrictions in sync with the active session. However, this cannot be done properly without a race-condition. Therefore, logind provides a new dbus-API to replace the ageing VT API. New session-daemons are advised to use it in favor of VTs.

DRM Render- and Modeset-Nodes

Another year, another Google Summer of Code. This time I got the chance to work on something that I had on my TODO list for quite a long time: DRM Render- and Modeset-Nodes

As part of the X.org Foundation mentoring organization, I will try to pick up the work from Ilija HadzicDave AirlieKristian HoegsbergMartin Peres and others. The idea is to extend the DRM user-space API of the linux kernel to split modeset and rendering interfaces apart. The main usage is to allow different access-modes for graphics-compositors (which require the modeset API) and client-side rendering or GPGPU-users (which both require the rendering API). We currently use the DRM-Master interface to restrict the modeset API to privileged applications. However, this requires SYS_CAP_ADMIN privileges, which is roughly equivalent to root-privileges. With two different interfaces for modeset and rendering APIs, we can apply a different set of filesystem-access-modes to each of them and thus get fine-grained access-control.

Apart from fine-grained access control, we also get some other nice features almost for free:

  • We will be able to run GPGPU clients without any running compositor or event without any display controller
  • We can split modeset objects across multiple nodes to allow multi-seat setups with a single display controller
  • Efficient compositor-stacking by granting page-flip access or full modeset access temporarily to sub-compositors

There are actually a lot of other ideas how to extend this. So I decided to concentrate on the modeset-node / render-node split first. Once that is done (and fully working), I will pick different ideas that depend on this and try to implement them. Considering the lot of work others have put in this already, I think if I get the split merged into mainline, the project will already be a great success. Everything on top of it will be some bonus that will probably take more time to get merged. But lets see, maybe it turns out to be easier than I think and we end up with some of the use-cases merged upstream, too.

Thanks to the X.org Foundation, Google and my GSoC-mentor Dave Airlie for giving me the chance to work on the DRM API! I hope it will be a productive summer.

GSoC-Proposal

If someone is interested in more details, some excerpts from my original GSoC proposal:

Project Description:
Since several years xserver is no longer the only user-space project that makes
use of the kernel DRM API. The introduction of KMS allowed many new projects to
emerge, including plymouth, weston and kmscon. On the other side, OpenCL support
allows applications to make use of DRM without requiring any KMS APIs. Even though
both use-cases work with the current APIs, there are a lot of restrictions that
need to be worked around.

The most problematic concept is DRM-Master. KMS applications are required to be
DRM-Master to perform modesetting, but DRM-Master is tightly coupled to
CAP_SYS_ADMIN/root. On the other side, render clients are required to be assigned
to a DRM-Master so they can get authenticated. This prevents off-screen/offline
rendering without a running compositor.

One possible solution is to split render- and modeset-nodes apart. The DRM control
node can be used as the management node (which is, as far as I understand, what
it was designed for, anyway). A separate static render-node is created for each
DRM device which is restricted to ioctls specific to rendering operations. Instead
of requiring drmAuth() for authorization, we can now use filesystem access-modes.
This allows slightly more dynamic access-control, but the biggest advantage is
that we can do off-screen/offline rendering without a running
compositor/DRM-Master.

On the other side, a modeset-node is a concept to have KMS separated from DRM.
The use-case is to split modeset objects (eg., crtcs, encoders, connectors) across
different modesetting applications. This allows one compositor to use one
CRTC+connector combination, while another compositor (maybe on another seat) can
use another CRTC+external-connector. This doesn't have to be a static setup. On
the contrary, one use-case I am very interested in is a dynamic modeset-object
assignment to temporary clients. This way, a fullscreen application can be granted
page-flip rights from the compositor to avoid context-switches to the compositor
for doing trivial page-flips only.

Deliverables
* Working render-node clients: Preferably an offline OpenCL example and
  a wayland EGL client
* Merged kernel render-node implementation with at least i915 support
* Dynamic kernel modeset-nodes
* "Zero-context-switches" wayland/weston fullscreen client (optional)

Known Problems:
There were several attempts to push render-nodes into the kernel, but all failed
due to missing motivation to finish the user-space clients. Writing up new fancy
APIs is one part, but pushing API changes to such big projects requires the whole
environment to work well with the changes. That's why I want to concentrate on
the user-space side of render-nodes. And I want to finish the render-nodes project
before continuing with modeset-nodes. The idea has been around long enough that
it's time that we get it done.

However, one problem is that I never worked with the low-level X11 stack. The
wayland environment is great for experiments and quite active. I am very familiar
with it and know how to get examples easily running. The xserver, however, is a
huge black box to me. I know the concepts and understand the input and graphics
drivers design. But I never read xserver core code. That's something I'd like to
change during this project. I will probably be limited to DRI and graphics
drivers, but that's a good start.

Another idea that came up quite often is something like gem-fs. It's far beyond
the scope of this project, but it's something I'd like to keep in mind when
designing the API. It's hard to account for something that's only an idea, but
the concepts seem related so I will try to understand the reasons behind gem-fs
and avoid orthogonal implementations.

A few smaller implementation-specific problems are already known, including the
mmap-security problem, static "possible_encoders"/"possible_crtcs" bitsets and
missing MMUs on GPUs. However, there already have been ideas how to solve them
so I don't consider them blockers for render-nodes.

Linux Wii Remote Driver Updates

Linux Bluetooth HID handling has been slightly broken in the kernel since several years. Reconnecting devices caused a kernel oops nearly every time you tried. Two months ago I sat down and rewrote the HIDP session handling and fixed several bugs. The series now got merged and will be part of linux-3.10. If you still encounter bugs please leave me a note.

Based on this I tried recovering my hid-wiimote work. Since the last time I reverse-engineered Nintendo devices, a lot has happened. The most significant change is probably the release of the Wii U. While the Gamepad is still an ongoing target for r/e, other devices were much easier to get working. Motivated by the new hardware I just got, I finally figured out a reliable way for extension hotplugging. This wasn’t supported in the kernel until now and required nasty polling-techniques to work around race-conditions in the proprietary protocol (who designs asynchronous protocols without protocol barriers?). So I rewrote most of the core device handling and moved the input parsers to a module based infrastructure. This allows us to easily extend the hid-wiimote driver to support new devices that are based on the same protocol.

The series is still pending on the linux-input ML but I hope to get basic hotplugging support into linux-3.10. Further support for the built-in speaker device or the Wii U Pro Controller will hopefully follow with 3.11.

The xwiimote user-space stack has already been updated and I will push the changes this weekend.

Wii U Gamepad Connection

I’ve been playing with the Wii-U Gamepad lately and am trying to figure out how it connects to the Wii U console. Once we get that working, we can start reverse-engineering the protocol and write a linux driver for it. It would make a great remote display for every linux box. So how does it work?

The communication between Wii-U and GamePad is done via 5Ghz Wi-Fi. It uses the range 5150-5250 Mhz (Sony UWA-BR100 is a nice dual-band ath9_htc USB dongle with perfect linux support). The console opens a soft AP without any encryption. SSID is set similar to “WiiU34af2c5fa6134af2c5fa61c_STA1”. The GamePad connects to this AP and then creates some private link. I haven’t figured out how this works, yet.

The IE fields do not advertise any Wifi-Direct (P2P), Wifi-Display (WFD) or Direct-Link (TDLS) features. The only features found are WMM QoS fields.

How to proceed? I need to figure out how to create a soft-AP with the advertised features so I can make the GamePad connect to me. Two Nintendo extensions are advertised “OUI a4:c0:e1” which probably identify the AP. The other vendor IEs are Broadcom/EPIGRAM IDs which can also be found on other networks. After that, I need to test P2P discovery, TDLS discovery or 802.11e DLS setup to find out what kind of direct-link Nintendo uses. According to Broadcom’s Dino Bekis a form a Miracast is used which would mandate P2P or TDLS.

If anyone has more information on that, I’d be very thankful!

Btw., dhcp is provided on the unprotected soft-AP and I can ping the console but my port-scans didn’t return any useful information. I will try connecting to WFD/RTSP default port 7236 next…

 

Soft-AP during Synchronization:

BSS 34:af:2c:5f:a6:1c (on wlan1)
	TSF: 3126337 usec (0d, 00:00:03)
	freq: 5180
	beacon interval: 100
	capability: ESS (0x0001)
	signal: -55.00 dBm
	last seen: 3208 ms ago
	Information elements from Probe Response frame:
	SSID: WiiU34af2c5fa6134af2c5fa61c_STA1
	Supported rates: 6.0* 9.0 12.0* 18.0 24.0* 36.0 48.0 54.0 
	HT capabilities:
		Capabilities: 0x1c
			HT20
			SM Power Save disabled
			RX Greenfield
			No RX STBC
			Max AMSDU length: 3839 bytes
			No DSSS/CCK HT40
		Maximum RX AMPDU length 16383 bytes (exponent: 0x001)
		Minimum RX AMPDU time spacing: 8 usec (0x06)
		HT RX MCS rate indexes supported: 0-15
		HT TX MCS rate indexes are undefined
	HT operation:
		 * primary channel: 36
		 * secondary channel offset: no secondary
		 * STA channel width: 20 MHz
		 * RIFS: 1
		 * HT protection: no
		 * non-GF present: 0
		 * OBSS non-GF present: 0
		 * dual beacon: 0
		 * dual CTS protection: 0
		 * STBC beacon: 0
		 * L-SIG TXOP Prot: 0
		 * PCO active: 0
		 * PCO phase: 0
	Vendor specific: OUI a4:c0:e1, data: f5 00
	Vendor specific: OUI a4:c0:e1, data: f4 10 4a 00 01 10 10 44 00 01 02 10 41 00 01 01 10 12 00 02 00 00 10 53 00 02 00 84 10 3b 00 01 03 10 47 00 10 22 21 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 21 00 08 42 72 6f 61 64 63 6f 6d 10 23 00 06 53 6f 66 74 41 50 10 24 00 01 30 10 42 00 01 30 10 54 00 08 00 06 a4 c0 e1 f4 00 01 10 11 00 10 57 69 69 55 33 34 61 66 32 63 35 66 61 36 31 63 10 08 00 02 00 84
	Vendor specific: OUI 00:10:18, data: 02 00 00 04 00 00
	WMM:	 * Parameter version 1
		 * u-APSD
		 * BE: CW 15-31, AIFSN 2, TXOP 1504 usec
		 * BK: CW 15-1023, AIFSN 7
		 * VI: CW 15-31, AIFSN 3, TXOP 3008 usec
		 * VO: CW 15-31, AIFSN 3, TXOP 1504 usec

Soft-AP during normal operation with GamePad:

BSS 34:af:2c:5f:a6:1c (on wlan1)
	TSF: 413388858 usec (0d, 00:06:53)
	freq: 5180
	beacon interval: 100
	capability: ESS Privacy (0x0011)
	signal: -48.00 dBm
	last seen: 3201 ms ago
	Information elements from Probe Response frame:
	SSID: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
	Supported rates: 6.0* 9.0 12.0* 18.0 24.0* 36.0 48.0 54.0 
	TIM: DTIM Count 1 DTIM Period 3 Bitmap Control 0x0 Bitmap[0] 0x0
	RSN:	 * Version: 1
		 * Group cipher: a4-c0-e1:4
		 * Pairwise ciphers: a4-c0-e1:4
		 * Authentication suites: a4-c0-e1:2
		 * Capabilities: 4-PTKSA-RC (0x0008)
	HT capabilities:
		Capabilities: 0x1c
			HT20
			SM Power Save disabled
			RX Greenfield
			No RX STBC
			Max AMSDU length: 3839 bytes
			No DSSS/CCK HT40
		Maximum RX AMPDU length 16383 bytes (exponent: 0x001)
		Minimum RX AMPDU time spacing: 8 usec (0x06)
		HT RX MCS rate indexes supported: 0-15
		HT TX MCS rate indexes are undefined
	HT operation:
		 * primary channel: 36
		 * secondary channel offset: no secondary
		 * STA channel width: 20 MHz
		 * RIFS: 1
		 * HT protection: no
		 * non-GF present: 1
		 * OBSS non-GF present: 0
		 * dual beacon: 0
		 * dual CTS protection: 0
		 * STBC beacon: 0
		 * L-SIG TXOP Prot: 0
		 * PCO active: 0
		 * PCO phase: 0
	Vendor specific: OUI 00:90:4c, data: 07 00 45 55 17
	Vendor specific: OUI 00:10:18, data: 02 01 00 04 00 00
	WMM:	 * Parameter version 1
		 * u-APSD
		 * BE: CW 15-31, AIFSN 2, TXOP 1504 usec
		 * BK: CW 15-1023, AIFSN 7
		 * VI: CW 15-31, AIFSN 3, TXOP 3008 usec
		 * VO: CW 15-31, AIFSN 3, TXOP 1504 usec

Advanced DRM Mode-Setting API

I recently wrote a short How-To that introduces the linux DRM Mode-Setting API. It didn’t use any advanced techniques but I got several responses that it is a great introduction if you want to get started with linux DRM Mode-Setting. So I decided to go further and extend the examples to use double-buffering and vsync’ed page-flips.

The first extension that I wrote can be found here:

https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset-double-buffered.c

It extends the old example to use two buffers so we no longer render into the front-buffer. It reduces a lot of tearing that you get when using the single-buffered example. However, it is still not perfect as you might swap the front and back buffer during a scanout-period and the display-controller will use the new buffer in the middle of the screen. So I extended this further to do the page-flip during a vertical-blank period using drmModePageFlip(). You can find this example here:

https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset-vsync.c

This example also shows how to wait for page-flip events and integrate it into any select(), poll() or epoll based event-loop. Everything regarding page-flip-timing beyond that point depends on the use-cases and can get very hard to get right. I recommend reading Owen Taylor’s posts #1 and #2 on frame-timing for compositors.

There are still many more things like “DRM hardware-accelerated rendering”, “DRM planes/overlays/sprites”, “DRM flink/dmabuf buffer-passing” that I want to write How-Tos for. But time is short around Christmas so that’ll have to wait until next year.

Feedback is always welcome and you can find my email address in all the examples. Happy reading!

KMSCON Introduction

KMSCON is a KMS/DRM-based system console with an integrated terminal emulator. It was designed to replace the linux-kernel-console and virtual terminals (VTs). When run in default-mode, KMSCON allocates a virtual terminal and provides a terminal-emulator on it. It can thus be used as drop-in replacement for the linux-console and agetty. Compared to the linux-console, KMSCON provides a rich set of enhanced features including full-internationalized keyboard handling, full UTF8 input/font support, hardware-accelerated rendering, multi-seat support and more.

If run in listen-mode, KMSCON can also replace virtual terminals. This allows to run multiple graphics-servers (like kmscon, X-Server, Wayland Compositors) simultaneously on all seats not only on the default seat seat0.

Parts of this article expect the reader to be familiar with the term multi-seat. KMSCON runs on single-seat and multi-seat setups, but to understand the terms used here, you should read up on Wikipedia. On linux, the default seat is called seat0. This is also the primary seat for single-seat setups or if KMSCON is compiled without multi-seat support.

Default Mode

In default-mode KMSCON opens a virtual terminal (VT) to provide a terminal-emulator. It first checks whether the controlling terminal is a VT, if it’s not it tries to find an unused VT via VT_OPENQRY. You can also specify a VT on the command line via vt=/dev/tty5, or –vt=tty5, or –vt=5. This syntax allows backwards compatibility to most getty implementations.

After startup the VT is automatically activated and switched to (this can be prevented with –no-switchvt). You can use the standard linux keyboard shortcuts ctrl+alt+F1 to ctrl+alt+F12 to switch between your VTs. KMSCON integrates seamlessly into existing setups and can be used in parallel with X-Servers, Wayland-Compositors or the linux-console. Only one VT is occupied by KMSCON, so you can still use the linux-console on all other VTs. But you can also run a KMSCON process on each VT replacing the linux-console everywhere.

By default, KMSCON spawns /bin/login on the new terminal session. You can change this with the –login option. However, for security reasons it is recommended to keep the default. If a session exits, it is automatically restarted. To terminate KMSCON send SIGTERM to the main process or use a daemon manager like systemd.

KMSCON tries to be backwards-compatible in terms of user-interaction to the linux-console. However, the internal terminal-emulator is developed to be compatible with xterm. This is because the linux-console‘s terminal-emulation layer is very limited and incompatible to many other terminal-emulators. On the other hand, xterm is the de-facto standard in terminal-emulation under linux and provides many enhanced control-sequences. Therefore, KMSCON tries to behave exactly like xterm does. This isn’t always as easy as it sounds so please file bug-reports if you encounter any incompatibilities.

With this knowledge we can now replace agetty. So instead of:

  /sbin/agetty --noclear tty5 38400 linux

we now run:

  /usr/bin/kmscon --vt=tty5 --no-switchvt

to replace the linux-console with KMSCON on VT-5.  KMSCON also ships optional systemd service files in ./kmscon/docs/*.service

Keyboard Input

KMSCON uses libxkbcommon for keyboard control. This library implements huge parts of the X11 Keyboard Specification without any dependencies to X11 source-code. It was developed to allow other applications than X-Server to provide internationalized keyboard handling. It recently saw its first release after several years of development but is currently not available in all major linux distributions. However, many distributions already provide experimental packages for it.

With XKB support on board, KMSCON can provide the same keyboard handling as the X-Server does. That is, keyboards are configured via the RMLVO options (Rules Model Layout Variant Options). The most commonly used option is probably –xkb-layout=<language> to change the keyboard layout. All other options are mostly unused.

Any keyboard setups that can be used with X11 can also be used with xkbcommon. Simply create your X11 keyboard layouts or use one of the many existing ones and configure KMSCON to use it.

Video Devices

KMSCON got its name from the linux kernel Direct Rendering Manager (DRM) subsystem (not to be confused with DRM: Digital Rights Management). DRM drivers provide a common set of APIs to perform Mode-Setting, called Kernel Mode-Setting (KMS). This is the API that is used by X-Server to program video output. KMSCON uses the same API and thus provides mostly the same compatibility to GPU drivers as X-Server does.

But KMSCON also supports linux fbdev devices for backwards-compatibility. In fact, the modular interface allows to provide drivers for all kinds of video hardware. However, it is recommended to write DRM drivers for your video-hardware instead of writing user-space drivers for KMSCON or X-Server. Because this allows all user-space programs to use the DRM API to access any type of video output device without extra user-space support for each driver.

Right from the beginning KMSCON provided multiple-GPU support. This means, KMSCON automatically picks up all connected GPUs and provides a system console on them. You can even hotplug new DisplayLink GPUs and KMSCON detects them during runtime and instantly provides a console on them. Unfortunately, GPU detection isn’t as straightforward as one would think. On some systems there are GPUs that shouldn’t be touched by KMSCON. That’s why KMSCON provides the –gpu switch to change the GPU selection algorithm. A GPU blacklist configuration file or option is also on its way.

On fbdev and dumb-DRM devices, KMSCON uses 2D rendering without any hardware acceleration. But if KMSCON is compiled with full DRM video backends, it can use mesa3D’s OpenGLESv2 implementation to provide hardware-accelerated rendering. Use –hwaccel to enable this during runtime. It can speed up rendering by multiple orders of magnitude. However, on older systems it might even slow down rendering. It is disabled by default and needs to be explicitly enabled with –hwaccel.

Seat Selection

By default, KMSCON runs on the seat that it was started on. At most times, this is the default seat seat0. However, you can run KMSCON on other seats with the –seats option. It takes a list of seat names that KMSCON will run on. Each seat is independent and you can either run one KMSCON process for each seat or you can even run a single KMSCON process on multiple seats. The latter will allow sharing of system-resources like font glyph-caches. However, if 3D-hardware-acceleration is enabled, unexpected latencies may occur in the DRM drivers and you shouldn’t share a single KMSCON process across multiple seats for decent user-experience.

On seat0 KMSCON can make use of VTs. However, on seats other than seat0 (or even on seat0 if VTs are disabled) linux does not provide VTs. Therefore, KMSCON has to run without them. This means, KMSCON will automatically activate itself and run unconditionally on this seat. Unfortunately this has the side-effect that no other graphics-server (like X-Server) can run on this seat. Without the synchronization API of VTs there is no way to dispatch the graphics/input devices between multiple graphics servers. Therefore, most users will prefer running an X-Server on those seats. However, if no X11 is needed, KMSCON runs perfectly fine on those seats, too.

Listen-Mode

In default-mode KMSCON runs only on seats that are available during startup and terminates after all seats it runs on are either gone or their controlling VT hung up. This is perfect for VTs on seat0 but it requires another process to start KMSCON for each new seat showing up.

In listen-mode KMSCON runs as daemon waiting for new seats to show up. It automatically spawns a new terminal session on each seat matching the –seats option. KMSCON will not exit until SIGTERM is caught. Note that KMSCON ignores seats with VTs (normally only seat0) in this mode as you should run KMSCON in default-mode on those seats.

VT Replacement

The linux-kernel-console and VTs are tightly bound to each other. There is only one kernel configuration option to control them both: CONFIG_VT. KMSCON can already replace the linux-kernel-console but to disable CONFIG_VT, we also need a proper replacement for virtual terminals. As previously noted, KMSCON can run smoothly without VTs, however, this means that you can only run a single graphics-server on a seat and most users will probably choose X-Server here (and I would do so, too).

Fortunately, KMSCON already provides a solution to this. The –cdev-session option provides fake VTs for all seats that do not have real VTs. This allows to run X-Server, KMSCON, a Wayland Compositor and any other graphics-server simultaneously without changing a single line of code. However, this is beyond the scope of this introduction. Stay tuned for a follow-up post that introduces KMSCON session support and cdev-sessions in more detail.

Current State

KMSCON works! No special setup, no special system requirements and no weird dependencies are needed. You can check out KMSCON on any major linux distribution. Right now. What you need is:

  • libxkbcommon: This library has no external dependencies and can easily be installed on any distribution. Many distributions even provide experimental packages for it. See xkbcommon.org.
  • libudev: Udev is used for hotplugging and device detection. Every major distribution provides it.
  • kmscon: You can get KMSCON from github.

Everything else is optional! For proper font-rendering you also need pango or freetype2. For DRM video output you need libdrm. For 3D-hardware-acceleration mesa3D is needed. For multi-seat support systemd-logind is used.

However, please note that KMSCON is still experimental. I do my best to keep backwards-compatibility and fix all bugs that are reported. But I am unable to test KMSCON on all imaginable setups. So please report all bugs to github.com/dvdhrm/kmscon and I promise to have a look at it.

Furthermore, the TODO list grows steadily and I still have lots of ideas how to improve KMSCON. New ideas are very welcome, but please bear with me if it takes more time than expected to implement them. There is probably lots of stuff with higher priority on my list. But patches are highly appreciated even if they implement low-priority features.

There is a man-page kmscon(1) that explains many concepts and command-line arguments right-away. But do not hesitate to contact me personally if you have any questions about KMSCON. And for all developers I am doing a presentation about deprecating CONFIG_VT and KMSCON during FOSDEM-13 in Brussels next year. Hope to see you there!