Tag Archives: C

memfd_create(2)

For 4 months now we’ve been hacking on a new syscall for the linux-kernel, called memfd_create. The intention is to provide an easy way to get a file-descriptor for anonymous memory, without requiring a local tmpfs mount-point. The syscall takes 2 parameters, a name and a bunch of flags (which I will not discuss here):

int memfd_create(const char *name, unsigned int flags);

If successful, a new file-descriptor pointing to a freshly allocated memory-backed file is returned. That file is a regular file in a kernel-internal filesystem. Therefore, most filesystem operations are supported, including:

  • ftruncate(2) to change the file size
  • read(2), write(2) and all its derivatives to inspect/modify file contents
  • mmap(2) to get a direct memory-mapping
  • dup(2) to duplicate file-descriptors

Theoretically, you could achieve similar behavior without introducing new syscalls, like this:

int fd = open("/tmp/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
unlink("/tmp/random_file_name");

or this

int fd = shm_open("/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
shm_unlink("/random_file_name");

or this

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

Therefore, the most important question is why the hell do we need a third way?

Two crucial differences are:

  • memfd_create does not require a local mount-point. It can create objects that are not associated with any filesystem and can never be linked into a filesystem. The backing memory is anonymous memory as if malloc(3) had returned a file-descriptor instead of a pointer. Note that even shm_open(3) requires /dev/shm to be a tmpfs-mount. Furthermore, the backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.
  • There are no name-clashes and no global registry. You can create multiple files with the same name and they will all be separate, independent files. Therefore, the name is purely for debugging purposes so it can be detected in task-dumps or the like.

To be honest, the code required for memfd_create is 100 lines. It didn’t take us 2 months to write these, but instead we added one more feature to memfd_create called Sealing:

File-Sealing

File-Sealing is used to prevent a specific set of operations on a file. For example, after you wrote data into a file you can seal it against further writes. Any attempt to write to the file will fail with EPERM. Reading will still be possible, though. The crux of this matter is that seals can never be removed, only added. This guarantees that if a specific seal is set, the information that is protected by that seal is immutable until the object is destroyed.

To retrieve the current set of seals on a file, you use fcntl(2):

int seals = fcntl(fd, F_GET_SEALS);

This returns a signed 32bit integer containing the bitmask of currently set seals on fd. Note that seals are per file, not per file-descriptor (nor per file-description). That means, any file-descriptor for the same underlying inode will share the same seals.

To seal a file, you use fcntl(2) again:

int error = fcntl(fd, F_ADD_SEALS, new_seals);

This takes a bitmask of seals in new_seals and adds these to the current set of seals on fd.

The current set of supported seals is:

  • F_SEAL_SEAL: This seal prevents the seal-operation itself. So once F_SEAL_SEAL is set, any attempt to add new seals via F_ADD_SEALS will fail. Files that don’t support sealing are initially sealed with just this flag. Hence, no other seals can ever be set and thus do not have to be enforced.
  • F_SEAL_WRITE: This is the most straightforward seal. It prevents any content modifications once it is set. Any write(2) call will fail and you cannot get any shared, writable mappings for the file, anymore. Unlike the other seals, you can only set this seal if no shared, writable mappings exist at the time of sealing.
  • F_SEAL_SHRINK: Once set, the file cannot be reduced in size. This means, O_TRUNC, ftruncate(), fallocate(FALLOC_FL_PUNCH_HOLE) and friends will be rejected in case they would shrink the file.
  • F_SEAL_GROW: Once set, the file size cannot be increased. Any write(2) beyond file-boundaries, any ftruncate(2) that increases the file size, and any similar operation that grows the file will be rejected.

Instead of discussing the behavior of each seal on its own, the following list shows some examples how they can be used. Note that most seals are enforced somewhere low-level in the kernel, instead of directly in the syscall handlers. Therefore, side effects of syscalls I didn’t cover here are still accounted for and the syscalls will fail if they violate any seals.

  • IPC: Imagine you want to pass data between two processes that do not trust each other. That is, there is no hierarchy at all between them and they operate on the same level. The easiest way to achieve this is a pipe, obviously. However, to allow zero-copy (assuming splice(2) is not possible) the processes might decide to use memfd_create to create a shared memory object and pass the file-descriptor to the remote process. Now zero-copy only makes sense if the receiver can parse the data in-line. However, this is not possible in zero-trust scenarios as the source can retain a file-descriptor and modify the contents while the receiver parses it, causing any kinds of failure. But if the receiver requires the object to be sealed with F_SEAL_WRITE | F_SEAL_SHRINK, it can safely mmap(2) the file and parse it inline. No attacker can alter file contents, anymore. Furthermore, this also allows safe mutlicasts of the message and all receivers can parse the same zero-copy file without affecting each other. Obviously, the file can never be modified again and is a one-shot object. But this is inherent to zero-trust scenarios. We did implement a recycle-operation in case you’re the last user of an object. However, that was dropped due to horrible races in the kernel. It might reoccur in the future, though.
  • Graphics-Servers: This is a very specific use-case of IPC and usually there is a one-way trust relationship from clients to servers. However, a server cannot blindly trust its clients. So imagine a client renders its window-contents into memory and passes a file-descriptor to that memory region (maybe using memfd_create) to the server. Similar to the previous scenario, the server cannot mmap(2) that object for read-access as the client might truncate the file simultaneously, causing SIGBUS on the server. A server can protect itself via SIGBUS-handlers, but sealing is a much simpler way. By requiring F_SEAL_SHRINK, the server can be sure, the file will never shrink. At the same time, the client can still grow the object in case it needs bigger buffers for growing windows. Furthermore, writing is still allowed so the object can be re-used for the next frame.

As you might imagine, there are a lot more interesting use-cases. However, note that sealing is currently limited to objects created via memfd_create with the MFD_ALLOW_SEALING flag. This is a precaution to make sure we don’t break existing setups. However, changing seals of a file requires WRITE-access, thus it is rather unlikely that sealing would allow attacks that are not already possible with mandatory POSIX locks or similar. Hence, it is possible that sealing will expand to other areas in case people request it. Further seal-types are also possible.

Current Status

As of June 2014 the patches for memfd_create and sealing have been publicly available for at least 2 months and are considered for merging. linux-3.16 will probably not include it, but linux-3.17 very likely will. Currently, there’s still some issues to be figured out regarding AIO and Direct-IO races. But other than that, we’re good to go.

Advertisements

Linking binary data

Took me half a weak to figure this out, but here is a quick description how to link your binary data into a library/executable:

We want this to:

  • work with GNU-ld
  • work with GNU-gold
  • work with GNU-libtool (if required)
  • work with cross-compiling
  • work with LLVM
  • not require any external non-standard tools

For now lets assume we have a file src/mydata.bin that contains your binary data that you want to link into your executable as a huge C-array.

With GNU-autotools

If you use GNU-autotools, add the following rule to your Makefile.am:

CLEANFILES += src/*.bin.lo src/*.bin.o

src/%.bin.lo: src/%.bin
     $(AM_V_GEN)$(LD) -r -o "src/$*.bin.o" -z noexecstack --format=binary "$<"
     $(AM_V_at)$(OBJCOPY) --rename-section .data=.rodata,alloc,load,readonly,data,contents "src/$*.bin.o
     $(AM_V_at)echo "# $@ - a libtool object file" >"$@"
     $(AM_V_at)echo "# Generated by $(shell $(LIBTOOL) --version | head -n 1)" >>"$@"
     $(AM_V_at)echo "#" >>"$@"
     $(AM_V_at)echo "# Please DO NOT delete this file!" >>"$@"
     $(AM_V_at)echo "# It is necessary for linking the library." >>"$@"
     $(AM_V_at)echo >>"$@"
     $(AM_V_at)echo "# Name of the PIC object." >>"$@"
     $(AM_V_at)echo "pic_object='$*.bin.o'" >>"$@"
     $(AM_V_at)echo >>"$@"
     $(AM_V_at)echo "# Name of the non-PIC object" >>"$@"
     $(AM_V_at)echo "non_pic_object='$*.bin.o'" >>"$@"
     $(AM_V_at)echo >>"$@"

Now you can simply use src/mydata.bin.lo as libtool-object file in any _LIBADD automake variable. This rule causes $(LD) to link the binary into the object file src/mydata.bin.o (via partial-linking/relocatables). After that we create the corresponding libtool object file src/mydata.bin.lo. Note that the comments in the libtool file are mandatory. You must not remove them!

The objcopy line causes the .data section to be renamed to .rodata and marked as read-only. You can drop it if you require write access.

Without GNU-autotools

Obviously, this is much easier. You can simply use:

$(LD) -r -o "src/mydata.bin.o" -z noexecstack --format=binary "src/mydata.bin"
$(OBJCOPY) --rename-section .data=.rodata,alloc,load,readonly,data,contents "src/mydata.bin.o"

This produces src/mydata.bin.o which you can then link like any other object.

That’s already it!

You can now access the binary data via:

extern const char _binary_<path>_start[];
extern const char _binary_<path>_end[];

In our case this would be:

extern const char _binary_src_mydata_bin_start[];
extern const char _binary_src_mydata_bin_end[];

The linker automatically creates these symbols. It also creates *_size symbols but I never had to use these. The linker also alignes the start address correctly, but the content isn’t touched, of course.

The data is placed in .data by default and marked as read/write. The objcopy command renames it to .rodata and marks it as read-only. Drop it if you require write access.

If you think you know another way to do this, please let me know! I have tried lots of other ways and they all fail in some subtle way:

  • Converting it into a C-source file with a static array. Problem: It works but the C-compiler might take considerably longer to compile the sources. On my machine a 2MB array file took >10min to compile.
  • Using objcopy to convert binary files into object files. Problem: This requires the target architecture as -B command. Unfortunately, it has some weird format and doesn’t accept the standard GNU $host_cpu kind of strings. For instance it requires i386:x86-64 instead of x86_64. I haven’t found a way to get these BFD strings reliably via autoconf.
  • Use objcopy without -B to produce architecture=UNKNOWN object files and link via –accept-unknown-input-arch with GNU-ldProblem: This doesn’t work with GNU-gold as it doesn’t support this command-line option.
  • Avoid partial-linking and add this to your LDFLAGS: -Wl,--format=binary -Wl,src/mydata.bin -Wl,--format=default Problem: This marks the file as requiring an executable-stack. You can fix this with -Wl,-z,noexecstack but that forces all other linked files to also not require an executable stack. And also, GNU-gold doesn’t support --format=default, it requires --format=elf which itself isn’t supported by GNU-ld. Hilarious, right?

I also got some other hints to use GNU-as or other nice workarounds. However, I wanted a solution that uses the tools that were created for that task. I mean, why should I create an assembler or C source file to link binary data? Seriously, the only reason to do this is to work around crappy binutils interfaces. I want ld and objcopy to do the task and if they can’t, then we should fix them!