RSoC: improving drivers and kernel - part 7

By 4lDO2 on

Introduction

In my last blog post, I introduced the userspace_fexec/userspace_clone features. As the names suggest, they move the inherently complex implementations of fork(3) and execve(2), from the kernel into relibc, giving userspace much more freedom while simplifying the kernel. There has been considerable progress since last post; the features userspace_fexec/userspace_clone, userspace_initfs, and userspace_initfs, have now all been merged!

RMM

After having thoroughly debugged the orbital/orblogin memory corruption bug with little success, I decided to go as far as phase out the old paging code (ActivePageTable/InactivePageTable/Mapper etc.) in favor of RMM (Redox Memory Manager). Surprisingly, this fixed the bug entirely in the process, and it turns out the issue was simply that parent page tables were not properly unmapped (causing use-after-free), most likely due to the coexistence of RMM and the old paging code, which did not agree on how the number of page table entries were counted.

Userspace initfs

I mentioned moving initfs to userspace as a TODO from last post. The changes required were very simple: rather than having the bootloader pass a physical address range containing the initfs image, and then letting the kernel load bootstrap from within the filesystem, it now simply loads a “bootstrap/initfs blob” into (virtual) memory at 0x0, and jumps to an address provided by the bootloader. The bootloader loads both /kernel, /bootstrap, and /initfs, the latter two of which are put adjacently in physical address space.

This also means bootstrap will now fork into both a scheme handler serving initfs: from the initfs memory, and for executing init.

Userspace cwd and userspace path canonicalization

Redox used to expose two system calls, chdir and getcwd, also a TODO from last post, which get and set the current working directory (identical to POSIX). This would modify an internal cwd string in each kernel context, used for canonicalizing paths while handing path-based syscalls (open, chmod (now removed in favor of fchmod), unlink, and rmdir), allowing e.g. open("./foo", O_RDONLY) => open("file:/path/to/foo", O_RDONLY). Now that userspace_cwd is merged however, the kernel will only allow already-canonicalized paths, i.e. enforce that both the scheme name and path are present. Hence, relibc will instead canonicalize the paths itself, and chdir/getcwd are implemented simply by accessing a global variable (although sigprocmask is run before and after). This global variable is passed in execve using auxiliary vectors.

But most importantly, the SYS_OPEN handler in the kernel, no longer resolves cross-scheme symlinks (i.e. handles EXDEV internally), which has also been moved to relibc. While also reducing the number of file operations initiated from the kernel, it reduces the amount of state needed for syscall handlers, which will be very helpful for a possible syscall multiplexing API (userspace-to-kernel io_uring).

Hopefully at some point, most if not all syscalls on Redox will be fully completion-based, i.e. the caller sends a request, waits (if blocking), and then asynchronously runs completion code (either returning from a blocking syscall, or in the future pushing an io_uring CQE). In the process, the kernel may become “stackless”, i.e. use the same kernel stack for all processes, and thereby reduce the memory footprint of contexts (threads) by an order of magnitude.

TODO

Luckily, the userspace initfs TODO, and fixing the orbital/orblogin bug, have both been finished!

On-demand paging is still not yet implemented, even though I have written a large part of it. This would allow optimizing ld.so and relibc’s execve, by calling mmap with CoW in order to load ELF segments, which is especially important when running rustc on Redox.

The syscall interface has also not been reworked either, but there is a clear need for overcoming the limitations of Packet (such as being limited to 4 args per scheme op), so the io_uring SQE format will very likely be used by scheme handlers soon, with or without using fancy ring buffers for passing them.

PTRACE_EVENT_CLONE is now sent to tracers, although some work is still needed there, and the acid ptrace test should be re-introduced (this was an issue before userspace_fexec too).

I also tried implementing (basic) KASLR as a side project, and succeeded, although with a visible performance cost (not sure why). It would also be interesting to implement regular userspace ASLR in the relibc fexec handler and possibly ld.so.