RSoC: improving drivers and kernel - part 6
By 4lDO2 on
Introduction
So far it’s been two weeks since my third RSoC began, where so far I have mainly worked on moving large parts of kernel code, which deal with process management, to userspace. In this blog post I will try summarizing what has been accomplished so far, and exciting things I have started but not finished.
userspace_fexec
(and later userspace_clone
)
(This feature is not yet merged, but available on my respective
userspace_fexec
branches in kernel
, syscall
and relibc
.)
Currently, the kernel has many syscalls related to process management as would
be expected in a Unix-like system, such as fexec
, clone
, kill
, waitpid
,
exit
. Of the first two, fexec
inherently is not as general as it could be,
and while the reasons to allow other binfmts than ELF in the kernel may be few,
moving it to userspace simplifies the kernel and if implemented correctly does
not impair performance or functionality. clone
is not as complex as fexec
,
but still calls fmap
to re-obtain grants from the kernel.
The new implementation is in (my fork of) relibc, in execve
, fork
(does
what clone without CLONE_VM
used to do) and pte_clone
(for threads, and
does what clone with CLONE_VM
used to do). The execve
implementation is
shared with escalated
(a daemon run as root which is used to implement
setuid/setgid) and uses interfaces from the proc:
scheme (also used by
ptrace) to change e.g. the process name, signal stack, address space, and file
table in the case of clone
.
If the kernel no longer understands ELF (well, it still uses goblin
for
resolving symbols when printing backtraces within the kernel), then it must
obviously somehow load init
. The way I implemented it was to add a new
binary, named bootstrap
, which starts extremely simple. The kernel simply
loads initfs:bin/bootstrap
as read+write+execute into address 0x0 and jumps
to a fixed offset. A tiny stub is written in assembly, which sets up an
environment for relibc and calls the normal _start
entry point it provides.
As soon as it relibc calls back into main()
, it runs the userspace
implementation execve("initfs:bin/init", empty_args, kernel_envs)
, and init
continues as usual.
While syscall::process
has shrunk by approximately 1500 lines, scheme::proc
has gotten more powerful interfaces, including the ability to switch processes'
address spaces (used by fork
and execve
), switching file tables (used by
fork
), transferring grants between address spaces (used by fork
), and
setting uid/gid if you’re root (used by escalated
). The page table handling
code has also been partially refactored so that threads now share the entire
address space, and not simply what was previously the grant area. Some syscalls
may also be replaced by the proc:
scheme, for example chdir
and getcwd
.
TODO
Just because there is no longer io_uring
in the title does not mean it has
been abandoned! However, last summer, optimization of the nvmed driver using
io_uring turned out to be harder than expected, pointing out that there is room
for many other optimizations in the kernel, and perhaps that
userspace-to-userspace as I mentioned in the io_uring RFC might be preferable
for such situations, more than a syscall-multiplexing kernel would be (it could
also have something to do with how it was benchmarked). And, user threads for
any given process have temporarily been placed all on the same hardware
thread.
Meanwhile, I have also during the year been working a bit on a runqueue-based
O(1)
scheduler
(O(n) with respect to the number of timers though).
That said, with the introduction of file descriptor
forwarding,
and the possibilities for sandboxing that follows, the current syscall
interface may soon be reworked. For example, openat
may allow opening new
files from existing files even for processes in the null namespace, and there
is an existing limitation that syscalls handled by schemes can only use up to 4
arguments. For that, the kernel-to-userspace io_urings, also mentioned before,
can replace the current packet-based API with a ring-buffer interface (probably
the same as is already implemented in io_uring
) that would offer lower
latency. In that case, a potential syscall-multiplexing (as I halfway
implemented it last summer) kernel would also reduce complexity from 2x2
(syscalls initiated either from io_uring or blocking, and handled either by
packets or io_uring) to 2x1 (client is blocking/io_uring, scheme only uses
io_uring for handling requests).
The most exciting thing the new AddrSpace
refactor will simplify, is
implementing on-demand paging. First, page tables are now always locked upon
access, and, all userspace virtual memory allocations now occur via Grant
. In
the best-case scenario, and if I don’t prioritize drivers/io_uring/scheduler
instead, I’ll be able to implement on-demand paging before the end of this summer.
And of course, while both the userspace replacements of fexec
and clone
, as
I implemented them now, are capable of making the OS boot all the way to
desktop (and even implements setuid/setgid securely (in
theory)), there are some
things which still need to be finished. PTRACE_EVENT_CLONE
is no longer
generated (but could be, since creating a process via proc:new
still
(currently) copies e.g. uid from the caller (due to the lack of an interface to
set uid for other processes unless you’re root). vfork
is no longer supported
either (in clone and exec, but remains in exit), but could be implemented in
userspace as well and requires no complex kernel interface. And, there is
memory corruption (only) in orblogin
and background
, but (1) it may be
unrelated given that the new Grant “allocator” now reuses addresses more often
than before, which debugging suggests has something to do with it, and (2) it
was not present before userspace clone, which should make it easier to debug.
The kernel initfs implementation, which very recently was rewritten to use a
proper filesystem format (as opposed to a source-level hack that required
recompilation every time initfs had to be changed), can also be moved to
userspace, if the kernel loads the raw initfs slice rather than loading
initfs:bin/bootstrap
.