Significant performance and correctness improvements to the kernel

By Jacob Lorentzon (4lDO2) on Friday, March 29, 2024

Introduction

Ever since my demand paging RSoC project was completed last summer, there have been numerous additional incremental improvements to the kernel, including fixes, simplifications, and some significant optimizations. The changes can be divided into the “correctness” and “performance” categories, even though there is a relatively large overlap.

Correctness

Physalloc replacement and complete frame bookkeeping

Although the demand paging MR did formalize the different types of grants, which previously merely indicated where user memory areas were, and frames got proper bookkeeping such as the refcount, that was very much not the case for physallocated frames. Requiring root, they were simply treated as “borrowed physical memory” (PhysBorrowed), which is intended to be used by drivers to map device memory or other pages, not controlled by the frame allocator. As a result, some checks had to be worked around, and driver bugs could cause kernel memory corruption.

This was fixed in the phys_contiguous mmap MR, which replaces the physalloc syscalls with a special memory scheme file, /scheme/memory/phys_contiguous?<memory type>, instead used together with regular mmap. As a result, deallocation is fully automatic, and most importantly, the kernel will now deny any attempt to map allocator-owned memory as “physically borrowed”! This should protect against e.g. faulty ACPI AML code, but if this rule turns out to be too restrictive, it only requires one line of code to disable.

TLB shootdown

Virtually all CPUs supporting virtual memory, have a Translation Lookaside Buffer to cache commonly used mappings. This makes page unmaps in multithreaded programs more complex, as they need to ensure the TLB is up-to-date in all other threads before they can free the frames, a mechanism called TLB shootdown. That involves interrupting the other CPUs to get them to flush their TLBs, in a synchronized way. The signals MR described below initially did not work due to double frees elsewhere, caused by the lack of TLB shootdown. This has now been fixed.

Improved signal handling

The previous signal code was problematic, in that it simply implemented userspace-like signals except also in the kernel, by copying and restoring the kernel stacks. This meant that a Ctrl-C interrupt could cause a thread to exit() before running destructors, for example resulting in a bug where Orbital windows were not properly closed.

This was fixed in the signals MR. Some schemes need to implement scheme call cancellation, however, before interrupted syscalls can fully work.

Misc

Additionally, bjorn3 has fixed a lot of UB and other bugs in the kernel, including the part that bootstraps into userspace, as well as deduplicate most i686 and x86_64 code. The userspace-bootstrap change makes it easier to add support for other bootloaders, such as GRUB, by integrating the bootstrap executable (which is responsible for setting up a stack, opening standard file descriptors, and creating the environment necessary to start the relibc userspace, and execute init) into the initfs image.

An important fix as part of that, is that the initfs image including bootstrap, is no longer mapped to address 0x0. This proves once again how important it is not to create any Rust reference to NULL, which in this case confused bootstrap thinking a Some(&initfs_memory) was None, as the slice pointed to NULL!

Performance

Kernel profiling

Normally when optimizing userspace applications, profiling can be done easily using e.g. perf record or an integrated tool such as cargo flamegraph. When profiling a kernel however, the sampling part needs to be implemented from scratch. The current implementation uses non-maskable interrupt IPIs to interrupt other CPUs at scheduled intervals, which is necessary since interrupts are disabled almost everywhere in kernel mode. The NMI handler tracks whether the CPU was interrupted in user or kernel mode, and in the latter case, performs a stack trace and sends it to a ring buffer, which a new daemon profiled reads from, and extracts into a perf-compatible text format.

Profiling userspace is yet to be implemented, possibly both as combined userspace and kernel profiling, or just userspace.

With profiling enabled, this allows producing awesome flamegraphs, using inferno! At the time profiling was implemented, the following is what it looked like when booting and starting NetSurf:

(Click on the flamegraphs to open them interactively.)

As the flamegraph suggests, a very significant part of the time spent in the kernel, is spent allocating frames, in rmm::BuddyAllocator. In fact, in some cases a massive performance boost from demand paging may have been strengthened in part because frame allocation was simply very slow. But even with demand paging, it accounted for roughly 35% of total boot time (in the kernel, after the arch-specific initialization):

New p2buddy frame allocator

The obvious optimization target, apart from context switching (which at the time was blocked by the problematic signal handling code), was therefore the frame allocator.

Originally, pre-2020, the Redox kernel used a recycling allocator on top of a bump allocator, which essentially stored an array (a Vec) of (base, frame count) that could be merged or split. In 2020, the Redox Memory Manager (RMM) was introduced, which provided a more optimized buddy allocator.

However, both the recycling allocator and RMM’s buddy allocator, were O(n), w.r.t. the total number of allocatable frames. The recycling allocator had to iterate through an array that could theoretically grow to half of the allocatable frames (maximum fragmentation), whereas the RMM buddy allocator had to iterate through usage arrays to find a sufficient number of contiguous frames, without an optimized way to allocate in larger units. Since the majority of frame allocations were only single-frame, as they had to be allocatable and deallocatable independently, even the RMM buddy allocator turned out to be relatively slow.

Although the RMM buddy allocator could probably have been optimized further, it would be advantegous to reuse the already existing PageInfos, which store the refcount of userspace-owned pages to implement both CoW MAP_PRIVATE and MAP_SHARED mmaps. That space is not used when the frame is free, so why not use it as the frame allocator data structure?

The power-of-two buddy (p2buddy) allocator, uses the two words in each PageInfo to implement a doubly-linked list per order, i.e. the log₂ of the number of frames for an allocation, and stores that order in the unused pointer bits. One bit is used to distinguish between used and free frames, and must not be overwritten except by the allocator. The list heads are stored as a global array. This p2buddy allocator is similar to buddy allocators found in other kernels, such as Linux and FreeBSD.

The allocator now only supports power-of-two allocation sizes, called p2frames, where the orders can currently range from 0 to 10, inclusive. It looks for the smallest possible p2frame large enough for the allocation, and possibly splits it into smaller p2frames. Freeing frames makes it look for free neighboring p2frames, merging them into a higher-order p2frame if possible. This algorithm is guaranteed to take O(k) steps, w.r.t. the (constant) number of orders k. Thus, the new allocator is O(1) w.r.t. the total number total frames available.

The new flamegraph when simply booting, is now instead:

The time spent inside the frame allocator, has almost disappeared entirely, and comprises only the 0.9% slice in the bottom-left corner. Additionally, NetSurf now starts significantly faster, and the boot time has also been reduced. For Jeremy, the boot time was reduced from 4s to 3s!

Of course, memory allocation overhead can still be significant for other workloads, and there are lots of future optimizations to be made.

Syscall optimization

With the signal MR having flattened the kernel stacks, the syscall prologue and epilogue code that kept track of where the userspace registers were, could be removed. Most of the ptrace and debug logic was moved to the context switch code, and some other improvements were made. On a CPU where the hardware syscall+sysret latency is 56 cycles, the latency changed (roughly) as follows, from these optimizations:

August 2023, initial: 344 cycles (-0%, cumulative)
After signal MR: 236 cycles (-31%)
Caching ptrace breakpoint state: 184 cycles (-47%)
Using saved regs directly for syscall debugging: 116 cycles (-66%)

Thus, the majority of the syscall latency on Redox is now almost spent in the syscall/sysret microcode itself. It can probably be optimized further, even though performance may be reduced slightly once Spectre mitigations are properly implemented. Worth noting that this is the base syscall latency, i.e. the time it takes to do an invalid syscall that returns ENOSYS, and not the time spent in the various syscalls themselves. That said, most simple syscalls, such as sigprocmask, do not take more than a few hundred cycles to run.

Conclusion

This year, there have been numerous improvements both to the kernel’s correctness, as well as raw performance. The signal and TLB shootdown MRs have significantly improved kernel memory integrity and possibly eliminated many hard-to-debug and nontrivial heisenbugs. Nevertheless, there is still a lot of work to be done optimizing and fixing bugs in relibc, in order to improve compatibility with ported applications, and most importantly of all, getting closer to a self-hosted Redox.