RSoC: improving drivers and kernel - part 5 (largely io_uring)
By 4lDO2 on
Introduction
It’s been some time since my last blog post last summer, so in this one I will attempt to summarize everything that I have done related to io_uring this first month of RSoC, but also a bit during the year.
Nearly full block-freedom
Last summer, while the interface was at least more or less usable at that time,
many schemes still internally blocked, thus limiting the number of SQEs that
can run simultaneously, to one. With the help of some Async Rust, all schemes
except some ptrace logic in proc:
no longer blocks. That said, I am probably
not going to use async in the near future, as it may not be that flexible for
kernel code, especially when the only existing “leaf future” is
WaitCondition
.
Not directly related to blocking, but I have also fixed io_uring when
#[cfg(feature = "multi_core")]
is used, where a synchronization problem
arises since you don’t want to have ten thousand locked mutexes. This requires
an additional flag in each context, and in fact I discovered a data
race,
although it only became a problem on the io_uring branch.
Various interface improvements
The first thing I did, was to cleanup the interface itself. Previously I used
two different types of SQEs and CQEs, one for 32-bit and 64-bit values each.
This turned out to be quite messy in the codebase, both kernel and userspace,
especially when every function had to be generic over the SQE and CQE types,
and could potentially also hurt performance if there would have to be runtime
branching between the different two structures. Now, there is only the 64-byte
SqEntry
and a 16-byte CqEntry
, which are the same sizes as the Linux
io_uring entries. However, as Redox allows full 64-bit return values e.g. from
mmap, two CQEs can be chained into an “extended” CQE if a single one is not
sufficient.
Additionally, I have removed the push/pop epochs, as they have very little benefit compared to simply reading the head and tail indices directly when figuring out whether to notify or not, in polling mode. In the extreme case of 64-bit index overflow, there would simply have to be lock and blocking notification somewhere. I have also added support for an indirect SQE array, which the Linux interfaces currently forces, but on Redox it will remain optional. And finally, optimized some operations at ring structure level, mostly reducing sometimes-expensive atomic operations.
TODO
Outside of simply cleaning up a lot of code that currently uses async, I am
also going to spend a lot of time trying to optimize io_uring. The problem with
the current approach of using async fn
, is that regular blocking syscalls
will have extra overhead due to io_uring requiring async, and at the moment the
size of the futures is quite embarrasing (6192 bytes, but this is because some
functions
normally allocate on the stack. Moving these allocations to the heap, the size
would probably still be a couple hundred bytes due to the complex
compiler-generated state machines, but that would hurt performance of blocking
syscalls as well). Instead, I will be using regular functions, but instead
store the state in a per-context runqueue, which hopefully can be no more than
32 bytes.
As of now, the only real userspace components that I have used io_uring in is a
forked drivers
branch. I used that mainly to test the interface, but now I
will instead start using it in the NVME driver. The NVME hardware interface is
conceptually similar to io_uring (as are many other hardware interfaces), so
ideally it would at a high level only have to “forward” SQEs sent, onto the
device, and then forward the device completion events back to CQEs.
Additionally, I plan on completely phasing out kernel handling of
userspace-to-userspace rings, since as the name implies, those are meant to be
independent of the kernel. Instead, these secondary rings will be handled by
userspace applications, where they have the choice which interface they will
use, and that does not necessarily have to be identical to the current io_uring
interface. Therefore, I have submitted a bunch of futex-related MRs, where
applications will open shared memory via ipcd
’s shm:
scheme, and then wait
for new entries via cooperatively doing FUTEX_WAIT64
and FUTEX_WAKE
on the
head and tail indices, respectively (or abuse virtual memory to get the
processor to show which memory locations have been modified, although I am not
so sure that is practical enough). Just like the first kernel opcode,
Waitpid
, Futex
would also become an io_uring opcode, allowing possibly
direct notification where the only overhead is reading one CQE from shared
memory.
I’ll update the RFC as soon as possible too.