RSoC: on-demand paging II
By 4lDO2 on
Introduction
Today is the end of the last week of RSoC, and most importantly, I’m happy to announce that the MVP for demand paging has now been merged!
aarch64 and i686
Before merging, the demand paging implementation was ported to i686 and aarch64. The i686 port was trivial, due to its similarity to x86_64 (they are to some extent the same arch). The page fault code was modeled after x86_64.
When porting it to aarch64 however, I did discover that (on master) the x18
register was being overwritten each time there was an exception or interrupt,
for debug purposes! Turns out that page faulting when accessing almost every
new page, is a great way to stress-test saving/restoring registers!
Complete grant bookkeeping
The ownership of Grant
s, which are the Redox equivalent of the entries in
/proc/<pid>/maps
on Linux, is now properly tracked, fixing this
issue. Each grant
has a provider, which is one of the following types:
pub enum Provider {
/// The grant is owned, but possibly CoW-shared.
///
/// The pages this grant spans, need not necessarily be initialized right away, and can be
/// populated either from zeroed frames, the CoW zeroed frame, or from a scheme fmap call, if
/// mapped with MAP_LAZY. All frames must have an available PageInfo.
Allocated { cow_file_ref: Option<GrantFileRef> },
/// The grant is owned, but possibly shared.
///
/// The pages may only be lazily initialized, if the address space has not yet been cloned (when forking).
///
/// This type of grants is obtained from MAP_SHARED anonymous or `memory:` mappings, i.e.
/// allocated memory that remains shared after address space clones.
AllocatedShared { is_pinned_userscheme_borrow: bool },
/// The grant is not owned, but borrowed from physical memory frames that do not belong to the
/// frame allocator.
PhysBorrowed { base: Frame },
/// The memory is borrowed directly from another address space.
External { address_space: Arc<RwLock<AddrSpace>>, src_base: Page, is_pinned_userscheme_borrow: bool },
/// The memory is MAP_SHARED borrowed from a scheme.
///
/// Since the address space is not tracked here, all nonpresent pages must be present before
/// the fmap operation completes, unless MAP_LAZY is specified. They are tracked using
/// PageInfo, or treated as PhysBorrowed if any frame lacks a PageInfo.
FmapBorrowed { file_ref: GrantFileRef, pin_refcount: usize },
}
(almost) Complete frame bookkeeping
The kernel previously didn’t store any metadata about physical memory frames,
allowing malicious schemes to continue using munmap
ped pages that were
temporarily mapped to those schemes (automatically by the kernel, provided
those pages were used as syscall arguments to that scheme). A scheme that
unmapped its pages would also risk a use-after-free, if that scheme had
provided those pages when handling an fmap call. Although root is still
currently required to run schemes, this lack of frame bookkeeping was one of
the reasons root was required.
The current kernel stores a PageInfo
for each page that the kernel’s frame
allocator can return.
pub struct PageInfo {
/// Stores the reference count to this page, i.e. the number of present page table entries that
/// point to this particular frame.
///
/// Bits 0..=N-1 are used for the actual reference count, whereas bit N-1 indicates the page is
/// shared if set, and CoW if unset. The flag is not meaningful when the refcount is 0 or 1.
pub refcount: AtomicUsize,
// (not currently used)
pub _flags: FrameFlags,
}
The way they are organized is very similar to Linux, at least according to
their documentation. A global variable, called SECTIONS
, contains an array of
“sections”, i.e. (base_frame: Frame, pages: &'static [PageInfo])
, based on
the bootloader memory map. The page arrays can be at most 32,768 entries, or
128 MiB with the x86_64 4096 byte page size (the optimal size is yet to be
determined).
The refcount is incremented/decremented for every new mapping created to or
removed from any frame, and those updates are as atomic (wait-free) as
std::sync::Arc
.
However, there is one inconvenient exception to this: physalloc
and
physfree
. Until those syscalls are removed and replaced by e.g. mmap(..., MMAP_PHYS_CONTIGUOUS)
, the kernel cannot currently enforce that all allocator
pages are properly tracked.
Once this is done, it will be possible to enforce that PhysBorrowed
grants,
obtained mostly by drivers to access MMIO, cannot access any owned memory by
other processes on the system. In particular, this will naturally sandbox the
AML interpreter from being able to (directly) maliciously modify memory it’s
not supposed to access.
Another even more useful possibility, is to make PageInfo
a union,
additionally encompassing other types of frames used by the kernel, such as
frames for the kernel heap, and most importantly, paging structures. By
tracking refcounts of paging structures, together with x86’s TLB that ANDs the
“writable” flag of all tree levels, it will be possible to make the page tables
CoW as well in Redox’s current fork
equivalent, possibly even allowing O(1)
forks, with respect to the number of mapped pages.
While Redox does not yet allow userspace to map large (2 MiB on x86_64) and/or
huge (1 GiB on x86_64) pages, storing 511 or in the extreme case 262,143
useless PageInfo
s, is of course not efficient. This can either be solved by
preallocating the expected number of PageInfo
s, use the unused space for e.g.
opportunistic caches, or allow dynamically resizing PageInfo
s.
physmap
deprecation
The physunmap
system call was removed in the earlier usercopy MR, but now the
physmap
system call has additionally been deprecated, and replaced by the
mmapping memory:physical@<memory type>
. Possible memory types are
uncacheable, write-combining, and the regular writeback memory type.
This comes with the benefit, once the physmap
syscall is removed, of being
able to restrict the ability to borrow device physical memory via namespaces,
even for processes running as root (the concept of a root user on Redox is
temporary).
improved fmap interface
The mmap interface used by schemes, have been improved, from
fn fmap(&self, id: usize, map: &Map) -> Result<usize>;
fn funmap(&self, address: usize, length: usize) -> Result<usize>;
struct Map {
offset: usize,
size: usize,
flags: MapFlags,
address: usize, // bad API: only used by the syscaller
}
to
fn mmap_prep(&self, id: usize, offset: u64, size: usize, flags: MapFlags) -> Result<usize>;
fn munmap(&self, id: usize, offset: u64, size: usize, flags: MunmapFlags) -> Result<usize>;
The kernel no longer needs to create a temporary mapping for the Map
struct
to be read. Schemes are now expected to track the number of mappings to each
file range, which the new
range-tree
crate can be
used for.
TODO
Some of the TODOs I mentioned in the previous blog post, are still TODOs:
- Proper synchronized TLB shootdown is still unimplemented. While the current excessive amount of TLB flushing makes TLB use-after-free bugs very rare, I did notice that when omitting some unnecessary flushes, page fault heisenbugs sometimes started appearing.
- Although the current implementation is visibly faster in QEMU and on some real hardware, there has not yet been any significant performance tuning, such as eagerly mapping pages up to a certain limit, or making page faults map multiple sequential pages. Additionally, the CPU caches can be used better by using large or huge pages for the kernel’s linear mapping of physical frames (like Linux does), and possibly using an LRU cache for frame allocation as well.
madvise
andmlock
are not implemented yet, and by extension, swap.- The scheme traits (
Scheme
,SchemeMut
,SchemeBlock
, andSchemeBlockMut
) are not ideal, as they use rely on a shell script to autogenerate the latter three traits based on the first one. This makes extending the trait, e.g. to allow one-way kernel-to-scheme messages, much more time consuming than necessary.
- The scheme traits (
- OOM is still not handled. But since the current signal handler
relies on
Vec::clone
to clone kernel stacks, which allocates memory before signals are even possible to send, it might make more sense to wait for the signal MR to be merged first. (OOM handling is tracked in this issue.) - It would be a good idea to document this new memory management code, as the current “memory management” Book section is currently empty.