Devel

Download

devel@lists.tockos.org

March 2025

1 participants
1 discussions

RFC: Zero-copy & DMA-able Grant Proposal
by Amit Levy March 13, 2025

March 13, 2025

(also as an issue at https://github.com/tock/tock/issues/4370) Grants and allowed buffers provide a mechanism for capsules and other kernel components to use and/or access memory bound to the lifetime of ("owned" by) a process. Commonly allowed buffers are used to send/receive data on behalf of a process using lower-level drivers, sometimes using DMA or other asynchronous hardware, and grants are used to store metadata related to a process. Grants were originally design with three important goals in mind: 1. No dangling pointers: it is not possible for a capsule to access grant memory after the process has been reaped or an allowed buffer after it has been unallowed. 2. No zombies: It should be possible to deallocate process memory without waiting for outstanding hardware operations. I.e. the kernel should be able to reap a process's grants and allowed buffers as soon as it terminates the process. 3. Infallible deallocation: unallowing a buffer should always succeed. 4. Availability: the kernel should be always be able to access grant memory at least once per call-stack. Accomplishing all four simultaneously is challenging, and has lead to a design that either precludes or makes cumbersome and inefficient zero-copy operations on granted memory. This proposal outlines a design for two new grant access mechanisms. One that achieves all four goals while enables a limited form of zero-copy functionality. A second that enabled general-purpose zero-copy functionality (including through asynchronous DMA), but sacrifices infallible deallocation, availablity and the no-zombie guaratees and, in exchange, requires users to have elevated trust (via a capability). ## The case for zero-copying granted memory The current grant (including allow-ed memory) access mechanism only allows references with lifetimes bounded by the scope of a function. A consequence of this is many capsules adopt the following patterm: 1. Allocate a static buffer 2. Copy (some) grant data into the static buffer 3. Pass ownership of this buffer to a driver 4. Pass ownership back. Go to step 2. The downsides to this pattern are - Extra copying - Extra allocation - Extra logic Logic is both implementation and maintenance complexity, but also generated code size. There is no good solution to sizing allocations. Too small and the control logic will ruin performance. Too large consumes what is typically a precious resource. No size is large enough to accept arbitrary sized allow-ed buffers. Copies are a performance concern in some applications. Zero-copy could address these downsides. ## Different use-cases, different trade-offs The two main use cases for zero-copy we consider are those that do not require asynchronous hardware access to memory (e.g. DMA), and those that do. We can satisfy all four original requirements while allowing the former use case. However, concurrent and zero-copy DMA is fundementally incompatible with the last three of the original requirements. If a DMA is ongoing to user memory we cannot reap that proccesses memory, we cannot respect a user's wish to deallocate, and the kernel itself may not be able to concurrently access that memory while DMA is writing to it. #### Memory-Mapped FIFO UART This is a typical non-DMA capable UART (such as the ns16550). The CPU can read/write to the UART's FIFO by reading/writing to a memory-mapped register. The process provided buffer for sending/receiving data is an arbitrary length, and almost certainly bigger than the UART FIFO (typically just a few bytes), so a single send/recieve operation will need to be done in steps. The UART is slow compared to the CPU, and we do not want to wait for the FIFO to clear synchronously, instead relying on interrupts to resume transmission/reception asynchronously. As a result, it's also possible for a process to unallow or replace the allowed buffer while the UART is waiting for the FIFO to clear. The UART should discover this and react appropriately (e.g. short cut the operation), but it need not necessarily be able to complete the operation. We want to pass the allowed memory from, e.g., a console system call driver to the UART driver, and let the UART driver handle the logic necessary to track how much as been sent/received so far, and where to continue from, without having to allocate additional memory and without having to continually call back to the console driver. #### DMA Ethernet This is an Ethernet device that has one or more DMA channels for receiving. The CPU writes a base pointer and length to each memory-mapped registers for each channel's and the hardware asynchronously populates that memory with received MTU-sized frames. The process provides a large, variably sized buffer for receiving data potentially larger than the MTU size. Because receiving one or more frames may take an abitrary amount of time, the CPU should do other things while it's waiting rather than block. As a result, to effectively use the device, the driver should chop up the process buffer into MTU-sized slices and receive on as many slices as there are channels, then rely on interrupts to signal completion of DMA operations. We want to pass the allowed memory from, e.g., a UDP system call driver to the ethernet driver, and allow the ethernet driver to reference this memory directly in the DMA, rather than copying it to a potentially large buffer. We _must_ retain goal one---a dangling pointer that's reused for something else might result, for example, in leaking some secret unintentionally over the network---but the ethernet driver is trusted enough (by the board) to release memory for reclamation to relax goals 2-4. ## Two New Grant Mechanisms Here are described two new mechanisms that intend to solve the two different zero-copy patterns. ### `ARef` (Allow-Lifetime Reference) & `PRef` (Process-Lifetime Reference) `ARef` & `PRef` are similar types that differ only in that the first is bounded by the lifetime of an allow, and the second by the lifetime of a process. They are both a sort of reference (generic in what they point to), and they have `'static` lifetime. Neither are directly dereferenceable. Instead, capsules and drivers must first convert them to a live version (`LiveARef<'a>` and `LivePRef<'a>`). The difference between the live and non-live versions is that the live versions _can_ be dereferenced (they are smart pointers), and have a corresponding lifetime that ensures they do not outlive a scope narrower than any process could be de-allocated, or buffer unallowed. The conversion to live is cheap (a load / compare), and fallible (it returns an `Option`). Live references have the same caveat as legacy entry in that they only work within a given function. However, a live reference can be frozen again (which is zero-cost and cannot fail), and so stored globally. Notably, while live, these reference can be modified and, e.g., broken apart into packets or the portions of a slice that still needs to be transmitted. ARef/PRef work by storing with each process / allow a generation counter. ARef / PRef themselves have both the counter of when they were created, and a reference to the counter to compare to. All ARef/PRef can be immediatly invalidated by incrementing the counter. This is done in a scope where no LiveARef/LivePRef are allowed to exist. These types are applicable in the non-DMA case. For instance, we can pass a LiveARef to a uart driver. It can write as many bytes as it likes, then freeze the remainging sub-slice and store it. Later, on an interrupt path, it can try covert to live and continue. If the conversion fails, the buffer must have been un-allowed and it can report this outcome. ARef/PRef also support an iterator pattern for use cases like this. ARef/PRef do not suffer from availability problems as they are shared references. They can be requested as many times as required on a call stack. They can also all be immediately revoked, causing any future attempts to convert to live reference to fail. They cannot dangle as the live versions have lifetimes that bound them sufficiently. ARef/PRef are not appropriate for DMA wherein the system is doing other things while DMA takes place. Hardware will not perform the checks that the live conversions do, and also will outlive the lifetimes that guard the live references. ### `Ref` / `RefMut` For DMA, we need to ensure that memory does not have its lifetime end while DMA is ungoing. DMA is unbounded. For this reason, we suggest reference counting by the kernel, which can also block de-allocaiton and fail subsequent allows if it finds non-zero reference counts. `RefCell`, and its referrence types`Ref` and `RefMut` are the types from the core library which implement reference counting. (Note the lifetime we provide for these references is `'static`). These provide a sufficient implementation of reference counting for our purposes. We allow both `Ref` and `RefMut` to the grant data, and `Ref` to allow-ed data. Misbehaving capsules and drivers cannot break safety, but if they leak the `Ref` or `RefMut` will create zombie processes. Nothing short of a system reset is likely to save a suffciently broken driver. Because accesssing these types breaks three of our original design goals, they are both locked off behind a capability, and can be turned off entirely at a configuration level. The intent is that these be used solely when other mechanisms do not work. ### Legacy mechanism `ARef` / `PRef` are proposed as general replacement for the legacy mechanism. `Ref` / `RefMut` are not as they have drawbacks. However, to to support the existing codebase the legacy mechanism is still supported in parralel on a case-by-case basis. There are conditions on using them in a mix-and-match way. For any given grant: Using `ARef`/`PRef` at all, or having an active `Ref`/`RefMut`, blocks use of the legacy mechanism. Being inside a grant via the legacy mechanism blocks both of the two new mechanisms. ARef/PRef cannot be used in conjunction with `RefMut` (they can be used with `Ref`) ## Initial results / takeaways ### Code size savings A simple "hello world" board was created with a console, uart mux, and uart. It was compiled with two versions of the console/uart: the existing version and a new zero-copy version. The new version saved 1K of text. ### Impementation simplicity Insert image of side by side code here ## Nice externalities Other things that this design solved at the same time (but are somewhat orthogonal): - Reentrancy Previously, if a grant was entered it could not be entered again on the same call stack. ARef/PRef/Ref are all inherently shared references and solve this problem by just allowing accessing grants multiple times. - Variable number of allows Scatter/gather lists with arbitrary number of ranges now works with a fixed (only 1!) allow number. Because changing the generation counter is somewhat orthogonal to changing the pointer, we can allow a new system call to change the pointer but not the counter. - Drivers/Drivers can notice buffers being ripped out underneath them User no longer has the ability to change a slice address / length underneath a capsule. Only disallow entirely.

1 0