RFC: Zero-copy & DMA-able Grant Proposal

March 13, 2025

      (also as an issue at https://github.com/tock/tock/issues/4370)

Grants and allowed buffers provide a mechanism for capsules and other
kernel components to use and/or access memory bound to the lifetime of
("owned" by) a process. Commonly allowed buffers are used to
send/receive data on behalf of a process using lower-level drivers,
sometimes using DMA or other asynchronous hardware, and grants are used
to store metadata related to a process.

Grants were originally design with three important goals in mind:

1. No dangling pointers: it is not possible for a capsule to access
grant memory after the process has been reaped or an allowed buffer
after it has been unallowed.

2. No zombies: It should be possible to deallocate process memory
without waiting for outstanding hardware operations. I.e. the kernel
should be able to reap a process's grants and allowed buffers as soon as
it terminates the process.

3. Infallible deallocation: unallowing a buffer should always succeed.

4. Availability: the kernel should be always be able to access grant
memory at least once per call-stack.

Accomplishing all four simultaneously is challenging, and has lead to a
design that either precludes or makes cumbersome and inefficient
zero-copy operations on granted memory.

This proposal outlines a design for two new grant access mechanisms. One
that achieves all four goals while enables a limited form of zero-copy
functionality. A second that enabled general-purpose zero-copy
functionality (including through asynchronous DMA), but sacrifices
infallible deallocation, availablity and the no-zombie guaratees and, in
exchange, requires users to have elevated trust (via a capability).

## The case for zero-copying granted memory

The current grant (including allow-ed memory) access mechanism only
allows references with lifetimes bounded by the scope of a function. A
consequence of this is many capsules adopt the following patterm:

1. Allocate a static buffer

2. Copy (some) grant data into the static buffer

3. Pass ownership of this buffer to a driver

4. Pass ownership back. Go to step 2. 

The downsides to this pattern are

- Extra copying

- Extra allocation

- Extra logic

Logic is both implementation and maintenance complexity, but also
generated code size.

There is no good solution to sizing allocations. Too small and the
control logic will ruin performance. Too large consumes what is
typically a precious resource. No size is large enough to accept
arbitrary sized allow-ed buffers.

Copies are a performance concern in some applications.

Zero-copy could address these downsides.

## Different use-cases, different trade-offs

The two main use cases for zero-copy we consider are those that do not
require asynchronous hardware access to memory (e.g. DMA), and those
that do.

We can satisfy all four original requirements while allowing the former
use case.

However, concurrent and zero-copy DMA is fundementally incompatible with
the last three of the original requirements. If a DMA is ongoing to user
memory we cannot reap that proccesses memory, we cannot respect a user's
wish to deallocate, and the kernel itself may not be able to
concurrently access that memory while DMA is writing to it.

#### Memory-Mapped FIFO UART

This is a typical non-DMA capable UART (such as the ns16550). The CPU
can read/write to the UART's FIFO by reading/writing to a memory-mapped
register. The process provided buffer for sending/receiving data is an
arbitrary length, and almost certainly bigger than the UART FIFO
(typically just a few bytes), so a single send/recieve operation will
need to be done in steps. The UART is slow compared to the CPU, and we
do not want to wait for the FIFO to clear synchronously, instead relying
on interrupts to resume transmission/reception asynchronously.

As a result, it's also possible for a process to unallow or replace the
allowed buffer while the UART is waiting for the FIFO to clear. The UART
should discover this and react appropriately (e.g. short cut the
operation), but it need not necessarily be able to complete the
operation.

We want to pass the allowed memory from, e.g., a console system call
driver to the UART driver, and let the UART driver handle the logic
necessary to track how much as been sent/received so far, and where to
continue from, without having to allocate additional memory and without
having to continually call back to the console driver.

#### DMA Ethernet

This is an Ethernet device that has one or more DMA channels for
receiving. The CPU writes a base pointer and length to each
memory-mapped registers for each channel's and the hardware
asynchronously populates that memory with received MTU-sized frames. The
process provides a large, variably sized buffer for receiving data
potentially larger than the MTU size. Because receiving one or more
frames may take an abitrary amount of time, the CPU should do other
things while it's waiting rather than block. As a result, to effectively
use the device, the driver should chop up the process buffer into
MTU-sized slices and receive on as many slices as there are channels,
then rely on interrupts to signal completion of DMA operations.

We want to pass the allowed memory from, e.g., a UDP system call driver
to the ethernet driver, and allow the ethernet driver to reference this
memory directly in the DMA, rather than copying it to a potentially
large buffer. We _must_ retain goal one---a dangling pointer that's
reused for something else might result, for example, in leaking some
secret unintentionally over the network---but the ethernet driver is
trusted enough (by the board) to release memory for reclamation to relax
goals 2-4.

## Two New Grant Mechanisms

Here are described two new mechanisms that intend to solve the two
different zero-copy patterns.

### `ARef` (Allow-Lifetime Reference) & `PRef` (Process-Lifetime
    Reference)

`ARef` & `PRef` are similar types that differ only in that the first is
bounded by the lifetime of an allow, and the second by the lifetime of a
process. They are both a sort of reference (generic in what they point
to), and they have `'static` lifetime.

Neither are directly dereferenceable. Instead, capsules and drivers must
first convert them to a live version (`LiveARef<'a>` and
`LivePRef<'a>`). The difference between the live and non-live versions
is that the live versions _can_ be dereferenced (they are smart
pointers), and have a corresponding lifetime that ensures they do not
outlive a scope narrower than any process could be de-allocated, or
buffer unallowed.

The conversion to live is cheap (a load / compare), and fallible (it
returns an `Option`). Live references have the same caveat as legacy
entry in that they only work within a given function. However, a live
reference can be frozen again (which is zero-cost and cannot fail), and
so stored globally. Notably, while live, these reference can be modified
and, e.g., broken apart into packets or the portions of a slice that
still needs to be transmitted.

ARef/PRef work by storing with each process / allow a generation
counter. ARef / PRef themselves have both the counter of when they were
created, and a reference to the counter to compare to. All ARef/PRef can
be immediatly invalidated by incrementing the counter. This is done in a
scope where no LiveARef/LivePRef are allowed to exist.

These types are applicable in the non-DMA case. For instance, we can
pass a LiveARef to a uart driver. It can write as many bytes as it
likes, then freeze the remainging sub-slice and store it. Later, on an
interrupt path, it can try covert to live and continue. If the
conversion fails, the buffer must have been un-allowed and it can report
this outcome. ARef/PRef also support an iterator pattern for use cases
like this.

ARef/PRef do not suffer from availability problems as they are shared
references. They can be requested as many times as required on a call
stack. They can also all be immediately revoked, causing any future
attempts to convert to live reference to fail. They cannot dangle as the
live versions have lifetimes that bound them sufficiently.

ARef/PRef are not appropriate for DMA wherein the system is doing other
things while DMA takes place. Hardware will not perform the checks that
the live conversions do, and also will outlive the lifetimes that guard
the live references.

### `Ref` / `RefMut`

For DMA, we need to ensure that memory does not have its lifetime end
while DMA is ungoing. DMA is unbounded. For this reason, we suggest
reference counting by the kernel, which can also block de-allocaiton and
fail subsequent allows if it finds non-zero reference counts.

`RefCell`, and its referrence types`Ref` and `RefMut` are the types from
the core library which implement reference counting. (Note the lifetime
we provide for these references is `'static`). These provide a
sufficient implementation of reference counting for our purposes. We
allow both `Ref` and `RefMut` to the grant data, and `Ref` to allow-ed
data.

Misbehaving capsules and drivers cannot break safety, but if they leak
the `Ref` or `RefMut` will create zombie processes. Nothing short of a
system reset is likely to save a suffciently broken driver.

Because accesssing these types breaks three of our original design
goals, they are both locked off behind a capability, and can be turned
off entirely at a configuration level. The intent is that these be used
solely when other mechanisms do not work.

### Legacy mechanism

`ARef` / `PRef` are proposed as general replacement for the legacy
mechanism. `Ref` / `RefMut` are not as they have drawbacks.

However, to to support the existing codebase the legacy mechanism is
still supported in parralel on a case-by-case basis. There are
conditions on using them in a mix-and-match way.

For any given grant:

Using `ARef`/`PRef` at all, or having an active `Ref`/`RefMut`, blocks
use of the legacy mechanism.  Being inside a grant via the legacy
mechanism blocks both of the two new mechanisms.  ARef/PRef cannot be
used in conjunction with `RefMut` (they can be used with `Ref`)

## Initial results / takeaways

### Code size savings

A simple "hello world" board was created with a console, uart mux, and
uart. It was compiled with two versions of the console/uart: the
existing version and a new zero-copy version.

The new version saved 1K of text.

### Impementation simplicity

Insert image of side by side code here

## Nice externalities

Other things that this design solved at the same time (but are somewhat
orthogonal):

- Reentrancy

Previously, if a grant was entered it could not be entered again on the
same call stack. ARef/PRef/Ref are all inherently shared references and
solve this problem by just allowing accessing grants multiple times.

- Variable number of allows

Scatter/gather lists with arbitrary number of ranges now works with a
fixed (only 1!) allow number.

Because changing the generation counter is somewhat orthogonal to
changing the pointer, we can allow a new system call to change the
pointer but not the counter.

- Drivers/Drivers can notice buffers being ripped out underneath them

User no longer has the ability to change a slice address / length
underneath a capsule. Only disallow entirely.

Amit Levy

tags

participants (1)