Feature Name: dma_and_device_mapping
Start Date: 2022-08-12
RFC PR: twizzler-rfcs/rfcs#0006
Twizzler Issue: twizzler-operating-system/twizzler#0084

Summary

This RFC introduces support for DMA (Direct Memory Access) and bus address mapping for device drivers by providing kernel support for setting up mappings for objects, kernel APIs for getting lists of physical or bus mappings for object pages, and twizzler-driver APIs for managing DMA objects and mappings in a memory safe manner.

Motivation

DMA is a fundamental aspect of writing device drivers, as devices use DMA to transfer data to and from host memory. However, thinking of devices accessing host memory solely via single one-shot DMA transfers is an outdated and limited model. The goal of this RFC is to provide a unified mechanism for supplying devices with bus addresses that correspond to physical memory that backs object memory in such a way that drivers can program both "streaming" (e.g. buffers) and "long-term-bidirectional" (e.g. command rings) memory.

Guide-level explanation

Considerations for DMA

When programs access memory in Twizzler they do so via accessing object memory, which involves an MMU translating some kind of object address to a physical address. On x86, for example, this involves a software translation to a virtual address followed by a translation via the Memory Management Unit (MMU) to a physical address. Similarly, when a device accesses memory, it emits a memory address (likely programmed by the driver) that may undergo no translation or some other translation on the bus before attempting to access host memory. There are two important considerations that are the result of this alternate (or no) translation:

Contiguous addresses. While object memory is contiguous (within an object), the physical memory that backs that object memory may not be. Thus devices and drivers need to be capable of handling access to memory in a scatter-gather manner.
Access Control. Access control can be applied differently between host-side driver software and devices. Thus driver software must be aware that it may have access to memory via the device that it should not directly. We can use devices like the IOMMU to limit this effect.

In addition to the above, we need to consider the issue of coherence. While CPU caches are coherent across cores, devices accessing host memory do not necessarily invalidate caches. Thus we have to handle both flushing data to main-memory after writing before the device reads it and invalidating caches if a device writes to memory. Some systems automatically invalidate caches, but not all do.

Memory Safety

Finally, we must consider memory safety, which is an issue because while we can control writes from host software to DMA buffers, we cannot necessarily control how the device will access that memory. To ensure memory safety of shared regions, we would need to ensure:

The device and host software cannot both mutate shared state at the same time (thread safety). Note that this may be okay in some situations, such as atomic variables that are updated from the device without tearing possibility or touch neighboring memory, however encoding this at compile time to prove safety may be impossible in general.
The device mutates data such that each mutation is valid for the ABI of the type of the memory region.

Enforcing these at all times may cause overhead and increase API complexity. Another stance we could take is Rust's approach to "external influences on memory", such as accessing /proc/self/mem on UNIX, which is basically to say that this is outside the scope of the compiler's ability to ensure safety. I think, though, that since programming shared access between driver software and the device is a fundamental part of driver software development, some middle ground that provides some safety is desireable, even if it means reaching for some unsafe here and there (possibly merely for efficiency).

Using DMA in a Device Driver

Twizzler will provide an interface for making a single Twizzler object accessible to a device by way of the DmaObject type exposed by the twizzler-driver crate. The DmaObject can be created from any Twizzler object, and exposes APIs for ensuring coherence and memory safety. Let's take as example a device that has a command ring buffer that is used to submit commands and to indicate when a command has been completed. A command in the ring buffer can point to another DMA buffer that is used to transfer data, and may look like the following:

struct Command {
    op: u32,
    status: u32,
    buffer: u64,
}

The op field specifies some operation to perform (send packet, etc.), the status field specifies the result of the command (say, for example, is set to 1 when the command is completed and must be cleared to zero for a command to be processed). Finally, the buffer field points to the physical address of some buffer. Let's also imagine some mechanism for communicating to the device the head of the ring so that we might communicate to the device a collection of new commands to process via a write to some MMIO register. For the sake of simplicity, let's assume that the buffer is at most 1 page long.

Setting up some DMA regions may look like:

let object = create_new_object();
let dma = DmaObject::new(object);

let command_ring = dma.slice_region::<Command>(some_command_len, Access::BiDirectional, DmaOptions::default());
let buffer = dma.slice_region::<u8>(some_buffer_len, Access::HostToDevice, DmaOptions::default());

At this point, both command_ring and buffer have types DmaSliceRegion<Command> and DmaSliceRegion<u8> respectively. Note that we distinguish between DmaRegion and DmaSliceRegion. Both provide a similar purpose, but have slightly different signatures on some functions. For example, both provide a with function (see below), but the DmaSliceRegion allows specifying a sub-slice. The rest of this document will use DmaRegion to stand for both types to avoid duplication of specification.

We can use DmaRegion::pin() to get a list of physical pages associated with the region so that we may program the device to operate on this command ring. Then, submitting a command would look like:

buffer.with_mut(0..0x1000, |buf| {
    fill_out_buffer(buf);
});
// Grab a 'pin'of the buffer, which ensures that the associated physical addresses and IOMMU maps will remain
// static until the dma object is dropped.
let buffer_pin = buffer.pin().unwrap();
// Get the physical address of the first page.
let buffer_addr = buffer_pin[0].addr();
// Fill out a new command.
command_ring.with_mut(0..1, |ring| {
    ring[0] = Command::new(buffer_addr);
});
increment_head();

A pin object can manually release the pages it refers to, but otherwise the lifetime of pinned physical memory is the same as the DmaObject itself. By tying pin lifetime to the DMA object and not the pin object reduces management complexity of avoiding accidentally programming a device with stale physical addresses.

The DmaRegion::with_mut function runs a closure while ensuring coherence between host and device. Before the closure, it ensures any writes from the device are visible, and after running the closure, it ensures that any writes made by driver software are visible to the device. A similar function, with, allows driver software to read the DMA region and not write it, allowing the system to skip ensuring coherent writes from host to device.

Simple Allocation

If a driver needs to allocate a large number of dynamically sized DMA regions, doing so with a single object may prove difficult as we can easily run out of space. Thus twizzler-driver also provides a type for managing a collection of DmaObjects all of a similar type: DmaPool. We can use it as follows:

let pool = DmaPool::new(DmaPool::default_spec(), Access::HostToDevice, DmaOptions::default());
let region = pool.allocate::<Foo>(Foo::default()).unwrap();
// Dropping region causes it to deallocate.

Coherence Models and Memory Safety

In the above example, we used default DMA options, which ensures the following:

Writes by host software are readable by the device once the with_mut function returns.
Coherence is synchronized at the start of the with or with_mut calls.

More relaxed models are available that do not do any synchronization unless the driver explicitly calls DmaRegion::sync. Note that we are not ensuring that no memory access conflicts occur between the device and driver software, since that is not possible to do at compile time or runtime¹. We are further not ensuring that the device maintains the ABI of the Command type. In this example, this doesn't really matter, as all the constituents of this type are simple integers, but imagine instead that status was an enum with only a few defined values. The device could update the value of status to a non-defined value, which would cause problems.

To avoid the type ABI problem, we require that a region be a type that implements the DeviceSync and Copy marker traits. The DeviceSync trait is a promise that the ABI for the type can handle any update to it that the device might make and that it can handle possible memory conflicts with writes from the device.

Efficiently, anyway. We could use the IOMMU to ensure that physical addresses are only available for the device to access during certain windows. However, this would involve a LOT of system calls and IOMMU reprogramming, which is currently not terribly fast. Note, however, that as-written this API would allow for this kind of enforcement if we choose to do it in the future.

Shared Objects

One final consideration is for drivers that want to point devices towards object memory that exists within an object that is shared across different programs. The twizzler-driver library cannot (at this level of the system) enforce mutability rules for these objects. Thus driver software should use the manual sync operations to ensure coherence (of course, parts of the object modified via the with functions will still have coherence rules applied as normal, see above).

`DmaOptions` and `Access`

DmaOptions modify how a region (or pool, see below) of DMA memory is treated by the host. The options are a bitwise-or'd collection, with the following defined:

UNSAFE_MANUAL_COHERENCE. Default: No. If set, the with functions do not perform any coherence operations.

The Access enum specified the direction of the DmaTransfers that will be made with this region, and can be used to optimize coherence and inform access controll for IOMMU mappings. The options are:

HostToDevice -- for transfers in which the device reads.
DeviceToHost -- for transfers in which the device writes.
BiDirectional -- for transfers in which the device reads and writes.

Reference-level explanation

Kernel API

Accessing physical mappings information is done, from consumers of the twizzler-driver API, via the pin function on a DmaObject. The pin function learns about physical mappings from the kernel by calling a KAction command on the underlying object for pinning pages, which returns a token along with information about physical addresses. That token is tied, in the kernel, to the list of physical mapping information that that call returns. After this call returns, the kernel ensures that the mappings information that it has returned stays correct ("active") until the pin is manually released via another KAction call.

Internally, the kernel will manage per-object information on which pages are pinned so as to not evict or change such pages. Ensuring that these active pins remain correct requires some interaction with the copy-on-write (COW) mechanism in the object copy functionality. In particular, pins do not get copied into new objects that source data from an existing object, however if a pin applies to a source object, that object is copied (via COW) to a new object, and the range that is copied intersects with the pin, and a write is performed to the pinned region in the source object while the underlying pages are still shared for COW, the kernel will need to copy the page for all other objects instead of just the source object. For this reason, we will break the kernel-side implementation into two feature gates:

Basic pin support, but not supporting COW-intersecting-with-pins.
Full support as described above.

Userspace Implementation

Let's consider the examples in the previous section and discuss implementation.

`DmaObject::slice_region` and `DmaObject::region`

These provide simple ways to just return some object memory as a [T; N] or a T. They return a struct (DmaRegion) that just manages a typed region of memory of a size determined by T (and N), and expose the pin function.

`pin`

The pin function calls the kernel to setup a pin and then manages the pin information in memory, allowing it to be called multiple times without having to negotiate with the kernel each time. Note that pins are not released to the kernel when the DmaRegion is dropped; instead all pins on an object are released when the DmaObject is dropped.

`with` and `with_mut`

These functions provide access to the internal memory managed by a DmaRegion to a closure. Before the closure is run, it ensures coherence for the host reading the memory, and after the closure is run it ensures coherence for the device to read memory. The with variant may skip the second step.

Pools and Allocation

Regions can also be gotten from a DmaPool, which internally manages a collection of DmaObjects (derived from objects that it creates as needed). All regions created this way share the DmaOptions with which the pool is created. Allocation is managed internally via a memory allocation algorithm, however all regions must be aligned on page size.

Drawbacks

DMA is vital to any driver written for the vast majority of devices that we care about. However, the particular design choices herein do have some drawbacks:

Pinning memory adds complexity to the eviction algorithms in the kernel and the pager, as they need to be made aware of pinned memory.
There is currently no attempt to limit the amount of pinned memory an application can request, thus opening an easy door to denial of service attacks. We can mitigate this somewhat via access control.
Currently we don't define a way to request that all physical addresses fit within 32 (or fewer) bits, as the kernel is not currently setup to do manage memory in a way that would make this easy. Ensuring physical addresses stay under the 4G mark is useful (mostly) for older hardware that cannot work with 64-bit addresses. Currently, we don't have any immediate need to support such hardware. If the need arises, however, we can extend the DmaOptions enum to include specifications for physical memory address limitations.

Rationale and alternatives

Pinned Memory

The overall goal is to make it possible for userspace code to program devices to access physical memory. We want to stay within the overall Twizzler object model to do this, thus the API herein is focused around making it possible to create objects that we can then use for DMA transfers.

Pin Leaks

One immediate concern is pin leaks. Since the pins must be manually released to the kernel by the library (not the user), we can imagine a device driver crashing and causing a section of object memory to be pinned forever (at least, until the kernel restarts). The decision to allow this, however, is intentional.

Should a device driver crash, we have no guarantee on the state of the device that it was programming. It's entirely possible that the device be programmed with physical addresses that it may make use of after the device driver crashes, thus blindly writing to memory that, in the case that pins get removed if the program creating them crashes, may no longer refer to the object memory that was originally intended by the (now crashed) driver.

Thus the choice of allowing leaks in the face of a driver malfunction is there to mitigate the possibility of corrupted memory. Of course, use of an IOMMU may be able to mitigate this, however I do not wish to rely on it, and this would also introduce many inefficiencies. If we prove to be able to efficiently make use of IOMMU hardware in the future, this design may change.

Prior art

Basically every major operating system provides some API for setting up DMA transfers. Most of them are quite similar, largely relying on the driver to manually synchronize for coherence and/or specify directionality of transfers. Some (e.g. Linux) additionally classify a region of memory-to-be-DMA'd as fully coherent or streaming, usually using this information to specify caching type for memory mappings.

Fuchsia uses a similar mechanism to what is outlined herein, also supporting pinned memory regions. There are a number of differences, however, that largely stem from our desire to (at least somewhat) follow Rust memory safety requirements. Fuchsia, in addition, does allow a limited ability to control contiguity of physical memory, which we do not (yet).

FreeBSD and Linux both have a significantly different style of interface, stemming from the fact that they implement their device drivers in-kernel, and so their DMA interfaces can be tightly coupled with their virtual memory subsystem. The two systems differ in the details of how they control coherence and synchronization (single-versus-all-cpus, streaming-versus-coherent, and FreeBSD allowing finer control over sync operations) and how they control contiguity and maximum address size. Otherwise, the differences are largely down to API specifics and not really functionality, with the exception of FreeBSD supporting a recursive-like pattern of region configuration inheritance, which is kinda cool.

Windows offers little additional insight into DMA operation and design tradeoffs, except as a case study of how not to name functions or other aspects of an API.

Future possibilities

This RFC is only intended to cover "dumb" devices -- that is, devices that are fully programmed by the host software and, while they may interact with memory via DMA, do not really "go off on their own". Essentially, most devices on the market that do things like NVMe, networking, etc. Such devices are fully controlled by the host and all memory access is either initiated by the same or initiated by the device to a pre-programmed section of memory, and the device can be thought of as a simple state machine.

In future it may be better to model devices a fully separate machines that access physical memory cooperatively with the host CPU and run their own programs. Should we reach that future, we will probably need a new model for programming such devices that will exceed in needed richness the model presented here.

Twizzler RFCs