- Feature Name: dma_and_device_mapping
- Start Date: 2022-08-12
- RFC PR: twizzler-rfcs/rfcs#0006
- Twizzler Issue: twizzler-operating-system/twizzler#0084
Summary
This RFC introduces support for DMA (Direct Memory Access) and bus address mapping for device drivers by providing kernel support for setting up mappings for objects, kernel APIs for getting lists of physical or bus mappings for object pages, and twizzler-driver APIs for managing DMA objects and mappings in a memory safe manner.
Motivation
DMA is a fundamental aspect of writing device drivers, as devices use DMA to transfer data to and from host memory. However, thinking of devices accessing host memory solely via single one-shot DMA transfers is an outdated and limited model. The goal of this RFC is to provide a unified mechanism for supplying devices with bus addresses that correspond to physical memory that backs object memory in such a way that drivers can program both "streaming" (e.g. buffers) and "long-term-bidirectional" (e.g. command rings) memory.
Guide-level explanation
Considerations for DMA
When programs access memory in Twizzler they do so via accessing object memory, which involves an MMU translating some kind of object address to a physical address. On x86, for example, this involves a software translation to a virtual address followed by a translation via the Memory Management Unit (MMU) to a physical address. Similarly, when a device accesses memory, it emits a memory address (likely programmed by the driver) that may undergo no translation or some other translation on the bus before attempting to access host memory. There are two important considerations that are the result of this alternate (or no) translation:
- Contiguous addresses. While object memory is contiguous (within an object), the physical memory that backs that object memory may not be. Thus devices and drivers need to be capable of handling access to memory in a scatter-gather manner.
- Access Control. Access control can be applied differently between host-side driver software and devices. Thus driver software must be aware that it may have access to memory via the device that it should not directly. We can use devices like the IOMMU to limit this effect.
In addition to the above, we need to consider the issue of coherence. While CPU caches are coherent across cores, devices accessing host memory do not necessarily invalidate caches. Thus we have to handle both flushing data to main-memory after writing before the device reads it and invalidating caches if a device writes to memory. Some systems automatically invalidate caches, but not all do.
Memory Safety
Finally, we must consider memory safety, which is an issue because while we can control writes from host software to DMA buffers, we cannot necessarily control how the device will access that memory. To ensure memory safety of shared regions, we would need to ensure:
- The device and host software cannot both mutate shared state at the same time (thread safety). Note that this may be okay in some situations, such as atomic variables that are updated from the device without tearing possibility or touch neighboring memory, however encoding this at compile time to prove safety may be impossible in general.
- The device mutates data such that each mutation is valid for the ABI of the type of the memory region.
Enforcing these at all times may cause overhead and increase API complexity. Another stance we could take is Rust's approach to "external influences on memory", such as accessing /proc/self/mem on UNIX, which is basically to say that this is outside the scope of the compiler's ability to ensure safety. I think, though, that since programming shared access between driver software and the device is a fundamental part of driver software development, some middle ground that provides some safety is desireable, even if it means reaching for some unsafe here and there (possibly merely for efficiency).
Using DMA in a Device Driver
Twizzler will provide an interface for making a single Twizzler object accessible to a device by way
of the DmaObject
type exposed by the twizzler-driver crate. The DmaObject can be created from any
Twizzler object, and exposes APIs for ensuring coherence and memory safety. Let's take as example a
device that has a command ring buffer that is used to submit commands and to indicate when a command
has been completed. A command in the ring buffer can point to another DMA buffer that is used to
transfer data, and may look like the following:
struct Command {
op: u32,
status: u32,
buffer: u64,
}
The op
field specifies some operation to perform (send packet, etc.), the status
field specifies
the result of the command (say, for example, is set to 1 when the command is completed and must be
cleared to zero for a command to be processed). Finally, the buffer
field points to the physical
address of some buffer. Let's also imagine some mechanism for communicating to the device the head
of the ring so that we might communicate to the device a collection of new commands to process via
a write to some MMIO register. For the sake of simplicity, let's assume that the buffer is at most 1
page long.
Setting up some DMA regions may look like:
let object = create_new_object();
let dma = DmaObject::new(object);
let command_ring = dma.slice_region::<Command>(some_command_len, Access::BiDirectional, DmaOptions::default());
let buffer = dma.slice_region::<u8>(some_buffer_len, Access::HostToDevice, DmaOptions::default());
At this point, both command_ring
and buffer
have types DmaSliceRegion<Command>
and DmaSliceRegion<u8>
respectively. Note that we distinguish between DmaRegion
and DmaSliceRegion
. Both provide a
similar purpose, but have slightly different signatures on some functions. For example, both provide
a with
function (see below), but the DmaSliceRegion
allows specifying a sub-slice. The rest of
this document will use DmaRegion
to stand for both types to avoid duplication of specification.
We can use DmaRegion::pin()
to get a list of physical pages associated with the
region so that we may program the device to operate on this command ring. Then, submitting a command
would look like:
buffer.with_mut(0..0x1000, |buf| {
fill_out_buffer(buf);
});
// Grab a 'pin'of the buffer, which ensures that the associated physical addresses and IOMMU maps will remain
// static until the dma object is dropped.
let buffer_pin = buffer.pin().unwrap();
// Get the physical address of the first page.
let buffer_addr = buffer_pin[0].addr();
// Fill out a new command.
command_ring.with_mut(0..1, |ring| {
ring[0] = Command::new(buffer_addr);
});
increment_head();
A pin object can manually release the pages it refers to, but otherwise the lifetime of pinned physical memory is the same as the DmaObject itself. By tying pin lifetime to the DMA object and not the pin object reduces management complexity of avoiding accidentally programming a device with stale physical addresses.
The DmaRegion::with_mut
function runs a closure while ensuring coherence between host and device. Before the
closure, it ensures any writes from the device are visible, and after running the closure, it
ensures that any writes made by driver software are visible to the device. A similar function, with
,
allows driver software to read the DMA region and not write it, allowing the system to skip ensuring
coherent writes from host to device.
Simple Allocation
If a driver needs to allocate a large number of dynamically sized DMA regions, doing so with a
single object may prove difficult as we can easily run out of space. Thus twizzler-driver also
provides a type for managing a collection of DmaObjects all of a similar type: DmaPool
. We can use
it as follows:
let pool = DmaPool::new(DmaPool::default_spec(), Access::HostToDevice, DmaOptions::default());
let region = pool.allocate::<Foo>(Foo::default()).unwrap();
// Dropping region causes it to deallocate.
Coherence Models and Memory Safety
In the above example, we used default DMA options, which ensures the following:
- Writes by host software are readable by the device once the
with_mut
function returns. - Coherence is synchronized at the start of the
with
orwith_mut
calls.
More relaxed models are available that do not do any synchronization unless the driver explicitly
calls DmaRegion::sync
. Note that we are not ensuring that no memory access conflicts occur
between the device and driver software, since that is not possible to do at compile time or
runtime1. We are further not ensuring that the device maintains the ABI of the Command
type. In
this example, this doesn't really matter, as all the constituents of this type are simple integers,
but imagine instead that status
was an enum with only a few defined values. The device could
update the value of status to a non-defined value, which would cause problems.
To avoid the type ABI problem, we require that a region be a type that implements the DeviceSync
and Copy
marker traits. The DeviceSync
trait is a promise that the ABI for the type can handle
any update to it that the device might make and that it can handle possible memory conflicts with writes
from the device.
Efficiently, anyway. We could use the IOMMU to ensure that physical addresses are only available for the device to access during certain windows. However, this would involve a LOT of system calls and IOMMU reprogramming, which is currently not terribly fast. Note, however, that as-written this API would allow for this kind of enforcement if we choose to do it in the future.
Shared Objects
One final consideration is for drivers that want to point devices towards object memory that exists
within an object that is shared across different programs. The twizzler-driver library cannot (at
this level of the system) enforce mutability rules for these objects. Thus driver software should
use the manual sync operations to ensure coherence (of course, parts of the object modified via the
with
functions will still have coherence rules applied as normal, see above).
DmaOptions
and Access
DmaOptions modify how a region (or pool, see below) of DMA memory is treated by the host. The options are a bitwise-or'd collection, with the following defined:
UNSAFE_MANUAL_COHERENCE
. Default: No. If set, thewith
functions do not perform any coherence operations.
The Access enum specified the direction of the DmaTransfers that will be made with this region, and can be used to optimize coherence and inform access controll for IOMMU mappings. The options are:
- HostToDevice -- for transfers in which the device reads.
- DeviceToHost -- for transfers in which the device writes.
- BiDirectional -- for transfers in which the device reads and writes.
Reference-level explanation
Kernel API
Accessing physical mappings information is done, from consumers of the twizzler-driver API, via the
pin
function on a DmaObject
. The pin function learns about physical mappings from the kernel by
calling a KAction command on the underlying object for pinning pages, which returns a token along
with information about physical addresses. That token is tied, in the kernel, to the list of
physical mapping information that that call returns. After this call returns, the kernel ensures
that the mappings information that it has returned stays correct ("active") until the pin is manually released
via another KAction call.
Internally, the kernel will manage per-object information on which pages are pinned so as to not evict or change such pages. Ensuring that these active pins remain correct requires some interaction with the copy-on-write (COW) mechanism in the object copy functionality. In particular, pins do not get copied into new objects that source data from an existing object, however if a pin applies to a source object, that object is copied (via COW) to a new object, and the range that is copied intersects with the pin, and a write is performed to the pinned region in the source object while the underlying pages are still shared for COW, the kernel will need to copy the page for all other objects instead of just the source object. For this reason, we will break the kernel-side implementation into two feature gates:
- Basic pin support, but not supporting COW-intersecting-with-pins.
- Full support as described above.
Userspace Implementation
Let's consider the examples in the previous section and discuss implementation.
DmaObject::slice_region
and DmaObject::region
These provide simple ways to just return some object memory as a [T; N]
or a T
. They return a
struct (DmaRegion) that just manages a typed region of memory of a size determined by T (and N), and expose the
pin function.
pin
The pin function calls the kernel to setup a pin and then manages the pin information in memory, allowing it to be called multiple times without having to negotiate with the kernel each time. Note that pins are not released to the kernel when the DmaRegion is dropped; instead all pins on an object are released when the DmaObject is dropped.
with
and with_mut
These functions provide access to the internal memory managed by a DmaRegion to a closure. Before
the closure is run, it ensures coherence for the host reading the memory, and after the closure is
run it ensures coherence for the device to read memory. The with
variant may skip the second step.
Pools and Allocation
Regions can also be gotten from a DmaPool, which internally manages a collection of DmaObjects (derived from objects that it creates as needed). All regions created this way share the DmaOptions with which the pool is created. Allocation is managed internally via a memory allocation algorithm, however all regions must be aligned on page size.
Drawbacks
DMA is vital to any driver written for the vast majority of devices that we care about. However, the particular design choices herein do have some drawbacks:
- Pinning memory adds complexity to the eviction algorithms in the kernel and the pager, as they need to be made aware of pinned memory.
- There is currently no attempt to limit the amount of pinned memory an application can request, thus opening an easy door to denial of service attacks. We can mitigate this somewhat via access control.
- Currently we don't define a way to request that all physical addresses fit within 32 (or fewer) bits, as the kernel is not currently setup to do manage memory in a way that would make this easy. Ensuring physical addresses stay under the 4G mark is useful (mostly) for older hardware that cannot work with 64-bit addresses. Currently, we don't have any immediate need to support such hardware. If the need arises, however, we can extend the DmaOptions enum to include specifications for physical memory address limitations.
Rationale and alternatives
Pinned Memory
The overall goal is to make it possible for userspace code to program devices to access physical memory. We want to stay within the overall Twizzler object model to do this, thus the API herein is focused around making it possible to create objects that we can then use for DMA transfers.
Pin Leaks
One immediate concern is pin leaks. Since the pins must be manually released to the kernel by the library (not the user), we can imagine a device driver crashing and causing a section of object memory to be pinned forever (at least, until the kernel restarts). The decision to allow this, however, is intentional.
Should a device driver crash, we have no guarantee on the state of the device that it was programming. It's entirely possible that the device be programmed with physical addresses that it may make use of after the device driver crashes, thus blindly writing to memory that, in the case that pins get removed if the program creating them crashes, may no longer refer to the object memory that was originally intended by the (now crashed) driver.
Thus the choice of allowing leaks in the face of a driver malfunction is there to mitigate the possibility of corrupted memory. Of course, use of an IOMMU may be able to mitigate this, however I do not wish to rely on it, and this would also introduce many inefficiencies. If we prove to be able to efficiently make use of IOMMU hardware in the future, this design may change.
Prior art
Basically every major operating system provides some API for setting up DMA transfers. Most of them are quite similar, largely relying on the driver to manually synchronize for coherence and/or specify directionality of transfers. Some (e.g. Linux) additionally classify a region of memory-to-be-DMA'd as fully coherent or streaming, usually using this information to specify caching type for memory mappings.
Fuchsia uses a similar mechanism to what is outlined herein, also supporting pinned memory regions. There are a number of differences, however, that largely stem from our desire to (at least somewhat) follow Rust memory safety requirements. Fuchsia, in addition, does allow a limited ability to control contiguity of physical memory, which we do not (yet).
FreeBSD and Linux both have a significantly different style of interface, stemming from the fact that they implement their device drivers in-kernel, and so their DMA interfaces can be tightly coupled with their virtual memory subsystem. The two systems differ in the details of how they control coherence and synchronization (single-versus-all-cpus, streaming-versus-coherent, and FreeBSD allowing finer control over sync operations) and how they control contiguity and maximum address size. Otherwise, the differences are largely down to API specifics and not really functionality, with the exception of FreeBSD supporting a recursive-like pattern of region configuration inheritance, which is kinda cool.
Windows offers little additional insight into DMA operation and design tradeoffs, except as a case study of how not to name functions or other aspects of an API.
Future possibilities
This RFC is only intended to cover "dumb" devices -- that is, devices that are fully programmed by the host software and, while they may interact with memory via DMA, do not really "go off on their own". Essentially, most devices on the market that do things like NVMe, networking, etc. Such devices are fully controlled by the host and all memory access is either initiated by the same or initiated by the device to a pre-programmed section of memory, and the device can be thought of as a simple state machine.
In future it may be better to model devices a fully separate machines that access physical memory cooperatively with the host CPU and run their own programs. Should we reach that future, we will probably need a new model for programming such devices that will exceed in needed richness the model presented here.