The Twizzler RFCs Book

  • Start Date: 2020-06-01

This document copied and modified from the Rust Community's RFC 0002-rfc-process.md.

Summary

The "RFC" (request for comments) process is intended to provide a consistent and controlled path for new features to enter the language and standard libraries, so that all stakeholders can be confident about the direction the language is evolving in.

Motivation

For Twizzler to become a mature platform we need to employ self-discipline when it comes to changing the system. This is a proposal for a more principled RFC process to make it a more integral part of the overall development process, and one that is followed consistently to introduce features to Twizzler.

Detailed design

Many changes, including bug fixes and documentation improvements can be implemented and reviewed via the normal GitHub pull request workflow.

Some changes though are "substantial", and we ask that these be put through a bit of a design process and produce a consensus among the Twizzler community and the [core team].

When you need to follow this process

You need to follow this process if you intend to make "substantial" changes to the Twizzler Operating System. What constitutes a "substantial" change is still evolving, but may include the following:

  • Any user visible breaking change (API change).
  • Anything that would violate the (Principle of Least Astonishment)[https://docs.freebsd.org/en/books/handbook/glossary/#pola-glossary]
  • Removing user visible features.

Some changes do not require an RFC:

  • Rephrasing, reorganizing, refactoring, or otherwise "changing shape does not change meaning".
  • Additions that strictly improve objective, numerical quality criteria (warning removal, speedup, better platform coverage, more parallelism, trap more errors, etc.)
  • Additions only likely to be noticed by other developers-of-twizzler, invisible to users-of-twizzler.

If you submit a pull request to implement a new feature without going through the RFC process, it may be closed with a polite request to submit an RFC first.

What the process is

In short, to get a major feature added to Twizzler, one must first get the RFC merged into the RFC repo as a markdown file. At that point the RFC is 'active' and may be implemented with the goal of eventual inclusion into Twizzler.

  • Fork the RFC repo twizzler-operating-system/rfcs.git
  • Copy 0000-template.md to text/0000-my-feature.md (where 'my-feature' is descriptive. don't assign an RFC number yet).
  • Fill in the RFC
  • Submit a pull request. The pull request is the time to get review of the design from the larger community.
  • Build consensus and integrate feedback. RFCs that have broad support are much more likely to make progress than those that don't receive any comments.

Eventually, somebody on the [core team] will either accept the RFC by merging the pull request, at which point the RFC is 'active', or reject it by closing the pull request.

Whomever merges the RFC should do the following:

  • Assign an id, using the PR number of the RFC pull request. (If the RFC has multiple pull requests associated with it, choose one PR number, preferably the minimal one.)
  • Add the file in the text/ directory.
  • Create a corresponding issue on Twizzler repo
  • Fill in the remaining metadata in the RFC header, including links for the original pull request(s) and the newly created Twizzler issue.
  • Add an entry in the Active RFC List of the root README.md.
  • Commit everything.

Once an RFC becomes active then authors may implement it and submit the feature as a pull request to the Twizzler repo. An 'active' is not a rubber stamp, and in particular still does not mean the feature will ultimately be merged; it does mean that in principle all the major stakeholders have agreed to the feature and are amenable to merging it.

Modifications to active RFC's can be done in followup PR's. An RFC that makes it through the entire process to implementation is considered 'complete' and is removed from the Active RFC List; an RFC that fails after becoming active is 'inactive' and moves to the 'inactive' folder.

Basic Objects

Objects in Twizzler form the basis for memory organization and data identity in a global address space. The core features that all objects share are:

  • Invariant references. Pointers in an object can reference any other object regardless of location.
  • Base structure. Objects have a "known entry point" that represents the overall type of the object.
  • Metadata. Objects have a metadata structure that contains a core metadata.
  • Access control. Objects may be read/write/execute depending on security context.
  • Metadata extensions. An object may "respond" to various APIs.

This document serves to outline the core object layout and the design and rationale of the twizzler-object crate. In short, twizzler-object provides safe functions to get always-immutable information about objects, unsafe functions to get references to any object memory, a low-level part of the runtime that manages memory slots for objects, and some type definitions for object layout. Any safe (transactional) mutability of objects will be done via twizzler-nando, which will be described in an upcoming RFC.

Object Layout

Objects form a 1GB contiguous sparsely populated collection of pages, and are all identified by a unique 128-bit ID. Objects may be either mutable or immutable (this property is, itself, immutable)1. An object may be either shared or volatile. A shared object is one that may be operated on by multiple instances of a host2.

The high-level layout is:

+---+--------------------------+---+-------+
| N | B                      F | M | E     |
| U | A -->              <-- O | E | X --> |
| L | S -->              <-- T | T | T --> |
| L | E                      e | A | s     |
+---+--------------------------+---+---+---+

At the very start of an object, we devote an entire page to an unmapped page. This serves two purposes:

  • Invariant pointers can be NULL and always refer to an unmapped page.
  • An unmapped page between objects prevents runaway writes from escaping an object.

After the null page we have the base of the object, which acts as a type for the object. The base can be unit if there is no meaningful base type. The rationale for having a base type is that it provides a simple way to represent a simple type and an entry point for objects. It allows the kernel to interact with some kinds of objects without needing to understand the more complex aspects of object layout.

After the base, the object memory is application-defined up until the end of the foreign object table (FOT), which grows downward. Often the remaining memory after the base is given to a per-object allocator. Near the top of the object, there is a metadata struct that has the same layout for all objects. Following the metadata struct is a number of extensions that allow the object to specify that is responds to a certain API.

The metadata region is needed because we need a mechanism for invariant pointers (the FOT), so it's not that the overhead of the metadata is justified, it's that it's required. Even for objects with no outgoing pointers, the metadata struct provides information for object ID derivation, base type type verification, and commonly used meta extensions.

1

Note that there's a chicken and egg problem here. An immutable object is immutable upon creation, and so it can only be created via the object create system call (which supports scatter-gather specification for how to construct object memory).

2

An instance of a host includes 1) different computers, 2) different power cycles of a single computer, or 3) different kernels running on different heterogeneous devices in a single computer. Another way to think about it is that volatile objects fate-share with the running kernel they are associated with. Note that this doesn't mean that volatile objects are limited to a single host, but that they must move instead of being shared.

Meta types

The meta struct is defined as follows:

struct MetaData {
    nonce: Nonce,
    kuid: ObjID,
    p_flags: u32,
    flags: u32,
    fotentries: u32,
    metaexts: u32,
    basetag: Tag,
    version: Version,
}

The nonce, kuid, and p_flags are related to security, the flags are used for future extension, the fotentries and metaexts fields count the number of FOT entries and meta extensions. The basetag and version fields are used for BaseType versioning and verification.

The FOT starts just below the meta data struct, and is an array of entries defined as follows:

struct FOTEntry {
    union {
        id: ObjID,
        nameinfo: { name: u64, resolver: u64 },
    },
    refs: u32, // ref count of this entry
    flags: u32, // requested protections
    resv0: u32, // unused (must be zero)
    resv1: u32, // unused (must be zero)
}

The meta extensions entries are defined as:

struct MetaExt {
    tag: u64,
    value: u64,
}

The tag field specifies which extension this is, and the value field is dependent on the extension. Tags have a simple constraint that if the top 32 bits are zero, then that is a value reserved for the system.

The twizzler-object Crate

The twizzler-object crate provides a foundation and interfaces for interacting with Twizzler objects. It exports a selection of core types and traits that make it possible to build a higher-level management system (if required):

  • The metadata types defined above (may be reexported from the twizzler-abi crate).
  • Object creation, deletion, and lifetime controls.
  • The Object<T> type.
  • Invariant pointers.
  • Object Safety trait.

Object Safety

The crate provides an auto marker trait: ObjSafe. The ObjSafe trait denotes two things:

  • The data structure does not contain a non-invariant pointer.
  • The data structure ensures that mutation is possible only via the twizzler-nando crate.

The first is done by impl<T> !ObjSafe for *const T (etc), and the second is done by implementing !ObjSafe for UnsafeCell. Any memory that is located in an object should implement ObjSafe (which happens automatically usually). Of course, a data structure may unsafely implement ObjSafe.

The BaseType trait

Object base types have a few constraints on top of simple object safety. They must be able to prove that some persistent value stored in memory is a valid instance of the type without any provenance, and they must provide an initialization mechanism for creating objects. The trait contains the following:

trait BaseType {
    /// Constructs a Self, optionally using some arguments.
    fn init<T>(args: T) -> Self;
    /// List of all tag, version pairs supported by this type.
    fn tags() -> &'static [(Tag, Version)];
}

A BaseType's init function is called by the object creation function as a way to create the initial object's base. An additional mechanism for creating an object by running an arbitrary transaction will also be supported.

The tags and version information is used as a runtime check for Object<T: BaseType>::base(), to ensure that an object actually has a T at its base. The version information is currently matched against, but unused, with the purpose of later implementing an upgrade mechanism.

Object<T>

The key type the twizzler-object crate provides is the Object<T> type, which represents a single Twizzler object. This type exposes the following interfaces:

fn id(&self) -> ObjID;
fn init_id(id: ObjID, prot: Protections, flags: InitFlags) -> Result<Self, InitError>;
fn base(&self) -> &T;

Note that the base function returns an immutable reference to the base, and there is no safe way to get a mutable reference. This is because all mutation should be done via the twizzler-nando crate. In addition to the above functions, the twizzler-object crate provides unsafe functions to get mutable references to (e.g.) the base, and functions to read immutable fields of the meta struct.

Slots

A key interface provided by twizzler-object is management of slots of memory. This allows programs following pointers to reuse already-mapped object slots for new references, reducing kernel overhead. The exact interface is unstable, as it is intended to be used internally only to assist in the implementation of twizzler-nando.

Standard Meta Extensions

Twizzler reserves meta extensions with a tag value that has the top 32 bits as all zero for system use and for standard universal extensions. Two system use values here are:

  • null (tag = 0x0): This marks the end of the metaext array (which may be before the end as specified by MetaData::metaexts).
  • tombstone (tag = 0xdeadbeef): A previous entry that has been deleted, and should be ignored. It may be reused, and it does not mark the end of the array.

Size (tag value: 0x1)

Some objects may have a notion of "size", where the amount of data in the data region grows and shrinks such that there is always a maximal size (eg. file size in Unix). The value part of the extension contains this size value. The API will be explored in a future RFC.

Invariant Pointers

An invariant pointer functions similar to a raw pointer in semantics. It does not convey lifetime or reference counting, and may be null.

// size: 64 bits
#[repr(transparent)]
struct InvPtr<T> {
    ...
}

impl<T> InvPtr<T> {
    fn null() -> Self;
    fn is_null(&self) -> bool; 
    fn from_parts(fote: usize, off: u64) -> Self;
    fn raw_parts(&self) -> (usize, u64);
}

The actual interesting aspects of the invariant pointers will be implemented in the twizzler-nando crate.

Alternatives

The design of invariant pointers forms the basis for sharing and the global address space. Other invariant designs (fat pointers) have problems.

The twizzler-object crate is small, and provides only a limited view of objects. Any part of an object that can mutate cannot be exposed by the twizzler-object crate at all, even via immutable reference. The only exception is atomically reading an invariant pointer, but even this requires interpretation via the FOT to be useful, and this requires interaction with twizzler-nando to be safe. The twizzler-object crate does expose unsafe functions for getting access to mutable object memory for the purposes of implementing twizzler-nando atop twizzler-object.

Status

This document is a draft and must be completed. Initial exploratory work has begun in the dbittman-object branch on the twizzler repository.

Future

Future, planned RFCs include:

  • Names in FOT entries, and manual FOT entry specification.
  • Append-type objects and the Size meta extension.
  • The twizzler-nando crate.
  • More on versioning.

Summary

This document introduces a time sub-system into Twizzler. The time sub-system lays down the foundation for Twizzler's notion of time and interfaces for user space programs. We introduce a trait that abstracts the hardware, new system calls in the kernel, and types to represent clocks.

Motivation

We care about supporting user space programs, particularly Rust's standard library. A big part of that is providing APIs for time. User programs require services for time to do useful things such as benchmarking.

Rust exposes an interface for time to user space through various types.

  • Duration: represents span of time (sec and ns, like timespec)
  • Instant: a monotonic non-decreasing type that is implemented using os-specific system calls
    • e.g. clock_gettime(CLOCK_MONOTONIC) for Linux
  • SystemTime: represents an anchor in time from the unix epoch, non-monotonic
    • e.g. clock_gettime(CLOCK_REALTIME) for Linux

To support the current APIs in Rust, we need to provide an implementation of a monotonic clock and system clock. Twizzler currently provides stubs so that Rust can call into the kernel.

The outcome of this work is to:

  • Provide strong support for user space Rust applications using its time APIs
  • Develop interfaces for Twizzler native programs
  • Develop standard interfaces for the kernel to manage the hardware used for clocks.

Guide-level explanation

The time interface exposed to users is traditional in the sense that all mechanisms in the kernel hide behind system calls. However, user programs are free to use libraries built on top of system calls, as is done so in the Rust standard library.

One could imagine a simple Rust program that reads the elapsed time from a monotonic clock, ported over to Twizzler as so:

use std::time::Instant;
fn main() {
    // reads a moment in time
    let now = Instant::now();
    
    // elapsed time since now
    // this returns a Duration type
    let elapsed_time = now.elapsed();
    println!("Running main() took {} nanoseconds.", elapsed_time.as_nanos());
}

Clock and ClockInfo

To support the above functionality, Twizzler exposes a Clock abstraction to user space, and APIs revolve around this and ClockInfo. A Clock is a logical abstraction exposed to users that serves a particular purpose and ticks at a particular pace.


#![allow(unused)]
fn main() {
struct Clock {
    info: ClockInfo,
    group: ClockGroup,
    id: ClockID
}
}

ClockInfo is meant to describe a clock to users. It contains the last value read from the clock and properties of the clock, such as resolution.


#![allow(unused)]
fn main() {
struct ClockInfo {
    value: TimeSpan,
    resolution: FemtoSeconds,
    precision: FemtoSeconds,
}
}

The value is a TimeSpan type meant to represent a span of time. To support high precision clock hardware we represent some duration of time using two u64 values, one for seconds, and another for the remainder in femtoseconds. Each value is its own type meant to represent the units. The TimeSpan type can easily be converted to support legacy timestamps such as Duration or timespec/timeval. TimeSpan exposes an interface similar to Duration.

Thus, ClockInfo can be thought of as a POSIX timespec, but much better. It provides descriptions of what is being read, not just the value. This additional metadata is essential for scientific applications that want to use time as a tool for measuring some event. Applications need to know about the properties of clocks to get accurate measurements and reduce errors in experiments.

We define resolution as the period of a clock tick. In other words, the rate at which a clock advances time. The resolution is expressed in Femtoseconds which is a simple wrapper type for a u64 value. We choose to do this to make it clear what the semantic meaning of the value is. Furthermore, we define precision as the stability of the measurements, measured in Femtoseconds. Another value of interest is accuracy which tells us how close measurements are to the true value, based on some reference. This is useful in determining clock error, and will be explored in a future RFC.


#![allow(unused)]
fn main() {
#[repr(transparent)]
struct Seconds(u64);

#[repr(transparent)]
struct FemtoSeconds(u64);

struct TimeSpan(Seconds, FemtoSeconds);
}

ClockGroup are a set of enums that associate some semantic meaning to a clock. This gives users control of what type of clock they are reading from (when talking to the kernel), and they can expect certain invariants to be maintained, such as a clock being monotonic.


#![allow(unused)]
fn main() {
enum ClockGroup {
    Unknown,
    Monotonic,
    RealTime,
}
}

The kernel internally manages a list of usable clocks backed by hardware. The id identifies which clock source is used when interacting with the Clock. The ClockID is a simple wrapper type around a u64.


#![allow(unused)]
fn main() {
#[repr(transparent)]
struct ClockID(u64);
}

User programs can get time in a variety of ways. Either transparently using Rust's std::time crate, directly through system calls, or indirectly through methods exposed by the Clock type.

System Call Interface

Revisiting our example from earlier, let's see how it would work when performing a system call:

use crate twizzler_abi::syscall::sys_read_clock_info;
use crate clock::{ClockInfo, ClockSource, ReadClockFlags};

fn main() {
    // reads a moment in time
    // returns a ClockInfo type
    let now = sys_read_clock_info(ClockSource::BestMonotonic, ReadClockFlags::empty());
    
    // elapsed time since now
    let later = sys_read_clock_info(ClockSource::BestMonotonic, ReadClockFlags::empty());
    // ClockInfo.value() returns a TimeSpan type
    let elapsed_time = later.value() - now.value();
    println!("Running main() took {} nanoseconds.", elapsed_time.as_nanos());
}

A few things here. For starters, sys_read_clock_info, a new system call. We pass in the clock we want (ClockSource), and some flags. By default, this returns a ClockInfo object with all fields filled in. The returned ClockInfo, in this case, is generated from a ClockGroup specific clock. We will get to the use of ClockID later.


#![allow(unused)]
fn main() {
enum ClockSource {
  BestMonotonic,
  BestRealTime,
  ID(ClockID)
}
}

There might be more than one piece of hardware that can be used to serve as the backing for a specific ClockGroup. Hence, sys_read_clock_info returns a value read from the best clock source available. The semantic meanings of ClockSource map directly to ClockGroup.

Functionally, this is the same program, except it uses different abstractions. The ClockInfo type has a set of methods to return internal values. We calculate the elapsed time by subtracting two TimeSpan types that sampled different points in time from the same clock. Internally, std::time does something like this using the Instant type.

Twizzler exposes a new set of system calls related to timekeeping. Other than sys_read_clock_info, which is helpful in reading a clock and learning about its properties, users need support to discover available clocks.


#![allow(unused)]
fn main() {
fn sys_read_clock_info(source: ClockSource, flags: ReadClockFlags) -> Result<ClockInfo, ReadClockError>;
fn sys_read_clock_list(clock: ClockGroup, flags: ReadClockFlags) -> Result<VecDeque<Clock>, ReadClockError>;
}

Should a user need detailed information about clocks exposed by the kernel to user space, they could use sys_read_clock_list. By default, it returns a list of clocks for every type of clock exposed (ClockGroup). All information in the ClockInfo except the current value is also returned. For clocks with more than one clock source, the first one is returned. Users can get a list of all clocks, and thus all clock sources, for a particular type by specifying the ClockGroup and setting the appropriate flag.

If a user wants to read an arbitrary clock's value, they could specify the ClockID given to them by sys_read_clock_list.


#![allow(unused)]
fn main() {
// reads all clocks as candidates for monotonic
let clocks = sys_read_clock_list(ClockGroup::Montonic, ReadClockFlags::ClockGroup).expect("error message");

// reference to the last clock in list
// clock type has id of backing clock source
let clk = clocks.last().unwrap();

// reading from some arbitrary clock source
let now = sys_read_clock_info(ClockSource::ID(clk.id()), ReadClockFlags::empty());

// ClockInfo.value() returns a TimeSpan type
println!("Current value of clock: {} nanoseconds.", now.value().as_nanos());
}

The last thing a user might want to do is steer the clock to prevent drift. This is useful for systems that need precise values from an accurate clock. Real-world applications such as PTP/NTP need an interface like this. A full detailed explanation and implementation are out of scope for this RFC and will be explored in another RFC.

Clock Interface

The last way of accessing a clock is by using Twizzler's native library for clocks. Each Clock has a set of operations that map directly to the system call interface exposed to time.


#![allow(unused)]
fn main() {
fn read() -> TimeSpan {}
fn info() -> ClockInfo {}
}

Our running example would look something like this:

fn main() {
  // gets a reference to the monotonic clock
  let clock = Clock::get(ClockGroup::Monotonic);

  // read a moment in time
  // returns TimeSpan 
  let now = clock.read();

  // elapsed time since now
  let later = clock.read();
  // ClockInfo.value() returns a TimeSpan type
  let elapsed_time = later - now;
  println!("Running main() took {} nanoseconds.", elapsed_time.as_nanos());
}

The benefit of doing things this way is that users interact with time at a much higher level than the system call interface, and they are given useful clock metadata. If all a user cares about is the passage of time, then Instant should suffice. However, if they are curious about the precision or accuracy of a clock, then this interface is one way of doing so. It is much cleaner.

Reference-level explanation

Twizzler needs an abstraction over hardware used for timekeeping to support the interfaces exposed to users. The purpose of the abstraction in Twizzler is so that the kernel can support different types of hardware. Time can come from many sources: some counter or programmable timer on the processor or board. Another thing the kernel needs is standard interfaces to manage timekeeping hardware.

ClockHardware and Ticks

The kernel achieves both of these through the ClockHardware trait defined as follows:


#![allow(unused)]
fn main() {
trait ClockHardware {
    fn read(&self) -> Ticks;
    fn info(&self) -> ClockInfo;
    // start, isEnabled, callback, etc.
}
}

Rather than having a concrete type, we provide an interface implemented by different architectures for different time sources. ClockHardware exposes methods to read time or get a description of the hardware backing it. This leaves more room to introduce other useful methods, such as disabling/enabling a hardware timer.

The necessary interfaces will be clear as we integrate this design with the existing kernel code. For example, the kernel currently has a stat clock which is programmed through a number of somewhat ad-hoc APIs. The idea would be to use this to manage the hardware backing the stat-clock in a hardware-agnostic way.

The purpose of read is to read a value provided by hardware, which requires some assembly, and returns Ticks meant to represent raw time:


#![allow(unused)]
fn main() {
struct Ticks {
    value: u64,
    rate: FemtoSeconds
}
}

Ticks represent some duration on the clock. The width of value, which is 64-bits, is not fundamental and could change. The value can be scaled to some unit of time by multiplying the rate of the time source. This multiplication operation produces a TimeSpan.

If we were on an x86-64 machine for example, and we wanted to use the TSC as ClockHardware, read would return the value of the TSC. Internally read would call the rdtsc instruction and return the value reported by hardware as Ticks.


#![allow(unused)]
fn main() {
impl ClockHardware for TSC {
    fn read(&self) -> Ticks {
        let t = unsafe { x86::time::rdtsc() };
        Ticks { value:t , rate: self.resolution() }
    }
}
}

Ticks can be converted to a TimeSpan which is helpful to user space. info generates a ClockInfo, which describes the properties of the hardware timer. This is done in a time source specific way.

Integration

The kernel maintains a system-wide list of time sources (TICK_SOURCES) building on these abstractions. TICK_SOURCES is implemented as a vector or an array of ClockHardware. One can think of this as an array of methods determined at runtime, based on the system configuration.

TICK_SOURCES is generated when Twizzler starts up in kernel_main and calls clock::init();. The kernel enumerates all hardware time sources available, and chooses which ClockHardware to serve as the backend that supports a particular Clock exposed to user space (e.g. Monotonic).

The enumeration of hardware is machine/architecture-specific. Moreover, the materialization of clocks exposed to user space will require an algorithm that understands the requirements of the clock and the functionality provided by the hardware. This is planned to be explored in a future RFC. For now, we could use well-known clock sources for specific platforms.

Integrating this into the kernel would be done using a set of files:

<root>/src/kernel/src/
  arch/
    x86/
      tsc.rs // implements clock based on tsc
      processor.rs // find/save ref to clocks on processor
    aarch64/
    mod.rs // decides what to compile
  machine/
    pc/dummy.rs // some clock source on platform
    morello/
  clock.rs // probing hw clocks (hw/config specific)
  time.rs // abstracting hw clock sources
  main.rs // initialize clock subsystem 

For each architecture subdirectory, we have a set of files implementing ClockHardware for specific hardware. Likewise, for timers on the motherboard, we have files that are board specific. The generic code lives in the main Twizzler source, which contains the ClockHardware trait and functions to initialize the time sub-system. At compile time, we decide what architecture to compile to and thus what time code we need to run. At run time, we discover hardware and choose the implementation as appropriate.

Circling back to our example from earlier, where a user program reads a monotonic clock, we could imagine that the time stamp counter (TSC) backs that clock on an x86 platform. A peek behind the curtain of Rust's std::time call to Instant::now() would look something like this:

use std::time::Instant;
fn main() {
  // reads a moment in time
  let now = Instant::now();
    // calls into os-implementation
    let ci = sys_read_clock_info(ClockGroup::Monotonic, 0, ReadClockFlags::empty());
    //======== jump to kernel space =========
      // os looks up the ClockHardware backing this clock
      USER_CLOCK[clock as usize].info();
      // time source specific generation of ClockInfo
      ClockHardware.info(self)
      // read the value given by hardware
      let t = self.read()
      // reading TSC (implementation)
      let tsc = unsafe { x86::time::rdtsc() };
      Ticks { value: tsc , rate: self.resolution() }
      // generate ClockInfo from value read
      ClockInfo::new(
       TimeSpan::from(t), // conversion to TimeSpec
       ClockGroup::Monotonic, 
      //  ...
      )
    //======== back to user space ========
    // read value from ClockInfo returned from sys_read_clock_info
    let now = ci.value()
    // Instant implemented as a TimeSpan
    return now;

  // elapsed time since now
  // this returns a Duration type
  let elapsed_time = now.elapsed();
  println!("Running main() took {} nanoseconds.", elapsed_time.as_nanos());
}

To illustrate this more clearly, we could imagine that the call stack up until the TSC is read would look something like this:

0: [kernel] x86::time::rdtsc()
1: [kernel] x86_64::tsc::ClockHardware::read()
2: [kernel] x86_64::tsc::ClockHardware::info()
3: [user]   twizzler_abi::syscall::sys_read_clock_info()
4: [user]   std::time::Instant::now()
5: [user]   main()

The actual calls in a real implementation would look different. We omit checks for values provided by users, and more efficient, possibly serialized reads of time.

Drawbacks

The biggest drawback might be the cost of abstraction. We plan to use Rust's dynamic dispatch, which involves a vtable call to the underlying interface. This layer of direction may or may not be expensive for time APIs. There might be a way around this and have the compiler statically compile all function calls, but it is unclear if possible.

Another source of overhead is that all interfaces with time require a system call. This can be optimized later for some things, such as reading a clock by exposing read-only memory to user space, similar to a vDSO

Using 64 bits for the implementation of Ticks may be relatively large for embedded. Likewise, for the standard library implementation of Rust's Duration 64 bits is a lot. For most instances of Twizzler, this should not be a problem.

Rationale and alternatives

This design offers flexibility and provides standard interfaces in the kernel. We can implement ClockHardware for any hardware. Having a standard interface also makes porting easier. Not doing things this way makes this part of the system harder to maintain.

We have considered other legacy designs and some of them have their downfalls.

Prior art

It is pretty standard in Rust to use traits to abstract hardware. Such is the case for many projects in the Embedded Rust community.

As far as time is concerned, the Embedded Time crate and Tock Time HIL are related. Both abstract over hardware used as time sources, have tick and frequency abstractions, and Tock provides a notion of a clock. However, they are not general purpose.

They do not provide interfaces for programming hardware used for timekeeping. We may use aspects of their design in our work, but we cannot directly use these crates as presented. Another downfall is that they do not describe the time source, which is necessary if applications want to consider the accuracy of their time measurements. Our goals are to provide something general purpose and high level.

Linux provides system calls for accessing monotonic and real-time clocks with clock_gettime. The way we read time is similar, except that we provide more useful metadata to users with ClockInfo instead of timespec. Other researchers in the community have noted the importance of time as a tool and the shortcomings of timespec. Having additional information available, such as accuracy or clock error, is critical to science. Not knowing the properties of clocks can lead to errors in experiments and hurt reproducibility [1].

Resources On Time

Other than the paper mentioned above on the importance of time as a tool, George Neville-Neil gives a good overview of the importance of time and why we need an interface to adjust time [2]. Not only that, but synchronized clocks are important in many distributed systems. The FAQ page on NTP is also a good resource.

[1] Najafi, Ali, Amy Tai, and Michael Wei. "Systems research is running out of time." Proceedings of the Workshop on Hot Topics in Operating Systems. 2021. https://dl.acm.org/doi/pdf/10.1145/3458336.3465293.

[2] George Neville-Neil. 2015. Time is an Illusion., ACM Queue 13, 9 (November-December 2015), 57–72. https://doi.org/10.1145/2857274.2878574

[3] Ulrich Windl, et al. 2006. The NTP FAQ and HOWTO: Understanding and using the Network Time Protocol. https://www.ntp.org/ntpfaq

Unresolved questions

Resolved Through RFC

Currently, there is no guide on how to implement system calls. It is unclear where specific abstractions for user space should go, such as ClockInfo. However, after looking at the changes to the twizzler-abi crate, it makes sense that the user space clock abstractions belong there.

Resolved Through Implementation

  • We expect that the necessary interfaces for managing time will be revealed as we implement ClockHardware for different hardware timers. When integrating this with existing code, such as the stat clock, we may add more methods to ClockHardware.

  • We may want an additional ClockHardwareInfo that describes not the logical clock but the hardware clock source. It could answer questions such as "is this monotonic?" It does not seem necessary at this point and could probably be integrated somewhere else.

  • We also don't know the overall performance of the Clock APIs over calls through std::time.

Related Issues

These are issues out of scope for this RFC, and could be addressed later, possibly in a future RFC.

  • This RFC does not introduce a system to allow users to set timers/alarms. This feature could be useful for sleeping or setting a callback but is out of scope.
  • We need an algorithm for adjusting sources of time. Standard protocols for this exist, such as NTP/PTP. Support for this requires an interface for adjusting the time that makes sense. Related to this is a definition for accuracy of a clock which is meaningful to the programmer.
  • Selecting the appropriate source of time for a particular clock use case. Some hardware timers are unfit for one reason or another. Maybe the resolution is low, the cost of reading the timer is high, or the timer measurements are unstable.
  • This design does not consider heterogeneous hardware or even differences in hardware among homogenous processors. This may or may not be an issue.

Future possibilities

These APIs and abstractions are necessary to support user space applications. This feature is marked on the Twizzler roadmap as a milestone.

Some additional features or optimizations are listed below that might be explored in the future if deemed necessary.

  • ClockGroup expanded to expose more logical clock types
  • A system call to register hardware as a new clock source
  • By design, we do not implicitly serialize operations that read timers. We could provide some flag that makes calls using arch-specific instruction barriers.
  • Faster reads of clocks can be achieved through read-only shared memory with user space.
  • We might want to be able to dynamically add a hardware source if, say, a CPU suddenly came online.
  • We may explore designs that encapsulate the time sub-system within a microkernel-style user space service

Summary

This RFC defines the interface that the kernel provides for userspace device management and driver implementation. It provides a generic interface that allows for userspace access to device memory and features through Twizzler objects, including memory-mapped IO (MMIO) registers, interrupt handling, device event messages, and structured device trees.

This low-level interface is not intended to be used by application programmers, and even driver programmers are expected to use higher-level abstractions for some aspects of their work (coming in a future RFC), though many aspects of the interface presented here will be used directly.

Motivation

Devices and drivers are, by and large, very complicated and (along with the surrounding infrastructure) make up large parts of modern OS kernels. In a more microkernel fashion, we would like to support userspace taking as much responsibility for device management and drivers as possible. However, due to the aforementioned complexity, providing a generic interface that allows for programming all possible devices is tricky -- we need to provide an interface that allows generic access to device memory (and/or IO ports), interrupt progamming services, etc., all while trying to minimize the responsibility of in-kernel code to handle these cases.

This RFC does not attempt to describe a higher-level, more convenient driver-programming API set; that is planned for a future RFC. Instead, we will cover only the basic user-kernel interface for device abstractions.

Guide-level explanation

Twizzler organizes devices and busses into a tree structure, where devices are children of busses, which are children of some root node (devices can have children too). These devices are each an object that contains information about the device as well as a mechanism of programming device interrupts. Each of these devices can also have a number of additional objects attached to it (called sub-objects) that provide access to device MMIO registers and bus-specific device information.

This low-level interface is not intended to be used by application programmers, and even driver programmers are expected to use higher-level abstractions for some aspects of their work (coming in a future RFC), though many aspects of the interface presented here will be used directly.

The meaning of "device" here is a little fluid, as it depends on the semantics of the hardware of the system. The following are "devices" according to this model:

  • Any "pseudo-devices" that the kernel might create as an abstraction atop some internal mechanism.
  • Any busses that the kernel identifies (e.g. a PCIe segment).
  • Devices that sit on a bus that would be programmed as a device on a bus (e.g. present devices on a given bus:device.function on a PCIe segment, including bridges etc.).

Example

As an example, lets consider a PCIe segment with several devices on it, say a NIC at 0:1.0 and an NVMe device at 0:2.0. The resulting tree would be:

[Bus Root]
  [PCIe Segment 0]
    [0:1.0 (NIC)]
    [0:2.0 (NVMe)]
  [...additional PCIe segments...]
  [...additional busses...]

Say the NVMe device has a region of device memory for programming the controller's MMIO registers, a region for the interrupt table, and the PCIe bus provides an info sub-object that contains PCIe information. The device would have sub-objects that describe the device, and sub-objects that map all the device's memory for access.

Why is the PCIe bus considered a device? The philosophy here is that the PCIe bus is programmed by accessing memory-mapped IO registers in device configuration space (including bridges, etc.). Thus we can allow userspace to program it like it's a device that has sub-devices.

Reference-level explanation

The core of the device model centers around the device object, a Twizzler object whose base is of type DeviceRepr. Internally, these device objects are organized into a Device Tree, where each device object may have up to 65Ki children. The root of this tree is special object called the Bus Root, which has no data and only acts as the root of the device tree. Operations on a device object are abstracted by the Device type defined in the twizzler-driver crate.

Tree access

The Device type exposes a function, children(&self), which returns an iterator that iterates over all this device's children. This is a wrapper around a kaction (see: sys_kaction) request to the kernel on the device object with the command KactionCmd::Generic(KactionGenericCmd::GetChild(n)) where n is the nth child of the device in the tree. The kernel returns either the object ID of the child device, or KactionError::NotFound if n is larger than the number of children.

The tree is manipulated via bus-specific APIs. For example, the PCIe bus uses userspace to perform device initialization, and thus exposes a kaction command for initializing a device on a given entry of the PCIe segment. Busses that support device removal can also expose functions to remove devices from the tree.

The DeviceRepr type

The device object's base type is a DeviceRepr, which has the following layout:

#[repr(C)]
pub struct DeviceRepr {
    kso_hdr: KsoHdr,
    device_type: DeviceType,
    bus_type: BusType,
    device_id: DeviceId,
    interrupts: [DeviceInterrupt; NUM_DEVICE_INTERRUPTS],
    mailboxes: [AtomicU64; MailboxPriority::Num as usize],
}

The DeviceType enum can be either Unknown (and should be ignored), Bus, or Device. The BusType enum specifies which bus this device is attached to (or, it it's a bus, what type of bus this is). Finally, the last identifier is the DeviceId, which is a newtype wrapping a u32 whose meaning is bus specific.

In the example from earlier, the NIC would have a device ID that encodes 0:1.0 as a 32-bit integer, bus type PCIe, and device type Device.

Interrupts

Each device supports up to NUM_DEVICE_INTERRUPTS interrupt entries. The actual mechanism for setting up interrupts with the kernel is bus-specific and relies on a small amount of in-kernel bus-driver code. The DeviceInterrupt struct is as follows:

#[repr(C)]
pub struct DeviceInterrupt {
    sync: AtomicU64,
    vec: InterruptVector,
    flags: DeviceInterruptFlags,
    taken: AtomicU16,
}

When a registered interrupt fires, the kernel's interrupt handler writes a non-zero value to sync and executes a thread_sync wake up on that word of memory, causing any user threads sleeping on that variable to wake up. The DeviceInterruptFlags field a bitflags of size u16 and is currently unused. The vec field refers to the actual vector number allocated by the kernel, and can be used to program device interrupt funcionality (e.g. MSI or MSI-X in PCIe). Finally, the taken field is used to indicate that a given entry is used or free.

Mailboxes

A device has several mailbox entries. These are distinct from interrupts in that they are raised by software (either in the kernel or not) and are statically initialized. Messages sent to device mailboxes are not specified in this RFC.

The DeviceRepr has MailboxPriority::Num mailboxes, each a different priority level: High, Low, and Idle. Each mailbox is a single AtomicU64. Submitting a message to the mailbox involves performing a compare-and-swap (CAS) operation, writing a non-zero value if the value is zero. Should the CAS fail, the submitter may perform a thread_sync sleep operation on the mailbox word. Should the CAS succeed, the submitter must perform a thread_sync wake operation on the mailbox word. Mailbox communication is not intended to be efficient.

When checking the mailbox, driver software performs an atomic swap with 0, reading the value. A non-zero value means that the mailbox had a message. On swap returning non-zero, software must perform a thread_sync wake up on the mailbox. Software may sleep on this word when receiving.

Driver software should prioritize the High priority mailbox above that of device interrupts, should prioritize the low priority mailbox below interrupts and High priority mailbox messages (but still ensure that the mailbox gets checked even if the device is constantly sending interrupts), and should prioritize the Idle mailbox lowest of all (and is allowed to ignore messages in this mailbox as it sees fit).

Sub-objects

Each device has some number of sub-objects. Each sub-object can be one of the types specified by SubObjectType: Info or Mmio. An info sub-object is an object whose base data is bus- or device-specific. An mmio sub-object contains a header that describes the mapping, followed by mapped MMIO space in the object space for accessing device memory. The Device type provides functions for accessing these sub-objects:

  • fn get_mmio(&self, idx: u8) -> Option<MmioObject>
  • unsafe fn get_info<T>(&self, idx: u8) -> Option<InfoObject>

The get_info function is unsafe because there is no checking that T is the correct type for the stored data. The fact that the index is a u8 means that there can be up to 256 sub objects per type.

MMIO Sub-Objects

The MmioObject type wraps the mmio sub object and allows for accessing the information about this mapping and the mapping itself:

  • fn get_info(&self) -> &MmioInfo
  • unsafe fn get_mmio_offset<T>(&self, offset: usize) -> &T

The MmioInfo struct describes information about this mapping, including its length, a device-specific info field, and the caching type of the mapping (uncachable, write-back, etc.). The get_mmio_offset function returns a reference to a location within this mmio memory mapping.

Example: PCIe

PCIe is a common bus mechanism. It provides access to devices via a series of segments, each of which have a region of physical memory comprising its configuration space. Each device on the PCIe segment is assigned a bus, device, and function value that, together, indicate which page of the configuration space is associated with this particular device (the device's PCIe configuration registers). Each device on a PCIe bus has a set of six Base Address Registers (BARs) that describe the location and length of mappings to the device's memory (e.g. MMIO registers, interrupt tables, etc.). Finally, PCIe requires devices to support message-signaled interrupts (either via MSI or MSI-X).

On startup, the kernel enumerates the PCIe segments (e.g. via the ACPI tables) and creates a bus-type device per segment. The segment's device has one info sub-object that specifies the information the kernel knows about the segment (start and end bus numbers). The device also has an mmio sub-object that maps the entire configuration space for the segment as Uncachable. Thus userspace can get to all configuration memory for this segment.

The userspace device manager enumerates over all the PCIe segments, and for each one performs a standard PCIe probing algorithm by reading the configuration memory. For each device that it finds, it executes a kaction on the bus device object: KactionCmd::Specific(PcieKactionSpecific::RegisterDevice) and passes as argument the bus, device and function number encoded into a 64-bit value. The kernel then creates a new device entry as a child of the bus with the following sub-objects:

  • Info sub-object: PcieDeviceInfo (which contains identifying information about the device).
  • Mmio sub-object: this device's portion of the PCIe configuration space. The info field in the MmioInfo struct is 0xff.
  • Mmio sub-object: one for each mapped BAR. The info field in the MmioInfo struct is the BAR number.

When allocating an interrupt, software is responsible for programming the MSI or MSI-X tables and the device to generate a proper interrupt. To register a device interrupt with the kernel's PCIe driver, the kaction-specific command PcieKactionSpecific::AllocateInterrupt can be used. The 64-bit argument is an encoding of which DeviceInterrupt in the DeviceRepr to use, and additional flags about isolation.

This provides sufficient support for userspace to implement drivers for devices that allows programming of:

  • Device interrupts via message-signaled interrupts.
  • Access to the MMIO registers for the device's PCIe configuration space.
  • Access to the MMIO registers for the device's BARs.
  • Access to any other device memory exposed via PCIe.

The NVMe device from the example at the start

The NVMe device I have maps the controller MMIO registers at BAR0 and the MSI-X table at BAR4. Thus its MMIO sub-objects would be:

  • 0: The PCIe configuration space, info = 0xff, uncachable.
  • 1: BAR0, info = 0, uncachable.
  • 2: BAR4, info = 4, uncachable.

Drawbacks

One major drawback is some additional complexity over in-kernel device management. Since we now need to coordinate between userspace and the kernel for device management, it complicates the user-kernel interface. However, the vast majority of users can ignore this.

By putting all driver software in userspace, we add overhead to interrupt handling, which now must (in the upper half anyway) schedule a thread and context switch in the worst-case. However, we expect driver software that is sensitive to this to implement at least partial polling for performance, and the additional overhead is minor anyway compared to monolithic kernel designs that still schedule threads to handle interrupts.

Rationale and alternatives

A device model is a required part of the operating system. While the large-scale device model, mechanisms, driver APIs, and management are larger in scope than this RFC, here we lay out a base level of support for devices atop which we can build additional abstractions.

This model is designed to provide a flexible model that requires little kernel support. The device tree model is a generic structuring of devices that can fit many different hardwares (they don't have to use the tree model and can just be flat). The interrupt model provides an abstraction that is based around how message-signaled interrupts are designed, which are commonly used. The sub-object abstraction is designed to allow for numerous memory mappings for devices that have a lot of them, and the info objects allow arbitrary information about devices to be encoded. Finally, the mailbox system is designed for software to communcate with drivers (or between drivers, or drivers and the bus) in a generic way.

An alternative mechanism would be to implement it all in the kernel and run driver software via loadable modules. While this model doesn't preclude loadable modules, putting things in userspace dramatically simplifies the kernel.

Prior art

  • Microkernel OSes often place driver responsibilities into userspace.
  • Monolithic kernels instead have drivers in the kernel, either compiled in or loaded as modules. These drivers enjoy higher privilege, but must rely on kernel infrastructure for programming.
  • The descriptions of PCIe were based on the latest official PCIe specifications.

Unresolved questions

  • Is the interrupt mechanism sufficient (for now) to allow effective programming of devices?
  • What is the correct security model to apply here? Probably a set a approved programs can access the device tree and that's it.
  • Some PCIe devices should have their memory mapped as write-combinting (framebuffers, for example), but PCIe doesn't encode this information in the BARs. We need a way for userspace to perhaps remap a BAR as WC.

Future possibilities

One major aspect that has been left out is some kind of generic "reporting" facility, where a driver could report device status back to the device manager (or the kernel) some generic status information. I have hesitated to design and include this, since I do not yet know what this would be used for or look like, and would prefer instead to save it for a future RFC.

Summary

This RFC extends RFC0004 with a set of higher-level abstractions that drivers can use to program devices. It describes centralized abstractions surrounding event streams, request-response queues, asynchrony, and interrupts that will be provided by the twizzler-driver crate.

Motivation

Writing device drivers is hard, in no small part due to complex models inherent to the asynchronous nature of interacting with the devices. Developers can make use of infrastructure surrounding device drivers in most operating systems to lessen this complexity. Devices can be (often, largely) modeled as separate computing devices that a) produce events for host software to consume, and b) receive and respond to commands sent by host software. Thus we want to provide a mechanism that allows these abstractions to be implemented by a majority of device drivers easily while providing a unified framework that higher-level aspects of a driver can use.

Guide-level explanation

The twizzler-driver crate provides a high-level wrapper around a collection of abstractions designed to manage a single device (for definition of "device" and the Device type, see RFC0004), called a DeviceController. When taking control of a device, driver software can create a DeviceController (heretoafter referred to as "the controller") from a Device. After this point, the controller manages the device and exposes several abstractions above the device:

  • A DeviceEventStream, which provides a way to get a stream of events, including mailbox events and device interrupts (see RFC0004).
  • A DmaAllocator (details for which will appear in a future RFC).
  • A RequestEngineManager, which provides a way to submit requests to the device and asynchronously await responses.

Device Event Stream

The device event stream provides an async interface for handling (upper-half) interrupts and mailbox messages. This means that some async executor is required; if this is too much overhead for the application, direct access to the underlying Device allows for lower-level interrupt handling without the async overhead.

The event stream allows driver software to handle incoming events from the device. For example, a NIC that receives a packet transfers the packet data to a receive buffer and then submits an interrupt. Driver software will then see this interrupt via the event stream and can handle it appropriately.

The DeviceEventStream provides an interface for allocating an interrupt, which return an InterruptInfo. Internally, the twizzler-driver crate handles the allocation of the interrupt according to the transport mechanism (e.g. PCIe MSI/MSI-X) and registration with the kernel. The InterruptInfo type exposes an async function next which returns the non-zero u64 value of the next interrupt that fires or None if the interrupt is removed. Thus driver software can handle a stream of interrupts as follows:

let int = controller.events().allocate_interrupt(...).unwrap();
while let Some(x) = int.next().await {
    while still_work_to_do() {
        // handle interrupt
    }
}

The DeviceEventStream also provides a method for accessing mailboxes, which can be used as follows:

loop {
    handle_mailbox_message(controller.events().next_msg(MailboxPriority::High).await);
}

As in RFC0004, it is up to driver software to ensure correct prioritization of mailbox receivers.

Request Engine Manager

The other main abstraction is to model the device such that it receives requests and asynchronously responds. For example, submitting a packet for transmit to a NIC happens by the driver software constructing an entry in a transmit queue and then notifying the device that the queue has been appended to. After the device submits the packet, it responds by indicating that that queue entry has been consumed and sends an interrupt. Note that since this often requires interrupt, this abstraction sits logically above the event stream abstraction.

Driver software can create a new Requester through the controller, which acts as a manager for inflight requests. The requester has a generic type that implements RequestDriver, which is a trait that implements device-specific logic for submitting a request.

  • fn shutdown(&self) -- shuts down this requester, notifying the driver, and cancels any in-flight requests.
  • fn is_shutdown(&self) -> bool -- returns true if the requester is shutdown.
  • async fn submit(&self, requests: &mut [SubmitRequest<Driver::Request>]) -> Result<InFlightFuture<Driver::Response>, SubmitError<Driver::SubmitError>> -- submits a number of requests to the device via the driver, and return a future for when the requests are completed.
  • async fn submit_for_response(&self, requests: &mut [SubmitRequest<Driver::Request>]) -> Result<InFlightFutureWithResponses<Driver::Response>, SubmitError<Driver::SubmitError>> -- same as submit but the output of the future contains a vector of responses (one for each request).
  • fn finish(&self, responses: &[ResponseInfo<Driver::Response>]) -- called by the driver to indicate which requests have completed.

The Driver in the API above is the generic in Requester<Driver: RequestDriver>. This trait is as follows:

#[async_trait::async_trait]
pub trait RequestDriver {
    type Request: Copy + Send;
    type Response: Copy + Send;
    type SubmitError;
    async fn submit(&self, reqs: &[SubmitRequest<Self::Request>]) -> Result<(), Self::SubmitError>;
    fn flush(&self);
    const NUM_IDS: usize;
}

Note: async functions in traits is unsupported by Rust at time of writing, so we use the async_trait crate here to make it possible to write this.

The associated types Request, Response, and SubmitError refer to the types of objects that this requester will submit to the device, responses the device sends back, and errors that the request driver can generate when submitting requests.

The SubmitRequest type is a wrapper around the driver's Request associated type that includes an ID of type u64. The number of inflight requests will not exceed NUM_IDS, and all IDs will be less than that value (this allows the driver to specify backpressure and queue limits). Similarly, the ResponseInfo type is a wrapper around the Response type that also includes the ID of the request that is associated with this response, and a boolean indicating if this response is considered an error or not.

Example

Driver software is responsible for implementing the RequestDriver. As an example, say we have a NIC that has a transmit queue for packets. We submit to the transmit queue by writing entries starting at the head and then writing an MMIO register to indicate that the head has moved. Say the queue is 128 entries long. When a queue entry has been handled, the NIC writes back to the queue entry a status word and sends an interrupt to indicate a new tail position. The implementation of the RequestDriver would:

  • Set NUM_IDS to 128.
  • Define Request to be whatever a transmit queue entry looks like.
  • Define Response to be the status word.
  • Define SubmitError to be an enum of possible submit errors.
  • Implement submit such that it:
    • Keeps a map of queue position to SubmitRequest ID.
    • Submits the requests. If it cannot submit them all, it internally asynchronously awaits until it can.
    • Optionally writes the new head to the head MMIO register.
  • Implement flush to write the head register to the MMIO register.
  • Implement an interrupt handler that runs when this queue's interrupt handling routine should be signaled. This routine goes through the queue starting from the old tail to the new tail and reads all the status words for those entries, constructing an array of ResponseInfo types, eventually calling finish on the requester. This routine may need to wake up the submit function after it reads out the status words and records the new queue tail.

Another example we can consider is an NVMe driver, which differs from the NIC driver above by having a separate completion queue. This possibility is the reason behind the abstracted request IDs -- this lets the driver and requester handle out-of-order responses.

Usage

The requester uses this implementation to expose the interface above that lets higher-level driver software interact with the device via this async request-response API. For example, a caller could do the following:

let req = create_packet_tx_req();
let mut sreq = SubmitRequest::new(req);

let fut = requester.submit(&mut [sreq]);
let res = fut.await.unwrap(); // awaits until the requests are submitted (may be a SubmitError).
let res2 = res.await; // awaits until the device responds to the requests.
match res2 {
    Done => {...} // all requests were handled, none error'd.
    Shutdown => {...} // the requester shutdown while requests were in-flight.
    Errors(x: usize) => {...} // all requests were handled, at least one error'd, all error occur after position x.
}

Use of the submit_for_response function looks similar to the above, except res2 could also be matched to Responses(Vec<Driver::Response>). The order of responses matches the order of requests. The reason that you have to issue two awaits is that one is for having submitted all requests, and the second is for all requests being responded to.

Reference-level explanation

These abstractions will be implemented by the twizzler-driver crate, and will be optional features to allow for driver software that does not need them.

Events

The device event stream is largely a straight-forward wrapper around the lower level device interrupt and mailbox mechanisms, presenting them as an asynchronous stream instead of something more akin to a condition variable. The future returned by InterruptInfo's next function uses the twizzler-async crate along with the interrupt definitions discussed in RFC0004 to construct a blocking future that returns the next interrupt.

The interrupt allocation routine operates with some bus-specific knowledge to properly allocate an interrupt for the device before constructing an InterruptInfo. On drop, the InterruptInfo calls back into the event stream manager to deallocate the interrupt, thus tying the lifetime of the interrupt to the InterruptInfo struct.

The mailbox message system internally keeps a queue of messages that haven't been handled. This is so the event stream can receive the mailbox message and clear up the mailbox for reuse even if no thread has called next_msg. These queues should have a maximum size, causing old messages to be dropped if necessary. The next_msg function:

  • Checks the mailboxes of all priorities, enqueuing any messages found there.
  • Dequeues messages from the highest priority queue that has messages, if the queue is at least as high priority as min. Messages here are returned immediately.
  • If no messages are present, blocks this async task on the arrival of new messages.

Note: since High priority mailbox messages take priority over interrupts, interrupt next() functions will also check the High priority mailbox, enqueuing if a message is found.

Requester

The requester internally keeps track of:

  • Inflight requests.
  • Available request IDs.

Request IDs

Request IDs are a simple unique identifier for requests. When calling submit, the caller passes a slice of SubmitRequests, which internally contain an ID. The caller, however, is not responsible for assigning IDs -- that is handled internally (hence why it's a mutable slice).

The available request IDs are managed by an AsyncIdAllocator, which exposes three functions (note: this is all completely invisible to the user):

  • fn try_next(&self) -> Option<u64>
  • async fn next(&self) -> u64
  • fn release(&self, id: u64)

Both try_next and next get an available ID, but next asynchronously waits until one is available. Two adjacent calls to any next functions are not guaranteed to return adjacent ID numbers.

Inflight Request Management

The submit function returns an InFlightFuture, and the submit_for_requests function returns an InFlightFutureWithResponses. These each output a SubmitSummary and a SubmitSummaryWithResponses when awaited. Each of these futures internally hold a reference to an InFlightInner that is shared with the requester, and manages the state for the inflight requests.

The InFlightInner contains:

  • A waker for the task that is awaiting the future.
  • An Option<SubmitSummary> for returning to the awaiter when it's ready, and additional state to construct this value.
    • e.g. a map of request IDs to indexes in the submit slice (may be unused if not tracking responses).
    • e.g. a vector of responses that gets filled out as responses come in (may be unused).
  • a count, counting how many requests have been responded to.

The requester internally keeps a map of request IDs to InFlightInner so that it can match a response with an inflight that manages it.

Finally, the requester's shutdown function shall ensure that, after internally recording the shutdown status so that future calls will fail, it drains all internal InFlightInners and fills out their ready values to indicate shutdown, after which it wakes any waiting tasks.

Drawbacks

One major drawback to this design is overhead. Using these abstractions requires pulling in the twizzler-async runtime and tolerating the (small) overhead of async for interacting with the device. This may be hard to tolerate in embedded environments (async executor size may be too large) and extreme high performance environments (async overhead)1.

A counter argument here would be that these abstractions are optional, but the counter-counter argument could be that the convenience they offer may make them non-optional in practice, where most drivers use them, requiring their use even in embedded systems. However, in systems that are truly space-limited, one is often working within the confines of an exact hardware set, so manual implementation of small drivers is likely regardless, and the larger drivers used on server machines will not be applicable.

1

This argument is less convincing to me. The interrupt handling routines, for example, only incur the async overhead when they are out of work. Polling or delaying calling the next() functions can nearly completely mitigate this overhead.

Rationale and alternatives

The main purpose here is to provide a common framework for common aspects of driver development, thus accelerating the creation of device driver software. By defining this RequestDriver trait, we allow driver software to be split into two parts: the lower half that submits requests on a queue, and the higher half, which submits requests via the Requester interface, simplifying driver software by handling the complexities of async and out-of-order responses. The use of async here is especially important, as it allows driver software to just submit requests and await responses.

Several rationales:

  • The request ID system is designed to allow the driver to internally manage its understanding of IDs and how they relate to the device queue(s). This makes it possible for the requester to internally manage in-flight state independent of the driver, and handle async and out of order responses.
  • There is a separate flush function to allow for enqueuing a number of requests without incurring the overhead of actually notifying the device.
  • The submit and submit_for_responses functions are async and both return another future, meaning one needs two awaits to get the SubmitSummary. This is because we separate the successful submission of requests from the completion. Imagine we want to submit 200 requests to a queue that has 128 entries. We'll have to wait at some point. Thus we allow the caller to await full submission and then later await responses if it likes (or it can drop the future and not get any responses or information).
  • We separate submit and submit_for_response because collating the responses has additional overhead, and a given submit may not care about the actual responses. Thus we provide an option for just submitting and awaiting completion without recording responses.

Unresolved questions

  • One unresolved question is in the ergonomics of building a driver that uses all these types that reference each other. I plan to dogfood this interface by way of an NVMe driver.
  • The DmaAllocator is a major component of drivers, allowing the driver to talk about physical memory. That will be discussed in a future RFC.

Summary

This RFC introduces support for DMA (Direct Memory Access) and bus address mapping for device drivers by providing kernel support for setting up mappings for objects, kernel APIs for getting lists of physical or bus mappings for object pages, and twizzler-driver APIs for managing DMA objects and mappings in a memory safe manner.

Motivation

DMA is a fundamental aspect of writing device drivers, as devices use DMA to transfer data to and from host memory. However, thinking of devices accessing host memory solely via single one-shot DMA transfers is an outdated and limited model. The goal of this RFC is to provide a unified mechanism for supplying devices with bus addresses that correspond to physical memory that backs object memory in such a way that drivers can program both "streaming" (e.g. buffers) and "long-term-bidirectional" (e.g. command rings) memory.

Guide-level explanation

Considerations for DMA

When programs access memory in Twizzler they do so via accessing object memory, which involves an MMU translating some kind of object address to a physical address. On x86, for example, this involves a software translation to a virtual address followed by a translation via the Memory Management Unit (MMU) to a physical address. Similarly, when a device accesses memory, it emits a memory address (likely programmed by the driver) that may undergo no translation or some other translation on the bus before attempting to access host memory. There are two important considerations that are the result of this alternate (or no) translation:

  • Contiguous addresses. While object memory is contiguous (within an object), the physical memory that backs that object memory may not be. Thus devices and drivers need to be capable of handling access to memory in a scatter-gather manner.
  • Access Control. Access control can be applied differently between host-side driver software and devices. Thus driver software must be aware that it may have access to memory via the device that it should not directly. We can use devices like the IOMMU to limit this effect.

In addition to the above, we need to consider the issue of coherence. While CPU caches are coherent across cores, devices accessing host memory do not necessarily invalidate caches. Thus we have to handle both flushing data to main-memory after writing before the device reads it and invalidating caches if a device writes to memory. Some systems automatically invalidate caches, but not all do.

Memory Safety

Finally, we must consider memory safety, which is an issue because while we can control writes from host software to DMA buffers, we cannot necessarily control how the device will access that memory. To ensure memory safety of shared regions, we would need to ensure:

  1. The device and host software cannot both mutate shared state at the same time (thread safety). Note that this may be okay in some situations, such as atomic variables that are updated from the device without tearing possibility or touch neighboring memory, however encoding this at compile time to prove safety may be impossible in general.
  2. The device mutates data such that each mutation is valid for the ABI of the type of the memory region.

Enforcing these at all times may cause overhead and increase API complexity. Another stance we could take is Rust's approach to "external influences on memory", such as accessing /proc/self/mem on UNIX, which is basically to say that this is outside the scope of the compiler's ability to ensure safety. I think, though, that since programming shared access between driver software and the device is a fundamental part of driver software development, some middle ground that provides some safety is desireable, even if it means reaching for some unsafe here and there (possibly merely for efficiency).

Using DMA in a Device Driver

Twizzler will provide an interface for making a single Twizzler object accessible to a device by way of the DmaObject type exposed by the twizzler-driver crate. The DmaObject can be created from any Twizzler object, and exposes APIs for ensuring coherence and memory safety. Let's take as example a device that has a command ring buffer that is used to submit commands and to indicate when a command has been completed. A command in the ring buffer can point to another DMA buffer that is used to transfer data, and may look like the following:

struct Command {
    op: u32,
    status: u32,
    buffer: u64,
}

The op field specifies some operation to perform (send packet, etc.), the status field specifies the result of the command (say, for example, is set to 1 when the command is completed and must be cleared to zero for a command to be processed). Finally, the buffer field points to the physical address of some buffer. Let's also imagine some mechanism for communicating to the device the head of the ring so that we might communicate to the device a collection of new commands to process via a write to some MMIO register. For the sake of simplicity, let's assume that the buffer is at most 1 page long.

Setting up some DMA regions may look like:

let object = create_new_object();
let dma = DmaObject::new(object);

let command_ring = dma.slice_region::<Command>(some_command_len, Access::BiDirectional, DmaOptions::default());
let buffer = dma.slice_region::<u8>(some_buffer_len, Access::HostToDevice, DmaOptions::default());

At this point, both command_ring and buffer have types DmaSliceRegion<Command> and DmaSliceRegion<u8> respectively. Note that we distinguish between DmaRegion and DmaSliceRegion. Both provide a similar purpose, but have slightly different signatures on some functions. For example, both provide a with function (see below), but the DmaSliceRegion allows specifying a sub-slice. The rest of this document will use DmaRegion to stand for both types to avoid duplication of specification.

We can use DmaRegion::pin() to get a list of physical pages associated with the region so that we may program the device to operate on this command ring. Then, submitting a command would look like:

buffer.with_mut(0..0x1000, |buf| {
    fill_out_buffer(buf);
});
// Grab a 'pin'of the buffer, which ensures that the associated physical addresses and IOMMU maps will remain
// static until the dma object is dropped.
let buffer_pin = buffer.pin().unwrap();
// Get the physical address of the first page.
let buffer_addr = buffer_pin[0].addr();
// Fill out a new command.
command_ring.with_mut(0..1, |ring| {
    ring[0] = Command::new(buffer_addr);
});
increment_head();

A pin object can manually release the pages it refers to, but otherwise the lifetime of pinned physical memory is the same as the DmaObject itself. By tying pin lifetime to the DMA object and not the pin object reduces management complexity of avoiding accidentally programming a device with stale physical addresses.

The DmaRegion::with_mut function runs a closure while ensuring coherence between host and device. Before the closure, it ensures any writes from the device are visible, and after running the closure, it ensures that any writes made by driver software are visible to the device. A similar function, with, allows driver software to read the DMA region and not write it, allowing the system to skip ensuring coherent writes from host to device.

Simple Allocation

If a driver needs to allocate a large number of dynamically sized DMA regions, doing so with a single object may prove difficult as we can easily run out of space. Thus twizzler-driver also provides a type for managing a collection of DmaObjects all of a similar type: DmaPool. We can use it as follows:

let pool = DmaPool::new(DmaPool::default_spec(), Access::HostToDevice, DmaOptions::default());
let region = pool.allocate::<Foo>(Foo::default()).unwrap();
// Dropping region causes it to deallocate.

Coherence Models and Memory Safety

In the above example, we used default DMA options, which ensures the following:

  1. Writes by host software are readable by the device once the with_mut function returns.
  2. Coherence is synchronized at the start of the with or with_mut calls.

More relaxed models are available that do not do any synchronization unless the driver explicitly calls DmaRegion::sync. Note that we are not ensuring that no memory access conflicts occur between the device and driver software, since that is not possible to do at compile time or runtime1. We are further not ensuring that the device maintains the ABI of the Command type. In this example, this doesn't really matter, as all the constituents of this type are simple integers, but imagine instead that status was an enum with only a few defined values. The device could update the value of status to a non-defined value, which would cause problems.

To avoid the type ABI problem, we require that a region be a type that implements the DeviceSync and Copy marker traits. The DeviceSync trait is a promise that the ABI for the type can handle any update to it that the device might make and that it can handle possible memory conflicts with writes from the device.

1

Efficiently, anyway. We could use the IOMMU to ensure that physical addresses are only available for the device to access during certain windows. However, this would involve a LOT of system calls and IOMMU reprogramming, which is currently not terribly fast. Note, however, that as-written this API would allow for this kind of enforcement if we choose to do it in the future.

Shared Objects

One final consideration is for drivers that want to point devices towards object memory that exists within an object that is shared across different programs. The twizzler-driver library cannot (at this level of the system) enforce mutability rules for these objects. Thus driver software should use the manual sync operations to ensure coherence (of course, parts of the object modified via the with functions will still have coherence rules applied as normal, see above).

DmaOptions and Access

DmaOptions modify how a region (or pool, see below) of DMA memory is treated by the host. The options are a bitwise-or'd collection, with the following defined:

  • UNSAFE_MANUAL_COHERENCE. Default: No. If set, the with functions do not perform any coherence operations.

The Access enum specified the direction of the DmaTransfers that will be made with this region, and can be used to optimize coherence and inform access controll for IOMMU mappings. The options are:

  • HostToDevice -- for transfers in which the device reads.
  • DeviceToHost -- for transfers in which the device writes.
  • BiDirectional -- for transfers in which the device reads and writes.

Reference-level explanation

Kernel API

Accessing physical mappings information is done, from consumers of the twizzler-driver API, via the pin function on a DmaObject. The pin function learns about physical mappings from the kernel by calling a KAction command on the underlying object for pinning pages, which returns a token along with information about physical addresses. That token is tied, in the kernel, to the list of physical mapping information that that call returns. After this call returns, the kernel ensures that the mappings information that it has returned stays correct ("active") until the pin is manually released via another KAction call.

Internally, the kernel will manage per-object information on which pages are pinned so as to not evict or change such pages. Ensuring that these active pins remain correct requires some interaction with the copy-on-write (COW) mechanism in the object copy functionality. In particular, pins do not get copied into new objects that source data from an existing object, however if a pin applies to a source object, that object is copied (via COW) to a new object, and the range that is copied intersects with the pin, and a write is performed to the pinned region in the source object while the underlying pages are still shared for COW, the kernel will need to copy the page for all other objects instead of just the source object. For this reason, we will break the kernel-side implementation into two feature gates:

  1. Basic pin support, but not supporting COW-intersecting-with-pins.
  2. Full support as described above.

Userspace Implementation

Let's consider the examples in the previous section and discuss implementation.

DmaObject::slice_region and DmaObject::region

These provide simple ways to just return some object memory as a [T; N] or a T. They return a struct (DmaRegion) that just manages a typed region of memory of a size determined by T (and N), and expose the pin function.

pin

The pin function calls the kernel to setup a pin and then manages the pin information in memory, allowing it to be called multiple times without having to negotiate with the kernel each time. Note that pins are not released to the kernel when the DmaRegion is dropped; instead all pins on an object are released when the DmaObject is dropped.

with and with_mut

These functions provide access to the internal memory managed by a DmaRegion to a closure. Before the closure is run, it ensures coherence for the host reading the memory, and after the closure is run it ensures coherence for the device to read memory. The with variant may skip the second step.

Pools and Allocation

Regions can also be gotten from a DmaPool, which internally manages a collection of DmaObjects (derived from objects that it creates as needed). All regions created this way share the DmaOptions with which the pool is created. Allocation is managed internally via a memory allocation algorithm, however all regions must be aligned on page size.

Drawbacks

DMA is vital to any driver written for the vast majority of devices that we care about. However, the particular design choices herein do have some drawbacks:

  1. Pinning memory adds complexity to the eviction algorithms in the kernel and the pager, as they need to be made aware of pinned memory.
  2. There is currently no attempt to limit the amount of pinned memory an application can request, thus opening an easy door to denial of service attacks. We can mitigate this somewhat via access control.
  3. Currently we don't define a way to request that all physical addresses fit within 32 (or fewer) bits, as the kernel is not currently setup to do manage memory in a way that would make this easy. Ensuring physical addresses stay under the 4G mark is useful (mostly) for older hardware that cannot work with 64-bit addresses. Currently, we don't have any immediate need to support such hardware. If the need arises, however, we can extend the DmaOptions enum to include specifications for physical memory address limitations.

Rationale and alternatives

Pinned Memory

The overall goal is to make it possible for userspace code to program devices to access physical memory. We want to stay within the overall Twizzler object model to do this, thus the API herein is focused around making it possible to create objects that we can then use for DMA transfers.

Pin Leaks

One immediate concern is pin leaks. Since the pins must be manually released to the kernel by the library (not the user), we can imagine a device driver crashing and causing a section of object memory to be pinned forever (at least, until the kernel restarts). The decision to allow this, however, is intentional.

Should a device driver crash, we have no guarantee on the state of the device that it was programming. It's entirely possible that the device be programmed with physical addresses that it may make use of after the device driver crashes, thus blindly writing to memory that, in the case that pins get removed if the program creating them crashes, may no longer refer to the object memory that was originally intended by the (now crashed) driver.

Thus the choice of allowing leaks in the face of a driver malfunction is there to mitigate the possibility of corrupted memory. Of course, use of an IOMMU may be able to mitigate this, however I do not wish to rely on it, and this would also introduce many inefficiencies. If we prove to be able to efficiently make use of IOMMU hardware in the future, this design may change.

Prior art

Basically every major operating system provides some API for setting up DMA transfers. Most of them are quite similar, largely relying on the driver to manually synchronize for coherence and/or specify directionality of transfers. Some (e.g. Linux) additionally classify a region of memory-to-be-DMA'd as fully coherent or streaming, usually using this information to specify caching type for memory mappings.

Fuchsia uses a similar mechanism to what is outlined herein, also supporting pinned memory regions. There are a number of differences, however, that largely stem from our desire to (at least somewhat) follow Rust memory safety requirements. Fuchsia, in addition, does allow a limited ability to control contiguity of physical memory, which we do not (yet).

FreeBSD and Linux both have a significantly different style of interface, stemming from the fact that they implement their device drivers in-kernel, and so their DMA interfaces can be tightly coupled with their virtual memory subsystem. The two systems differ in the details of how they control coherence and synchronization (single-versus-all-cpus, streaming-versus-coherent, and FreeBSD allowing finer control over sync operations) and how they control contiguity and maximum address size. Otherwise, the differences are largely down to API specifics and not really functionality, with the exception of FreeBSD supporting a recursive-like pattern of region configuration inheritance, which is kinda cool.

Windows offers little additional insight into DMA operation and design tradeoffs, except as a case study of how not to name functions or other aspects of an API.

Future possibilities

This RFC is only intended to cover "dumb" devices -- that is, devices that are fully programmed by the host software and, while they may interact with memory via DMA, do not really "go off on their own". Essentially, most devices on the market that do things like NVMe, networking, etc. Such devices are fully controlled by the host and all memory access is either initiated by the same or initiated by the device to a pre-programmed section of memory, and the device can be thought of as a simple state machine.

In future it may be better to model devices a fully separate machines that access physical memory cooperatively with the host CPU and run their own programs. Should we reach that future, we will probably need a new model for programming such devices that will exceed in needed richness the model presented here.

Summary

This RFC describes the core functionality of what may be described as the "core Twizzler monitor" program. It proposes a formalized definition of: 0. How we can have a flexible runtime system that enables users to run "bare-metal" against twizzler-abi, against a more complete runtime, and even swap out the runtime system if they like.

  1. How programs (including libraries which may expose "nando"s, either secure or not) are linked and formed into objects.
  2. How programs are loaded, including how they are isolated from each other.
  3. What a program's runtime looks like, including the execution environment, threading, etc, and how the monitor supports programs.
  4. How we achieve (optional) isolation between the loaded components of a running execution.
  5. How users might interact with this runtime at a high level.

A huge portion of this program is, essentially, a dynamic linker, or will have to be. However, it's a dynamic linker that is supported by kernel-supported security contexts and gates, and a language (with runtime) that assists with both portability and abstraction.

Motivation

As we are working towards an environment that enables programmers to write software for Twizzler, we need to set ourselves up now for an extensible and well-defined runtime system and execution environment. In particular, this work will dramatically improve the extensibility of the userspace parts of Twizzler and provide a core foundation upon which we can build higher-level and secure software. In fact, a number of items in our respective roadmaps depend heavily on the functionality described by this RFC.

Additionally, this work will enable us to more easily use and demonstrate the usefulness of several core parts of the planned Twizzler programming model, namely Nandos and Secure Gates. We also plan, as part of this work, to reach a semi-stable version of twizzler-abi, engineering this crate so that the runtime can hook into core aspects of what Rust's standard library depend on and swap them out, allowing us more flexibility without having to recompile programs for different runtimes.

Now, you may be asking, "why do we need a monitor". Well, lets consider the minimum required secure environment that some sensitive software is running in. In particular, notice that, in a traditional system, we must trust the kernel, the toolchain, the standard library (and all linked-to libraries), and the dynamic linker. The issue of a trusted toolchain is out of scope of this RFC. We will focus instead on how we can ensure that the kernel, library, and dynamic linker can work together to provide isolation. As a result, the dynamic linker for Twizzler will be Twizzler specific. This isn't particularly weird -- most dynamic linkers have a bunch of OS specific and arch specific code. Note that a traditional dynamic linker is already, kinda, a monitor, loading programs and libraries as required, etc.

Guide-level explanation

The runtime is a program that acts as a security monitor and dynamic linker. It loads programs and libraries into memory and executes them. Any program or library that is loaded by the runtime is then under the control of that runtime. The runtime organizes programs and libraries that are loaded into compartments, each one providing a configurable level of isolation from the others for the programs and libraries residing within.

The execution environment

The basic environment of a Twizzler program can be supported either by twizzler-abi directly or by a runtime which provides some additional support on top of the basic twizzler-abi functionality. This may, for example, include access to a more useful naming service, logging, debugging, networking, etc. Each piece of the system is defined as follows:

  1. The twizzler-kernel provides the basic kernel services. It should only be accessed through twizzler-abi.
  2. The twizzler-abi crate defines the interaction with the kernel, including syscall ABI, API, etc. It also defines the Runtime trait.
  3. The runtime trait provides an implementation of a twizzler runtime. It exposes the interface that Rust's standard library uses to interact with the runtime. A given implementation of runtime is selectable both at compile time and at load time.
  4. Twizzler ships with two runtimes: a default one provided by twizzler-abi that implements a bare minimum set of features to start up a program running against twizzler-abi. The second is the standard runtime that acts as a security monitor, dynamic linker, and program loader. It also facilitates secure gate calls between components.
+------------------------+   +-------------------+
|"Bare Metal" environment|   |Runtime Environment|
+---------+-+------------+   +-------+-+---------+
          | ^                        | ^
          | |                        | |
          v |                        v |
     +----+-+-------+         +------+-+---------+
     |              +-------->+                  |
     | twizzler-abi |         | twizzler-runtime |
     |              +<--------+                  |
     +------+-+-----+         +------------------+
            ^ |
            | |
            | v
  +---------+-+-----+
  |                 |
  | twizzler-kernel |
  |                 |
  +-----------------+

Here are two examples of how a runtime may offer functionality over that of the twizzler-abi runtime:

Naming: The runtime provides a default service for resolving names into object IDs. This requires a fair bit of runtime to work, as it needs a service running to manage naming, persistence of naming, etc. However, we want Rust's std to be able to use this for paths too, so twizzler-abi needs to expose a hook for the more featureful runtime to provide that functionality. What, then, does the "bare-metal" environment get? Well, those programmers could still implement their own naming system or hook into one, but by default we'll have the twizzler-abi crate expose a read-only mapping of "initrd" names.

stdout/stderr: These, too, require a bit of runtime to make them as flexible as programmers generally expect them to be. The basic twizzler-abi runtime implements these by writing to the kernel log. What else could it do? The more featureful runtime could collect output, stream it to different programs, log it, etc.

As a result, you can write a program for twizzler-abi and it is loadable into the more featureful runtime as well. The reverse may be true as well, though of course a program dependent upon a specific runtime may not function correctly elsewhere.

The Standard Runtime: loading programs

A core piece of the runtime is a dynamic linker. In particular, the runtime monitor loads and links programs and libraries both at load and run times. This means that the runtime needs to be able to load executable objects, map them, and (depending on the executable format) relocate them. Now, I'm not so crazy as to suggest we implement our own executable format. ELF is a sufficiently weird machine that we'll probably get all the functionality we need out of it.

So, what's a program?

The runtime supports both statically linked executables (or, dynamically linked traditional executables) and libraries (dynamic linked objects). Executable programs have a defined entry point, libraries expose a set of symbols that are linked when the library is loaded. A program or library is contained within a single object that contains an ELF file. This file is parsed and mapped into memory via Twizzler object APIs (notably, the copy-from primitive is used heavily along with direct mapping). Note that not every ELF file will work. It does need to target Twizzler and link using the Twizzler linker script built into the Rust compiler's Twizzler support.

The Loading Process: and example in a virtual address space

The monitor, when loading an executable program, does all the normal things a dynamic linker does, with the addition of a few things. First, it allows subcomponents of the program and libraries to isolate from each other even within the same address space (this uses security contexts and secure gates). Let's imagine we have the following setup:

Program A links to libstd (S), libruntime (R), and a helper library, L. The resulting address space in Twizzler looks like:

+-----------------+
|                 |
| A.text   A.data |
|                 |
| R.text   R.data |
|                 |
| L.text   L.data |
|                 |
| S.text   S.data |
|                 |
| stack    A.heap |
|                 |
| M.text   M.data |
|                 |
| thread          |
+-----------------+

This is a bit of a simplification. Also note that I didn't really try to order these, as with Twizzler's invariant pointers + position-independent code, it doesn't matter. Well. Okay, A.text and A.data, as the potentially statically-linked executable, may be forced into slots 0 and 1 respectively, but that's a small detail.

Each loaded executable object is split into two parts, text (executable code) and data (read-write data). We may add a third rodata object in the future. We also have a heap (A.heap, we'll see why it's named as such later), a stack, a thread repr object (see the thread model documentation). We've loaded A and its dependencies (R, L, and S). We also have ourselves, the monitor / loader, as M, loaded.

Once everything is loaded and relocated (linked symbols, etc), we can start executing at A's entry point. During execution, A may decide to load more executable objects, or perform an operation that causes the runtime to do so automatically. The result is the familiar programming environment we are all used to, in that we can construct statically-linked executables or dynamically linked ones. We can load libraries at load or runtime, and use their exported symbols.

The Standard Runtime: secure gates

One major functionality addition to the dynamic linker is compartmentalization. A given library may be called by a program, and that library may be fully isolated from the caller. This is useful for building secure OS services, for example, a network library may manipulate configuration data directly if it has write permission on that data. But your average program that calls that library shouldn't be given write permission to that data, as it could then trash it. Thus a key primitive that the runtime enables is for some insecure software to call a secure function in a library that has some additional permissions not to be granted to the caller: a secure gated API.

Lets use the same example program and dependencies as from above with the following addition: The library L has additional permissions to update some system state that the caller (A) doesn't have permission to do. Library L defines a function as follows:

#[secure_gate]
fn update_some_state(args...) -> ... {...}

The caller can then just call L::update_some_state(...). The runtime, during load, will ensure the following security properties:

  1. The library L is protected from arbitrary code execution by secure gates support in security contexts.
  2. The "irreducible Rust Runtime" (discussed below) remains isolated within this library (we'll discuss what this means below).
  3. The caller's stack cannot be corrupted (optional, adds overhead)
  4. The callee's stack frames cannot be leaked to the caller (optional, adds overhead)
  5. The caller's stack is not accessible at all by the callee (optional, limits API)

Once loaded, the virtual address space will contain:

+--------------------------------------------+
| thread(r--)                                |
| +--------------------------+               |
| |A.text(r-x)   A.data(rw-) |               |
| |                          |               |
| |A.heap(rw-)   stack(rw-)  |               |
| |                          |               |
| |S.text(r-x)   S.data(rw-) |               |
| |                          |               |
| |R.text(r-x)   R.data(rw-) |               |
| +--------------------------+               |
|                                            |
|                                            |
| +---------------------------+              |
| |L.text(r-x)   L.data(rw-)  |              |
| |                           |              |
| |L.heap(rw-)   L.stack(rw-) |              |
| |                           |              |
| |S.text(r-x)   S.data(rw-)  |              |
| |                           |              |
| |R.text(r-x)   R.data(rw-)  |              |
| +---------------------------+              |
|                                            |
+--------------------------------------------+

Note how there are multiple "compartments" of objects, here. This has not actual bearing on virtual address space layout, they are compartments for ensuring isolation. The specified permissions only apply to within a given compartment. The resulting, high-level view of the runtime monitor is then:

  +-----------------------------------------------+
  |                                               |
  |C_A: A's data, text, and non-isolated libraries|
  |                                               |
  +-----------------------------------------------+    +-----------+
                                       ^               |           |
                                       |               |  Monitor  |
                                       +<--------------+           |
                                       |               +-----------+
                                       v
  +-----------------------------------------------+
  |                                               |
  |C_B: B's data, text, and non-isolated libraries|
  |                                               |
  +-----------------------------------------------+

Wherein the monitor controls isolation between compartments, and within a compartment, it adds no overhead to control transfer between components. The resulting programming model is one where a programmer can easily call functions exposed by system libraries that operate securely on protected data. These functions can go on to call other non-isolated or isolated functions, and the runtime will handle it.

The anatomy of a compartment is as follows:

           +---------+
           | program |
           +--------++
                    ^
                    |
                    |
                    v
+------------+     ++-------+
| libruntime +<--->+ libstd |
+----+-------+     +--------+
     |
     v
+----+--+     +------+     +-------+
| state +---->+ heap |     | stack |
+-------+     +------+     +-------+

A program (which may be a library) links against libstd, which links against twizzler-abi (not shown) and libruntime. The runtime library then uses the state object (one per compartment) to locate the heap and handle allocation requests from the standard library.

Isolation options

The minimum isolation required by a library can be set by the library at compile time, or by the caller of the library at load time. Once a library is loaded, there is no way to change isolation levels without spinning up another compartmentalization of the same library.

There are, broadly, two major isolation directions to consider: does the caller trust the callee, does the callee trust the caller, or does neither party trust the other. The resulting table shows the different options:

Trust relationshipModel
Both trust each otherNo isolation -- simple dynamic library call
Callee doesn't trust callerIsolated callee (e.g. library to safely update system state)
Caller doesn't trust calleeIsolated caller (e.g. program that operates on protected data calls an untrusted library)
Neither trust the otherFull isolation

Whether or not the callee trusts the caller is set by a flag in the secure gate itself, so the caller can never bypass the isolation by setting a different policy. Similarly, the caller can specify its trust level at load or runtime.

Selection of higher isolation levels may come with a tradeoff. In particular, performance and restrictions on what APIs are possible. They are as follows:

  1. No isolation: no overhead above a standard dynamic library call
  2. Any amount of isolation requires security context switches and shadow stack construction.

Notes on "thread state" and secure gates

When switching between compartments, we must be careful to avoid 1) improper transfer of control flow to a compartment, 2) isolation of sensitive information within a compartment, and 3) prevention of the isolated compartment getting "tricked" into writing somewhere it didn't intend. The details of how we'll deal with these will be described by the reference level explanation below, but here we'll quickly discuss some basic concepts.

The stack

One key aspect of thread runtime state that we don't have direct control over normally is the stack. Instructions generated by the compiler interact with the stack frequently and in automatic ways relative to our level of abstraction in programming Rust. Thus we, as programmers, have little control over when and how the stack is accessed, and cannot ensure that we get to "check" that the stack pointer is "okay" before its used -- after all, a checking function will still use the stack!

Why do we care? Well, imagine an attack where an untrusted program calls a sensitive, isolated library to modify data in object O. Normally, the untrusted program cannot access object O, but imagine if we pointed our stack pointer into O and then called the secure gate! The isolated library will then corrupt O with stack frame data before it even gets a chance to do anything else.

Thus the runtime will ensure that the isolated library cannot accidentally corrupt some object using its stack pointer (details later).

Another concern is privacy -- any stack frames pushed to the stack by the isolated library could be visible to the untrusted program if they shared stacks. For this reason, the runtime sets up a shadow stack so that pushed stack frames by the isolated library stay within that library, and any stack arguments are still accessible. Note, however, that this now restricts using the stack for return values. We'll need to figure this one out.

The heap

The heap is another big issue. See, Rust follows the common programming model of a global heap, one per process. If we were to continue with this model, we'd need to put the APIs for accessing the global heap behind secure gates into the runtime, and even then we'd need hardware capabilities to prevent isolated compartments from corrupting heap data. Instead, we'll change the model to one heap per compartment, that way allocation can happen without huge overhead.

One thing this means is that heap data may, sometimes, not be sharable across compartments. We propose to default to a private-heap-per-compartment, so normal allocations are contained, however we can use Rust's allocator API to allocate from heaps that can be shared across compartments.

TLS

Thread-local storage is another major concern, as it involves the use of an architectural thread-pointer that we need to ensure has similar restrictions as the stack.

Unwinding

Unwinding across a foreign function interface is undefined behavior in Rust. While we're not exactly crossing an FFI boundary, we kinda are, since the heap and stack pointer change dramatically between compartments. On top of that, as we'll see later, the way we return from a compartment requires a bit of extra work. Thus we must always catch an unwind before going back across the compartment boundary. The runtime will allow the unwind to be caught, and resumed, across the boundary.

Global variables

Look, if you make a global variable, and then try to share it across a compartment boundary, it'll be restricted (either read-only or inaccessible), so I guess just plan for that. Or just don't use global, shared variables.

Reference-level explanation

Phew, okay. Let's start in on how we can make this happen.

The runtime is a combination security monitor and dynamic linker, managing the loading and linking of programs and libraries within and across isolation domains called compartments. It is, explicitly, part of the trusted computing base.

Security Contexts

Twizzler offers an abstraction of a security context with the following properties:

  1. A context contains capabilities that define the access rights of a thread that has this context as active.
  2. A thread can be attached to multiple contexts, though only one may be active at one time.
  3. A thread switches contexts automatically if a security fault occurs, and the thread is attached to a context in which the instruction would not cause a security fault.
  4. Transfer of control between contexts is limited by an object's secure gates, a set of allowed entry points into this object. If a security context switch occurs, and the instruction pointer is not pointing to a valid gate, that context is not considered a valid context to switch to.

The runtime uses this feature via a 1-1 mapping of compartment to security context. All programs and libraries within a given compartment all run within the same security context (if possible), or others with control flow transfer managed by the runtime if necessary. Note that the runtime itself is contained within a compartment and isolated from all other compartments, allowing only approved operations to be invoked (via security gates) should a loaded program wish to interact with the runtime monitor directly.

How to isolate a library

Let's go back to that running example: untrusted program A wishes to call isolated library L's function foo, which is a security gate. First, library L (and its associated runtime state) are contained within compartment CL, while A is contained within CA. Upon call to CL, via the call instruction, the processor pushes the return address to the stack and then jumps. The instruction pointer is now in L, pointing at the start of foo. But we are still within context CA! This triggers a page fault, since, because it is isolated, L is not executable in CA. The kernel finds CL, where the execution is valid, and then continues with the first instruction. Let's say it's push rbp, which it probably is. This triggers a security fault. Why?

See, we must protect against the callee (foo) corrupting some data that A didn't have write access to but L does. The caller could have pointed the stack pointer at some sensitive data and then called foo. To protect against this, the stack that A is using is not writable in CL. The security fault is handled by the runtime monitor, as it's registered itself as the fault handler with the kernel during setup. The runtime monitor then constructs a shadow stack for L, using the object copy-from primitive. Execution is then allowed to proceed normally.

Upon return, we have to do another thing -- we have to check to see if the return address is safe. The caller could have pushed a location within L and then jumped to foo, instead of calling it. This would re-enable arbitrary code execution in L! So the runtime, when constructing the shadow stack, checks the return address. Finally, what if foo just instantly and blindly executes ret? While unlikely, we still have to deal with this. By default, the stack is made not readable from CL, however the runtime still constructs the shadow stack. This option doesn't prevent L from reading A's stack frames, it just prevents L from doing anything with A's stack data before the runtime can interpose.

Finally, we can optionally refuse to create a shadow stack and instead require that the callee has a new stack. This results in the construction of a new stack with a fake frame that returns to the call site, but contains no other data from A's frames.

So, how do we enforce that the stack isn't writable on context switch?

Yeah, we do need to do this! See, the "prevent corruption" motivation above does still require the runtime to interpose always, which means that the stack pointer cannot be writable (or, even readable sometimes!) when a gate is used. Thus we propose to extend the Twizzler gate model to include the ability to check the validity of architectural pointers:

  • Stack and base pointer: at least -w- in source and at most r-- in target
  • Thread pointer: at most r-- in both contexts. This covers TLS -- the thread pointer points to a region of memory that is used as the TLS block locator for dynamic memory objects. That index should not be writable by either context, as it is under the control of the runtime. Further, the thread pointer should not be changeable in the source context (denoted by the capability to write the thread repr object).
  • Upcall pointer (not really architectural, but): at most r-- in both contexts. The runtime needs to be the handler for upcalls, so we ensure that it cannot be executable in the target, nor do we want the source compartment to handle (e.g.) security violations for the target compartment. Same as thread pointer: not changeable in the source context.

For flexibility, these permissions should not be hardcoded, but instead will allow the gate's creator to specify both required and disallowed masks. However, the above may be the default.

What about the heap?

Each compartment has a heap, which means each compartment has an independent allocator and libstd. We can achieve this by linking to a compartment-local libruntime that is configured with enough information to manage its own heap independent of the other compartments. This is a different linking algorithm than is standard for dynamic libraries. We are first explicitly linking against a library that has "first dibs" on symbols, and falling back to global symbol resolution only after that (ok -- I suppose this is kinda like LD_PRELOAD).

This trick, of allowing heap allocation without the runtime, is necessary for correctness (isolation of heap data) but also for performance, since a context switch on every allocation would be a hard pill to swallow. Fortunately, we can make heap allocation just as cheap as it is now with this per-compartment heap trick, at the cost of some additional complexity to share heap data across compartments. Heaps are private by default, but we can always create additional heaps with different permissions and then use Rust's allocator API to allocate data from those heaps instead of the default.

How are executable objects linked and loaded?

We use the linker provided by LLVM to link executables and libraries, however we provide a custom linker script that ensures that the memory layout of the program fits with the runtime. For example, a typical program might look like this:

| Section | vaddr  | len   | offset | perms |
|---------|--------|-------|-------:|:-----:|
| .text   | 0x1000 | 0x800 |      0 |  r-x  |
| .rodata | 0x1800 | 0x100 |  0x800 |  r--  |
| .data   | 0x2000 | 0x100 | 0x1000 |  rw-  |
| .bss    | 0x2100 | 0x130 |    N/A |  rw-  |

As you can see, the program's sections are loaded into specific memory addresses, with data taken from the offset of the file (this is a simplified table compared to ELF). However, on Twizzler, this is more likely:

| Section | vaddr      | len   | offset | perms |
|---------|------------|-------|-------:|:-----:|
| .text   | 0x1000     | 0x800 |      0 |  r-x  |
| .rodata | 0x1800     | 0x100 |  0x800 |  r--  |
| .data   | 0x40001000 | 0x100 | 0x1000 |  rw-  |
| .bss    | 0x40001100 | 0x130 |    N/A |  rw-  |

Note the major change is just in the vaddr field, where we've bumped the data and bss sections to be loaded into the second object slot of the address space (object size 0x40000000). This is already how Twizzler operates. We just need to extend it to be a little more general for loading position-independent libraries. The above example loads data into object slots 0 and 1, respectively. It does this by first directly mapping the executable object to slot 0 (the linker script ensures this direct mapping is correct), and then creates a new data object and copies the initial image of the data section from the executable (this is done via the copy-from primitive to leverage copy-on-write), which is then loaded into slot 1. For a position independent library, we'll allocate two consecutive slots and map the executable and read-only data into the first, and the writable data and bss into the second.

At this point, we need to run the standard dynamic linking algorithm, with some small exceptions, to relocate and link any loaded programs and libraries. Intra-compartment symbol resolution results in standard dynamic library function calls, whereas inter-compartment results in the limitation of communication to secure gates. The main exceptions to the standard linking process are to ensure that allocations are performed intra-compartment by default, and to ensure that all calls stay within a compartment unless using a secure gate.

More Details about the Irreducable Rust Runtime

Above we talked a bit about the stack and the heap, two of the most important parts of the runtime. But there is more.

Thread Local Storage

The thread pointer, mentioned above, points to a per-thread data structure. The complication here is that we'll need to isolate compartments' TLS data from other compartments. We can do this by using the higher-overhead TLS implementation that dynamic libraries use, where the thread pointer points to a vector of TLS regions, and we call a helper function to actually load a reference to a TLS variable.

This vector then must be protected appropriately: read-only, except for the runtime. The runtime sets up the TLS vector, and then other libraries can use that to find their regions. On compartment switch, the runtime could change the thread pointer to a limited vector. This means that we'll need to protect updates to the thread pointer, which we will do by saying that the thread pointer can be updated in contexts that have write access to the thread's repr object. A thread doesn't need write access to its own repr, so we can prevent compartments from changing the thread pointer. Finally, the kernel can verify properties of the thread pointer (not writable in source or target context) on compartment switch.

Upcall pointer

This isn't an architectural pointer, but it is necessary for the kernel to deliver synchronous events to a thread. We can play the same trick: disallow updates to the upcall pointer, and have it set so that it enters the runtime on upcall.

Unwinding

Since it's undefined behavior to unwind across an FFI boundary, there's a good chance we'll need to catch unwinding panics in a trampoline for a security gate. So if you write something like this:

#[security_gate]
pub fn foo(x: u32) -> u32 {...}

Under the hood, it'll get rewritten to something like:

pub fn foo(x: u32) -> core::thread::Result<u32> {
    core::panic::catch_unwind(|| __do_foo(x))
}

fn __do_foo(x: u32) -> u32 {...}

Drawbacks

The security limitations do add overhead on a security context switch. However, the comparison should not be to library calls, but to invoking an entire new process on unix.

It is a huge undertaking. We could instead skip this, port a dynamic linker, etc. But I think that would miss out on an opportunity to leverage Twizzler's features to build a better programming and system model.

Rationale and alternatives

This design builds directly off the Twizzler security model and implements, as far as I know, the simplest way to expose a secure programming interface to programmers such that the secure programming is easy to do.

This RFC is careful to walk a line between being opinionated about how programming should be done on Twizzler and remaining sufficiently flexible (as would be expected by an OS). However, it does essentially propose a reference runtime, which brings up worries about locking us into a particular ecosystem. However, the alternative is essentially no standard programming environment for Twizzler, which is unacceptable.

Prior art

Some useful papers and concepts:

  • Lightweight Contexts (OSDI)
  • Jails
  • Solaris Doors
  • Dynamic linking
  • Compartmentalization

Unresolved questions

None for this RFC.

Future possibilities

The Reference REPL

As one example of something you could build atop this, let's consider an interactivity model for Twizzler. Instead of a "shell", we'll think of the interaction point as a REPL, broadly defined, to be defined concretely in a different RFC. We'll consider some example, shell-like syntax here as a placeholder to avoid bikeshedding. Consider that, in a system where libraries explicitly expose calling points (Nandos, Secure Gates), we could expose these typed interaction points to the command line interface itself. For example, imagine a library exposes an interface for updating the password as follows:

#[secure_gate]
pub fn update_password(user: &User, password: String) -> Result<(), UpdatePasswordError>;
pub fn lookup_user(name: &str) -> Result<User, LookupUserError>;

Not saying this is a good interface, just roll with it. Let's imagine that the User type implements TryFrom<String>, and the UpdatePasswordError and LookupUserError types are enums with variants listing possible failures. Further, these error types implement the Error trait. Now, let's say these functions are in a library that gets compiled to an object named libpasswd. We can then expose this library to the REPL. The REPL can enumerate the interface for the library. If the source is provided (or, maybe if we look at rust metadata??? or debug data??? or we generate type info in the nando struct???), the REPL knows the interface and all the types for the functions, so it can extrapolate an interface for the command line automatically:

Long form (the X ?= Y syntax is sugar for X = Result::unwrap (Y)):

twz> user ?= passwd::lookup_user bob
twz> passwd::update_passwd user changeme
Ok(())
twz>

Wrong username:

twz> user ?= passwd::lookup_user bobby
Error: No such user

If we hadn't implemented TryFrom<String> for User

twz> passwd::update_passwd bob changeme
Type Error: Expected 'User' got 'String'

But, since we did, we get a shortcut:

twz> passwd::update_passwd bob changeme
Ok(())

Or, with the wrong username:

twz> passwd::update_passwd bobby changeme
Error: No such user

Or, if you don't have permission:

twz> passwd::update_passwd bob changeme
Security Error: Failed to call update_password in passwd: Permission denied

The basic REPL here has special handling for String, Into/From/TryInto/TryFrom String, impl Error, Result, Option, etc, but otherwise builds this command line interface, including arguments and error reporting, automatically from the types.

Another example would be some library, foo, that wants to update some object by ID. So it exposes some function bar(id: ObjID) -> .... Now, typing an object ID is annoying, but doable if necessary, so we do allow that. But the REPL can see this type and automatically try to resolve that positional argument with a (default, but configurable) name resolution mechanism. So the user can still type a name, even if the actual programming interface takes an object by ID. And this can be extended to object handles, too. Those can even be typed:

#[nando]
pub fn baz<T: SomeKindaObject>(obj: Object<T>) -> ...;

If "name" resolves to an object that implements SomeKindaObject:

twz> foo::baz name
...

If "name" resolves to an object that does NOT implement SomeKindaObject:

twz> foo::baz name
Type Error: Object 'name' does not implement SomeKindaObject.

If "name" does not resolve:

twz> foo::baz name
Name Error: Object 'name' failed to resolve: ...

This means that system software is the libraries written to operate or interact with the system. The command line interface is just a translation from command-line interface interaction to library calls, for which the Type info is sufficient to automatically generate.

And of course a more powerful REPL can just expose library calls that interact with the system and the data model directly in its programming language.

Note that you can recover the semantics of a standard unix-like program via fn cli_main(args: &[&str]) -> i32, and in fact, this would be an effective wrapper around such programs as a way to invoke them (via just loading it, and passing args via ELF aux data, or whatever).

Also note that I'm just using this above syntax as an example -- one powerful feature of this is not just making scripting the OS even easier than shell scripts, as you get typed library calls instead of invocation of an executable, but doing so without coupling deeply to the language.

Drawbacks

The security limitation do add overhead on a security context switch. However, the comparison should not be to library calls, but to invoking an entire new process on unix.

It is a huge undertaking. We could instead skip this, port a dynamic linker, etc. But I think that would miss out on an opportunity to leverage Twizzler's features to build a better programming and system model.

Rationale and alternatives

This design builds directly off the Twizzler security model and implements, as far as I know, the simplest way to expose a secure programming interface to programmers such that the secure programming is easy to do.

This RFC is careful to walk a line between being opinionated about how programming should be done on Twizzler and remaining sufficiently flexible (as would be expected by an OS). However, it does essentially propose a reference runtime, which brings up worries about locking us into a particular ecosystem. However, the alternative is essentially no standard programming environment for Twizzler, which is unacceptable.

Prior art

Some useful papers and concepts:

  • Lightweight Contexts (OSDI)
  • Jails
  • Solaris Doors
  • Dynamic linking
  • Compartmentalization

Unresolved questions

  • What does the runtime trait look like?
  • How do we hot-plug that runtime?
  • What does the unix compatibility story look like?
  • More syntax bikeshedding about the cli stuff.
  • Key management...
  • Can we return things using the stack in the shadow stack setup?
  • Should the runtime be async by default?
  • What does it look like to write a command line interface nando? Should it be a #[cli_interface] macro that we can use to auto-gen a CliCall struct that contains type info, etc?
  • Can we implement this with ASLR?

Future possibilities

We can imagine the runtime providing deep introspection on the libraries and executables it loads and isolates. For example, imagine a debugger (controlled by the runtime) that can seamlessly transition from debugging a "script" in the REPL to debugging a loaded library (via step-in).

If all system software is written this way, and maintains OS configuration data and runtime telemetry data in objects, the REPL can expose a query-like interface for interacting with the OS, and can seamlessly be extended via nandos and secure gates.