User Space Primitive Documentation
Introduction
This documentation is largely still a collection of notes based off the previous C-based Twizzler system, and is being updated to reflect new APIs.
Where to begin
Twizzler introduces objects to organize persistent data, rather than files in traditional systems. This provides the benefit of not having to serialize and deserialize data to make it persistent.
Pages explaining the main abstractions of the OS are available at the following links: Objects (for the main data abstraction), Views (for thread environments), and Kernel State Objects (the security model). From these basics, there are a number of features provided by the Twizzler userspace that can be used to enhance programs, but are not necessary for understanding the fundamentals of the OS.
- To get a background on the motivations and a high level understanding of the goals of the operating system, we recommend Twizzler: a Data-Centric OS for Non-Volatile Memory. This is a research paper explaining the system for academic readers.
- To just jump in, follow the build guide and look at the code documentation (essentially manual pages) for primitive functions.
Building Twizzler
A bit of a time consuming process the first time, so make sure you have some nice tea or something before you start :)
Requirements
This build process has been tested on an Ubuntu 20.04 system with standard development tools installed, in addition to rustup (which is required). We require rustup because we will build our own toolchain during the build, and link the result through rustup for easier invocation of the Rust compiler.
To build a boot image, you'll need the limine bootloader installed. In particular, we need the EFI code to help boot Twizzler through their boot protocol.
To run qemu through the build system, you'll need qemu installed.
Overview
Installing the tools:
- sudo apt install build-essential
- sudo apt install python
- sudo apt install cmake
- sudo apt install ninja-build
- Install Rust https://www.rust-lang.org/tools/install
- Clone submodules:
git submodule update --init --recursive
Building Twizzler is done in several steps:
- Building xtask.
- Building the toolchain.
- Building Twizzler itself.
Fortunately, step 0 is handled automatically whenever we try to do anything. That's because xtask is
the "build system orchestrator". Essentially, building Twizzler requires using the right toolchain,
target specification, and compile flags at the right times, so we've placed that complexity in an
automation tool to make builds easier. To get an idea of what xtask is doing, you can run
cargo xtask --help
. Note that this repo's cargo config provides aliases for the common commands,
as we will see below. In fact, it's advisable to NOT use the default cargo commands, and instead run
everything through xtask.
Step 1: Building the Toolchain
This step takes the longest, but only has to happen once. Run
cd where/you/cloned/twizzler
cargo bootstrap
and then wait, while you sip your tea. This will compile llvm and bootstrap the rust compiler, both of which take a long time. At the end, you should see a "build completed successfully" message, followed by a few lines about building crti and friends.
Step 2: Building Twizzler
Now that we've got the toolchain built and linked, we can compile the rest of Twizzler. Run
cargo build-all
which will compile several "collections" of packages:
- The build tools, for things like making the initrd.
- The kernel.
- The userspace applications.
By default all will be built in debug mode, which will run very slow. You can build for release mode with:
cargo build-all --profile release
Step 3: Running Twizzler
You can start Twizzler in Qemu by running
cargo start-qemu
which will bootup a qemu instance. If you want to run the release mode version, you can run
cargo start-qemu --profile release
Step 4: Exiting Twizzler
At the moment Twizzler does not have a shutdown command. To exit the QEMU based simulation use the Ctrl-a X
command which is a part of the simulator.
Objects
Definition
Objects are an abstraction of a set of related data with the same lifetime and permissions. This vague definition allows for applications to define what data is contained in a single object in a way that is most reasonable for the particular use case. For example, a B-Tree could contain all nodes in the same object given that the nodes likely have the same permissions and lifetime. However, another tree with different permissions for children could separate these nodes into different objects. For this second example, managing their lifetime can be done with ties.
Kernel interposition is only done when creating and deleting objects, leaving access and modification to userspace facilities and hardware. Access control is limited by specifying policies and letting hardware enforce those policies. This allows the kernel avoid involvement in access, improving performance without sacrificing security. Objects maintain a reference count to prevent deletion of object data when multiple pointers reference it.
Object Creation
When creating objects, the medium storing data can be chosen, such as choosing between volatile DRAM and non-volatile memory. While these options are supported by default, other types can be configured based on the hardware support of the particular machine. Different storage mediums provide different benefits and costs and a more in depth discussion can be found at lifetime.
When creating objects, a source object can be denoted, where the new object will be a copy of the original. This allows for easy versioning, as objects can be copied and kept as different versions. Copying an object uses copy-on-write, meaning another copy of the data is only created when a change is made, rather than immediately on creation.
IDs
While 264 object IDs provides a large enough space for a single computer's address space without worries of running out object IDs, adding the ability to generate IDs without having to interact with a central authority and the Twizzler's future of a transparent single id space on a distributed set of computers creates the possibility of collisions. Thus 128 bits are used to shrink the possibility of collisions and the ability to guess an object ID, while also creating a large enough ID space to allow for distribution.
ID derivation
IDs are derived by inputting a nonce, 128 bits of random data, to a hash function. The nonce is provided for objects created using copy-on-write, the objects where a src
points to a valid object when calling twz_obj_new()
so as to create unique IDs despite having the same object content.
There is also the ability to create object IDs by hashing the contents of the object. This is most useful for conflict-free replicated data types (CRDTs), where multiple computers are running distributed Twizzler and can aggresively replicate objects without worrying about consistency issues. Hashing to obtain an object ID is designed for immutable objects.
Object Lifetime
Volatile Memory
Placing objects in volatile memory (such as DRAM) limits the object lifetime to at most the time of the next power cycle. This can provide easy cleanup for temporary data, such as the result of computation or cached data kept in memory for locality (faster access).
Objects in volatile memory can be accessed and used in the same ways they are in non-volatile memory.
Ties
Ties handle object lifetime by allowing for automatically deleting objects once other constructs are deleted. For example, if an application crashes, the temporary computation might be useless, yet keeping the temporary computation in volatile memory until a power cycle occurs is a waste of that space. Instead, we can tie the lifetime of the object to other objects, such that the object is automatically deleted with the other, freeing the memory before a power cycle. This mechanism is convenient because the kernel does not have to maintain an understanding of the implied lifetime of an object, rather it can be specified relative to other objects.
Ties also provide the benefit of allowing temporary context (such as stacks and heaps) to be stored in persistent memory, allowing for recovery after a power cycle.
For example if we have two objects: koala and coldbrew, tying koala to coldbrew means that koala will not be deleted until after coldbrew is. While koala can be deleted immediately after coldbrew is, if koala is tied to multiple objects, it will only be fully deleted when the final object it is tied to is deleted. Additionally, if koala is tied to a handful of objects, once all of those objects are deleted, koala will be automatically deleted too. This is similar to the practice of creating a file and immediately unlinking it within Unix, so the file is automatically deleted once the file descriptor is closed.
Since most construts in Twizzler are just various types of objects, we can use ties to establish a lifetime of objects based on the existence of other objects, such as threads or views. For a more detailed explanation of views, see the page on views. For temporary computation done in a thread, an instance of computation, objects can be tied to the thread. This provides similar semantics to creating and immediately unlinking files on Unix, as with both, once the application exits, the data is deleted. With views, this provides an address space for an application to run in, possibly over multiple time periods. Tying an object to a view allows the object to exist for as long as the execution state of the application, which could be as long as the application is installed, or only removed when all application data is deleted.
Ties to Volatile Memory
Tying volatile objects to each other implies that both objects will be deleted at a power cycle, which is to be expected. However, things get a little more complicated with ties between volatile and persistent objects. Tying a volatile object to a persistent one breaks the semantics of ties in the event of a power cycle, as the volatile object is deleted and the persistent one is not. However, this outcome is to be expected, and we assume programmers will use this when doing temporary computation. Tying a persistent object to a volatile object is dangerous as both objects will be deleted in the event of a power cycle, including an unexpected one.
Pointers
Definition
There are two types of pointers in Twizzler: persistent and dereferenceable. Persistent pointers are used in order refer to external data with no extra context needed. They can be thought of file names on a traditional operating system, where the data exists longer than any process or power cycle. Dereferenceable pointers are references to data in ways that act like traditional memory accesses when programing on other operating systems (such as stack or heap data). Unlike persistent pointers, the data can be acted upon, such as by reading or writing, but additional context is necessary.
Rationale
Persistent pointers are much more efficient than file I/O, as there is a link to data from the pointer without the need for deserialization of a file. This just allows links of data in data structures, which is what objects are.
Foreign Object Table
Persistent pointers work by indexing into a Foreign Object Table (FOT), which holds a longer reference to the data, allowing for late binding of names. Late binding in FOTs are explained in more detail in a later section. Persistent pointers are thus just an index of external object wanted (16 bits, allowing for 65,536 object references), and an offset within the foreign object (40 bits, a maximum offset of 1 terabyte). Because access control is at an object granularity, multiple pointers to the same object can use the same FOT entry.
Late-Binding
Late binding of names is used often with libraries to allow for updates of the library without requiring every program using the library to be recompiled. In Twizzler, late-binding is done by putting a name as the entry in the FOT. When creating the FOT entry, a name resolver can also be specified to allow for different objects to have different name resolvers. The actual name resolution happens when converting the persistent pointer to a dereferenceable one, allowing for different objects to be resolved at different times, just based on the name at dereference time.
Permissions
A thread has permission to access an object if:
- They have not been restricted by a mask (including global mask)
- The thread has the capability, or delegated capability. (by attaching to a security context).
- The thread knows the object's name. (security by obscurity)
Permission values for objects
There are 5 permissions an object can have: read, write, execute, use, and delete. Except for use (and to an extent delete), these permissions exist in Unix systems, and are used in the same way.
- Read: This allows a thread the ability to look at the contents of an object.
- Write: This allows a thread the ability to modify an object.
- Execute: The object can be run as a program.
- Delete: The object can be deleted. Usually Unix systems include this as part of write permissions, and Windows systems allow this to be a separate permission.
- Use: This marks the object as available for the kernel to operate on, such as a kernel state object, further explained on kernel state objects. Often times this is used for attaching a thread to a security context.
Masks
Masks further restrict permissions to objects. This is similar to umask
in Unix systems. For example, while by default any object may have access to an object called bloom, we may want a specific security context called Fall to not have access to the object.
We do not need signatures on masks because they are part of the security context, meaning threads can only modify the mask if they can modify the security context object.
Capabilities
Capabilities are when permissions a provided to objects as tokens, where the program can access the data if it has a valid token. Unlike previous implementations of capability systems, Twizzler includes an object ID as part of the capability signature to prevent a capability from being stolen by leaking the signature to malicious parties. While this does require identity to be checked in addition to the validity of the signature, this prevents simple leaks of secrets from breaking the security of an object.
Delegation
Delegation allow for capabilities to be shared and futher restricted with other views. In order to delegate a capability, it must have high permissions within the object it wishes to delegate (enough so as to access the private key of the object).
Late Binding Access Control
Rather than checking an object when it is initially accessed, such as in Unix with a call to open()
, Twizzler checks access at the time when the operation is done, such as a read or write. This means that a thread can open an object with more permissions than allowed and not cause a fault, and only once that illegal operation is attempted will the fault occur.
This method for enforcing access control is different from Unix systems because the kernel is not involved for memory access, which is how Twizzler formats all data. However protection still exists because when loading a security context, the MMU is programmed to limit access.
The Twizzler Reference Runtime
The primary runtime environment supported by Twizzler is the Reference Runtime. It provides userspace functionality for programs, the ability to load programs and libraries, and the ability to isolate those programs and libraries from each other.
It is a work in progress.
Stdio
The runtime provides three types of stream-like interfaces for basic IO, which should be familiar to most: stdin (for reading input), stdout (for writing output), and stderr (for reporting errors). Each of these can be handled by either a thread-local path, or a global path. When writing to stdout, for example, the runtime first checks if the thread-local path has a handler. If so, the output gets sent to that handler. If not, the runtime checks if there is a global handler registered for stdout. If so, the output goes there. If not, it gets sent to the fallback handler, which can be configured to either drop the output or send it to the kernel log.
Views
Views are an address space abstraction that sets an environment for threads to run in. Persistent objects are mapped into the view and given dereferenceable virtual pointers. These virtual pointers allow access to the object inside the view.
Because views are normal objects, they can be written on persistent memory and allow recovery of application state in the event of a crash or power cycle. The abstraction ov views also allows easy sharing of thread state, such as references to data. This is convient as it allows for sharing of data without requiring serialization through a construct like a pipe or file, and without the need of a call to mmap
. Sharing views with other threads does provide a security threat, as one thread could corrupt the view for both. To deal with this security vulnerability, there is the abstraction of secure API calls/gates which allows communication between threads without allowing one or the other to corrupt another's data.
When a thread wants to map a new object into the view, they can call _____ and when they attempt to access the object, the kernel will automatically map it in. To change or remove an entry, the kernel must be involved with the function invalidate_view()
, to update references to the underlying memory.
To switch between views, the system call become()
is needed.
Kernel State Objects
These are normal objects used by both userspace programs and the kernel. For them to be used by the kernel, the use permission must be set. To learn more about the use permission, see Permissions.
Security Contexts
A security context is an object that contains information about which objects can be accessed and how (such as managing capabilities). A thread attaches to the security context to gain access to the objects. This can be useful for operations similar to the sudo
command on UNIX, where privileges are temporarily increased in order to perform certain privileged operations without fully changing user ID.
Additionally, security contexts can be used to limit permissions. To prevent a limited thread from shedding their limited permission state, attached contexts can be set as undetachable.
Extensions
This is the interface abstraction in many programming languages, or the functions that must be implemented for drivers, such as read()
and write()
. In practice, Twizzler's implementation of this applies to objects, where a set of methods are defined and noted such that external threads accessing the object can call the interface methods without a need to understand the specifics of under the hood operations for the object.
Examples
Two examples of extensions are IO and Event. IO is useful for reading and writing to an object, and for an object to support the extension, the object must implement the functions read()
, write()
, ioctl()
, and poll()
. When registering the extension, the object will provide pointers to all of the functions, so calls to read()
for example on the object will know how to implement the function in an object specific way.
Event is a way of waiting for something to happen to an object, similar to poll()
on a file descriptor in Unix. Specific events can be waited for by using event_wait()
with the object and event passed in as arguments. Because this is just an interface, an object can implement it in a way that makes sense to it, such as waiting for data from a network or a write to an object to complete.
Tags
Tags are a way of uniquely identifying an extension, such as IO, and checking if the object supports the extension. These are stored in the metadata for the object, and when the tag is added to the metadata, a pointer to the functions that implement the interface are also added.
Gates
Gates, also known as secure API calls, are a means of exposing a limited set of functions available to an external user. This is similar to system calls, where the user can call into the kernel to do specific actions. Gates are used for interprocess communication.
Gates are a way of an object exposing a system call like interface. This allows an object to define arbitrary behavior other threads can call. Because external threads can only access the object through the gate, they are restricted from detrimental actions, provided the gate is correctly written. While this does place the responsibility for secure code in the hands of any programmer rather than the typical relegation of secure code to security experts, gates are optional and can be avoided if there is worry about security flaws.
When writing gates, best security practices are required to avoid vulnerabilities in the gates. As such, beware of timing attacks and other side channels that can be used to subtly exploit the object.
Direct Memory Access (DMA)
A key aspect of a device driver involves programming the device to access host memory. When a device accesses host memory, we usually call it Direct Memory Access (DMA). DMA is used, for example, by NICs to access transmit rings, or to copy packet data into main memory (memory that the CPU, and thus the OS and user programs can access). However, devices access main memory differently to how threads running on a CPU access memory. Before we discuss the API that Twizzler provides for DMA, we should discuss how devices access memory, and the implications this has for memory safety, translation, and coherence.
Considerations for DMA
When programs access memory in Twizzler they do so via accessing object memory, which involves an MMU translating some kind of object address to a physical address. On x86, for example, this involves a software translation to a virtual address followed by a translation via the Memory Management Unit (MMU) to a physical address. Similarly, when a device accesses memory, it emits a memory address (likely programmed by the driver) that may undergo no translation or some other translation on the bus before attempting to access host memory. There are two important considerations that are the result of this alternate (or no) translation:
- Contiguous addresses. While object memory is contiguous (within an object), the physical memory that backs that object memory may not be. Devices and drivers need to be capable of handling access to memory in a scatter-gather manner.
- Access Control. Access control can be applied differently between host-side driver software and devices. Thus driver software must be aware that it may have access to memory via the device that it should not directly. We can use devices like the IOMMU to limit this effect.
In addition to the above, we need to consider the issue of coherence. While CPU caches are coherent across cores, devices accessing host memory do not necessarily invalidate caches. Thus we have to handle both flushing data to main-memory after writing before the device reads it and invalidating caches if a device writes to memory. Some systems automatically invalidate caches, but not all do.
Memory Safety
Finally, we must consider memory safety. While we can control writes from host software to DMA buffers, we cannot necessarily control how the device will access that memory. To ensure memory safety of shared regions, need to ensure:
- The device and host software cannot both mutate shared state at the same time (thread safety), or if this can happen, then the shared memory region that can be updated by both entities is comprised of atomic variables.
- The device mutates data such that each mutation is valid for the ABI of the type of the memory region.
Enforcing these at all times would add significant overhead. We take some inspiration from Rust's
stance on external influences to
memory,
tempering this somewhat with the addition of a DeviceSync
marker trait.
Overview of DMA System
The Twizzler DMA system is contained within the twizzler-driver crate in the dma
module. The
module exposes several types for using Twizzler objects in DMA operations along with an abstraction
that enables easier allocation of DMA-able memory. The key idea behind Twizzler's DMA operation is
that one can create a DmaObject
, from which one can create a DmaRegion
or a DmaSliceRegion
.
These regions can then be "pinned", which ensures that all memory that backs them is locked in place
(the physical addresses do not change), and the list of physical addresses that back the region are
made available for the driver so that it may program the device.
Coherence and Accessing Memory
The primary way that the driver is expected to access DMA memory is through the DmaRegion
's with
or with_mut
method. These functions take a closure that expects a reference to the memory as
argument. When called, the with
function ensures coherence between the device and the CPU, and
then calls the closure. The with_mut
function is similar, except it passes a mutable reference to
the closure and ensures coherence after the closure runs as well.
The DmaSliceRegion
type provides similar with
functions, except they take an additional Range
as argument that can be used to select only a subslice of the region that the closure gets access
to. Allowing for subslicing here is useful because it allows the driver to communicate to the
library which parts of the region need coherence before running the closure.
Access Directions and Other Options
Regions can be configured when they are created for various different use cases.
The Access Direction refers to which entities (the device and the CPU) may read and write the memory. Driver writers should pick the most restricted (but correct) mode they can, as is can have implications for maintaining coherence. It can have one of three values:
- HostToDevice: The memory is used for the host to communicate to the device. Only the host may write to the memory.
- DeviceToHost: The memory is used for the device to communicate to the host. The host may not write to the memory.
- BiDirectional: Either entity may write to the memory.
In addition to access direction, regions can be configured with additional options, a bitwise-or of the following flags:
- UNSAFE_MANUAL_COHERENCE: The
with
functions will not perform any coherence operations. The driver must manually ensure that memory is coherent.
Pinning Memory
Before a device can be programmed with a memory address for DMA, the driver must learn the physical
address that backs the DMA region while ensuring that that address is stable for the lifetime of
whatever operation it needs the device to perform. Both of these are taken care of with the pin
function on a DmaRegion
or DmaSliceRegion
. The pin
function returns a DmaPin
object that
provides an iterator over a list of PhysInfo
types, which can provide the physical address of a
page of memory.
A region of DMA memory that comprises some number of pages (contiguous in virtual memory) can
list the (likely non-contiguous) physical pages that it maps to. The order that the pages are
returned in is the order that they appear for backing the virtual region. In other words, the 4th
PhysInfo
entry in the iterator of a DmaPin
for a region contains the physical address of the 4th
virtual page in the DMA region.
Any future calls to pin
return another DmaPin
object, but the underlying pin information (that
is, the physical addresses) may be the same, even if the DmaRegion
is dropped and recreated.
However, if the DmaObject
is dropped and recreated, the driver cannot rely on the pin to be
consistent. More specifically, the pin's lifetime is tied to the DmaObject
, not the DmaRegion
.
The reason for this somewhat conservative approach to releasing pins is to reduce the likelihood of
memory corruption from accidental mis-programming. Another consideration for pinned memory lifetime
is that it can leak if the driver crashes. Allowing for leaks in this case is intentional, as it
makes it less likely that the device will stomp over memory in the case of a driver crash.
Pools
While we can use a DmaObject
to perform DMA on an existing Twizzler object, it is common for a
device driver to need a simple pool of DMA-able memory that it can allocate from so that it may
communicate with the device (e.g. DMA memory for a ring buffer). For this, twizzler-driver provides
a DmaPool
type that can be used to allocate DMA regions that share an access type and a set of
DmaOptions
. The pool will internally create new Twizzler objects that it uses to allocate DMA
memory from, which it then uses to create DMA regions on-demand.
Cargo metadata for Twizzler crates built inside xtask
The xtask program organizes the build into a series of "collections" that get built in different environments. There are:
- Tools (targets build system)
- Kernel (targets arch-machine-none)
- Userspace (targets arch-machine-twizzler, optional, default yes)
- Userspace-static (targets arch-machine-twizzler-minruntime, optional, default yes)
- Userspace-tests (targets arch-machine-twizzler-minruntime, optional, default no)
- Kernel-tests (targets arch-machine-none, optional, default no)
Programs may select which collection to be compiled in based on the metadata value set in Cargo.toml, described in more detail below.
Static versus non-static builds
Twizzler currently builds packages in two target_env
settings: "minruntime" and "". This translates to two triples that are used for userspace twizzler programs: arch-machine-twizzler, and arch-machine-twizzler-minruntime. The minruntime variant is defined to be for statically linked programs, using the default minimal runtime provided by twizzler-abi. Such crates can declare that they should be compiled only in the minruntime collection by setting the key package.metadata.twizzler-build
to "static" in Cargo.toml:
[package.metadata]
twizzler-build = "static"
Tools
Tools should be placed in the tools subdirectory, and should set the package.metadata.twizzler-build
key to "tool" in Cargo.toml:
[package.metadata]
twizzler-build = "tool"
The kernel and xtask
Both the kernel and xtask themselves set the package.metadata.twizzler-build
key to "kernel" or "xtask". Programs should not use these values.