Or: The Most Expensive Linked List I’ve Ever Written

Some of you already know the contents of this post, especially if you’ve written embedded or unsafe code in Rust. But I didn’t, so I thought it was useful to write down my experience as accurately as I can. Without further ado…

Last year, I wrote Photohash, software to help me index my NAS and find duplicate photos with rotation-independent hashing and perceptual hashing. To make use of cores and keep the disks busy, it distributes work to compute and IO workers. Work is distributed with channels – synchronized work queues.

In Photohash, work tends to be discovered and processed in batches: enumerating directories returns multiple entries and the database is updated in multi-row transactions.

Rust has a rich selection of channel implementations: std::sync::mpsc, futures::channel, tokio::sync, crossbeam::channel, flume, and kanal are high-quality options.

Unfortunately, none of them exactly met my needs, so I nerd-sniped myself into writing my dream channel. My previous day job (EdenFS and Watchman) was full of ad-hoc channels so I knew roughly I wanted. kanal is closest, but it is riddled with unsafe code and uses spinlocks which look great in microbenchmarks but have no place in userspace software.

Introducing batch-channel, a throughput-optimized channel. The design goals are:

  • Multi-producer, multi-consumer. Parallelism in both production and consumption.
  • Sync and async support for both consumers and producers. Mix-and-match provides flexibility for use in any type of thread pool, async runtime, or FFI.
  • Bounded or unbounded. Bounded for backpressure and limiting peak memory consumption. Unbounded for situations where you cannot guarantee deadlock freedom.
  • Sending and receiving multiple values. I often want to send multiple values. Like reading all of the paths in a directory. Or writing multiple rows to a database. Batching allows amortizing per-batch costs. It’s silly to acquire the channel lock N times to push N values. It’s the same on the consumption side: workers may want to pull all pending work items in one channel read. You might wonder about lock-free queues, and they have their place, but but you’ll still contend on the head and tail, and atomic operations remain slow even on modern Intel cores. If you’re going to contend on the queue anyway, batch-channel’s philosophy is to stick the whole thing behind a mutex and maximize batch sizes on both ends.

At the time this was written, the following design goals weren’t yet implemented:

  • Priorities. Senders can influence processing order.
  • Bounding variable-sized items. For example, being able to say a queue can hold up to 20 MB of paths, no matter how long they are.

And, finally, the design goal that led to this post:

  • No allocations under steady-state use. Allocations are a source of contention, failure, and overhead, especially when using slow system allocators.

The Shape of a Channel

To explain why unsafe Rust is even involved, let’s look at the implementation of a channel.

pub struct Channel<T> {
  q: VecDeque<T>,
  // Blocked receivers
  waiting_for_elements: Vec<Waker>,
  // Blocked senders
  waiting_for_capacity: Vec<Waker>,
  // Synchronous use requires some condition variables too.
}

When an async receiver blocks on recv() because the channel is empty, the task’s Waker is stored so that the channel knows to wake it when a value arrives.

Waker is a handle to a blocked task. The channel can signal the async runtime to wake a task when it should poll the channel again. It’s a wide pointer, two words in size.

Waker is used like this:

pub struct Recv<'a, T> {
  channel: &'a Channel<T>,
}

impl<'a, T> Future for Recv<'a, T> {
  type Output = T;
  
  fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
    // Is the queue empty?
    if let Some(element) = self.channel.q.pop_front() {
      // The queue has an element, so return it.
      Poll::Ready(element)
    } else {
      // Queue is empty so block and try again later.
      self.channel.waiting_for_elements.push(cx.waker().clone());
      Poll::Pending
    }
  }
}

Note: The above code is illustrative. In reality, the channel has a Mutex and some condition variables, but that’s incidental for this post.

If the queue is empty when recv() is called, the waker is stored in the channel and the task enters the blocked state.

Later, when a value is added to the queue, any waiting tasks are woken:

fn send<T>(channel: &mut Channel<T>, value: T) {
  channel.q.push_back(value);
  let wakers = mem::take(&mut channel.waiting_for_elements);
  // NOTE: Mutexes are released here, before waking.
  // Unless we are clever with cancellation, we have to wake all futures,
  // because we don't know which, if any, will attempt the next poll.
  for waker in wakers {
    waker.wake();
  }
}

Here’s the issue: waiting_for_elements is a Vec<Waker>. The channel cannot know how many tasks are blocked, so we can’t use a fixed-size array. Using a Vec means we allocate memory every time we queue a waker. And that allocation is taken and released every time we have to wake.

The result is that a naive implementation will allocate and free repeatedly under steady-state send and recv. That’s a lot of memory allocator traffic.

Can We Use an Intrusive List?

The optimization I want is, rather than allocating in a Vec every time a task is blocked on the queue, can we store the list of Wakers within the blocked futures themselves? We know that we only need to store as many Wakers as blocked futures, so that should work.

It should look something like:

pub struct Channel<T> {
  q: VecDeque<T>,
  // Intrusive doubly-linked list head.
  waiting_for_elements: WakerList,
}

fn send<T>(channel: &Channel<T>, value: T) {
  channel.q.push_back(value);
  let wakers = channel.waiting_for_elements.extract_list();
  // Release any mutex before waking.
  for waker in wakers {
    waker.wake();
  }
}

pub struct Recv<'a, T> {
  channel: &'a Channel<T>,
  // Every Future gets a WakerSlot, which is an intrusive doubly-linked
  // list node protected by Channel's mutex.
  waker: WakerSlot,
}

impl<'a, T> Future for Recv<'a, T> {
  type Output = T;
  
  fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
    if let Some(element) = self.channel.q.pop_front() {
      Poll::Ready(element)
    } else {
      // Queue is empty so try again later.
      // Store the Waker in this Future by linking it into channel's list.
      self.channel.waiting_for_elements.link(&mut self.waker, cx.waker().clone());
      Poll::Pending
    }
  }
}

And that’s about the limit of my fake illustrative code. It’s time to get into details. How do we express an intrusive linked list in Rust?

Intrusive List Crates

This isn’t a new idea. I looked for existing crates:

  • intrusive-collections is popular, but list nodes must outlive the list itself. In my case, the future will never outlive the channel.
  • futures-intrusive is a nice-looking crate that performs the same optimization, but does not meet my design goals.
  • multilist is Patrick Walton’s pre-1.0 experiment. Interesting idea, but it allocates nodes on the heap.

There are two other production examples of this approach:

  • lilos-list is part of Cliff Biffle’s embedded operating system lilos. It stores wakers with an intrusive list. It’s close to what I wanted, but embedded OS code tends to have its own concurrency model. In particular, it would take some work to integrate it with standard library mutexes, and it calls wakers while locks are held, which is a bad idea. On the other hand, since there are no threads in lilos, it can avoid implementing Send and use std::cell::Cell to temporarily perform mutations on otherwise shared references.
  • tokio stores its channel wakers in an intrusive linked list too. Its implementation has a surprising amount of code, but it’s closest to what I want. The point is moot: it’s an implementation detail not visible outside of Tokio. (See Alice Rhyl’s musings on the challenges of intrusive structures in Rust.)

Pinning

It’s time to write a crate.

I want the channel to store a WakerList and each future to have a WakerSlot member. Slots can be linked into the list and unlinked either on wake or future cancellation.

WakerList and WakerSlot form a self-referential data structure. Self-referential data structures are a well-known challenge in Rust. They require unsafe code.

In C++, this is relatively easy. You delete the move and copy operations and fix up the link pointers as appropriate.

So, at this point in my Rust journey, still thinking in C++, I assume “easy!”

I just need to disable movement with !Unpin (actually PhantomPinned) and ensure all methods take Pin<&mut WakerList> and Pin<&mut WakerSlot>.

Once you observe a Pin<&mut T>, you can assume T will never move again. I’m not going to discuss Pin in depth – Jon Gjengset has an excellent video describing its rationale and usage.

Here’s where things start to get hard. Pinning was added well after Rust 1.0 was stabilized. The language pervasively assumes values of any type can be moved with a memcpy, so writing a data structure that violates that assumption makes the public APIs themselves awkward.

Here’s what I tried first:

struct WakerSlot(...);
struct WakerList(...);

impl WakerList {
  fn link<'list : 'slot, 'slot>(
    self: Pin<&'list mut WakerList>,
    slot: Pin<&'slot mut WakerSlot>,
  )
}

My thought was that the act of linking should constrain the “pinnedness” and the lifetimes: list must outlive slot. Alas, lifetimes don’t work like that. A function call cannot constrain the actual lifetimes of its parameters. The borrow checker will happily subset 'list and 'slot until it proves whether it can satisfy the constraints. The result is that the link’s definition above has no effect.

The idea that lifetimes precisely match the lives of actual values is apparently a common misconception, and it resulted in some puzzling error messages.

(Writing a post like this after the fact feels weird because “of course it doesn’t work that way” but I’m faithfully documenting my learning process.)

Can WakerSlot itself take a lifetime to the list it references?

struct WakerList(...);
struct WakerSlot<'list>(...);

impl WakerSlot<'_> {
  fn new(list: &WakerList) -> WakerSlot<'_>;
}

This doesn’t work. If you pretend the WakerSlot has a reference to the WakerList, then you can never create a &mut WakerList elsewhere, because Rust’s core lifetime rule is that you can either have one mut reference or many shared references but never both.

I’m hoping this is possible and a reader leaves me a note.

When Panicking Isn’t Safe Enough

Conceptually, link and unlink operations take mutable references to both the list and slot. But I never found a way to satisfy all of the rules in the type system:

  • WakerSlot’s lifetime parameter does not outlive its list.
  • Only one &mut reference at a time.
  • Never &mut and & simultaneously.

Here, I gave up on trying to express the rules in the type system and chose to assert at runtime. The runtime lifetime rules are:

WakerList must be empty when dropped. Otherwise, slots would have pointers to invalid memory.

WakerSlot must be unlinked when dropped. Otherwise, the list references deallocated memory.

Reporting these invariant violations with panic is not sufficient. Panics can be caught, but the program would remain in a state where safe code can access dangling pointers and cause undefined behavior (UB).

Therefore, when an invariant is violated, the program must abort.

But I can’t just call abort: I want this utility crate to be [no_std], so it’s up to the calling program to decide how it aborts.

The simplest solution I found was to panic from an extern "C" function and let Rust translate that to an abort.

#[allow(non_snake_case)]
#[inline(never)]
#[cold]
extern "C" fn MUST_UNLINK_WakerSlot_BEFORE_DROP() -> ! {
    // panic! from extern "C" is an abort with an error message.
    panic!("Must unlink WakerSlot before drop")
    // Another option, at the cost of a tiny, stable, dependency, is
    // the `abort` crate.
    //abort::abort()
}

Structural Pinning

I elided this detail in the example code above, but WakerList is intended to be accessed behind a mutex. However, neither std::sync::Mutex nor parking_lot::Mutex have structural pinning. That is, lock() is &Mutex<T> onto &mut T, allowing T to be moved.

I needed a safe API for getting Pin<&mut T> from Pin<&Mutex<T>>.

So I wrote the pinned-mutex crate which provides structurally-pinned Mutex, MutexGuard, and Condvar wrappers.

Note that there is a pinarcmutex crate that offers a PinArcMutex<T> type roughly equivalent to Pin<Arc<Mutex<T>>> except with structural pinning. But it allocates and you can’t drop in parking_lot’s mutex, which is faster and lighter than the standard library’s.

We can imagine a future Rust version where pinning is more natural and has pervasive (or implicit) standard library support.

Pinning Ergonomics

Boats recently wrote a nice overview of why Pin is shaped the way it is and why it is painful to use.

And the internet is full of threads like “Pin Suffering Continues”.

If you want to use pinned APIs safely in your own code, you will need to depend on a pin-projection crate like pin-project or pin-project-lite. (And don’t forget pinned_drop!)

It works fine, but you end up with code that looks like the following.

let mut wakers = state.as_mut().project().base.project().rx_wakers.extract_some_wakers();
while wakers.wake_all() {
    let mut state = self.project_ref().state.lock();
    wakers.extract_more(state.as_mut().base().project().rx_wakers);
}

Writing code this way is miserable. The compiler will guide you, but my mind was shouting “you know what I want, just do it” the whole time.

It Works!

Deep breath. Tests passed. WakerList and WakerSlot provide the interface I wanted, so I published them in a wakerset crate. It offers a safe, intrusive, no_std list of Wakers.

With it, I could remove the main source of steady-state allocations in batch_channel.

It was time to polish it up and ensure I didn’t miss anything, and this blog post is only half-done.

Undefined Behavior, Sanitizers, and MIRI

One of my original goals for batch-channel was to avoid unsafe code.

Unfortunately, two optimizations required it. Besides the intrusive list described above, the MPMC channel objects themselves are managed with two reference counts, one for senders and one for receivers. A split reference count is required: when one half of the channel is dropped, the channel is closed.

To simplify auditing, I placed all of this new unsafe code behind safe APIs and separate crates. They are:

Safe Rust, excepting compiler bugs and incorrectly-designed unsound APIs, has no undefined behavior. Unsafe Rust, on the other hand, removes the guardrails and opens a buffet of possible UB.

There are three ways you can deal with potential undefined behavior. In increasing order of happiness over time:

  • Hope for the best and deal with potential bugs when they come up.
  • Think carefully and convince yourself the code is correct.
  • Automated sanitizers.

Fortunately, Rust supports sanitizers that detect various types of undefined behavior. The two most useful are ASan and TSan. C++ programmers are quite familiar with them at this point, and I consider them table stakes for any C or C++ project.

But Rust has one even better: MIRI. It catches violations of the Rust aliasing model, which is pickier than ASAN. Satisfying MIRI is where I spent most of my time.

Rust Aliasing Model

The first time I ran MIRI, it failed:

test link_and_notify_all ...
error: Undefined Behavior: trying to retag from <254318> for SharedReadWrite permission at alloc88289[0x10], but that tag does not exist in the borrow stack for this location
...
trying to retag from <254318> for SharedReadWrite permission at alloc88289[0x10], but that tag does not exist in the borrow stack for this location
...
help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information

Stacked borrows? What’s all this?

Here, I realized my first mistake: I dove straight into unsafe Rust and should have read more in advance:

My other mistake was “thinking in C”. Being deeply familiar with C’s semantics wasn’t helpful here. In hindsight, it feels foolish to have assumed Rust’s aliasing model was similar to C’s. In C, usually, having pointers to values carries no meaning. Primarily you reason about pointer dereferencing operations.

For example, in C, this is perfectly legal if p == q:

void foo(int* p, const int* q) {
  printf("%d\n", *q);
  *p = 456;
  printf("%d\n", *q); // if p == q, prints 456
  *(int*)q = 789;
  printf("%d\n", *p); // if p == q, prints 789
}

const is “meaningless” in that it does not prevent the pointee from changing, so the compiler can’t optimize based on it.

fn foo(a: &u32, b: &mut u32) {
    println!("{a}");
    *b = 123;
    println!("{a}"); // always prints the same value as above
}

On the other hand, Rust’s primary aliasing rule is that, at any point, an object may have a unique &mut reference to it or any number of shared & references, but never both.

The optimizer will take advantage of that. a is not reloaded from memory because the write to b cannot alias it.

The aliasing rules in Rust are not fully defined. That’s part of what makes this hard. You have to write code assuming the most pessimal aliasing model.

Under the most pessimal aliasing rules, you have to assume taking a &mut reference to a value immediately writes to it and continues to write to it as long as the reference lives. And if you have a shared reference, you have to assume the value is read at arbitrary times as long as any reference is held.

Box::leak

Let’s start with the first confusing example I ran into:

let p = Box::leak(Box::new(MyThing::new())) as *mut MyThing;
// later:
let ref: &MyThing = *p;
ref.method();

MIRI failed, complaining that p was formed from the perpetual &mut returned by Box::leak and therefore it’s UB to create a shared &MyThing reference to it at any point thenceforth.

The fix was to allocate without forming a reference:

let p = Box::into_raw(MyThing::new());
// later:
let ref: &MyThing = *p;
ref.method();

Note: This may have been a MIRI bug or the rules have since been relaxed, because I can no longer reproduce as of nightly-2024-06-12. Here’s where the memory model and aliasing rules not being defined caused some pain: when MIRI fails, it’s unclear whether it’s my fault or not. For example, given the &mut was immediately turned into a pointer, does the &mut reference still exist? There are multiple valid interpretations of the rules.

Box::from_raw

OK, if you allocate some memory with Box::into_raw, you’d expect to deallocate with Box::from_raw, right? That failed MIRI too.

I ended up having to write:

unsafe {
    let ptr = ptr.as_ptr();
    std::ptr::drop_in_place(ptr);
    std::alloc::dealloc(
        ptr as *mut u8,
        std::alloc::Layout::new::<Inner<T>>());
}

Note: This may have also been a MIRI bug. It is no longer reproducible. I changed splitrc to use Box::into_raw and Box::from_raw and it passes MIRI. I enabled MIRI in my CI so we’ll see if it breaks again going forward.

Linkage, References, and Interior Mutability

That’s channel allocation and deallocation covered. Now let’s look at the intrusive pointers in wakerset.

In a linked list, every node has a linkage struct.

struct Pointers {
    next: *mut Pointers,
    prev: *mut Pointers,
    pinned: PhantomPinned,
}

struct WakerList {
    pointers: Pointers,
}

struct WakerSlot {
    pointers: Pointers,
    // Required to assert that this slot is never
    // unlinked from an unrelated list.
    owner: *mut WakerList,
    waker: Option<Waker>,
}

Now imagine two threads. Thread A holds a &mut WakerList with the intent to extract pending Wakers. Thread B happens to hold a &WakerSlot at the same time.

It is UB for code traversing the pointers to form a &mut WakerSlot (or even a &WakerSlot) if any thread might have a &mut WakerSlot, because this violates Rust’s aliasing rules. A &mut reference must always be exclusive, even if it is never dereferenced. This is the important difference with C.

Because Rust reorders reads and writes based on its aliasing rules, you must never convert a pointer into a reference unless you know that nobody else has a conflicting reference.

We need to prevent the compiler from optimizing a &WakerSlot into early reads of the pointers and waker fields.

UnsafeCell is the tool to reach for. It introduces a “mutability barrier”, and UnsafeCell<Pointers> tells Rust not to cache reads. We are responsible for ensuring we won’t violate Rust’s aliasing rules when accessing Pointers’s fields.

struct WakerList {
    pointers: Pointers,
}

struct WakerSlot {
    // UnsafeCell: written by WakerList independent of
    // WakerSlot references
    pointers: UnsafeCell<Pointers>,
    // Required to assert that this slot is never
    // unlinked from an unrelated list.
    owner: *mut WakerList,
    waker: Option<Waker>,
}

A circular linked list means that only a slot reference is required to mutate the list, so I needed to enforce the guarantee that only one thread may access the UnsafeCells at a time.

I did this with an important, if subtle, API guarantee: all link and unlink operations take &mut WakerList. If &mut WakerSlot was sufficient to unlink, it could violate thread safety if WakerList was behind a mutex. (This also means that WakerList does not require an UnsafeCell<Pointers>.)

The pinned-aliasable crate solves a related problem: how do we define self-referential data structures with mutable references that do not miscompile? Read the motivation in the crate’s doc comments. It’s a situation required by async futures, which are self-referential and thus pinned, but have no desugaring. See the open Rust Issue #63818.

Avoiding References Entirely

As mentioned, when traversing a linked list, it’s easy to form conflicting &mut references to nodes. Consider this slightly contrived unlink example:

let p = &slot.pointers as *mut Pointers;
let next: &mut Pointers = *(*p).next;
let prev: &mut Pointers = *(*p).prev;
next.prev = prev as *mut Pointers;
prev.next = next as *mut Pointers;
(*p).next = ptr::null_mut();
(*p).prev = ptr::null_mut();

If slot is the only slot in the list, then next and prev both point to the WakerList and we’ve now formed two mut references to the same value, which is UB.

In this particular case, we could ensure every pointer dereference occurs as a temporary. That limits the scope of each reference to each line.

let p = &slot.pointers as *mut Pointers;
let next = (*p).next;
let prev = (*p).prev;
(*next).prev = prev;
(*prev).next = next;
(*p).next = ptr::null_mut();
(*p).prev = ptr::null_mut();

But I just don’t trust myself to ensure, under all code paths, that I never have two references overlap, violating Rust’s aliasing rules.

It’s kind of miserable, but the safest approach is to avoid creating references entirely and operate entirely in the domain of pointer reads, writes, and offsets.

let nextp = addr_of_mut!((*node).next);
let prevp = addr_of_mut!((*node).prev);
let next = nextp.read();
let prev = prevp.read();
addr_of_mut!((*prev).next).write(next);
addr_of_mut!((*next).prev).write(prev);
nextp.write(ptr::null_mut());
prevp.write(ptr::null_mut());

(So much syntax. Makes you appreciate C.)

addr_of_mut! is key: it computes a pointer to a place expression without forming a reference. There are gotchas: you can still accidentally form a reference within an addr_of_mut! argument. Read the documentation.

Note: As I publish this, Rust 1.82 introduces new syntax that allows replacing addr_of_mut! with &raw mut and addr_of! with &raw const. It’s not yet clear to me how much this prevents accidental reference creation.

Despite the noise, to be safe, I ended up converting all of WakerSet to pointer reads and writes. It’s not greppable: the code looks like it’s still full of pointer dereferences, but they’re within addr_of_mut! and the place expressions have the right shape.

I think it was Patrick Walton who once proposed a sugar for unsafe Rust and pointers with a hypothetical -> operator. It would be convenient and easier on the eyes.

Until the Rust memory model stabilizes further and the aliasing rules are well-defined, your best option is to integrate ASAN, TSAN, and MIRI (both stacked borrows and tree borrows) into your continuous integration for any project that contains unsafe code.

If your project is safe Rust but depends on a crate which makes heavy use of unsafe code, you should probably still enable sanitizers. I didn’t discover all UB in wakerset until it was integrated into batch-channel.

MIRI: Stacked Borrows and Tree Borrows

MIRI supports two aliasing models: stacked borrows and tree borrows.

I won’t attempt to describe them. They are different approaches with the same goal: formalize and validate the Rust memory model. Ralf Jung, Neven Villani, and all are doing amazing work. Without MIRI, it would be hard to trust unsafe Rust.

I decided to run both stacked and tree borrows and haven’t hit any false positives so far.

Active Research in Self-Referential Structures

This topic is an active work in progress. I hope this blog post is obsolete in two years.

Self-referential and pinned data structures are something of a hot topic right now. The Rust-for-Linux project, doing the kinds of things systems programs do, has Pin ergonomics and self-referential data structures near the top of their wishlist.

In particular, pinned initialization problem. The Linux kernel has self-referential data structures and it is currently hard to initialize them with safe code. In wakerset, I sidestepped this problem at the cost of a small amount of runtime inefficiency by giving WakerList two empty states: one that is moveable and one that is pinned. The former converts to the latter the first time it is used after pinning.

y86-dev has a great blog post proposing Safe Pinned Initialization.

Conclusions

As much as this post might come across as a gripefest, I still think Rust is great. In particular, its composable safety. The result of my pain is a safe, efficient API. You can use wakerset without any risk of undefined behavior.

What I learned from this experience:

  • Be extra careful with any use of unsafe.
  • References, even if never used, are more dangerous than pointers in C.
  • Pinning syntax is awful, but it feels like Rust could solve this someday.
  • UnsafeCell is required for intrusive structures.
  • I don’t know how to statically constrain lifetime relationships with intrusive structures, but maybe it’s possible? Avoiding the need for runtime assertions would be nice.
  • MIRI, especially under multithreaded stress tests, is critical.
  • Putting this in words was as hard as writing the code.