Unsafe Rust Is Harder Than C
Or: The Most Expensive Linked List I’ve Ever Written
Some of you already know the contents of this post, especially if you’ve written embedded or unsafe code in Rust. But I didn’t, so I thought it was useful to write down my experience as accurately as I can. Without further ado…
Last year, I wrote Photohash, software to help me index my NAS and find duplicate photos with rotation-independent hashing and perceptual hashing. To make use of cores and keep the disks busy, it distributes work to compute and IO workers. Work is distributed with channels – synchronized work queues.
In Photohash, work tends to be discovered and processed in batches: enumerating directories returns multiple entries and the database is updated in multi-row transactions.
Rust has a rich selection of channel implementations: std::sync::mpsc, futures::channel, tokio::sync, crossbeam::channel, flume, and kanal are high-quality options.
Unfortunately, none of them exactly met my needs, so I nerd-sniped
myself into writing my dream channel. My previous day job
(EdenFS and
Watchman) was full of ad-hoc
channels so I knew roughly I wanted. kanal
is closest, but it is
riddled with unsafe code and uses spinlocks which look great in
microbenchmarks but have no place in userspace
software.
Introducing batch-channel, a throughput-optimized channel. The design goals are:
- Multi-producer, multi-consumer. Parallelism in both production and consumption.
- Sync and async support for both consumers and producers. Mix-and-match provides flexibility for use in any type of thread pool, async runtime, or FFI.
- Bounded or unbounded. Bounded for backpressure and limiting peak memory consumption. Unbounded for situations where you cannot guarantee deadlock freedom.
- Sending and receiving multiple values. I often want to send multiple values. Like reading all of the paths in a directory. Or writing multiple rows to a database. Batching allows amortizing per-batch costs. It’s silly to acquire the channel lock N times to push N values. It’s the same on the consumption side: workers may want to pull all pending work items in one channel read. You might wonder about lock-free queues, and they have their place, but but you’ll still contend on the head and tail, and atomic operations remain slow even on modern Intel cores. If you’re going to contend on the queue anyway, batch-channel’s philosophy is to stick the whole thing behind a mutex and maximize batch sizes on both ends.
At the time this was written, the following design goals weren’t yet implemented:
- Priorities. Senders can influence processing order.
- Bounding variable-sized items. For example, being able to say a queue can hold up to 20 MB of paths, no matter how long they are.
And, finally, the design goal that led to this post:
- No allocations under steady-state use. Allocations are a source of contention, failure, and overhead, especially when using slow system allocators.
The Shape of a Channel
To explain why unsafe Rust is even involved, let’s look at the implementation of a channel.
pub struct Channel<T> {
q: VecDeque<T>,
// Blocked receivers
waiting_for_elements: Vec<Waker>,
// Blocked senders
waiting_for_capacity: Vec<Waker>,
// Synchronous use requires some condition variables too.
}
When an async receiver blocks on recv()
because the channel is
empty, the task’s Waker
is stored so that the channel knows to wake
it when a value arrives.
Waker
is a
handle to a blocked task. The channel can signal the async runtime to
wake a task when it should poll the channel again. It’s a wide
pointer, two
words in size.
Waker
is used like this:
pub struct Recv<'a, T> {
channel: &'a Channel<T>,
}
impl<'a, T> Future for Recv<'a, T> {
type Output = T;
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
// Is the queue empty?
if let Some(element) = self.channel.q.pop_front() {
// The queue has an element, so return it.
Poll::Ready(element)
} else {
// Queue is empty so block and try again later.
self.channel.waiting_for_elements.push(cx.waker().clone());
Poll::Pending
}
}
}
Note: The above code is illustrative. In reality, the channel has
a Mutex
and some condition variables, but that’s incidental for this
post.
If the queue is empty when recv()
is called, the waker is stored in
the channel and the task enters the blocked state.
Later, when a value is added to the queue, any waiting tasks are woken:
fn send<T>(channel: &mut Channel<T>, value: T) {
channel.q.push_back(value);
let wakers = mem::take(&mut channel.waiting_for_elements);
// NOTE: Mutexes are released here, before waking.
// Unless we are clever with cancellation, we have to wake all futures,
// because we don't know which, if any, will attempt the next poll.
for waker in wakers {
waker.wake();
}
}
Here’s the issue: waiting_for_elements
is a Vec<Waker>
. The
channel cannot know how many tasks are blocked, so we can’t use a
fixed-size array. Using a Vec
means we allocate memory every time we
queue a waker. And that allocation is taken and released every time we
have to wake.
The result is that a naive implementation will allocate and free repeatedly under steady-state send and recv. That’s a lot of memory allocator traffic.
Can We Use an Intrusive List?
The optimization I want is, rather than allocating in a Vec
every
time a task is blocked on the queue, can we store the list of Waker
s
within the blocked futures themselves? We know that we only need to
store as many Waker
s as blocked futures, so that should work.
It should look something like:
pub struct Channel<T> {
q: VecDeque<T>,
// Intrusive doubly-linked list head.
waiting_for_elements: WakerList,
}
fn send<T>(channel: &Channel<T>, value: T) {
channel.q.push_back(value);
let wakers = channel.waiting_for_elements.extract_list();
// Release any mutex before waking.
for waker in wakers {
waker.wake();
}
}
pub struct Recv<'a, T> {
channel: &'a Channel<T>,
// Every Future gets a WakerSlot, which is an intrusive doubly-linked
// list node protected by Channel's mutex.
waker: WakerSlot,
}
impl<'a, T> Future for Recv<'a, T> {
type Output = T;
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
if let Some(element) = self.channel.q.pop_front() {
Poll::Ready(element)
} else {
// Queue is empty so try again later.
// Store the Waker in this Future by linking it into channel's list.
self.channel.waiting_for_elements.link(&mut self.waker, cx.waker().clone());
Poll::Pending
}
}
}
And that’s about the limit of my fake illustrative code. It’s time to get into details. How do we express an intrusive linked list in Rust?
Intrusive List Crates
This isn’t a new idea. I looked for existing crates:
- intrusive-collections is popular, but list nodes must outlive the list itself. In my case, the future will never outlive the channel.
- futures-intrusive is a nice-looking crate that performs the same optimization, but does not meet my design goals.
- multilist is Patrick Walton’s pre-1.0 experiment. Interesting idea, but it allocates nodes on the heap.
There are two other production examples of this approach:
- lilos-list is part
of Cliff Biffle’s embedded operating system
lilos. It stores wakers with an intrusive
list. It’s close to what I wanted, but embedded OS code tends to
have its own concurrency model. In particular, it would take some
work to integrate it with standard library mutexes, and it calls
wakers while locks are held, which is a bad
idea.
On the other hand, since there are no threads in lilos, it can avoid
implementing
Send
and usestd::cell::Cell
to temporarily perform mutations on otherwise shared references. - tokio stores its channel wakers in an intrusive linked list too. Its implementation has a surprising amount of code, but it’s closest to what I want. The point is moot: it’s an implementation detail not visible outside of Tokio. (See Alice Rhyl’s musings on the challenges of intrusive structures in Rust.)
Pinning
It’s time to write a crate.
I want the channel to store a WakerList
and each future to have a
WakerSlot
member. Slots can be linked into the list and unlinked
either on wake or future cancellation.
WakerList
and WakerSlot
form a self-referential data structure.
Self-referential data structures are a well-known challenge in Rust.
They require unsafe code.
In C++, this is relatively easy. You delete the move and copy operations and fix up the link pointers as appropriate.
So, at this point in my Rust journey, still thinking in C++, I assume “easy!”
I just need to disable movement with !Unpin
(actually
PhantomPinned
)
and ensure all methods take Pin<&mut WakerList>
and Pin<&mut
WakerSlot>
.
Once you observe a Pin<&mut T>
, you can assume T will never move
again. I’m not going to discuss Pin
in depth – Jon Gjengset has an
excellent video describing its rationale and
usage.
Here’s where things start to get hard. Pinning was added well after Rust 1.0 was stabilized. The language pervasively assumes values of any type can be moved with a memcpy, so writing a data structure that violates that assumption makes the public APIs themselves awkward.
Here’s what I tried first:
struct WakerSlot(...);
struct WakerList(...);
impl WakerList {
fn link<'list : 'slot, 'slot>(
self: Pin<&'list mut WakerList>,
slot: Pin<&'slot mut WakerSlot>,
)
}
My thought was that the act of linking should constrain the
“pinnedness” and the lifetimes: list must outlive slot. Alas,
lifetimes don’t work like
that.
A function call cannot constrain the actual lifetimes of its parameters.
The borrow checker will happily subset 'list
and 'slot
until it
proves whether it can satisfy the constraints. The result is that the
link
’s definition above has no effect.
The idea that lifetimes precisely match the lives of actual values is apparently a common misconception, and it resulted in some puzzling error messages.
(Writing a post like this after the fact feels weird because “of course it doesn’t work that way” but I’m faithfully documenting my learning process.)
Can WakerSlot
itself take a lifetime to the list it references?
struct WakerList(...);
struct WakerSlot<'list>(...);
impl WakerSlot<'_> {
fn new(list: &WakerList) -> WakerSlot<'_>;
}
This doesn’t work. If you pretend the WakerSlot has a reference to the
WakerList, then you can never create a &mut WakerList
elsewhere,
because Rust’s core lifetime rule is that you can either have one mut
reference or many shared references but never both.
I’m hoping this is possible and a reader leaves me a note.
When Panicking Isn’t Safe Enough
Conceptually, link
and unlink
operations take mutable references
to both the list and slot. But I never found a way to satisfy all of
the rules in the type system:
WakerSlot
’s lifetime parameter does not outlive its list.- Only one
&mut
reference at a time. - Never
&mut
and&
simultaneously.
Here, I gave up on trying to express the rules in the type system and chose to assert at runtime. The runtime lifetime rules are:
WakerList
must be empty when dropped. Otherwise, slots would have pointers
to invalid memory.
WakerSlot
must be unlinked when dropped. Otherwise, the list
references deallocated memory.
Reporting these invariant violations with panic is not sufficient. Panics can be caught, but the program would remain in a state where safe code can access dangling pointers and cause undefined behavior (UB).
Therefore, when an invariant is violated, the program must abort.
But I can’t just call abort: I want this utility crate to be
[no_std]
, so it’s up to the calling program to decide how it aborts.
The simplest solution I found was to panic from an extern "C"
function and let Rust translate that to an abort.
#[allow(non_snake_case)]
#[inline(never)]
#[cold]
extern "C" fn MUST_UNLINK_WakerSlot_BEFORE_DROP() -> ! {
// panic! from extern "C" is an abort with an error message.
panic!("Must unlink WakerSlot before drop")
// Another option, at the cost of a tiny, stable, dependency, is
// the `abort` crate.
//abort::abort()
}
Structural Pinning
I elided this detail in the example code above, but WakerList
is
intended to be accessed behind a mutex. However, neither
std::sync::Mutex
nor
parking_lot::Mutex
have structural
pinning.
That is, lock()
is &Mutex<T>
onto &mut T
, allowing T to be
moved.
I needed a safe API for getting Pin<&mut T>
from Pin<&Mutex<T>>
.
So I wrote the pinned-mutex crate
which provides structurally-pinned Mutex
, MutexGuard
, and
Condvar
wrappers.
Note that there is a pinarcmutex crate
that offers a PinArcMutex<T>
type roughly equivalent to
Pin<Arc<Mutex<T>>>
except with structural pinning. But it allocates
and you can’t drop in parking_lot
’s mutex, which is faster and
lighter than the standard library’s.
We can imagine a future Rust version where pinning is more natural and has pervasive (or implicit) standard library support.
Pinning Ergonomics
Boats recently wrote a nice overview of why Pin is shaped the way it is and why it is painful to use.
And the internet is full of threads like “Pin Suffering Continues”.
If you want to use pinned APIs safely in your own code, you will need to depend on a pin-projection crate like pin-project or pin-project-lite. (And don’t forget pinned_drop!)
It works fine, but you end up with code that looks like the following.
let mut wakers = state.as_mut().project().base.project().rx_wakers.extract_some_wakers();
while wakers.wake_all() {
let mut state = self.project_ref().state.lock();
wakers.extract_more(state.as_mut().base().project().rx_wakers);
}
Writing code this way is miserable. The compiler will guide you, but my mind was shouting “you know what I want, just do it” the whole time.
It Works!
Deep breath. Tests passed. WakerList
and WakerSlot
provide the
interface I wanted, so I published them in a
wakerset crate. It offers a safe,
intrusive, no_std
list of Wakers.
With it, I could remove the main source of steady-state allocations in
batch_channel
.
It was time to polish it up and ensure I didn’t miss anything, and this blog post is only half-done.
Undefined Behavior, Sanitizers, and MIRI
One of my original goals for batch-channel
was to avoid unsafe
code.
Unfortunately, two optimizations required it. Besides the intrusive list described above, the MPMC channel objects themselves are managed with two reference counts, one for senders and one for receivers. A split reference count is required: when one half of the channel is dropped, the channel is closed.
To simplify auditing, I placed all of this new unsafe code behind safe APIs and separate crates. They are:
Safe Rust, excepting compiler bugs and incorrectly-designed unsound APIs, has no undefined behavior. Unsafe Rust, on the other hand, removes the guardrails and opens a buffet of possible UB.
There are three ways you can deal with potential undefined behavior. In increasing order of happiness over time:
- Hope for the best and deal with potential bugs when they come up.
- Think carefully and convince yourself the code is correct.
- Automated sanitizers.
Fortunately, Rust supports sanitizers that detect various types of undefined behavior. The two most useful are ASan and TSan. C++ programmers are quite familiar with them at this point, and I consider them table stakes for any C or C++ project.
But Rust has one even better: MIRI. It catches violations of the Rust aliasing model, which is pickier than ASAN. Satisfying MIRI is where I spent most of my time.
Rust Aliasing Model
The first time I ran MIRI, it failed:
test link_and_notify_all ...
error: Undefined Behavior: trying to retag from <254318> for SharedReadWrite permission at alloc88289[0x10], but that tag does not exist in the borrow stack for this location
...
trying to retag from <254318> for SharedReadWrite permission at alloc88289[0x10], but that tag does not exist in the borrow stack for this location
...
help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
Stacked borrows? What’s all this?
Here, I realized my first mistake: I dove straight into unsafe Rust and should have read more in advance:
My other mistake was “thinking in C”. Being deeply familiar with C’s semantics wasn’t helpful here. In hindsight, it feels foolish to have assumed Rust’s aliasing model was similar to C’s. In C, usually, having pointers to values carries no meaning. Primarily you reason about pointer dereferencing operations.
For example, in C, this is perfectly legal if p == q
:
void foo(int* p, const int* q) {
printf("%d\n", *q);
*p = 456;
printf("%d\n", *q); // if p == q, prints 456
*(int*)q = 789;
printf("%d\n", *p); // if p == q, prints 789
}
const
is “meaningless” in that it does not prevent the pointee from
changing, so the compiler can’t optimize based on it.
fn foo(a: &u32, b: &mut u32) {
println!("{a}");
*b = 123;
println!("{a}"); // always prints the same value as above
}
On the other hand, Rust’s primary aliasing rule is that, at any point,
an object may have a unique &mut
reference to it or any number of
shared &
references, but never both.
The optimizer will take advantage of that. a
is not reloaded from
memory because the write to b
cannot alias it.
The aliasing rules in Rust are not fully defined. That’s part of what makes this hard. You have to write code assuming the most pessimal aliasing model.
Under the most pessimal aliasing rules, you have to assume taking a
&mut
reference to a value immediately writes to it and continues
to write to it as long as the reference lives. And if you have a
shared reference, you have to assume the value is read at arbitrary
times as long as any reference is held.
Box::leak
Let’s start with the first confusing example I ran into:
let p = Box::leak(Box::new(MyThing::new())) as *mut MyThing;
// later:
let ref: &MyThing = *p;
ref.method();
MIRI failed, complaining that p
was formed from the perpetual &mut
returned by Box::leak
and therefore it’s UB to create a shared
&MyThing
reference to it at any point thenceforth.
The fix was to allocate without forming a reference:
let p = Box::into_raw(MyThing::new());
// later:
let ref: &MyThing = *p;
ref.method();
Note: This may have been a MIRI bug or the rules have since been
relaxed, because I can no longer reproduce as of nightly-2024-06-12.
Here’s where the memory model and aliasing rules not being defined
caused some pain: when MIRI fails, it’s unclear whether it’s my fault
or not. For example, given the &mut
was immediately turned into a
pointer, does the &mut
reference still exist? There are multiple
valid interpretations of the rules.
Box::from_raw
OK, if you allocate some memory with Box::into_raw
, you’d expect to
deallocate with Box::from_raw
, right? That failed MIRI too.
I ended up having to write:
unsafe {
let ptr = ptr.as_ptr();
std::ptr::drop_in_place(ptr);
std::alloc::dealloc(
ptr as *mut u8,
std::alloc::Layout::new::<Inner<T>>());
}
Note: This may have also been a MIRI bug. It is no longer
reproducible. I changed splitrc
to use Box::into_raw
and
Box::from_raw
and it passes MIRI. I enabled MIRI in my CI so we’ll
see if it breaks again going forward.
Linkage, References, and Interior Mutability
That’s channel allocation and deallocation covered. Now let’s look at
the intrusive pointers in wakerset
.
In a linked list, every node has a linkage struct.
struct Pointers {
next: *mut Pointers,
prev: *mut Pointers,
pinned: PhantomPinned,
}
struct WakerList {
pointers: Pointers,
}
struct WakerSlot {
pointers: Pointers,
// Required to assert that this slot is never
// unlinked from an unrelated list.
owner: *mut WakerList,
waker: Option<Waker>,
}
Now imagine two threads. Thread A holds a &mut WakerList
with the
intent to extract pending Wakers. Thread B happens to hold a
&WakerSlot
at the same time.
It is UB for code traversing the pointers to form a &mut WakerSlot
(or even a &WakerSlot
) if any thread might have a &mut WakerSlot
,
because this violates Rust’s aliasing rules. A &mut
reference must
always be exclusive, even if it is never dereferenced. This is the
important difference with C.
Because Rust reorders reads and writes based on its aliasing rules, you must never convert a pointer into a reference unless you know that nobody else has a conflicting reference.
We need to prevent the compiler from optimizing a &WakerSlot
into
early reads of the pointers
and waker
fields.
UnsafeCell
is the tool to reach for. It introduces a “mutability barrier”, and
UnsafeCell<Pointers>
tells Rust not to cache reads. We are
responsible for ensuring we won’t violate Rust’s aliasing rules when
accessing Pointers
’s fields.
struct WakerList {
pointers: Pointers,
}
struct WakerSlot {
// UnsafeCell: written by WakerList independent of
// WakerSlot references
pointers: UnsafeCell<Pointers>,
// Required to assert that this slot is never
// unlinked from an unrelated list.
owner: *mut WakerList,
waker: Option<Waker>,
}
A circular linked list means that only a slot reference is required to
mutate the list, so I needed to enforce the guarantee that only one
thread may access the UnsafeCell
s at a time.
I did this with an important, if subtle, API guarantee: all link and
unlink operations take &mut WakerList
. If &mut WakerSlot
was
sufficient to unlink, it could violate thread safety if WakerList
was behind a mutex. (This also means that WakerList
does not require
an UnsafeCell<Pointers>
.)
The
pinned-aliasable
crate solves a related problem: how do we define self-referential data
structures with mutable references that do not miscompile? Read the
motivation in the crate’s doc comments. It’s a situation required by
async futures, which are self-referential and thus pinned, but have no
desugaring. See the open Rust Issue
#63818.
Avoiding References Entirely
As mentioned, when traversing a linked list, it’s easy to form conflicting
&mut
references to nodes. Consider this slightly contrived unlink
example:
let p = &slot.pointers as *mut Pointers;
let next: &mut Pointers = *(*p).next;
let prev: &mut Pointers = *(*p).prev;
next.prev = prev as *mut Pointers;
prev.next = next as *mut Pointers;
(*p).next = ptr::null_mut();
(*p).prev = ptr::null_mut();
If slot
is the only slot in the list, then next
and prev
both
point to the WakerList
and we’ve now formed two mut references to
the same value, which is UB.
In this particular case, we could ensure every pointer dereference occurs as a temporary. That limits the scope of each reference to each line.
let p = &slot.pointers as *mut Pointers;
let next = (*p).next;
let prev = (*p).prev;
(*next).prev = prev;
(*prev).next = next;
(*p).next = ptr::null_mut();
(*p).prev = ptr::null_mut();
But I just don’t trust myself to ensure, under all code paths, that I never have two references overlap, violating Rust’s aliasing rules.
It’s kind of miserable, but the safest approach is to avoid creating references entirely and operate entirely in the domain of pointer reads, writes, and offsets.
let nextp = addr_of_mut!((*node).next);
let prevp = addr_of_mut!((*node).prev);
let next = nextp.read();
let prev = prevp.read();
addr_of_mut!((*prev).next).write(next);
addr_of_mut!((*next).prev).write(prev);
nextp.write(ptr::null_mut());
prevp.write(ptr::null_mut());
(So much syntax. Makes you appreciate C.)
addr_of_mut!
is key: it computes a pointer to a place expression
without forming a reference. There are gotchas: you can still
accidentally form a reference within an addr_of_mut!
argument. Read
the
documentation.
Note: As I publish this, Rust 1.82 introduces new
syntax
that allows replacing addr_of_mut!
with &raw mut
and addr_of!
with &raw const
. It’s not yet clear to me how much this prevents
accidental reference creation.
Despite the noise, to be safe, I ended up converting all of WakerSet
to pointer reads and writes. It’s not greppable: the code looks like
it’s still full of pointer dereferences, but they’re within
addr_of_mut!
and the place expressions have the right shape.
I think it was Patrick Walton who once proposed a sugar for unsafe
Rust and pointers with a hypothetical ->
operator. It would be
convenient and easier on the eyes.
Until the Rust memory model stabilizes further and the aliasing rules are well-defined, your best option is to integrate ASAN, TSAN, and MIRI (both stacked borrows and tree borrows) into your continuous integration for any project that contains unsafe code.
If your project is safe Rust but depends on a crate which makes heavy use of unsafe code, you should probably still enable sanitizers. I didn’t discover all UB in wakerset until it was integrated into batch-channel.
MIRI: Stacked Borrows and Tree Borrows
MIRI supports two aliasing models: stacked borrows and tree borrows.
I won’t attempt to describe them. They are different approaches with the same goal: formalize and validate the Rust memory model. Ralf Jung, Neven Villani, and all are doing amazing work. Without MIRI, it would be hard to trust unsafe Rust.
I decided to run both stacked and tree borrows and haven’t hit any false positives so far.
Active Research in Self-Referential Structures
This topic is an active work in progress. I hope this blog post is obsolete in two years.
Self-referential and pinned data structures are something of a hot
topic right now. The Rust-for-Linux
project, doing the kinds of things
systems programs do, has Pin
ergonomics and self-referential data
structures near the top of their
wishlist.
In particular, pinned initialization
problem.
The Linux kernel has self-referential data structures and it is
currently hard to initialize them with safe code. In wakerset
, I
sidestepped this problem at the cost of a small amount of runtime
inefficiency by giving WakerList
two empty states: one that is
moveable and one that is pinned. The former converts to the latter the
first time it is used after pinning.
y86-dev has a great blog post proposing Safe Pinned Initialization.
Conclusions
As much as this post might come across as a gripefest, I still think
Rust is great. In particular, its composable safety. The result of my
pain is a safe, efficient API. You can use wakerset
without any risk
of undefined behavior.
What I learned from this experience:
- Be extra careful with any use of
unsafe
. - References, even if never used, are more dangerous than pointers in C.
- Pinning syntax is awful, but it feels like Rust could solve this someday.
UnsafeCell
is required for intrusive structures.- I don’t know how to statically constrain lifetime relationships with intrusive structures, but maybe it’s possible? Avoiding the need for runtime assertions would be nice.
- MIRI, especially under multithreaded stress tests, is critical.
- Putting this in words was as hard as writing the code.