Reference Counting Things
Reference counting is cheap and easy. An integer starts at one, increments on every new reference, and whoever decrements it to zero is responsible for deallocation.
If references are shared across threads, increments and decrements must be atomic.
Decades ago, I wrote an audio
library that shipped in a couple
commercial games. Things you’d find on CD in the bargain bin at
Walmart. The ABI was modeled after
COM and most objects were
reference-counted. At the time I’d never seen a dual-CPU system, and
thought inc [refcount]
and dec [refcount]
are single instructions.
It will be fine, right?!
Dual-core didn’t yet exist, but some people had dual-socket boards, and we started seeing crash reports after the CDs were burned… oops.
(On the bright side, since I was religious about maintaining stable ABIs, users could just drop the fixed DLL into place.)
Cost of Atomics
Atomics are more expensive than non-atomic operations. inc
is a
handful of cycles. lock inc
even uncontended, can be dozens.
When C++ standardized std::shared_ptr
in 2011 they didn’t even
bother with a non-atomic
version. C++ isn’t safe
enough, and there was a feeling that atomic increments and decrements
were common enough that they’d get optimized in hardware. That was
correct – it just took a while.
Rust’s safety guarantees, on the other hand, allow safe use of an
unsynchronized Rc
if you don’t want to pay for Arc
.
It’s pretty easy for reference counting overhead to show up in
profiles. Sometimes it’s the accidental shared_ptr
copy in a hot
loop or a recursive .clone()
in Rust. Last time I wrote Swift,
atomic reference counts were a major cost.
The hardware is getting better. On Apple Silicon and AMD Zen 3, uncontended atomic increments and decrements are almost as cheap as non-atomic. (Interestingly, atomics are also cheap on my 64-bit, 4-thread Intel Atom from 2011.) These optimizations are a big deal, and if all CPUs worked that way, maybe this blog post would end here.
Alas, data centers are still filled with years-old Intel CPUs or non-Apple ARM implementation. It’s worth spending some time in software to avoid synchronization if possible.
Avoid 0-to-1
Here’s an easy but commonly-missed trick. Initialize your reference counts to 1.
For whatever reason (symmetry?), it’s common to see implementations like:
struct Object {
std::atomic<size_t> count{0};
};
struct ObjectPtr {
ObjectPtr(Object* p): p{p} {
p->count.fetch_add(1, std::memory_order_relaxed);
}
Object* p;
};
I haven’t seen a compiler realize it can replace the initial value with 1 and avoid atomics when new objects are allocated.
Avoid 1-to-0
A typical release implementation is written:
struct ObjectPtr {
~ObjectPtr() {
if (1 == p->count.fetch_sub(1, std::memory_order_acq_rel)) {
delete p;
}
}
Object* p;
}
However, actually decrementing the count to zero is not necessary. We only need to know if we’re the last reference. Thus, we can write:
~ObjectPtr() {
if (1 == p->count.load(std::memory_order_acquire) ||
1 == p->count.fetch_sub(1, std::memory_order_acq_rel)) {
delete p;
}
}
Maybe the impact on code size isn’t worth it. That’s your call. On older Intel CPUs, in situations where most objects only have one reference, it can be a meaningful optimization.
Maged Michael implemented a fancier version of this algorithm in gcc’s libstdc++.
Implementing these two optimizations in Watchman was a material win for code that allocated or deallocated large trees.
Biased Reference Counting
Swift implicitly reference-counts many of its objects. When I worked at Dropbox, we measured reference counting operations as a substantial portion of our overall CPU time.
In 2018, researchers at University of Illinois Urbana-Champaign proposed an algorithm called Biased Reference Counting that splits the reference count into two. One is biased to a specific thread and can be updated without atomic operations. The other reference count is atomic and shared among the remaining threads. Unifying these two counts requires extra bookkeeping, especially in languages like Swift or C++ where unsynchronized values can easily migrate across threads.
The hybrid_rc Rust crate has an
implementation of this algorithm that takes advantage of Rust’s type
system (in particular, by not providing Send
for thread-local
references) to avoid extra bookkeeping.
I’m curious if anyone uses biased reference counting in practice.
Split Reference Counting
Channel and promise implementations need to track two reference counts: one for readers and one for writers. When either reaches zero, the channel is closed. Waiting senders or receivers are notified that no more messages can be sent.
Rust’s built-in channels use two atomic counters and an atomic bit. The bit is necessary to determine which thread should deallocate in the case that a thread drops the last reader exactly as another thread drops the last writer.
It’s possible to pack all of these into a single 64-bit counter. If each half has 32 bits but the entire counter is updated atomically, no additional state is required to disambiguate who deallocates.
I have a Rust implementation of the above in the splitrc crate.
How Many Bits?
Rust is sound: safe code must not have undefined behavior.
std::mem::forget
is a safe function. Therefore, it’s possible to run up some reference
count p
in a tight loop such as:
loop {
std::mem::forget(p.clone());
}
64-bit counters are effectively infinite. Let’s hypothesize a 4 GHz CPU where increments take one cycle. It would take almost 150 years to overflow.
In contrast, a modern CPU can overflow a 32-bit counter in seconds. You might say (and I’d agree) that a program that holds billions of references is pathological, and need not be supported. On the other hand, in Rust, safe code must never overflow and cause use-after-free.
Therefore, any 32-bit counter (even usize
and AtomicUsize
on
32-bit CPUs) must detect and handle overflow.
Rc
uses usize::wrapping_add
to detect
wraparound.
Arc
reserves half the range of
usize
to detect overflow. This is safe under the assumption that billions of
threads aren’t simultaneously incrementing the counter.
Rust reference counts typically abort on overflow rather than panic. I assume this is because panics can be caught and ignored. There may be codegen benefits as well. However, in the context of long-lived server processes that concurrently handle requests, it’s nice to catch panics and fail the one buggy request instead of aborting.
splitrc
allocates a panic range and an abort
range to
get the best of both worlds.
In practice, overflowing a reference count should never happen. Reference counts should never get that high. But that sounds like famous last words, and I’ll happily pay a branch and some cold code for a loud failure.
Older versions of the Linux kernel even had a use-after-free caused by reference count overflow.
Weak References
Like split reference counts, supporting weak references requires maintaining two counts: a strong count and a weak count. When the strong count reaches zero, the referenced value is destructed. But the memory can’t be deallocated until both counts reach zero.
The approach taken by Rust’s Arc
is to maintain two separate
counters. All strong references share an extra weak reference. When
the last strong reference is dropped, the extra weak reference is
dropped too.
Memory is deallocated when the last weak reference reaches zero.
libc++ takes a similar approach with the interesting caveat that it starts counting at zero and waits until the counts decrement to -1.
Supporting weak references has a small cost. You need space for two counters and some implementations actually perform two atomic decrements when the last strong reference is dropped.
It’s possible to do better: like splitrc, the strong and weak references can be packed into a single 64-bit integer with overflow detection on each half. Each new reference is a single atomic addition. As in the 1-to-0 optimization above, an optimistic load can avoid an atomic RMW in the common case that no weak references are alive.
If you don’t need weak references, the Rust triomphe crate provides some faster alternatives to the standard library.
Count First Reference From Zero or One?
It’s typical for reference counts to start at one and decrement to zero. But that’s not the only option. As mentioned above, libc++ initializes its references to value zero, meaning one reference. Decrement checks whether the count underflows to -1.
Unless you’re the most standard of libraries, the tiny differences in instruction selection don’t matter. But they’re fun to look at, so let’s see. (Compiler Explorer)
Initializing values to zero is smaller in most ISAs:
struct RC {
size_t s;
size_t w;
};
void init_zero(RC& rc) {
rc.s = 0;
rc.w = 0;
}
void init_one(RC& rc) {
rc.s = 1;
rc.w = 1;
}
x86-64 (gcc 13.2):
init_zero(RC&):
pxor xmm0, xmm0
movups XMMWORD PTR [rdi], xmm0
ret
init_one(RC&):
movdqa xmm0, XMMWORD PTR .LC0[rip]
movups XMMWORD PTR [rdi], xmm0
ret
gcc chooses to load the pair of ones from a 128-bit constant. clang instead generates two stores.
x86-64 (clang 17):
init_zero(RC&): # @init_zero(RC&)
xorps xmm0, xmm0
movups xmmword ptr [rdi], xmm0
ret
init_one(RC&): # @init_one(RC&)
mov qword ptr [rdi], 1
mov qword ptr [rdi + 8], 1
ret
ARM64 gcc generates equivalent code to x86-64. clang on ARM64 instead broadcasts a constant 1 into a vector and stores it.
64-bit ARM (clang 17):
init_zero(RC&): // @init_zero(RC&)
stp xzr, xzr, [x0]
ret
init_one(RC&): // @init_one(RC&)
mov w8, #1 // =0x1
dup v0.2d, x8
str q0, [x0]
ret
As expected, zero-initialization is slightly cheaper.
Increment will generate the same instructions no matter where the count starts, of course. (If using a 32-bit counter, overflow checks are required. Choosing an overflow range that allows branching on the sign bit can generate a smaller hot path, but that’s almost independent of where to start counting.)
Decrement is a little interesting.
void dec_zero_exact(std::atomic<size_t>& c) {
if (0 == c.fetch_sub(1, std::memory_order_acq_rel)) {
dealloc();
}
}
void dec_zero_less(std::atomic<size_t>& c) {
using ssize_t = std::make_signed_t<size_t>;
if (0 >= static_cast<ssize_t>(c.fetch_sub(1, std::memory_order_acq_rel))) {
dealloc();
}
}
void dec_one(std::atomic<size_t>& c) {
if (1 == c.fetch_sub(1, std::memory_order_acq_rel)) {
dealloc();
}
}
Let’s look at x86-64:
dec_zero_exact(std::atomic<unsigned long>&): # @dec_zero_exact(std::atomic<unsigned long>&)
mov rax, -1
lock xadd qword ptr [rdi], rax
test rax, rax
je dealloc()@PLT # TAILCALL
ret
dec_zero_less(std::atomic<unsigned long>&): # @dec_zero_less(std::atomic<unsigned long>&)
lock dec qword ptr [rdi]
jl dealloc()@PLT # TAILCALL
ret
dec_one(std::atomic<unsigned long>&): # @dec_one(std::atomic<unsigned long>&)
lock dec qword ptr [rdi]
je dealloc()@PLT # TAILCALL
ret
There are two atomic decrement instructions, lock dec
and lock
xadd
. lock dec
is slightly preferable: it has a similar cost, but
its latency is one cycle less on Zen 4, and it’s smaller. (lock xadd
also requires loading -1 into a register.)
But, since it doesn’t return the previous value and only sets flags, it can only be used if a following comparison can use those flags.
Therefore, on x86-64, counting from 1 is slightly cheaper, at least with a naive comparison. However, if we sacrifice half the range of the counter type (again, two billion should be plenty), then we can get the same benefits in the counting-from-zero decrement.
Now let’s take a look at ARM64:
dec_zero_exact(std::atomic<unsigned long>&): // @dec_zero_exact(std::atomic<unsigned long>&)
mov x8, #-1 // =0xffffffffffffffff
ldaddal x8, x8, [x0]
cbz x8, .LBB2_2
ret
.LBB2_2:
b dealloc()
dec_zero_less(std::atomic<unsigned long>&): // @dec_zero_less(std::atomic<unsigned long>&)
mov x8, #-1 // =0xffffffffffffffff
ldaddal x8, x8, [x0]
cmp x8, #0
b.le .LBB3_2
ret
.LBB3_2:
b dealloc()
dec_one(std::atomic<unsigned long>&): // @dec_one(std::atomic<unsigned long>&)
mov x8, #-1 // =0xffffffffffffffff
ldaddal x8, x8, [x0]
cmp x8, #1
b.ne .LBB4_2
b dealloc()
.LBB4_2:
ret
None of the atomic read-modify-writes on ARM64 set flags, so the value has to be explicitly compared anyway. The only difference is that comparing equality with zero is one fewer instruction.
So there we go. All of this instruction selection is likely in the wash. I was hoping for a Dijkstra-like clear winner. The strongest argument to start counts at 1 is that the counter never underflows, allowing multiple counts to be packed into a single integer.
False Sharing
Where the reference count is positioned in the object can matter. If the reference count ends up in the same page as other modified data, concurrent workloads are penalized. By reducing false sharing and making RCU more scalable, Intel improved highly concurrent network performance in Linux by 2-130%.
There may be value in abstracting the count’s location through a vtable, like COM does.
I’ll Stop Here
Reference counting is a long-studied topic. There are saturating counts, counts that saturate into mark-and-sweep, counts that saturate into (logged) leaks, cycle detection, weighted reference counts, deferred increments, combining updates, external counts, but you can read more elsewhere.
I mostly wanted to share some things I’ve recently run into.