Reference Counting Things

Reference counting is cheap and easy. An integer starts at one, increments on every new reference, and whoever decrements it to zero is responsible for deallocation.

If references are shared across threads, increments and decrements must be atomic.

Decades ago, I wrote an audio library that shipped in a couple commercial games. Things you’d find on CD in the bargain bin at Walmart. The ABI was modeled after COM and most objects were reference-counted. At the time I’d never seen a dual-CPU system, and thought inc [refcount] and dec [refcount] are single instructions. It will be fine, right?!

Dual-core didn’t yet exist, but some people had dual-socket boards, and we started seeing crash reports after the CDs were burned… oops.

(On the bright side, since I was religious about maintaining stable ABIs, users could just drop the fixed DLL into place.)

Cost of Atomics

Atomics are more expensive than non-atomic operations. inc is a handful of cycles. lock inc even uncontended, can be dozens.

When C++ standardized std::shared_ptr in 2011 they didn’t even bother with a non-atomic version. C++ isn’t safe enough, and there was a feeling that atomic increments and decrements were common enough that they’d get optimized in hardware. That was correct – it just took a while.

Rust’s safety guarantees, on the other hand, allow safe use of an unsynchronized Rc if you don’t want to pay for Arc.

It’s pretty easy for reference counting overhead to show up in profiles. Sometimes it’s the accidental shared_ptr copy in a hot loop or a recursive .clone() in Rust. Last time I wrote Swift, atomic reference counts were a major cost.

The hardware is getting better. On Apple Silicon and AMD Zen 3, uncontended atomic increments and decrements are almost as cheap as non-atomic. (Interestingly, atomics are also cheap on my 64-bit, 4-thread Intel Atom from 2011.) These optimizations are a big deal, and if all CPUs worked that way, maybe this blog post would end here.

Alas, data centers are still filled with years-old Intel CPUs or non-Apple ARM implementation. It’s worth spending some time in software to avoid synchronization if possible.

Avoid 0-to-1

Here’s an easy but commonly-missed trick. Initialize your reference counts to 1.

For whatever reason (symmetry?), it’s common to see implementations like:

struct Object {
  std::atomic<size_t> count{0};
};

struct ObjectPtr {
  ObjectPtr(Object* p): p{p} {
    p->count.fetch_add(1, std::memory_order_relaxed);
  }
  Object* p;
};

I haven’t seen a compiler realize it can replace the initial value with 1 and avoid atomics when new objects are allocated.

Avoid 1-to-0

A typical release implementation is written:

struct ObjectPtr {
  ~ObjectPtr() {
    if (1 == p->count.fetch_sub(1, std::memory_order_acq_rel)) {
      delete p;
    }
  }
  Object* p;
}

However, actually decrementing the count to zero is not necessary. We only need to know if we’re the last reference. Thus, we can write:

  ~ObjectPtr() {
    if (1 == p->count.load(std::memory_order_acquire) ||
        1 == p->count.fetch_sub(1, std::memory_order_acq_rel)) {
      delete p;
    }
  }

Maybe the impact on code size isn’t worth it. That’s your call. On older Intel CPUs, in situations where most objects only have one reference, it can be a meaningful optimization.

Maged Michael implemented a fancier version of this algorithm in gcc’s libstdc++.

Implementing these two optimizations in Watchman was a material win for code that allocated or deallocated large trees.

Biased Reference Counting

Swift implicitly reference-counts many of its objects. When I worked at Dropbox, we measured reference counting operations as a substantial portion of our overall CPU time.

In 2018, researchers at University of Illinois Urbana-Champaign proposed an algorithm called Biased Reference Counting that splits the reference count into two. One is biased to a specific thread and can be updated without atomic operations. The other reference count is atomic and shared among the remaining threads. Unifying these two counts requires extra bookkeeping, especially in languages like Swift or C++ where unsynchronized values can easily migrate across threads.

The hybrid_rc Rust crate has an implementation of this algorithm that takes advantage of Rust’s type system (in particular, by not providing Send for thread-local references) to avoid extra bookkeeping.

I’m curious if anyone uses biased reference counting in practice.

Split Reference Counting

Channel and promise implementations need to track two reference counts: one for readers and one for writers. When either reaches zero, the channel is closed. Waiting senders or receivers are notified that no more messages can be sent.

Rust’s built-in channels use two atomic counters and an atomic bit. The bit is necessary to determine which thread should deallocate in the case that a thread drops the last reader exactly as another thread drops the last writer.

It’s possible to pack all of these into a single 64-bit counter. If each half has 32 bits but the entire counter is updated atomically, no additional state is required to disambiguate who deallocates.

I have a Rust implementation of the above in the splitrc crate.

How Many Bits?

Rust is sound: safe code must not have undefined behavior. std::mem::forget is a safe function. Therefore, it’s possible to run up some reference count p in a tight loop such as:

loop {
  std::mem::forget(p.clone());
}

64-bit counters are effectively infinite. Let’s hypothesize a 4 GHz CPU where increments take one cycle. It would take almost 150 years to overflow.

In contrast, a modern CPU can overflow a 32-bit counter in seconds. You might say (and I’d agree) that a program that holds billions of references is pathological, and need not be supported. On the other hand, in Rust, safe code must never overflow and cause use-after-free.

Therefore, any 32-bit counter (even usize and AtomicUsize on 32-bit CPUs) must detect and handle overflow.

Rc uses usize::wrapping_add to detect wraparound. Arc reserves half the range of usize to detect overflow. This is safe under the assumption that billions of threads aren’t simultaneously incrementing the counter.

Rust reference counts typically abort on overflow rather than panic. I assume this is because panics can be caught and ignored. There may be codegen benefits as well. However, in the context of long-lived server processes that concurrently handle requests, it’s nice to catch panics and fail the one buggy request instead of aborting.

splitrc allocates a panic range and an abort range to get the best of both worlds.

In practice, overflowing a reference count should never happen. Reference counts should never get that high. But that sounds like famous last words, and I’ll happily pay a branch and some cold code for a loud failure.

Older versions of the Linux kernel even had a use-after-free caused by reference count overflow.

Weak References

Like split reference counts, supporting weak references requires maintaining two counts: a strong count and a weak count. When the strong count reaches zero, the referenced value is destructed. But the memory can’t be deallocated until both counts reach zero.

The approach taken by Rust’s Arc is to maintain two separate counters. All strong references share an extra weak reference. When the last strong reference is dropped, the extra weak reference is dropped too.

Memory is deallocated when the last weak reference reaches zero.

libc++ takes a similar approach with the interesting caveat that it starts counting at zero and waits until the counts decrement to -1.

Supporting weak references has a small cost. You need space for two counters and some implementations actually perform two atomic decrements when the last strong reference is dropped.

It’s possible to do better: like splitrc, the strong and weak references can be packed into a single 64-bit integer with overflow detection on each half. Each new reference is a single atomic addition. As in the 1-to-0 optimization above, an optimistic load can avoid an atomic RMW in the common case that no weak references are alive.

If you don’t need weak references, the Rust triomphe crate provides some faster alternatives to the standard library.

Count First Reference From Zero or One?

It’s typical for reference counts to start at one and decrement to zero. But that’s not the only option. As mentioned above, libc++ initializes its references to value zero, meaning one reference. Decrement checks whether the count underflows to -1.

Unless you’re the most standard of libraries, the tiny differences in instruction selection don’t matter. But they’re fun to look at, so let’s see. (Compiler Explorer)

Initializing values to zero is smaller in most ISAs:

struct RC {
    size_t s;
    size_t w;
};

void init_zero(RC& rc) {
    rc.s = 0;
    rc.w = 0;
}

void init_one(RC& rc) {
    rc.s = 1;
    rc.w = 1;
}

x86-64 (gcc 13.2):

init_zero(RC&):
        pxor    xmm0, xmm0
        movups  XMMWORD PTR [rdi], xmm0
        ret
init_one(RC&):
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        movups  XMMWORD PTR [rdi], xmm0
        ret

gcc chooses to load the pair of ones from a 128-bit constant. clang instead generates two stores.

x86-64 (clang 17):

init_zero(RC&):                       # @init_zero(RC&)
        xorps   xmm0, xmm0
        movups  xmmword ptr [rdi], xmm0
        ret
init_one(RC&):                        # @init_one(RC&)
        mov     qword ptr [rdi], 1
        mov     qword ptr [rdi + 8], 1
        ret

ARM64 gcc generates equivalent code to x86-64. clang on ARM64 instead broadcasts a constant 1 into a vector and stores it.

64-bit ARM (clang 17):

init_zero(RC&):                       // @init_zero(RC&)
        stp     xzr, xzr, [x0]
        ret
init_one(RC&):                        // @init_one(RC&)
        mov     w8, #1                          // =0x1
        dup     v0.2d, x8
        str     q0, [x0]
        ret

As expected, zero-initialization is slightly cheaper.

Increment will generate the same instructions no matter where the count starts, of course. (If using a 32-bit counter, overflow checks are required. Choosing an overflow range that allows branching on the sign bit can generate a smaller hot path, but that’s almost independent of where to start counting.)

Decrement is a little interesting.

void dec_zero_exact(std::atomic<size_t>& c) {
    if (0 == c.fetch_sub(1, std::memory_order_acq_rel)) {
        dealloc();
    }
}

void dec_zero_less(std::atomic<size_t>& c) {
    using ssize_t = std::make_signed_t<size_t>;
    if (0 >= static_cast<ssize_t>(c.fetch_sub(1, std::memory_order_acq_rel))) {
        dealloc();
    }
}

void dec_one(std::atomic<size_t>& c) {
    if (1 == c.fetch_sub(1, std::memory_order_acq_rel)) {
        dealloc();
    }
}

Let’s look at x86-64:

dec_zero_exact(std::atomic<unsigned long>&):    # @dec_zero_exact(std::atomic<unsigned long>&)
        mov        rax, -1
        lock xadd  qword ptr [rdi], rax
        test       rax, rax
        je         dealloc()@PLT                # TAILCALL
        ret
dec_zero_less(std::atomic<unsigned long>&):     # @dec_zero_less(std::atomic<unsigned long>&)
        lock dec  qword ptr [rdi]
        jl        dealloc()@PLT                 # TAILCALL
        ret
dec_one(std::atomic<unsigned long>&):           # @dec_one(std::atomic<unsigned long>&)
        lock dec  qword ptr [rdi]
        je        dealloc()@PLT                 # TAILCALL
        ret

There are two atomic decrement instructions, lock dec and lock xadd. lock dec is slightly preferable: it has a similar cost, but its latency is one cycle less on Zen 4, and it’s smaller. (lock xadd also requires loading -1 into a register.)

But, since it doesn’t return the previous value and only sets flags, it can only be used if a following comparison can use those flags.

Therefore, on x86-64, counting from 1 is slightly cheaper, at least with a naive comparison. However, if we sacrifice half the range of the counter type (again, two billion should be plenty), then we can get the same benefits in the counting-from-zero decrement.

Now let’s take a look at ARM64:

dec_zero_exact(std::atomic<unsigned long>&):        // @dec_zero_exact(std::atomic<unsigned long>&)
        mov     x8, #-1                         // =0xffffffffffffffff
        ldaddal x8, x8, [x0]
        cbz     x8, .LBB2_2
        ret
.LBB2_2:
        b       dealloc()
dec_zero_less(std::atomic<unsigned long>&):         // @dec_zero_less(std::atomic<unsigned long>&)
        mov     x8, #-1                         // =0xffffffffffffffff
        ldaddal x8, x8, [x0]
        cmp     x8, #0
        b.le    .LBB3_2
        ret
.LBB3_2:
        b       dealloc()
dec_one(std::atomic<unsigned long>&):                // @dec_one(std::atomic<unsigned long>&)
        mov     x8, #-1                         // =0xffffffffffffffff
        ldaddal x8, x8, [x0]
        cmp     x8, #1
        b.ne    .LBB4_2
        b       dealloc()
.LBB4_2:
        ret

None of the atomic read-modify-writes on ARM64 set flags, so the value has to be explicitly compared anyway. The only difference is that comparing equality with zero is one fewer instruction.

So there we go. All of this instruction selection is likely in the wash. I was hoping for a Dijkstra-like clear winner. The strongest argument to start counts at 1 is that the counter never underflows, allowing multiple counts to be packed into a single integer.

Where the reference count is positioned in the object can matter. If the reference count ends up in the same page as other modified data, concurrent workloads are penalized. By reducing false sharing and making RCU more scalable, Intel improved highly concurrent network performance in Linux by 2-130%.

There may be value in abstracting the count’s location through a vtable, like COM does.

I’ll Stop Here

Reference counting is a long-studied topic. There are saturating counts, counts that saturate into mark-and-sweep, counts that saturate into (logged) leaks, cycle detection, weighted reference counts, deferred increments, combining updates, external counts, but you can read more elsewhere.

I mostly wanted to share some things I’ve recently run into.