RCU is Pretty Cool

2024-11-02

Random

Table of Contents

# “Do you know what is read-copy-update?”

Balázs randomly mumured on a Friday afternoon while I was doom scrolling the endless stream of despair on Memegen some time shortly after the Jan 2023 layoffs. Friday wasn’t one of the team designated RTO days, but I decided to work from office anyway for some focus time, as few people would be coming in.

B: “It’s pretty cool.”

Balázs proceeded to explain to me that read-copy-update, or RCU, is a lock-free mechanism to update data structures that are actively being consumed by asynchronous readers without the usage of locks. It was first introduced into the Linux kernel back in 2002, but discussions and designs date back to the mid 90s. The general idea is that instead of applying a mutex lock to update an asynchronously read data structure, instantiate a copy of the existing data structure and apply changes to the new instance. After the changes are applied, do an atomic update of the pointer from the old instance to the new one. This essentially creates a lock-free update of data change.

Me: “But wouldn’t the updating of the pointer itself also require a lock?”

B: “Not in the Linux kernel. The kernel uses a counter to track all readers. It only updates the pointer when there are no more readers.”

Me: “Okay that’s pretty cool.”

B: “Pretty cool.”

Me: “…”

B: “…”

After heading back to my seat from his window seat overseeing Credit Suisse bankers sipping cocktails on the terrace of their bankruptcy embroiled office, I decided to go read up a bit more about RCU. I think I got the gist of it and will try to explain it myself.

Typically when writing a program using parallelism or concurrency, we wouldn’t want to do something like this:

type FeatureFlags struct {
    configs map[string]bool
}

func(f *FeatureFlags) Update(newConfig map[string]bool) {
    f.configs = map[string]bool{}
    for k, v := range newConfigs {
        f.configs[k] = v
    }
}

func(f *FeatureFlags) Get(config string) (bool, error) {
    if result, ok := f.configs[config]; ok {
        return result, nil
    }
    return false, errors.New("invalid config")
}

The Update method is performing a pointer update that can be accessed asynchronously by multiple goroutines, which would result in race conditions. To prevent such errors, we may choose to use a mutex lock.

type SimpleMutexConfigs struct {
	m sync.RWMutex
	configs map[string]bool
}

func(c *SimpleMutexConfigs) Update(newConfigs map[string]bool) {
	c.m.Lock()
	defer c.m.Unlock()

	c.configs = map[string]bool{}
	for k, v := range newConfigs{
		c.configs[k] = v
	}
}

func(c *SimpleMutexConfigs) Get(config string) (bool, error) {
	c.m.RLock()
	defer c.m.RUnlock()

	if result, ok := c.configs[config]; ok {
		return result, nil
	}
	return false, errors.New("invalid config")
}

RLock() is a reader lock that multiple async goroutines may concurrently hold. Lock() is a write lock that blocks all read and write locks until it is unlocked. This would successfully prevent race conditions, but when you have a program that is high in reads, you create reader pileups leading to latency spikes every time there’s a config update. This would be especially pronounced in a system with high read requests of large configuration data structures.

Going by RCU logic, we could a make a copy before safely updating the pointer to the new config data structure.

type PseudoRCUMutexConfigs struct {
	m sync.RWMutex
	configs map[string]bool
}

func(c *PseudoRCUMutexConfigs) Update(newConfigs map[string]bool) {
	copyConfigs := map[string]bool{}
	for k, v := range newConfigs{
		copyConfigs[k] = v
	}

	c.m.Lock()
	defer c.m.Unlock()
	c.configs = copyConfigs
}

func(c *PseudoRCUMutexConfigs) Get(config string) (bool, error) {
	c.m.RLock()
	defer c.m.RUnlock()
	if result, ok := c.configs[config]; ok {
		return result, nil
	}
	return false, errors.New("invalid config")
}

The above would achieve the basic idea of RCU. It is not entirely accurate since actual RCU would have a queue of updates waiting for a kernel grace period before updating in sequence. Not to mention that the write lock for the pointer updates is still a blocking lock. We could possibly play around with Golang’s atomic library to give a more accurate representation, but I think these examples are close enough.

## Let’s run some benchmarks!

We want to test the performance of reads for:

Simple RW Mutex locking.
Pseudo RCU where we only do a lock to update the pointer.
Performance of small vs large configs.
Performance of reads based on frequency of updates.

It would be pretty clear that by shifting the mutex write lock to safeguard just the pointer update, we would be able to obse…

BenchmarkSimpleMutex/Update_Small_Config_frequent-8  4759899   253.5 ns/op
BenchmarkSimpleMutex/Update_Small_Config_seldom-8    5074174   237.0 ns/op
BenchmarkSimpleMutex/Update_Large_Config_frequent-8  4116626   289.5 ns/op
BenchmarkSimpleMutex/Update_Large_Config_seldom-8    5048846   238.9 ns/op
BenchmarkPseudoRCU/Update_Small_Config_frequent-8    4667325   255.6 ns/op
BenchmarkPseudoRCU/Update_Small_Config_seldom-8      5063089   239.1 ns/op
BenchmarkPseudoRCU/Update_Large_Config_frequent-8    4453814   268.4 ns/op
BenchmarkPseudoRCU/Update_Large_Config_seldom-8      5054824   238.5 ns/op

…👀

The shifting of the write lock to include just the pointer update only significantly matters if the config update is incredibly frequent (every 5 nanoseconds), and if the configuration to be copied is large. Such frequency of changes is rare, not to mention that the traditional kernel RCU wasn’t really built for frequent updates due to the piling up of updates awaiting grace periods. Decreasing the updates to 1 per second has already led to no difference between either methods of implementation.

This does not conform to my preconceived notions, I must juke the data.

type AtomicRCUConfigs struct{
	configs atomic.Value
}

func(c *AtomicRCUConfigs) Update(newConfigs map[string]bool) {
	copyConfigs := map[string]bool{}
	for k, v := range newConfigs{
		copyConfigs[k] = v
	}
	c.configs.Store(copyConfigs)
}

func(c *AtomicRCUConfigs) Get(config string) (bool, error) {
	currentConfigs, ok := c.configs.Load().(map[string]bool)
	if !ok {
		return false, errors.New("invalid config state")
	}

	if result, ok := currentConfigs[config]; ok {
		return result, nil
	}
	return false, errors.New("invalid config")
}

There we go. Now that we have entirely stripped out the eXpEnSiVe mutex locks in Golang by swapping in atomic operations, we would be able to confirm my biases.

BenchmarkSimpleMutex/Update_Small_Config_frequent-8   4757412   253.7 ns/op
BenchmarkSimpleMutex/Update_Small_Config_seldom-8     5080617   238.4 ns/op
BenchmarkSimpleMutex/Update_Large_Config_frequent-8   4214394   287.1 ns/op
BenchmarkSimpleMutex/Update_Large_Config_seldom-8     4991079   241.7 ns/op
BenchmarkPseudoRCU/Update_Small_Config_frequent-8     4639545   257.1 ns/op
BenchmarkPseudoRCU/Update_Small_Config_seldom-8       4944051   242.6 ns/op
BenchmarkPseudoRCU/Update_Large_Config_frequent-8     4428478   272.5 ns/op
BenchmarkPseudoRCU/Update_Large_Config_seldom-8       4927450   243.6 ns/op
BenchmarkAtomicRCU/Update_Small_Config_frequent-8     4776242   251.7 ns/op
BenchmarkAtomicRCU/Update_Small_Config_seldom-8       4967343   241.9 ns/op
BenchmarkAtomicRCU/Update_Large_Config_frequent-8     4559646   264.2 ns/op
BenchmarkAtomicRCU/Update_Large_Config_seldom-8       4966724   240.7 ns/op

…👀

…

It would appear that if the frequency of updates is anywhere grounded in reality, the optimisation gains are entirely inconsequential regardless of the size of the data.

Or is that truly the case here? If we look at how the benchmark code is written:

c := &AtomicRCUConfigs{}
wg := &sync.WaitGroup{}
stopChan := make(chan struct{})
go updater(c, configs, 5*time.Nanosecond, stopChan)
b.Run("Update Small Config frequent", func(b *testing.B) {
	for i := 0; i < b.N; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			_, _ = c.Get("feature_0")
		}()
	}
	wg.Wait()
})
close(stopChan)

Golang benchmarking in the stdlib seems to assign some big number to b.N, loops until a certain timeout, then prints out the total number of operations it managed to run along with the average nanoseconds per op. As any experienced backend engineer might have already noticed, average latencies are mostly useless numbers as it smears the histogram into a single number, making it hard to profile the performance of the system.

Let’s try adding some percentile tracking logic to each reader goroutine.

func percentile(values []int64, pct float64) float64 {
	roughIndex := float64(len(values)/100) * pct
	a, b := roughIndex, roughIndex
	if math.Floor(roughIndex) != roughIndex {
		a = math.Floor(roughIndex)
		b = a + 1
	}
	if a >= float64(len(values)) {
		a = float64(len(values)) - 1
		b = a
	}
	return (float64(values[int(a)]) + float64(values[int(b)])) / 2
}

func printPercentiles() {
	for testCase, latencies := range percentiles {
		sort.Slice(latencies, func(i, j int) bool {
        return latencies[i] < latencies[j]
    })
		fmt.Printf("%-*s - \tP50:%-*dP90:%-*dP99:%-*dMax:%-*d\n",
			50, testCase,
			15, int64(percentile(latencies, 50.0)),
			15, int64(percentile(latencies, 90.0)),
			15, int64(percentile(latencies, 99.0)),
			15, int64(percentile(latencies, 100.0)),
		)
	}
}

Then we create a large channel to receive the metrics.

latencyCh := make(chan latencyMetric, 100000000)
percentiles = map[string][]int64{}
go func() {
	for {
		metric, ok := <-latencyCh
		if !ok {
			return
		}
		percentiles[metric.testCase] = append(percentiles[metric.testCase], metric.value)
	}
}()
// .
// .
// .
b.Run("Update Small Config frequent", func(b *testing.B) {
	for i := 0; i < b.N; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			now := time.Now().UnixNano()
			_, _ = c.Get("feature_0")
			latencyCh <- latencyMetric{"Simple Mutex - Update Small Config frequent", time.Now().UnixNano() - now}
		}()
	}
	wg.Wait()
})

Tip: Don’t just create async accessed channels with no buffer like so: make(chan latencyMetric). It would cause a choke point for the goroutines when they are all trying to pump back metric data.

Let’s run the benchmarks again! The results were rearranged and formatted for better legibility.

--Average benchmarks--

BenchmarkSimpleMutex/Update_Small_Config_frequent-8          3715759	       320.1 ns/op
BenchmarkPseudoRCU/Update_Small_Config_frequent-8          	 4084863	       297.8 ns/op
BenchmarkAtomicRCU/Update_Small_Config_frequent-8          	 4307348	       285.7 ns/op

BenchmarkSimpleMutex/Update_Small_Config_seldom-8            4277454	       286.3 ns/op
BenchmarkAtomicRCU/Update_Small_Config_seldom-8            	 4508586	       271.8 ns/op
BenchmarkPseudoRCU/Update_Small_Config_seldom-8            	 4509039	       271.6 ns/op

BenchmarkSimpleMutex/Update_Large_Config_frequent-8          3394640	       370.3 ns/op
BenchmarkPseudoRCU/Update_Large_Config_frequent-8          	 4468725	       300.7 ns/op
BenchmarkAtomicRCU/Update_Large_Config_frequent-8          	 4534879	       279.8 ns/op

BenchmarkSimpleMutex/Update_Large_Config_seldom-8            4260943	       285.0 ns/op
BenchmarkPseudoRCU/Update_Large_Config_seldom-8            	 4504276	       265.9 ns/op
BenchmarkAtomicRCU/Update_Large_Config_seldom-8            	 4470870	       264.5 ns/op

--Distribution benchmarks (nanoseconds)--

Simple Mutex - Update Small Config frequent         	P50:137            P90:32882          P99:289197         Max:2519006
Pseudo RCU - Update Small Config frequent           	P50:133            P90:31640          P99:226916         Max:1513739
Atomic RCU - Update Small Config frequent           	P50:84             P90:134            P99:207            Max:26497

Simple Mutex - Update Small Config seldom           	P50:105            P90:173            P99:281            Max:43536
Pseudo RCU - Update Small Config seldom             	P50:107            P90:174            P99:273            Max:24992
Atomic RCU - Update Small Config seldom             	P50:59             P90:123            P99:205            Max:7940

Simple Mutex - Update Large Config frequent         	P50:350            P90:523505         P99:1243037        Max:5655489
Pseudo RCU - Update Large Config frequent           	P50:103            P90:147            P99:475            Max:1549957
Atomic RCU - Update Large Config frequent           	P50:71             P90:101            P99:182            Max:6320

Simple Mutex - Update Large Config seldom           	P50:110            P90:172            P99:282            Max:282293
Pseudo RCU - Update Large Config seldom             	P50:108            P90:166            P99:269            Max:8000
Atomic RCU - Update Large Config seldom             	P50:69             P90:125            P99:204            Max:9743

Now it is extremely clear that while the benchmark averages seem to show almost no optimisation improvements, the distribution tells a very different story. The worst performer here is aligned with our intuition: Frequent locking updates with large configurations are going to cause “large” spikes on the tail latencies. I put “large” in quotes because it’s still just a 1-5ms increase that may not be significant enough to warrant attention for a web service loading data from MySQL.

The cool thing is that just by excluding the copying from the write locks and focus solely on locking the pointer update, we can already significantly reduce tail latencies for configs with seldom updates. The more surprising part was the performance of the atomic operations. Even with freqeunt large config updates, the true lock-free atomic readers are magnitudes faster than the other methods.

With that said, the “magnitudes” of improvements are still within the sub-microsecond range that no one functioning at the L7 networking layer would care for. But we can see how this would be extremely important for performance at the kernel level.

You’re right Balázs, RCU is pretty cool. 😎