Skip to main content

Alienchow

RCU is Pretty Cool

# “Do you know what is read-copy-update?”

Balázs randomly mumured on a Friday afternoon while I was doom scrolling the endless stream of despair on Memegen some time shortly after the Jan 2023 layoffs. Friday wasn’t one of the team designated RTO days, but I decided to work from office anyway for some focus time, as few people would be coming in.

B: “It’s pretty cool.”

Balázs proceeded to explain to me that read-copy-update, or RCU, is a lock-free mechanism to update data structures that are actively being consumed by asynchronous readers without the usage of locks. It was first introduced into the Linux kernel back in 2002, but discussions and designs date back to the mid 90s. The general idea is that instead of applying a mutex lock to update an asynchronously read data structure, instantiate a copy of the existing data structure and apply changes to the new instance. After the changes are applied, do an atomic update of the pointer from the old instance to the new one. This essentially creates a lock-free update of data change.

Debugging SRE #2: Pager Burnout

this is fine

It goes without saying that even the most disciplined SRE functions eventually experience pager burnout. Over time, I’ve found that the reasons can be condensed down to 4 main reasons:

  • Lack of pager review
  • Undersized teams
  • Incompatible on-call shift length
  • Inadequate on-call compensation

# Pager Review

## Treat a Page Like a Page

The most mindblowing thing to me was seeing teams use Slack notifications as a pager alert.

Debugging SRE #1: SOP Opera

# Reliability Theatrics Galore

Recently, I have observed several anti-patterns going on in teams:

  • Release engineer accidentally skipping a step during deployment causing an incident.
  • No one knew how to roll back a service because no one knew where the SOP was.
  • Action items in the post-mortem reports added more manual checks to the ever-growing checklist

The recurring theme in all the above is the over reliance on Standard Operating Procedures (SOP). SOP has its place for basic sanity checks and release approvals, but several teams have been using it as a crutch to weasel out of building scalable, long-term solutions.

Debugging SRE

Debugging SRE is a series of low effort brain dumps, consisting of reliability practices that I have observed, and to discuss anti-patterns masquerading as reliability diligence.

After resigning from the Google tech island to practise Site Reliability Engineering (SRE) elsewhere, I have come to realise that many organisations fancy the branding and engineering credibility of a tech organisation that has a dedicated SRE team.

Yet, few of the organisations that I’ve observed so far actually embrace the full implementation of an SRE function. Many are just rebranded DevOps or IT Sysadmins. More concerningly, some of these SRE orgs are made up of traditional Ops Engineers who barely know how to code beyond copy pasting Bash or Powershell scripts. The premise of the original Google SREs was to have SWEs work on ops using software development perspectives, so as to bridge the divide between Dev and Ops to focus on service stability.