Skip to main content

Alienchow

Tag: Sre

Debugging SRE #2: Pager Burnout

this is fine

It goes without saying that even the most disciplined SRE functions eventually experience pager burnout. Over time, I’ve found that the reasons can be condensed down to 4 main reasons:

  • Lack of pager review
  • Undersized teams
  • Incompatible on-call shift length
  • Inadequate on-call compensation

# Pager Review

## Treat a Page Like a Page

The most mindblowing thing to me was seeing teams use Slack notifications as a pager alert.

Debugging SRE #1: SOP Opera

# Reliability Theatrics Galore

Recently, I have observed several anti-patterns going on in teams:

  • Release engineer accidentally skipping a step during deployment causing an incident.
  • No one knew how to roll back a service because no one knew where the SOP was.
  • Action items in the post-mortem reports added more manual checks to the ever-growing checklist

The recurring theme in all the above is the over reliance on Standard Operating Procedures (SOP). SOP has its place for basic sanity checks and release approvals, but several teams have been using it as a crutch to weasel out of building scalable, long-term solutions.

Debugging SRE

Debugging SRE is a series of low effort brain dumps, consisting of reliability practices that I have observed, and to discuss anti-patterns masquerading as reliability diligence.

After resigning from the Google tech island to practise Site Reliability Engineering (SRE) elsewhere, I have come to realise that many organisations fancy the branding and engineering credibility of a tech organisation that has a dedicated SRE team.

Yet, few of the organisations that I’ve observed so far actually embrace the full implementation of an SRE function. Many are just rebranded DevOps or IT Sysadmins. More concerningly, some of these SRE orgs are made up of traditional Ops Engineers who barely know how to code beyond copy pasting Bash or Powershell scripts. The premise of the original Google SREs was to have SWEs work on ops using software development perspectives, so as to bridge the divide between Dev and Ops to focus on service stability.