Debugging SRE
is a series of low effort brain dumps, consisting of reliability
practices that I have observed, and to discuss anti-patterns masquerading as
reliability diligence.
After resigning from the Google tech island to practise Site Reliability
Engineering (SRE) elsewhere, I have come to realise that many organisations
fancy the branding and engineering credibility of a tech organisation that has
a dedicated SRE team.
Yet, few of the organisations that I’ve observed so far actually embrace the
full implementation of an SRE function. Many are just rebranded DevOps or IT
Sysadmins. More concerningly, some of these SRE orgs are made up of traditional
Ops Engineers who barely know how to code beyond copy pasting Bash or Powershell
scripts. The premise of the original Google SREs was to have SWEs work on ops
using software development perspectives, so as to bridge the divide between Dev
and Ops to focus on service stability.