This article appears to contain a large number of buzzwords. There might be a discussion about this on the talk page. Please help improve this article if you can.(May 2023)
The field of site reliability engineering originated at Google with Ben Treynor Sloss,[5][6] who founded a site reliability team after joining the company in 2003.[7] In 2016, Google employed more than 1,000 site reliability engineers.[8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.[9] The position is more common at larger web companies, as small companies often do not operate at a scale that would require dedicated SREs.[9]Organizations that have adopted the concept include Airbnb, Dropbox, IBM,[10]LinkedIn,[11]Netflix,[8] and Wikimedia.[12] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[13][14]
Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in security engineering, a company may eventually hire specialists and engineers for the job.[citation needed]
Site reliability engineering has also been described as a specific implementation of DevOps, although they differ slightly. SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly.[2][3][4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change.[9]
There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:[1][17]
Automation or elimination of anything repetitive in a cost-effective way.
Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
Systems designed with a bias toward the reduction of risks to availability, latency, and efficiency.
Observability—as in, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.[18]
The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:
Toil management as the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Site reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms. Here is a high-level overview of common SRE team implementations:[19]
These focus on the reliability of behind-the-scenes systems that help make other teams' jobs more efficient. These are often confused with "Platform" teams or "Platform Operations" teams. Infrastructure SRE teams may pair up with one or more platform engineering team(s), but they differ in that Infrastructure SRE teams focus on performing most, if not all, of the work described in the principles and practices listed above. Platform teams tend to focus on building the platform, and while reliability is desirable, that's not their sole priority.
These teams consult on how to implement SRE principles and practices. These are usually experienced SREs who've worked on teams in one or several of the implementations above. SREs on external facing consulting SRE teams are sometimes called "Customer Reliability Engineers".
Large companies who have adopted SRE tend to have a combination of the implementations described above, including multiple teams of the same implementation, e.g. multiple Product/application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products/applications.
The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry and also holds regional conferences with similar themes.[20]
Beyer, Betsy; Murphy, Niall; Kawahara, Kent; Rensin, David; Thorne, Stephen (2018). The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly. ISBN978-1492029502.
Welch, Nat (2018). Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime. Packt. ISBN978-1788628884.
Adkins, Heather; Beyer, Betsy; Blankinship, Paul; Lewandowski, Piotr; Oprea, Ana; Stubblefield, Adam (2020). Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems. O'Reilly. ISBN978-1-4920-8312-2. OCLC1129470292.
Rosenthal, Jones, Casey, Nora (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly. ISBN978-1492043867.{{cite book}}: CS1 maint: multiple names: authors list (link)