Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 History  





2 Definition  





3 Principles and practices  





4 Implementations  



4.1  Kitchen Sink, a.k.a. Everything SRE  





4.2  Infrastructure  





4.3  Tools  





4.4  Product or application  





4.5  Embedded  





4.6  Consulting  







5 Industry  





6 See also  





7 References  





8 Further reading  





9 External links  














Site reliability engineering






العربية
Deutsch
فارسی
Français

Bahasa Indonesia

Português

 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 

(Redirected from Site Reliability Engineer)

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations.[1] SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.[2][3][4]

History[edit]

The field of site reliability engineering originated at Google with Ben Treynor Sloss,[5][6] who founded a site reliability team after joining the company in 2003.[7] In 2016, Google employed more than 1,000 site reliability engineers.[8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.[9] The position is more common at larger web companies, as small companies often do not operate at a scale that would require dedicated SREs.[9] Organizations that have adopted the concept include Airbnb, Dropbox, IBM,[10] LinkedIn,[11] Netflix,[8] and Wikimedia.[12] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[13][14]

Definition[edit]

Site reliability engineering, as a job role, may be performed by individual contributors or organized in teams, responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.[15] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.[16] Focuses of SRE include automation, system design, and improvements to system resilience.[16]

Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in security engineering, a company may eventually hire specialists and engineers for the job.[citation needed]

Site reliability engineering has also been described as a specific implementation of DevOps, although they differ slightly. SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly.[2][3][4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change.[9]

Principles and practices[edit]

There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:[1][17]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:

Implementations[edit]

Site reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms. Here is a high-level overview of common SRE team implementations:[19]

Kitchen Sink, a.k.a. “Everything SRE”[edit]

The scope of services or workflows covered is usually unbounded.

Infrastructure[edit]

These focus on the reliability of behind-the-scenes systems that help make other teams' jobs more efficient. These are often confused with "Platform" teams or "Platform Operations" teams. Infrastructure SRE teams may pair up with one or more platform engineering team(s), but they differ in that Infrastructure SRE teams focus on performing most, if not all, of the work described in the principles and practices listed above. Platform teams tend to focus on building the platform, and while reliability is desirable, that's not their sole priority.

Tools[edit]

The teams focus on tools to measure, maintain, and improve system reliability. For example, Nagios CoreorPrometheus (software).

Product or application[edit]

SRE team for product and/or application. Some large companies tend to staff several of these.

Embedded[edit]

Usually, SRE solo practitioners or pairs staffed within a software engineering team apply most of the principles and practices described above.

Consulting[edit]

These teams consult on how to implement SRE principles and practices. These are usually experienced SREs who've worked on teams in one or several of the implementations above. SREs on external facing consulting SRE teams are sometimes called "Customer Reliability Engineers".

Large companies who have adopted SRE tend to have a combination of the implementations described above, including multiple teams of the same implementation, e.g. multiple Product/application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products/applications.

Industry[edit]

The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry and also holds regional conferences with similar themes.[20]

See also[edit]

  • Cloud computing
  • Data center
  • Disaster recovery
  • High availability software
  • Infrastructure as code
  • Operations, administration and management
  • Operations management
  • Reliability engineering
  • System administration
  • References[edit]

    1. ^ a b "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
  • ^ a b Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
  • ^ a b Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
  • ^ a b "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.
  • ^ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.
  • ^ "What is SRE?". Red Hat. Retrieved June 17, 2021.
  • ^ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
  • ^ a b Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.
  • ^ a b c Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
  • ^ "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.
  • ^ "Site Reliability Engineering (SRE)". engineering.linkedin.com. Retrieved March 12, 2024.
  • ^ "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.
  • ^ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
  • ^ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.
  • ^ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
  • ^ a b Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.
  • ^ "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.
  • ^ "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.
  • ^ "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.
  • ^ "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.
  • Further reading[edit]

    External links[edit]


    Retrieved from "https://en.wikipedia.org/w/index.php?title=Site_reliability_engineering&oldid=1219096559"

    Categories: 
    2003 introductions
    Google
    Reliability engineering
    Software engineering
    Hidden categories: 
    Articles with short description
    Short description is different from Wikidata
    Articles with peacock terms from May 2023
    All articles with peacock terms
    Wikipedia articles containing buzzwords from May 2023
    Articles with multiple maintenance issues
    All articles with unsourced statements
    Articles with unsourced statements from June 2023
    CS1 maint: multiple names: authors list
     



    This page was last edited on 15 April 2024, at 18:37 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki