A curated list of Site Reliability and Production Engineering resources.
-
Updated
Sep 25, 2021
{{ message }}
Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Chaos engineering is a disciplined approach to identifying failures before they become outages
A curated list of Site Reliability and Production Engineering resources.
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
A curated list of Chaos Engineering resources.
Is your feature request related to a problem? Please describe:
the kind version is too low, and will not work well on kernel 5.12.2 +
because of kubernetes-sigs/kind#2240
Describe the feature you'd like:
Currently, the server can't connect to MongoDB with TLS mode.
Solution:
It seems to me that UTC is selected for on the wire representation of time as well as in the database (jaegertracing/jaeger#712), which sort of makes sense, at least with a somewhat naive handling of timezones. However, I think that the Jaeger UI should support displaying times in the timezone local to the user, i.e. of the browser as to reduce the mental load when viewing
What to Read to Learn More About DevOps
A curated list of Site Reliability and Production Engineering Tools
Knowledge seeks no man
This repository includes resources which are more than sufficient to prepare for google interview if you are applying for a software engineer position or a site reliability engineer position
Curated list of good SRE interview questions.
Google Site Reliability Engineering book converted in audio
A party card game for engineers caring about reliability. Based on Cards Against Humanity.
A curated list of awesome Site Reliability and Production Engineering resources.
The Skinny Distributed Lock Service
Although it's not a high priority, we could get a more fancy and modern wheel.
Calculate how much downtime should be permitted in your Service Level Agreement or Objective
A collection of SRE tools
My opinionated list of products and tools used for high-scalability projects
A collection templates ported from the SRE Workbook
A list of common Disaster Recovery (DR) scenarios for software companies
This repository helps performance testers and engineers who wants to dive into DevOps and SRE world.
This repo contains all the SRE (Site Reliability Engineering) principles and guidelines for managing Operate First services
A combination of introduction to operating system and computer network
Endpoint monitoring and DNS failover agent written in Go
Issue Description
Question
Describe what happened (or what feature you want)
Trying to evaluate ChaosBlade as an option for resiliency testing. But I'm not sure if this is a feature request or a question. Actually, two questions: