Top 5 Resources to learn SRE – Site Reliability Engineering

SRE is a new concept for the industry but Site reliability engineering or SRE is an essential part of the engineering team at Google since 2003. It started making noise in the year 2016 when the Google engineering team announced to employ more than 1,000 site reliability engineers in their team. After that, it started spreading in the industry and big tech giants like LinkedIn, GitHub, Reddit, Dropbox, Airbnb, IBM, Netflix, Facebook started adopting the concepts in their team.

If we believe some of the surveys reports then till now 28% of organizations in a survey of 4000 respondents had adopted the SRE concepts or employed SRE engineers in their team.

So, why SRE role is making so much noise in the industry?

In this age of the internet, almost every company and industry is dependent on the internet and application. Even if you’re a restaurant operating locally, you’re using technology for call routing, menu updates, offerings, reservation bookings, events, or for various reasons. If your application goes down for sometime, you will potentially lose some business and in the same time your nearby competitors will make place in your prospect customer list. Not only you are losing some business but you are likely paying someone to maintain website and look into the issue – costing you even more money. While this is a very small example that shows the value of constant uptime and the opportunity cost involved with incident management.

These days every business is a technology-based business with so many competitors. Customers expectations are sky-high, every service is just a click away and they are willing to churn if their needs are not promptly met.

According to Gartner, if we will count the average cost of the industry then downtime is costs approximately $300,000 per hour. For few companies, this number is considerably higher; if we will take Amazon example in the year 2018, they lost approximately $90 million during their Prime Day outage, and that outage only lasted 75 minutes. So, you can imagine how much its necessary for organizations to prioritize reliability so they can innovate as quickly as possible on top of a strong foundation that won’t compromise the customer experience.

SRE concept is based on a customer-first approach. Each decision should result in protecting or improving customer satisfaction. This is mindset, and a set of practices, metrics, and prescriptive ways with the goal of reliability and resilience in mind in light of exponentially growing customer expectations as well as systems complexity.

These are the skillsets on which SRE aspirants can work on to be good in it:-

  • Knowing How to Code:- Python/JAVA
  • Understanding Operating Systems:- Centos/Ubuntu & VirtualBox & Vagrant
  • Implementing a CI/CD Pipeline:- Jenkins
  • Using Monitoring Tools
    • Infrastructure Monitoring – Datadog, Prometheus with Grafana
    • Log Monitoring Tool – Splunk & ELK stake
    • Performance & RUM Monitoring – NewRelic
  • Understanding of Cloud Native Applications:- AWS, Docker, Kubernetes & Helm
  • Using Version Control Tools:- Git using Github
  • Gain a Deep Understanding of Databases
  • Master Distributed Computing
  • Communication & Collaboration:- Jira, Confulence, Pagerduty & Slack
  • Webserver:- Apache HTTP & Nginx
  • Configuration & Deployment Management:- Ansible
  • Infrastructure Coding:- Terraform
  • Services mesh Data planes & Control Planes:- Envoy & Istio
  • Network configurations and Service Discovery:- Consul
  • Securing credentials:- HashiCorp Vault & SSL & Certificates

Everyone talks about the SRE – their benfits and etc…The SRE adoption is growing quickly (30-70% Year of Year growth in job openings), but there is not enough skilled professioanls are there in the market to compensate, because not everyone knows the best places to start to how to implement SRE in organizations. Here are our top 5 Resources to learn SRE or Site Reliability Engineering for getting started:-

SRE Ebook :- You can start with the first book written by Google engineers, which you can read and learn online for free. The book explains how Google started, implemented, and handled SRE concept in their organization. You can learn a lot from this book, but this doesn’t necessarily mean that you should copy the exact methods Google does. The main consideration should be your organization’s needs.

Video Tutorial :- YouTube is one of the best place to find some of the useful videos on SRE learning resources. I have linked one channel’s playlist where you find all videos of SRE concepts and topics in sequence.

Site Reliability Engineering Certified Professional (SRECP) :- Learning by yourself can be a daunting task. You may have got stuck with some concepts and help needed to understand that and self-leanring is time-consuming too. If you are ready to invest few bucks to save your time and learn things faster – you may enroll for DevOpsSchool SRE certification program.

GitHUB sre-learning-resources + GitHUB awesome-sre :- Follow these GITHUB ids – here you can so many materials curated in a good way all at one place.

SRE Learning KIT :- This SRE learning kIT is a really useful resource if you have no idea how to start with the SRE learning process. All SRE topics are in sequence linked in a proper way where you can find tutorials, videos, Quizzes/Dumps, Interview QA, Notes, Slides, and various other materials in one place.

These resources are intended to be a point of knowledge for everyone who wants to learn SRE concepts and implementation process. However, if you have other resources – please drop it in the comment section. We will work on the enhancement of the lists.

Would love your thoughts, please comment.x