DEV Community

Cover image for #1 What's Site Reliability Engineering [SRE] | Roles & Responsibilities | Technologies involved
Tharun Shiv
Tharun Shiv

Posted on • Updated on

#1 What's Site Reliability Engineering [SRE] | Roles & Responsibilities | Technologies involved

Site reliability engineering

Site Reliability Engineering, also popularly referred to as the SRE, is a role in Computer Science Engineering where the main purpose is to provision, maintain, monitor, and manage the infrastructure in order to provide maximum application uptime and reliability. SRE is an emerging role, but the tasks that the SRE does were always there ever since the first application that was developed. The scope of the software developers ends where they write code to develop the application and right from setting up the infrastructure, the various services that run on them, the network connectivity that is required, providing a platform for the application to run and making sure every part of the application is up and running reliably 24x7 is the duty of an SRE. In fact, we can consider Site Reliability Engineers are the strong bridge between the users and a reliable application.

Now, in order to explain the different responsibilities of an SRE, I have divided it into 4 different categories. I have always seen SRE this way, and definitely not as some ad-hoc process. The four categories in which I would classify the tasks of a Site Reliability Engineer are:

  1. Create
  2. Monitor
  3. Manage
  4. Destroy

Let's dive deep into each one of them.


1. Provision virtual machines / PXE Baremetals

SREs are responsible for provisioning the virtual machines with the requested resources in terms of CPU, memory, disks, network configurations, and operating system. In case a bare metal needs to be set up, it is also performed with the provided configurations. The SREs use Linux commands, automation scripts to provision the server as quickly as possible. They are also responsible to be rack aware during provisioning. Example operating systems involve Linux Ubuntu, CentOS, Windows.

2. Setup services

Once the machines are provisioned, the SRE also takes care of setting up the services on the machines. These services can be networking services, proxy or load balancing services, container or orchestration services, message queues, databases, caching systems, big data services, or more, along with the disk setup. In this way, the SRE are exposed to a variety of technology and play an important role in the components involved in an application. Example technologies involve NGINX, Apache, RabbitMQ, Kafka, Hadoop, Traefik, MySQL, PostgreSQL, Aerospike, MongoDB, Redis, MinIO, Kubernetes, Apache Mesos, Marathon, MariaDB, Galera.

3. Optimize the infrastructure

Since there are several components and services that are being used in the infrastructure, there is a scope for improvements in terms of performance, efficiency, and security. The SRE optimizes the components by keeping them up to date, choosing the right service for the right job, patching the servers.

4. Write monitoring scripts

When the SRE are involved in maintaining an infrastructure of any size, they never underestimate any component of the infrastructure and write a monitoring script to monitor the components and metrics of each and every one of them. This provides the ability to get real-time alerts on any of the components malfunctioning and also a better view of the infrastructure. The SRE uses programming languages like Bash, Python, Golang, Perl, and tools like daemon processes, Riemann, InfluxDB, OpenTSDB, Kafka, Grafana, Prometheus, and APIs to monitor the infrastructure.

5. Write automation scripts

If there are more than 10 steps to be performed and chances are that the task has to be performed more than once, the SRE never hesitate to automate the task. This saves time and also prevents human error. The SRE uses programming languages like Bash, Python, Golang, Perl, Ansible to automate the tasks.

6. Manage users on the machines

One of the main security precaution that the SRE take is to restrict user access to the components in the infrastructure. They use various technologies like VPN ( Virtual Private Network ), firewall, configuration files, user management on machines, LDAP, sudoer configuration, PAM, OTP, two-factor authentications, SSH keys, and more to avoid unauthorized access to any component of the infrastructure.

These are the create aspects of a Site Reliability Engineer. In the next article we will read about the Monitor aspect of a Site Reliability Engineer.

Complete Video:

Watch the video above or listen to the full podcast exclusively below


You can find more articles here:

Thank you

Check out my YouTube Channel here: Developer Tharun - YouTube

Written by,

Thank you for reading, This is Tharun Shiv a.k.a Developer Tharun

Tharun Shiv

Top comments (0)