Site reliability engineering (SRE) is a software engineering (developer) approach to IT operations (ops). SRE teams manage systems, handle scale, firefight incidents/problems and automate some operational tasks.
SRE was coined by the Google engineering team, when they realized that the duties and responsibilities required had deviated significantly from traditional IT/DevOps. One of the key differences is the use of code to help solve problems within cloud-native systems and infrastructure.
Any system that requires high availability and/or scalability needs SRE as a dedicated practice.
SRE can also stand for site reliability engineer, which are the individuals who handle site reliability engineering. SREs perform many tasks and are focused on the production cloud environment. Some of the common tasks an SRE will perform are:
- Scaling the system
- Optimizing cloud spend
- Remediating incidents (when things break)
- Patching and upgrades
SREs will often write custom code (software) to link systems together, and will create workflows (often called runbooks) to help automate parts or all of the cloud system needs.
At a high level an SRE is responsible for ensuring the systems run 24/7 and can handle scale as needed. To achieve this requires a lot of tools and expertise, not to mention often times having to “carry the pager” and handle incidents any time of the day or night.
Historically SREs came from the software development or sysadmin worlds and became a bit of a hybrid of the two. There are several areas that SREs are responsible for.
Deployment — How code is deployed into the production environment.
Monitoring — Using systems to monitor proper operations.
Alerting — Using tools to alert the appropriate people when systems aren’t functioning properly (or are at risk of not functioning properly).
Configuration — Configuring systems appropriately for optimal performance or cost reduction.
Performance — Keeping latency of systems within acceptable limits.
Change management — Keeping track of changes in systems both as a historical record but also in many cases to comply with industry standards and certifications
Emergency response — Quickly reacting to and mitigating cloud incidents as they happen
Optimization — Optimizing systems, often with automation, to reduce MTTR (Mean Time To Recovery/Repair/Resolution) — when things break, fix them as quickly as possible.
One of the primary outputs from an SRE are called runbooks or workflows. There are many situations that happen repeatedly, so it of course makes sense to create a repeatable process to handle these situations. Tying steps together in an automated way is how SREs optimize their processes. Common workflows will deal with things like cost optimization or incident remediation. For example, an SRE might create a workflow that runs on a daily basis for cost optimization (autoscaling). A simplified workflow for this could have the following steps:
- Check instance utilization
- If usage has remained under 50% for the last 24 hours reduce instance size
Conversely, an SRE might create a workflow for replacing a bad EC2 instance.
- Alert from AWS Health
- Spin up new instance
- Reroute traffic
- Kill old instance
These very simplified workflows will have several steps in them, with conditional branches and could even have what’s being called a “human in the loop”, which is a defined pause point in the workflow to allow a human to verify the situation and authorize appropriate actions.
SREs look for repeatable processes and then try to automate as much of those as they can to both simplify their job, but also to maintain as high availability as possible. No SRE team expects systems to have 100% uptime, but they plan for incidents and create processes to address them quickly.
There are many categories of tools that SREs use to effectively maintain cloud operations. The tools range from monitoring, logging, alerting, incident management, orchestration, and workflow automation and execution.
Fylamynt has created the world’s first enterprise ready low code platform for building, running and analyzing SRE cloud workflows.