DEV Community 👩‍💻👨‍💻

Site Reliability Engineering

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Talking a little bit about Ansible's loops

Talking a little bit about Ansible's loops

Reactions 6 Comments
4 min read
Litmus 2.0 - Simplifying Chaos Engineering for Enterprises

Litmus 2.0 - Simplifying Chaos Engineering for Enterprises

Reactions 19 Comments
3 min read
Migrating Applications from VMs to K8s

Migrating Applications from VMs to K8s

Reactions 9 Comments
3 min read
Como continuar a execução de um build do Jenkins quando um stage falha

Como continuar a execução de um build do Jenkins quando um stage falha

Reactions 6 Comments
4 min read
A different approach working with Ansible variables

A different approach working with Ansible variables

Reactions 5 Comments
2 min read
Having On-call Nightmares? Runbooks can Help you Wake Up.

Having On-call Nightmares? Runbooks can Help you Wake Up.

Reactions 7 Comments
5 min read
How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

Reactions 7 Comments
3 min read
Episode 3: To Boldly Debug

Episode 3: To Boldly Debug

Reactions 3 Comments
1 min read
SRE2AUX: How Flight Controllers were the first SREs

SRE2AUX: How Flight Controllers were the first SREs

Reactions 2 Comments
20 min read
So you Want an SRE Tool. Do you Build, Buy, or Open Source?

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Reactions 3 Comments
6 min read
Kubernetes Health Checks - 2 Ways to Improve Stability in Your Production Applications

Kubernetes Health Checks - 2 Ways to Improve Stability in Your Production Applications

Reactions 9 Comments
10 min read
Understanding the ABCs of CD

Understanding the ABCs of CD

Reactions 3 Comments
3 min read
Infracost diff - "git diff" but for cloud costs

Infracost diff - "git diff" but for cloud costs

Reactions 7 Comments
2 min read
How to: Pingdom super powered status sage

How to: Pingdom super powered status sage

Reactions 2 Comments
3 min read
Performance Engineering - The Reliability Edition

Performance Engineering - The Reliability Edition

Reactions 3 Comments
5 min read
It's all Chaos! And it Makes for Resilience at Scale

It's all Chaos! And it Makes for Resilience at Scale

Reactions 4 Comments
4 min read
How to Build an SRE Team with a Growth Mindset

How to Build an SRE Team with a Growth Mindset

Reactions 4 Comments
6 min read
How We Built and Use Runbook Documentation at Blameless

How We Built and Use Runbook Documentation at Blameless

Reactions 15 Comments 2
5 min read
SigNoz : Open-source alternative to DataDog

SigNoz : Open-source alternative to DataDog

Reactions 24 Comments 2
3 min read
Lessons from Slack, GCP and Snowflake outages

Lessons from Slack, GCP and Snowflake outages

Reactions 4 Comments
3 min read
Deep Dive into Docker Internals - Union Filesystem

Deep Dive into Docker Internals - Union Filesystem

Reactions 28 Comments
10 min read
My DevOps learning path

My DevOps learning path

Reactions 3 Comments
5 min read
Introduce Chaos Platform 2.0 for Azure

Introduce Chaos Platform 2.0 for Azure

Reactions 7 Comments
2 min read
What Is Nix and Why You Should Use It

What Is Nix and Why You Should Use It

Reactions 8 Comments
7 min read
How do you wrap your head around observability?

How do you wrap your head around observability?

Reactions 49 Comments 13
1 min read
Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Reactions 2 Comments
14 min read
Reliability as an Inseparable Part of Software Engineering

Reliability as an Inseparable Part of Software Engineering

Reactions 3 Comments
5 min read
Getting Started as an SRE? Here are 3 Things You Need to Know.

Getting Started as an SRE? Here are 3 Things You Need to Know.

Reactions 5 Comments
5 min read
How They SRE

How They SRE

Reactions 7 Comments 1
1 min read
The Key Differences between SLI, SLO, and SLA in SRE

The Key Differences between SLI, SLO, and SLA in SRE

Reactions 15 Comments
9 min read
How to Backup your Applications Data to S3 with Walrus

How to Backup your Applications Data to S3 with Walrus

Reactions 6 Comments
2 min read
What is the right AWS Kubernetes distribution for you?

What is the right AWS Kubernetes distribution for you?

Reactions 4 Comments
5 min read
Resilience Engineering – Don't Be Afraid to Show Your Vulnerable Side!

Resilience Engineering – Don't Be Afraid to Show Your Vulnerable Side!

Reactions 4 Comments
4 min read
The True Cost of Building your Own Incident Management System (IMS)

The True Cost of Building your Own Incident Management System (IMS)

Reactions 2 Comments
5 min read
Communication Tool Down? Here are 3 Ways to Handle it

Communication Tool Down? Here are 3 Ways to Handle it

Reactions 3 Comments
5 min read
GCP DevOps Certification - Pomodoro Ten

GCP DevOps Certification - Pomodoro Ten

Reactions 4 Comments
3 min read
Quick Survey: IT on-call experience in an "Always-On" world

Quick Survey: IT on-call experience in an "Always-On" world

Reactions 5 Comments 2
1 min read
Azure Front Door: An Overview

Azure Front Door: An Overview

Reactions 6 Comments
3 min read
Managing health checks at scale

Managing health checks at scale

Reactions 6 Comments
5 min read
"I'm Just Doing my Job," An SRE Myth

"I'm Just Doing my Job," An SRE Myth

Reactions 3 Comments
5 min read
Executando AWS cli em múltiplas contas de maneira fácil

Executando AWS cli em múltiplas contas de maneira fácil

Reactions 6 Comments
3 min read
What is a microservice catalog?

What is a microservice catalog?

Reactions 2 Comments 1
5 min read
Top Observability tools for DevOps Engineers and SREs

Top Observability tools for DevOps Engineers and SREs

Reactions 16 Comments
7 min read
Kubernetes gone bust. Now what?

Kubernetes gone bust. Now what?

Reactions 6 Comments
4 min read
From SysAdmin to SRE: How to evolve your skillset

From SysAdmin to SRE: How to evolve your skillset

Reactions 2 Comments
6 min read
How Kyverno helps with policy management

How Kyverno helps with policy management

Reactions 2 Comments
3 min read
Argo CD

Argo CD

Reactions 6 Comments
2 min read
The Engineer's Guide to Preparing for Black Friday 2020

The Engineer's Guide to Preparing for Black Friday 2020

Reactions 2 Comments
8 min read
Choosing SLOs that users need, not the ones you want to provide

Choosing SLOs that users need, not the ones you want to provide

Reactions 6 Comments
6 min read
Blameless Book Club: Implementing Service Level Objectives, Part 1

Blameless Book Club: Implementing Service Level Objectives, Part 1

Reactions 6 Comments
7 min read
Debugging incidents in Google's Distributed Systems

Debugging incidents in Google's Distributed Systems

Reactions 1 Comments
2 min read
Resilience Engineering and Life

Resilience Engineering and Life

Reactions 4 Comments
4 min read
Testing ML incident detection using a cloud native microservices app

Testing ML incident detection using a cloud native microservices app

Reactions 11 Comments
10 min read
Operational Readiness Review Template

Operational Readiness Review Template

Reactions 6 Comments
7 min read
What is GitOps?

What is GitOps?

Reactions 2 Comments
3 min read
Google Down worldwide | Why is Google Down? Let's break it down

Google Down worldwide | Why is Google Down? Let's break it down

Reactions 15 Comments
4 min read
SREview Issue #7 November 2020

SREview Issue #7 November 2020

Reactions 4 Comments
2 min read
Making Instrumentation Extensible

Making Instrumentation Extensible

Reactions 5 Comments
7 min read
SREview Issue #8 December 2020

SREview Issue #8 December 2020

Reactions 4 Comments
2 min read
Challenges with Implementing SLOs

Challenges with Implementing SLOs

Reactions 3 Comments
11 min read
loading...