loading...
👋 Sign in for the ability sort posts by top and latest.

Here's your Complete Definition of Software Reliability

Reactions 3
5 min read

Availability, Maintainability, Reliability: What's the Difference?

Reactions 4
4 min read

How to Become a Master at Incident Command

Reactions 5
12 min read

How to Build Your SRE Team

Reactions 5
7 min read

If you’re not using SSH certificates you’re doing SSH wrong | Episode 2: Certificates improve usability, operability, & security

Reactions 104 Comments 3
6 min read

If you’re not using SSH certificates you’re doing SSH wrong | Episode 1: Keys versus Certificates

Reactions 35
5 min read

If you’re not using SSH certificates you’re doing SSH wrong | Episode 3: An ideal SSH flow

Reactions 30 Comments 1
5 min read

What is a Kubernetes Operator and why it matters for SRE

Reactions 15 Comments 1
5 min read

Here are the Metrics you Need to Understand Operational Health

Reactions 5
7 min read

Why SREs Should be Responsible for Development Environments

Reactions 35 Comments 13
5 min read

Introduction to LitmusChaos

Reactions 24
11 min read

Monitoring Production Methodologically (Talk with the transcript)

Reactions 5
20 min read

5 DevOps Books to Read for FREE

Reactions 183 Comments 7
2 min read

Monitoring with Prometheus and Grafana

Reactions 7
10 min read

4 YouTube Resources to Get Started with Kubernetes

Reactions 58
2 min read

How to Classify Incidents

Reactions 7
6 min read

Building a Multi-Tenant gRPC Development Platform with Ambassador and AWS EKS

Reactions 6
9 min read

Incident Postmortem Template

Reactions 9
6 min read

Chaos Workflows with Argo and LitmusChaos

Reactions 25
8 min read

Introducción a IAM - Día #1 de caminando con un SRE

Reactions 5
6 min read

Chaos Engineering for cloud-native systems

Reactions 24
4 min read

Best Practices in Incident Management

Reactions 7
4 min read

Falando sobre SRE - Parte 01 - Uma breve introdução

Reactions 8
7 min read

Optimizing your alerts to reduce Alert Noise

Reactions 6
8 min read

Retrying groups of tightly coupled tasks in Ansible

Reactions 4 Comments 1
3 min read

Cleaning up Zookeeper Logs and Snapshots

Reactions 5
1 min read

#discussHow does deployment work at your organization?

Reactions 71 Comments 72
1 min read

go apps + jaeger tracing

Reactions 5 Comments 2
1 min read

Incident Response in the time of Remote Work

Reactions 8
7 min read

SLOs with Stackdriver Service Monitoring

Reactions 7
8 min read

The Night Before Code Freeze

Reactions 50 Comments 1
4 min read

Rapid Docker on AWS: How to monitor the application?

Reactions 10
4 min read

Becoming a Site Reliability Engineer (SRE)

Reactions 14
14 min read

What Is a Site Reliability Engineer? Should You Become One?

Reactions 11
10 min read

What It Means To Be A Site Reliability Engineer

Reactions 298 Comments 13
5 min read

Tracking one metric opened a whole new world for me

Reactions 17
9 min read

Leaders, Here's how to Encourage Full Service Ownership

Reactions 3
5 min read

Using this one simple trick you can cut your GCP compute costs by as much as 80%!

Reactions 4
2 min read

Augment a PagerDuty Incident with Root Cause

Reactions 4
7 min read

SREview Issue #3

Reactions 3
2 min read

Choosing the Right SRE Tools

Reactions 6
6 min read

I’m a certified Associate Cloud Engineer!

Reactions 35 Comments 5
4 min read

#discussManaging infra code ⚙️🛠🧰

Reactions 20 Comments 5
1 min read

Nobody likes to wait in a Queue

Reactions 4
2 min read

Using Automation and SLOs to Create Margin in your Systems

Reactions 4
4 min read

Delete unrelated files post-use

Reactions 3
2 min read

The Importance of Reliability Engineering

Reactions 4
5 min read

#techtalksBringing Operational Excellence to Dev with Github's Lauren Rubin

Reactions 4
33 min read

Complete Docker Tutorial - FREE Video Training

Reactions 11 Comments 1
3 min read

How SLIs Help You Understand Users' Needs

Reactions 4
5 min read

#techtalksResilience in Action SRE Podcast #4

Reactions 5
1 min read

How to Choose Monitoring Tools for DevOps and SRE

Reactions 5
5 min read

5 Tips for Getting Alert Fatigue Under Control

Reactions 24 Comments 1
9 min read

Explain IaC like I'm Five

Reactions 6
2 min read

#discussSRE, DevOps Authors

Reactions 9
1 min read

Teamwork and Culture in the Era of Remote Work

Reactions 6
4 min read

Managing Burnout During COVID-19

Reactions 4
8 min read

#techtalksConferences in the Time of COVID-19: Cloud and Infrastructure

Reactions 9
3 min read

Kafka Chaos Engineering With Litmus

Reactions 33
10 min read

Top Practices for Runbook Automation

Reactions 14 Comments 1
6 min read
loading...