DEV Community

Software at Scale

Software at Scale 28 - Tammy Butow: Principal SRE, Gremlin

Tammy Butow is a Principal SRE at Gremlin, a Failure as a Service platform company that helps engineers build more resilient software. She’s also the co-founder of Girl Geek Academy, an organization to encourage women to learn technology skills. She previously held IC and management roles in SRE at Dropbox and Digital Ocean.

Apple Podcasts | Spotify | Google Podcasts

Share Software at Scale

Subscribe now

In this episode, we talk about reliability engineering and Chaos Engineering. We talk about the growing trend of outages across the internet and their underlying reasons. We explore common themes in outages, like marketing events and lack of budgets/planning, the impact of such outages on businesses like online retailers, and how tools and methodologies from Chaos Engineering and SRE can help.

Highlights

01:00 - Starting as the seventh employee at Gremlin

04:00 - An analysis of recent outages and their root causes.

09:00 - A mindset shift on software reliability

14:00 - If you’re suddenly in charge of the reliability of thousands of MySQL databases, what do you do? How do you measure your own success?

25:00 - Why is it important to know exactly how many nodes your service requires to run reliably?

30:00 - What attracts customers to Chaos Engineering? Do prospects get concerned when they hear "chaos” or “failure as a service”?

43:00 - Regression testing failure in CI/CD

51:00 - Trends of interest in Chaos Engineering over time.

Episode source