Originally published on Failure is Inevitable.
Blameless recently had the pleasure of interviewing Yury Niño Roa, Site Reliability Engineer, Solutions Architect and Chaos Engineering Advocate at ADL Digital Labs. She’s worked in roles ranging from solutions architect, to software engineering professor, to DevOps engineer, to SRE. Additionally, Yury is an avid blogger and conference speaker who regularly presents at events such as Chaos Conf, DevOpsDays Bogotá, and more.
In this interview, we’ll delve into what draws Yury to SRE and chaos engineering, how she defines resilience, as well as her predictions on emerging trends in the SRE landscape.
Can you share your background and some highlights from your career journey to date? What is it that drew you to SRE and chaos engineering?
I am a Site Reliability Engineer and Chaos Engineering Advocate in Colombia. I love building software applications, reading blogs, writing articles, solving hard performance and resilience issues, and teaching software concepts.
I have 9+ years of experience designing and implementing software applications using agile methodologies such as Scrum and Kanban. Also, I have two years of hands-on experience supporting, automating, and optimizing mission-critical deployments.
After finishing my bachelor's degree, I was interested in solving programming and performance challenges. My first job was as a software developer in the Enrollment Department at the National University of Colombia. Each semester we received thousands of students’ applications, so we had to face performance and reliability issues frequently.
After my time at the National University of Colombia, I worked as a software engineer for six years in different companies.
In 2017, I was assigned to a team at Scotiabank that had the mission to solve a performance problem. Chaos Engineering became a useful tool for our team, so I began to study the discipline, looking for information on Google and asking the experts. It was fascinating how we were able to inject failures in the infrastructure and use the results for implementing resilience. I was excited by the idea of designing applications with the capacity to recover from inevitable failures, without human intervention.
In this journey, I discovered Site Reliability Engineering. I found that this field matched my interests, so I decided to change my role. I want to be part of the teams that will build a more reliable Internet.
What does SRE look like at ADL Digital Labs? What are some of the key operating tenets of your team?
Let me start with a short description of the company that I work for. ADL Digital Labs is a company in Colombia that provides technology and innovation services. We are driving digital transformation changes in the Grupo Aval, a business conglomerate. We are building our business strategy around novel technologies and methodologies such as software development, DevOps culture, analytics, big data, and, of course, cloud services. This comes with significant operational challenges. We are addressing these challenges by following the Spotify organizational model, and using the principles and practices of Site Reliability Engineering.
That means the engineering team is formed into squads, tribes, chapters, guilds, and a transversal team that supports them. We provide solutions for infrastructure, security, and automation challenges using software engineering. Our mission is to guarantee the reliability of the software products and assist those who are behind these products.
Although the company has been in the market for over a year, we have overcome a lot of challenges that allow us to have scenarios for applying almost all practices and processes described in the Google SRE books.
At this moment, our key focus is on moving the ownership of the infrastructure and operation toward the software development teams. Our mantra is that if you participate in a cycle to build a digital solution, you are the owner of the process. You are part of the successes, and you are responsible for the failures also.
To reach this goal, our SRE automation, infrastructure, and security sub-teams are working on building great tools that are aligned with the SRE principles.
There are two tools that have been instrumental to this: Hefesto and SEP.
Hefesto is a Terraform vendor orchestrator that allows technology leaders to provide the tools that are part of the development process such as repositories and pipelines. With Hefesto, the engineers can manage the roles, accesses, and permissions on each of those tools, and, in the case of the pipelines, they can configure the stages and steps that are part of the flow. We named this tool in honor of Hephaestus who is the Greek god of blacksmiths, metalworking, carpenters, craftsmen, artisans, sculptors and metallurgy. In our case Hefesto is the god of our development tools.
And SEP (Single Environment Project) is an automated workflow to provision infrastructure. With SEP, the engineers create the resources that they need in the cloud using infrastructure as code and Gitflow. They write code to create a bucket, make a pull request (PR) that triggers a pipeline and that our team approves. Once the pipeline is executed and the PR is approved, the resource is created. This workflow guarantees that it accomplishes our standard and achieves the requirements of the well-architected framework, which governs our interaction with the cloud.
What does resilience mean to you?
Despite resilience being just one word, there is a lot to say about this. Let me share some definitions that I think are excellent:
David Woods grouped four basic concepts in his paper, “Four concepts for resilience and the implications for the future of resilience engineering”:
- Resilience is a rebound from trauma and return to equilibrium
- Resilience is a synonym for robustness.
- Resilience is the opposite of brittleness.
- Resilience is a network architecture that can sustain the ability to adapt to future surprises as conditions evolve.
Another good definition was provided by John Wreathall in the book Resilience Engineering: Concepts and Precepts. “Resilience is the ability of an organization (system) to maintain or recover quickly to a stable state, allowing it to continue operations during and after a major mishap or in the presence of continuous significant stresses.”
For me, resilience is what happens when you or your system suffers a trauma, but your capacity to get up and continue working allows you to recover the stability that you had before the trauma without the intervention of third parties. This last part is the challenge in our systems: they must have the ability to recover the steady-state without human intervention. Resilience is a requirement if you or your system wants to be reliable.
Where do you feel the industry is currently at with respect to resilience practices such as SLOs, observability, and chaos? How do you see trends evolving in the coming years?
Proactively running experiments with chaos engineering has been demonstrated to be useful for large organizations and startups, as it is described in the most recent book “Chaos Engineering: System Resiliency in Practice”. The book contains perspectives, examples, and narratives from Slack, Google, Microsoft, LinkedIn, and Capital One adopting Chaos Engineering.
However, to answer this question we should consider that enterprise organizations — which are the counterpart of startups — include many types of businesses, such as traditional banks, big transportation companies, medical industries, pharma firms, technology corporations, environment agencies, education establishments, and government. Each one has different challenges, regulations, and markets for adopting a discipline like Chaos Engineering.
For example, medical industries work to build mission-critical systems in which human lives are involved. It is hard to pass from one level to another in the chaos maturity model. In the government, innovation could be constrained by regulations and security measures, which are critical in the protection of the citizens' data. I have met with customers working on-call who consider topics such as Infrastructure as Code, Automation, Observability, and consequently, Chaos Engineering, as a utopia or panacea. So my answer is that it depends on the context. I think typical enterprises are not mature enough in this regard.
Banking and transportation companies, for example, belong to a world of heavy and increasing regulation, with software systems that are dependent on legacy systems that cannot be modernized. Still, they have progressed faster than other organizations such as medical or pharma. In the case of banking, they are feeling competitive pressure from each other and FinTechs. There are many success stories from large banks such as Capital One, who consider chaos engineering a very useful, solid practice.
There has also been a lot of buzz about areas such as the Internet of Things (IoT), Artificial Intelligence (AI), Cybersecurity, and Human Augmentation (HA) that have demonstrated a big potential. Medicine, transportation, software, education, and financial industries have benefited from the progress of these technologies.
Reaching success requires assuming risks and failing many times to gain resilience, so I hope we are evolving to study how we can use chaos engineering to manage the pains in the path toward providing those solutions. On the other side, we will see how Chaos Engineering can benefit from artificial intelligence. For example, AIOps is a hot new area, and it is being used by Observability providers to advertise their AIOps platforms. If the promise of AIOps proves to be true, Chaos Engineering will be impacted by marketing.
What is your advice to those looking to break into SRE?
My recommendation is just to start. It will be difficult, but it is the price to evolve. It is challenging, I know, but if you make the decision and start following steps, it will lighten your path.
In the Google SRE book, there are three practices that are key principles of SRE, so I strongly recommend starting here:
- Define some objectives, but don’t allow them to take away your sleep. Take the time that you need.
- Remove nocive practices that could be promoting blame.
- And remember, “The best time to learn about fire is when you’re on fire.” It is a beautiful quote from Jen Hammond, Slack engineering manager.
There are excellent references to start and progress into SRE from Google as well, but you can look at this resource to get started.
Top comments (0)