How to Destroy your Terraform Infrastructure

Jill Ann — Tue, 20 Sep 2022 14:37:09 +0000

Working on a project recently I was faced with the problem of how best to destroy terraform infrastructure. There are a few ways to do it, and the best way depends on what you are actually trying to do.

Note: there are many providers that you can use with Terraform but I’ll be using AWS for these examples. The logic is the same whatever provider you are using.

Remove configuration

One way is to simply remove the resources from the configuration (this could be blocks of code, files, or directories). Then run terraform apply. The Terraform language is declarative, meaning that it defines the end goal rather than the steps needed to get there. So when you apply the changes it will see that the configuration is gone and will delete the corresponding instances from AWS (or whichever provider you are using). This is best for removing only part of your terraform project.

Terraform destroy

If you want to destroy the whole project then use terraform destroy. Run it in the root directory then delete the project.

Heads up - you might be tempted to just delete the project without doing terraform destroy first (like the method above but for the entire project). However if you do this, Terraform won’t be able to tell AWS you’ve deleted that config so the infrastructure won’t be torn down. You would have to go to the AWS console and remove the instances manually, defeating the point of having Infrastructure as Code.

Destroy with a target

If you want to use the destroy command to tear down only part of your infrastructure, then use a target:

terraform destroy -target="aws_instance.example[0]"

The advantage here is that it’s easy to bring the resources back (just do terraform apply again). However to remove it permanently remember to delete the related config from the code! Otherwise, the next time you terraform apply the resources will be recreated.

Replacing an instance

You could even use this method of destroying and applying again to replace an instance (for example if the hardware is degraded).

However the recommended way to do this is by using the -replace option with terraform apply.

terraform apply -replace="aws_instance.example[0]"

I hope you found this explanation helpful, feel free to leave a comment below!

What is Site Reliability Engineering? A short intro

Jill Ann — Thu, 25 Mar 2021 16:44:31 +0000

I recently started learning about Site Reliability Engineering (SRE), a discipline that began at Google in the early 2000s and is now popular at companies around the world. Since one of the best ways to properly understand a topic is to explain it to someone else, I decided to write this article explaining some of SRE's core concepts. As such, this article is partly for me, and partly for others who are curious about SRE or just starting out on their SRE journey.

This is by no means intended to be a definitive guide. Instead, I'm going to focus on three of the core concepts that I found the most interesting and insightful: toil and how to eliminate it, SLOs, and the error budget. I'll start with a short intro on what SRE is, the problems it addresses, and then dive into each of the core concepts in turn.

What is SRE? Is it DevOps?

SRE is an approach that uses software engineering concepts to solve operations problems. It focuses heavily on automation and its key principles help align the goals of the development and operations teams.

SRE is not exactly DevOps, although they do share many common traits and have similar aims. For example, DevOps and SRE were both brought in to address the problem of conflict between the development and operations teams.

Both aim to break down the barrier between the two teams and foster a healthier, more effective, and more productive working environment. But where DevOps is a fairly general term, SRE is a definite set of principles. SRE can be thought of as a specific way of doing DevOps.

What problems does SRE fix?

To really understand SRE, we have to look at the problems organisations faced before SRE (and DevOps) were part of the picture.

Conflict between development and operations teams

In traditional organisations, product development and operations were two quite distinct teams. Development was responsible for writing code and building new features, whereas the goal of operations was to keep everything stable and running smoothly in production.

The problem is that this setup inherently creates tension between the two teams. The development teams want to move quickly and ship as many new features as possible. However operations want to move slowly and limit changes because they break things and risk system downtime.

Over time, this conflict results in a number of problems such as bad communication, different goals regarding service reliability, and ultimately a lack of trust and respect between the two teams.

Scalability

Another problem that SRE addresses is scalability. The problem with a traditional operations team is that as the company launches new services or as those services start getting more traffic, the operations team must scale with it. This is because the work of a traditional operations team is mostly manual. So, the more services or traffic, the more people needed to help run those services and keep them stable.

When you have a company the size of Google, the operations team would therefore have to grow to a size that would quickly become unmanageable and extremely expensive.

How does SRE fix these problems?

SRE has a set of principles that focuses on aligning the goals of development and operations as much as possible.

SRE is what happens when you ask a software engineer to design an operations team.

So what might this look like? Let's take a look at some of the core concepts now.

Eliminating toil

Toil is the manual and repetitive work that comes with running a production service. So toil is not just any manual work, but rather work related specifically to keeping the site up and running. Toil is not something that moves you forward. If, after completing a task, you're in the same place as you started, chances are that task is toil. It's important to note that toil is also something that is automatable. Therefore any tasks that rely on human judgement are not toil.

So how does the SRE team eliminate toil? SRE tackles this by making automation a top priority. At Google, SRE team members have a cap of 50% on the time they can spend on operations work such as manual tasks and being on-call. And what do they do with the rest of their time? They spend it on engineering projects (hence the "Engineering" part of SRE). A big part of this involves automating manual tasks, with the goal of automating away that year's work. With more and more tasks automated, more toil is eliminated.

And how exactly does eliminating toil help solve the problems mentioned above? First of all, it directly tackles the problem of scalability, because with more and more tasks automated, the SRE team doesn’t need to scale in line with more services or traffic. Now the company can expand its services, but the size of the SRE team can stay the same.

Secondly, it intelligently addresses the issue of conflict between development and operations. You might be wondering what happens if there’s just more operational work to be done and the SRE team exceeds its limit of 50%? If this happens, any excess operational work gets redirected back to the development team.

Now instead of being in conflict, the values of both teams are aligned and focused on reducing the overall amount of manual operations work. The development team will be more careful with the code they send to SRE, because they know that any badly-tested code that causes problems will only increase the SRE team’s workload. If it goes over the 50% limit, they'll have to pick up the excess.

SLOs

SLO stands for Service Level Objective. This basically means the level of reliability or availability a service aims to offer its users.

It's a common misconception that a company should aim to offer services with 100% availability. However, if you think about it, this is actually pointless. A user's computer or internet connection may only be 99% reliable, so they wouldn’t even notice if your service is only 99.99% reliable instead of 100%.

One major downside of aiming for 100% reliability is that it stifles innovation. If the system can't be down even for a second, it's going to be very hard for developers to launch new features as this risks breaking things. Also, to make a service 100% reliable requires great effort (if it's even possible), and that effort is almost definitely better spent on other things.

An SLO is therefore an agreement on what level of reliability is acceptable. It can be measured in terms of how often a service is available, as well as things like how long it takes to return a response to a request. These measurements are called SLIs (Service Level Indicators).

SLOs are an important way of communicating to users what level of reliability they should expect. However, they’re also key in getting developers and operations on the same page. Since the SLO is negotiated in advance, there will be less conflict with operations wanting more reliability and development teams wanting less.

Error budgets

The error budget is another clever way that SRE aligns the incentives of the developers and those concerned with reliability. It helps find a balance between releasing new features and making sure these features are reliable.The error budget is the other side of the coin to the SLO. If the product team decides that the SLO of a service should be 99.9%, then the error budget is the remainder, in this case 0.1%.

The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.

The easiest way to think about this is in terms of time. If the SLO is 99.9%, then the remaining 0.1% is time that the service is allowed to fail. In this way, the development and SRE teams agree in advance what the acceptable level of unreliability is, thereby reducing conflict.

New features can be added until that quarter’s error budget is spent. For example, if the error budget is 0.1% and a change causes the system to fail 0.01% of the time, that problem uses up 10% of that quarter's error budget. Once this limit is reached, no more features can be launched.

This gets the developers thinking like SREs. If they know the error budget is almost used up, they’ll write better, more well-tested code. This works to their advantage too, because if their code causes less problems, they can continue to publish new features.

However if the developers are having a hard time launching new features because of a strict error budget, the SLO can be relaxed. This would increase the error budget and encourage more innovation. The important thing is to find a balance that works.

Of course, there are other things that can consume the error budget other than buggy code: for example, the failure of a data center. Since this isn't the development team's fault, should it still affect their remaining error budget? The answer is that anything that causes the system to go down will eat up the error budget. However, it can be handled by splitting the budget into different parts: part can be reserved for the development team and part can be set aside for other types of outages.

DEV Community: Jill Ann