<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jill Ann</title>
    <description>The latest articles on DEV Community by Jill Ann (@jillann).</description>
    <link>https://dev.to/jillann</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F557328%2F81a8f463-9b74-4191-9737-ee1993132587.jpeg</url>
      <title>DEV Community: Jill Ann</title>
      <link>https://dev.to/jillann</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jillann"/>
    <language>en</language>
    <item>
      <title>How to Destroy your Terraform Infrastructure</title>
      <dc:creator>Jill Ann</dc:creator>
      <pubDate>Tue, 20 Sep 2022 14:37:09 +0000</pubDate>
      <link>https://dev.to/jillann/how-to-destroy-your-terraform-infrastructure-k39</link>
      <guid>https://dev.to/jillann/how-to-destroy-your-terraform-infrastructure-k39</guid>
      <description>&lt;p&gt;Working on a project recently I was faced with the problem of how best to destroy terraform infrastructure. There are a few ways to do it, and the best way depends on what you are actually trying to do. &lt;/p&gt;

&lt;p&gt;Note: there are many providers that you can use with Terraform but I’ll be using AWS for these examples. The logic is the same whatever provider you are using.&lt;/p&gt;

&lt;h2&gt;
  
  
  Remove configuration
&lt;/h2&gt;

&lt;p&gt;One way is to simply remove the resources from the configuration (this could be blocks of code, files, or directories). Then run &lt;code&gt;terraform apply&lt;/code&gt;. The Terraform language is declarative, meaning that it defines the end goal rather than the steps needed to get there. So when you apply the changes it will see that the configuration is gone and will delete the corresponding instances from AWS (or whichever provider you are using). This is best for removing only part of your terraform project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform destroy
&lt;/h2&gt;

&lt;p&gt;If you want to &lt;a href="https://www.terraform.io/cli/commands/destroy#command-destroy"&gt;destroy&lt;/a&gt; the whole project then use &lt;code&gt;terraform destroy&lt;/code&gt;. Run it in the root directory then delete the project.&lt;/p&gt;

&lt;p&gt;Heads up - you might be tempted to just delete the project without doing &lt;code&gt;terraform destroy&lt;/code&gt; first (like the method above but for the entire project). However if you do this, Terraform won’t be able to tell AWS you’ve deleted that config so the infrastructure won’t be torn down. You would have to go to the AWS console and remove the instances manually, defeating the point of having Infrastructure as Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Destroy with a target
&lt;/h2&gt;

&lt;p&gt;If you want to use the destroy command to tear down only &lt;em&gt;part&lt;/em&gt; of your infrastructure, then use a &lt;a href="https://learn.hashicorp.com/tutorials/terraform/resource-targeting?in=terraform/state#destroy-your-infrastructure"&gt;target&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform destroy -target="aws_instance.example[0]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The advantage here is that it’s easy to bring the resources back (just do &lt;code&gt;terraform apply&lt;/code&gt; again). However to remove it permanently remember to delete the related config from the code! Otherwise, the next time you &lt;code&gt;terraform apply&lt;/code&gt; the resources will be recreated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replacing an instance
&lt;/h2&gt;

&lt;p&gt;You &lt;em&gt;could&lt;/em&gt; even use this method of destroying and applying again to replace an instance (for example if the hardware is degraded).&lt;/p&gt;

&lt;p&gt;However the recommended way to do this is by using the &lt;code&gt;-replace&lt;/code&gt; option with &lt;code&gt;terraform apply&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform apply -replace="aws_instance.example[0]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you found this explanation helpful, feel free to leave a comment below!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What is Site Reliability Engineering? A short intro</title>
      <dc:creator>Jill Ann</dc:creator>
      <pubDate>Thu, 25 Mar 2021 16:44:31 +0000</pubDate>
      <link>https://dev.to/jillann/an-intro-to-sre-cen</link>
      <guid>https://dev.to/jillann/an-intro-to-sre-cen</guid>
      <description>&lt;p&gt;I recently started learning about Site Reliability Engineering (SRE), a discipline that began at Google in the early 2000s and is now popular at companies around the world. Since one of the best ways to properly understand a topic is to explain it to someone else, I decided to write this article explaining some of SRE's core concepts. As such, this article is partly for me, and partly for others who are curious about SRE or just starting out on their SRE journey.&lt;/p&gt;

&lt;p&gt;This is by no means intended to be a definitive guide. Instead, I'm going to focus on three of the core concepts that I found the most interesting and insightful: toil and how to eliminate it, SLOs, and the error budget. I'll start with a short intro on what SRE is, the problems it addresses, and then dive into each of the core concepts in turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is SRE? Is it DevOps?
&lt;/h2&gt;

&lt;p&gt;SRE is an approach that uses software engineering concepts to solve operations problems. It focuses heavily on automation and its key principles help align the goals of the development and operations teams.&lt;/p&gt;

&lt;p&gt;SRE is not exactly DevOps, although they do share many common traits and have similar aims. For example, DevOps and SRE were both brought in to address the problem of conflict between the development and operations teams.&lt;/p&gt;

&lt;p&gt;Both aim to break down the barrier between the two teams and foster a healthier, more effective, and more productive working environment. But where DevOps is a fairly general term, SRE is a definite set of principles. SRE can be thought of as a specific way of doing DevOps.&lt;/p&gt;

&lt;h2&gt;
  
  
  What problems does SRE fix?
&lt;/h2&gt;

&lt;p&gt;To really understand SRE, we have to look at the problems organisations faced before SRE (and DevOps) were part of the picture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conflict between development and operations teams
&lt;/h3&gt;

&lt;p&gt;In traditional organisations, product development and operations were two quite distinct teams. Development was responsible for writing code and building new features, whereas the goal of operations was to keep everything stable and running smoothly in production.&lt;/p&gt;

&lt;p&gt;The problem is that this setup inherently creates tension between the two teams. The development teams want to move quickly and ship as many new features as possible. However operations want to move slowly and limit changes because they break things and risk system downtime.&lt;/p&gt;

&lt;p&gt;Over time, this conflict results in a number of problems such as bad communication, different goals regarding service reliability, and ultimately a lack of trust and respect between the two teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalability
&lt;/h3&gt;

&lt;p&gt;Another problem that SRE addresses is scalability. The problem with a traditional operations team is that as the company launches new services or as those services start getting more traffic, the operations team must scale with it. This is because the work of a traditional operations team is mostly manual. So, the more services or traffic, the more people needed to help run those services and keep them stable.&lt;/p&gt;

&lt;p&gt;When you have a company the size of Google, the operations team would therefore have to grow to a size that would quickly become unmanageable and extremely expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does SRE fix these problems?
&lt;/h2&gt;

&lt;p&gt;SRE has a set of principles that focuses on aligning the goals of development and operations as much as possible.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;SRE is what happens when you ask a software engineer to design an operations team.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So what might this look like? Let's take a look at some of the core concepts now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Eliminating toil
&lt;/h3&gt;

&lt;p&gt;Toil is the manual and repetitive work that comes with running a production service. So toil is not just &lt;em&gt;any&lt;/em&gt; manual work, but rather work related specifically to keeping the site up and running. Toil is not something that moves you forward. If, after completing a task, you're in the same place as you started, chances are that task is toil. It's important to note that toil is also something that is automatable. Therefore any tasks that rely on human judgement are not toil.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15gqxwpolysfo2ibiq9q.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15gqxwpolysfo2ibiq9q.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So how does the SRE team eliminate toil? SRE tackles this by making automation a top priority. At Google, SRE team members have a cap of 50% on the time they can spend on operations work such as manual tasks and being on-call. And what do they do with the rest of their time? They spend it on engineering projects (hence the "Engineering" part of SRE). A big part of this involves automating manual tasks, with the goal of automating away that year's work. With more and more tasks automated, more toil is eliminated.&lt;/p&gt;

&lt;p&gt;And how exactly does eliminating toil help solve the problems mentioned above? First of all, it directly tackles the problem of scalability, because with more and more tasks automated, the SRE team doesn’t need to scale in line with more services or traffic. Now the company can expand its services, but the size of the SRE team can stay the same.&lt;/p&gt;

&lt;p&gt;Secondly, it intelligently addresses the issue of conflict between development and operations. You might be wondering what happens if there’s just more operational work to be done and the SRE team exceeds its limit of 50%? If this happens, any excess operational work gets redirected back to the development team.&lt;/p&gt;

&lt;p&gt;Now instead of being in conflict, the values of both teams are aligned and focused on reducing the overall amount of manual operations work. The development team will be more careful with the code they send to SRE, because they know that any badly-tested code that causes problems will only increase the SRE team’s workload. If it goes over the 50% limit, they'll have to pick up the excess.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff9iuflsy5m6cy9dfmtc4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff9iuflsy5m6cy9dfmtc4.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SLOs
&lt;/h3&gt;

&lt;p&gt;SLO stands for Service Level Objective. This basically means the level of reliability or availability a service aims to offer its users.&lt;/p&gt;

&lt;p&gt;It's a common misconception that a company should aim to offer services with 100% availability. However, if you think about it, this is actually pointless. A user's computer or internet connection may only be 99% reliable, so they wouldn’t even notice if your service is only 99.99% reliable instead of 100%.&lt;/p&gt;

&lt;p&gt;One major downside of aiming for 100% reliability is that it stifles innovation. If the system can't be down even for a second, it's going to be very hard for developers to launch new features as this risks breaking things. Also, to make a service 100% reliable requires great effort (if it's even possible), and that effort is almost definitely better spent on other things.&lt;/p&gt;

&lt;p&gt;An SLO is therefore an agreement on what level of reliability is acceptable. It can be measured in terms of how often a service is available, as well as things like how long it takes to return a response to a request. These measurements are called SLIs (Service Level Indicators).&lt;/p&gt;

&lt;p&gt;SLOs are an important way of communicating to users what level of reliability they should expect. However, they’re also key in getting developers and operations on the same page. Since the SLO is negotiated in advance, there will be less conflict with operations wanting more reliability and development teams wanting less.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error budgets
&lt;/h3&gt;

&lt;p&gt;The error budget is another clever way that SRE aligns the incentives of the developers and those concerned with reliability. It helps find a balance between releasing new features and making sure these features are reliable.The error budget is the other side of the coin to the SLO. If the product team decides that the SLO of a service should be 99.9%, then the error budget is the remainder, in this case 0.1%. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The easiest way to think about this is in terms of time. If the SLO is 99.9%, then the remaining 0.1% is time that the service is allowed to fail. In this way, the development and SRE teams agree in advance what the acceptable level of unreliability is, thereby reducing conflict.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ind8c9yrbve4qb84m24.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ind8c9yrbve4qb84m24.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;New features can be added until that quarter’s error budget is spent. For example, if the error budget is 0.1% and a change causes the system to fail 0.01% of the time, that problem uses up 10% of that quarter's error budget. Once this limit is reached, no more features can be launched.&lt;/p&gt;

&lt;p&gt;This gets the developers thinking like SREs. If they know the error budget is almost used up, they’ll write better, more well-tested code. This works to their advantage too, because if their code causes less problems, they can continue to publish new features.&lt;/p&gt;

&lt;p&gt;However if the developers are having a hard time launching new features because of a strict error budget, the SLO can be relaxed. This would increase the error budget and encourage more innovation. The important thing is to find a balance that works.&lt;/p&gt;

&lt;p&gt;Of course, there are other things that can consume the error budget other than buggy code: for example, the failure of a data center. Since this isn't the development team's fault, should it still affect their remaining error budget? The answer is that anything that causes the system to go down will eat up the error budget. However, it can be handled by splitting the budget into different parts: part can be reserved for the development team and part can be set aside for other types of outages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;I hope you've found this article useful. If you want to learn more about SRE, then check out this &lt;a href="https://sre.google/sre-book/table-of-contents/" rel="noopener noreferrer"&gt;free online book by Google&lt;/a&gt;, and also this &lt;a href="https://www.youtube.com/watch?v=uTEL8Ff1Zvk&amp;amp;list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj&amp;amp;index=1&amp;amp;ab_channel=GoogleCloudTech" rel="noopener noreferrer"&gt;YouTube playlist&lt;/a&gt;. Both of them explore the concepts mentioned here in more detail, plus a lot more.&lt;/p&gt;

&lt;p&gt;&lt;cite&gt;All quotes from the book 'Site Reliability Engineering' by Google.&lt;/cite&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
