<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nir Sharma</title>
    <description>The latest articles on DEV Community by Nir Sharma (@dborta).</description>
    <link>https://dev.to/dborta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F524746%2Fc72f84e4-d209-4b57-8d9f-657d80c48934.jpeg</url>
      <title>DEV Community: Nir Sharma</title>
      <link>https://dev.to/dborta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dborta"/>
    <language>en</language>
    <item>
      <title>Most frequently asked questions surrounding Google’s Cloud Operations Sandbox</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Wed, 11 Aug 2021 11:16:22 +0000</pubDate>
      <link>https://dev.to/squadcast/most-frequently-asked-questions-surrounding-google-s-cloud-operations-sandbox-59a3</link>
      <guid>https://dev.to/squadcast/most-frequently-asked-questions-surrounding-google-s-cloud-operations-sandbox-59a3</guid>
      <description>&lt;p&gt;&lt;em&gt;Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google's Cloud Operations Sandbox.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Google SRE sandbox provides an easy way to get started with the core skills you need to become a SRE. It simulates all the behavioural complexities of a real GCP(Google Cloud Platform) environment, so that budding SREs can practice hands-on while learning SRE best practices.&lt;/p&gt;

&lt;p&gt;The core skills you need to become a good SRE are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Observability of complex microservice-based cloud environments&lt;/li&gt;
&lt;li&gt;Performing quick root-cause analysis when things go wrong&lt;/li&gt;
&lt;li&gt;Automating rollbacks and monitoring deployments&lt;/li&gt;
&lt;li&gt;Tracking SLOs, SLIs over a time period&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3oQ4AjwG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027a952455fb6d68c02ee0_01.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3oQ4AjwG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027a952455fb6d68c02ee0_01.jpg" width="800" height="294"&gt;&lt;/a&gt;&lt;/p&gt;
Architecture of the demo application provided with the sandbox&lt;br&gt;&lt;a href="https://github.com/GoogleCloudPlatform/cloud-ops-sandbox/blob/master/docs/README.md" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;With Cloud Operations Sandbox, you can get started and take the first steps into SRE expertise and answer the question, ‘Will it work in my production environment’? We have compiled a list of FAQs related to the Google SRE Sandbox and answered them below.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: What are the major features of the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;While the sandbox has many features, in this blog we will be focusing on observability, root cause analysis, simulating user traffic and SLO/SLI tracking. The features in the sandbox used for learning about these are Cloud Tracing, Locust artificial load generator, cloud profiler, cloud debugger and SRE recipes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Can I track custom SLOs and SLAs with the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The demo application that comes with the sandbox has microservices that are pre-instrumented with logging, monitoring, tracing, debugging, and profiling capabilities. In the screenshot shown below you can see how Service Level Indicators(SLI)s can be defined for the demo app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VNoifbIe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027acac9168e1b961f2209_02.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VNoifbIe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027acac9168e1b961f2209_02.jpg" width="800" height="670"&gt;&lt;/a&gt;&lt;/p&gt;
Defining SLIs in the Google Sandbox&lt;br&gt;&lt;a href="https://github.com/GoogleCloudPlatform/cloud-ops-sandbox/blob/master/docs/README.md" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;You can pick SLIs based on availability, latency or even define your own custom metric for the demo application.&lt;/p&gt;

&lt;p&gt;If you have instead chosen to track SLIs for your replicated production environment you will need to instrument the services separately.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Which module is used to simulate traffic in the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The artificial load generator used by the sandbox is Locust. Locust is mainly used for testing the load-bearing abilities of your infrastructure. With Locust you can define artificial user behaviour using Python code. Locust allows performing load tests by simulating upto millions of concurrent users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mKktVqIw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027af7b71f0d3cf9dd6e49_03.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mKktVqIw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027af7b71f0d3cf9dd6e49_03.jpg" width="800" height="626"&gt;&lt;/a&gt;&lt;/p&gt;
User Interface of the locus load generator&lt;br&gt;&lt;a href="https://github.com/GoogleCloudPlatform/cloud-ops-sandbox/blob/master/docs/README.md" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;Below you will find a code-snippet with the python code used to simulate the behaviour of a user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;locust&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HttpUser&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;between&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WebsiteUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HttpUser&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;wait_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/login&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;username&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Q: What is ‘Google cloud debugger' and how does it work in the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You may have noticed many instances where an issue faced in production, cannot be reproduced in the test environment for root cause analysis. To discover the underlying cause, you must either go into the source code or add more logs to the program when it is running in the production environment. The Cloud Debugger allows developers to debug code during execution using real-time request data.&lt;/p&gt;

&lt;p&gt;Developers have the option of utilising the Cloud Debugger to debug a running application using real-time request data. Breakpoints and log points may be defined while viewing the project. A snapshot of the process state is taken when a breakpoint is hit, so you may examine what went wrong.&lt;/p&gt;

&lt;p&gt;With the Cloud Debugger, adding a log statement to a running project doesn't result in slowed performance. Typically, this would need re-deploying the program/code, with all of the risks that are involved for production deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: What is ‘Google cloud profiler’ and how can it help me?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You can use Cloud Profiler to perform statistical testing on your application. It collects statistical information on CPU usage, heap size, threads and so on depending on the programming language used. You may utilise the Profiler UI charts to identify performance gaps in your application code.&lt;/p&gt;

&lt;p&gt;Once you have installed the Profiler library, you do not have to write any profiling code in your application; all you have to do is make the Profiler library available (the method depends on the language). This library will generate reports and allow you to conduct various analyses.&lt;/p&gt;

&lt;p&gt;Note that if you are not using the demo application the profiler has to be configured to work with the related microservice.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: What are the tools available to learn tracing across Sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Cloud Trace allows developers to examine distributed traces by graphically revealing request latency bottlenecks. Developers gather the trace information by instrumenting the application code. Traces also include environmental information added to the Cloud Logging records. The sandbox provides openCencus and OpenTelemetry to learn tracing within the platform.&lt;/p&gt;

&lt;p&gt;The solution the sandbox uses for instrumenting is &lt;a href="https://opencensus.io/" rel="noopener noreferrer"&gt;OpenCensus&lt;/a&gt;. The OpenCensus project is open-source and offers trace instrumentation in many languages. Furthermore, it enables the trace data to be exported to Google Cloud Operations dashboard. To examine the data, you may utilise the Cloud Trace UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LNHL8nIV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027b24843f189ec9a06b05_04.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LNHL8nIV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/61027b24843f189ec9a06b05_04.jpg" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Clicking on a trace in the timeline will give you a more detailed view and breakdown of the traced call and the subsequent calls that were made.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Can I replicate my production/staging environment in the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Your production/staging environment can be replicated if it is hosted on GCP(Google Cloud Platform).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Can I check for observability of my replicated environment?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The sandbox has a demo application(hipster shop) that comes pre-instrumented with observability. If you are using your own environment, you will need to instrument your microservices accordingly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Can I send alerts to an external platform?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As of now the demo sandbox has an inbuilt incident management system with basic functionality. Sending alerts to an external platform can be done after creating a custom module.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: How much does the Sandbox cost?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The sandbox is provided free of charge. However, since it can only be used on the Google Cloud Platform(GCP) platform, any computing resources consumed will be billed.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Can I improve my MTTR(Mean time to Respond) with the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The sandbox has a feature called “SRE recipes” that auto-generates issues in your environment. It is a good way to learn the skills to fix things in production. It is important to note that SRE recipes will only be working in the demo application provided with the sandbox. You will need to create your own scripts to auto-generate problems in your custom setup. By practicing, SREs can get better at fixing issues in production and reducing the MTTR(Mean time to respond) to incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: Can I test the performance of my production environment in the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes. The sandbox environment can be used to test your production environment since it has a tool to generate synthetic traffic. However, the sandbox does not have any tools for thorough unit testing and performance testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Q: What new features will be added to the sandbox?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Runbooks are expected to be added in the sandbox in the near future. Creating effective runbooks is an important skill all SREs need to acquire.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The SRE sandbox is a great place to test out your skills for becoming a better SRE. To be effective in their work, SREs need expertise in the areas of observability, performance testing and distributed architecture. The sandbox provides a way for budding SREs to test out different scenarios. Some possible scenarios include checking the performance of your application under different user loads, getting better at resolving critical issues and testing out different on-call strategies.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt; is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.squadcast.com/register/" rel="noopener noreferrer"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ljesGLP7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/60d2ec8e2d9d85d17141958f_footer_banner-2000x761.png" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>bestpractices</category>
    </item>
    <item>
      <title>The True Cost of Building your Own Incident Management System (IMS)</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Tue, 09 Feb 2021 11:19:39 +0000</pubDate>
      <link>https://dev.to/squadcast/the-true-cost-of-building-your-own-incident-management-system-ims-391e</link>
      <guid>https://dev.to/squadcast/the-true-cost-of-building-your-own-incident-management-system-ims-391e</guid>
      <description>&lt;p&gt;&lt;em&gt;Is your organization on the lookout for an incident management tool? If yes, you may wonder- am I better off building my own? Our latest blog outlines some of the key factors to consider while choosing whether to build or buy an incident management software.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When your organisation realises that it needs an Incident Management System (IMS), the first question is almost always, “Build or Buy?” Superficially, the requirements seem simple and being a technical organisation you probably have the skills you need as well. With your deep knowledge of your internal setup, surely you can build one that’s best suited to your needs? This may seem like a solid argument towards building your own IMS, however, there are some hidden factors that you may not have considered. In this blog, we look at the costs involved in building your own IMS and help you determine if the return on investment (ROI) makes it worth building one.&lt;/p&gt;

&lt;p&gt;First, let’s quickly look at some advantages of building your own IMS.&lt;/p&gt;

&lt;p&gt;The biggest advantage is that you can build an IMS that suits your needs perfectly. If your organization does not use &lt;a href="https://sensu.io/" rel="noopener noreferrer"&gt;Sensu&lt;/a&gt;, for example, then why build support for it? Instead, you can directly integrate with any on-premise monitoring tool you have in-place. If you restrict access to your production network, an off-the-shelf SaaS IMS will be difficult to use. An in-house IMS will not face such issues.&lt;/p&gt;

&lt;p&gt;Now that we have had a look at the advantages, let’s look at the disadvantages.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Budgeting for an in-house incident management system (IMS)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;‍When building your own IMS, it can be very difficult to estimate the total cost of ownership. In general, it is easier to get approval for a one-time cost rather than an open-ended project. The trouble with building an IMS is that more often than not the costs do not include long-term maintenance, usability and reliability.&lt;/p&gt;

&lt;p&gt;While getting budgetary approval for the IMS, it's hard to communicate the benefits the system will bring. This is because many of the benefits of having a strong IMS platform in place are qualitative. While the on-call experience and effectiveness of engineers will definitely improve, it is hard to measure the benefits quantitatively. Convincing your management to increase the budget may become harder as time passes and additional features are required. Many organizations may not have the measuring tools necessary to decisively prove the return on investment. You may build your own dashboard that tracks the &lt;a href="https://www.squadcast.com/blog/how-squadcast-actions-help-you-reduce-mttr" rel="noopener noreferrer"&gt;MTTR&lt;/a&gt; (Mean Time To Respond) but unless such metrics have been tracked even earlier, it will be a hard sell to convince management.&lt;/p&gt;

&lt;p&gt;Off the shelf systems, on the other hand, often don’t have high upfront costs and require little commitment. A small pilot of a commercial product is an easier sell than a potentially long and expensive development project.&lt;/p&gt;

&lt;p&gt;But is it, in fact, expensive? Let’s break it down:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Development costs:&lt;/strong&gt; This includes the cost of assigning programmers to the task, tools required to build the IMS, and infrastructure to test and deploy it. Given a list of features, it is possible to estimate this particular cost, and it is reasonably straightforward to get funding for these expenses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maintenance Costs:&lt;/strong&gt; Like any other piece of software, there will be maintenance costs associated with building an IMS. The costs associated with maintenance include fixing the bugs that crop up during the development and use of the IMS. You will also need to factor in costs when your requirements grow - this can be changes in the production applications, databases, vendor tools, or any other dependencies. As the underlying software is updated, you will also need to consider the associated security fixes. This involves setting aside time to ensure that any newly discovered security vulnerabilities don’t compromise your system. In certain scenarios, you may also need to hire external contractors to validate your security.&lt;/p&gt;

&lt;p&gt;Since it is a mission-critical piece of software that alerts you to any problems in your entire infrastructure, you cannot neglect its maintenance. You cannot afford to delay patching any vulnerabilities or making critical fixes to your IMS. Therefore you will need to set aside dedicated and continuous engineering capacity for the maintenance of the IMS. Even if it is a part-time team, there must be someone available at short notice to make any critical fixes required. This team and the overhead of maintaining it is likely to be your single highest cost and the one most difficult to sustain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opportunity cost:&lt;/strong&gt; This is one of the hidden costs that are harder to measure. While developing your own IMS, you will take away engineering capacity from other aspects of your organization. These people could have been working on your organization’s product instead of working on the IMS.&lt;/p&gt;

&lt;p&gt;Now that we have looked at the cost of developing your own in-house platform, let us have a look at the cost incurred if you opt for an off-the-shelf &lt;a href="https://www.squadcast.com/it-incident-management-tools" rel="noopener noreferrer"&gt;incident management platform&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Factors to consider for an off-the-shelf incident management platform&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Usually, off-the-shelf platforms are more expensive to develop because they have to be more flexible in terms of feature set and be able to scale to a higher number of users. Fortunately, you will end up paying only a fraction of that cost, because it is shared among all the customers of the product. In fact, if you have a small team, you can get many features free of cost from several incident management platforms.  In general, for a particular feature set, the cost of acquisition will be far lower with off-the-shelf systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment and Training Cost:&lt;/strong&gt; Off the shelf systems are usually quite flexible, but you may have to spend some time and effort to adapt your systems to it. You may have to change some of your processes or deprecate old, unsupported monitoring tools, for example. This also includes any training costs for the users in your organisation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usability and Features:&lt;/strong&gt; Due to the competitive nature of the market, any off-the-shelf incident management platform will need to keep up and add features to ensure it does not fall behind. An in-house platform often stops being developed as soon as basic minimum functionality is in place. In-house platforms can have poor usability as they are built in an ad-hoc fashion by SREs without input from UX professionals. A better user interface ensures more efficiency and ease of use. Any external product will already have been used by hundreds if not thousands of users in other organizations and therefore will have a highly optimized layout. An external platform will also have the added benefit of a customer support team to answer any queries not covered by the support documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;‍These were the costs and benefits of having an in-house versus an external system. If you factor in the hidden costs, compliance, and support issues unless you are operating at the scale of &lt;a href="https://sre.google/" rel="noopener noreferrer"&gt;Google&lt;/a&gt; or &lt;a href="https://www.facebook.com/" rel="noopener noreferrer"&gt;Facebook&lt;/a&gt; or are operating an esoteric system that is incompatible with external tools, investing in an in-house incident management platform makes little sense. However in the majority of cases, be it a growing or a small SRE team in a large organization, an off-the-shelf solution is significantly desirable. For most organizations, the return on investment is not substantial enough to warrant planning and developing an in-house incident management system.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt; is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.squadcast.com/register/" rel="noopener noreferrer"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cj1VUnAS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5e16013f80ad26b00925d758_image--5--1.png" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>incidentmanagement</category>
      <category>bestpractices</category>
    </item>
    <item>
      <title>Top Observability tools for DevOps Engineers and SREs</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Wed, 27 Jan 2021 11:55:40 +0000</pubDate>
      <link>https://dev.to/squadcast/top-observability-tools-for-devops-engineers-and-sres-15g0</link>
      <guid>https://dev.to/squadcast/top-observability-tools-for-devops-engineers-and-sres-15g0</guid>
      <description>&lt;p&gt;&lt;em&gt;Better visibility is the first step to improved system stability. Our latest blog outlines Top Observability tools for DevOps Engineers &amp;amp; SREs to help you get started on your journey to gain valuable insights into your infrastructure.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;“We can't fix something which we can't observe” - whether it's a steam engine or a complex microservice based cloud deployment, great observability makes troubleshooting things easier. Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability.&lt;/p&gt;

&lt;p&gt;In this blog post, we have collated a list of observability tools in the areas of log aggregation, APM, time series databases, distributed tracing, and metrics collection tools. While this is not an in-depth look at the strengths and weaknesses of these tools, it's a good starting point to get started on your journey to better observability.&lt;/p&gt;

&lt;p&gt;The list contains a mix of on-premise, hybrid, and SaaS platforms. Also, some of the tools featured here are open-source products or built on the foundation of other open-source software.&lt;/p&gt;

&lt;p&gt;First up, we look at some log aggregation tools:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.fluentd.org/" rel="noopener noreferrer"&gt;Fluentd&lt;/a&gt; is an open-source data collection tool. It is used to analyse data from event and application logs. It is a centralizing layer for consolidating different log inputs and outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flexible plugin system that allows the community to extend its usability.&lt;/li&gt;
&lt;li&gt;Fluentd is written in C and Ruby and requires very little system resources.&lt;/li&gt;
&lt;li&gt;Supports Unified Logging with JSON&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---N6DiDAY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3368cc29cf079deb6e807_image13.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---N6DiDAY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3368cc29cf079deb6e807_image13.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.fluentd.org/architecture" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://www.elastic.co/what-is/elk-stack" rel="noopener noreferrer"&gt;ELK&lt;/a&gt; is a stack that includes three common open source projects: Elasticsearch, Logstash, and Kibana. ELK allows you to collect logs from your applications, review and analyse these logs to create visualisations for better monitoring and troubleshooting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly scalable and resilient&lt;/li&gt;
&lt;li&gt;Encrypted communications are supported&lt;/li&gt;
&lt;li&gt;Role based access control&lt;/li&gt;
&lt;li&gt;Support for several integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--w1wBptUa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe337eecc57a425c234f868_image3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--w1wBptUa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe337eecc57a425c234f868_image3.png" width="800" height="548"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://aws.amazon.com/elasticsearch-service/the-elk-stack/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://www.graylog.org/" rel="noopener noreferrer"&gt;Graylog&lt;/a&gt; is another centralised log aggregation tool that allows real-time search of large amounts of data. It uses the Elasticsearch and MongoDB frameworks. It also functions as a repository for capturing and storing machine data. Graylog has &lt;a href="https://www.graylog.org/downloads/try-enterprise" rel="noopener noreferrer"&gt;paid plans&lt;/a&gt; for enterprises.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extended log collection using Sidecar&lt;/li&gt;
&lt;li&gt;Graphical log analysis&lt;/li&gt;
&lt;li&gt;Free marketplace of extensions&lt;/li&gt;
&lt;li&gt;Simple UI for administration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2JLzQGUh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3381a3d22be8d5d69fcfb_image8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2JLzQGUh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3381a3d22be8d5d69fcfb_image8.png" width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://docs.graylog.org/en/3.1/pages/dashboards.html" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://www.loggly.com/" rel="noopener noreferrer"&gt;Loggly&lt;/a&gt; is a log data processing SaaS solution. It has log tracking tools to help you monitor and analyse the logs generated from your infrastructure. Since it is a SaaS product you can start using it without installing any additional hardware or software. Loggly has &lt;a href="https://www.loggly.com/plans-and-pricing/" rel="noopener noreferrer"&gt;freemium and paid plans&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proactive monitoring: View app performance, system behavior, and unusual activity across the stack.&lt;/li&gt;
&lt;li&gt;Analyze and visualize data to answer key questions, track SLA compliance, and spot trends.&lt;/li&gt;
&lt;li&gt;Integrates with Slack, GitHub, Jira, Microsoft Teams, custom webhooks, and more.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Px65mJ7g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe338ccb3d5030fd8ab1ac2_image9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Px65mJ7g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe338ccb3d5030fd8ab1ac2_image9.png" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.loggly.com/blog/loggly-3-0-connecting-dots-unified-log-analysis-and-monitoring/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;Next up, here’s some APM (Application Performance Monitoring) tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.opsview.com/" rel="noopener noreferrer"&gt;Opsview&lt;/a&gt; is a highly scalable monitoring platform that is used by enterprises. Opsview Cloud, gives its users an unified view of their organization's IT infrastructure as well as uncovering opportunities for automation. Opsview is suitable for small to medium businesses as well. Opsview is a paid tool with a free demo available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatically find hosts, identify them and bulk configure them with ease, saving time and effort.&lt;/li&gt;
&lt;li&gt;Visualize your on-premises or cloud infrastructure in your NOC with ease.&lt;/li&gt;
&lt;li&gt;Encrypt database connections, communication between slave and master servers, login credentials and more&lt;/li&gt;
&lt;li&gt;Configure intelligent alerts using one of many built-in notification methods.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--psKoV9lr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe338f78317b70781a0f175_image4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--psKoV9lr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe338f78317b70781a0f175_image4.png" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.opsview.com/resources/linux/blog/ubuntu-system-monitoring-opsview-increase-uptime" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://www.zenoss.com/" rel="noopener noreferrer"&gt;Zenoss&lt;/a&gt; offers monitoring services for IT infrastructure. It is agentless and uses a collector tool to collect system information and sends it to a central server for analysis. Zenoss captures data in real-time and places it in context. Zenoss is a paid tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring of containers&lt;/li&gt;
&lt;li&gt;AI-guided anomaly detection &amp;amp; capacity planning&lt;/li&gt;
&lt;li&gt;Root-cause isolation with Service Impact&lt;/li&gt;
&lt;li&gt;Business intelligence and Log Analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5REtqPOP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339166ba7b75a3bf42727_image11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5REtqPOP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339166ba7b75a3bf42727_image11.png" width="767" height="580"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.zenoss.com/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;List of top distributed tracing tools for monitoring microservice based applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tanzu.vmware.com/observability" rel="noopener noreferrer"&gt;Wavefront&lt;/a&gt; (Tanzu Observability) offers insight into your cloud platforms with detailed metrics, traces, logs, and relevant analytics. It has a host of integrations to major cloud hosting and incident management platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Get instant insights, customized for each team, with one-click analytics-driven dashboards.&lt;/li&gt;
&lt;li&gt;Measure what matters most using advanced analytics-driven custom metrics.&lt;/li&gt;
&lt;li&gt;Identify the root cause in seconds across any cloud, any application or any siloed tool.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--r-0TaM6u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3393fc29cf0cbe4b6ec47_image12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--r-0TaM6u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3393fc29cf0cbe4b6ec47_image12.png" width="542" height="374"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://docs.wavefront.com/integrations_tkgi.html" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://lightstep.com/" rel="noopener noreferrer"&gt;Lightstep&lt;/a&gt; is a product that provides visibility into complex deployments. This includes analysis of redundancies and automatic root causes analysis from collected data. It also has the ability to automatically detect changes in your infrastructure. Lightstep has paid as well as freemium versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lightstep's correlation engine finds the cause for every effect, even across service boundaries.&lt;/li&gt;
&lt;li&gt;Instantly detect everything from minor fluctuations to major deployments anywhere in your system.&lt;/li&gt;
&lt;li&gt;Automatically detect the root cause of issues and resolve performance regressions immediately.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4SAz-t8M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3396335f0e837ff13f51d_image10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4SAz-t8M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3396335f0e837ff13f51d_image10.png" width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://lightstep.com/blog/lightstep-xpm-architecture-explained/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; is an open-source, vendor-neutral set of tools, APIs, SDKs with broad support for most languages and frameworks. It lets you collect telemetry data from your applications and send it to other tools for analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic instrumentation agents that can collect telemetry from some applications without requiring code changes&lt;/li&gt;
&lt;li&gt;Language-specific integrations for popular web frameworks that capture relevant traces and metrics&lt;/li&gt;
&lt;li&gt;OpenTelemetry Collector, which can collect data from OpenTelemetry SDKs and other sources, and then export this telemetry to any supported backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MVsMfvm7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3398bfa68bfd7bc95b006_image5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MVsMfvm7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe3398bfa68bfd7bc95b006_image5.png" width="768" height="671"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://blog.newrelic.com/product-news/what-is-opentelemetry/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;Next up are some time series databases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastax.com/" rel="noopener noreferrer"&gt;Datastax&lt;/a&gt; is a time series database that is built using Apache Cassandra (No SQL). Cassandra is widely used when time series data needs to be stored. It is preferred since it allows for easy scalability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DSE graph and DSE search&lt;/li&gt;
&lt;li&gt;Advanced replication and analytics&lt;/li&gt;
&lt;li&gt;Tiered storage and DSE multi-instance capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7xcHLyPJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339a9271924499fc0fb1b_image6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7xcHLyPJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339a9271924499fc0fb1b_image6.png" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.bloorresearch.com/company/datastax/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://www.warp10.io/" rel="noopener noreferrer"&gt;Warp 10&lt;/a&gt; is a time series database that has its own analytics language and engine (Warpscript). It can be used to collect, store and analyse data. It is used in the aggregation and analysis of sensor data for IoT applications and others that require time sensitive data. Due to its GTS (Geo-timestamped) data, it is preferred for use in IoT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WarpLib, a library dedicated to sensor data analysis with more than 1000 functions and extension capabilities&lt;/li&gt;
&lt;li&gt;Standalone version can run on a Raspberry Pi as well as on a beefy server, with no external dependencies&lt;/li&gt;
&lt;li&gt;Integration with Pig, Spark, Flink, NiFi, Kafka Streams and Storm for batch and streaming analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xiwVg0yw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339c4770a091c11b3a79d_image2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xiwVg0yw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339c4770a091c11b3a79d_image2.png" width="800" height="629"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://blog.senx.io/when-do-you-need-a-timeseries-database/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;Lastly here are some preferred tools used for metrics collection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.elastic.co/logstash" rel="noopener noreferrer"&gt;Logstash&lt;/a&gt; is a lightweight, open source, server-side data processing framework for storing, converting and transmitting data from a number of sources to their target destination. It ingests, converts and transmits data dynamically independent of their format or complexity. Logstash also has tight integration with &lt;a href="https://www.squadcast.com/blog/top-observability-tools-for-devops-engineers-and-sres#" rel="noopener noreferrer"&gt;Elasticsearch&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Seamless integration with Elasticsearch, Beats, and Kibana&lt;/li&gt;
&lt;li&gt;Logstash is completely free and the source code is available freely on &lt;a href="https://github.com/elastic/logstash" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Highly extensible - it is easy to create additional filters for Logstash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6M_yc3fK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339e9f680599d1a3c3929_image1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6M_yc3fK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe339e9f680599d1a3c3929_image1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.elastic.co/logstash" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;&lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Kafka&lt;/a&gt; is an open-source distributed event dissemination platform with support for high-performance data pipelines, streaming analytics, data integration, and more. It is widely used for mission critical applications for its zero message loss capabilities. Kafka is widely used by organisations in the insurance, banking, manufacturing, and telecom industries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka supports deriving new data streams using the data streams from producers&lt;/li&gt;
&lt;li&gt;The Kafka cluster can easily manage failures&lt;/li&gt;
&lt;li&gt;Kafka uses a Distributed commit log, messages remain on disk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IVCoEEOq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe33a070d953b0389d2af08_image7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IVCoEEOq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe33a070d953b0389d2af08_image7.png" width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://aws.amazon.com/msk/what-is-kafka/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;You can never have enough visibility into your infrastructure. With the advent of microservices architecture the resulting observability tools must rise to the challenge of discovering and analysing dependencies.&lt;/p&gt;

&lt;p&gt;Although this is not an exhaustive list of both the available tools and the listed features, as stated earlier, it is important to identify the kind of metrics you need to observe and understand how you can make this data more actionable before choosing an observability tool. You can also visit the respective websites to know more about each tool and how it can help you.&lt;/p&gt;

&lt;p&gt;Regardless of the kind of platform you are running, we are sure that the tools listed here will be useful to you. On similar lines, for a more detailed look at the top monitoring tools used by DevOps/SREs, head over to this &lt;a href="https://www.squadcast.com/blog/top-monitoring-tools-for-devops-engineers-and-sres" rel="noopener noreferrer"&gt;blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt; is an incident management tool that ingests data from various monitoring sources and supports tooling in your techstack to provide actionable alerts, reduce MTTR and eliminate unplanned downtime. &lt;a href="https://app.squadcast.com/register" rel="noopener noreferrer"&gt;Try for free now&lt;/a&gt; or &lt;a href="https://calendly.com/renuka-squadcast/30min" rel="noopener noreferrer"&gt;schedule a demo&lt;/a&gt; to explore SRE best practices in incident management with better collaboration and transparency, increasing the overall reliability of your service.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.squadcast.com/register/" rel="noopener noreferrer"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cj1VUnAS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5e16013f80ad26b00925d758_image--5--1.png" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>From SysAdmin to SRE: How to evolve your skillset</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Tue, 12 Jan 2021 11:38:26 +0000</pubDate>
      <link>https://dev.to/squadcast/from-sysadmin-to-sre-how-to-evolve-your-skillset-75a</link>
      <guid>https://dev.to/squadcast/from-sysadmin-to-sre-how-to-evolve-your-skillset-75a</guid>
      <description>&lt;p&gt;&lt;em&gt;Are you wondering what it takes to become an SRE from a SysAdmin background? Our latest blog, covers the growth areas and technical skills needed to successfully transition to an SRE role.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The last decade has seen widespread adoption of SRE practices based on the best practices laid out by Google. Many SysAdmins have observed this trend and are now evaluating becoming SREs. Which gives rise to the question how much of a skills overlap is there between an SRE and a SysAdmin?&lt;/p&gt;

&lt;p&gt;Both roles are concerned with IT operations and there is a significant overlap in their respective responsibilities. Broadly, Google has defined SRE to be software engineering principles applied to IT operations at scale. What does this mean in reality? SRE is essentially applying some key principles to IT operations. It frequently involves using various technologies that may be new to some SysAdmins.&lt;/p&gt;

&lt;p&gt;In this blog we look at some of the growth areas and skills a SysAdmin needs to pick up to become an SRE. This transition requires some mindset changes and the acquisition of some new technical skills as well but it shouldn't be difficult for an experienced SysAdmin. So here are some of the changes you need to bring about in your mindset and skills to successfully transition to an SRE role.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Mindset Changes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Embracing Risk&lt;/strong&gt;&lt;br&gt;
As a SysAdmin the primary focus of your work has been to maintain order and keep the systems under your care, running smoothly. SysAdmins have traditionally focused on keeping their infrastructure stable and secure and to eliminate any risk of failure. On the other hand, SREs recognize that some amount of failure is inevitable. &lt;a href="https://www.squadcast.com/blog/managing-technical-risk-effectively-with-error-budgets" rel="noopener noreferrer"&gt;Error budget&lt;/a&gt; is an SRE concept that quantifies the amount of downtime your infrastructure can have before you are in breach of a &lt;a href="https://www.squadcast.com/blog/choosing-slos-that-users-need-not-the-ones-you-want-to-provide" rel="noopener noreferrer"&gt;SLO (service level objective)&lt;/a&gt;. Armed with that knowledge, an SRE can decide to support agility and allow riskier changes or be more safety conscious and risk averse. This allows SREs to leverage risk for the benefit of the product rather than futilely attempting to eliminate risk and potentially becoming a bottleneck&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reducing Toil&lt;/strong&gt;&lt;br&gt;
Much of SRE concerns itself with removing toil. In this context, toil refers to those tasks that are repetitive and don't add any enduring value to the upkeep of your infrastructure. This sometimes also includes automating those jobs that are repetitive and time-consuming. By limiting toil to half of the work, an SRE frees up time to improve other aspects of the system. Improvements in system stability and performance are encouraged, and creative solutions can materialize. SysAdmins, are all too familiar with the repetitive configuration of hardware and software to fit the needs of their organisation. Most mature SysAdmins have developed automation practices that work well within their org but are not standardised. As an SRE you are expected to know standardization practices that will work for organizations of all types and major tech stacks. Automation using software such as Puppet, Chef and Ansible helps minimise repetitive steps and frees SysAdmins for more substantive and thorough work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automate all the things&lt;/strong&gt;&lt;br&gt;
Automation is a substantial aspect of good SRE practice. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments (Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining. Some of the other benefits of automation include greater reliability when deployments are done, improved performance and all around, cost reduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dealing with failure: Understanding SLOs and blameless postmortems&lt;/strong&gt;&lt;br&gt;
SysAdmins are familiar with the RCA(Root Cause Analysis) process - when a failure occurs the root cause is identified, and a solution is put in place. However, as an SRE there are best practices Google has created that include going beyond root causes and concerns itself with understanding the weaknesses in the system that led to the breakdown. &lt;a href="https://www.squadcast.com/blog/towards-more-effective-incident-postmortems#:~:text=Successful%20postmortems%20are%20blameless&amp;amp;text=A%20culture%20that%20seeks%20to,postmortem%20in%20the%20first%20place." rel="noopener noreferrer"&gt;Blameless postmortems&lt;/a&gt; encourage one to pick flaws in the existing reporting and operational processes. Good SRE practices insist on keeping people in the loop when failure occurs, including your customers. This is a cultural shift for SysAdmins, as they rarely tend to keep customers in the loop when things go down. These practices also include a formal written incident post-mortem process. The conclusions from an incident post-mortem must then be fed back to the planning process for future deployments. Failure takes on a fresh perspective from a SRE’s viewpoint - it is an opportunity to learn from your mistakes and do better next time around.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Soft Skills&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;SRE culture demands much greater collaboration with other parts of the organisation. While SLOs bring greater transparency to operations, achieving consensus on those objectives and deciding on the next step can often be challenging. Business teams, product management, developers and SREs all have slightly different goals and incentives. Bridging the gap between these various stakeholder perspectives may require &lt;em&gt;conflict resolution skills&lt;/em&gt;. Explaining the trade off between feature development, stability and how Error Budgets can help decide the best result, requires &lt;em&gt;strong communication skills&lt;/em&gt;. Finally, good &lt;em&gt;negotiation skills&lt;/em&gt; will ensure that SRE goals are accepted in the face of pressure from Business, Product or Development.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Technical Skills&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Transitioning from being a SysAdmin to an SRE requires brushing up or acquiring various technical skills.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Programming &amp;amp; Testing Skills:&lt;/strong&gt; The emphasis on toil reduction and automation in SRE will require significantly stronger programming and testing skills. Typically an SRE should know one highly productive scripting language like &lt;a href="https://en.wikipedia.org/wiki/Python_(programming_language)" rel="noopener noreferrer"&gt;Python&lt;/a&gt; and one high performance systems language like &lt;a href="https://golang.org/" rel="noopener noreferrer"&gt;Go&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Infrastructure as Code:&lt;/strong&gt; Traditionally, infrastructure deployment is a slow, manual, labour intensive process. Because of this, it is expensive, inelastic, inconsistent and unreliable. &lt;a href="https://docs.microsoft.com/en-us/azure/devops/learn/what-is-infrastructure-as-code" rel="noopener noreferrer"&gt;Infrastructure as Code&lt;/a&gt; (IaC) is an automation technique that brings the rigor of software engineering to infrastructure management. Tools like &lt;a href="https://github.com/ansible/ansible" rel="noopener noreferrer"&gt;Ansible&lt;/a&gt;, &lt;a href="https://learn.hashicorp.com/terraform" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt;, &lt;a href="https://puppet.com/" rel="noopener noreferrer"&gt;Puppet&lt;/a&gt; or &lt;a href="https://www.chef.io/" rel="noopener noreferrer"&gt;Chef&lt;/a&gt; can be used to power an IaC initiative.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud, Containers &amp;amp; Container Orchestration:&lt;/strong&gt; Cloud and container services make something that was previously difficult to automate -- physical hardware -- manageable via standardised APIs. As an added benefit, they are usually far cheaper, more flexible and faster to provision than traditional hardware. They have also made the IaC technique far more powerful and useful. Knowledge of &lt;a href="https://www.amazon.com/Amazon-Web-Services/e/B007R6MVQ6" rel="noopener noreferrer"&gt;Amazon AWS&lt;/a&gt;, &lt;a href="https://www.redhat.com/en/topics/containers/what-is-kubernetes" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Docker_(software)" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; are now considered basic skills for SREs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Modern Monitoring Tools:&lt;/strong&gt; Active checking systems, metrics collection, and log aggregation have been the traditional mainstays of monitoring. More recently, code instrumentation and distributed tracing have been added to this arsenal. Older de facto standard tools like Nagios, Ganglia and rsyslog have been surpassed by tools like &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, &lt;a href="https://www.datadoghq.com/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt;, and the &lt;a href="https://www.elastic.co/what-is/elk-stack" rel="noopener noreferrer"&gt;ELK stack&lt;/a&gt;. APMs like &lt;a href="https://github.com/newrelic" rel="noopener noreferrer"&gt;NewRelic&lt;/a&gt; are now key for instrumentation and &lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; seems very promising as a distributed tracing tool. Familiarity of these platforms is a significant requirement for a good SRE.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Statistical Analysis:&lt;/strong&gt; SRE culture demands hard data to support decision making. With the vast volumes of data being generated by monitoring tools, some basic statistical analysis is necessary to generate actionable data. This data can be used for capacity planning, release planning, continuous improvement and incident response.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;SysAdmins and SREs are expected to be drivers of reliability and change that is beneficial to the customers. If you are a SysAdmin, you have doubtless carried out many operations in the systems level that will be invaluable to you as an SRE. The necessary areas of growth include learning to adapt to change, since the SRE practices in vogue today may very well change tomorrow. An SRE is someone who brings practices that have been a mainstay of software development at scale to the operations side. This crossover brings dividends to the organisation as they find solutions to recurrent problems without investing on more manpower and hardware. The future of SRE is bright as more organisations are seeking to cut costs and streamline their IT operations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt; is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.squadcast.com/register/" rel="noopener noreferrer"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cj1VUnAS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5e16013f80ad26b00925d758_image--5--1.png" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sysadmin</category>
      <category>sre</category>
    </item>
    <item>
      <title>Squadcast's Year in Review, 2020</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Mon, 04 Jan 2021 11:11:44 +0000</pubDate>
      <link>https://dev.to/squadcast/squadcast-s-year-in-review-2020-1mh2</link>
      <guid>https://dev.to/squadcast/squadcast-s-year-in-review-2020-1mh2</guid>
      <description>&lt;p&gt;&lt;em&gt;Thank you for inspiring us this year! As is becoming a tradition, we've put together a collection of product updates, case studies &amp;amp; content, that tell the story of Squadcast in 2020.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;2020 has been a difficult year for all of us. As the world saw lockdowns and organisations pivoted to working from home, the need to provide uninterrupted services (be it for online shopping, streaming movies, or banking) increased exponentially. At &lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt;, we are proud to have played our role in helping our customers get better insights into their systems so they can become even more reliable.&lt;/p&gt;

&lt;p&gt;This year we helped OTT providers, &lt;a href="https://www.squadcast.com/case-studies" rel="noopener noreferrer"&gt;financial institutions, and e-commerce&lt;/a&gt; among many other organisations, make reliability a core and fundamental process. We have also been recognized by G2 as Momentum Leader &amp;amp; High performer in the IT Incident Management &amp;amp; IT alerting space.&lt;/p&gt;

&lt;p&gt;As is becoming a tradition, we've put together a collection of product updates, case studies, and content, that tell the story of the past year at Squadcast.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast's Year in Review&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For us, the key product highlights of the year have been:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.Reducing Alert Noise&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(a)&lt;/strong&gt; &lt;a href="https://support.squadcast.com/docs/alert-suppression" rel="noopener noreferrer"&gt;Improved Alert Suppression Rules‍&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As part of our continuous improvement process, we rolled out completely new and improved Suppression Rules. This latest version provides more flexibility and fine control over the nature of alerts being suppressed, thus preventing alert fatigue by curbing unnecessary or non-actionable alerts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UOHRAbRp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe449508011fb5f8a7f2348_image23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UOHRAbRp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe449508011fb5f8a7f2348_image23.png" width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(b)&lt;/strong&gt; &lt;a href="https://headwayapp.co/squadcast-updates/service-dependency-based-deduplication-162440" rel="noopener noreferrer"&gt;Service Dependency Based Deduplication&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(c)&lt;/strong&gt; &lt;a href="https://headwayapp.co/squadcast-updates/incident-status-based-deduplication-162438" rel="noopener noreferrer"&gt;Incident Status Based Deduplication&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;a href="https://apidocs.squadcast.com/?version=latest" rel="noopener noreferrer"&gt;Squadcast API V3&lt;/a&gt;&lt;/strong&gt; ‍&lt;/p&gt;

&lt;p&gt;We released the latest version of Squadcast Public API which helps you access Squadcast features within your account. It allows you to configure users, services, group incidents, and more to help further reduce noise and speed up response times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. &lt;a href="https://headwayapp.co/squadcast-updates/improvement-escalation-policies-144884" rel="noopener noreferrer"&gt;Improvement: Escalation Policies&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. &lt;a href="https://headwayapp.co/squadcast-updates/view-who-is-on-call-147871" rel="noopener noreferrer"&gt;View Who is On-call&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. &lt;a href="https://headwayapp.co/squadcast-updates/schedule-overrides-147864" rel="noopener noreferrer"&gt;Schedule Overrides&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. &lt;a href="https://apidocs.squadcast.com/?version=latest" rel="noopener noreferrer"&gt;New REST APIs - Incidents &amp;amp; Squads&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We've published a few more REST APIs to help access some basic Incident and Squads functionalities within your Squadcast Account.&lt;/p&gt;

&lt;p&gt;Play around with our pre-made Squadcast APIs using our Postman collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. &lt;a href="https://support.squadcast.com/docs/incident-list-table-view" rel="noopener noreferrer"&gt;Incident List Table View&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We released a beta version of our Incident List Table View which provides you with a single-pane view of all incidents along with their relevant information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ythv07sC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe449e4e084a83d1d9dd570_image1%2520%281%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ythv07sC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe449e4e084a83d1d9dd570_image1%2520%281%29.png" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. &lt;a href="https://support.squadcast.com/docs/service-specific-slack-channel" rel="noopener noreferrer"&gt;Service Specific Slack Channels&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. &lt;a href="https://headwayapp.co/squadcast-updates/export-incidents-166132" rel="noopener noreferrer"&gt;Export Incidents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. &lt;a href="https://support.squadcast.com/docs/import-users" rel="noopener noreferrer"&gt;Import users&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a more comprehensive list of features and alert source integrations we shipped, check out our &lt;a href="https://headwayapp.co/squadcast-updates" rel="noopener noreferrer"&gt;Updates Page&lt;/a&gt;! To see what else we have planned for our next year, check out our public &lt;a href="http://bit.ly/squadcast-roadmap" rel="noopener noreferrer"&gt;Product Roadmap&lt;/a&gt;. We’re always happy to hear new ideas and feature requests from the community - you can &lt;a href="mailto:support@squadcast.com"&gt;write&lt;/a&gt; to us.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Squadcast Impact&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UvBpgm4t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe44a1d2e6bd7bd136f6e5f_casestudy-collage-004.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UvBpgm4t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe44a1d2e6bd7bd136f6e5f_casestudy-collage-004.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Learn through our customer stories on how we changed the lives of companies for the better. Numerous organisations effortlessly integrated and benefited from our streamlined incident management platform. Squadcast helped increase productivity &amp;amp; improve MTTR for teams in various industries and scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.squadcast.com/case-studies" rel="noopener noreferrer"&gt;&lt;center&gt;Read Case Studies&lt;/center&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Recognition&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our commitment to quality has not gone unnoticed. We have been acknowledged by G2 as Momentum Leader in the Incident Management and IT alerting space. G2 uses verified customer reviews, social and web data to determine winners in each quadrant. This is in addition to awards in the following fields:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Qqf5mJaM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5ff2cd329f2331ce16b6aab0_g2-01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Qqf5mJaM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5ff2cd329f2331ce16b6aab0_g2-01.png" width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It would be remiss not to mention that this year was one of hardship for many small business owners facing an economic downturn and for everyone else affected by the pandemic. From the bottom of our hearts, thank you for trusting us and inspiring us. We feel more fortunate than ever and are excited about what's ahead in 2021!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ALZbx7W5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe47e1fb7e629c29eccc8fa_nesletter-collage-005.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ALZbx7W5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fe47e1fb7e629c29eccc8fa_nesletter-collage-005.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;
2020 gave us tremendous product growth and many new teams depending on Squadcast for Reliability.&lt;br&gt;Also, our team grew by 2x :)



&lt;p&gt;Want to be part of our awesome team? We are &lt;a href="https://squadcast.breezy.hr/" rel="noopener noreferrer"&gt;hiring&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We couldn’t have made it this far without our incredible team and our super helpful community.&lt;/p&gt;

&lt;p&gt;Happy holidays and a beautiful new year from all of us here at &lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt;! :)&lt;/p&gt;

</description>
      <category>events</category>
      <category>productupdates</category>
    </item>
    <item>
      <title>How to SRE without an SRE on your team</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Mon, 07 Dec 2020 10:58:22 +0000</pubDate>
      <link>https://dev.to/squadcast/how-to-sre-without-an-sre-on-your-team-2fdm</link>
      <guid>https://dev.to/squadcast/how-to-sre-without-an-sre-on-your-team-2fdm</guid>
      <description>&lt;p&gt;&lt;em&gt;Are terms like “Error budgets” and SLOs roadblocks on your way to adopting SRE practices for your organisation? Our latest blog talks of "How to SRE without an SRE on your team", where we look at some of the most elementary SRE concepts that you can start implementing right away! We help you pick SLOs, identify toil and touch base on Automation for SREs along with few best practices to get you started on your SRE journey.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An organisation with mature Site Reliability Engineering (SRE) principles may conjure images of engineers with years of experience in DevOps and System Administration, having a suite of specialised tools and experts dissecting each service outage. For an organisation that is thinking of implementing SRE principles this is an intimidating image and may seem unattainable. The truth is everyone can get started on their SRE journey by following a few elementary principles, which are outlined here. While we are not claiming that this is the only way to go forward when you don't have an SRE as a job title or role in your team, it's a good place to start.&lt;/p&gt;

&lt;p&gt;In this blog, we go over some of the most basic steps you can take in your journey towards SRE adoption. To unlock the full value of SRE practices however requires a deeper commitment than just investing in the tools or training. There is a need to have an organisation wide cultural changes as well. We also look at some of the ways you can implement SRE principles including learning about error budgets, automation, and more. At the heart of great SRE practice lies a willingness to break down existing silos and to communicate and coordinate across cross-functional teams along with automating redundant processes.&lt;/p&gt;

&lt;p&gt;Now that you have decided to adopt some SRE practices the question comes up: Where do I start? To start off we look at some of the most useful concepts and practices in SRE and how it makes deployments seamless and your engineers happy.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error Budgets&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;An &lt;a href="https://www.squadcast.com/blog/managing-technical-risk-effectively-with-error-budgets" rel="noopener noreferrer"&gt;error budget&lt;/a&gt; can be defined as the maximum amount of time your system can be down without facing consequences. These consequences can be derived from external legal agreements (SLAs) or even internal organisational goals(SLOs). Error budgets are important because they allow your development and IT operations to ensure that the price of having new features in the product does not result in downtime and inconvenience for your users. For example, if your product is running faultlessly your developers can spend the error budget to create and deploy more features. If you have run out of error budget for the month then publishing new features takes a backseat till operational issues have been resolved.&lt;/p&gt;

&lt;p&gt;Deciding the error budget will depend on the kind of service you are providing. Are you running a banking platform that needs to be available almost 24/7? Are you running an OTT platform specialising in live streaming sports? An effective error budget must also take into account the external problems that you won't have control over - which includes internet connectivity going down, your remote machines becoming the victim of a DDOS attack and similar unforeseen problems. Your error budget will be derived from the SLOs that you have picked. In the next section of this blog, we look at how you can pick SLOs that would be most helpful for your organisation.&lt;/p&gt;

&lt;p&gt;An error budget is calculated with the following formula.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Error Budget = (1 -  SLO of the service)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For instance a 99.9% SLO service has a 0.1% error budget.&lt;/p&gt;

&lt;p&gt;While error budgets are usually used as an unambiguous criteria for SREs to accept changes from the development team, they are still useful for organisations without SREs. Development teams can use the error budget to decide whether or not to deploy changes or to decide whether to work on riskier features or not.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Measuring your service (SLOs,SLIs and SLAs)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The adoption of SRE practices should ideally evolve organically from your existing processes. The first thing to keep in mind is “Which metrics are most important to my customers?”. If you are a customer facing website, the most common would be uptime, latency and volume.  Those important measurements can become your SLIs (service level indicators) and SLOs(service level objectives). SLOs should ideally shape up as a natural outcome of customer requirements. It is ofcourse tempting to let your IT staff determine your SLOs but for a more sustainable solution the SLOs must come from things that directly impact your customers. The easiest way to pick SLOs is to work backwards from your business objectives. This metric must be something that you can improve upon and something that does not depend on external factors. It is always better to start with  fewer SLOs and pick the ones that you feel are most important for your product. The vital thing to remember here is that the SLO should be something that is quantifiable.&lt;/p&gt;

&lt;p&gt;“SLOs have a formidable use as metric-based indicators that show you what needs to be improved in your systems, its capabilities, and where you can get your best “bang for buck” when it comes to focusing your work efforts. However, SLOs must be influenced by data, and that data can only come from your customers. A lot of IT professionals tend to think that they know the best metrics, and they do; &lt;strong&gt;the only problem is that they are the best metrics for monitoring systems, not for improving customer satisfaction.&lt;/strong&gt;” says Adam Hammond, DevOps Engineer at Megaport, in his extensive blog called “&lt;a href="https://www.squadcast.com/blog/choosing-slos-that-users-need-not-the-ones-you-want-to-provide" rel="noopener noreferrer"&gt;Choosing SLOs that users need, not the ones you want to provide&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;Finally, when you define your SLOs, remember that a good SLO should be &lt;strong&gt;S.M.A.R.T.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Specific:&lt;/strong&gt; the SLO should clearly and explicitly state what it measures (e.g. we want to measure availability by testing whether a request can be to the backend server, and not that we want it to be up).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measurable:&lt;/strong&gt; the SLO should be something that can be calculated easily (for example, the latency of the disk should be less than 5ms, not the disk should be quick to load data).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Achievable:&lt;/strong&gt; you should be able to fulfill the SLOs (e.g. if a service has a SLO of 95 percent , you cannot promise 100 percent).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relevant:&lt;/strong&gt; your SLO should correspond to the user experience(e.g. an appropriate metric for a web server is response time, not CPU activity).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Timebound:&lt;/strong&gt; an SLO should cover a timeframe that is suitable for when your users operate your system(e.g. if your users only use your system between 9 AM to 5 PM, a 24-hour SLO will be counterproductive).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ycZMKaIK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fbcda5db82f6508d8c43351_image1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ycZMKaIK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fbcda5db82f6508d8c43351_image1.png" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional reading:&lt;/strong&gt; A narrative case study -“&lt;a href="https://www.squadcast.com/blog/how-small-changes-to-your-slos-can-be-smart-for-your-business-a-narrative-case-study" rel="noopener noreferrer"&gt;How small changes to your SLOs can be SMART for your business&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;SLOs are a great way to generate metrics about what is important to the business or it’s customers. Those insights are critical even if you don’t have a separate SRE team.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Toil&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you have been running a production system then  toil will be familiar to you. Google’s SRE book defines Toil as “The kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” All organisations that are scaling up their product have had to deal with  toil at some point of their expansion. The first stop to tackling toil in your organisation is to identify it properly. Sometimes this can be challenging as toil is often disguised as seemingly important work. A significant aspect of good SRE implementation is recognising toil early and automating it away. There is a cultural aspect to  toil as well. Many teams may have included “busy work” in their schedules without realising its potential  for long term damage. Naturally all production systems require human intervention to run optimally but the number of people managing them should not be growing linearly with the addition of every new user, virtual machine or service. Toil has a detrimental effect on your engineering team's morale and productivity as well.&lt;/p&gt;

&lt;p&gt;Here are some examples of toil that is commonly faced by growing organisations-&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Any task your engineers are performing manually. For example, you may have a script that automates some tasks but manually executing the script every time is still “toil”.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lacking any enduring value - things that constitute toil do not add any  lasting value to  your project.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Repetitive - Things that are repeated frequently. It can include manually configuring production servers before every deployment.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An effective way of identifying toil includes doing a survey of your engineers. Here are some sample questions you can ask your engineering team to pinpoint problem areas.&lt;/p&gt;

&lt;p&gt;Q: Approximately speaking over the past month, how much of your time was spent on toil?&lt;/p&gt;

&lt;p&gt;Q: What according to you were the top five sources of toil?&lt;/p&gt;

&lt;p&gt;Q: Is there toil that can be automated but you are not able to due to cultural reasons?&lt;/p&gt;

&lt;p&gt;Without a separate SRE team, toil is doubly dangerous -- not only is your limited manpower being wasted on value added activity, they’re probably wasting more time than an experienced SRE would since it’s work that they don’t specialize in.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Automation for SREs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Automation is a substantial aspect of good SRE practices. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments(Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining.&lt;/p&gt;

&lt;p&gt;Automation helps in three ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;By being orders of magnitude faster than the equivalent manual effort thus saving precious manpower.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;By being reliable, once stable, automation is not subject to human error and can be relied upon to correctly execute every single time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;By being repeatable. Automation can ensure operations are 100% consistent across your infrastructure. Consistent behaviour is cheaper and easier to manage.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With good automation in place, you can delay the need for specialist SREs and infrastructure people.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion: Next steps&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here are some other best practices that are good to follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Have proper rollbacks in place&lt;/em&gt; - In case of a faulty deployment which breaks things it is easier to rollback to when things were working fine. Having proper rollbacks in place will save you hours which will otherwise result in loss of engineering time and productivity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/blog/things-to-do-to-make-on-call-less-stressful" rel="noopener noreferrer"&gt;Manage stress and burnout&lt;/a&gt;&lt;/em&gt; - Dealing with on-call stress, unhappy customers and teams is part and parcel of the SRE journey. There is always pressure on on-call engineers to resolve issues quickly however, sometimes resolution can take days.  The onus is on you to create a stress-free environment for your team with the help of SRE practices and building a culture of reliability that takes pride in shipping great software.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Don’t update code if you won’t be available for the next two days&lt;/em&gt; - Again self-explanatory, if things go wrong it will be much harder to get your system running again. Once code has been deployed its best to see what is working smoothly in production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/blog/keeping-your-teams-and-customers-in-the-loop-during-downtime" rel="noopener noreferrer"&gt;Keep your customers in the loop&lt;/a&gt;&lt;/em&gt; - Customer facing teams like marketing, sales and support must always be aware of the limitations of your product. This ensures that your customers have realistic expectations from the product. Any website having over thousand users has experienced breakdowns. These breakdowns usually prompt angry emails from your customers especially if the site is down for unscheduled maintenance or if the breakdown in service is of a critical nature.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adopting SRE best practices can be something teams and organisations of all sizes can begin with. If you are anticipating rapid expansion of your product or being plagued by frequent breakdowns, exploring the SRE process makes sense. It is important to remember that SRE  is a journey and what works for others may not be the best solution for you. While it’s vital to learn from others, every organisation has its own unique journey towards adopting SRE. Post adopting a dedicated incident management platform it becomes easier to track the problem areas in your digital infrastructure. This includes learning by bringing in important metrics like MTTA (Mean time to acknowledge), &lt;a href="https://www.squadcast.com/blog/how-squadcast-actions-help-you-reduce-mttr" rel="noopener noreferrer"&gt;MTTR (Mean time to resolve)&lt;/a&gt; and others. This transition to a dedicated platform is the next big step on your SRE journey. More than just a set of metrics and practices, SRE envisions a culture where problems are resolved in a “blameless” environment. A culture where issues can be raised and fixed in a transparent manner.&lt;/p&gt;

&lt;p&gt;This article is inspired by a talk originally given at LISA'19 by Squadcast with the same title "&lt;a href="https://www.slideshare.net/mobile/squadcastHQ/how-to-sre-when-you-have-no-sre" rel="noopener noreferrer"&gt;How to SRE without an SRE on your team&lt;/a&gt;".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some useful resources for SRE:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey" rel="noopener noreferrer"&gt;Do you have an SRE team yet? How to start and assess your journey&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.coursera.org/learn/site-reliability-engineering-slos" rel="noopener noreferrer"&gt;Site Reliability Engineering: Measuring and Managing Reliability&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/squadcastHQ/awesome-sre-tools" rel="noopener noreferrer"&gt;Awesome SRE Tools&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://cloud.google.com/blog/products/devops-sre" rel="noopener noreferrer"&gt;DevOps &amp;amp; SRE&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://landing.google.com/sre/workbook/toc/" rel="noopener noreferrer"&gt;Site Reliability Engineering&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://sre.google/" rel="noopener noreferrer"&gt;What is Site Reliability Engineering (SRE)?&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via &lt;a href="https://twitter.com/squadcasthq?lang=en" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; and let us know your thoughts.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt; is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.squadcast.com/register/" rel="noopener noreferrer"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cj1VUnAS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5e16013f80ad26b00925d758_image--5--1.png" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>bestpractices</category>
    </item>
    <item>
      <title>Top Open Source projects for SREs and DevOps</title>
      <dc:creator>Nir Sharma</dc:creator>
      <pubDate>Thu, 03 Dec 2020 11:07:25 +0000</pubDate>
      <link>https://dev.to/squadcast/top-open-source-projects-for-sres-and-devops-45ie</link>
      <guid>https://dev.to/squadcast/top-open-source-projects-for-sres-and-devops-45ie</guid>
      <description>&lt;p&gt;&lt;em&gt;Building scalable and highly reliable software systems is the ultimate goal of every SRE out there. Follow the path of continuous learning with the help of our latest blog which outlines some of the most sought out open source projects in the monitoring, deployment &amp;amp; maintenance space.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The path to becoming a successful SRE lies in continuous learning. There are a plethora of great open source projects out there for SREs/DevOps,each with new and exciting implementations and often tackling unique challenges. These open-source projects do the heavy lifting so you can do your job more easily.&lt;/p&gt;

&lt;p&gt;In this blog we look at some of the top and sought out open source projects in the areas of monitoring, deployment and maintenance. Among the projects we have covered are those that simulate network traffic and allow you to model unpredictable(chaotic) events to develop dependable systems.&lt;/p&gt;

&lt;p&gt;And, while you are at it, we thought we could help a little more by providing some essential &lt;a href="https://www.squadcast.com/blog/must-read-devops-sre-books-for-all-engineers" rel="noopener noreferrer"&gt;DevOps and SRE reading suggestions&lt;/a&gt; as well for all you tech folks out there.&lt;/p&gt;

&lt;p&gt;We hope this keeps you good company.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/google/cloudprober" rel="noopener noreferrer"&gt;Cloudprober&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Cloudprober is an active &lt;strong&gt;tracking&lt;/strong&gt; and monitoring application to spot malfunctions before your customers do. It uses an "active" monitoring model to check that your components are operating as intended. It runs probes proactively, for instance, to ensure if your frontends can access your backends. Similarly, a probe can be run to verify that your on-premise systems can actually reach your in-Cloud VMs. This method of tracking makes it easy, independent of the implementation, to track the configurations of your applications and lets you easily pin down what is broken in your system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native Integration with open source monitoring stack of &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; and &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt;. Cloudprober can export probe results as well.&lt;/li&gt;
&lt;li&gt;For Cloud targets, automatic target discovery. Out-of-the-box support is provided to GCE and Kubernetes; other cloud services can be easily configured.&lt;/li&gt;
&lt;li&gt;Significant commitment on ease of deployment. Cloudprober is completely written and compiled into a static binary in &lt;a href="https://golang.org/" rel="noopener noreferrer"&gt;Go&lt;/a&gt;. It can be deployed quickly by way of docker containers. In addition to most of the updates, there is normally no need to re-deploy or reconfigure cloudprober due to the automatic aim discovery.&lt;/li&gt;
&lt;li&gt; The Cloudprober docker image size is low, containing only a statically compiled binary, and it requires a very small amount of CPU and RAM to run even a large number of probes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lZYPzXtU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5bd946575b4c9affe61b_1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lZYPzXtU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5bd946575b4c9affe61b_1.png" width="608" height="484"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/google/cloudprober" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/GoogleCloudPlatform/cloud-ops-sandbox" rel="noopener noreferrer"&gt;Cloud Operations Sandbox (Alpha)&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Cloud Operations Sandbox is an open-source platform that lets specialists learn about Google's Service Reliability Engineering practices and adapt them to their cloud systems using Ops Management (formerly &lt;a href="https://cloud.google.com/products/operations" rel="noopener noreferrer"&gt;Stackdriver&lt;/a&gt;). It is based on the Hipster Shop, a cloud-based platform for native microservices. Note: This requires a Google cloud services account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Demo Service - an application designed on a modern, cloud-native, microservice architecture.&lt;/li&gt;
&lt;li&gt;One-click deployment - a script handles the work of deploying the service to Google Cloud Platform.&lt;/li&gt;
&lt;li&gt;Load Generator - a part that produces simulated traffic on a demo service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O2iHJC3W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5c14d10c2b76716fbe34_2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O2iHJC3W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5c14d10c2b76716fbe34_2.jpg" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/GoogleCloudPlatform/cloud-ops-sandbox" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/jetstack/version-checker#:~:text=version%2Dchecker%20is%20a%20Kubernetes,This%20tool%20is%20currently%20experimental." rel="noopener noreferrer"&gt;Version Checker for Kubernetes&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Kubernetes utility that allows you to observe existing versions of images that are running in the cluster. This tool also allows you to see the current image versions in  table format on a Grafana dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple self hosted registries can be set-up at once&lt;/li&gt;
&lt;li&gt;This utility allows you to see the version information as Prometheus metrics.&lt;/li&gt;
&lt;li&gt;Support for registries like ACR, DockerHub, ECR.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xZ7wmc4c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae46c71a5a0a24cfccbd8d_image6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xZ7wmc4c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae46c71a5a0a24cfccbd8d_image6.png" width="800" height="581"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/jetstack/version-checker" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://istio.io/" rel="noopener noreferrer"&gt;Istio&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Istio is an open framework for incorporating microservices, monitoring traffic movement through microservices, implementing policies and aggregating telemetry data in a standardised way. The control plane of Istio offers an abstraction layer over the underlying platform for cluster management, such as Kubernetes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic load balancing for HTTP, gRPC, WebSocket, and TCP traffic.&lt;/li&gt;
&lt;li&gt;Fine-grained control of traffic behavior with rich routing rules, retries, failovers, and fault injection.&lt;/li&gt;
&lt;li&gt;A pluggable policy layer and configuration API supporting access controls, rate limits and quotas.&lt;/li&gt;
&lt;li&gt;Automatic metrics, logs, and traces for all traffic within a cluster, including cluster ingress and egress.&lt;/li&gt;
&lt;li&gt;Secure service-to-service communication in a cluster with strong identity-based authentication and authorization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2D0zUwV8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5c3ad3b5421ccfa359f7_3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2D0zUwV8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5c3ad3b5421ccfa359f7_3.jpg" width="800" height="593"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://istio.io/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://www.checkov.io/" rel="noopener noreferrer"&gt;Checkov&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Checkov is an Infrastructure-as-Code static code review tool. It scans Terraform, Cloud Details, Cubanet, Serverless or ARM Models cloud infrastructure, and detects security and compliance misconfigurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than 400 built-in rules cover AWS, Azure and Google Cloud's best protection and security practises.&lt;/li&gt;
&lt;li&gt;Assesses Terraform Provider settings to monitor Terraform-managed IaaS, PaaS or SaaS development , maintenance, and updates.&lt;/li&gt;
&lt;li&gt;Detects AWS credential in EC2 Userdata, Lambda context variables and Terraform providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7Rljw88k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae46b8a8cf376b15c27f31_image3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7Rljw88k--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae46b8a8cf376b15c27f31_image3.png" width="800" height="842"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.checkov.io/" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/litmuschaos/litmus" rel="noopener noreferrer"&gt;Litmus&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Cloud-Native Chaos Engineering&lt;/p&gt;

&lt;p&gt;Litmus is a cloud-based chaos modelling toolkit. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs discover vulnerabilities in their deployments. SREs use Litmus to conduct chaos tests first in the staging area and finally in development to discover glitches and vulnerabilities. Fixing the deficiencies leads to improved system resilience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers can run chaos tests during application development as an extension to unit testing or integration testing.&lt;/li&gt;
&lt;li&gt;For CI pipeline builders: To run chaos as a pipeline stage to find bugs when the application is subjected to fail paths in a pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xlzj7EIl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5c80e9ad215032686770_4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xlzj7EIl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5c80e9ad215032686770_4.jpg" width="800" height="652"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/litmuschaos/litmus" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/locustio/locust" rel="noopener noreferrer"&gt;Locust&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Locust is a simple to use, scriptable and flexible performance testing application. You define the behaviour of your users in standard Python code, instead of using a clunky UI or domain specific language. This enables Locust to be extensible and developer friendly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Locust is distributed &amp;amp; scalable - easily supporting hundreds or thousands of users.&lt;/li&gt;
&lt;li&gt;Web-based UI that shows progress in real-time.&lt;/li&gt;
&lt;li&gt;Can test any system with a little tinkering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v_1Qdu3X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae487f1a5a0a3f0eccc78d_image1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v_1Qdu3X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae487f1a5a0a3f0eccc78d_image1.png" width="800" height="319"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/locustio/locust" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/prometheus/prometheus" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus, a &lt;a href="https://www.cncf.io/" rel="noopener noreferrer"&gt;Cloud Native Computing Foundation&lt;/a&gt; project, is a systems and service monitoring system. It extracts metrics from configured destinations at specific times, tests rules and shows outcomes. If specified criteria are violated, it will trigger notifications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)&lt;/li&gt;
&lt;li&gt;Targets are discovered via service discovery or static configuration&lt;/li&gt;
&lt;li&gt;No dependency on distributed storage; single server nodes are autonomous&lt;/li&gt;
&lt;li&gt;PromQL, a powerful and flexible query language to leverage this dimensionality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1r7YHuBj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5cabe713151e3ba35284_5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1r7YHuBj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5cabe713151e3ba35284_5.jpg" width="800" height="552"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/prometheus/prometheus" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/asobti/kube-monkey" rel="noopener noreferrer"&gt;Kube-Monkey&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Kube-monkey is a Kubernetes cluster implementation of &lt;a href="https://netflix.github.io/chaosmonkey/" rel="noopener noreferrer"&gt;Netflix's Chaos Monkey&lt;/a&gt;. The random deletion of kubernetes pods facilitates the creation of failure-resistant resources and validates them at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kube-monkey is operating with an opt-in model and only targeting the termination of Kubernetes (k8s) users which have specifically accepted that kube-monkey will terminate their pods.&lt;/li&gt;
&lt;li&gt;Highly customisable scheduling features based on your requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wLrekpvO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae4976092c02d97f145670_image10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wLrekpvO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae4976092c02d97f145670_image10.png" width="638" height="359"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://www.slideshare.net/arungupta1/chaos-engineering-with-kubernetes" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://squadcast-website.webflow.io/blog/top-open-source-projects-for-sres-and-devops#" rel="noopener noreferrer"&gt;PowerfulSeal&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;PowerfulSeal injects failure into Kubernetes clusters, helping you to recognise issues as quickly as possible. It enables scenarios that portray complete chaos experiments to be created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compatible with Kubernetes, OpenStack, AWS, Azure, GCP and local machines&lt;/li&gt;
&lt;li&gt;Connects with &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; and &lt;a href="https://www.datadoghq.com/" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; for metrics collection
Multiple modes allowed for custom use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--51e07Px---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5cca5c87c3b80dcf9912_6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--51e07Px---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c9200c49b1194323aff7304/5fae5cca5c87c3b80dcf9912_6.jpg" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;a href="https://github.com/powerfulseal/powerfulseal" rel="noopener noreferrer"&gt;Image Source&lt;/a&gt;



&lt;p&gt;The great benefit of open source technologies is their extensible nature. You can add features to the tool if required to better fit your custom architecture. These open source projects have extensive support documentation and a community of users. As microservice architecture is slated to dominate the cloud computing space, reliable tools to monitor and troubleshoot these instances are sure to become part of every developer's arsenal.&lt;/p&gt;

&lt;p&gt;You can also find more such awesome DevOps and SRE open source projects &lt;a href="https://awesomeopensource.com/projects/sre" rel="noopener noreferrer"&gt;here&lt;/a&gt;. Meanwhile, we’d love to hear from you on other projects/tools that should make this list! Leave us a comment or reach out over a DM via &lt;a href="https://twitter.com/squadcasthq?lang=en" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; and let us know your thoughts.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt; is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.squadcast.com/register/" rel="noopener noreferrer"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cj1VUnAS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://uploads-ssl.webflow.com/5c51758c58939b30a6fd3d73/5e16013f80ad26b00925d758_image--5--1.png" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
