DEV Community: Nir Sharma

Most frequently asked questions surrounding Google’s Cloud Operations Sandbox

Nir Sharma — Wed, 11 Aug 2021 11:16:22 +0000

Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google's Cloud Operations Sandbox.

The Google SRE sandbox provides an easy way to get started with the core skills you need to become a SRE. It simulates all the behavioural complexities of a real GCP(Google Cloud Platform) environment, so that budding SREs can practice hands-on while learning SRE best practices.

The core skills you need to become a good SRE are:

Observability of complex microservice-based cloud environments
Performing quick root-cause analysis when things go wrong
Automating rollbacks and monitoring deployments
Tracking SLOs, SLIs over a time period

Architecture of the demo application provided with the sandbox
Image Source

With Cloud Operations Sandbox, you can get started and take the first steps into SRE expertise and answer the question, ‘Will it work in my production environment’? We have compiled a list of FAQs related to the Google SRE Sandbox and answered them below.

Q: What are the major features of the sandbox?

While the sandbox has many features, in this blog we will be focusing on observability, root cause analysis, simulating user traffic and SLO/SLI tracking. The features in the sandbox used for learning about these are Cloud Tracing, Locust artificial load generator, cloud profiler, cloud debugger and SRE recipes.

Q: Can I track custom SLOs and SLAs with the sandbox?

The demo application that comes with the sandbox has microservices that are pre-instrumented with logging, monitoring, tracing, debugging, and profiling capabilities. In the screenshot shown below you can see how Service Level Indicators(SLI)s can be defined for the demo app.

Defining SLIs in the Google Sandbox
Image Source

You can pick SLIs based on availability, latency or even define your own custom metric for the demo application.

If you have instead chosen to track SLIs for your replicated production environment you will need to instrument the services separately.

Q: Which module is used to simulate traffic in the sandbox?

The artificial load generator used by the sandbox is Locust. Locust is mainly used for testing the load-bearing abilities of your infrastructure. With Locust you can define artificial user behaviour using Python code. Locust allows performing load tests by simulating upto millions of concurrent users.

User Interface of the locus load generator
Image Source

Below you will find a code-snippet with the python code used to simulate the behaviour of a user.

from locust import HttpUser, between, task
class WebsiteUser(HttpUser):
    wait_time = between(5, 15)
    def on_start(self):
        self.client.post("/login", {
            "username": "test_user",
            "password": ""
        })

Q: What is ‘Google cloud debugger' and how does it work in the sandbox?

You may have noticed many instances where an issue faced in production, cannot be reproduced in the test environment for root cause analysis. To discover the underlying cause, you must either go into the source code or add more logs to the program when it is running in the production environment. The Cloud Debugger allows developers to debug code during execution using real-time request data.

Developers have the option of utilising the Cloud Debugger to debug a running application using real-time request data. Breakpoints and log points may be defined while viewing the project. A snapshot of the process state is taken when a breakpoint is hit, so you may examine what went wrong.

With the Cloud Debugger, adding a log statement to a running project doesn't result in slowed performance. Typically, this would need re-deploying the program/code, with all of the risks that are involved for production deployment.

Q: What is ‘Google cloud profiler’ and how can it help me?

You can use Cloud Profiler to perform statistical testing on your application. It collects statistical information on CPU usage, heap size, threads and so on depending on the programming language used. You may utilise the Profiler UI charts to identify performance gaps in your application code.

Once you have installed the Profiler library, you do not have to write any profiling code in your application; all you have to do is make the Profiler library available (the method depends on the language). This library will generate reports and allow you to conduct various analyses.

Note that if you are not using the demo application the profiler has to be configured to work with the related microservice.

Q: What are the tools available to learn tracing across Sandbox?

Cloud Trace allows developers to examine distributed traces by graphically revealing request latency bottlenecks. Developers gather the trace information by instrumenting the application code. Traces also include environmental information added to the Cloud Logging records. The sandbox provides openCencus and OpenTelemetry to learn tracing within the platform.

The solution the sandbox uses for instrumenting is OpenCensus. The OpenCensus project is open-source and offers trace instrumentation in many languages. Furthermore, it enables the trace data to be exported to Google Cloud Operations dashboard. To examine the data, you may utilise the Cloud Trace UI.

Clicking on a trace in the timeline will give you a more detailed view and breakdown of the traced call and the subsequent calls that were made.

Q: Can I replicate my production/staging environment in the sandbox?

Your production/staging environment can be replicated if it is hosted on GCP(Google Cloud Platform).

Q: Can I check for observability of my replicated environment?

The sandbox has a demo application(hipster shop) that comes pre-instrumented with observability. If you are using your own environment, you will need to instrument your microservices accordingly.

Q: Can I send alerts to an external platform?

As of now the demo sandbox has an inbuilt incident management system with basic functionality. Sending alerts to an external platform can be done after creating a custom module.

Q: How much does the Sandbox cost?

The sandbox is provided free of charge. However, since it can only be used on the Google Cloud Platform(GCP) platform, any computing resources consumed will be billed.

Q: Can I improve my MTTR(Mean time to Respond) with the sandbox?

The sandbox has a feature called “SRE recipes” that auto-generates issues in your environment. It is a good way to learn the skills to fix things in production. It is important to note that SRE recipes will only be working in the demo application provided with the sandbox. You will need to create your own scripts to auto-generate problems in your custom setup. By practicing, SREs can get better at fixing issues in production and reducing the MTTR(Mean time to respond) to incidents.

Q: Can I test the performance of my production environment in the sandbox?

Yes. The sandbox environment can be used to test your production environment since it has a tool to generate synthetic traffic. However, the sandbox does not have any tools for thorough unit testing and performance testing.

Q: What new features will be added to the sandbox?

Runbooks are expected to be added in the sandbox in the near future. Creating effective runbooks is an important skill all SREs need to acquire.

Conclusion

The SRE sandbox is a great place to test out your skills for becoming a better SRE. To be effective in their work, SREs need expertise in the areas of observability, performance testing and distributed architecture. The sandbox provides a way for budding SREs to test out different scenarios. Some possible scenarios include checking the performance of your application under different user loads, getting better at resolving critical issues and testing out different on-call strategies.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

The True Cost of Building your Own Incident Management System (IMS)

Nir Sharma — Tue, 09 Feb 2021 11:19:39 +0000

Is your organization on the lookout for an incident management tool? If yes, you may wonder- am I better off building my own? Our latest blog outlines some of the key factors to consider while choosing whether to build or buy an incident management software.

When your organisation realises that it needs an Incident Management System (IMS), the first question is almost always, “Build or Buy?” Superficially, the requirements seem simple and being a technical organisation you probably have the skills you need as well. With your deep knowledge of your internal setup, surely you can build one that’s best suited to your needs? This may seem like a solid argument towards building your own IMS, however, there are some hidden factors that you may not have considered. In this blog, we look at the costs involved in building your own IMS and help you determine if the return on investment (ROI) makes it worth building one.

First, let’s quickly look at some advantages of building your own IMS.

The biggest advantage is that you can build an IMS that suits your needs perfectly. If your organization does not use Sensu, for example, then why build support for it? Instead, you can directly integrate with any on-premise monitoring tool you have in-place. If you restrict access to your production network, an off-the-shelf SaaS IMS will be difficult to use. An in-house IMS will not face such issues.

Now that we have had a look at the advantages, let’s look at the disadvantages.

Budgeting for an in-house incident management system (IMS)

‍When building your own IMS, it can be very difficult to estimate the total cost of ownership. In general, it is easier to get approval for a one-time cost rather than an open-ended project. The trouble with building an IMS is that more often than not the costs do not include long-term maintenance, usability and reliability.

While getting budgetary approval for the IMS, it's hard to communicate the benefits the system will bring. This is because many of the benefits of having a strong IMS platform in place are qualitative. While the on-call experience and effectiveness of engineers will definitely improve, it is hard to measure the benefits quantitatively. Convincing your management to increase the budget may become harder as time passes and additional features are required. Many organizations may not have the measuring tools necessary to decisively prove the return on investment. You may build your own dashboard that tracks the MTTR (Mean Time To Respond) but unless such metrics have been tracked even earlier, it will be a hard sell to convince management.

Off the shelf systems, on the other hand, often don’t have high upfront costs and require little commitment. A small pilot of a commercial product is an easier sell than a potentially long and expensive development project.

But is it, in fact, expensive? Let’s break it down:

Development costs: This includes the cost of assigning programmers to the task, tools required to build the IMS, and infrastructure to test and deploy it. Given a list of features, it is possible to estimate this particular cost, and it is reasonably straightforward to get funding for these expenses.

Maintenance Costs: Like any other piece of software, there will be maintenance costs associated with building an IMS. The costs associated with maintenance include fixing the bugs that crop up during the development and use of the IMS. You will also need to factor in costs when your requirements grow - this can be changes in the production applications, databases, vendor tools, or any other dependencies. As the underlying software is updated, you will also need to consider the associated security fixes. This involves setting aside time to ensure that any newly discovered security vulnerabilities don’t compromise your system. In certain scenarios, you may also need to hire external contractors to validate your security.

Since it is a mission-critical piece of software that alerts you to any problems in your entire infrastructure, you cannot neglect its maintenance. You cannot afford to delay patching any vulnerabilities or making critical fixes to your IMS. Therefore you will need to set aside dedicated and continuous engineering capacity for the maintenance of the IMS. Even if it is a part-time team, there must be someone available at short notice to make any critical fixes required. This team and the overhead of maintaining it is likely to be your single highest cost and the one most difficult to sustain.

Opportunity cost: This is one of the hidden costs that are harder to measure. While developing your own IMS, you will take away engineering capacity from other aspects of your organization. These people could have been working on your organization’s product instead of working on the IMS.

Now that we have looked at the cost of developing your own in-house platform, let us have a look at the cost incurred if you opt for an off-the-shelf incident management platform.

Factors to consider for an off-the-shelf incident management platform

Usually, off-the-shelf platforms are more expensive to develop because they have to be more flexible in terms of feature set and be able to scale to a higher number of users. Fortunately, you will end up paying only a fraction of that cost, because it is shared among all the customers of the product. In fact, if you have a small team, you can get many features free of cost from several incident management platforms. In general, for a particular feature set, the cost of acquisition will be far lower with off-the-shelf systems.

Deployment and Training Cost: Off the shelf systems are usually quite flexible, but you may have to spend some time and effort to adapt your systems to it. You may have to change some of your processes or deprecate old, unsupported monitoring tools, for example. This also includes any training costs for the users in your organisation.

Usability and Features: Due to the competitive nature of the market, any off-the-shelf incident management platform will need to keep up and add features to ensure it does not fall behind. An in-house platform often stops being developed as soon as basic minimum functionality is in place. In-house platforms can have poor usability as they are built in an ad-hoc fashion by SREs without input from UX professionals. A better user interface ensures more efficiency and ease of use. Any external product will already have been used by hundreds if not thousands of users in other organizations and therefore will have a highly optimized layout. An external platform will also have the added benefit of a customer support team to answer any queries not covered by the support documentation.

Conclusion

‍These were the costs and benefits of having an in-house versus an external system. If you factor in the hidden costs, compliance, and support issues unless you are operating at the scale of Google or Facebook or are operating an esoteric system that is incompatible with external tools, investing in an in-house incident management platform makes little sense. However in the majority of cases, be it a growing or a small SRE team in a large organization, an off-the-shelf solution is significantly desirable. For most organizations, the return on investment is not substantial enough to warrant planning and developing an in-house incident management system.

Top Observability tools for DevOps Engineers and SREs

Nir Sharma — Wed, 27 Jan 2021 11:55:40 +0000

Better visibility is the first step to improved system stability. Our latest blog outlines Top Observability tools for DevOps Engineers & SREs to help you get started on your journey to gain valuable insights into your infrastructure.

“We can't fix something which we can't observe” - whether it's a steam engine or a complex microservice based cloud deployment, great observability makes troubleshooting things easier. Having a clear view of your system makes early recognition and preemptive solving of problems possible. Getting the right data at the right time with associated context is a game changer for those who want better system stability.

In this blog post, we have collated a list of observability tools in the areas of log aggregation, APM, time series databases, distributed tracing, and metrics collection tools. While this is not an in-depth look at the strengths and weaknesses of these tools, it's a good starting point to get started on your journey to better observability.

The list contains a mix of on-premise, hybrid, and SaaS platforms. Also, some of the tools featured here are open-source products or built on the foundation of other open-source software.

First up, we look at some log aggregation tools:

Fluentd is an open-source data collection tool. It is used to analyse data from event and application logs. It is a centralizing layer for consolidating different log inputs and outputs.

Features:

Flexible plugin system that allows the community to extend its usability.
Fluentd is written in C and Ruby and requires very little system resources.
Supports Unified Logging with JSON

DEV Community: Nir Sharma

Most frequently asked questions surrounding Google’s Cloud Operations Sandbox

Q: What are the major features of the sandbox?

Q: Can I track custom SLOs and SLAs with the sandbox?

Q: Which module is used to simulate traffic in the sandbox?

Q: What is ‘Google cloud debugger' and how does it work in the sandbox?

Q: What is ‘Google cloud profiler’ and how can it help me?

Q: What are the tools available to learn tracing across Sandbox?

Q: Can I replicate my production/staging environment in the sandbox?

Q: Can I check for observability of my replicated environment?

Q: Can I send alerts to an external platform?

Q: How much does the Sandbox cost?

Q: Can I improve my MTTR(Mean time to Respond) with the sandbox?

Q: Can I test the performance of my production environment in the sandbox?

Q: What new features will be added to the sandbox?

Conclusion

The True Cost of Building your Own Incident Management System (IMS)

Budgeting for an in-house incident management system (IMS)

Factors to consider for an off-the-shelf incident management platform

Conclusion

Top Observability tools for DevOps Engineers and SREs

From SysAdmin to SRE: How to evolve your skillset

Mindset Changes

Soft Skills

Technical Skills

Conclusion

Squadcast's Year in Review, 2020

Squadcast Impact

Recognition

How to SRE without an SRE on your team

Error Budgets

Measuring your service (SLOs,SLIs and SLAs)

Toil

Automation for SREs

Conclusion: Next steps

Top Open Source projects for SREs and DevOps