DEV Community: Instana

GitHub CoPilot, Code Automation, and Software Health

Instana — Mon, 15 Aug 2022 13:39:28 +0000

At Instana, we naturally talk a lot about strategies for automating operations, but this is just one part of the picture. Developers can also benefit from increasingly intelligent forms of automation.

By now you’ve likely heard of GitHub Copilot, which is the most popular in a growing field of automated code assistance tools for developers.

In our eBook Achieving Software Health, we discuss how code assistance tools can improve maintenance and repair of services and applications. Let’s take a closer look at some of those tools…

Commercial Code Assistance Tools

GitHub released the first iteration of CoPilot in October 2021. Since then they have steadily added extensions for many popular editors. Amazon announced CodeWhisperer, which is currently in a preview program, on June 23rd this year.

GitHub Copilot is currently the easiest to get started with — if you don’t mind paying for it — as the only requirements are a GitHub account and a compatible editor. CodeWhisperer requires interested developers to apply for the preview program, although being approved within a few days is common.

Both of these tools offer a very similar feature set. The core feature is the ability to complete entire functions or even modules based on a code comment describing the developer’s intent. CodeWhisperer depends more on these comments, while CoPilot is more integrated into the editor’s typeahead autocompletion.

CoPilot is almost aggressive in how quickly it will attempt to complete a line or definition, but you quickly adapt to the code hints. It is uncanny how well CoPilot can guess the name of a new function or variable based on context as soon as you begin the declaration.

With either tool, you have the option to reject a suggestion or open a context menu to choose from a selection of alternative suggestions, which provides more experienced developers with the ability to choose their preferred solutions.

Things Code Automation Tools Do Well

So what can we do with this super-powered autocomplete? While it’s possible to have these tools build almost an entire (small) program from just a few comments, that is not how most developers will want to use them.

If basic autocomplete as we’re familiar with in IDEs is like an autopilot — capable of maintaining a straight course and heading — then the CoPilot name is spot on (although “CodeWhisperer” just sounds cool, doesn’t it?). It is a helpful aide, almost like having an always-available junior pair programmer.

In this role, the code completion tools can look up documentation and combine that with the context of your program to create the exact suggestion you need.

For junior developers, or even experienced developers working in an unfamiliar language, there is a didactic nature to the experience that informs and educates through assistance and suggestion.

Another excellent use for AI-assisted code completion is in writing boilerplate code. You, the developer, provide the intent and the architecture, and the code completion tool handles the bulk of the typing. In fact, according to Microsoft, CoPilot now writes 40% of the code on files where it has been enabled.

Open Source Options

In addition to the commercial tools, a number of open source ML-driven code completion tools are in active development. The most general purpose of these is GPT Code Clippy, which has the aim of being a complete open source alternative to one of the commercial code assistance tools. It is based on the GPT-3 model and has an extension available for VSCode.

Other tools exist for specific editors:

ASM Dude

A Visual Studio extension for Assembly
YouCompleteMe

A code-completion plugin for Vim
SecondMate

A mini-copilot for Emacs

Code Completion for Software Health

You may be asking, “won’t these tools just enable garbage code to be generated that will increase our technical debt?”

No.

Unlike the boilerplate generators built into many popular frameworks, these code completion tools provide a conversational interface that leaves the developer in the driver’s seat at all times.

This is the best kind of automation — the kind that reduces tedium for developers and operators and leaves them free to focus on the higher order solutions.

The post GitHub CoPilot, Code Automation, and Software Health appeared first on Instana.

Shift Left Testing: What Is It and Why Does It Matter?

Instana — Fri, 29 Jul 2022 14:27:34 +0000

Have you been involved in a software project that ran over budget and blew past every deadline? Of course, you have – we all have. In fact, if you haven’t, you are a unicorn and I would like to hear from you.

Early in my software development career, I learned the importance of working backwards from a deadline. If a project must be done by a certain date and testing will take a certain amount of time, then we can use that information to work backwards and choose a due date for our project. Perfect, right?

Well, not quite. While building in time for testing reduced some stress in the final days of projects, there were still too many surprises.

Building in time for QA testing is great in theory but quickly falls apart in practice once the first bug or defect is identified.

How long will this defect take to fix? How much will it impact the timeline? Will new bugs be introduced? How will we ensure each fix is verified with time to fix anything we broke while we were fixing the first thing?

relevant xkcd

Ultimately, I was never able to find the correct amount of time to allocate for QA. Inevitably, rushed fixes were merged at the last minute; I learned to keep my calendar clear for a couple of weeks after big launches so that I could triage all of the issues we missed (or introduced) in our mad dashes to the finish.

The problem, at the end of the day, was not the time available for testing but rather the timing of the testing. I needed testing sooner and more often. I needed shift left testing.

What Is Shift Left Testing?

If we imagine our software development process as a timeline flowing from left to right, then “shift left testing” becomes somewhat self-explanatory. Simply put, it is the practice of testing earlier and more often in the development life cycle.

What is Shift Left Testing?

The V-Model of Software Development

The V-model is a useful way to conceptualize software development cycles. If we take the traditional waterfall flow and “flip” the Y-axis at the implementation phase, we get the V-model.

A development cycle begins with high-level requirements. These requirements are narrowed down with each successive step down the “V” until we reach the code-level implementation itself. We then verify the implementation, starting with the most granular unit tests and working our way up the “V” to more abstract user acceptance testing.

In a waterfall process, the entire project is made up of a single “V.” As an industry, we have learned that when you leave all of your validation to the very end of a complex project, you are basically setting yourself up to fail.

In an iterative process, we can think of each sprint or iteration as a smaller “V.” We have theoretically achieved our goals of shift left: testing sooner and more often. Problem solved, right? Well… not quite.

Types of Shift Left Testing

You may have noticed that there are two labels on the feedback channel built into the V-model: verification and validation. These are both important.

We need to validate that our user requirements actually solve the problems we set out to solve. We also need to verify that our implementation matches the specifications we get from those user requirements.

Automated testing can be applied to both validations and verifications. BDD (Behavior Driven Design) has led to the creation of technologies such as Cucumber that can automate some parts of the validation process. For the purposes of this article, we will focus on automated testing for verification.

Unit Testing

Unit tests verify the functionality of a specific module within a larger application. The module is tested in isolation, and any communication with other external processes is simulated or mocked. Unit testing and TDD represent the first phase in shift left testing.

Integration Testing

Integration tests attempt to verify the overall functionality of a service or application, including side effects. This is an anti-pattern for reasons we will discuss later.

API Testing / Contract Testing

API tests verify the external endpoints of a single service. The scope of API tests is similar to the scope of integration tests; However, in an SOA or Microservices context, we can think of API tests as the new unit tests.

UI Testing

UI tests verify the complete functionality of an application from the user interface layer. Tools like Selenium make automated UI testing widely accessible.

More Than Just Automation

Shift left isn’t just about automation. Another way to test earlier and more often is to make sure that your QA specialists are involved in every step of your process, beginning with discovery and requirements gathering. Test engineers can do better when they have a greater understanding of the overall implementation, and their insights can help make the architecture more transparent and resilient.

The Benefits of Shift Left Testing

The shorter feedback loops built into shift left processes empower us in several ways. Defects can be found faster, fixes can be applied more efficiently, and lessons learned in one iteration can be applied in the next, to name a few.

Whatever project management methodology or release cadence your team has, you can benefit from the shorter verification feedback loops from shift left testing.

Cost Savings

A defect found by an automated unit test on a developer’s local machine is orders of magnitude cheaper to identify and fix than a defect that has made it all the way to a customer-facing environment.

Developer Wellbeing

When done properly, automated testing and CI can provide the confidence that software engineers need to deploy often — even on Fridays. Finding defects sooner means fewer panicked all-hands moments. Since releases are so painless, fixing the few errors that do make it through is faster and easier too.

Resilient Architecture

Just as more accessible software is usually easier to use for all of us, more testable software can be easier to reason about and maintain. Thinking about testing early can lead to better separation of concern and a more resilient overall architecture.

Greater Overall Quality

Improving the customer experience is our ultimate goal. Shift left can eliminate some incidents that end users might experience and reduce the impact of other incidents. We can use observability to complete this feedback loop and improve our overall software health.

The Dangers of Shift Left Testing

With powerful automation tools at our disposal, it can be tempting to implement every kind of testing on every line of code. This is a dangerous path.

Testing side effects — did that record actually get saved to the database? — is an attractive idea. But testing implementation details is an anti-pattern because these types of tests are extremely brittle. They might need to be changed every time your application is changed. The user interface is also an implementation detail, so UI tests land in this same boat.

Verification tests just care about the “what,” not the “how” or the “why.” Ideally, the user requirements have been designed to validate “why.” To answer “how” we can rely on a more powerful automation in the form of an observability platform.

How to Get Started with Shift Left Testing

When we think of testing “sooner and more often” a certain word comes to mind: continuous. Many (most) software development teams are practicing some form of continuous integration and continuous delivery. Continuous testing is a vital feedback loop in this DevOps cycle.

Continuous Testing

If we think of TDD as “Shift left for monoliths,” then continuous testing is “shift left for distributed architectures.”

TDD had us focus on unit testing. For continuous testing we should focus on API and contract tests. API tests have a number of benefits:

API tests can prevent one of the most common ways to introduce errors in a microservices application: changing a dependency out of sync with its upstream or downstream services.
API tests can be owned by the same team that owns the tested service.
API tests avoid the brittleness of testing side effects and implementation details.

Ideally, these API tests will be run continuously against both production and pre-production environments. Contract testing tools can help automate this process, but that requires additional infrastructure.

What if we could use continuous API testing built into our observability tool? The upcoming synthetic API testing feature from Instana will let you continuously run API tests against all of your environments with minimal effort.

Shift Left vs Shift Right

Shift right testing is the practice of testing later in the development process, usually in production environments. While it may seem strange, shift left and shift right testing are complimentary.

Shift right testing allows us to identify production issues before our customers do. The shorter feedback loops from shift left testing give us the ability to respond to and remediate these production issues rapidly.

Synthetic API testing as a part of your observability platform is the perfect way to combine the benefits of shift left and shift right practices.

The post Shift Left Testing: What Is It and Why Does It Matter? appeared first on Instana.

Instana Introduces OpenTelemetry Exporter for .NET

Instana — Wed, 20 Jul 2022 13:39:39 +0000

What is OpenTelemetry?

OpenTelemetry is an open-source project hosted by the CNCF that provides APIs and SDKs for a variety of programming languages to instrument and collect observability data from applications. OpenTelemetry gives a framework for instrumenting, generating, collecting, and exporting telemetry data for analysis and understanding of software performance and behavior.

OpenTelemetry offers a vendor-neutral data format that can be integrated with any data processing backend. This is possible thanks to a concept called “exporters.” An exporter allows you to configure which backend(s) you want it sent to. The exporter decouples the instrumentation from the backend configuration. This makes it easy to switch backends without the pain of re-instrumenting your code.

OpenTelemetry and .NET

.NET is one of many languages supported by OpenTelemetry.

To instrument the .NET application, you have to add a corresponding NuGet package for the targeting runtime.

From there on, every collected trace will be exported by the .NET Instana exporter and directly sent to Instana’s backend.

The Span Exporter

Once the OpenTelemetry instrumentation package is added, it will trace appropriate library, generating spans for every time that an instrumented library is called.

But you have to tell the tracer what to do with these spans. That’s where the exporter comes in.

The exporter understands the vendor-neutral span format, converts it to a specific format, and sends this data to a backend to be processed and displayed later.

Example

Let’s say we have one simple ASP .NET Core application which uses MSSQL database as storage. The Application exposes two endpoints to interact with database. The first endpoint named /init is used for initialize database and insert one random integer value. Second endpoint /read is used to read previously generated and inserted random integer value.

If we want to instrument this application using OpenTelemetry we have to add appropriate OpenTelemetry instrumentation packages. For this ASP .Net Core app we have to add OpenTelemetry.Instrumentation.AspNetCore and because we are using MsSQL we have to add OpenTelemetry.Instrumentation.SqlClient packages to generate spans every time those two libraries are called.

At the end, we want to report it somewhere, so we need OTel exporter. In order to export traces from serverless environment to Instana backend we choose OpenTelemetry.Exporter.Instana.

The crucial thing, in OTel .Net world everything starts with TraceProviderso during initialization of this object we have to list all instrumentations and exporters. In our case that is SqlClientInstrumentation, AspNetCoreInstrumentation and because we want to report to Instana backend we have to add InstanaExporter.

Also we have to provide two environment variables to the application before or during application starts. These variables are mandatory because exporter needs to know were to report generated spans:

ENV INSTANA_ENDPOINT_URL=endpoint_url

ENV INSTANA_AGENT_KEY=the_agent_key

Each HTTP call which results with database call will be captured in this case and immediately reported to Instana’s backend. Here we can see HTTP entry call /init with all following DB calls that are reported to Instana’s backend.

Same for /read:

We can see the converted and processed span data from OpenTelemetry in the Stan dashboard.

In conclusion, OpenTelemetry is rapidly gaining popularity in the observability world, especially with the flexibility to be exported, consumed and processed by a vendor specific backend.

Instana offers a convenient way to convert OpenTelemetry traces to the Instana platform for customers hosting the Instana Agent. But also for Node.js applications running in a serverless environment through the InstanaExporter.

Instana is wherever our customers are. By introducing an OpenTelemetry exporter for span data, we continue supporting use cases of our customers and continue with our integration of OpenTelemetry as a first class citizen into our observability platform.

Check our solution on OpenTelemetry Github repository

Let us know what you think! Check our OTel solution on OTel Github repository and leave us some feedback there or with support@instana.com.

Come back to see how we work with and support the OpenTelemetry project.

For more on Instana’s support of OpenTelemetry, here’s some additional reading material:

The post Instana Introduces OpenTelemetry Exporter for .NET appeared first on Instana.

What is OpenTelemetry?

Instana — Thu, 14 Jul 2022 19:57:43 +0000

You’ve probably heard of OpenCensus and OpenTracing, but what about OpenTelemetry?

Distributed and cloud-native environments make it challenging to monitor application performance. The bottom line is if you’re looking to understand your system’s behavior, it’s crucial to collect telemetry data. The problem is that no product on the market has a single instrument for collecting this data across all of an organization’s applications and systems. That is until OpenTelemetry hit the market.

OpenTelemetry has finally standardized a way for DevOps and IT professionals to collect and transmit telemetry data to your observability backend. In this guide, we’ll deep dive into what OpenTelemetry is, how it’s used, the benefits, and everything else you’d need to know to get started with this framework.

What Is OpenTelemetry?

OpenTelemetry is an open-source observability framework with a collection of software development kits (SDKs), vendor-neutral or vendor-agnostic APIs, and tools for instrumentation. This technology can generate, collect, export, and instrument telemetry data to analyze your platform’s behavior and performance. Opentelemetry is also known as OTel.

IT groups and DevOps professionals must use instrumentation to create an observable system in cloud-native applications. Instrumentation code used to be varied, making it difficult for companies to change backends. It was hard to switch tools because they would need to reinstrument their code and reconfigure new agents to send telemetry data to their new devices.

After seeing the need for a standardized system, Cloud Native Computing Foundation (CNCF) sponsored the OpenTelemetry project to create a standardized way to send, collect, and transfer telemetry data to backend observability platforms. OpenTelemetry was born from combining the distributed tracing technology of OpenCensus and OpenTracing into one tool.

What Is Telemetry Data?

To gain a deeper understanding of OpenTelemetry, let’s deep dive into what telemetry data is and how your organization can utilize it.

A key part of successful application performance is having observability through access to data. IT professionals use telemetry data to determine the health and performance of your application.

OpenTelemetry creates a standard for collecting and transferring telemetry data in cloud-native applications. These metrics can then be analyzed and monitored by your organization to improve your platforms.

Telemetry data is composed primarily of outputs collected from logs, metrics, and traces. These are often referred to as the three pillars of observability.

Logs: Logs are a timestamp or record of events in your application. The important events identified using logs show errors or unpredictable behaviors within your system. This information will signal to your internal teams that a problem has occurred so you can fix it before more users experience the error.
Metrics: Metrics are typically where you’ll see the first sign of an issue occurring in your system. These give you numerical values or sets of measurements that show your resource utilization and application performance. The three main types of metrics are delta, gauge, and cumulative.
Traces: Traces evaluate how requests move through a server in distributed cloud environments. It looks at this by monitoring how an operation transfers from node to node. Traces can only provide limited visibility into application health because it is solely focused on application layers. To get a complete picture of what is going on in your system, it’s also essential to monitor your metrics and logs.

Collecting telemetry data is an important step in the OpenTelemetry and observability process. Next, we’ll discuss how OpenTelemetry is used in a dispersed cloud environment.

How Does OpenTelemetry Work?

In a nutshell, OpenTelemetry works by combining an API, SDK, Collector, and automatic instrumentation to pull data and send it to its target system. In order to make your system more agnostic, there are several steps it needs to take using these components.

An API will create traces by instrumenting your code and dictating which metrics need to be collected. Your SDK will then gather, translate, and sends that data to the next stage. The OpenTelemetry Collector processes the data, filters it, and exports it to a supported backend.

Components of OpenTelemetry

There are many moving pieces when it comes to making OTel’s data collection successful. Here is an in-depth explanation of the four major components of OpenTelemetry :

API: Application Performance Interface (API) enables different software components to communicate with each other. It defines data operations for logging, metrics, and tracing data. Essentially, OpenTelemetry APIs decouple an application from the infrastructure, allowing developers to have the flexibility to switch servers that run your cloud. APIs are language-specific (Java, Ruby, JavaScript, Python, etc.).
SDK: DevOps professionals can use language SDKs to allow OTel APIs to generate telemetry data in the language of their choice. After they have generated this data, you can export the information to your desired backend. OpenTelemetry SDKs make it possible to connect common libraries’ manual instrumentation with applications’ manual instrumentation. SDKs are the bridge between APIs and collectors. It stands for software development kit.
Collector: The OpenTelemetry Collector exports, processes, and receives telemetry data. It can support Prometheus, OTLP, Jaeger, and other proprietary tools. It can take telemetry data and send it to multiple observability backends. Lastly, it can assist your organization in filtering and processing your data before exporting.

Each of these components makes up the framework for why OTel is a winning addition to monitoring your application.

What Are The Benefits of OpenTelemetry?

There are many benefits to using OpenTelemetry for your open source projects. Each of these benefits will help improve observability and monitoring. These benefits explain why OTel is the future of application performance monitoring (APM).

Consistency: The main benefit of OpenTelemetry is the consistency of collecting OpenTelemetry data across different applications. A lack of a unified standard creates problems for Dev Ops professionals and SREs. OTel now saves you time, gives you more observability, and collects telemetry data without changing your code. The broad adoption of this technology across organizations has made it easier to implement container deployment. This is similar to the mass embracement of Kubernetes as the standard for container orchestration.
Simplified Observability: OTel simplifies observability because it can collect telemetry data without changing code. Now developers don’t have to stick to specific backends or vendors.

- Flexibility: Developers can monitor performance metrics and usage from any web browser or device. The convenience of observing your application from any location makes it easier to track your analytics in real-time.

Overall, OpenTelemetry’s primary benefit is that it can help you achieve your optimal business goals. This software enables your organization to understand and fix issues that could negatively impact your customer experience. OpenTelemetry gives you the data needed to stop a problem in its tracks before your service is interrupted.

What is OpenTelemetry Used For?

OpenTelemetry’s primary goal is to collect and export telemetry data. OTel assists DevOps professionals in debugging and managing applications. Once they have this data, they can make informed coding decisions and adjust as their organization continues to change and grow.

There are three main ways OpenTelemetry is used in DevOps to solve application problems:

Prioritizes Requests: OpenTelemetry has the unique ability to create a tier system for requests within your system. This is important because competing requests will be correctly prioritized.
Track Resource Usage: Capture requests between microservers to attribute resource usage by groups. IT professionals can track this resource usage between shared systems.
Observability of Microservices: Monitor the health of your application by recording telemetry data from applications in distributed systems. Having this information will help your team optimize and run your application correctly.

Each of these features helps organizations solve common errors when running applications across cloud-native systems.

OpenTelemetry vs OpenTracing

OpenTracing is an open-source project that assists developers in instrumenting code for distributed tracing through vendor-neutral APIs. This is beneficial because it doesn’t force you to stick with one particular vendor or product.

This project is available in nine different languages, including Ruby, Java, and Python. DevOps and IT professionals can use distributed tracing to optimize and debug software architecture code. It is especially useful when dealing with microservices.

The CNCF created OpenTelemetry by merging OpenTracing and OpenCensus into one platform. There have been over 10,000 contributions from 300 companies since the project was deployed. The encouragement of broad collaboration and additions has created access to a large set of instrumentation that is unmatched in the industry.

If you were going to choose between the two open source platforms, it would be smart to go with OpenTelemetry since it has more capabilities.

Is OpenTelemetry The Future of Instrumentation?

OpenTelemetry is changing the landscape of observability. Similar to Kubernetes becoming the standard for container orchestration, OpenTelemetry is becoming widely adopted for observability. Opentelemetry’s adoption and popularity will continue to soar because of the OpenTelemetry benefits we stated above.

The OpenTelemetry project teams continue to work on stabilizing the software’s core components and creating automated instrumentation. Once it is out of the beta stage, it’s projected to take over the observability framework in cloud-native ecosystems.

Achieve Your Business Goals With Instana and OpenTelemetry

The bottom line is that OpenTelemetry is not an observability backend but a tool that makes collecting and sending telemetry data more streamlined. Instana is the final piece of this equation as the observability backend. OpenTelemetry formats and SDKs can be a migration path for legacy systems and unsupported technologies.

Our organization is committed to fully embracing OTel to help you achieve business goals through simplified data collection. We are working on giving users the same visibility that they get with Instana’s AutoTrace through our integration with OpenTelemetry.

The post What is OpenTelemetry? appeared first on Instana.

How We Optimize Complex Queries at Processing Time

Instana — Mon, 27 Jun 2022 19:09:33 +0000

Context

Instana aims to provide accurate and instant application monitoring metrics on dashboards and in Unbounded Analytics. These metrics are calculated based on millions of calls collected from the systems under monitoring. Calls are stored in Clickhouse which is a columnar database and each call has hundreds of tags stored in columns.

We use various techniques to speed up the querying of this data. The most important one is the use of materialized views. The idea is to select a couple of most frequently used tags such as service.name, endpoint.name, http.status etc., and pre-aggregate metrics (call count, latency, error rate etc.) over these tags into buckets of different sizes (1min, 1h). The materialized view contains much fewer data than the original table so it is much faster to read, filter and aggregate from the view. You can also check out another technique in one of my previous blog posts on the data skipping index.

However, this approach has a limitation. There are 2 types of tags that cannot be included into the materialized view. Therefore, queries filtering or grouping by these tags cannot be optimized:

Tag that has very high cardinality

Including tags like http.url into the materialized view will increase the number of rows in the view. For example, if we only include endpoint.name in the view, endpoint /api/users/{id} will have only 1 row per minute if the bucket size is 1 minute. However, if we include http.path in addition and the endpoint receives requests with hundreds of different paths such as /api/users/123, each unique path would generate a new row in the view.

Custom key value pair tag defined by the user

Users could add custom tags to an agent (agent.tag), to a call through SDK (call.tag), to a docker container (docker.label) or define a custom HTTP header (call.http.header). Each tag has a custom key and value, e.g. agent.tag.env=prod, docker.label.version=1.0. The keys are dynamic and unknown to Instana, so we cannot create a static materialized view on top of these columns.

We need to figure out a solution to optimize the latency of queries using these tags.

Solution

The solution we came up with is to automatically detect complex queries that cannot be optimized by the materialized views, register them as precomputed filters and tag calls matching these filters during processing time. The idea is to move the complexity from query time to the processing time. This allows to better distribute the filtering and aggregation work load over time during call processing. When the load increases, it’s easier and less costly to scale out the processing component than the database.

The general architecture looks as following:

Step 1: The reading component detects and registers the complex queries as a precomputed filter and pushes them to a shared database. A precomputed filter is basically mapping between a key and a complex filtering with tags on the calls, e.g. filter1: endpoint.name=foo AND call.http.header.version=88 AND call.tag.os=android AND call.erroneous=true, plus some metadata such as creation time or last hit time.
Step 2: The processing component reads the precomputed filters from the shared database. Each incoming call will be matched against all the registered precomputed filters. If there’s a match, the call will be tagged with the filter id. A call can be tagged with multiple ids if it matches multiple filters.
Step 3: Calls are stored in Clickhouse with an additional column precomputed_filter_ids Array(String). We then create a materialized view which groups calls by each precomputed filter id. The id will be the primary key and sorting key of the view table, followed by the bucket timestamp, so querying the view filtered by id is extremely fast.

Step 4: The reading component can transform a complex query into precomputed_filter.id = xxx, and query the materialized view to return the metrics for calls matching the complex query.

Sample pseudo query:

SELECT SUM(call_count)FROM precomputed_filter_view WHERE time > toDateTime(‘2022-06-01 00:00:00’) AND time < toDateTime(‘2022-06-01 12:00:00’) AND precomputed_filter_id = ‘1’

How do we handle grouping?

precomputed_filter.id = xxx only handles the filtering part, if the query requests for metrics grouped by a tag such as endpoint.name, we need to handle this with additional steps:

During the processing, if a call matches the filter, we need to extract the value of the grouping tag endpoint.name from the call, and also store this tag in an additional column. The column will also be included in the materialized view, placed after the precomputed_filter_id and time columns in the sorting key.

Sample pseudo query:

SELECT precomputed_filter_group, SUM(call_count) FROM precomputed_filter_view WHERE time > toDateTime(‘2022-06-01 00:00:00’) AND time < toDateTime(‘2022-06-01 12:00:00’) AND precomputed_filter_id = ‘1’ GROUP BY precomputed_filter_group

Result

Above is a very coarse-grained analysis of queries customers did during a day in our EU region, broken down by different tables and views. We can see that the queries to the precomputed filter materialized view is almost 10 times faster than those to the original calls table, and 3 times faster than a query optimized by the materialized view of the same bucket size (1min).

Limitations and future improvements

The major limitation is that a query can only be optimized after it’s first registered as a precomputed filter. It works well for recurrent queries that users do on a regular basis. However, if a user runs an ad-hoc query in Unbounded Analytics for the first time over the last day or so, the optimization can not kick in immediately. To limit the load on the processing pipeline, we also disable a precomputed filter if it’s not used over a certain period of time.

Some complex queries are predictable if they are configured in a custom dashboard or alerting configuration. In these cases, we can use the configuration to create precomputed filters so that even users can see the metrics and charts quickly even if they open the custom dashboard or jump from an alert to Unbounded Analytics for the first time.

The post How We Optimize Complex Queries at Processing Time appeared first on Instana.

Full-Cycle Observability With Instana and Lightrun

Instana — Fri, 25 Mar 2022 07:00:38 +0000

We are excited to announce that Lightrun had partnered with Instana to enrich existing telemetry with real-time, code-level observability data and provide full-cycle Observability.

Understanding everything that happens inside a production environment is a notoriously difficult task.

Instana’s solution helps developers and DevOps become aware of problems quickly – problems that are rooted in both infrastructure-level information and application-level information. Lightrun, on the other hand, enables practitioners to drill deeper into line-by-line, debugger-grade information from your production systems – enriching the existing information Instana delivers.

When application problems occur in production, it’s important to gain immediate access to information regarding the “full lay of the land”, including all the relevant components that could have been the root cause. There’s a term that refers to that level of comprehension of an application: Observability.

Observability is a property of an application system. An observable system enables DevOps to answer any question about it from outside the system. Observability is a great determining factor in whether we can troubleshoot tough bugs quickly and whether our system is considered reliable. Fast issue resolution (usually measured as MTTR – mean time to resolve) is a great indicator for reliability.

However, Observability is not just “one thing” – there isn’t a single button you can push to get all the information you want. In fact, when tackling tough issues we often rely on various types of telemetry data to clarify what is actually happening under the hood. We can divide this data, broadly, into two levels of granularity: Infrastructure-level information and application-level information.

The integration between Instana and Lightrun allows us to create full-cycle observability, which in practice looks like this:

We’ll first use Instana to understand how the machine running our application or our application itself is feeling, and identify various issues (like performance degradations).
Then, Developers and DevOps can use Lightrun from the IDE (Integrated Development Environment) to add real-time, on-demand logs, metrics and traces to the running application – without stopping the application or shipping new code.
The information provided by Lightrun automatically makes its way to Instana and can be consumed right next to information provided by Instana – closing the aforementioned cycle.

These capabilities are important for Development, DevOps, and SRE practitioners for maintaining application performance and reliability.

For DevOps teams, it helps them instantly review live application environment problems that require triage and optimize the procedures for delivering issue remediation.

For SRE teams, it helps them rapidly identify and repair issues that impact application operations, scaling and reliability.

Developers can debug application code, from their IDE, in production, test, and development without stopping the application or installing updates. This can significantly reduce issue MTTR.

Combining these two tools to effectively tag-team the problem is a good idea, and will provide enough visibility into the running application to solve many critical production issues.

Check out this information about how to use Instana with Lightrun.

Try out Instana with a guided tour in our Play With environment.

The post Full-Cycle Observability With Instana and Lightrun appeared first on Instana.

The SRE Guide to Hyperscale for Cloud-Native Applications

Instana — Tue, 01 Mar 2022 15:19:25 +0000

In my previous post, I discussed the advantages of using Instana Enterprise Observability for achieving hyper-resiliency for applications, particularly cloud-native applications. Hyper-resiliency is usually defined as 99.99% system and application availability, or four 9s. Essentially, it is the ability to perform non-stop computing.

In the cloud, high availability can be difficult, even with the ubiquitous use of cluster technology. Meanwhile, hyperscale for cloud-native applications occurs when infrastructure resources are properly allocated to applications as they scale. If resources are mis-allocated, especially if they’re under-allocated, application performance can degrade or even stop.

Instana Enterprise Observability helps keep applications available by notifying app teams when problems begin. Granular metrics, events, and traces with context enable teams to rapidly identify issues.

If the availability or performance issues are caused by under-allocated or unbalanced resources (CPU, memory, network, and storage), Instana can pass that data to Turbonomic, another IBM company. Turbonomic provides Application Resource Management (ARM), which automatically and dynamically manages and allocates infrastructure resources for applications.

Combining Turbonomic ARM with Instana Enterprise Observability keeps application resource allocation optimized to ensure Service Level Objectives for both performance and availability. ARM procedures can be fully automated or partially automated to enable server resource adjustments that enhance application resiliency and performance, and optimize resource allocation cost.

How ARM and observability work together

Instana monitors application metrics, events, traces, and logs to provide a rich mosaic of application health information. It captures these measurements at unmatched one-second intervals. At this frequency, Instana can observe and identify any issues, either application or infrastructure, and match them with upstream and downstream dependencies in real time.

One-second monitoring granularity is one of the most critical attributes for hyper-resiliency because longer sample times of 10 seconds or higher are not adequate for detecting anomalies. Events in microservice applications and the surrounding infrastructure take place in microseconds, meaning that they can go undetected for a long time with sampling.

Instana’s Enterprise Observability powers rapid anomaly recognition so Turbonomic can apply problem remediation to provide the strongest SLO compliance. If it’s a code issue, Instana’s Auto Profiler identifies the problematic code within a few clicks.

The combination of Instana + Turbonomic creates a seamless and automatic remediation path for any issues that are attributable to mismatched application resources.

For cloud-native applications, those mismatches happen frequently. One moment your applications are starved for resources due to a sudden surge in activity; moments later, they’re over-allocated as the demand surge drops.

When application infrastructure resources are low for any microservice, performance degrades – or worse, service crashes. Instana identifies the slow application response time, highlights constrained resources that may be the root cause of the disruption, and passes that data to Turbonomic.

Turbonomic knows exactly why the resources are constrained and the right adjustment to remediate the disruption. These actions are illustrated in the diagram below, which highlights how Turbonomic adjusts constrained resources based on a target response time.

Proper resource allocation is critical

Turbonomic acts when resources are under-allocated to make sure that performance degradation (or worse) does not occur. Turbonomic automatically adjusts application resources to avoid resource contention or under-allocation that can negatively impact SLOs.

Conversely, when resources are over-allocated, Turbonomic automatically makes adjustments based on thresholds you define. This helps dramatically reduce cloud overspend, which is equally problematic

Instana + Turbonomic is a power combo that will rapidly become an SRE’s best friend. The combination enables hyperscale with hyper-resiliency, cost effectively. It paves the path to automated SLO compliance and continuous performance consistency, especially for your cloud-native applications.

Try out Instana with a guided tour in our Play With environment.

The post The SRE Guide to Hyperscale for Cloud-Native Applications appeared first on Instana.

Dissecting the OpenTelemetry Collector: An Overview

Instana — Thu, 04 Nov 2021 15:53:14 +0000

The OpenTelemetry Collector is the central data collection mechanism for the OpenTelemetry project. We’re going to focus on different angles in subsequent articles, but for now let’s look at it more generally.

Deployment

With single agent we refer to a scenario where the observability vendor provides a single agent that customers can deploy to their systems, that then acts as a data collection mechanism. In case of Instana, this would be the Instana Agent. You would deploy it to your systems via one of the supported mechanisms and basically leave it at that.

The OpenTelemetry Collector supports this scenario, and the project describes it as follows: “A Collector instance running with the application or on the same host as the application (e.g. binary, sidecar, or daemonset)”

Components & Pipelines

The OpenTelemetry (Otel) collector architecture – source: https://opentelemetry.io/docs/collector/

OpenTelemetry recognizes four signal types. Three are explicit: spans, metrics, logs. The fourth is resource description. All three explicit signals are processed in a pipeline. That gives an opportunity to individually post-process and export them to the desired target. If you were to implement a fully open source-based approach to observability, you might be using Prometheus for metrics, Jaeger for traces and some ELK stack for your logging needs. These tools all expect their signals in a dedicated format.

The collector nicely separates those signals into “pipelines,” which helps in tailoring the inputs for the desired output. An example definition for pipelines might look like this:

service: pipelines: metrics: receivers: [opencensus, prometheus] exporters: [opencensus, prometheus] traces: receivers: [jaeger] processors: [batch] exporters: [zipkin]

This defines two pipelines: metrics and traces, where the the opencensus-receiver is essentially opening an HTTP endpoint to receive OpenCensus data and the prometheus-receiver can be used for scraping. The trace pipeline applies the OpenCensus receiver as well as the jaeger-receiver for opening a Jaeger-compatible endpoint on the Collector.

The trace pipeline is applying some batching logic through the batch-processor, where the Collector will collect many individual signals in batches before they are dispatched to the exporter so we don’t hammer the signal receivers too much.

After processing, exporters are applied – OpenCensus and Prometheus exporters forward metrics to compatible remote endpoints. Trace data is forwarded to a remote ZipKin endpoint.

On a higher level, the components involved with this handling signals are the following:

receivers: receive data from other sources. We can consider them as inputs to the collectors. The only core receiver is the OTLP receiver that can ingest the OpenTelemetry Line Protocol (OTLP).
extensions: can add additional functionality to the collector executable such as health checks
processors: work on telemetry data pipelines, for example controlling batching, attribute additions, conversions necessary, etc
exporters: take the data and make it available to outside consumers, for example an observability platform that can make the data useful by aggregating it and providing insights

It is important to note that the core OpenTelemetry Collector only ships OTLP receivers and exporters, so the project can concentrate on being compliant with OTLP and delegate other protocols to the community.

The core collector distribution is then bundling the opentelemetry-collector-contrib plugins, which are extending the vanilla collector with more vendor-specific exporters, processors, and other components.

Why is the Collector important for Instana?

OpenTelemetry is an amazing project, and it’s great to see the community of observability vendors and developers coming together and further evolving the data collection process that we all tackle individually.

Instana’s high-granularity data model is currently bound to our Instana Host Agent and the in-process collectors we provide. We opened our Agent for ingress of locally produced OTLP data for tracing a while ago, and we are currently analyzing where we can provide the most value for our customers going forward as OpenTelemetry continues to gain traction.

As the OpenTelemetry Collector is the central piece or collection, it is a great target for us to make a dent in the universe. It provides good mechanisms to enrich the telemetry signals with the data points we need and for transforming it to our own data model. The challenge, from a vendor perspective, is to provide the right balance between open ingress and precise output from the collector.

Stay tuned as we continue to work with and support the OpenTelemetry project. For more on Instana’s support of OpenTelemetry, here’s some recreational reading material:

The post Dissecting the OpenTelemetry Collector: An Overview appeared first on Instana.

DevOps Horror Stories to Slow Development and Freeze Operations

Instana — Tue, 26 Oct 2021 21:06:04 +0000

Halloween is a scary time to be in abandoned buildings, cemeteries, and dark forests… and DevOps teams. Developers, operations engineers, and SREs told us some DevOps horror stories that have haunted them to this day. Light some candles, gather your courage, and read the spine-chilling tales of terrifying errors, bone-chilling data loss, and nightmarish lost weekends.

Relax, It’s Just the Complete Loss of All Data

During a routine attempt to gather information from our production MySQL DB, the script I was running did not have a relevant section commented out, as the script was a dual-purpose script: to gather information about the schemas in one section, and another section dedicated to database migration, with the engineer commenting out the section not needed at the time of operation.

I neglected to comment out the migration portion of the script during my attempt to do discovery via the script, resulting in the immediate dropping of the entire production database. We had a read-replica in another AZ in AW; but by the time I recognized the error, the drop of tables had already replicated, resulting in the complete loss of all data. Compounded on this, our CTO was out of town, meaning this had to be reported directly to our CEO, who promptly spent the next hour watching over my shoulder as I spun up a new RDS database and restored data from the most recent snapshot – approximately 30 minutes ago. I was still a newly promoted DevOps engineer, having just been moved up from the desktop support team, and made this colossal error.

I still consider it to be the most stressful/terrifying event in my DevOps career.

Petrified,

Inopportune Deployment Engineer

These Credentials Don’t Work in Swedish

I was responsible to set up the flow for a tourist company. The flow had one queue which was protected by simple username/password (long time ago, it was normal). The credentials were shared with me in an email. The issue was: the guy who shared them was Swedish, and they were in a format: användare lösenord.

I didn’t speak Swedish at that time and just added the credentials. When we launched it on prod we started losing all the messages. After investigation we found out that actual user and password where hidden in whitespaces and became visible on select… It was the way to “secure” them in an email.

And those word just mean “user” and “password” in Swedish.

Terrified,

Senior Developer

A Not-So-Hot Fix in Production

In the heat of applying a hot fix in production. I accidentally deleted all k8s deployments that were in the non-default namespace with just one command. With collaborated efforts from development, we recovered quickly. But just a simple kubectl command can wipe almost your whole cluster without any request for confirmation.

Paranoid,

Senior Site Reliability Engineer

A Weekend Ruined By Floppy Disks

A long time ago, when it was still a fairly common and feasible practice to put an entire app’s database on a few floppy disks, I made the mistake of fiddling with the .DBF files without first making a backup. Needless to say, I screwed something up and had to spend the rest of my weekend fixing the files using a C program I cobbled together to gather up all the old data into new tables.

Luckily, I had enough information from reference materials on hand to be able to figure out the file format and where all the data was on the disks (this pre-Internet times). Still wasn’t fun and my supervisor rightfully chewed me out for not taking proper precautions.

Freaked out,

Consultant

A Case of Bad Timing

I once deployed an application ahead of time and scheduled a cron to restart the webserver at 8am, but instead it was every 8 minutes. #DevOoops!

With curdled blood,

Developer

The Road to Prod Is Paved with Good Intentions

One of my developers decided to “improve” a production deployment script. He started making changes directly in the production environment, not in development, against my advice; but management didn’t seem concerned. At 5pm on the dot he left work every day. This day he left as usual. The script changes were unfinished and untested, but live in production. All production deployments failed overnight, costing the company many tens of thousands of dollars.

I came into the office in the morning, was confronted by livid Operations staff (and their manager), and quickly reverted his code. This helped convince Development management to see that code changes needed to be done in dev first. The developer was “convinced to resign from the company,” and he did. Many years later I still run into developers who want to take shortcuts to production, and I tell them this story.

Alarmed,

Architect

Wedding Day Fiasco

Years ago, one of our developers added a new feature to our web application on Friday just before our release. The application was delivered to the customer that evening and deployed overnight. On Saturday morning, application users began to report that one of the main functions of the application was not responsive – essentially preventing them from doing their job. The incident was escalated, and I was called in early Saturday afternoon to help troubleshoot the problem.

It took a couple of hours to find the problem because we didn’t really have great application telemetry at the time or a great way to debug the deployed application. There was an SQL query that was inadvertently pulling hundreds of thousands of records from the customer database, triggered by every application user. The problem wasn’t seen in development because the developers were using a tiny dataset in comparison. Patching the SQL statement with a LIMIT clause restored the application to its normal speedy self.

Oh, and by the way, when I was called in to troubleshoot the problem, I was called away from a friend’s wedding.

Shrieking,

Senior Principal Software Developer

Apologies in advance for your sleepless night tonight. But if you're like us, and can't keep your eyes away, we have some more unfortunate tales where these came from, like the “Zero Width Space”character that broke a k8s deployment, and the fact that things break all the time when you’re an SRE.

Improve Query Performance with Clickhouse Data Skipping Index

Instana — Thu, 07 Oct 2021 06:59:46 +0000

Operating a Large Clickhouse Table

At Instana, we process and store every single call collected by Instana tracers with no sampling over the last 7 days. Instana’s Unbounded Analytics feature allows filtering and grouping calls by arbitrary tags to gain insights into the unsampled, high-cardinality tracing data. We are able to provide 100% accurate metrics such as call count, latency percentiles or error rate, and display the detail of every single call.

For many of our large customers, over 1 billion calls are stored every day. This number reaches 18 billion for our largest customer now and it keeps growing. Calls are stored in a single table in Clickhouse and each call tag is stored in a column. Filtering this large number of calls, aggregating the metrics and returning the result within a reasonable time has always been a challenge.

Previously we have created materialized views to pre-aggregate calls by some frequently used tags such as application/service/endpoint names or HTTP status code. However, we cannot include all tags into the view, especially those with high cardinalities because it would significantly increase the number of rows in the materialized view and therefore slow down the queries.

Filtering on high cardinality tags not included in the materialized view still requires a full scan of the calls table within the selected time frame which could take over a minute.

Optimize filtering on http url

Filtering on HTTP URL is a very frequent use case. The cardinality of HTTP URLs can be very high since we could have randomly generated URL path segments such as /api/product/{id}.

Choose the data skipping index

Clickhouse MergeTree table engine provides a few data skipping indexes which makes queries faster by skipping granules of data (A granule is the smallest indivisible data set that ClickHouse reads when selecting data) and therefore reducing the amount of data to read from disk. ngrambf_v1 and tokenbf_v1 are two interesting indexes using bloom filters for optimizing filtering of Strings. A bloom filter is a space-efficient probabilistic data structure allowing to test whether an element is a member of a set.

ngrambf_v1

A string is split into substrings of n characters. For example, n=3 ngram (trigram) of 'hello world' is ['hel', 'ell', 'llo', lo ', 'o w' ...]. The ngrams of each column value will be stored in the bloom filter.

When searching with a filter column LIKE 'hello' the string in the filter will also be split into ngrams ['hel', 'ell', 'llo'] and a lookup is done for each value in the bloom filter. If all the ngram values are present in the bloom filter we can consider that the searched string is present in the bloom filter.

Functions with a constant argument that is less than ngram size can’t be used by ngrambf_v1 for query optimization. For example, searching for ‘hi’ will not trigger a ngrambf_v1 index with n=3. Small n allows to support more searched strings. But small n leads to more ngram values which means more hashing and eventually more false positives. False positive means reading data which do not contain any rows that match the searched string.

Since false positive matches are possible in bloom filters, the index cannot be used when filtering with negative operators such as column_name != 'value’ or column_name NOT LIKE ‘%hello%’.

tokenbf_v1

tokenbf_v1 splits the string into tokens separated by non-alphanumeric characters and stores tokens in the bloom filter. ‘Hello world’ is splitted into 2 tokens [‘hello’, ‘world’].

In addition to the limitation of not supporting negative operators, the searched string must contain at least a complete token. In the above example, searching for hel will not trigger the index.

Once we understand how each index behaves, tokenbf_v1 turns out to be a better fit for indexing HTTP URLs, because HTTP URLs are typically path segments separated by /. Each path segment will be stored as a token. Splitting the URls into ngrams would lead to much more sub-strings to store. The index size needs to be larger and lookup will be less efficient.

Configure the index

Tokenbf_v1 index needs to be configured with a few parameters. First the index granularity specifies how many granules of data will be indexed together in a single block using a bloom filter. The entire block will be skipped or not depending on whether the searched value appears in the block. The number of rows in each granule is defined by the index_granularity setting of the table. Increasing the granularity would make the index lookup faster, but more data might need to be read because fewer blocks will be skipped.

We also need to estimate the number of tokens in each granule of data. In our case, the number of tokens corresponds to the number of distinct path segments. Then we can use a bloom filter calculator. After fixing the N which is the number of token values, p which is the false positive rate and k which is the number of hash functions, it would give us the size of the bloom filter.

The index can be created on a column or on an expression if we apply some functions to the column in the query. In our case searching for HTTP URLs is not case sensitive so we have created the index on lowerUTF8(http_url).

The final index creation statement looks something like this:

**ADD INDEX** IF **NOT EXISTS** tokenbf_http_url_index lowerUTF8(http_url) TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4

Result

Index size

The size of the tokenbf_v1 index before compression can be calculated as following:

Bloom_filter_size x number_of_blocks

Number_of_blocks = number_of_rows / (table_index_granularity * tokenbf_index_granularity)

You can check the size of the index file in the directory of the partition in the file system. The file is named as skp_idx_{index_name}.idx.

In our case, the size of the index on the HTTP URL column is only 0.1% of the disk size of all data in that partition.

Query speed

The query speed depends on two factors: the index lookup and how many blocks can be skipped thanks to the index.

According to our testing, the index lookup time is not negligible. It can take up to a few seconds on our dataset if the index granularity is set to 1 for example. We decided to set the index granularity to 4 to get the index lookup time down to within a second on our dataset.

The number of blocks that can be skipped depends on how frequently the searched data occurs and how it’s distributed in the table. Our calls table is sorted by timestamp, so if the searched call occurs very regularly in almost every block, then we will barely see any performance improvement because no data is skipped. On the contrary, if the call matching the query only appears in a few blocks, a very small amount of data needs to be read which makes the query much faster.

Optimize filtering on HTTP header

Now that we’ve looked at how to use Clickhouse data skipping index to optimize query filtering on a simple String tag with high cardinality, let’s examine how to optimize filtering on HTTP header, which is a more advanced tag consisting of both a key and a value.

In Clickhouse, key value pair tags are stored in 2 Array(LowCardinality(String)) columns. For example, given a call with Accept=application/json and User-Agent=Chrome headers, we store [Accept, User-Agent] in http_headers.key column and [application/json, Chrome] in http_headers.value column.

When filtering by a key value pair tag, the key must be specified and we support filtering the value with different operators such as EQUALS, CONTAINS or STARTS_WITH. e.g. call.http.headers.Accept EQUALS application/json. This filter is translated into Clickhouse expression

arrayExists((k, v) -> lowerUTF8(k) = ‘accept’ AND lowerUTF8(v) = ‘application’, http_headers.key, http_headers.value)

Choose and configure the index

We can add indexes to both the key and the value column. tokenbf_v1 and ngrambf_v1 indexes do not support Array columns. bloom_filter index looks to be the best candidate since it supports array functions such as IN or has. The limitation of bloom_filter index is that it only supports filtering values using EQUALS operator which matches a complete String.

bloom_filter index requires less configurations. The only parameter false_positive is optional which defaults to 0.025. Reducing the false positive rate will increase the bloom filter size.

Since the filtering on key value pair tag is also case insensitive, index is created on the lower cased value expressions:

```ADD INDEX bloom_filter_http_headers_key_index arrayMap(v -> lowerUTF8(v), http_headers.key) TYPE bloom_filter GRANULARITY 4,

ADD INDEX bloom_filter_http_headers_value_index arrayMap(v -> lowerUTF8(v), http_headers.value) TYPE bloom_filter GRANULARITY 4,```

So that the indexes will be triggered when filtering using expression has(arrayMap((v) -> lowerUTF8(v),http_headers.key),'accept')

When filtering on both key and value such as call.http.header.accept=application/json, it would be more efficient to trigger the index on the value column because it has higher cardinality. The index on the key column can be used when filtering only on the key (e.g. call.http.header.accept is present).

Our Pragmatic Clickhouse Rollout

Adding an index can be easily done with the ALTER TABLE ADD INDEX statement. After the index is added, only new incoming data will get indexed. Clickhouse provides ALTER TABLE [db.]table MATERIALIZE INDEX name IN PARTITION partition_name statement to rebuild the index in an existing partition. But this would generate additional load on the cluster which may degrade the performance of writing and querying data. We decided not to do it and just wait 7 days until all our calls data gets indexed.

Conclusion – Try (TRY) Index Skipping Yourself

We have spent quite some time testing the best configuration for the data skipping indexes. But once we understand how they work and which one is more adapted to our data and use case, we can easily apply it to many other columns.

The bloom_filter index and its 2 variants ngrambf_v1 and tokenbf_v1 all have some limitations. They do not support filtering with all operators. The performance improvement depends on how frequently the searched data occurred and how it is spread across the whole dataset so it’s not guaranteed for all queries.

Ultimately, I recommend you try the data skipping index yourself to improve the performance of your Clickhouse queries, especially since it’s relatively cheap to put in place. It only takes a bit more disk space depending on the configuration and it could speed up the query by 4-5 times depending on the amount of data that can be skipped. BUT TEST IT to make sure that it works well for your own data. If it works for you – great! If not, pull it back or adjust the configuration. We also hope Clickhouse continuously improves these indexes and provides means to get more insights into their efficiency, for example by adding index lookup time and the number granules dropped in the query log.

How We Handled Bugs in CockroachDB and JDBI

Instana — Thu, 07 Oct 2021 06:32:33 +0000

At Instana we appreciate, contribute, and extensively make use of open source projects like Kafka, ClickHouse, Elasticsearch, CockroachDB, and many others. In this post, I would like to share an outstanding team effort that led us to quickly detect, investigate, and remediate an issue that ended up with fixes in two different open source projects: CockroachDB and JDBI.

Not surprisingly, we love metrics and logs at Instana. We also take several practices such as Automated Testing, Continuous Integration, and Continuous Delivery very seriously. As a result, we receive frequent feedback from every change in our code and infrastructure. When the issue I am writing of first happened in our test environments, the first signal we had was failing end-to-end tests due to broken login at Instana.

Introduction

Since the team I work at Instana is responsible for authentication, authorization, and other security-related topics, we kicked off an investigation and quickly found the following error in our logs when trying to update a row in a table in CockroachDB:

org.postgresql.util.PSQLException: ERROR: integer out of range for type int4

Our initial thought was that we were updating a field with a number greater than the column’s capacity in the database, therefore, there were two possibilities:

The number sent from the application to CockroachDB is greater than the column capacity in the database due to a bug introduced in our code
The number sent from the application to CockroachDB is fine but we mistakenly set a column type in the database which is not big enough in certain scenarios;

It turns out that both assumptions were wrong. The number was correct and the column type in CockroachDB was INT8 (64 bits), big enough to store it. It was then that we realized that fixing the issue would not be so straightforward.

Investigation

We noticed that the CockroachDB version was bumped to 21.1.5 the day before in our test environments and the issue did not happen in production where the old CockroachDB version was still running. That was a good starting point for our investigation, and by checking the logs, it was clear that the issue started after updating CockroachDB:

The logs also showed a pattern where the exception was being raised every 10 minutes, and that helped us understand what triggered the issue so we could eventually write a test scenario to reproduce it.

After some serious debugging, we confirmed that the issue was indeed in CockroachDB. It happened because the cached query plan for a prepared statement at some point received the wrong type of a column and kept using it.

Reproducing the issue

To put it simply, consider a table in CockroachDB which contains a column C of type INT8 (64 bits). The scenario below would result in the error:

Set type INT8 and insert a number to C ✅
Set type INT4 and update C with another number within the INT4 range (from -2147483648 to +2147483647) ✅
Set type INT8 and update C with a number greater than INT4 capacity (> +2147483647) ✖ (fail, but should work since C is of type INT8)

Since we extensively use Testcontainers, coming up with an integration test reproducing the described scenario against CockroachDB 21.1.5 was not a problem:

int rowId = 1;
long numberGreaterThanMaxInt4Capacity = (long)Integer.MAX_VALUE + 1;

try(CockroachContainer cockroachContainer = new CockroachContainer("dbName","cockroachdb/cockroach:v21.1.5")){
cockroachContainer.start();
cockroachContainer.executeSql("CREATE TABLE dbName.someNumber (id INT8, theNumber INT8);");

Connection con = getConnection(cockroachContainer.getUsername(), cockroachContainer.getPassword(), cockroachContainer.getJdbcUrl());

try (PreparedStatement firstInsert = con.prepareStatement("INSERT INTO dbName.someNumber (id, theNumber) VALUES(?, ?);")) {
firstInsert.setLong(1, rowId);
firstInsert.setLong(2, 100L);
firstInsert.execute();
}

try (PreparedStatement firstUpdate = con.prepareStatement("UPDATE dbName.someNumber SET theNumber = ? where id = ?;")) {
int myNumber = 1234;
firstUpdate.setInt(1, myNumber);
firstUpdate.setLong(2, rowId);
firstUpdate.execute();
}

try (PreparedStatement secondUpdate = con.prepareStatement("UPDATE dbName.someNumber SET theNumber = ? where id = ?;")) {
secondUpdate.setLong(1, numberGreaterThanMaxInt4Capacity);
secondUpdate.setLong(2, rowId);
secondUpdate.execute(); // throws org.postgresql.util.PSQLException: ERROR: integer out of range for type int4
}
}

Not surprisingly, if myNumber changed from int to short and firstUpdate.setInt(1, myNumber) to firstUpdate.setShort(1, myNumber), the second update would raise the same exception, but now complaining about INT2 (16 bits), meaning that CockroachDB reused the query plan from the first update:

org.postgresql.util.PSQLException: ERROR: integer out of range for type int2

On the other hand, if myNumber changed from int to long and firstUpdate.setInt(1, myNumber) to firstUpdate.setLong(1, myNumber), this issue would never happen in CockroachDB (although the bug would still exist).

Fixing the issue

With that information in hand, we reported an issue to the CockroachDB GitHub repository and the CockroachDB team promptly took it over and came up with a fix (big thanks to the CockroachDB team!). The fix is available on CockroachDB version 21.1.7.

However, a question was still open: this issue would never have happened if a long type was being sent in the SQL updates. So, would that mean that the application is mistakenly sending to CockroachDB updates with type int (32 bits) where long (64 bits) is expected?

To confirm that this was the case, we tracked down the network packets being sent over the wire from the application to CockroachDB and concluded that an integer type was indeed being sent where a long type was expected. Diving deep into the Postgres JDBC driver and JDBI code, we found out a bug in JDBI where long and Long types were mistakenly being mapped to int and Integer respectively, so we created a pull request in the JDBI GitHub repository fixing it. The JDBI team quickly reviewed and accepted it (big thanks to the JDBI team too!). The fix is now available on JDBI version 3.21.0.

Conclusion

By detecting the issue in its early stages in our test environments, we were able to quickly fix it and prevent it from affecting our customers. The way our entire investigation was conducted highlighted the outstanding teamwork and cooperation within Instana, and once again showed that collaboration is key to achieve success – both within individual companies and within the open source community. A big thanks to everyone involved, including CockroachDB and JDBI teams!

If you feel excited about how we solve problems and also love metrics and logs, we are hiring!

My twitter: https://twitter.com/jorgeacetozi)

My LinkedIn: (https://www.linkedin.com/in/jorgeacetozi/)

Experience Instana for yourself in our guided demo sandbox environment: https://www.instana.com/apm-observability-sandbox/

From Shipping to Scaling: How Goji Investments Masters Developer Experience Through Observability

Instana — Wed, 09 Sep 2020 22:26:02 +0000

This guest blog post was written by Dean Record, Engineer at Goji Investments.

Goji Investments launched in 2016. Our platform democratises access to real estate, business lending, renewables, and other alternative investments. It allows asset managers and investment firms to offer financial vehicles to global private investors seeking stability and better yields by looking beyond traditional equity markets. Our scalable modular platform is available as a white-label solution and also integrates into existing platforms using our API.

Creating an Excellent Developer Experience

In the four years since we launched, Goji Investments’ developers have worked hard to help us expand our product offering and bring in new customers. In return, we’ve done everything to keep them happy, and that means supplying them with the right tools to help them transition from building our platform to scaling it.

Our DevOps pipeline is a complex machine with many moving parts. We are continuously refining it and adding or subtracting tools based on developer feedback and our changing deliverables.

We realised early on that the best way to ensure a superb end-user experience is to provide an outstanding developer experience. We help our people stay focused on performing at the highest level by automating repetitive low-level tasks, thus freeing them to work on features that add value to the company. To this end, we have fine-tuned our technology stack to give our eight-member developer team everything it needs to succeed.

Automating Deployments Wasn’t Enough

Our applications are written in a modern Java stack, and we are using Dropwizard as our framework. Our developers typically write code and then push it to GitHub for review. If it’s approved, the new code is merged into master, which automatically triggers a build pipeline using ThoughtWorks GoCD. We then run automated unit, integration, and acceptance tests and our developers also have the option to run manual tests to encourage exploratory testing.

From there, we can promote new code into our pre-production environment, which mirrors our current production setup. Anything not covered in prior testing that makes its way into our production environment can be easily rolled back using the blue-green deployment functionality of HashiCorp Nomad, our container orchestration platform. We’re a small team, and our codebase is relatively compact, so we don’t need the complexity of Kubernetes GKE or EKS to deploy and manage our applications.

“Combing through logs for application monitoring is a waste of valuable developer resources.”

One of the benefits of our technology stack is extensive automation. Although our deployment workflow was seamless, things were not as rosy when it came to monitoring. We were using New Relic to monitor our infrastructure. When an application was eating up too many CPU cycles, or an EC2 instance crashed and rebooted, it would trigger an alert. But if we wanted to troubleshoot slow response times, latency errors, and bottlenecks at the service level, we had to comb through logs manually, and that was a waste of time and human resources.

End-to-End Traceability

The last thing you want to do with your day is analyse log files. You want to build the next killer feature, but if you have to dive into the weeds, you want to do it as quickly and as painlessly as possible. That’s where observability and application performance monitoring (APM) comes in. APM incorporates infrastructure monitoring, but also provides service-level metrics, including information about database calls and end-user activity. It allows us to locate bottlenecks and other performance issues without having to divert our developers from their primary duties.

I’d used APM software at my previous position, and when it came to time to adopt a platform at Goji Investments, I went with what I knew. Experience had shown me that Instana is a zero-touch solution that does everything we need straight out of the box. When you run 20 different microservices that cover everything from taxes, payments, and customer notification—and when you’re scaling to add even more—you need to optimise every element of your DevOps pipeline. Instana was the final piece in our puzzle, and it allowed us to automate one of the most cumbersome parts of the development cycle.

Instana gives us full traceability from our front-end to our back-end. We can track user actions on our website and trace subsequent calls across multiple services. It automates alerts and can send an alarm to our developers’ Slack channel when it encounters an issue with CPU, container memory usage, or the JVM heap size.

Instana also integrates with Humio, a real-time logging solution we recently adopted. When we spot an issue in Humio, we can jump into Instana and go directly to the Docker container or host that generated the error. We can also view AWS metrics in Instana and no longer have to pull information from the AWS console. Consolidating all of these functions saves time and simplifies troubleshooting.

Finding Our Bottlenecks

When we’re testing a new build or functionality, we can use Instana to view dozens of metrics in real-time, and that is a complete game-changer. Recently, we developed a feature that distributes dividends to investors and, prior to release we wanted to run performance tests to ensure it would scale to meet the demands of our growing client base. Initially, we ran a performance test of several hundred thousands of payments, and it took over 20 hours.

During the performance tests as we iterated on our code, we found ourselves getting halfway through the process in five or ten minutes, hitting a bottleneck, and then taking 10 to 12 hours to complete the remainder.

With Instana, we were able to see where the problem occurred. The instant Java profiling feature, AutoProfile™, showed us hotspots where the code was slow, and we determined that our database queries were the bottleneck. Once we knew where to look, we went back, fixed the code, and redeployed the application until we got it right.

Instana allowed us to compare test runs. We could pull up a previous trial run and then compare its timeframe with the current version of the code. Using this iterative process, we streamlined the code, eliminated all the bottlenecks, and reduced the dividend processing time from 20 hours to less than 1 hour.

If we didn’t have Instana, it would have easily taken us four to five times longer to get this process right.

Giving Our Developer the Tools to Shine

If there’s one thing developers love, it’s rising to the occasion. The best way to keep them happy is to give them the tools to contribute to your bottom line. Treat them like internal customers by relieving their pain points. Automate repetitive tasks and eliminate tedious, time-consuming work like going through error logs. Reward them with creative work and give them every opportunity to shine.

Instana is helping Goji Investments streamline the DevOps process by giving us visibility into the inner workings of our applications and microservices. We can trace calls and database queries, and we can monitor hosts, Docker containers, and Java Virtual Machines in real-time. All of this accelerates testing and deployment. We can do more and scale faster and attract the world’s most talented developers.

After all, the smartest people want to work for companies that challenge them and also give them the tools to shine.

Dean Record is an Engineer at Goji Investments. You can connect with him on LinkedIn.

DEV Community: Instana

GitHub CoPilot, Code Automation, and Software Health

Commercial Code Assistance Tools

Things Code Automation Tools Do Well

Open Source Options

Code Completion for Software Health

Shift Left Testing: What Is It and Why Does It Matter?

What Is Shift Left Testing?

The V-Model of Software Development

Types of Shift Left Testing

Unit Testing

Integration Testing

API Testing / Contract Testing

UI Testing

More Than Just Automation

The Benefits of Shift Left Testing

Cost Savings

Developer Wellbeing

Resilient Architecture

Greater Overall Quality

The Dangers of Shift Left Testing

How to Get Started with Shift Left Testing

Continuous Testing

Shift Left vs Shift Right

Instana Introduces OpenTelemetry Exporter for .NET

What is OpenTelemetry?

OpenTelemetry and .NET

The Span Exporter

Example

Check our solution on OpenTelemetry Github repository

What is OpenTelemetry?

What Is OpenTelemetry?

What Is Telemetry Data?

How Does OpenTelemetry Work?

Components of OpenTelemetry

What Are The Benefits of OpenTelemetry?

- Flexibility: Developers can monitor performance metrics and usage from any web browser or device. The convenience of observing your application from any location makes it easier to track your analytics in real-time.

What is OpenTelemetry Used For?

OpenTelemetry vs OpenTracing

Is OpenTelemetry The Future of Instrumentation?

*Achieve Your Business Goals With Instana and OpenTelemetry *

How We Optimize Complex Queries at Processing Time

Context

Tag that has very high cardinality

Custom key value pair tag defined by the user

Solution

How do we handle grouping?

Result

Limitations and future improvements

Full-Cycle Observability With Instana and Lightrun

The SRE Guide to Hyperscale for Cloud-Native Applications

How ARM and observability work together

Proper resource allocation is critical

Dissecting the OpenTelemetry Collector: An Overview

Deployment

Components & Pipelines

Why is the Collector important for Instana?

DevOps Horror Stories to Slow Development and Freeze Operations

Relax, It’s Just the Complete Loss of All Data

These Credentials Don’t Work in Swedish

A Not-So-Hot Fix in Production

A Weekend Ruined By Floppy Disks

A Case of Bad Timing

The Road to Prod Is Paved with Good Intentions

Wedding Day Fiasco

Improve Query Performance with Clickhouse Data Skipping Index

Operating a Large Clickhouse Table

Optimize filtering on http url

Choose the data skipping index

ngrambf_v1

tokenbf_v1

Configure the index

Result

Optimize filtering on HTTP header

Choose and configure the index

Our Pragmatic Clickhouse Rollout

Conclusion – Try (TRY) Index Skipping Yourself

How We Handled Bugs in CockroachDB and JDBI

Introduction

Investigation

Achieve Your Business Goals With Instana and OpenTelemetry