DEV Community: Bryan Lee

Master is the new Prod, Devs are the new Ops

Bryan Lee — Thu, 02 Apr 2020 19:23:32 +0000

Operational concerns of shipping software and keeping production up and running have been largely automated. Ultimately, even when things go wrong, modern-day monitoring and observability tools allow engineers to triage and fix nasty bugs faster than ever before. In part due to this, engineering teams and their budgets are shifting left. With the fear of downtime in the rearview mirror and operational challenges largely at bay, businesses are investing heavily in development to increase the quality and speed at which business value is delivered. The biggest bottleneck? Your failing master branch.

The Ops Wayback Machine

The year is 2004. It’s time for deployment. You’re confident about the software you’ve written. You clear your mind, get one last sip of coffee, and get ready to deploy. Before you proceed, you open up terminals, many terminals, and tail every log file on every server that could possibly be affected. You have business metrics up and running on your second monitor. Next to it, there are infrastructure and application-level metrics. You hit the return key and proceed to deploy the latest release to a single server. Now you watch.

You scan rapidly across every terminal in your 19'’ CRT monitors to identify patterns and look for discrepancies: are there any errors in the logs? Has the conversion rate changed? What’s the load on the server? Was there a change in disk or network i/o? You wait a few minutes, if everything looks good, you proceed with a second server, else you roll back to the previous release as fast as humanly possible, hoping that’ll fix whatever it is that broke.

While the excerpt above may sound ridiculous in this day and age, that’s how many deployments were done in the good old days of the internet after the Dot-com bubble. Then came automation, hand-in-hand with the proliferation of “DevOps” and “microservices”. These practices were born from the ever-increasing speed and complexity of the applications being developed, a consequence of businesses’ competitive desires to deliver more value to customers, faster. Businesses could no longer afford to ship new features every 6 months; a slow release cycle represented an existential threat.

But businesses weren’t always concerned with shipping new features at all costs. If anything, that wasn’t even a priority back in the early 2000s. The biggest fear of any company running software on the internet has always been downtime (ok, maybe the second biggest fear, with a security breach being at the top). And for this reason, among others, ops teams have always had sizable budgets for tech companies to sell into. After all, anything that minimizes downtime is likely cheaper than downtime itself.

Fast-forward to today and you’ll see a whole new world. A world in which high-performing engineering organizations run as close to a fully-automated operation as possible. While every team does things a little bit differently, the sequence mimics the following: the moment a developer’s code is merged into master, a continuous integration (CI) job is triggered to build and test the application; upon success, a continuous delivery process is triggered to deploy the application in production; oftentimes this automated deployment is done in a minimal fashion, just a few nodes at a time or what’s known as a canary deployment or release; in the meantime, the system, equipped with the knowledge of thousands of previous deployments, automatically performs multi-dimensional pattern matching to ensure no regressions have been introduced; at a certain degree of confidence, the system proceeds to automatically update the remaining nodes; and if there are any issues during deployment, rollbacks are automated, of course.

Automation doesn’t necessarily stop with the deployment and release of an application. More and more operational tasks are being automated. It’s now possible, and not uncommon, to automate redundant tasks, like rotating security keys or fixing vulnerabilities the moment a patch is made available. And it’s been many years since the introduction of automatic scaling to handle spikes in CPU or memory utilization, to improve response time and user-experience, or even to take advantage of discounted preemptible/spot instances from IaaS providers.

But, as anyone who has ever had to manage systems will be quick to tell you: no amount of testing or automation will ever guarantee the desired availability, reliability, and durability of web applications. Fortunately enough, for those times when shit ultimately hits the fan (and it undoubtedly will), there is monitoring to help us find the root cause and quickly fix it. Modern-day monitoring tooling — nowadays being marketed as “observability” — offers fast multi-dimensional (metrics, logs, traces) analysis with which to quickly debug those pesky unknown-unknowns that affect production systems. Thanks to observability, in just a handful of minutes, high-performing engineering organizations can turn “degraded performance”, and other problematic production regressions, to business as usual.

The amount and complexity of the data that an engineer can rapidly process using a modern observability tool is truly astounding. In just a few keystrokes you can identify the specific combination of device id and version number that results in that troublesome backend exception, or which API endpoint is slowest for requests originating from Kentucky and why, or identify the root cause behind that seemingly random spike in memory consumption that occurs on the third Sunday of every month.

While this level of automation and visibility isn’t achieved overnight, we’re unequivocally headed this way. And it is the right path forward, as demonstrated by the adoption of these tools and processes by the world’s most innovative tech companies. And so, with operational concerns at ease, what’s next? Business value! And what’s the biggest bottleneck we’re currently facing? Failing to keep master branch green (i.e. healthy/passing CI). Let me explain.

Keeping master green

In terms of development, testing, and continuous integrations, the closest software engineering concept to “keeping production up” is to “keep master green”, which essentially means it’s deployable.

This is something that’s likely to strike a chord with most software developers out there. It makes sense after all; if teams are going to cut releases from master, then master must be ready to run in production. Unlike years ago, when releases were cut far and few between, the adoption of automation (CI) and DevOps practices has resulted in development teams shipping new software at a much faster rate. So fast, that high performing engineering organizations are able to take this practice to its extreme, by automatically releasing any and every commit that gets merged to master — resulting in hundreds, sometimes even thousands, of production deployments on a daily basis. It’s quite impressive.

But you might be left wondering, if you’re not continuously deploying master, and instead, you’re doing so once a week, or maybe once a month, why bother keeping master green? The short answer is: to increase developer productivity, to drive business value, and to decrease infrastructure costs. Even if you’re not shipping every commit to master, tracking down and rolling back a faulty change is a tedious and error-prone task, and more often than not requires human intervention. But it doesn’t stop there, a red (broken) master branch introduces the following problems:

A red master leads to delayed feature rollouts, which themselves lead to a delay or decrease in business value and potential monetary loss. Under the assumption that CI is functioning as expected, breaking master means that a faulty or buggy code commit needs to be detected, rolled back, and debugged. And for many companies, the cost of delaying the release of a new feature, or [security] patch, has a direct correlation to a decrease in revenue.
A broken (red) master branch has a cascading negative effect that also hurts developer productivity. In most engineering organizations, new features or bug fixes are likely to start as a branch of master. With developers branching off a broken master branch, they might experience local build and test failures or end up working on code that is later removed or modified when a commit is rolled back.
A broken/failing build is also money wasted. Yes, automated builds and tests are performed precisely to catch errors, many of which are impractical to catch any other way. But keep in mind that for every failed build, there’s at least another build (often more) that needs to run to ensure the rollback works. With engineering teams merging thousands of commits every day, build infrastructure costs can no longer be disregarded — at some organizations, CI infrastructure costs already exceed those of production infrastructure.

Convinced of the perils of a red master branch, you may ask yourself, what can I do to keep things green? There are many different strategies to reduce the number of times that master is broken, and when it breaks, how often it stays broken. From Google’s presubmit infrastructure and Chromium’s Sheriff, to Uber’s “evergreen” SubmitQueue, there’s no doubt that the world’s highest performing software organizations understand the benefits of keeping master green.

For those that aren’t dealing with the scale of Google, Facebook, and others their size, the most widely established, and easily automated approach is to simply build and test branches before merging to master; easy, huh? Not really. While this often works for relatively simple, and monolithic codebases, this approach falls short when it comes to testing the microservice applications of today. Given the inherent distributed complexity of microservice applications, and the rate at which developers are changing the codebase, it is often impractical (i.e. too many builds that would take too long) to run every build, not to mention running the full integration and end-to-end test suites on every commit for every branch for every service in the system. Due to this limitation, branch builds are often limited in the scope of their testing. Unfortunately, after the changes are merged and the complete test suite is executed, it is common for this approach to result in test failures. And back to square one, master is red. So what can you do?

There’s been an outage in master

At Undefined Labs, we spend a lot of our time meeting and interviewing software engineering teams of all sizes and maturity levels. These teams are developing systems and applications with varying degrees of complexity, from the simplest mobile applications to some of the largest distributed systems servicing millions of customers worldwide. We’ve seen an interesting new pattern emerge across several software engineering organizations. This pattern can be summarized as:

“Treat a failure (i.e. automated CI build failure) in your master branch as you would treat a production outage.”

This is not only exciting but indicative of how engineering organizations continue shifting left, further and earlier into the development lifecycle.

Most software teams already aim to “keep master green”, so what’s the difference? Traditionally, keeping master green has been on a best-effort basis, with master often failing over multiple builds across different commits, and days going by before returning to green.

As engineering teams mature, it becomes even more urgent to fix master in a timely manner. Now, if master is red, it’s a development outage, and it needs to be addressed with the utmost urgency. It is this sense of urgency and responsibility that development teams have set upon themselves to get things back to normal as quickly as possible that is truly transformative. Given the business repercussions, associated costs, and productivity losses of a failing master branch, we expect broad adoption of this pattern across engineering teams of all sizes in the near future.

Let’s take a look at some practical things you can do to keep master green:

If you can’t measure it, you can’t improve it

This is true for many things, and it’s particularly true here. Before setting out on a journey to improve anything, one needs to be able to answer these questions:

How long does it take us to fix master when it breaks?
How often do we break master?

These questions may sound familiar. Change master to “production” and these questions have well-known acronyms in the operational world. For anyone that has hung around an SRE (site reliability engineer) long enough, the terms MTTR and MTBF come to mind.

MTTR or mean time to repair/resolution is a measure of maintainability. In the context of this article, MTTR answers the question, how long does it take to fix master? MTTR starts a running clock the moment a build fails and only stops when it’s fixed. As failures continue to occur, the MTTR is averaged over a period of time.

MTBF, or mean time between failure, is a measure of reliability. In the context of this article, MTBF answers the question, how often do we break master? MTBF starts a running clock the moment a build goes from failing to passing and stops the next time a build fails. For example, for a team that breaks master once every week, the project’s MTBF will be approximately 7 days. As the project continues to experience failures, the MTBF is averaged over a period of time.

Unfortunately, until now, there’s no easy nor automated way to monitor these metrics. Few CI providers/tools provide insights into related information, and even fewer provide an API with which to more accurately calculate these metrics over time.

In Scope, we’re adding both MTTR and MTBF to our user’s service dashboards. That is, for every service for which Scope is configured, teams automatically see these values and trends tracked over time.

Now that you know what “normal” is, it’s time to figure out ways to improve MTTR and MTBF, in the never-ending quest to keep master evergreen.

MTTR — if you can’t debug it, you can’t fix it

MTTR is a measure of maintainability: how quickly a failure can be fixed. To fix a software problem, software engineers need to understand what the system was doing at the time of the error. To do this in production, SREs and developers rely on monitoring and observability tools to provide them with the information they need in the form of metrics, logs, traces, and more to understand the problem at hand. Once the problem is well understood, implementing a fix is usually the easy part.

However, the contrast between production and development is quite stark. In CI, developers lack the detailed and information-rich dashboards of production. In CI, developers get a “build failed” notification, followed by a dump of the logs of the build. Here is an example output from a top CI provider:

The problem is evident. Whereas operationally we have an endless amount of data, visibility, and infinite cardinality to probe the system and quickly understand complex production issues, when it comes to CI, developers are left in the dark, with nothing more than a log dump.

To deal with this problem, before moving to Scope, the organizations we’ve encountered had either built a custom solution, frustrated by the lack of options in the market, or were trying to jerry-rig their production monitoring tools to work for CI without much success. There are key differences between production, a long-lived system handling transactions, and CI, short-lived and fast-changing environments running unit and integration tests, that make using production tooling impractical in CI.

Teams interested in reducing the mean time to resolution in development to increase productivity, reduce CI costs, and increase business value, ought to look no further. Scope provides low-level insights into builds, with visibility into each and every test, across commits and PRs. With distributed traces, logs, exceptions, performance tracking and much, much more, teams using Scope are able to cut down their MTTR by more than 90%.

MTBF — a reliably flaky build

MTBF is a measure of reliability: how often master breaks. After meeting with countless teams, and reviewing their CI builds, the verdict was clear: the leading cause for build failures in today’s software development teams is flakiness. If something is flaky, by definition, it cannot be reliable. As such, to increase a project’s MTBF the best thing any engineering organization can do today is to improve how flakiness is managed in a test suite and its codebase.

Lots have been written about flakiness. Most recently, Bryan Lee, from Undefined Labs, has written a great primer on flakiness; you should read it. In this post, Bryan lists a clear set of patterns to successfully handle flakiness:

Identification of flaky tests
Critical workflow ignores flaky tests
Timely flaky test alerts routed to the right team or individual
Flaky tests are fixed fast
A public report of the flaky tests
Dashboard to track progress
Advanced: stability/reliability engine
Advanced: quarantine workflow

Due in part to the limited visibility into CI that most development teams have today, the way developers deal with flakiness is quite rudimentary. When a build fails, and if the developer suspects a flaky test is the reason behind it, they simply hit that retry button. Again and again, until the build passes, at which point they can proceed with their job. Not only is this wasteful, and costly, from an infrastructure perspective, it’s also highly inefficient and unproductive, as developers may be blocked while a build is taking place.

Other solutions, less rudimentary, but still largely ineffective, require manual intervention from developers to investigate builds in order to identify flakes; however, these flakes are not properly tracked, and rarely dealt with, given the overhead experienced by teams without testing visibility. Without the means to quarantine or exclude known flaky tests from builds in master, teams with flaky tests are quick to give up on any ambition to keep master green. This, of course, carries grave business consequences as the team’s ability to innovate and ship new features is hindered by their broken builds.

While this may be the case for most, there are those high-performing engineering organizations that have built internal tooling to address these challenges. Google, Microsoft, Netflix, Dropbox, among others, have built custom solutions to deal with flakiness and minimize the frequency at which master is red. The problem? These solutions are custom-built and not readily available for the rest of us.

To address this glaring problem, we’re adding flaky test management features right into Scope. Starting with flaky test detection, automatic test retries, and a dashboard to track all flaky tests. In this dashboard, teams can track every flaky test, their flaky rate, the date and commit in which the test first exhibited flakiness, as well as the most recent flaky execution.

Another one of the biggest struggles developers face when attempting to improve very flaky codebases and test suites are: where do I even begin? While all flaky tests may seem as good candidates for fixing, prioritization is key when dealing with flaky test suites. If you’re actively experiencing flakiness across hundreds of tests, it’ll be a futile attempt to try and fix them all. Instead, development teams should prioritize those tests that experience the highest rate of flakiness and are also the slowest to execute. Everything else being equal, these are the tests that have the most negative impact on your application and development processes.

Our journey to the left

At Undefined Labs we’re big believers in shifting left, and are always looking for innovative and transformative ways to improve application development. Treating master as production may still raise a few eye-brows in this day and age, but as shown in this article, there is clear business value in doing so: faster feature rollouts, increased developer productivity, and reduced costs. Similarly, once-reserved-for-production indicators like MTTR and MTBF, feel right at home as part of the development process, and provide development teams with the responsibility and accountability they need to more efficiently run their operations.

When building Scope, we often ask ourselves: what happens when you apply the modern and sophisticated tooling that we have in production to problems in development? what happens when you close the feedback loop, and use everything we know about our applications running in production to make more informed decisions during the development process? The possibilities are endless, and quite exciting! If this sounds interesting, make sure to follow and stay tuned for more coming from us very soon!

In the meantime, happy testing!

The way we build applications has drastically changed with the rise of DevOps, Microservices, and Cloud Native — but the pinnacle of developer testing has remained static: run every test for every commit in CI and get no visibility whatsoever when things go wrong.

We’re building Scope to give teams a modern testing platform that provides a solution to the biggest pains in testing:

Debugging unit and integration tests
Identifying and managing flaky tests
Reducing time spent testing by 50–90% (leading to a dramatic decrease in CI costs)
Detecting regressions before they reach production
Keeping master green

Working at Undefined Labs

Bryan Lee — Wed, 26 Feb 2020 20:53:19 +0000

Originally written by Javier Vidal

If you are planning to invest or work with Undefined Labs, this blog post is for you. After almost a year working as a Frontend Engineer at Undefined Labs, I’m excited to summarize how it’s been one of my best experiences working for startups.

This is my personal view of a promising startup inside the rich Spanish culture.

The Team

The team is led by @borjaburgos and @fernandomayo, very smart technologists with a clear vision about the future. They already sold their previous startup to Docker, Tutum, and since then, they have been working on the same market, so few people in the world know the Docker environment better than these two. They are always looking at the industry, obsessed about converting ideas into business. And now they are trying to repeat the success of Tutum. Believe me, they know what they are doing.

We are only eight engineers, all talented senior developers. Not having any junior or mid-level developers is a questionable decision, but what I can say is that this is working for us. We try to build as fast as possible. Fewer people means less time communicating and organizing, so we can move and pivot quickly.

Our strength is our differences. Except for the frontend team, who are three javascript developers, the rest of our engineers code in different languages: .NET, Python, Java, Go, and Swift. This diversity in programming languages is required by the products we are building, but also a decision we made to support the multilingual systems of most engineering organizations. Since the majority of our developers work alone on their stack, it requires a high degree of responsibility and ownership. And we don’t fall into the trap of having what-is-the-best language or IDE arguments. Brie Wolfson described it as Tupertine.

Sometimes, having only one person resolving technical problems in each language can be painful, but it gives us a considerable perspective when dealing with a strategic decision about products.

And all of this was possible because of the hiring process. Fernando, the CTO, has tried very hard to find the best talent in Spain. He has a clear vision of the kinds of players that would aspire to form a perfect team. It should fit perfectly for the way we do technology.

Software development

Choosing technology is a vital decision, so we do it strategically. As we are building things from scratch, we have the freedom to choose whatever we want, with no dependencies or debt. But we also know that there are unknown unknowns that could slow us down, so we always try to use already proven technology. Or frontend is written in React, while our backend is in Python. We use Apollo for communications, so GraphQL is our daily language.

While it’s easy to get overeager in our choices around technology, the team always grounds these decisions by holding the end-user experience as the leading priority. We always make decisions with the user in mind, careful not to add complexity that isn’t justified by the value delivered to the user. Keep it simple is our mantra. Unjustified complexity is forbidden.

The benefit of building developer tools is that we are always our first users. Dogfooding is a widespread practice in the industry, and we embrace it. It increases the sense of ownership, as you feel the wins and pains of the product directly. It’s a common occurrence to see a colleague sharing a new dogfooding case in Slack, followed by questions and suggestions on how we could improve the experience.

Another exciting prospect of building developer tools is the opportunity to impact how you and others in your profession will work. Few professions are responsible for building many of the tools they will use on a day to day basis. It can be invigorating having such a tight feedback loop, where every iteration of the product immediately enhances the way my coworkers and I work.

Dream big, everything is possible. Fernando has taught — by doing — that no matter how far you are from the Bay Area, you can still out-compete technology incumbents. An inspirational ideal that gives us the confidence to compete in a demanding market.

The market

We are developing tools for developers. The space for tooling is vast, so we are focusing on the CI/CD phases. Some of the hot topics at the office are observability, testing, and building.

If you want to know more about the products we have already released, check out https://undefinedlabs.com.

In the beginning, when I entered Undefined Labs, I was skeptical about the market. I thought it was all about IDEs and Plugins, which I find very boring. But the founders’ motivation surprised me. And after a couple of months, I could understand why this market has so much potential.

We are in a new promising market. As digital products evolve, so does the software behind it. The systems that power these products are getting more complex every day, so companies are spending more and more money developing, shipping, and maintaining these complex systems. On the other hand, the DevOps culture is growing fast and more and more developers are getting into the operations space. By addressing the needs of developers, your technology can reach millions of end-users.

Many of the existing solutions at our disposal are brilliant, built by the best minds in the industry. You can learn a lot by just looking inside these new products and features. To have the opportunity to compete with these minds — and do it successfully — is a very humbling and fulfilling experience.

Because the market is getting bigger, a lot of companies from small startups to big fortune 500s are entering, and it can be exciting being a part of this chaotic, fast-paced ecosystem. We usually watch the more significant industry conferences together like GitHub Universe and AWS re:Invent, and pay close attention to how the announcements may affect us, either immediately or down the road. It seems commonplace during these conferences that a big company announces an acquisition or the release of a new feature that makes the latest trendy tool in the market useless. The frenetic pace can seem crazy, but it gives you a sense of urgency. This adrenaline keeps us focused.

Finally, our customers are smart people. Often, they are the innovators within their larger company. They know exactly what they need, so the feedback they give is invaluable. Also, we speak the same language as our customers, so the GAP between what we are building and why we are doing it is drastically reduced.

The environment

At the time I’m writing this, there are ten of us in the company. Two of us are in the USA, handling meetings and looking for customers and partners. The developer team is in Madrid, the capital of Spain.

VC investors from Silicon Valley back the company. The difference in the average salary between Silicon Valley and Madrid is significant, which allows us to be very money-efficient while still offering the best compensation in Spain. This allows us to enjoy and live in one of the best countries in the world.

We also enjoy one of the best locations in Madrid. The office is located in the city center, near good restaurants and public transportation.

Madrid is a vibrant city. Like other capitals, it is full of young people from all around Spain striving to leave their mark. Also, it receives influences from people coming from Latin America. This makes Madrid a beautiful place to sample Spanish culture.

We have the best weather in Europe, with clear skies and a temperate climate. The summer is hot, and the winter is cold, but it never snows. If you want to go to the beach, you can take a 3h train to Valencia or Barcelona. If you prefer the mountains, you have ski resorts less than 2h by car.

We enjoy the city together. Some of the team members have lunch in the office, and others go out to have lunch daily. But altogether we go out on Friday for our team lunch.

I’m looking forward to seeing what new products or ideas we bring to life. If you are also interested, follow us on Twitter, and keep an eye out for an opportunity of joining us at https://undefinedlabs.com/about-us/

Testing is a core competency to build great software. But testing has failed to keep up with the fundamental shift in how we build applications. Scope gives engineering teams production-level visibility on every test for every app — spanning mobile, monoliths, and microservices.

Your journey to better applications through better testing starts with Scope.

Testing Strategies for Modern Web Applications

Bryan Lee — Fri, 21 Feb 2020 20:28:39 +0000

Written by Ramón Guijarro

Why frontend testing matters

It’s no secret that websites are nowadays more complex than ever. The last decade has seen a big shift on the web: as user expectations have changed with the rise of the smartphone, we’ve effectively transitioned from web pages mostly based on HTML and CSS to web applications driven by tons of JavaScript code. Unlike traditional document-based websites, current webapps have rich interfaces that support heavy user interaction, async data fetching for partial content updates, data caching and even offline usage.

If only because of this complexity, we should be testing our web applications to make sure that they behave as expected. As developers, we want to increase our confidence in the software we write, and that’s what tests provide. But making sure our webapps work is even more important if we think about the fact that they’re the entry point to our products. It can even be argued that your webapp is your product since it’s the thing your users are actually using. When users think about your product, they think about your UI, and all they care about is accomplishing tasks through it. So how do we go about testing it?

Interpreting the test pyramid

From a conceptual standpoint, besides picking specific testing tools and technologies, one of the first questions that arise is at which level we need to be testing and how many tests of each kind we should be writing. The classic test pyramid quickly comes to mind as an answer, but let’s see how it applies to modern component-based web applications — the kind built with libraries like React or Vue.

The thin line between unit and integration

Following the test pyramid will get us writing lots of unit tests, as those constitute its base. In our webapp context, they’re usually interpreted as testing a single component completely isolated from the rest of the tree, mocking out all of its dependencies and subcomponents. However, this kind of test doesn’t very accurately reflect how people will actually use our app.

That last statement is especially true for those who believe that a test that writes to the DOM or receives user input cannot be considered a unit test but an integration test, because there is I/O involved or side effects in general. But while you can call it an integration test all you want, it’s probably not the kind of integration test the pyramid is referring to, as it is not noticeably more expensive to run. And hence the advice of not writing as many of them arguably doesn’t apply.

The technologies used for unit and integration tests of components are in fact usually the same — typically, a test runner like Jest that uses an emulated browser-like environment under the hood — so sometimes the distinction between them only comes down to who you’re asking.

End to end tests are fundamentally different

On the other hand, end to end tests at the top of the pyramid much better reflect how users interact with the app. For web applications, these rely on tools that run your tests in an actual web browser, instead of an emulated DOM like the ones we just mentioned. This fact makes them conceptually different and forces you to test from the end user’s point of view. You can think of them as manual tests that are automated.

Historically, end to end tests have been slow, prone to flakiness, and hard to debug. However recent testing tools and frameworks, like Cypress or Puppeteer, are improving all of these aspects to the point where some people are even advocating to invert the test pyramid altogether — something generally considered an antipattern — on the basis that we ought to be testing exactly what the user is experiencing.

I personally wouldn’t go as far as inverting the test pyramid, but what is definitely becoming clearer nowadays is that our webapps could benefit from raising the level at which you’d typically write tests; moving it above single isolated components. Let’s see why.

The case for higher-level tests

Why classic unit testing doesn’t cut it

Testing our components in isolation and mocking everything around them is not only a poor reflection of their real-world usage — hence not providing that much value in terms of confidence — but it almost inevitably leads to coupling our tests with their implementation details. This can get particularly bad with libraries such as Enzyme, which lets developers select nodes based on component names, arbitrarily modify their internal state, and skip rendering of all children altogether with shallow rendering.

Under this approach, it’s common that making almost any change to a component will break its tests, even if its API stays the same. If end users or consumers of the component would not notice changes, why should tests fail? Also, fixing the tests will sometimes force you to basically rewrite them from scratch. This means that those tests will actually hinder your ability to refactor and will never be able to catch regressions. So what’s their value then?

Integration tests to the rescue

A lot of components in our applications are meant to be working in conjunction with others to form a larger component, a certain screen, or a feature. Features are what users care about and how we should measure confidence in our app. So instead of testing the individual components at the leaves of the tree — what would be usually referred to as unit testing them — look for the higher-level components that constitute true units in terms of features and test these without mocking its children. You will cover the real use case, the tests will take less effort to write and maintain, and you will be able to refactor all the subcomponents without breaking multiple tests.

To ensure that you’re testing the same way your users would use the app, it’s a good idea to rely on tools like Testing Library, since it gives you utilities to query for nodes similarly to how users would find them. And because its queries are based on ARIA roles, testing with it will force you to improve the accessibility of your app as a bonus.

This approach allows us to write tests that closely imitate real user interactions and are resilient to changes, just as end to end tests would do, which is exactly what we want. But they have the benefit of running much faster, since they’re not using a real browser.

The role of end to end tests

So if we can get a similar level of confidence with integration tests, what are end to end tests good for then? An excellent use for them in our context is smoke testing. Build processes of modern webapps have many moving pieces and involve sophisticated tools like transpilers, bundlers or polyfills, with non-trivial configurations that are often different for development and production. That means that your production build could fail whilst the development one is working fine. So a simple test that opens your webapp in a browser and checks that it loads actually gives you quite some value for little investment.

Another use case for end to end tests is to cover the happy path of your most important user flows, exercising real APIs instead of having network requests mocked — either production ones, or in testing or staging environments. These tests might seem redundant, as well as more prone to exhibit flakiness since they hit real backend services, all on top of being slower. That’s why it’s advisable not to have a ton of them and maybe only run them before a deployment or as a nightly process. But they’re still relevant since they emulate the usage of your app in the most realistic way of all automated types of testing.

Making responsible use of mocking

We’ve discussed an approach to tests based on emulated DOM technology that tries to get as close as possible to the benefits of end to end browser-based tests, and we’ve seen how too much mocking goes against our goals in this area. But you will still need and want to mock some things in your non-browser tests. So let’s briefly discuss what to mock and how.

Global mechanisms

You usually won’t want to render your whole app, but at the same time, you’ll want whatever global mechanisms you have in place to be available in your tests. This way, you can confidently rely on those in your code knowing that your tests won’t fail. For example, if you’re using React you will probably have some top-level context providers; it can be a good idea to write mocked versions of them and mount them in all your tests.

In a similar vein, you will want some functions used across your app to always be mocked. A typical example is date formatters: if you ever change the way you format dates in your app, you don’t want to have to modify assertions in every test under the sun. You can write unit tests for those functions to check that the formatting works as expected, and then have global mocks return a constant to ignore these in the rest of your tests. Testing frameworks like Jest allow you to define global mocks once for your own modules or third party dependencies. And you can always restore the original implementation for a particular test if you need to.

Network requests

You will also want to mock network requests in your integration tests to ensure that they run fast and are not flaky. You can do that in exactly the same way, as long as all your requests eventually go through the same module. The key is to always encapsulate core functionality like this in reusable modules and consistently rely on them. This will not only make testing easier but improve the architecture of your application, avoiding duplication and reducing the risk of diverging implementations and duplicated errors.

Specific components

Finally, there are legitimate reasons to mock components in a particular test, besides top-level or global ones. For example, you might want to have actual unit tests for some core components. That’s fine, but instead of using techniques like shallow rendering to completely prevent rendering all child components — which might unknowingly hide errors from you — explicitly mock only what you need. Again, Jest makes it easy with mock functions.

Going beyond tests with static analysis

The easiest tests to maintain are the ones that you don’t need to write. Static analysis tools can automatically catch a lot of bugs for us, effectively saving us from writing certain kinds of tests. These tests tend to be repetitive and cumbersome, so help is even more welcomed.

Prevent common bugs

A common source of bugs in modern web applications is the incorrect use of components by other consumers, like forgetting to set some mandatory property, or using a mismatching value for it — e.g., a string when a number is expected. This can be addressed with mechanisms like PropTypes in React, but these only warn you in runtime during development, if the particular component happens to be mounted, and via the browser console — so you can miss the warning anyway. A better alternative is to declare types for properties and state of components with tools like TypeScript or Flow, that perform static checking of the types to ensure that these issues are all caught at build time. They will also support you while making changes in your components and allow you to refactor with greater confidence.

Another relevant source of bugs in JavaScript applications is null pointer errors, such as cannot read property x of undefined or x is not a function. Type checkers can also help prevent these errors, as they will for example statically check if a particular key that your code is trying to access actually exists in an object. They do this based on the object shapes you declare, and their value is multiplied thanks to type inference — so you don’t need to explicitly declare types all the time. Support for nullable types completes the deal.

Define safe data models

We can leverage this ability to declare our own types to make our applications safer in some other ways. For example, we can write factory functions that receive data fetched from our backend APIs and process and sanitize it, returning objects safe to use in our components — what we’d call models. And we can define custom types for the shape of these models and export them to use as prop types of our components. This can help us gracefully handle mistakes in backend responses so that our UI doesn’t break, while still being able to log custom errors to a monitoring service like Sentry to be aware of the issues.

We can go even further down this road by using tools that automatically generate type definitions for our API responses. This allows us to use the generated types directly in our components without any extra work, providing a solid safety layer. Since GraphQL has a type system at its core, APIs based on it are particularly well-suited for this. Tools like GraphQL Code Generator can generate TypeScript and Flow types from a GraphQL schema, and next-generation systems like Prisma go one step further and generate them from the database itself.

In conclusion

Writing tests for our web frontend code is important because it’s the part of our product that our users will be directly interacting with. We want to simulate that interaction as faithfully as possible, so browser-based tests would be the way to go, but they’re too slow to use at scale. However, we can get pretty close with non-browser based tests if we write them at a high enough level, use selectors based on accessibility roles, and make sensible use of mocking. Adding a static type checker to the mix will further increase our confidence and prevent some common bugs.

So go ahead and write some tests for your webapp. Your users will unknowingly thank you.

Your journey to better applications through better testing starts with Scope.

Introduction to Profiling and Optimizing SQL Queries for Software Engineers

Bryan Lee — Mon, 03 Feb 2020 20:58:37 +0000

Written by Adrián López Calvo

Nowadays it is quite uncommon to have dedicated Database Admins (DBAs) on application development teams. Whether it’s the adoption of microservices architectures, cloud, or DevOps processes that is to blame, more and more development teams are now responsible for their databases. Now more than ever, solid database skills are indispensable on your path to becoming a proficient programmer.

Response-time is a key indicator of the performance of any application,. Database optimization can be one of the fastest ways to give a speed boost to your application, website or API. Learning the basics of profiling and optimizing SQL queries is actually not as hard as it might seem, and you should not feel intimidated by it. In this post, I will explain the general steps I followed with a real-world scenario I ran into recently while working on Scope. Armed with the skills described in this post, you’ll be optimizing your own database queries in no time!

Note: our main database is PostgreSQL and most of our backend stack is written in Python. While the concepts I cover should also be applicable to MySQL or other SQL databases, please bear in mind that all practical examples in this article will be for PostgreSQL. There will also be some recommendations or examples specific to Python.

Finding slow queries to profile

First things first, before jumping to profile any random query, we need to find a good candidate database query to profile. Slow queries are usually the best ones to start with, but profiling can be used for much more than speeding up slow queries. If there are queries that appear to be behaving erratically, or otherwise returning unexpected data, profiling will help you determine the reason for these abnormal behaviors.

There are many ways to find slow queries. From client-side instrumentation to server-side configuration, through automatic application-level monitoring or through server logs. Each has its pros and cons. Let’s take a look at some of our options:

Database Slow Query Log

Most SQL databases provide a Slow Query Log, a log where the database server registers all queries that exceed a given threshold of execution time. With it enabled and configured, the database will automatically output warning messages and query information for all slow queries.

There is no prescriptive way to access these logs, and most of the usual suspects will work here. You could use grep, for example. But you can also get more sophisticated, and have a simple script that parses the contents of the file and notifies you via Slack.

The Slow Query Log method, while somewhat rudimentary, does have an advantage. With this method you’re getting the information directly from the source of truth, thus eliminating the need for any third-party system or component. The downside? You may experience a small performance hit, originating from the database server having to time every query. Furthermore, keep in mind that enabling this feature does require admin privileges in the database.

Middleware and Application Logs

Other alternatives are application-side middlewares and application-level logging. There are many ready-made solutions for this approach. For example, a popular choice for Python and Django is the django-debug-toolbar. Alternatively, there also exists the option to enable logging directly from the ORM. But even simpler than this is to just add logging to the code interacting directly with the database to print or log the latency of each query. You can even mimic the behavior of the Slow Query Log, and only log the queries exceeding a given threshold of execution time.

Since these are client-side solutions, latencies include networking times. This can be advantageous. Oftentimes a slow query isn’t due to the database taking too long to resolve the given query, but instead, the slowness is due to the size of the payload. With large payloads, the bottleneck may actually be on the data transfer between the database and the application server. This is more likely to occur while using an ORM, as it may default to fetching all the columns from the database table, even when they’re all not required.

APM and Distributed Tracing

Last on this list is APM (Application Performance Management) and Distributed Tracing. While they aren’t one and the same, for the purpose of this article, and the value they provide in profiling database queries, I’ve decided to put them in the same category.

The way most APMs work is through a library or agent that is installed alongside your application and automatically instruments its client libraries, like http, grpc, sql, etc. to monitor and log transactions and queries.

If you are up to speed with the latest advancements in observability and distributed tracing, and already have instrumentation for your application’s database queries, there are open-source distributed trace visualization tools, like Jaeger, that can help you easily identify slow queries.

Let’s see an example of a request containing one such slow database query:

90% of the request time was spent on a single query.">

As you can see, with APM or tracing data, and a good service or tool for visualization, finding slow database queries to profile is actually quite easy.

In any case, no matter which method you choose, as long as you have a way to time your queries, you’re likely to find a slow query that is worth optimizing!

Profiling a SQL Query

Now that we’ve identified our slow query, the next step is profiling. Profiling, in the context of a SQL database generally means, explaining our query. The EXPLAIN command outputs details from the query planner, and gives us additional information and visibility to better understand what the database is actually doing to resolve the query.

You can execute this command directly from a command line shell in your database, but if you would rather avoid terminals, most database clients with a GUI also include EXPLAIN capabilities. Following along with the earlier example of our slow query, let’s now profile it. To do this, we simply need to pre-append the original slow query with EXPLAIN ANALYZE.

Attention! It’s important to understand the difference between EXPLAIN and EXPLAIN ANALYZE. While EXPLAIN will only give details about the plan, without executing the query, EXPLAIN ANALYZE actually executes the query, providing you with exact timing information under the current server load. But you need to be very careful with this, particularly on production databases. Not only could it impact performance, you could potentially modify or delete data if you are profiling an UPDATE or DELETE operation.

After running the EXPLAIN ANALYZE query on a PostgreSQL shell, you’ll get an output similar to this:

As you can see, the output for this is quite cryptic! What does all of this mean? PostgreSQL’s EXPLAIN is very thorough, it really shows us everything the database knows and plans to do with our queries. As such, it is to be expected that we will not understand most things. While explaining all of the output is beyond the scope of this article, we can still learn quite a few things from it.

The first tip is that, if we want to think about the steps the database would follow with our query sequentially, it helps to read from the bottom upwards. This is because the output is like a break-down of operations, so the last ones are actually the first step for each parallelized block of the execution.

Also, if you are obtaining your EXPLAIN output from the shell, I highly recommend using an EXPLAIN visualization tool, I personally like this one. It provides syntax highlighting and table formatting to help with legibility. You can also easily share the report with other teammates! Here’s a link to an example query.

This site is also a great source to learn about what most of the operations on SQL plans are. But for this practical example, we’ll just be taking a look at two operations, Seq Scan and Index Scan:

Seq Scan is a full search on the table with the given conditions. These are usually bad because they will become slower as the data in your tables grow.
Index Scan is using one of the indexes that exist on your table. These are usually better since the database has a shortcut to the rows that match the conditions.

Most often optimizing your queries will be a matter of replacing a Seq Scan that is too big with an Index Scan by creating an index on the columns that are involved in the conditions. Thought you should know that adding indexes to your tables is not always appropriate. When adding a new index we are trading off writing performance for query speed, due to the database having to update the index when writing to the table. So you should try to have the lowest number of indexes that you can afford and only create the ones that you find unavoidable.

As we continue to debug our slow query, we can see in the visualization above that a single step in the plan is the culprit for most of the time spent resolving the query. So we should focus on these lines:

This step in the plan is finding rows in the table scope_testexecution with a column agent_id that matches the primary key on the table scope_agent through the index called scope_teste_agent_i_f352da_idx. It is resolving a foreign key from one table to another. Unfortunately, it is already using an index, so that is not the issue for this specific query.

You can check what kind of index is being used on PostgreSQL. To do that we use the command \d+ scope_testexecution, resulting in:

Indexes:

…

“scope_teste_agent_i_f352da_idx” btree (agent_id, fqn)

This is not how we usually index foreign keys. The database is using a compound index on 2 columns, agent_id and fqn and we do not have a single column index on agent_id!

Furthermore, the fqn column is a varchar(1000) with very high cardinality (a lot of different values per agent) and the EXPLAIN is actually telling us that this step involves 83967 heap fetches. On the Index Only Scan operation, PostgreSQL scans through the index but it stills needs access to the table storage. This area in storage where the table is stored is called the heap, so the number of heap fetches is the number of read operations from the table. In summary, the query is requiring a lot of I/O access to fetch these values from the table, and that is probably the cause for low performance.

We could decrease the size of the data that PostgreSQL needs to resolve this condition if we had a single column index on agent_id. Having a dedicated index for each foreign key is the standard set up, usually autogenerated by ORMs or framework tooling, but in this case, it’s missing.

Testing an optimization

Now that we have an idea of what to change, we can experiment and see if the query performs better. It is better to do this on a testing or staging database when possible since it will involve altering our indexes and there’s always the chance we’re wrong, potentially affecting our production database.

It’s likely that you have a mechanism in place to change your database schema; these are usually called database or schema migrations. Schema changes should involve code reviews, as well as be part of a continuous integration process to be tested and validated.

While this isn’t the best way to debug, unfortunately, we cannot test this kind of change on a development machine, since we need data resembling the actual size of the production database. With this in mind, a trick I’ve used to get past this limitation is to make our changes within a transaction itself. PostgreSQL allows transactional changes to indexes and table structures, which comes in handy for our use case:

With this change, we can actually test a different plan for our query without affecting how it currently works for other transactions, and if we were to be wrong, nothing is affected; we can simply undo the index through a quick ROLLBACK!

Let’s try this out. The next step is to profile the query again, and to use our visualizer to see the report and compare to the previous results we obtained:

As you can see in the screenshot above, the same query now takes just over a second. Remember it initially took over 10 seconds! Our optimization, creating a separate index for agent_id, has been successful!

You may be curious as to why we did not have the standard index for the foreign key on the agent_id column. It turns out that in previous iterations of our product, the table scope_testexecution was being used on several queries that aggregated results. Because of this, we had more queries with GROUP BY on agent_id, fqn than we did directly accessing it. That index was necessary at the time. Fortunately, things have changed since then and we can now revert back to using the standard index.

Conclusion

Optimization of SQL queries may be daunting at first, but it is fascinating and within every developer’s reach. Hopefully, this post has helped demonstrate that there are easy optimizations you can do today, such as proposing the addition of a missing index.

But even if you’re just getting started with databases, you can already help your team by identifying slow queries and providing the output of EXPLAIN to help debug issues — they will surely appreciate it!

Scope can help you identify slow SQL queries in testing before they ever make it to production!

Your journey to better applications through better testing starts with Scope.

Limitations of xUnit in Cloud Native Testing

Bryan Lee — Mon, 13 Jan 2020 17:05:27 +0000

Written by Daniel Rodriguez

xUnit is the name used for the collection of test frameworks loosely based on the original Smalltalk Test Framework, proposed by Kent Beck in 1998, and later popularized by JUnit. While it has become common nowadays, the idea of systematically checking an application’s correctness in code (e.g. a test) was quite novel at the time.

Another revolutionary idea in those days was the concept of Continuous Integration: every developer merging code on a recurring basis to ensure everything was working as expected. And to help them check for correctness and identify potential regressions, teams leveraged these testing frameworks to test their applications during the continuous integration flow.

But if you stop to think about it, we are talking about practices that were introduced somewhere between 20 and 25 years ago!

With regard to CI, there is no shortage of options nowadays, from free and open-source solutions to commercial SaaS and on-premise products. And with the adoption and proliferation of Docker and containers, it is really easy for developers to run their custom tech stacks on any CI provider. But at the end of the day, CI services are still predominantly dumb runtimes, where a developer can define what to run, but the CI doesn’t actually understand what is being executed. And it is only due to “standardization” around the xUnit XML format that some CI tools are able to parse and report on test data.

xUnit has failed to evolve for the cloud native world

But in the era of cloud native applications, innovation in testing has been stuck in a seemingly alternate universe where XML is the pinnacle of testing innovation. There are sophisticated engineering organizations running thousands of tests at planet scale, and they’re still generating XML reports and uploading them to their CI or reviewing them with a text editor.

Furthermore, there is no XML schema standard to represent this report – every vendor has freedom in how they define their schema, which may have varying levels of compatibility with the most popular CI’s. All of this only to see a summary of your passing and failed tests. It makes no sense.

While this way of testing was sufficient when we only had unit tests, it falls predictably short with current software engineering needs that emphasize integration with other services. Suites of integration tests are now a must-have to reduce the risk of deploying bugs into production. As a matter of fact, Lean Testing, a modern testing philosophy, questions the validity of the traditional testing pyramid and instead advocates for more integration tests, and fewer unit tests — resulting in the “Testing Trophy”.

Testing trophy by Kent C. Dodds.

xUnit only provides superficial insights

XML test reports are insufficient to understand what is happening under the hood. Yes, the reports show when a test has failed, but developers won’t know the type of test (unit? integration? benchmark? end-to-end? other?) and if it is an integration test, to which services it is integrating to, or which versions are running, or in which environment. Is a test failing due to a code change or a config change? Or is it failing because of a dependency? With today’s XML reports, we simply cannot figure out the answer to these and many other questions. And consequently, developers end up spending more time trying to understand the problem than fixing it.

Observability in testing: a must-have in a cloud native world

Leveraging observability patterns in our integration tests, we cannot only start providing reliable information about what services are touched by our tests, but we could also start executing controlled tests in production. Modern production observability tools give developers the visibility they need to understand complex systems, but why wait until production to actually understand how our applications are behaving? Given a choice, every developer would much rather catch a bug before it ships to production, where the cost is always much greater. Yet, developers lack the tools to efficiently do this with today’s modern applications.

It’s now clear that with the proliferation of containers, serverless, Kubernetes, and cloud native, both the way we develop and the applications we develop have changed. But developers are struggling to properly test and debug these new applications with the current tools at their disposal. Testing and testing frameworks need to evolve to understand better what is happening in our tests. As is, current methods for debugging integration tests are unreliable, time-consuming, requires a high level of expertise, and don’t always lead to resolution.

Your journey to better engineering through better testing starts with Scope.

A Modern Approach to Manual Testing

Bryan Lee — Tue, 17 Dec 2019 22:05:40 +0000

Written by Juan Fernandez

Setting the stage

We’ve all seen it, the bigger an organization gets, the more difficult it becomes to have meaningful and efficient communication between teams. This is particularly apparent when incentives are misaligned. Often, development teams are incentivized to build as much as possible in the shortest time possible. On the other hand, QA teams are incentivized to reduce the inherent risk in a codebase/application to an acceptable degree. The problem is obvious: whereas development teams optimize for speed, QA teams optimize for correctness. This misalignment of incentives results in frustration and burnout between and within these teams.

This inherent friction between development and QA teams is most acute when finding a bug. At that time, QA professionals prioritize having said bug fixed. In contrast, developers get conflicted between shipping a new feature in a timely manner, or delaying it in exchange for a bugfix, which could take five minutes or five weeks to implement, and has no clear ROI.

While an argument could be made that the right culture could [potentially] prevent this, our experience having talked to hundreds of organizations is that, more often than not, teams with competing incentives behave this way.

The state of affairs

Let’s take a look at what happens today, at any given organization, when QA finds a bug in an application. For the sake of this post, let’s pretend this is an e-commerce application.

Step 1: QA contacts the developer or development team supposedly responsible for the fix. They do this on a communications tool like Slack, or by creating an issue on a tracker like Jira.

Step 2: The QA professional then proceeds to talk with the “Checkout Dev Team.”

Step 3: QA is left confounded and unsure of how to proceed.

What would follow is a back and forth between the QA and development teams struggling to reproduce the issue across different development environments, while sharing screenshots and recordings, and trying to fish for the appropriate logs of the apparent issue at hand. All of this while banging their heads against the wall, because the problems only show up under seemingly random circumstances that cannot be determined. On top of it all, it’s likely that after days with this issue lingering and bouncing around JIRA, not one development team confidently claimed ownership of the fault, let alone is working on a fix.

This sequence of events happens even in the most dedicated and collaborative organizations with robust processes in place. The problem is the depth and complexity of the software applications that we’re developing these days. Correctly identifying what team, microservice, configuration, or transaction is at fault for any given issue, and quickly debugging it to find a proper solution is one of the most difficult challenges in modern-day software engineering.

Can we do better?

Of course, we can! But before I show you how, let me introduce myself: I’m Juan, and I work as a Software Engineer at Undefined Labs. We work on developer tools, and we believe it’s time to fundamentally change the way testing and development teams collaborate, and significantly improve how organizations test their applications. In this post, I’d like to show you the power of Scope for manual web-based testing.

To explain our approach better, let’s revisit our previous scenario, this time with Scope:

QA can provide a deep link to the test report for Developers to analyze.

All reports in Scope include a trace showing the transaction recorded during the execution of the test.

The verdict: screenshots and logs, while handy, are hardly sufficient to deal with most of the issues a development team will face any given day in today’s modern applications. With Scope, we’re making it trivially easy for QA teams to provide their development teams with the rich and in-depth visibility they need to fix any manual test and any bug. Let’s see how.

Introducing Scope for Chrome

When we set out to build Scope, we mainly set out to solve problems for developers. And while developers are ultimately responsible for fixing a bug, we saw in Scope the potential to build a bridge between QA and development teams. Scope for Chrome is our first step towards building this bridge.

Whereas many of the frameworks, SDKs, and agents for testing we’ve built to date cater to developers, it was clear early on that the needs and tools of a QA are quite different. With Scope for Chrome, we wanted to build the easiest way for anyone to create meaningful manual browser tests. As such, the only requirement to use Scope for Chrome is to (1) install our browser extension, and (2) know how to use a web browser.

It really is that easy:

After anyone records a manual test using Scope for Chrome, a report is generated automatically. This test report includes:

User actions like clicks or keyboard strokes.
HTTP requests with their headers and payloads.
Responses by the backend and even database queries being fired as a result of the request.
Console logs.
Exceptions.

Finally, every manual test comes with a detailed report, including everything a developer would want to see when trying to debug any given regression. And best of all, it eliminates the need to try to reproduce the issue at hand. There is also no “what browser were you using?” or “what user were you logged in as?” Everything you need to understand the problem is in a single pane of glass. And it is easier than ever to know what team is best suited for the fix, or what service is at fault.

How it works

Simply put: when a user clicks on the “Start Recording” button, Scope For Chrome starts listening and recording everything happening within your current tab.

In more detail, to accomplish this, our effort is threefold:

Listen to and record user events such as mouse clicks, keyboard strokes, and exceptions happening within the tab.
Monkey patch XHR and fetch by injecting code to the tab under test. Each request creates a new span (“individual unit of work” as per OpenTracing terminology) that then propagates to the backend.
Listen to main_frame requests (a document that is loaded for a top-level frame). This is the first request that your browser does when going to a new page. For this, we use the webRequest extension API.

Of course, this is just a glimpse at what Scope for Chrome does. None of it would have been possible without the work behind our javascript agent, our other agents (Python, iOS, .NET, Java, Golang), our backend capable of ingesting and processing test data and distributed traces, and our purpose-built web UI to display tests in an interactive and structured way.

P.S. Check out the Technical Addendum below for more information.

What’s next?

With the ever-increasing complexity of modern software comes bigger, more sophisticated software teams. In the same way teams work to improve the interfaces between distributed services, we should improve how we collaborate and communicate between teams. We believe Scope for Chrome can help alleviate the most frustrating problems associated with the lack of visibility in manual testing, and help bridge the gap between QA and Dev.

You can learn more about Scope for Chrome here.

Technical Addendum: technical challenges faced while building Scope for Chrome.

1. Monkey patching is hard

For those that haven’t dwell into the world of monkey patching, it is a technique to add, modify, or suppress the default behavior of a piece of code at runtime without changing its original source code.

For example, consider a class with a method get_value. Suppose this method does a database query when being called. Imagine you are now unit testing your class: you may not want to do an actual database query, so you dynamically replace get_value by a stub that returns some mock data.

This can be extended to other uses. For example, in a web application, you might want to substitute console.log by a function that not only logs a message but also adds the date at which the function was called.

Here’s an example:

const log = console.log

console.log = function() {

  log.apply(console, [new Date().toISOString(), ...arguments])

}

Basic monkey patching of window.fetch is simple (note that this is just an example and is not prepared to be production code):

const oldFetch = window.fetch

window.fetch = (...args) => {

   // do something here
   return oldFetch(...args)

}

Things get more interesting when you want to do async stuff in there, like communicating with a background script:

const oldFetch = window.fetch

 window.fetch = (...args) =>

  new Promise(resolve => {

   asyncCommWithBackground().then(newRequestInfo => {

    const newFetchArgs = [...args, ...newRequestInfo]

    resolve(oldFetch(...newFetchArgs))

   })

  })

This pattern is quite powerful. But with great power comes great responsibility. By monkey patching, we are slowing every request in the active tab down by however long asyncCommunicationWithBackground takes to resolve.

And here’s an example of doing something with the result of the fetch:

window.fetch = (...args) =>

 new Promise(resolve => {

  asyncCommWithBackground().then(newRequestInfo => {

   const newFetchArgs = [...args, ...newRequestInfo]

   resolve(oldFetch(...newFetchArgs)).then(fetchResult => {

    // do something with fetchResult

    return fetchResult 

   })

  })

 })

}

You can simplify the code a bit with async/await, but you have to be careful. You probably want to know if your fetch has failed, for which you would use try/catch. But if you do that, you would stop exceptions from propagating to the consumer, which is a scenario you’d want to avoid. The most important thing to remember here is: monkey patching done right should be transparent.

To do something with the response data, like adding it to the span as tag or metadata, you need to be careful with cloning your response before, as it is a stream and can only be consumed once.

If you want to dig a bit deeper into this pattern, we have set up a repository with a small project of a functioning chrome extension that delays all your fetch requests. The pattern is highlighted here.

An alternative to this solution is to use the webRequest API with hooks like onBeforeSendHeaders, but as of now, this API does not allow the capture of response payloads, which was a requirement for Scope.

2. Monkey patching in a browser is even harder

As it turns out, if you want to affect the window variable of the tab, which we need for monkey patching, a content script is not enough. You need to execute code like the one shown in here. This means handling your code as string. There are some alternatives like using function.toString() and babel macros to evaluate variables in build time, but the extra complexity defeats the purpose, as your monkey patched functions should not be big anyway.

Utility functions that your monkey patched functions need, like random number generation or parsing of data need to be available in the tab, which means again handling your code as strings to inject it. The tab shares no execution environment with your background, and while it would be possible to asynchronously request and wait for results, this would mean slowing down all your requests.

3. Sending responses asynchronously

Your content scripts and injected code will communicate with your background through message passing. At some point, the responses might require some async operation. To leave the communication channel open you need to return true before calling sendResponse. More of this pattern here.

4. Bundling your extension

The window that appears when you click on your browser’s extension icon, also known as a popup, will grow sooner than you expect and a state management library will come in really handy.

To avoid the hassle of managing a webpack configuration with all the perks (like hot reloading), there are some excellent starter projects and tools for this specific purpose.

samuelsimoes / chrome-extension-webpack-boilerplate

A basic foundation boilerplate for rich Chrome Extensions using Webpack to help you write modular and modern Javascript code, load CSS easily and automatic reload the browser on code changes.

Chrome Extension Webpack Boilerplate

A basic foundation boilerplate for rich Chrome Extensions using Webpack to help you write modular and modern Javascript code, load CSS easily and automatic reload the browser on code changes.

Developing a new extension

I'll assume that you already read the Webpack docs and the Chrome Extension docs.

Check if your Node.js version is >= 6.
Clone the repository.
Install yarn.
Run yarn.
Change the package's name and description on package.json.
Change the name of your extension on src/manifest.json.
Run yarn run start
Load your extension on Chrome following
1. Access chrome://extensions/
2. Check Developer mode
3. Click on Load unpacked extension
4. Select the build folder.
Have fun.

Structure

All your extension's development code must be placed in src folder, including the extension manifest.

The boilerplate is already prepared to have a popup, a options page and a background page. You can easily customize…

View on GitHub

xpl / crx-hotreload

Chrome Extension Hot Reloader

Watches for file changes in your extension's directory. When a change is detected, it reloads the extension and refreshes the active tab (to re-trigger the updated scripts).

Here's a blog post explaining it (thanks to KingOfNothing for the translation).

Features

Works by checking timestamps of files
Supports nested directories
Automatically disables itself in production
And it's just a 50 lines of code!

How To Use

Drop hot-reload.js to your extension's directory.
Put the following into your manifest.json file:

    "background": { "scripts": ["hot-reload.js"] }

Also, you can simply clone this repository and use it as a boilerplate for your extension.

Installing From NPM

It is also available as NPM module:

npm install crx-hotreload

Then use a require (or import) to execute the script.

View on GitHub

We’ve used React but these should work for any other state management library.

5. Browser compatibility

The problem with browser compatibility is not well solved with extensions. Though we have not dug into it yet, there seems to be a lot of potential in web-ext.

6. Host commands

Our javascript agent gets some of its metadata with host commands. But you’re running in a browser extension, so that is not an option.

Some questions then arise:

Where to get credentials from? Traces sent to the backend need an API endpoint and an API key.
How would I calculate the NTP offset in my machine needed for precise timestamp measurements in the trace view? When talking about distributed traces, resolution and precision in the order of microseconds and even nanoseconds is important, as the traces are often generated in different machines. Any small offset can ruin your data.
In the future: how do I get the code of this specific file and line number that threw an exception?

There are different ways to answer these questions. For example, we can solve number 1 by logging our extension in using the browser’s cookies. This would allow the extension to send authenticated requests to our backend. This is risky though, as it requires a SameSite=None cookie. Number 2 is tricky as there are no “external world” solutions — we need host rights. Number 3 could be solved the same way as number 1, but again, that is risky.

An option that answers all 3 questions is using our Scope Native App via the native messaging API. The native app is already configured with the API endpoint of your choice and it has access to the API key, which solves question number 1. It can also run host commands, so that solves number 2 and number 3. The disadvantage is that we couple our extension with a different product, but with our current and future requirements in mind, this seems like the best possible alternative.

Building browser extensions is very rewarding. The technical challenges we faced on this one were really thought provoking and and we’re sure we will continue to face many more. You can also rest assured we’ll continue to invest and innovate in this space, as we’re confident of our ability to help bridge the gap between Dev & QA with tools like Scope for Chrome.

Your journey to better engineering through better testing starts with Scope.

Introduction To Observability For An iOS Developer

Bryan Lee — Thu, 12 Dec 2019 18:53:34 +0000

Written by Ignacio Bonafonte

Issues and failures in production

Tracking down crashes and issues in asynchronous code is usually very hard. If your code crashes, the crash report and the stack trace are related to the thread that crashed, but the context where the crash happened is usually lost. If the app didn’t crash but had a wrong behavior, it can be even worse, because the best information you can probably get is a log line in a hidden log file.

To locally track these problems, Apple provides ActivityTracing.framework, which lets you group your application code in Activities and assign logs to those activities. ActivityTracing also allows you to leave a trail of events to help you identify the path your code walked before the problem happened. This functionality is very helpful to identify problems locally. If you are still not using this technology in your app, take a look at Apple’s documentation, it can help make life easier as a developer.

However, the usefulness of ActivityTracing is limited if your application interacts with multiple services and performs requests and receives responses asynchronously. As a developer, identifying the root cause for failure turns into a guessing game: if the failure happens in a repeatable manner, you can run the problematic code many times while monitoring the service to find it; but if the error arises in the hands of your users and is not reproducible then you lack the visibility to find a solution. Here comes Observability to the rescue.

Observability and distributed tracing

Observability means understanding how and why an application reached its current state just by its outputs without modifying its current status. So, when something wrong occurs in the wild you should have all the data needed to know why it happened just by checking the output of the existing application and related services.

It consists of several practices that must be followed in all the systems involved: monitoring, alerts, logs, and a common way to compose all the information from your different systems working together.

Distributed tracing is a method used to profile and monitor applications, especially useful for those built using a microservices architecture. It monitors the transactions that happen between systems and reports the monitoring and log results of every system that participate in that communication to a central server that unifies the different reports around the transaction. You could say that it works like ActivityTracing but in a multi-system environment.

Why bother with microservices?

IT and DevOps use distributed tracing for debugging and monitoring distributed software architectures, but you can also use it for your development and debugging purposes. When developing application functionality around an external service, the communications with that service are not always as nice as desired: maybe you misimplemented the specification of a REST API and sometimes your application doesn’t work, maybe the server just didn’t handle a corner case properly and your application is receiving an error 500, or maybe the service changed under the hood and your code is not compatible anymore.

When your application is released and your code is running in the client’s hands, if some feature doesn’t work as expected the error reports will find their way back to you. You have to start the process of debugging to figure out why it is happening: probably trying to reproduce the issue yourself, checking the latest code added to that functionality, or checking the crash if you have them. Wouldn’t it be better if you could just look for the issue in the logs and see that the server returned an error because it had an internal error? Wouldn’t it be even better if you could provide the exact request that made the service fail to your fellow colleague who’s writing the backend stuff?

This is what observability and distributed tracing can bring to your workflow, the complete context for every interaction with an external system that your application or framework has. You can also add your existing ActivityTracing activities to have a complete frame of the code and narrow down issues even faster.

Meet OpenTelemetry

Observability is mainly used in microservices environments and each solution supports a subset of system and languages. Finding one that supports both your backend systems and your iOS platform may not be an easy task.

OpenTelemetry is an open-source observability framework still in the works, formed through a merger of the two most popular distributed tracing standards (OpenTracing and OpenCensus). The goal of OpenTelemetry is to provide both the API and a vendor-neutral implementation so you won’t be tied to what the vendor of your solution provides for all your platforms.

Right now, iOS is not in the group of officially supported platforms, but we are working at Undefined Labs to provide the source of a Swift client that will allow any iOS or macOS developer to use this technology in their products. We will be providing the first alpha version of the code in the following weeks.

At Undefined Labs, we are interested in the standardization of distributed tracing in all platforms. One of our products in the works, Scope, is a management and monitoring platform for all your testing needs. Scope provides these observability superpowers to all your tests, so when a test unexpectedly fails, the root cause of the failure can be easily found and fixed. Scope makes it easy for you to keep your test suites healthy and robust.

Your journey to better applications through better testing starts with Scope.

Testing in the Cloud Native Era

Bryan Lee — Wed, 11 Dec 2019 17:28:16 +0000

Written by Fernando Mayo

As applications are being broken down into smaller interdependent pieces and shipped at ever faster rates, we need to update our definition of software testing. We need to make sure our testing methods keep pace with how we develop, to ensure we continue shipping reliable and performant software, in a cheap and fast way.

You are always testing

When you think about the “software development lifecycle”, you will probably picture something like this:

According to traditional wisdom, “testing” is something we do after we finish developing and before we start deploying. But in a world where monolithic applications are being broken down into smaller “services”, this traditional definition of testing no longer holds true. This is due to several factors: increasing complexity (number of deployable artifacts and APIs, independent release schedules, number of network calls, persistent stores, asynchronous communication, multiple programming languages…), higher consumption rates of third-party APIs of staggering variety, frequent deployments thanks to CI/CD pipelines, and a step-change in power when it comes to observability and monitoring tools.

You test when you run a few unit tests before pushing your code, or when CI automatically runs a suite of integration tests every night. But you also test when a product manager uses your staging environment to try out a new feature, or when you gradually send traffic to a new version of your application that was just deployed and you’re continuously monitoring for errors. You also test when you run periodic checks that drive your UI automatically to perform a synthetic transaction on your production instance. And yes, when your customers are using your application, they are helping you test it as well.

What do we test for?

At a bare minimum, any service owner would want to ensure a certain level of quality of their service regarding these aspects:

Correctness: does it do what I want it to do without defects?
Performance: does it respond to its consumers with an acceptable delay?
Robustness: does it degrade gracefully when dependencies are unavailable?

There are many other non-functional aspects of your application you might want to test for depending on your application’s specific requirements (e.g. security, usability, accessibility). Any breach of expectations in any of these dimensions becomes something you want to be able to detect, troubleshoot, and fix as soon as possible, with the lowest possible effort, and have a way to prevent it from happening again in the future.

Correctness is the aspect we typically associate testing with. We immediately think of unit and integration test suites that ensure the application returns an expected response to a set of predefined inputs. There are also other very useful approaches to testing correctness, like property-based testing, fuzz testing, and mutation testing, that can help us detect a wider range of defects in an automated way. But we should not stop there.

Even with the most comprehensive test suite in the world, a user is still going to find a defect when it goes live. That’s why we should extend correctness testing to production as well, with techniques like canary deployments and feature flags. As we will discuss later, an efficient testing strategy will make use of multiple techniques across all environments in order to proactively prevent issues.

Performance is a very important quality aspect of any software application, yet we don’t actively test for it as much as we should, as it becomes a complex endeavor to figure out which combination of data, access patterns, and environment configuration will be the one most representative of “production”, which is what will eventually dictate the performance of our application as seen by the end user.

Benchmark testing is a good and cheap way to get early feedback about any performance regressions on part of our code, and we should definitely use it as part of our strategy. But again, there’s more we can do to test the performance of our application.

This is the perfect example of how expanding our definition of testing to the entire software lifecycle can help us increase software quality with less effort. Even if we invested in having pre-production load and stress tests to give us an idea of the throughput of our application and making sure we don’t introduce regressions, there’s nothing closer to production than production itself. That’s why a good performance testing strategy should also include adding the required instrumentation and tools to be able to detect and debug performance issues directly in production.

Robustness is very often overlooked, as we are biased towards testing for the “happy path”. This wasn’t much of an issue in the world of monoliths — failure modes were few and mostly well known. But in the brave new world of microservices, the number of ways our application can fail has exploded. This is also the aspect of our application that, if not properly and thoroughly tested, has the most direct impact on the end user experience.

Making sure our services tolerate issues and degrade gracefully when dependencies fail is very important, and we should make sure we test for that. In this case, testing emphasis should be put on pre-production testing: failure handling code, by its nature, will not be exercised very frequently in production (if things go well), so having automated tests that programmatically simulate failure is essential.

Investing in failure injection and chaos engineering in production is another option if we consider that there are possible failures that we cannot reproduce in a controlled environment and we need to resort to testing them directly in production.

The modern testing toolbox

Software engineers are gradually becoming service owners, where they are responsible for a specific part of an application all the way from development to production. This includes testing, but not just in the traditional sense — it starts with unit testing their code, and extends to adding telemetry to effectively test in production.

Just as the DevOps movement highlighted the importance of developers to understand and be involved in the deployment and ongoing monitoring of the service they own, it is also important for them to understand the different testing techniques available to them, and use them appropriately to increase the reliability of their application at the lowest possible cost.

As we have seen earlier, some of these new techniques enable safely testing in production, if done right. For example, canary deployments allow for testing an application in production for a small percentage of real user data. For it to work, the application must be developed in a way to support this kind of testing, for example, by adding appropriate metrics to detect when there is an issue, logs and/or traces to troubleshoot what went wrong in the case the test fails, and by making sure there are no side effects on any datastore should the deployment need to be rolled back. While there may be cases where these testing techniques can help us efficiently test, they are complex to setup and execute, and one must understand all the prerequisites and implications of performing such tests.

Including tools for testing in production in your toolbox will allow you to use it when it’s the most efficient (cheapest) for the feature or bug fix you want to test, since the cost of synthetically testing it earlier in the cycle might actually be more expensive (e.g. data requirements, or dependencies that cannot be mocked or replicated).

The value of testing

In order to make sure our application is correct, performant, and robust, we have to make sure we take a holistic approach to testing and explore all the different testing options at our disposal. But how do we decide which type of test to use? It comes down to reducing costs.

We also know that, at some point, our application will not perform as expected, no matter how much we test. There are simply way too many factors involved that we cannot anticipate: too many possible user inputs, too many states in which your application can find itself, too many dependencies that are outside of your control. So why testing if we cannot avoid failure completely? When should we stop? It comes down to managing risk.

Testing is about reducing the risk of your application performing unexpectedly, at the lowest possible cost.

The costs associated with catching issues in production

Let’s consider the cost of addressing issues in production. It comes in different forms:

Detection cost:

How and when do I get notified if it is not working as intended?
Does the user need to notify us of the problem?
What is the time delay between the issue being introduced in the application, and someone in the organization being alerted?
Do we have automated alerting, or do we have to actively monitor a dashboard?
Does the alerting work properly?
Do we have the right metrics?
How do we detect non-obvious issues that aren’t accompanied by a spike in latency or an increased error rate?

Troubleshooting cost:

Once I know there is an issue, how do I know what caused it?
Did I add the appropriate instrumentation (metrics, logs, traces, exceptions) to debug the issue?
Do the metrics have the right tags and resolution to aid with debugging?
Do we have the logs, or have they been deleted because of retention policies?
Have the relevant traces for troubleshooting been sampled?
What are the costs associated with processing and storing this information?
Do I have to reproduce the issue in another environment to find out more about it? How much time will that take?
Do I know who has the knowledge to debug it?
When was it introduced?

Fixing cost:

Who is the team responsible for fixing this?
Do they have the bandwidth to address the issue?
Can I rollback safely to temporarily fix the issue, or am I forced to come up with a hotfix ASAP and roll forward?
Did the issue affect any datastores or other services that now need cleaning up?
How long will the fix take to propagate through all affected environments?

Verification cost:

Can I automate verifying the fix, or does it need manual verification?
How much time and resources does verifying the fix take?
Do I have to rely on an affected user for verification?
Can I verify all possible permutations of the issue?
Can I verify the fix without side effects on the production instance?

User impact cost:

Are users impacted by the issue?
If they are, how many, and for how long?
Is the business losing money?
Is the company’s brand or reputation being negatively impacted?
How many support tickets have resulted from this issue?
What is the cost of processing and replying to these support tickets?

As issues approach the end user of the application, the more expensive it becomes to address them. A bug detected with a unit test that a developer ran locally while working on a feature branch, for example, is the cheapest to address: the bug has been detected immediately after the developer introduced it (detection cost); easy to debug as it pinpoints exactly where the problem is, along with rich debugging information (troubleshooting cost); the developer just introduced the issue, so they already have the proper context to quickly fix the problem (fixing cost), can immediately verify the fix by re-running the test (verification cost), and the issue has resulted in zero user impact.

On the other hand, an issue that comes up weeks after a new version has been released to users is arguably the most expensive one. Costs for detection, troubleshooting, fixing, verification, and user impact will all be at their highest.

Note that I have separated production from end user. By using canary deployments or feature flags, the issue can reach production, while we control which end users are exposed if any at all. In this case, an issue detected after deployment to production (the new code is running in production infrastructure) but before releasing it to all users (no users or only a small fraction of users are being served by the new code), helps to mitigate the user impact cost, but all of the other costs still apply.

Also, we should note that engineering teams that are constantly interrupted to address issues, especially ones that have been detected late in the cycle, incur variable costs of stress, which can ultimately lead to burnout. Regardless of whether engineers are on-call or not for the services they own, adding unplanned work to debug and fix production issues, which requires context switching from their already planned and full sprints, has a negative effect that becomes readily apparent over time.

Testing can be cheap, but it’s never free

Addressing issues in production is expensive. Ideally, we want to utilize tests to catch them as cheaply and as early in the cycle as possible. But while tests can be cheap, they’re never free. Some of the associated costs include:

Creation cost: how much time and effort is needed to write the test, and make the system testable?
Execution cost: how long does it take to actually run the test to get feedback? How many computing resources does it consume?
Maintenance cost: if I change my application (refactoring, new feature, etc.), how much time and effort does it take to update the test accordingly?

Any software development team must be on top of their testing costs and actively manage them, like any other aspect of the code they write. Reducing execution cost can be done in many ways: removing overlapping tests, making sure tests run quickly (by doing I/O only if absolutely necessary, and using mocks where possible), running only tests that cover the code that has changed (like go test, jest or bazel do), or by “failing fast” and getting feedback before all tests finish running.

Flaky tests (defined as tests that both pass and fail with the same codebase) are especially costly, as they don’t just introduce noise and distraction, and the need for retries — they decrease developer confidence in the system, and will either slow down the workflow (the build needs to be green to continue), or increase risk (we know the tests failing are flaky — so let’s continue anyway). Flakiness should be measured and reduced to a minimum. Some tests will need to perform I/O operations and will inherently have some degree of flakiness — in these cases, adding retries to the test or making the I/O operations more resilient to transient failures, can be the most efficient way to tackle them, as rewriting them to completely remove flakiness might be much more expensive, or even impossible. Techniques like mocking dependencies by recording and replaying HTTP traffic (a kind of snapshot testing but for integration tests), can also help reduce flakiness and speed up testing.

Strategies to reduce testing cost

The execution cost of tests can be reduced by having an effective strategy when running them. For example, by running only fast unit and integration tests locally when working on a bug fix or new feature, we reduce the time to feedback from the developer. Then, after pushing the code to the central repository, CI can kick off a round of more in-depth integration tests. The master branch can have a nightly run of longer and more expensive system or end-to-end tests, etc.

The idea is to balance execution cost with the time to receive feedback; for example, by reducing the execution cost, we increase the troubleshooting/verification cost: by running tests less often, we will create “gaps” in the history of test executions that, in the case of a broken test, will introduce a larger search space for the actual culprit of the failure; the less frequent the test executions, the wider the gaps in history, as depicted in the above graph.

Because testing is all about reducing risk, one must balance risk appetite with the cost of testing. Even within the same application, not all parts of the application will need the same amount of testing. For every scenario you want to test (e.g. a new feature, a bug fix, or a new dependency failure handler), try to think about how you can test it at the lowest cost possible. Would a simple unit test be sufficient? Do I need to test it against a real instance of a dependency? Or is the most efficient way to test it to add proper instrumentation and alerting, and do a canary deployment in production?

This is something that must be evaluated by the developer and potentially the greater team, on a case by case basis. It highlights the importance of the developer being familiar with the entire range of testing methodologies available to them in order to choose the most cost-effective one.

How do we test what we don’t know can fail?

As we have seen, testing allows us to detect, debug and fix issues in a cheaper manner than waiting for them to be surfaced by a user. But, how do we test for things that we are not aware of that can fail? How do we test for the unknown unknowns?

Our objective should be to bring as much information as possible to the known knowns quadrant, which are the facts about our service. When we change something on our application, previous known knowns are invalidated. Testing helps us move from known unknowns to known knowns, i.e. we know the questions to ask, and by executing the test, we’ll get the answer. We can do this in an automatic, quick, and very efficient way. For example, “does this function return what I expect when I provide these specific arguments?”, “does my service return a 400 when the user sends an invalid payload?” or “does the UI show an error message when a user tries to log in with an invalid password?”.

Unknown unknowns are issues that appear that we didn’t anticipate because we didn’t even know they could happen. We can’t test for them, as by definition, we don’t know what can actually fail until it does. For this case, good instrumentation and tooling in production will allow us to debug (and sometimes detect) new issues we couldn’t anticipate, but it comes at a high cost. If the root cause finally ends up being one that could come up in the future (and not just a transitory operational issue), it’s always a good idea to write the cheapest test possible for it, to avoid regressions, and bring it to the known unknowns quadrant for future versions of the software, which will save us precious engineering time.

How much testing is enough?

We know testing will never be able to tell us that our application is 100% reliable, as testing is about managing risk. That’s why traditional testing coverage is not a good measure of quality or a target an engineering team should focus on. As we have seen, unit testing is just one of the many techniques we should be using to test our services — and code coverage is based on that. It can only tell us how extensive our unit tests are, but not if we are building a high-quality service.

What can we use instead? We instinctively know that if we don’t test our application, its quality won’t meet our user’s standards. We also know that we could just keep investing in testing forever and we will never reach perfection. There is a compromise somewhere in the middle, but how to measure it?

The performance (latency) and robustness (availability) aspects of your application should be already being measured and monitored, with a corresponding SLO. SLOs provide a target you should strive for in these dimensions. Testing should support hitting those goals, which will depend on the application requirements. Critical services will have very aggressive SLOs, thus requiring a high level of investment in testing, and non-critical services will have more relaxed requirements. Only by directly linking the testing budget to objective targets like SLOs will provide the right incentives for teams to decide how much risk they want to remove.

The correctness aspect is harder to measure directly, but equally important. You application might be extremely reliable and performant, yet your users might be unhappy because your application is just not doing what it’s supposed to do. A pragmatic approach to continuously measure correctness could be to have a target on the rate of new high priority defects in production. Just like an availability SLO, that number can be a good proxy on whether defects are slipping through to end users too often, and guide the team to adjust their testing efforts accordingly.

Conclusion

Testing has always helped us build applications that are more maintainable, debuggable, reliable and performant, and allowed us to ship faster and with more confidence. But as applications have become more and more complex and dynamic, new types of failure modes have been introduced which are increasingly more difficult to anticipate and troubleshoot. In order to be able to proactively and efficiently detect, debug and fix them, we should review and adapt how we use traditional testing techniques and embrace new ones that apply to all stages of the development lifecycle.

Only then can testing return to be the invaluable ally it once was in delivering high quality software.

Your journey to better engineering through better testing starts with Scope.

We Have A Flaky Test Problem

Bryan Lee — Mon, 09 Dec 2019 16:43:50 +0000

Flaky tests are insidious. Fighting flakiness can sometimes feel like trying to fight entropy; you know it’s a losing battle, and it’s a battle you must engage in again and again and again.

Google’s definition:

We define a “flaky” test result as a test that exhibits both a passing and a failing result with the same code.

At Undefined Labs, we’ve had a chance to talk to dozens of engineering organizations of all different sizes, ranging from 3 person startups to Fortune 500 companies. We listen and gather feedback around all things “test.”

When we talk about the frustrations and major issues organizations encounter with testing, inevitably, flaky tests will always come up. And when we get to this part of the discussion, the people we’re talking to will have a visible shift in demeanor, wearing the expression of someone trying to put out more fires than they have water.

We, too, once wore this same expression. While we were at Docker, the co-founders of Undefined Labs and I were spread across and working on different products, from enterprise on-premise solutions, CLI developer tools, to various SaaS (Software as a Service) solutions. We also got to see the work of our co-workers on various open-source projects like Docker Engine, Docker Swarm, and Docker Compose.

It was this experience and the frustrations with testing in general that led to the creation of Undefined Labs and our first product, Scope.

Throughout our talks with these various engineering organizations, we’ve heard about all kinds of different solutions to tackle flakiness, with varying success. We noticed that the organizations that were best able to cope with flakiness had dedicated teams ready to create best practices, custom tools, and workflows to deal with flakiness.

But not all teams had such lavish resources to throw at the problem, without having to worry about efficiency. We saw some of these teams hack together workflows through existing tooling and scripts. And some teams did nothing at all. They threw their hands up and succumbed to the torrent of flaky tests.

I think it’s paramount to have a plan in place to address flaky tests. Flaky tests are bad, but they’re even worse than you think.

Why flaky tests are even worse than you think (is that even possible?)

Testing plays a significant role in the productivity of engineers.

Google on developer productivity:

“Productivity for developers at Google relies on the ability of the tests to find real problems with the code being changed or developed in a timely and reliable fashion.”

The keywords here are “real problems,” “timely,” and “reliable fashion,” all of which seem to point a big fat finger directly at the consequences of flaky tests.

When tests behave as expected, they’re a boon to productivity. But as soon as tests can’t find real problems, their results arrive too slowly or can’t be trusted, it turns into one of the most miserable time-sucks known to modern humanity.

Flaky tests love collateral damage

Not only do flaky tests hurt your own productivity, but there is cascading loss of productivity experienced by everyone upstream.

When master is broken, everything comes to a screeching halt.

We’ve seen some of the highest performing engineering organizations implement various strategies to mitigate this collateral damage. They gate and prevent flaky tests from ever making it to master and/or have a zero-tolerance policy for flaky tests; once they’ve been identified, they’re quarantined until fixed.

Other tests, and even the entire test suite, can also be collateral damage. Test flakiness left unabated can completely ruin the value of an entire test suite.

There are even greater implications of flaky tests; the second and third-order consequences of having flaky tests is that it spreads apathy if unchecked. We’ve talked to some organizations that reached 50%+ flaky tests in their codebase, and now developers hardly ever write any tests and don’t bother looking at the results. Testing is no longer a useful tool to improve code quality within that organization.

Ultimately, it will be the end-users of your product that bear the brunt of this cost. You’ve essentially outsourced all testing to your users and have accepted the consequences of adopting the most expensive testing strategy as your only strategy.

Increases Costs

As mentioned by Fernando, our CTO, in his blog post Testing in the Cloud Native Era, tests have an associated cost attached to them. And when it comes to flaky tests, there are hidden costs, and the cost is generally increased across the board: creation cost, execution cost, fixing cost, business cost, and psychological cost.

Creation cost: this includes both the time and effort needed to write the test, as well as making the system testable. A flaky test requires you to re-visit this step more often than you would like, to either fix the test or make the system more testable.

Execution cost: if you’re interested in trying to generate signal from your flaky tests without necessarily fixing it, you can execute the test more than once.

Additional executions can be manual — we’ve all hit the “retry build” button, with the hope that this time, things will be different. We’ve also seen some teams leverage testing frameworks that allow for automatic retries for failing tests.

Execution cost can also potentially manifest itself as requiring a platform team to help keep the pipeline unblocked and moving at a fast enough pace to service the entire organization. Your team needs both infrastructure and scaling expertise if you want to reach high levels of execution.

There’s also the cost of the infrastructure required to run your tests, and perhaps the most valuable resource of all, your time. Increased executions mean more time.

Fixing cost: debugging and fixing flaky tests can potentially take hours of your workweek. Some of the most frustrating parts about flaky tests are reproducing them and determining the cause for flakiness.

Fixing a flaky test also demands expertise and familiarity with the code and/or test. Junior developers brought in to work on a legacy code base with many flaky tests, will certainly require the oversight of a more senior developer that has spent enough time to build up sufficient context to fix these tests.

This is all made worse if you have dependencies (it’s another team’s fault), run tests in parallel (good luck finding only your specific test’s logs), or have long feedback cycles (builds that take hours with results only available after the entire build is finished).

And in the worst cases, you lack the information and visibility into your systems necessary to fix the test, or the fixing cost can be too high to be worth paying.

Business cost: Flakiness consumes the time of developers investigating them, and developers represent one of the most expensive and scarce resources for any business. Fixing flaky tests adds accidental complexity, which ultimately leads to more of the developer’s time being taken away from working on new features.

Other parts of the business will also be impacted due to potential delays in project releases. If the same number of development cycles are required to release a new product, but now every development cycle takes longer, products will be released late, impacting marketing, sales, customer success, and business development.

Psychological cost: responsibility without the knowledge, tooling, and systems in place to actually carry out that responsibility is a great way to set someone up for failure and cause psychological stress to your developers.

And what’s more, flaky tests will force you to undergo this cost cycle of a test more than just once while trying to remove or mitigate the flakiness. A great test can have just one up-front creation cost, minimal execution cost (because you trust the first signal it gives you), and very little maintenance cost. Every time a test flakes, it will require you to re-absorb the costs of the test.

Reduces Trust, Leads to an Unhealthy Culture, and Hurts Job Satisfaction

When it comes to job satisfaction for engineers, I think it’s safe to say we all prefer working in an organization with a healthy culture. While we may each have our own definition of what a healthy culture looks like, having trust is a major factor. If you don’t trust your organization, your team, or your co-workers, you’ll most likely be looking for the exit sometime soon.

If your tests seemingly pass and fail on a whim, it’s only a matter of time before trust is eroded. And once trust is eroded, it’s very difficult to build back up. I also think that trust works from the bottom up. You need a solid foundation in which trust can begin to gain traction.

Organizations that try to increase trust among co-workers from the top-down, usually sow even more distrust. If management is telling me I need to trust my co-workers, then they must be untrustworthy, otherwise, why would they even bring this up?

When it comes to engineering organizations, a major factor in that foundational level of trust often starts with testing. The testing strategy and general attitude towards testing speak volumes about an engineering organization’s culture.

If you can’t trust your tests and testing processes, then everything else built on top of tests will slowly crumble. It may not happen quickly, but cracks will begin to form, and morale will suffer.

This is why the prevalence of flaky tests and what you do when they occur matters so much. Flakiness left unguarded, will destroy the trust people have in their tests.

If there is rampant, inadequately addressed flakiness in your tests, then you can’t trust the tests.
If you can’t trust the tests, then you can’t trust the code.
If you can’t trust the code, then you can’t trust developers.
If you can’t trust developers, then you can’t trust your team.
If you can’t trust your team, then you can’t trust the organization.
If you can’t trust your organization, then there’s obviously a lack of trust in the culture, which means you work at a company with an unhealthy culture.
If your company has an unhealthy culture, your job satisfaction will steadily decline.

There will be a tipping point where you transition from thinking about your organization as an organization that has flaky tests, to the kind of organization that has flaky tests. Once you make the mental shift and believe it’s because of the organization that there are so many flaky tests, trust in the organization has been eroded.

With job satisfaction continuing to plummet, it’s at this time that your top engineering talent will begin looking elsewhere, actively searching, or maybe just a little more willing to open the emails and messages from recruiters that they had been previously ignoring.

The saving grace of this situation is that this degradation happens over time. With the right strategies and tools, you can mitigate and even reverse the damage. But it’s not easy, and you can’t do it alone.

Flaky tests aren’t going away

Flakiness is only getting worse, not better. As your codebase and test suites grow, so too will the number of flaky tests and results. As you transition, or if you’re already using a microservice architecture, you can have many dependencies. As dependencies increase, flakiness is magnified.

For example, even if all of your microservices have 99.9% stability, if you have 20+ dependencies each with the same stability, you actually end up having a non-trivial amount of flakiness:

98.01% stability for 20 dependencies
97.04% stability for 30 dependencies
95.12% stability for 50 dependencies

As your engineering organization scales, how it addresses flakiness will be one of the most important factors impacting overall productivity.

A presentation was released by Google, The State of Continuous Integration Testing @Google, where they collected a large sample of internal test results over a one month time period and uncovered some interesting insights:

84% of test transitions from Pass -> Fail were from flaky tests
Only 1.23% of tests ever found a breakage
Almost 16% of their 4.2 million tests have some level of flakiness
Flaky failures frequently block and delay releases
They spend between 2–16% of their compute resources re-running flaky tests

Google concluded:

Testing systems must be able to deal with a certain level of flakiness

In order to address flaky tests, you need systems thinking

In google’s conclusion, they mentioned testing systems must be able to deal with a certain level of flakiness, not teams or engineers. Looking from the bottom up, from the perspective of an individual, will blind you to the larger universal patterns at work and how they must change.

To understand why no one person can do it alone, I like to turn towards systems thinking. The power of systems thinking comes when you shift your perspective of the world away from a linear one, and towards a circular one. This reveals a world in which there is a much richer and complex interconnectedness between seemingly everything.

While seeing things as they truly are can be eye-opening, it can also be a little daunting. Most of the common ways we know to affect change have little leverage.

Donella Meadows in her book Thinking in Systems: A Primer, described the different ways one could influence a system. Here’s a great graphic that shows them all stack-ranked:

What’s important to take away from this graphic is that the capacity of an individual is limited to the least powerful system interventions. This is why teams or organizations with notoriously high turnover rates continue to have high turnover rates, even as new individuals are placed within the system. By and large, teams don’t hire bad employees who turnover quickly; there are only bad teams with high turnover rates.

If you want to have a high performing culture and have an organization that attracts the best engineers, you’re going to need to have a system in place that reinforces best practices, rewards behavior you want to be repeated, has all of the right feedback loops, to the right people, in the right context, in the right time frame, and has redundancy built in to ensure the system is resilient.

A CI/CD pipeline, the team that manages it, the various development teams that are dependent on it - they’re all part of a greater system. Whether the system was designed or grew organically, they are a system.

So when it comes to handling flakiness effectively and with resilience, what are the common system patterns implemented by the best engineering organizations?

Common patterns found in systems successfully addressing flaky tests

There were a handful of clear patterns that seemed to crop up in every instance of teams that successfully handled flaky tests:

Identification of flaky tests
Critical workflow ignores flaky tests
Timely flaky test alerts routed to the right team or individual
Flaky tests are fixed fast
A public report of the flaky tests
Dashboard to track progress
Advanced: stability/reliability engine
Advanced: quarantine workflow

Let’s go over each of these a little bit more in-depth…

Identification of flaky tests

The first step in dealing with flaky tests is knowing which tests are flaky and how flaky they are.

It also helps to have your tests exist as a first-class citizen, which will allow you to keep track and identify flaky tests over time/commits. This will enable tagging tests as flaky, which will kick off all of the other patterns listed below.

When working with flaky tests, it can also be very useful to have test flakiness propagate throughout your system, so you can filter your test list or results by “flaky”, manually mark tests as “flaky” or remove the “flaky” tag if no longer appropriate.

We’ve also seen different organizations have particular dimensions of flakiness that were important to them: test flakiness per commit, flakiness across commits, test flakiness agnostic to the commit and only tied to the test, and flakiness across devices & configurations.

Test flakiness per commit:

Re-run failing tests multiple times using a test framework or by initiating multiple builds
If the test shows both a passing and failing status, then this “test & commit” pairing is deemed flaky

Test flakiness agnostic to the commit and only tied to the test

A test is tagged as flaky as soon as it exhibits flaky behavior
Exhibiting flaky behavior may occur in a single commit, or aggregated across multiple commits (e.g. you never rebuild or rerun tests but still want to identify flaky tests)
The test will stay tagged as flaky until it has been deemed fixed
The prior history and instances of flakiness will stay as metadata of the test
This can help with the verification of a test failure, to determine if it’s broken or likely a flake, i.e. if this test was previously flaky but hasn’t exhibited flakiness in the past two builds, this recent failure means it’s still likely flaky and was never fixed

Test flakiness across devices & configurations

Re-run a single test multiple times across many device types and/or whatever configurations are most important to you i.e. iPhone 11 Pro vs iPhone XS, iOS 12 vs iOS 13, or Python 2.7 vs Python 3.0
A test can be flaky for a specific device or configuration, i.e. the test both passes and fails on the same device
A test can be flaky across devices and configurations, i.e. a test passes on one device, but fails on another device

The most common pattern we saw, were teams rerunning failed tests anywhere from 3 to 200 times. A test was then labeled as flaky in either a binary fashion (e.g. at least one fail and at least one pass) or there was a threshold and flakiness score given (e.g. test must fail more than 5 out of 100 retries).

Identifying flaky tests based on a single execution

There are techniques and tools available (including the product I work on, Scope) that can help identify flaky tests, only requiring a single execution. Here’s a brief summary of how it works if you’ve just recently pushed a new commit and a test fails:

You need access to the source code and the commit diff
You need access to the test path, which is the code covered by the test that failed
You can then cross-reference the two, to identify if the commit introduced a change to the code covered in this particular tests’ test path
If there was a change to the code in the test path, it’s likely that this is a test failure. And the reason for the test failure is likely at the intersection of the commit diff and the test path
If there was no change to the code in the test path, it’s a likely sign the test is flaky

Critical workflow ignores flaky tests

Once the test is identified as flaky, the results from this test are ignored when it comes to your critical workflow. We oftentimes see organizations set up multiple workflows, one dedicated to the initial testing of PRs and one dedicated to master.

By catching flaky tests before they make it to master, you can then choose to ignore the test results of flaky tests or quarantine these flaky tests and only allowing them to run in specific pipelines if any.

Timely flaky test alerts routed to the right team or individual

In order for a system to properly function and remain resilient, it needs feedback loops. Not just any feedback loop; these loops need to be timely, convey the right information and context, be delivered to the right actor, and this actor must be able to take the right action using the information and context delivered.

The pieces when assembled:

Notification once the flaky behavior is identified (through email, Slack, JIRA ticket, GitHub issue, etc.)
Notification must be sent to the party who will be responsible for fixing this flaky test (test author, commit author, service owner, service team, etc.)
The notification must contain or point to the relevant information required to begin fixing the test

Flaky tests are fixed fast

Most of these organizations place a high priority on fixing flaky tests. Most of the teams we saw, usually fixed flaky tests within the week. The longest we came across that still proved to be fairly effective was a month.

The most important takeaway though, is that whenever the time elapsed to fix a flaky test surpassed the explicit or implicit organizational threshold, the test was almost always forgotten and never fixed.

In order to fix these tests, varying levels of tools and workflows were set up for developers to begin debugging. These systems always had some smattering of the following: traces, logs, exceptions, historical data points, diff analysis between the current failing and last passing execution, build info, commit info, commit diff, test path, prior trends & analysis).

Essentially, the more visibility and the more context you can provide to your developers for the specific test in question, while at the same time removing any noise from irrelevant tests, the better.

A public report of the flaky tests

This can take many different forms, but the gist of this pattern is to make public the status of the flaky tests identified and the flaky tests fixed, with an emphasis on the flaky tests identified but that have not yet been fixed.

Some teams had this information available via a slackbot along with the “time since identified”, or would post this list weekly every Monday in the team channel. A couple of organizations even surface these flaky tests within the dashboards used to display metrics for team performance.

Dashboard to track progress

In addition to seeing the current state of flaky tests in your system, most organizations have a way to track progress over time. If your goal is to reduce flakiness, how will you know if you’ve actually hit your goal? How do you know if things are getting better and not worse?

This high-level system feedback is necessary to make adjustments to the overall system and help identify patterns that aren’t working, or are broken and need fixing.

At its most basic, this is just a timeline of your builds and the build statuses.

In more advanced versions, you might get every test execution from every test, the ability to filter by flaky tests, multiple builds & commits worth of information, and the date each of these results were captured.

Here’s a screenshot of Scope, viewing the Insights for a particular service:

Advanced: stability/reliability engine

This is an advanced use case, which we’ve only seen versions being used at some of the biggest tech companies (Netflix, Dropbox, Microsoft, & Google), but we think could be useful for any organization trying to deal with flaky tests.

The general idea is to have two different testing workflows, one for the critical path, and one for the non-critical path. The primary objective of the stability engine is to keep the critical path green. This is done by creating a gating mechanism with specific rules around the definition of “stable tests” and “unstable tests”.

A test is deemed “unstable” until proven “stable.” So every new test or “fixed” test is submitted to the stability engine, exercises the test in different ways depending on your definition of flaky, and ultimately determines if the test is stable or not, and if unstable, how unstable.

Unstable tests are either quarantined and never run in your critical path, or the test results of these unstable tests are just ignored.

All stable tests are now included in your “stable” tests list and will run in your critical path.

Any test deemed “unstable” must be remediated, and once fixed, re-submitted to the stability engine.

For example, this may be as simple as:

For any new PR, for all new and fixed tests (tests that were previously unstable), execute each test 20 times
If the test passes more than once, but less than 20 times, the test is marked as unstable
Fixed tests that are still unstable remain quarantined and their test results are ignored
If there are any new unstable tests in the PR, the PR is prevented from being merged

Advanced: quarantine workflow

The most basic version of the quarantine workflow is to simply mark the test results of flaky tests as “ignored” in your critical path.

However, we’ve seen some interesting workflows implemented by savvy companies. A quarantine workflow makes it easy for developers to follow all of the patterns listed above and helps keep your critical path green.

For example, this is the workflow we’re currently working on for Scope (each of these steps are optional):

Flaky test identified
The flaky test is added to the Quarantine List and skipped during testing
JIRA/GitHub issue is created
Automatically remove the test from Quarantine List when the issue is closed
Ping commit author on Slack
Remind author of their Quarantined test(s), every week
Ability to view Quarantine List
Ability to manually remove tests from the Quarantine List

With a proper quarantine workflow implemented, it really helps ensure the collateral damage of a flaky test is minimized, and the responsible party for fixing the test is properly notified.

Conclusion

One of the best techniques to ensure high code quality is to use tests. Unfortunately, as applications become more complex and codebases grow larger, test flakiness will begin to rear its ugly head more often. How your organization handles flakiness will be a major factor in defining your engineering culture.

Ultimately, no single person can fix flakiness as just a single actor within a larger system. Systems thinking should be used and there are useful patterns that are already being implemented by many of the highest performing organizations.

A peaceful co-existence with flaky tests is available to anyone willing to invest in the tools and processes necessary.