DEV Community: roesslerj

Your basic Allowlist-testing Options

roesslerj — Wed, 11 Sep 2019 09:29:03 +0000

I am big fan of allowlist-testing. I honestly think allowlist-testing is the future of testing for testing any interface. For those unfamiliar with the term: where typical assertion-based testing is denylist-testing (denying specified changes, ignores all else), allowlist-testing guards against all changes, except for the changes you specified as irrelevant, i.e. allowlisted. Other names for that technique are whitelist/blacklist testing (now dismissed for being racist), difference testing, snapshot testing, Golden Master testing, approval testing or characterization testing.

I wanted to come up with an executable demo of a tool with Allowlist testing capabilities. It should allow you to focus on the concept and to play around with it to get some first impression, with the least amount of overhead possible. Such a demo can be used in workshops and conferences, it can be tweeted and otherwise easily shared. For that, there are 3 simple requirements:

The tool should be hassle-free and simple to install and play around with, to produce results with as little overhead and as little prerequisites in terms of knowledge and technology as possible.
The tool should demo the many advantages of allowlist-testing.
It should be vendor-agnostic, i.e. the tool should ideally be open source and to the least extent commercial.

This list of requirements is short and relatively straight forward, and there are many implementations for allowlist-testing. Yet in my opinion, candidates fulfilling those criteria are hard to come by with.

Hassle-free and simple to try

That the tool should be hassle-free and simple to try disqualifies quite a bunch of interesting candidates. Most open source frameworks are test frameworks that are used during development by developers. Which means that you need to set up a project environment with some build mechanism for test execution.

Jest

Being used in over 1.000.000 public repos on GitHub, Jest (https://github.com/facebook/jest) is a very popular example of allowlist-testing—if not the most popular example. However, it only works with JavaScript in a Node, React or Angular project environment. And setting up yarn or node.js together with any such project just to execute some tests for a short demo is definitely neither simple nor hassle-free.

Approval Tests

The same applies to another great example of allowlist-testing, which is Approval Tests (https://approvaltests.com/). Approval Tests is available as a library in almost any platform (Java, C++, NodeJS, Python, …). But it is mainly focusing on technical interfaces, and works great with XML and other technical formats. The barriers to using it are definitely a lot lower than with Jest. You only need a test that can be executed in any of the supported platforms, e.g. in Java. It is open source and free to use, so approval tests also fulfills the other criteria.

TextTest

Another great tool is TextTest (http://texttest.sourceforge.net) that runs on Python and can be combined with a number of other platforms, including Java Swing, SWT and Tkinter. To set it up, you need to install Python and some more, depending of the platform of your choice. However, it is very much geared towards domain-language-written tests. For a more visual test of e.g. a GUI (custom or Web) it needs some more tooling to drive the GUI. For Python and Java GUIs there is StoryText, which is especially designed to work with TextTest.

recheck-web

recheck-web (https://github.com/retest/recheck-web) is open source and comes with a Chrome extension that can be used to easily try it. A Chrome extension is both simple to install and simple to remove. In order for the Chrome extension to run without any additional setup costs, it sends the data to retest.org. To guard your sensible test data, you need to create a free account before trying it. The detailed results are in a proprietary format, that you need an open source CLI or a free GUI to open. The GUI comes in a self-contained ZIP file.

Demo the Advantages of Allowlist-testing

Although pixel comparison is a way of Golden Master testing, a tool must implement some mechanism to allowlist changes (i.e. not notifying the user when they occur), in order to also count as a allowlist-testing tool. This is important, so the tool can be used in test automation on a regular basis, without reporting (too many) false positives. Many tools fail at this, such as

Appraise (https://github.com/AppraiseQA/appraise)
Screenster.io (https://screenster.io)
Diff.io (https://diff.io)
Shutter (https://shutter.sh)

In the long term, mere pixel-comparison is of limited value. A good tool should both allow to mass-accept a change, as allowlist-testing often creates redundancies, and should provide convenient ignore options.

Be Open source and not Commercial

The remainder of the tools are commercial, none offer an open source solution to their testing tool:

Screener.io (https://screener.io)
Applitools (https://applitools.com)
Percy (https://percy.io)
Mabl (https://www.mabl.com/visual-change-detection)

Summary

So, what do you choose when you want to demonstrate allowlist-testing to a bunch of testers, that have diverse backgrounds (i.e. are used to work on different platforms)? Do you happen to know any other tool option we forgot?

Your 2 basic visual regression testing options

roesslerj — Fri, 23 Aug 2019 13:42:25 +0000

Pixel-comparison based visual regression testing

When you see that Netflix has a simple visual bug on their website for over three months now (still life as of August 2019, visit http://devices.netflix.com/en/), the trend towards visual regression testing of websites is understandable. This approach guards you from unexpected changes (for which writing assertions is impossible) and is much more complete than assertion-based testing. Most current approaches today are pixel based, meaning that they compare screenshots of the pages pixel-by-pixel. This makes a lot of sense:

The first version of a pure pixel-diffing tool is easy to implement.
It works for any browser, app or other situation, as long as a screenshot can be retrieved.
It gives instant results.

However, there are also some downsides to pixel-based visual regression testing:

Similar changes cannot easily be recognized: e.g. if the header or footer of the site changes, this affects all tests.
You usually get a lot of false positives, as even a small change can result in many elements changing e.g. position.
Filtering these false positives is tricky, because it can result in either too many false positives (irrelevant changes being reported) or false negatives (important changes being missed).

In the below example, you see all of this play out. The demoed tool uses an AI algorithm to filter the differences for artifacts. As you can see, it produces both false negatives – the added colon after “Password” being missed – and false positives – the “Remember Me” checkbox did not change but was merely moved to the left as a whole.

Deep visual regression testing

recheck-web goes a different route and compares all rendered elements and their respective CSS attributes. So instead of being reported that the pages differ in pixel, where a human has to review the difference and interpret it, recheck reports the exact way in which they differ:

As you can see, as the text of the button changed from “Sign in” to “Log in”, the type of the element changed from a (a link) to button. Also, the class changed from btn-primary to btn-secondary. All other changes to the button are probably a result of those last two changes. The changes in the labels (added colons) were truthfully reported.

Since these are now semantic changes in contrast to pixel differences, it is easy to add rules and filters for handling them. Both the review GUI and CLI come with predefined filters. For instance, for the given situation, you could choose to ignore all invisible differences (class, type, background color, etc.). You could also choose to ignore all changes to any CSS style attributes, focusing only on relevant content changes (i.e. text).

Filtering changes

In the case of the changed login button, filtering all of these would show only the changes below.

This powerful mechanism lets you quickly focus on changes that are relevant to you. This works for CSS animations (try pixel-diffing that: http://www.csszengarden.com/215/) as well as for websites that are completely different in layout, but not in content. With this mechanism, you can easily ignore font but not text or color but not size.
It lets you see where these two differ in content:

Even better – you can create your own filters using a simple git-like syntax. For instance, you can filter specific elements (e.g. of tag meta) with: matcher: type=meta. To filter attributes globally (e.g. for the class attribute), use: attribute=class. To ignore attributes e.g. of specific elements (e.g. alt of images), use: matcher: type=img, attribute: alt. You can also use regex for both elements or attributes: attribute-regex=data-.*.

More details and examples can be found in the documentation.

Use recheck in your own automated tests (https://retest.de/recheck-open-source/) or demo it using the Chrome Extension (https://retest.de/recheck-web-chrome-extension/).

Assertions considered Harmful

roesslerj — Wed, 08 Nov 2017 21:48:21 +0000

Assertions are the go-to checking mechanism in unit tests. However, when applied to testing interfaces, specifically GUIs, I consider them to be toxic. Thankfully, there is a promising alternative.

JUnit was a huge success, being the single most used library in all of Java. And JUnit brought with it the famous Assert.assert... statement. This mechanism is designed to only check one thing at a time in isolation. And when testing a single unit, this is the most sensible approach: we want to ignore as much volatile context as possible. And we want to focus on ideally checking only a single aspect of only the unit under test. This creates maximally durable tests. If a test depends only on a single aspect of the code, then it only needs to change if that aspect changes. Assertions are a natural and intuitive mechanism to achieve that. Being “inside of the software during the test, where practically all of the internals are exposed in one way or another, all else wouldn’t have made sense.

Because of its success, JUnit is considered the state-of-art of test automation–and rightfully so. As such, its mechanisms were also applied to non-unit testing, i.e. they were applied to interface testing (e.g. GUI testing). And intuitively, this makes sense. Because, as the individual features stack up towards the interface, the interface become very volatile. Testing only individual aspects of the system seems to solve this problem.

Except that it doesn’t. It is already hard, albeit still feasible, to achieve that grade of separation on the unit level. On the interface level, where integration is inevitable, it is outright impossible. And practice shows exactly that. One of the reasons of the shape of the famous test pyramid is that tests on that level tend to break often and require a lot of maintenance effort.

A practical example

Imagine that you want to test a single aspect of the code–the calculation of the number of items a single user has ever bought. On the unit level, all you need is a user object and some associated items or transactions. Depending on the complexity of the system, you can create these objects either on demand or mock them. Then you can test just the code that counts the items.

However, on the GUI level, you first need to log into the system with an existing user. Then you need to navigate to a certain page where the relevant information is shown. So even if you create only a single assertion to check the number of items, your code still depends on a working persistence layer, a predefined state (e.g. user existing, and correct number of items), on the ability of the user to log in and on the navigation. How well is this test isolated?

In an integrated test, it is basically impossible to ignore context. Involuntarily, we always depend on numerous aspects that have nothing to do with what we want to test. We suffer from the multiplication of effects. This is the reason for the famous test-pyramid. However, if we cannot ignore context, maybe we should embrace it instead?

Embrace Context

Imagine, just for a second, we could somehow mitigate the multiplication of effects. Then we could check the complete state of the system instead of individual aspects. We could check everything at once!

So because interfaces are fragile, we now want include more context, making our tests even more fragile? Because instead of depending on single aspects, the test now depends on everything at once? Who would want that? Well ... everybody who want’s to know if the interface changed. If you think about that, the same question applies to version control. A version control system is a system with which, every time you change about anything in any file, you have to manually approve that change. What a multiplication of efforts! What a waste of time! Except that not using one is a very bad idea.

Because people change things all the time without meaning to do so. And they change the behaviour of the system without meaning to do so. Which is why we have regression tests in the first place. But sometimes we really wanted to change the behaviour. Then you have to update the regression test. Actually, regression tests are a lot like version control.

With the mindset that software changes all the time, an assertion is just a means to detect a single such change. So writing assertions is like blacklisting changes. The alternative is to check everything at once, and then permanently ignore individual changes–effectively whitelisting them.

When creating a firewall configuration, which approach would you rather choose? Blacklisting (i.e. “closing”) individual ports or whitelisting (i.e. “opening”) individual ports? Likewise with testing ... do you want to detect a change and later recognise that it isn’t problematic, or would you rather ignore all changes except the ones for which you manually created checks? Google introduced whitelist testing, because they didn’t want to miss the dancing pony on the screen again. Whitelisting means to err on the side of caution.

Of course I am not the first one to come up with that idea. In his book Working Effectively with Legacy Code, Michael Feathers called this approach characterization testing, others call it Golden Master testing. Today, there are two possibilities: pixel-based and text-based comparison. Because pixel-based comparison (often called visual regression testing) is easy to implement, there are many tools. For text-based comparison, there are essentially two specific testing tools: ApprovalTests and TextTest. But both pixel-based and text-based approaches suffer from the multiplication of effects.

Multiplication of Effects

On the GUI level, many things depend on one another, because isolation is not really possible. Imagine you wrote automated tests naively, as a series of actions. Then, if someone changed the navigation or the login screen, this single change would most likely affect each and every test. This way, the implicit or explicit dependencies of the tests potentially cause a multiplication of the effects of a single change.

How can we contain that multiplication of effects? One possibility is to create an additional layer of abstraction, as is being done by single page objects or object maps. But, in order to later reap the fruits in the form of reduced efforts if the anticipated change happens, this requires manual effort in advance. According to YAGNI, implementing that abstraction “just in case is actually a bad thing to do.

What other possibilities do we have, to contain the multiplication of effects? When doing refactorings in programming, we happen to be in the same situation. One method is probably called in dozens or even hundreds of places. So when renaming a single method (please only do that in internal, not-exposed APIs), we need to also change every place where that method is called. For some cases we can derive these places from the abstract syntax tree. For other cases (properties files, documentation, ...) we have to rely on text-based search and replace. If we forget or oversee something, this often shows only in certain situations–usually when executing the software. But for tests, this is different. Because tests, by definition, are already executing the software. So we get shown all the places where something changed (i.e. by failing tests). Now we just need a mechanism to “mass-apply similar changes.

There are two different kind of changes: differences in layout and differences in flow.

Differences in Layout

If, for instance, the login-button now is called “sign in”, has a different internal name, XPath or xy-coordinates, this is a difference in layout. Differences in layout are relatively easy to address with object maps.

But, surprisingly, differences in layout are also relatively easy to address if we have more context. If we know the whole puzzle instead of only individual pieces, we can create one-on-one assignments. This makes for very robust object recognition.

Imagine, we have a form where some elements are added. And we want to recognize the “Accept button to submit the form. If everything about the button changes, we can still recognize it, based on a one-on-one assignment of the remaining unused UI-components.

And mass-applying these changes is also easy. We can just apply every similar change. E.g. combine all instances of the change of “Accept to “Save into a single change, that needs to be reviewed only once.

With such a strong mechanism, redundancy is suddenly not a problem anymore. So we can suddenly collect many attributes of our UI-components, making our recognition of them even more robust.

So we can gather XPath, name, label and pixel-coordinates. If some of the values change, we still have the remaining values to identify the element. And mass-applying makes this still easy to maintain.

Differences in Flow

Sometimes, the use-cases or internal processes of the software change. These can be minor changes (e.g. if an additional step is required–filling a captcha or resetting a password). Sometimes these are major changes–a workflow changes completely. In the latter case, it is probably easier to rewrite the tests. But this happens seldom. More often, we need to just slightly adapt the tests.

Differences in flow cannot be addressed by object maps. Instead, we need other forms of abstractions: extracting recurring flows as “functions or “procedures and reusing them. This can be achieved with page objects, but requires manual effort and the right abstraction.

Instead, I propose a different approach: passive update. What do I mean by that? Traditionally, we have to actively identify all the occurrences of a specific situation in the tests and update them manually. So if we need to adjust the login process, we have to find all the instances where the tests log in. Then we manually need to change them accordingly. This is active update.

Passive update is to instead specify the situation we need to update together with a rule about how to update. So instead of finding all the login attempts, we specify the situation: the login page is filled with credentials and the captcha is showing. Now we add a rule about how to update a test script that finds itself in that situationâ€Š–â€Šfilling the captcha. We do this by deleting or inserting individual actions, or a combination thereof. Then that update is applied passively, upon execution of the tests. This means we are essentially turning the extraction of a procedure on its head.

This approach has various advantages:

By retaining the multiplication of effects, this approach requires less effort to update your tests.
By creating detailed rules, it allows to be more nuanced in the update.
It affects really only tests, that are in the specified situation during execution–no need to analyse a static test script and interpret whether the situation applies to that test during runtime or manually debug it.
General rules can be defined, that apply to a variety of situations. So we can have a rule that says: whenever the test finds itself with a modal dialog and only one option (e.g. “ok”), click that option and continue with the test. This makes our tests much more robust against unforeseen changes.

Being able to address the multiplication of effects allows us to embrace the whole context of a test, rather than trying to ignore it. This approach promises to make test automation and result checking both more powerful and more robust.

We already implemented that approach for Java Swing. Now we want to create an open source tool to foster a widespread adoption. Any support is highly appreciatedâ€Š–â€Šgive us feedback, back us or spread the word.

We have a Kickstarter campaign, that allows you to fund us. Back us to vote which technology gets implemented next. Or get an unbeatable price for premium. Or simply be the cool guy who can claim: I backed them in 2017.

Thank you!

Test Automation is not Automated Testing

roesslerj — Fri, 27 Oct 2017 12:02:32 +0000

Testing as a craft is a highly complex endeavour, an interactive cognitive process. Humans are able to evaluate hundreds of problem patterns, some of which can only be specified in purely subjective terms. Many others are complex, ambiguous, and volatile. Therefore, we can only automate very narrow spectra of testing, such as searching for technical bugs (i.e. crashes).

What is more important is that testing is not only about finding bugs. As the Testing Manifesto from Growing Agile summarises very illustratively and to the point, testing is about getting to understand the product and the problem(s) it tries to solve and finding areas where the product or the underlying process can be improved. It is about preventing bugs, rather than finding bugs and building the best system by iteratively questioning each and every aspect and underlying assumption, rather than breaking the system. A good tester is a highly skilled professional, constantly communicating with customers, stakeholders and developers. So talking about automated testing is abstruse to the point of being comical.

Test automation on the other hand is the automated execution of predefined tests. A test in that context is a sequence of predefined actions interspersed with evaluations, that James Bach calls checks. These checks are manually defined algorithmic decision rules that are evaluated on specific and predefined observation points of a software product. And herein lies the problem. If, for instance, you define an automated test of a website, you might define a check that ascertains a specific text (e.g. the headline) is shown on that website. When executing that test, this is exactly what is checked—and only this. So if your website looks like shown in the picture, your test still passes, making you think everything is ok.

A human on the other hand recognises with a single glimpse that something has gone awry.

But if test automation is so limited, why do we do it in the first place? Because we have to, there is simply no other way. Because development adds up, testing doesn’t. Each iteration and release adds new features to the software (or so it should). And they need to be tested, manually. But new features also usually cause changes in the software that can break existing functionality. So existing functionality has to be tested, too. Ideally, you even want existing functionality to be tested continuously, so you recognise fast if changes break existing functionality and need some rework. But even if you only test before releases, in a team with a fixed number of developers and testers, over time, the testers are bound to fall behind. This is why at some point, testing has to be automated.

Considering all of its shortcomings, we are lucky that testing existing functionality isn’t really testing. As we said before, real testing is questioning each and every aspect and underlying assumption of the product. Existing functionality has already endured that sort of testing. Although it might be necessary to re-evaluate assumptions that were considered valid at the time of testing, this is typically not necessary before every release and certainly not continuously. Testing existing functionality is not really testing. It is called regression testing, and although it sounds the same, regression testing is to testing like pet is to carpet—not at all related. The goal of regression testing is merely to recheck that existing functionality still works as it did at the time of the actual testing. So regression testing is about controlling the changes of the behaviour of the software. In that regard it has more to do with version control than with testing. In fact, one could say that regression testing is the missing link between controlling changes of the static properties of the software (configuration and code) and controlling changes of the dynamic properties of the software (the look and behaviour). Automated tests simply pin those dynamic properties down and transform them to a static artefact (e.g. a test script), which again can be governed by current version control systems.

This sort of testing (I’d rather call it checking) can be automated. And it should be automated for several reasons:

In the long run, it is cheaper to automate it.
It can be done continuously, giving you faster feedback whether a change has broken existing functionality.
As the software grows, your testers will not be able to perform it to the full extent necessary anymore, because development adds up—testing doesn’t.
It is a trivial, yet in its repetitiveness boring and exhausting task, that insults the intelligence and abilities of any decent tester and keeps them from their actual work.
Worse yet, testing the same functionality over and over again makes testers routine-blinded and makes them loose their ability to question assumptions and spot improvement potentials.

Test automation is an important part of overall quality control, but since it is not really testing, the term “automated testing is very misleading and should be avoided. This also emphasises that test automation and manual testing do complement each other, not replace each other.

Many people have tried to make this point in different ways (e.g. this is also the quintessence of the discussion about testing vs. checking, started by James Bach and Michael Bolton). But the emotionally loaded discussions (because it is about peoples self-image and their jobs) often split discussants into two broad camps: those that think test automation is “snake oil and should be used sparsely and with caution, and those that think it is a silver bullet and the solution to all of our quality problems. Test automation is an indispensable tool of today’s quality assurance but as every tool it can also be misused.

TL;DR: Testing is a sophisticated task that requires a broad set of skills and with the means currently available cannot be automated. What can (and should) be automated is regression testing. This is what we usually refer to when we say test automation. Regression testing is not testing, but merely rechecking existing functionality. So regression testing is more like version control of the dynamic properties of the software.

What do Testers really do?

roesslerj — Wed, 18 Oct 2017 21:59:34 +0000

Crash test dummies or judges–what do Testers really do?

I read a lot about what the job of testers is. But no one wrote about what the job of testers really is. Like the Ministry of Testing has an extensive list of what testers do on a daily basis. And if you Google about the job of a tester, you only get task descriptions, like test design, bug hunting, documentation or test automation. But these are all just the day to day tasks. They aren’t what testing is about. All of that is incidental. What is the first and principal thing a tester does. What needs does a tester serve, by testing?

Answering the question of what a tester does with “test design is like answering what an architect does with “stroke-drawing or answering what an artist does with “applying paint to canvas”. Although all of this is technically true, it doesn’t convey the big picture. It does not portrait the larger purpose and higher meaning of architectural design or painting. Likewise, “test design does not convey the inert significance of what it means to be a tester. Yet, this is what people talk about. No wonder the job is underappreciated...

The need a tester serves

The purpose of a tester is to represent the user. A good tester understands the needs and desires of the user and can speak in the name of the user. A good tester is a substitute of the user, a sparring partner for the developer, that gets to see all the ugly warts of the application. A tester is a crash test dummy that crashes the application times and again, so the actual user can rest assured and can safely take the next shortcut in the process the developer didn’t foresee and think about. This way, the tester prevents the user from anger and frustration and the company from brand damage and user fluctuation.

In most projects I know, there are typically (simplified) three main parties:

The project managers or product owners

They often have a birds-eye view on things. They are mostly interested in the general functionality (e.g. “one has to be able buy an item”). They mostly care about time and cost constraints. They usually don’t care if the user needs three more clicks to achieve some functionality or if the product looks ugly. Quality becomes an issue only if a certain functionality does not work and this poses a risk.

The developers

They have a very technical view on everything. What works for them usually doesn’t work for not so technically inclined users. I see this in our own project regularly, when developers are baffled how “normal users struggle with e.g. using git from the command line. They often care more for technology and the beauty and elegance of technical solutions, than for the real needs of users.

“The problem is that software engineers don’t understand the problem they’re trying to solve, and don’t careÂ to.”

The Testers

They are responsible to ensure that what developers claim works in principle or works technically, really works in practice. But they do so from a user perspective. Projects where no end-users are concerned (API-design, technical interfaces, frameworks, etc.) are the only projects that can afford to have no testers.

This also matches my experience of many projects: In most B2B projects, the user is not the one who pays the bill. Which sometimes means that the end-user of a system doesn’t have too much to say in its creation. The focus is on functionalityâ€Š–â€Šquality and usability are “add-ons”. In B2C projects on the other hand, where the end-user also is the one with the money, quality is often a priority and testers are premier team members.

One could also say the tester is a power like in the separation of powers. The legislative power (the PM or PO) decides what has to be done in general. The executive power (developers) decide on the specific details and execute (implement) what has been decided. The executive power (developers) is answerable to judiciary (testers), which in turn ultimately decides e.g. whether taken measures are appropriate. This system of checks and balances helps to hold the developers accountable. It doesn’t say that developers aren’t committed to high standards of quality on their own. Just like many Kings of the ancient where benevolent and made some good decisions without being directly accountable. But the past has shown times and again, that not all humans aspire to the highest of standards. The system of checks and balances tries to take individual commitment (the “human element) out of the equation. And it adds an “outside perspective”, that is often hard to attain from within–even with the best of intentions.

There are more job roles in software development projects you might say. What about the UX-designers? They, like developers, think in terms of a future potential product. A tester on the other hand, has a real actual product. Take the Dyson Airblade for example. This product was extremely well designed. But if you ever encountered an Airblade where many people use it, you also know that it was badly tested. Why? Well, it doesn’t dry the hand, it blows the water away. And that water then has to go somewhere. With many wet hands, you always have puddles underneath. That is exactly what I mean: Foreseeing such an effect from blueprints or simulation alone is hard for a designer or developer. Recognizing this problem is easy for a tester.

Even on their own website, the problem becomes apparent

Arguably, every situation is different, and so is the job and are the daily tasks of every tester. But in my experience and understanding of most projects, testers create the most value by completing the trinity of perspectives: represent the user and hold developers accountable for their work.

Why do IÂ ask?

Why is this question even relevant? Because it defines what testers will do in the long term. If all a tester does is defined by small day to day tasks, and these tasks get automatedÂ ... the perspective for testers are dim. But if these tasks are just the incidental work when trying to achieve a higher goalÂ ... and the goal as such remains unchanged and important, then testers have nothing to fear. Then they should be fond of better tools and higher efficiency and e.g. embrace automation. Then, also, the sheer thought of automating the user perspective and the judiciary “away is almost comical.

What do you think?

I would love to hear your thoughts about that, preferably on Twitter (@roesslerj). Thank you!

Testing vs. Checking–so what?

roesslerj — Fri, 08 Sep 2017 14:47:57 +0000

If you are curious or bewildered by the discussion about “testing vs. checking”, this post is for you. I read the article about testing versus checking by James Bach and Michael Bolton. At first I thought “so what?”. Then I thought it was some instance of dualism. Now I know better. And it even makes sense.

If you are in the testing community, you probably know the discussion about testing versus checking. Roughly speaking, the idea behind “testing vs. checking is that checking is a purely mechanical task, where in contrast, testing can only be performed by humans. Since the distinction was propagated by James Bach and Michael Bolton, it got pretty famous in the testing community.

When I first read about the distinction, I couldn’t wrap my head around it. I thought “so what, we still have to do both". Like, what is the benefit of the distinction, other than hairsplitting. What value does it add to the discussion? Calling things differently doesn’t change facts in the real world, so why would anyone bother? And I was not the only one who thought so:

// Detect dark theme var iframe = document.getElementById('tweet-865234381622738945-660'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=865234381622738945&theme=dark" }

Then I recognized that the distinction is all about what a machine can’t do and never will be able to do. Well, I know a few things about AI, and I don’t think there is anything that a machine will never be able to do. So I thought, this is some weird instance of dualism. Again, I was not the first one to think that. Nor the second one.

This is the point where I lost interest and shrugged it off as some weirdo tester eccentric thing. But when I was at TestBash, people I respected made a point about that distinction. So I went back to that article and thought “what did I miss”?

What lies at the heart of the discussion

Turns out I did indeed miss something.

In most companies where I have been, testing was regarded as “low value”. Testers were mostly students, or even worse, people that had failed as developers. They “clicked through the application”, usually following a given script, and manually checked what they were told to check. There is nothing challenging or respectable in the task the way it is performed by most companies. This usually also reflects the stand of those companies on the topic of software quality: as something that is not that important, something that can be added later if the time and budget allows it. These “testers are (mis-)used as “clicking machines”. And this is the sort of “manual automation that can easily be outsourced–and often is.

But there are people and companies that put the quality of the software at the center of their effort. There is a whole community of people who take pride in testing. People that take testing as a craft. If you do testing well, it is an original, creative, challenging task. A task that tries to understand the purpose of the product and tries to find ways in which the product can be misused or improved. These people are about as far from “clicking machines as you can get. Yet they are lumped together in the same bucket. As testers. That do testing. Like, you know, low value clicking and checking.

It took me some time to understand, but at the heart of this discussion of “testing vs. checking is the esteem of real testing. And the self-esteem of people who take pride in contributing to a high-quality end-product. So if you ever meet someone who is emotional about the distinction of testing vs. checking, that person probably sees himself as a real tester, someone that honors the craft. Pay him respect by making that distinction.

But it is even more: It also is about marketing. If you want to “sell testing and promote it, you need to make that distinction. (It is no coincidence the term was coined by consultants.) To make an analogy: The situation is that the current notion of testing is a bag full of sand and diamonds. And when people need to “sell testing (e.g. to management or customers), they have to sell testing as such. Of course nobody pays the full diamond price, if there is a lot of sand in the mix. So if you want to sell the diamonds, it is important to separate them from the sand. By differentiating, you make clear that testing is important and valuable and needs expertise.

Essentially, this is what this distinction is all about. Testing is diamondsâ€Š–â€Ša valuable and important task. Checking on the other hand is a mere mechanical task with low value that you better automate.

Checking is just sand.

If you liked this post, I would be thankful if you clapped (as much as you want), and twittered or shared. Thank you!

Why there is no way around test automation (except one)

roesslerj — Wed, 30 Aug 2017 16:06:26 +0000

Right now, manual testers in software quality assurance basically fight a lost cause. With every sprint and every iteration, the number of features increases as the software grows, because the work of developers (ideally) adds up.

But the work of manual testers doesn’t add up. New features have to be tested as they are introduced. This part of the work scales with the amount of developers. But software can break in interesting ways. So all features should be tested before a release, including existing ones. If a team has a fixed number of developers and testers, over time the testers are bound to fall behind.

There are only three possible solutions:

Add more manual testers as the software grows
Let users find your bugs
Automate tests

Adding more manual testers

Adding more manual testers means to increase the cost. More people in a team do not scale linearly, as the overhead for organization and communication increases. So even with unlimited budget (which most companies don’t have) this is only a limited option. And even if it was, there are many reasons, why doing manual regression testing is a bad idea.

Let users find your bugs

Accepting the risk by testing only a small sample of the features or testing them insufficiently is only acceptable in certain situations. Google and Facebook do not implement life-critical software, so showing an error to some small percentage of their users is a viable solutionâ€Š–â€Šfor them. This is what let’s them implement continuous delivery. But this possibility tends to be the exception. If your software is installed locally or is critical in any way, this is not an option.

Automate tests

So eventually you end up in the situation that you have to automate the tests in order to deal with this inequality. This is probably the reason why test automation has seen such a boom in recent years, and why test automation engineers are in such demand right now.

There is a huge and recurring discussion between testers, whether test automation will eventually replace human testers. Whoever thinks that got both the reason for test automation and the capabilities of test automation wrong. You have to automate tests. But not in order to replace testers. You have to automate tests in order to enable testers to do their job: proper testing of new features.

Test automation is despised by some. Or at least it appears that way. They repeat all over again, that test automation is not automated testing. And that test automation is of very limited value. I understand where this comes from and why they stress this point so much. I also understand, that

a test tool is not a strategy; test automation is a development process; test automation is a significant investment; and test automation projects require skills in programming, testing, and project management.

While all of that is true, it mostly refers to specific tools and specific experiences. And it doesn’t address the issue I stress here. I acknowledge that most of today’s test automation tools are far from perfect, but that is another discussion.

What test automation can’t do

Test automation is very much comparable to a version control system. It highlights changes of the behavior of the system and asks the user to verify or undo those changes. Thus it cannot find bugs that existed at the time of the creation of the test (historical bugs). Because these kinds of bugs are already baked into the tests. Hard-to-change existing tests can even be detrimental, as they enforce defective behavior.

Test automation can only find new bugs in old functionality and only a specific kind of such bugs. It can only find bugs that manifest in changed behavior. It cannot reason or understand the software, so test automation doesn’t find if the system becomes inconsistent. It does not find functionality that should have changed to preserve consistency. So even when using test automation, this is not a no-brainer. You still have to manually test and review old functionality, to make sure it stays consistent with the overall system.

Of course, testing exhaustively or testing everything is impossible. I would argue that testing is a risk–cost calculation. How much risk are you willing to take, or as others have put it “How long do you look into the rearview mirror”? So the number of tests to automate is always a cost-function.

What test automation can do

This is what test automation really does, albeit sub-optimal: it highlights changes.

Test automation helps to detect when functionality that once worked (and was tested and approved) ceases to work. In other words, test automation is a way to help you find unwanted changes to the behavior of the system under test. These unwanted changes are also called side-effects or regressions. Seen that way, regression testing and test automation are version control in disguise, i.e. version control of the behavior of the software.

Test automation is no silver bullet, but it can be of help. Test automation is a tool, a support, a utility function to help testers with what would otherwise be a sheer unmountable amount of effort.

TL;DR: With ever growing software, there is no way around test automation, unless you want your users to find your bugs for you. But this is meant to enable testers to do their real work: critically challenge the system instead of becoming routine-blinded.

This post has also been published on medium. If you liked it, please press heart, twitter or otherwise spread the word.

Whitelist Testing vs. Blacklist Testing

roesslerj — Fri, 25 Aug 2017 12:54:35 +0000

From a IT security point of view, the current approach to GUI test automation is careless or even dangerous. And here is why...

A general principle in IT security is to forbid everything and only allow what is really needed. This reduces your attack surface and with it the number of problems you can encounter. For most situations (e.g. when configuring a firewall), this means to apply a whitelist: forbid everything and allow only individual, listed exceptions. And make sure to review and document them.

Compare this to the current state of the art of test automation of software GUIs. With tools like Selenium--the quasi standard in web test automation–-it is the other way around. These tools allow every change in the software under test (SUT), unless you manually create an explicit check. With regard to changes, this is a blacklisting approach. If you are familiar with software test automation, you know that this is for good reasons. It is because of both the brittleness of every such check and the maintenance effort it brings about. But apart from why it is that way, does it make sense? After all, false negatives (missing checks) will decay trust in your test automation.

To be defensive would mean to check everything and only allow individual and documented exceptions. Every other change to the software should be highlighted and reviewed. This is comparable to the “track changes” mode in Word or version control as used in source code. And it is the only way to not miss the dancing pony on the screen, that you didn’t create a check for. At the end of the day, this is what automated tests are for: to find regressions.

Of course, for that approach to work in practice, there are a few necessary preconditions:

We need the execution of the system under test (SUT) to be repeatable (e.g. use the same test data). This is a very sensible idea anyway. And it is way easier with today’s tools of virtualization and containerization than it was a couple of years before.
We need to deal with the multiplication of changes. Every change to the software shows up in multiple tests, probably multiple times. E.g. if the logo on a web site changes, this may well affect each and every test. Yet it should be necessary to review a change only once.

The dose makes the poison

There is an ideal amount of checks for every software. Everything than can change without ever being a problem should not be checked. And everything that must not change should be checked.

There are two important considerations when choosing between the two approaches:

How do you reach that middle ground in the most effective way?
What “side” is less risky to approach it from, if the perfect spot is missed?

IT security guidelines recommend to err on the side of caution. So in case both approaches create an equal amount of effort, you should choose whitelisting. But, of course, you usually don’t have equal amounts of effort.

A real-life example

Imagine you have a software that features a table. In your GUI test, you should put a check for every column of every row. With seven columns and rows, this would mean 49 checks--just for the table. And if any of the displayed data ever changes, you have to copy & paste the changes manually to adjust the checks.

Starting with a whitelisting approach, the complete table is checked per default. You then only need to exclude volatile data or components (typically build-number or current date and time). And if the data ever changes, maintaining the test is way easier, because you usually (depending on the tool) have efficient ways to update the checks. Guess which of the two approaches is less effort...

Text-based vs pixel-based whitelist tests

There are already tools out there that let you create whitelist tests. Some are purely visual/pixel-based, such as PDiff, Applitools and the like. This approach comes with its benefits and drawbacks. It is universally applicable--no matter if you check a web site or a PDF document. But on the other hand, if the same change appears multiple times, it is hard to treat it with one go. Whitelisting of changes (i.e. excluding parts of the image) can be a problem, too.

Approval Test and TextTest are text-based tools that are much more robust. But PDFs, web sites, or software GUIs have to be converted to plain text or images for comparison. Ignoring changes is usually done via regular expressions.

Shameless self-promotion:

I am only aware of one tool that is semantic, can be applied to software GUIs, is not pixel-based (although it can be), and easily lets you ignore volatile elements: ReTest.

This is a crosspost from medium. If you liked this post, like, twitter, share or otherwise help raise awareness. Thank you!