DEV Community: Saša Zejnilović

5 things to watch out for in automated regression tests

Saša Zejnilović — Wed, 06 Jan 2021 21:33:51 +0000

What are regression tests? (in short)
The problems
- 1. Change of the output formats
- 2. Designed-in assumptions about the test environment
- 3. Errors in maintenance
- 4. Changing operators.
- 5. Not treating your tests as any other codebase
Conclusion

What are regression tests? (in short)

You design regression tests to detect issues that might come as side-effects of implementing other features you've already tested.

The problems

The biggest problem facing regression tests are:

1. Change of the output formats

Most common. The changes may be so minor that manual testers would barely notice. Automated tests, however, are sensitive and brittle, unable to differentiate between improvements and bugs. The whole suite could have to be updated if we only change metadata from some Map[String, String] to Map[String, Any] to keep all formats.

2. Designed-in assumptions about the test environment

Test suites may break when moved to different environments or when the configuration is changed (when they are not masters of the environment).

3. Errors in maintenance

Writers of automation tests repairing tests make mistakes, introducing bugs into the test suites. Regression test suites then develop regression bugs themselves, which can show after some time.

4. Changing operators.

Test suites may require up-skilled people and knowledge to run and maintain. People change positions and jobs. If person X disables some test and then is let go, person Y is just running tests unaware there might be a problem.

5. Not treating your tests as any other codebase

Very often, and not only in regression testing, but in general, people treat their tests as they treat their documentation. Tests should be treated as any other codebase, there should be standards and design principles applied. You just stop just shy of creating tests for your tests. It actually should work in unison. You can view it as your Test Code testing your Product Code and vice versa.

Conclusion

In conclusion, people tend to invest in regression automation a lot. Sadly, they often find that the tests stopped working sooner than later. The tests are out of sync with the product. They demand repair. They're no longer helping find bugs. Testers respond by updating the tests or just adding new ones ending with 5-6 000 test cases that no one knows what they are doing, and everyone just prays it is ok. I know a company where two full-time SDETs were needed for patching them every day and adding more. Hundreds of tests disabled because they don't have time.

From all this, uncontrolled maintenance cost is probably the most common outcome. It results in companies then rather "forgetting" the regression suite and testing than repairing it.

Working with nested structures in Spark

Saša Zejnilović — Sun, 20 Sep 2020 11:01:55 +0000

Table of Content

Intro
Add Column
Drop Column
Map column
Afterword

Intro

I want to introduce a library to you called spark-hats, full name Spark Helpers for Array Transformation*s*, but do not let the name fool you. It works with structs as well. This library saves me a lot of time and energy when developing new spark applications that have to work with nested structures. Hope it will help you too.

The core of the library are methods add a column, map a column, drop a column. All of these engineered so you can turn this:

val dfOut = df.select(col("id"), transform(col("my_array"), c => {
  struct(c.getField("a").as("a"),
  c.getField("b").as("b"),
  (c.getField("a") + 1).as("c"))
}).as("my_array"))

into this:

val dfOut = df.nestedMapColumn("my_array.a","c", a => a + 1)

Let's get started with imports and the structure that will be used for examples.

I will use spark-shell with the package using this command in the shell:

$> spark-shell --packages za.co.absa:spark-hats_2.11:0.2.1

and then in the spark-shell:

scala> import za.co.absa.spark.hats.Extensions._
import za.co.absa.spark.hats.Extensions._

scala> df.printSchema()
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)

scala> df.show(false)
+---+------------------------------+
|id |my_array                      |
+---+------------------------------+
|1  |[[1, foo]]                    |
|2  |[[1, bar], [2, baz], [3, foz]]|
+---+------------------------------+

Now let's move to the methods.

Add Column

Add column comes in two variants. Simple and extended. Simple allows adding of a new field in nested structures. Extend does the same while allowing you to reference other elements.

The simple one is pretty straight forward. You get your DataFrame, and instead of calling withColumn, you call nestedWithColumn. Let's add a literal to a struct.

scala> df.nestedWithColumn("my_array.c", lit("hello")).printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: string (nullable = false)

scala> df.nestedWithColumn("my_array.c", lit("hello")).show(false)
+---+---------------------------------------------------+
|id |my_array                                           |
+---+---------------------------------------------------+
|1  |[[1, foo, hello]]                                  |
|2  |[[1, bar, hello], [2, baz, hello], [3, foz, hello]]|
+---+---------------------------------------------------+

The extended version can then use other elements of the array. The API also differs. Here the method nestedWithColumnExtended expects a function returning a column as a second parameter. Moreover, this function has an argument which is a function itself, the getField() function. The getField() function can be used in the transformation to reference other columns in the DataFrame by their fully qualified name.

scala> val dfOut = df.nestedWithColumnExtended("my_array.c", getField =>
         concat(col("id").cast("string"), getField("my_array.b"))
       )

scala> dfOut.printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: string (nullable = true)

scala> dfOut.show(false)
+---+------------------------------------------------+
|id |my_array                                        |
+---+------------------------------------------------+
|1  |[[1, foo, 1foo]]                                |
|2  |[[1, bar, 2bar], [2, baz, 2baz], [3, foz, 2foz]]|
+---+------------------------------------------------+

Notice that for root-level columns it is enough to use col, but getField would still be fine.

Drop Column

By the second method, you might have already caught to the naming convention. This method is called nestedDropColumn and is the most straight forward of the three. Just provide a fully qualified name.

scala> df.nestedDropColumn("my_array.b").printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)

scala> df.nestedDropColumn("my_array.b").show(false)
+---+---------------+
|id |my_array       |
+---+---------------+
|1  |[[1]]          |
|2  |[[1], [2], [3]]|
+---+---------------+

Map column

Map column is probably the one with the most use-cases. The map will apply a function on each element of your struct and puts an output on the same level by default, or somewhere else if specified.

If the input column is a primitive, then a simple function will suffice. If it is a struct, then you will have to use getField again.

scala> df.nestedMapColumn(inputColumnName = "my_array.a", outputColumnName = "c", expression = a => a + 1).printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)

scala> df.nestedMapColumn(inputColumnName = "my_array.a", outputColumnName = "c", expression = a => a + 1).show(false)
+---+---------------------------------------+
|id |my_array                               |
+---+---------------------------------------+
|1  |[[1, foo, 2]]                          |
|2  |[[1, bar, 2], [2, baz, 3], [3, foz, 4]]|
+---+---------------------------------------+

Afterword

I hope these methods and the library will help you as much as they helped me. They make working with structures a lot easier and keep my code more concise, which in my head means less error-prone.

For more info go to https://github.com/AbsaOSS/spark-hats

Good luck and happy coding!

Black Box Testing Misconceptions

Saša Zejnilović — Thu, 28 May 2020 12:37:19 +0000

After some time in QA and being a self-proclaimed SDET, I have seen that there are a lot of misconceptions regarding testing. You can see articles all around the net with "Regression testing vs Retesting", "Performance Done Right", and other similar, but I have not seen one that adequately addresses Black Box Testing.

From my experience Black Box Testing is seen as something that can be done quickly, or with unskilled people. Some would label it as cheap while providing a lot of feedback. Let's break this down.

The easier misconception to disprove is "it is cheap". That goes totally against the basic testing pyramid. Black box testing is done after you have some user interface, CLI or GUI, whatever you want to throw at the users. This means your integration could be garbage, and your supporting code could be garbage, but for some reason, it just worked together, until a user sat behind it. Now when something goes wrong, there is a chance you will have to dig really deep into your underlying code to make it work, and this could also break the integration with other modules. This seems so expensive to me, but then again, I am not a manger.

The second misconception is about it "being easy". I am not saying black box testing is one of the more complex types of testing, but I am sure it is not as ignorance-based, as some would think. Yes, you can throw a team of 20 people on a UI and tell them to go nuts, but does this actually bring the most value out of it? In my experience, proper black box testing profits when the people setting it up are knowledgable about the business use cases and issues, and when they understand the users. Give a tester a one on one with a user, let them chat, see what happens. Another thing that would be good for the testers is to understand the technology and configurations of the system under tests, what is some other software that this software will interact with, and what are the expectations for the data flow.

I hope this clears it up a bit. If I was not clear enough in everything I said, comment below and I will correct myself ASAP.

Good luck and happy coding!

Github Awesome Lists

Saša Zejnilović — Tue, 26 May 2020 12:38:49 +0000

TL;DR: Awesome repo

In the last couple of weeks, I have seen a lot of custom made lists and suggestions for apps, frameworks and others, which is fantastic. A good tip is an excellent way to jumpstart someone's work or project. A curated list from someone experienced is just irreplaceable.

With that said, allow me to present to you Github's repo called Awesome. This repository is full of different lists of impressive open-source applications, libraries and frameworks for a plethora of usages.

Please have a look. Give some of them a try. I know it helped me a lot of times. Sometimes it even sparks an idea.

Good luck and happy coding!

Short: The biggest mistake of juniors

Saša Zejnilović — Sat, 23 May 2020 16:57:24 +0000

In my approximately five years of professional IT experience (a weird mix of QA, Dev, DevOps) I had the honour of looking at CVs, reviewing candidates and teaching or guiding more junior team members. Let's say my "mentoring" started three years ago.

In these three years, I have seen a lot of people, both my teammates, colleagues and others, like open-source participants, make the same fundamental mistake. This mistake is something which even the wise Vesemir told us not to do. He said: "Don't train alone; it only embeds your errors."

This might sound funny. Taking quotes from fictional characters, but it is something that I think is crucial. I have many times seen starting programmers learn something wrongly and burn it into their mind and then spread the harmful code like a disease everywhere in the codebase.

I want to emphasize that I don't think this is their mistake. Internet is big, dark and full of errors. They are trying, they are learning alone, and they should be praised and encouraged. But they should also start with peer reviewing as soon as possible. This is the "cure" of sorts. Internet is also full of beginning programmers, and not everyone is lucky enough to get a mentor. What I am saying is do things together as soon as possible. Work together but try to learn separately. This will allow all of you to learn new things, share knowledge and discuss better ways, not allowing you to get comfortable with what you know and lowering the risk of embedding wrong ideas into your daily routine.

Now how to find a place where you could code and someone will review your code for free? Github is a start. You can explore repositories; there is a button for it on the main page. You can filter by topics and languages. Pick a smaller project, look through the issues, play with it a bit. Smaller projects tend to be more open to newcomers. Not only do you learn, build meaningful things, but it will also show on your CV.

Good luck and Happy Coding!

How to compare your data in/with Spark

Saša Zejnilović — Fri, 01 May 2020 08:18:54 +0000

Intro
The problem
The solution
Who exactly is behind this project
Hermes dataset comparison Features
Usage - Spark application
Summing-up

Intro

Apache Spark, as is, provides quite a lot of different capabilities and features, but it is missing one that I, as a self-proclaimed SDET, find pretty valuable. The comparison of data. I'm talking about the comparison of complex data, complex structures and generating a report that can be used to see where the problem lies; more than just a normal true/false comparison.

The problem

The main problem that we are trying to solve is that when using standard solutions, running a comparison of sorts on a large dataset returns a basic true or false result, after which you then need to comb through all of the data and try to find the root cause.
There is no fast response. Fast feedback loops are essential, but that's for a different article. You also need some basic metrics about the dataset to be provided. Testing without proper results is putting your trust in hope, and hope alone cannot build your big data solutions.

The solution

For these reasons, my teammates from AbsaOSS and I have written a tool called Hermes. Hermes consists of three modules, and one of its modules is a data comparison tool which works either as a Spark application or as a library, and it can compare whichever format is supported by Apache Spark. This tool is written in Scala, so it should be possible to use within any JVM application of your own. (I have even seen people using py-spark use our libraries, so it's not only JVM compatible. I am, however, not an expert on that, and I am not sure how "clean" of a solution that is.)

In this article, I would like to give a brief overview of the features of this Spark comparison tool and how to use it as a Spark app. Usage as a library is a bit more complex, and I believe it deserves a full article of its own. Let me first explain who we are.

Who exactly is behind this project

AbsaOSS is an initiative of Absa Group Limited, a South African bank that wants to go open source. You want to standardize your data, move it from COBOL to something current or track what and how your data is handled? We do that (Enceladus), that (Cobrix) and that (Spline). And some other interesting stuff.

Hermes and all other projects are under the Apache License. Meaning, feel free to use it and contribute. The projects are active, and we spend almost our whole days on GitHub, so we are usually quite fast to respond. All of our projects are in some way currently used by ABSA in production.

Hermes' significant advantage is that even though it is used in production, it is quite young and still looking for ideas. It is still growing.

Current real-world usages are:

A testing framework for the Enceladus project.
A data check tool that gives us an assurance that new tools work as well as the old ones that are being decommissioned.

Hermes dataset comparison Features

This feature list should be the same for the people who use it as a library as for those that use it as a Spark app. The features are as follows:

Can be used as an Apache Spark application or Spark library
Compares virtually any data type if you provide the needed library for the source type on the classpath. Spark already supports a lot of source types, but you might need to read Oracle, Hive or Avro. Just provide the application with proper packages, and you are good to go
JDBC, Spark and other packages are not packaged together with the application. They have a provided dependency. This allows us to keep the jar to 150 Kb and provide users with more flexibility
Can compare two different source types
Writes output as parquet (this is planned to be configurable. Issue #72)
Only compares data sets with the same schema. We have a complex schema comparison so the schema does not have to be aligned, but it has to be the same. (We have a plan for selective comparisons in the future)
Will write _METRICS file at the end (this will be written next to the parquet)
- If you passed or failed
- How many rows were processed
- If any duplicate rows were found
- Number of differences found
Provides a precise path to what was wrong in the datasets. Even if the structure is complex (arrays of structs and the likes). This is written to the final parquet
Final parquet holds only the rows that were different
Prints summary to STDOUT

Usage - Spark application

Disclaimer: I will try to cover all of the tool's functionalities, but I will be skipping over spark-submit configurations. That is beyond the scope of this text. I will also not cover how to set up your Hadoop and Apache Spark.

In this use case, I will try to show possibilities of Hermes's dataset comparison. This use case covers usage as a Spark application. For usage as a library, look forward to a second article.

To use Hermes's Dataset Comparison, you just need to know how to run spark-submit, your data types, their properties/options and where it is. Let's start with an easy example:

Example 1

spark-submit \
<spark-options> \
dataset-comparison-0.2.0.jar \
--new-format csv \
--new-header true \
--new-path /new/path \
--ref-format xml \
--ref-rowTag alfa \
--ref-path /ref/path \
--out-path /out/path \
--keys ID

Example 2

spark-submit \
<spark-options> \
dataset-comparison-0.2.0.jar \
--format xml \
--rowTag alfa \
--new-path /new/path \
--ref-path /ref/path \
--out-path /out/path \
--keys ID

Now, let's go over what these are. The job has one independent parameter, and that is --keys. Keys refers to the set of primary keys. You can provide either a single primary key or a number of keys as a comma-delimited list in the form ID1,ID2,ID3.

Next up is --out-path. For now, out-path can only be configured to specify the destination path for the parquet file which will contain the output differences and metrics. This is planned to change (#72), and it will have the same rules as --ref and --new prefixes.

Last and (probably) hardest to grasp are the --ref and --new parameters. These are only prefixes to the Spark source type's standard options. Just add -format to specify the source format (type). Add -path to get the input or output path, unless you are using JDBC connector, then use -dbtable and then any other options prepended with the correct prefix (--ref or --new) depending on if it is reference data or the new data that you are testing.

These options can also be generalized. Taking a look at Example 2, it has only --format; no --new-format or --ref-format. This is because both source types are XML and both have the same rowTag.
In this case, there is no need to specify this twice. If both source types were XML but had different rowTags, then the --ref-rowTag and --new-rowTag options would need to be specified.

After running this, just run hdfs dfs -ls /out/path and take a look at the results. If there were any differences, you should find a parquet file that has a new column added called err_col. This error column will be filled with paths highlighting differences in your structure.
Its schema is (pretty simple):

root
 |-- errCol: array (nullable = true)
 |    |-- element: string (containsNull = true)

Summing-up

Hermes should be easy to use testing tool and framework. Its dataset comparison module currently holds the most value, even outside of AbsaOSS, and I hope it can help you solve an issue or two. If you have any question about this or any of our projects, just send us a message or create a Question issue on GitHub.

I am looking forward to your comments and see you in the next article - usage as a library.

Building Hadoop native libraries on Mac in 2019

Saša Zejnilović — Mon, 20 May 2019 15:37:17 +0000

TL;DR to be found at the end

Recently I came into a situation that I "needed" Hadoop native libraries. Well, when I say "needed", I mean I was just getting fed up by the constant warnings like this one:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

So I thought I would build my own Hadoop native libraries. How hard can it be, right? Honest answer? Less than an hour if you don't have a tutorial. Fifteen minutes if you do and most of that is compilation time. In my search, I found out a lot of tutorials and guides were either outdated or didn't offer everything needed for a full compilation and installation and that is why I wrote my own which I tested on two independent Macs, thus it should be "tested enough".

Why do it

There was no real world issue I was hoping to solve. I just had a few minutes on my hands and I used them to learn something new. But I did read that there are cases of speed improvements which is good if you are developing or testing something locally because local machines tend to be slow and any improvement is more than welcome. Another thing is I did see two random articles a while back saying they did have some issues with the Java libraries, but chances of some of you having the same issues are really small.

Dependencies

First of all, we need to install the dependencies for the build and I am including links so you can check what you are going to install exactly:

gcc
autoconf
automake
libtool
cmake
snappy
gzip
bzip2
zlib
wget
openssl 1.0
- 1.1 on Brew has an issue. More in the comments section. Thanks to @imasli
protobuf 2.5.0

(Please note I am skipping maven, java and others that I think you would already have. If I am wrong, tell me and let's update the article. As well as Hadoop installation. There is a beautiful article about Hadoop installation on Mac by Zhang Hao here.)

For the installation of most of these, I will be using Homebrew. It's a good tool, has a one-liner installation and a very short average time to be productive with it. As the link provides everything you need I am skipping the installation here.

If you are not using Homebrew for the first time, update and upgrade your tools. If you are using it for some time already and would like to keep some things with the current version, use brew pin like this.

# Update
brew update
brew upgrade

# Then the installation
brew install wget gcc autoconf automake libtool cmake snappy gzip bzip2 zlib openssl

As you could have noticed one of those dependencies listed is missing from the list above. Yes! It is a protobuf that has been deprecated and can't be easily installed from Homebrew. So let's build our own. It's cleaner that way and much more fun then it sounds. We will first need to get it from GitHub and unarchive it somewhere. You can delete it right after, so you don't need a special folder structure.

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar -xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0

Then comes the process of building and making sure everything went smoothly. It takes some time and I advise you to run it step by step to see and know what is happening. Some warnings here and there are normal so you can skip those.

./configure
make
make check
make install
# And just to check if everything is ok.
# This should print libprotoc 2.5.0
protoc --version

OpenSSL setup

Now, linking OpenSSL libraries by hand as Homebrew refuses to link OpenSSL and the compiler needs them. This is a known feature and needs to be done by running ln.

cd /usr/local/include
ln -s ../opt/openssl/include/openssl .

This will solve an error that looks something like the caption below.

[exec] -- Configuring incomplete, errors occurred!
[exec] See also /Users/user/github/hadoop/hadoop-tools/hadoop-pipes/target/native/CMakeCMake Error at /usr/local/Cellar/cmake/3.14.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
[exec]   Could NOT find OpenSSL, try to set the path to OpenSSL root folder in the
[exec]   system variable OPENSSL_ROOT_DIR (missing: OPENSSL_INCLUDE_DIR)
[exec] Call Stack (most recent call first):
[exec]   /usr/local/Cellar/cmake/3.14.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
[exec]   /usr/local/Cellar/cmake/3.14.3/share/cmake/Modules/FindOpenSSL.cmake:413 (find_package_handle_stFiles/CMakeOutput.log.
[exec] andard_args)
[exec]   CMakeLists.txt:20 (find_package)
[exec]
[exec]

Building native libraries

And finally! The building of the libraries. Again, this will create a folder that you can delete in the end. Here is probably the first place you will need to modify something and that is the version of Hadoop you will be using.

git clone https://github.com/apache/hadoop.git
cd hadoop
# Change the version as needed
git checkout branch-<VERSION>
# And just package.
mvn package -Pdist,native -DskipTests -Dtar
# After build, move your newly created libraries.
cp -R hadoop-dist/target/hadoop-<VERSION>/lib $HADOOP_HOME

Setting up environment variables

Now the critical part, making your shell see the libraries. I don't know what kind of shell you are using, nevertheless, put this into your shell profile (.bashrc, .zshrc, etc.):

export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:${HADOOP_HOME}/lib/native

This will point all the libraries to the right path and will make everything fall right into place. The last thing that we need is just to check if everything is ok (and by everything I mean almost everything, because bzip is acting up and I still have not found a way to solve, when I do I will update this).

hadoop checknative -a

#The output should be something like this.
19/05/17 19:00:14 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
19/05/17 19:00:14 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /usr/local/Cellar/hadoop/2.7.5/lib/native/libhadoop.dylib
zlib:    true /usr/lib/libz.1.dylib
snappy:  true /usr/local/lib/libsnappy.1.dylib
lz4:     true revision:99
bzip2:   false
openssl: true /usr/lib/libcrypto.35.dylib
19/05/17 19:00:14 INFO util.ExitUtil: Exiting with status 1

Afterword

Hopefully, everything is running smoothly and you no longer get those warnings and if I helped even one person with this I am glad. Because if there is no added value for the reader, then it is just me talking to my wall. On the other hand, if you did find some issues in the code or the article, please do tell me and I will fix everything I am capable of.

TL;DR

This is just a step by step shell script extracted from the upper text.

DEV Community: Saša Zejnilović

5 things to watch out for in automated regression tests

What are regression tests? (in short)

The problems

1. Change of the output formats

2. Designed-in assumptions about the test environment

3. Errors in maintenance

4. Changing operators.

5. Not treating your tests as any other codebase

Conclusion

Working with nested structures in Spark

Table of Content

Intro

Add Column

Drop Column

Map column

Afterword

Black Box Testing Misconceptions

Github Awesome Lists

Short: The biggest mistake of juniors

How to compare your data in/with Spark

Table of Contents

Intro

The problem

The solution

Who exactly is behind this project

Hermes dataset comparison Features

Usage - Spark application

Summing-up

Building Hadoop native libraries on Mac in 2019

Why do it

Dependencies

OpenSSL setup

Building native libraries

Setting up environment variables

Afterword

TL;DR