Looking at Challenges of Big Data Testing with Hadoop
Jasmine Morgan Dec 21, 2017
Have you ever imagined how a baggage sorting system works in a busy airport? Have you ever thought how your GPS knows which part of the road is crowded? Do you ever wonder how social media companies make sure every post is processed fact and displayed to those most likely to interact with it? These are just a few examples of why Big Data testing is important. You could even go a bit further and think about how the world would look like if the information that is created every second would not be adequately tested.
The Need to Test Big Data
The term Big Data is a binding concept which denotes any set of recordings that is too large to be stored and analyzed on a single machine. It has become a way to make better decisions, personalize content, automate responses and unveil hidden patterns that could offer a competitive advantage.
Its difference from regular large databases is that the majority of such data doesn’t follow the structure of conditional databases and can’t be set in a table.
It comes mostly as unstructured (video, audio, maps) or semi-structured (CSV files, XML, JSON) data. This makes traditional testing methods inappropriate and inaccurate and calls for the need to develop specific ways to deal with accuracy and quality assurance.
How Is Big Data Testing Different?
As listed by software testing company A1QA, software testing includes various aspects related to data and its use:
● Functional testing
● Performance testing
● Security testing
● Compatibility testing
● Usability testing
These aspects are more about the use of software. Meanwhile, the most significant issue of Big Data testing is related to its quality, and the focus will be on validating it by checking completeness, ensuring that it is ready to be transformed and that it has no missing records. All these are necessary before it is analyzed.
The testing process consists of three distinct stages, including the validation (pre-Hadoop), Map Reduce and logic checking, and validation of the output.
The problem with Big Data validation is that the tools used until now, namely the minus query and sampling methods, are useless.
The minus query method applied to SQL databases compares each row in the original set with every row in the destination, which means significant resource consumption. Multiply this process by two, since it must be performed both ways (source-target/target-source) and you are already falling behind. Not to mention that if the data is unstructured, like CCTV footage, the process is inapplicable.
The sampling method (“stare and compare”) is also a waste of time and can’t be used for datasets that have millions of lines and which are updated every second in some cases. For Big Data, automation is necessary, and manual sampling for accuracy should be eliminated.
Challenges & Solutions
Since Big Data is analyzed through distributed and parallel systems, the problems that arise are linked to processing the entire volume of data, doing so in a reasonable time, safely, and ensuring the transmission of all the records. The challenges here arise from the 3Vs defining the Big Data: volume, velocity, and variety.
The Problem with Scale
The first V, volume, requires the system to be able to handle large incoming data streams. To be ready to face these problems, the best approach is to use a distributed file system, like HDFS powered by Hadoop. The great advantage of this framework is that it requires little changes to go from a few to hundreds of processing nodes. Therefore, it is enough to have the logic in place for testing.
The system includes redundancy since each piece of data is replicated along more nodes to prevent information loss in case some of the network nodes fail. Generally, it takes a failure of more than 50% of the network to lose data. Possible additional problems here may come from synchronization between nodes.
The Problem with Speed
A system is only valuable if it performs fast. Using parallelism can ensure the result is given promptly, spreading the task over multiple CPUs. This required dividing the information in such a way that computations can be performed on segments without affecting the result. If two different nodes need the same information, it can be replicated or serialized.
The tool used to solve this is the NoSQL approach, which is perfect for Big Data since it is perfectly adapted to unstructured records. To speed up the search process in such a structure, indexing is necessary, which can be done by using Cassandra.
The Problem with Security
The rising problem of cyberterrorism makes security testing an integrated component of any testing suite. Any vulnerability in the infrastructure gathering the Big Data, such as Wi-Fi or sensors, could be exploited to get access to the data lake and compromise the organization’s records. Hadoop instances are known to be quite insecure. Therefore, a data penetration test is required.
Verifying Accuracy & Compatibility
Since in a parallel computing system some nodes can fail, it should be an integral part of testing to check that data is replicated correctly across the nodes, processed appropriately in each node, and that the result is transmitted in due time to the master node of the Hadoop system.
Due to the varied nature of the data, the testing should also run for compatibility between inputs, the system’s capacity to analyze those types of data, and outputs.
Big Data testing should address each of the problems raised by the 3Vs, to create the fourth - value. There are significant differences between standard software and data testing related to infrastructure, tools, processes and existing know-how. To cope with the challenges posed by Big Data, testers need to use parallel computing, automate testing and keep in mind issues related to data privacy and security. A clear testing strategy for database testing and infrastructure followed by performance and functional testing types also helps.