Managing Data at Scale

#dataengineering #systemarchitecture #opensource #bigdata

Imagine you are on a summer vacation to an exquisite resort. You constantly make sure that your phone or camera is charged because you take lots of pictures and videos to preserve the memory. Well, after a few days, you run out of disk space because of the countless high-definition pictures and videos you have taken. As a solution, you move files from your phone to your favorite cloud provider (maybe Google Drive or iCloud) and problem solved. In reality, while you have freed up space on your device, you have consumed space on another device. To get a sense of how much data you may have shared on the internet over time, open network settings on your phone and look for data usage. At the time of this writing, mine was 520Gb. Outstandingly, I am just one among five billion internet users generating over 2.5 billion gigabytes each day.

Traditional computers are no longer the sole generators of data. Video game consoles, smart watches, CCTV networks, mobile phones, and weather sensors all generate data of varying sizes and types. For some devices, data generated must be streamed at real-time or near real-time speeds. Unfortunately, having almost reached the limits of the number of transistors that can fit on a single CPU chip, how can processing be increased to match the exponential increase in data? More importantly, how will the data be stored for future retrieval?

It is a common saying that united we stand, divided we fall. Just as tiny ants are capable of consuming a dead animal many times their size by swarming together, massive data can be processed and analyzed using multiple computers (nodes) for processing. This is referred to as parallel processing. Furthermore, data storage in these systems uses a file system different from that of regular operating systems as it has a higher storage allocation unit (chunk).

A popular technology for handling data at massive scale is Apache Hadoop. Hadoop enables the distributed processing and storage of data sets across a cluster of computers thus providing high availability and fault tolerance.
It achieves this using four key components: Common, Hadoop Distributed File System (HDFS), Yet Another Resource Allocator (YARN), and MapReduce.

Common is the base component on which the other three components depend. HDFS is the file system responsible for efficient distributed file storage. Recall that big data requires a larger storage allocation unit different from that of the regular OS. HDFS is what provides this functionality. It divides data into a minimum of 64MB chunks which are then replicated thrice across the cluster. Illustratively, to store a 1GB file in a Hadoop cluster of six computers and a chunk limit of 100MB, HDFS breaks the file into 10 parts and creates three copies of each part which is then stored on the cluster. Notice that the chunk size can be configured to suit the needs of the system. However, care should be taken not to configure a chunk size larger than that of the size of average files to be processed as this would lead to poor I/O performance specifically called the small file problem. In a cluster, HDFS assigns a computer the role of a master (name node). The name node keeps track of what file is stored and where it is stored. Additionally, it uses a JobTracker to assign tasks to other nodes (data nodes). YARN supports the name node in the fulfillment of job tracking, scheduling, and resource allocation. MapReduce coordinates data processing by performing two essential duties. First, it breaks down large datasets into smaller ones to be processed by each node (Map). Next, it reassembles each output to get a final result (Reduce).

Asides from the primary components of Hadoop, other open-source utilities exist to improve and extend its capabilities. Sqoop facilitates the import and export of data from relational databases, Mahout provides the library for machine learning models required for predictive analysis, Kafka offers data streaming, Hive allows querying data from different data stores, even flat files, using Hive Query Language (HQL) which is similar to SQL and HBase serves as a database system for storing large volumes of real-time data.

Other databases exist to store petabyte-scale data and perform Online Analytical Processing (OLAP) – a technique for processing and analyzing data – like Cassandra and MongoDB. Similar to HBase, Cassandra is a column-oriented database. However, the absence of a master node rules out the problem of a single point of failure that is present in HBase due to its reliance on Hadoop. While MongoDB offers the best read performance of the three databases listed, it is unsuitable for large volumes of real-time data. By contrast, Cassandra outperforms the rest in write speeds making it suitable to store logs for example. In fact, Cassandra seems to perform better as the cluster size increases beyond 24, consequently making it the best choice for supporting a large distributed sensor system.

If you learned something new from this article, please like and share

DEV Community

Managing Data at Scale

Top comments (0)

Read next

🔥How to fetch HTML from server

UI Card Library

My Journey in Rust & Tauri: Creating a Simple Pomodoro Timer App

Here’s how AI-powered autocompletion is implemented in Novel, an open-source text editor