jbx1279

Posted on Feb 6, 2023

Hadoop/Spark is too heavy, esProc SPL is light

#bigdata #database #programming

With the advent of the era of big data, the amount of data continues to grow. In this case, it is difficult and costly to expand the capacity of database running on a traditional small computer, making it hard to support business development. In order to cope with this problem, many users begin to turn to the distributed computing route, that is, use multiple inexpensive PC servers to form a cluster to perform big data computing tasks. Hadoop/Spark is one of the important software technologies in this route, which is popular because it is open source and free. After years of application and development, Hadoop has been widely accepted, and not only can it be applied to data computing directly, but many new databases are developed based on it, such as Hive and Impala.

The heaviness of Hadoop/Spark

The goal of Hadoop is to design a cluster consisting of hundreds of nodes. To this end, developers implement many complex and heavy functional modules. However, except for some Internet giants, national communication operators and large banks, the amount of data in most scenarios is not that huge. As a result, it is common to see a Hadoop cluster of only a few or a dozen nodes. Due to the misalignment between goal and reality, Hadoop becomes a heavy product for many users whether in technology, use or cost. Now we will explain the reason why Hadoop is heavy in the said three aspects.

The heaviness of technology

If a cluster consisting of thousands of computers does exist, it is impossible to rely on manual operation to perform personalizedmanagement. We can imagine that if these computers are all listed, the number of computers will be more than the eyes of the operation and maintenance personnel can take in, let alone manage and assign tasks. Besides, it is inevitable that various failures will occur now and then when so many machines are running. In this case, how to ensure the smooth execution of computing tasks? To solve these problems, Hadoop/Spark developers write a lot of code to implement the functions including automated node management, task distribution, and strong fault tolerance.

However, these functions themselves will take up a lot of computing resources (CPU, memory, hard disk, etc.). If such functions are used on a cluster of several to a dozen nodes, it will be too heavy. The cluster of a few nodes is not large originally, yet Hadoop takes up a considerable part of resource, which is very uneconomical.

Beyond that, the product line of Hadoop is very long. To put these functional modules on one platform to run, it needs to sort out the interdependence between various products, and therefore, it has to implement an all-encompassing complex architecture. Although most scenarios only use one or two products of the product line, they have to accept this complex and heavy platform.

Spark, which appeared later, makes up for Hadoop's lack of memory utilization. Herecomesaquestion: can it make technology lighter? It’s a pity that Spark goes to another extreme. From a theoretical model perspective, Spark only considers in-memory computing. In particular, the RDD in Spark adopts the immutable mechanism, and a new RDD will be copied after each calculation step, resulting in a large occupation and waste of memory space and CPU. It cannot even run without large memory, so it is still technically heavy.

The heaviness of use

Hadoop is technically too complex, which means that installation, operation and maintenance will be very troublesome. Even when a cluster has only a few computers, it also has to use the node management, task distribution, and fault tolerance functions designed for cluster with thousands of nodes, and thus you can imagine how difficult the installation, configuration and debugging, as well as the maintenance and management of daily operation will be.

Even if these difficulties are solved, and Hadoop runs normally, you will also get in a bigger trouble when writing the calculation code for big data. The core framework of programming in Hadoop is MapReduce, and programmers only need to write Map and Reduce actions when they write parallel programs. Indeed, it is effective to solve simple problems such as summation and counting, however, when encountering complex business logic, programming in MapReduce will be very difficult. For example, the JOIN calculation, which is very common in business computing, is difficult to be implemented in MapReduce. In addition, it is also difficult to implement many order-related operations.

Spark's Scala has a certain ability of calculating the structured data, will it be simpler to code in Scala? Unfortunately, it is very difficult to use and learn Scala, and it is more difficult to master. It is also difficult for Scala to write complex operation logic.

Since MapReduce and Scala are all difficult, the computing syntax of Hadoop/Spark begins to return to SQL. Hive is very popular as it can convert SQL to MapReduce, and Spark SQL is more widely used than Scala. But, while it is relatively simple for SQL to do some regular queries, it is still very cumbersome to handle multi-step procedural calculation or order-related operations, for it needs to write very complex UDFs. Moreover, although many computing scenarios can be barely implemented in SQL, the computing speed is not ideal, and it is difficult to optimize the performance.

The heaviness of cost

Although Hadoop software itself is open source and free, it is technically complex, difficult to use, resulting in a high comprehensive cost.

As mentioned earlier, Hadoop itself will consume too much resources of CPU, memory and hard disk, and Spark needs large memory to support normal operation. To solve this problem, you have to purchase server with higher configuration for Hadoop/Spark, this will increase the hardware expense.

Due to the difficulty in using Hadoop/Spark, more personnel are required for the installation, operation and maintenance to ensure its normal operation, and more developers are required to program various complex business calculations. As a result, the cost of human resources is increased.

Because it is too difficult to use, many users have to purchase the non-free version of Hadoop/Spark from commercial companies. Since the price is quite high, it will greatly increase the cost of software procurement.

Since Hadoop is so heavy, why do many users still choose it? The reason is simple: users can't find an alternative for the time being, and only Hadoop can be barely used, at least its reputation is higher.

Thus, users can only install and configure heavy applications of Hadoop, and endure the huge consumption on computing resources from Hadoop itself. The number of servers in a small-scale cluster is not large originally, and Hadoop takes up many of them, which will lead to a phenomenon that a cluster with less servers run a calculation task that is beyond its capacity, and thus you can imagine how slow the running effect will be. In short, Hadoop is expensive and laborious, but the actual computing performance is not ideal.

Isn't there other choice?

Lightweight choice

The open-source esProc SPL is a lightweight big data computing engine, which adopts a new implementation technology, and boasts the advantages of light in technology, simple in use, and low in cost.

Light in technology

As mentioned at the beginning of this article, the growing data makes traditional databases unable to hold such a big amount of data, so users have to turn to distributed computing technology. The reason is that it is difficult to implement high-speed algorithms in SQL, and the computing performance for big data can only rely on the optimization engine of database, but for complex calculations, these optimization engines can often do nothing.

Therefore, we should find ways to design more efficient algorithms, rather than blindly pursuing distributed computing. According to this idea, SPL provides many high-performance algorithms (many of which are pioneered in the industry) and efficient storage schemes, so that we can obtain a computing performance that far exceeds that of database under the same hardware environment. SPL installed on a single machine can achieve many big data computing tasks, and its architecture is much simpler than that of cluster, so it is naturally much lighter in technology.

SPL’s high-performance algorithms include:

For larger amount of data, SPL achieves the computing function of lightweight cluster. This function is designed for cluster of a few to a dozen nodes, which adopts a completely different implementation method from Hadoop.

SPL cluster does not provide complex and heavy automated management function. Instead, it allows you to personalize the configuration of each node. Programmers can decide what kind of data each node stores, and what kind of calculation each node performs based on data’s characteristics and calculation objective. In this way, not only is the architecture complexity decreased greatly, but it is also an important means to improve performance.

Let's take order analysis as an example. When the order table is large, and we want to associate the product number field with the primary key of the smaller product table, and then group and aggregate the order amount by product supplier. SPL cluster can easily store the order table in segments on the hard disk of each node, and then read the smaller product table into the memory of each node. When calculating, each node only needs to associate the local order segments with product data, and then group and aggregate, which can shorten the total calculation time, after that, the aggregated result is transmitted to next node for secondary aggregation. Since what is transmitted is the first aggregated result, the amount of data is small and the network transmission time is short. Overall, this scheme achieves the best performance. Although the programmer needs to do some more detailed work, the increased workload is not large for small-scale cluster.

Also, SPL does not provide strong fault tolerance. Unlike Hadoop, there is no need for SPL to ensure that any task is executed successfully in the case of node failure. In fact, the execution time of most computing tasks is within a few hours, and cluster with a few or a dozen machines can operate normally for a long time in general without failing frequently. Even if the task execution fails due to occasional node failure, it is acceptable to recalculate. After all, this does not happen frequently. Therefore, the fault tolerance capability of SPL only ensures that the entire cluster can continue to work and accept new tasks (including recalculation) when a few nodes fail, which greatly reduces the complexity of SPL cluster.

In terms of in-memory computing, SPL does not use the immutable mechanism of Spark RDD, and instead use the pointer-style reuse mechanism. This mechanism uses the address (pointer) to access memory, and directly uses the addresses of original data to form a result set under the condition that the data structure is not changed. Since there is no need to copy the data in each calculation step, and the only thing that needs to do is to save one more address (pointer), it can reduce the consumption of CPU and memory space at the same time, and hence it is much lighter than Spark in operation. Moreover, SPL improves current algorithm system of external storage computing, reduces the complexity and expands the scope of adaptation. Furthermore, SPL can combine in-memory and external storage calculations to fully improve the computing performance without relying on large memory as Spark does.

Simple in use

SPL adopts lightweight technology, it is naturally easier to install, configure, run and maintain. Not only can SPL be used as an independent server, but it can also be easily integrated into applications that need high-performance computing. For example, you only need to import a few jars for the instant query system. In contrast, it is difficult to integrate Hadoop into such applications, and hence Hadoop has to be run outside applications as a data source. Some temporary data need to be handled at any time, you can use SPL’s desktop IDE to calculate visually to get the result quickly. However, if you want to handle such temporary data tasks through installing and deploying Hadoop, they may be outdated after Hadoop environment is built.

SPL’s high-performance algorithms shown in the figure above also make the programming of big data computing become simple. Programmers can master the functions of such algorithms in a shorter time, and thus the learning cost is relatively low. Moreover, it is very easy to achieve various complicated computing requirements by using these ready-made functions. Therefore, SPL is simpler than MapReduce/Scala, and also simpler than SQL.

Let’s take common funnel analysis of e-commerce platform as an example. SQL code for implementing three-step funnel is roughly as follows:

with e1 as (
     select gid,1 as step1,min(etime) as t1
     from T
     where etime>= to_date('2021-01-10', 'yyyy-MM-dd') and etime<to_date('2021-01-25', 'yyyy-MM-dd')
          and eventtype='eventtype1' and …
     group by 1
),
with e2 as (
     select gid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
     from T as e2
     inner join e1 on e2.gid = e1.gid
     where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e2.etime > t1
          and e2.etime < t1 + 7
          and eventtype='eventtype2' and …
     group by 1
),
with e3 as (
     select gid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
     from T as e3
     inner join e2 on e3.gid = e2.gid
     where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e3.etime > t2
          and e3.etime < t1 + 7
          and eventtype='eventtype3' and …
     group by 1
)
Select
  sum(step1) as step1,
  sum(step2) as step2,
  sum(step3) as step3
from
  e1
  left join e2 on e1.gid = e2.gid
  left join e3 on e2.gid = e3.gid

We can see that it needs to code more than 30 lines in SQL, and it is quite difficult to understand this code. If the task is performed in MapReduce/Scala, it will be more difficult. Even in SQL, the code is related to the number of steps of funnel, one more sub-query is needed for every extra step.

In contrast, SPL is much simpler, and the following SPL code can handle any number of steps:

    A   B
1   =["etype1","etype2","etype3"]   =file("event.ctx").open()
2   =B1.cursor(id,etime,etype;etime>=date("2021-01-10") && etime<date("2021-01-25") && A1.contain(etype) && …)
3   =A2.group(id).(~.sort(etime))   =A3.new(~.select@1(etype==A1(1)):first,~:all).select(first)
4   =B3.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime, null))))
5   =A4.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)

The cluster calculation code of SPL is also very simple. Let's take the order analysis calculation mentioned above as an example. Now we want to store the large order table in segments on four nodes, load the small product table into the memory of each node, and group and aggregate the order amount by product supplier after associating the two tables, SPL code:

    A   B
1   ["192.168.0.101:8281","192.168.0.102:8281",…, "192.168.0.104:8281"]
2   fork to(4);A1   =file("product.ctx").open().import()
3   
>env(PRODUCT,B2)
4   =memory(A1,PRODUCT)
5   =file("orders.ctx":to(4),A1).open().cursor(p_id,quantity)
6   =A5.switch(p_id,A4)
7   =A7.groups(p_id.vendor;sum(p_id.price*quantity))

When executing this code, the computing resource required for task management (in-memory loading, task splitting, merging, etc.) is far less than that consumed on associating, grouping and aggregating. The task management function is so light that it can be executed on any node or even IDE.

Low in cost

Like Hadoop, SPL is also open source and free. However, unlike Hadoop, the comprehensive cost of SPL is very low, for the following two reasons:

One is that the cost of human resources is reduced. For one thing, since the installation, configuration, operation and maintenance of SPL are very easy, relevant cost of human resources is greatly reduced; For another, since SPL reduces the programming difficulty of big data computing, programmers can implement various complicated calculations very easily, and the development efficiency is significantly improved, the cost of programmers is thus saved.

Another is that the hardware cost is reduced. Since SPL technology system is very light, and the system itself occupies very little CPU, memory and hard disk resources, it enables more resources to be used for business computing, and hence the hardware utilization is greatly improved. In addition, SPL does not rely on large memory as Spark does. Overall, the cost of hardware procurement is greatly reduced.

Light and fast SPL

Because SPL is light in technology, consumes less by itself, and provides many high-performance algorithms, SPL performs better than Hadoop/Spark for a cluster of a few, or dozens of, or even only one machine.

Case 1: Funnel analysis and calculation of an e-commerce business.

Spark: 6 nodes with four CPU cores each, average computing time: 25 seconds.

SPL: one machine within 8 threads, average computing time: 10 seconds. The amount of code is only half that of Spark Scala.

Case 2: Analyze the user profile of a large bank.

An OLAP server on Hadoop: virtual machine with 100 CPU cores, computing time: 120 seconds.

SPL: virtual machine with 12 CPU cores, computing time: 4 seconds only. The performance is improved by 250 times.

Case 3: Query the details of current accounts via the mobile banking APP of a commercial bank. The amount of data is large and the concurrency number is high.

A commercial data warehouse based on Hadoop: since a response speed in second cannot be achieved when the concurrency is high, it has to replace the warehouse with a cluster of 6 ESs.

One machine in SPL: it achieves a same response speed with that of cluster of 6 ESs under the same concurrency number.

In summary, Hadoop/Spark sources from the heavy solution of the top Internet enterprises, and is suitable for very large enterprises that need to deploy very large cluster. Although the amount of data in many scenarios is not small, a small cluster or even one machine is fully capable of handling them, because the scale of data amount is far less than that of large enterprises, and there are not so many hardware equipment and maintenance personnel. In this case, SPL, the lightweight big data computing engine, is the first choice, for the reason that it can achieve the light technology, simple use, higher development efficiency and higher performance with very low cost.

Origin: https://blog.scudata.com/hadoop-spark-is-too-heavy-esproc-spl-is-light/
SPL source code: https://github.com/SPLWare/esProc