DEV Community: jbx1279

Are There “Queries over Trillion-Row Tables in Seconds”? Is “N-Times Faster Than ORACLE” an Exaggeration?

jbx1279 — Sat, 13 Apr 2024 11:43:46 +0000

We often hear about the advertisements for the performance of a big data product, saying that it is capable of running “queries over trillion-row tables in seconds”, which means they can get and return data meeting the specified condition from one trillion rows in seconds.

Is this true?

Probably you do not think it is true if you have read the article How Much Is One Terabyte of Data? ”. To process one trillion rows of data, which is dozens of, even one hundred, terabytes in size, we need tens of thousands of, even hundreds of thousands of, hard disks. This is almost impracticable.

However, “queries in seconds” does not necessarily mean full traversal. If we use the unique identifiers (such as phone numbers, though there are not one trillion of them) to perform the query, the amount of computation based on the index won’t be too large, and the number of retrievals and comparisons is about forty. Contemporary computers can handle this computation amount effortlessly in one second, even along with many concurrencies. The computational complexity of this type of target-search task is logarithmic level. When data volume grows from one million rows to one trillion rows, the computation amount the search task involves only increases by two times. Almost all databases can deal with the search as long as they can accommodate this data. It is not surprising at all. Of course, creating index will be very slow, but that is another thing.

Yet what will happen if it is a computation based on traversal? Index becomes useless for, such as, calculating sum of values of a certain column.

Well, it is impractical to achieve “queries in seconds” for such a computing task on trillion-row tables, but it is possible to do this on TB-level data. If there are one hundred columns of data, an aggregation on one column just needs to retrieve 1% data, which is probably 10GB. Ten hard disks are enough to scan the data in seconds. It is easy to achieve the configuration, even ten times more, under contemporary server clusters.

So, probably this is what “being able to handle TB-level data in seconds” really means. Every database product can do that.

Some database products like comparing them with famous old-brand databases, such as Oracle. We often hear about statements saying that the product is N times faster than Oracle. Sounds like they are bragging. As the world-class benchmark product, it isn’t Oracle if it is easily outrun by N times.

Well, they are not bragging.

Oracle is mainly intended for transaction processing (often known as TP) rather than for analytical processing (often known as AP). A TP database should use row-based storage while an AP database usually uses columnar storage. For example, a certain operation only involves two of one hundred columns in a data table. The row-based Oracle database basically needs to read through all the one hundred columns; but a columnar AP database only needs to read the two targeted columns. The amount of data that is retrieved is dozens of times less. In this case, it is natural that the computation is N times faster. This shouldn’t be a surprise. If the computation isn’t N times faster, that will be a problem.

Apart from using columnar storage instead of row-based storage, other stratagems may include adopting clusters instead of single machines, in-memory techniques instead of external memory strategies, and so on. In a word, they run faster because they use N times more resources. Yet, though they outpace Oracle, they do not really outsmart it.

That’s probably what “being N times faster than Oracle” means. The speed is a fact but isn’t worth boasting about.

In fact, Oracle’s optimizer is powerful. If not for columnar storage and the support of plenty of resources, many specialized AP databases does not necessarily run faster than Oracle, particularly the Hadoop-based technologies.

In the above, we mentioned the power of logarithmic-level algorithm, which logarithmize the amount of computation from one trillion rows to dozens of rows. However, there is the opposite situation. The seemingly small volume of data is accompanied with an astronomical amount of computation. In the SPL forum there is a National Astronomical Observatories’ computing scenario. The data involves eleven tables and each has only 500,000 rows. The total volume is less than 10GB. It takes a distributed database 3.8 hours and 100 CPUs to finish the computation, whose degree of complexity is multiplication-level and involves a total computation amount of 10*500,000*500,000=2.5 trillion. It is already pretty satisfactory to be able to get the job done in such a period of time.

Therefore, we cannot simply look at the slogans saying how fast they can run over a huge volume of data when examining the performance of big data products. Those statements may be true, but they are meaningless. With an efficient algorithm, we can execute computations on 100TB in seconds; without an efficient algorithm, it could take N hours to perform computations on 10B.

In this sense, the key element to examining the performance of a big data technology is whether it provides distinguished, efficient algorithms that can reduce the computation amount.

But, under the SQL system, a bad product can be extremely bad but a good one is only decent at most. SQL has been developed for decades; there are mature optimization algorithms already known throughout the industry. It is difficult for the language to come up with anything new. As a result, vendors who do not have the optimization techniques make extremely bad products and those who have mastered the techniques are only able to design mediocre products. After all, they can and only can copy the techniques from each other.

However, by breaking out of the SQL system, it is possible to come up with new algorithms. For the Observatories scenario mentioned above, there are actually algorithms that can logarithmize one side of the multiplication. The computation amount is about 10*500,000*log500,000*2, which is reduced by 10,000 times. Using this algorithm, esProc SPL can finish the computation in 2 minutes on a 4-core laptop.

Unfortunately, SQL is unable to express this algorithm. It is very laborious just to describe the computing logic correctly. The amount of code is measured in KB, the database optimizer is disabled, and it takes N solid hours to finish the execution.

How come there are tens of thousands of tables in a database

jbx1279 — Sat, 23 Mar 2024 10:45:29 +0000

Many large databases accumulate a large number of data tables after running for many years, with severe cases reaching tens of thousands, making the database very bloated. These data tables are often many years old, for some of which the construction reasons may have been forgotten or they may no longer be useful, but they are difficult to confirm and you dare not delete them. This brings a huge burden to the operation and maintenance work. Along with these tables, there are still a large number of stored procedures constantly updating data to these tables, occupying computing resources and often forcing the database to scale up.

Are these tables truly necessary for the business? Is the business so complex that it requires thousands of tables to describe?

People with development experience know that this is unlikely, as a few hundred tables can describe a fairly complex business. Most of these tables are so-called intermediate tables and are not used to store basic data.

Then, why will there be intermediate tables?

Intermediate tables are mostly used to serve the data presentation (reporting or querying). When the raw data volume is large or the calculation process is complex, direct calculation is troublesome and the performance is poor. We will first calculate some intermediate results and store them. When presenting them, we will then perform some simple filtering and aggregations based on parameters, and the user experience will be much better. These intermediate data will exist in the form of data tables, and will also be updated regularly along with the stored procedures. Front end reports are unstable businesses that require frequent modifications and additions, resulting in an increasing number of intermediate tables.

Some intermediate tables are caused by external data sources, and sometimes applications need to import data from outside the database before they can be mixed calculated with data from within the database, which can also result in more tables in the database. Moreover, many external data are in a multi-layered JSON format, and multiple associated tables need to be created in relational databases for storage, which further exacerbates the problem of having too many intermediate tables.

The reason for putting intermediate data into the database is mainly to obtain the computing power of the database. Lack of strong computing power outside the database, while the computing power of the database is closed (it cannot compute data outside the database). In order to obtain the computing power of the database, these data can only be imported into the database, thus forming intermediate tables.

There are two engineering structural reasons in the database that could be accomplices to this matter:

A database is an independent process with computing power outside the application and not subordinate to any particular application. All applications share one database and can access it. An intermediate table generated for one application may be referenced by another application, which creates coupling between applications. Even if the creator of an intermediate table has been offline and is not used, it cannot be deleted because it may be used by another application, and the intermediate table will remain.

Database tables are organized and managed in a linear form. It’s OK when the quantity is small, but it can be chaotic when there are too many (thousands or even tens of thousands), and people usually use a tree structure to organize and manage numerous items. However, relational databases do not support this solution (its schema concept can be understood as only supporting two layers), which requires tables to be given longer names for classification. On the one hand, this is inconvenient to use, and on the other hand, it requires high levels of developing management. When work is urgent, specification is not considered, and when it goes online, these matters will be forgotten. Business A will have dozens of tables, while Business B will also have dozens. Over time, a large number of intermediate tables will be left behind.

Intermediate tables and related stored procedures consume a large amount of expensive data storage and computing resources, which is obviously not cost-effective.

If an independent computing engine can be implemented so that calculations no longer rely on database services, then the database can be streamlined.

This is esProc SPL.

esProc SPL has open and integrable computing capabilities. Openness refers to the separation of computing power and storage, and the computation does not rely on storage, meaning that storage and computation are separated. Otherwise, if a specific storage scheme is required, the bloating of the database is simply replaced by bloating in another place. Integrability refers to the ability of embedding computing power into an application and become a part of the application, rather than being an independent process like a database, so as not to be shared by other applications (modules) and avoid coupling issues between applications.

The intermediate data no longer needs to be stored in the database in the form of data tables, but can be placed in the file system and provided with computing power by SPL. For read-only intermediate data, when using file storage, there is no need to consider rewriting and corresponding transaction consistency issues. The mechanism is greatly simplified, which can achieve better performance than databases. The file system can also adopt a tree organization scheme to classify and manage the intermediate data of various applications (modules), making it more convenient to use. This way, the intermediate data will naturally belong to a certain application module and will not be accessed by other applications. When an application is modified or taken offline, the corresponding intermediate data can be modified or deleted accordingly without worrying about coupling issues caused by sharing. The stored procedures used to generate intermediate data can also be moved outside the database as part of the application, without causing coupling issues.

	A	B
1	=file("/data/scores.btx").cursor@b()	/Read as a cursor
2	=A1.select(CLASS==10)	/Filter
3	=A1.groups(CLASS;min(English),max(Chinese),sum(Math))	/Group & Aggregation

SPL provides high-performance binary file formats that utilize mechanisms such as compression, columnar storage, and indexing. It also supports parallel segmentation and related high-performance algorithms to further improve computational performance.

SPL can also directly implement mixed calculation of data outside and inside database, without the need for external data sources to be imported into the database. Instant data retrieval has better real-time performance and can fully utilize the advantages of the original data sources, which we have already discussed in multi-source mixed computing.

With the open and integrable computing capability of esProc SPL, it will be more convenient to design application frameworks. Computing can be placed in the most suitable position, without the need to deploy extra databases to obtain computing power. The database can focus on doing its most suitable thing, and complex and flexible computing can be left to SPL to solve, maximizing resource utilization.

Finally, esProc SPL is open source. It is here https://github.com/SPLWare/esProc.

What to do if the query calculation is moved out of database but is too slow using java

jbx1279 — Sat, 16 Mar 2024 13:04:41 +0000

Many modern applications will move data computation and processing tasks away from databases and implement them in Java, which can gain framework benefits. Moreover, Java has comprehensive process processing capabilities and is more adept at handling increasingly complex business logic than SQL (although the code is not short). However, we often find that the performance of these Java codes in computing and processing data is not satisfactory, and they cannot even match the performance of SQL in the database.

Normally, as a compiled language, Java may not be as good as C++in terms of performance, but it should have an advantage over interpreted SQL. However, the fact is not.

Why is this?

There are two main reasons.

One direct reason is the IO aspect. Java itself does not have a common storage mechanism, and usually needs to continue to rely on databases to store data. Therefore, when calculating, data needs to be read from the database first, and the database access interface (JDBC) is not very fast. If the data volume is large, it will suffer significant losses in terms of reading.

Then, can we not use database storage to achieve higher read performance? After all, most of the data is historical data that will not change, and the amount of data that is still changing is usually small. If we change to an efficient access scheme to store cold data, only a small amount of hot data needs to be read instantly. Can Java’s computational performance be greatly improved?

In theory, this is the case, but for the aforementioned reasons, Java itself does not have a common storage mechanism. If a database is not used, public formats such as CSV/TXT can generally only be used. The performance of these formats is not significantly different from that of a database, and there is also a risk of losing data type information. If you design a binary storage format yourself, it can indeed be much faster than a database, but it is not an easy task to consider it comprehensively and implement it, which exceeds the ability of many application programmers.

So, Java programmers are still using databases or text files, enduring low performance IO.

The other reason is algorithm implementation. To run fast and find ways to reduce computational complexity, it is necessary to use some low complexity algorithms. However, the computational complexity of these algorithms is lower, but the implementation complexity is higher.

For example, common grouping and join operations, databases generally use the HASH algorithm instead of direct sorting or hard traversal. But the implementation difficulty of this algorithm is relatively high, and it exceeds the ability of many application programmers. As a result, programmers often use relatively simple sorting or hard traversal methods to implement it, which will increase the computational complexity by orders of magnitude. It is not surprising that compiled Java runs slower than interpreted SQL.

In-memory operations are still slightly better, and there are now some open-source libraries available (but to be fair, their convenience is far inferior to SQL). But for external storage computing involving big data, the Java industry has almost no effective support, making even basic sorting difficult.

Furthermore, to utilize the parallel capabilities of multiple CPUs, it is necessary to write multi-threaded code. Writing multithreading in Java is not a problem, but it is extremely troublesome. Programmers need to consider various issues such as resource sharing conflicts, which can increase implementation difficulty and the possibility of errors. As a result, they often weigh costs and write it as a single thread, wasting CPU resources in vain.

What should we do then?

esProc SPL is here to help you.

esProc SPL is a pure Java open-source computing engine that provides database independent but more powerful computing power than SQL. esProc SPL can be seamlessly integrated into Java applications, just like the code written by application programmers themselves, enjoying the advantages of mature Java frameworks together.

esProc SPL supports access to databases and common public files such as CSV/TXT, and the performance in this area is not significantly different from direct Java development. Especially, esProc has designed high-performance binary file formats that support compression, columnar storage, indexing, as well as cursor and segment parallel mechanisms for big data. Storing historical big data as binary files not only achieves much higher access performance than databases, but also makes it easier to organize and manage using the tree structure of the file system.

The computing power of esProc SPL does not rely on databases or other third-party services, making it easy to implement mixed computing of multiple data sources. Specifically, by simultaneously reading cold data from files and hot data from databases, real-time calculations on whole data can be achieved. Please refer to: How to perform mixed computing with multiple data sources

esProc SPL comes with built-in rich structured data computing class libraries.

Filter：T.select(Amount>1000 && Amount<=3000 && like(Client,"*s*"))
Sort：T.sort(Client,-Amount)
Distinct：T.id(Client)
Group：T.groups(year(OrderDate);sum(Amount))
Join：join(T1:O,SellerId; T2:E,EId)
...

Similar to databases, these libraries also use mature algorithms in the industry, which can efficiently perform calculations.

SPL also supports big data cursors and parallel operations, using mature algorithms, and the syntax is almost the same as in-memory data tables:

file("T.btx").cursor@b().select(Amount>1000 && Amount<=3000 && like(Client,"*s*"))
file("T.ctx").open().cursor().groups@m(year(OrderDate);sum(Amount))
...

This way, Java programmers no longer need to implement these complex algorithms themselves and can enjoy high performance similar to databases.

In fact, SPL provides more structured data operations and high-performance algorithms than SQL. In many complex scenarios, the actual performance of SPL is much higher than that of SQL in the database, often achieving better performance on a single machine than SQL on a cluster: Here comes big data technology that rivals clusters on a single machine .

Take a practical case: a data processing task of a large international bank, the data amount involved is not too large, only over one million rows. But the business rules are very complex. The primary data table has over 100 columns, each with dozens of different calculation rules. Coding in SQL/stored procedures is too cumbersome and chaotic. Although the code written in Java is not short, the structure is much clearer and easier to maintain. The total calculation time of Java code is about 20 minutes, with a reading of about 1 minute. After switching to SPL coding, the reading time of 1 minute cannot be reduced, but the calculation time has been reduced to 1.5 minutes, with a total duration of 2.5 minutes, which is 8 times faster than the original 20 minutes!

In this case, SPL’s ordered cursor technique is utilized. Due to hardware limitations, over one million rows of data cannot be fully loaded and can only be read in using a cursor. After grouping, association operations need to be performed. Even with simple and inefficient multiple traversal association algorithms, Java code is still cumbersome. SPL’s ordered cursor technology can handle grouping while reading, avoiding repeated traversal and association. With less than 300 lines of code, there is still a significant improvement in performance.

SPL also has well-established process control statements, such as for loops and if branches, and supports subroutine calls, which is comparable to Java’s procedural processing capabilities. Using only SPL can achieve very complex business logic, directly forming a complete business unit, without the need for upper-level Java program code to cooperate. The main Java program simply calls SPL scripts.

SPL scripts are stored as files and placed outside the main application program. Code modifications can be made independently and immediately take effect, unlike Java libraries such as Stream/Kotlin that require recompilation with the main program after modifying code, and the entire application needs to be shut down and restarted. This can achieve hot swap of business logic, especially suitable for supporting frequently changing businesses.

esProc SPL also has a simple and easy-to-use development environment, providing single step execution, breakpoint setting, and WYSIWYG result preview. The development efficiency is also better than programming in Java:

(Here is a more detailed introduction to SPL A programming language coding in a grid )
….

SPL is equivalent to combining the advantages of SQL and Java, that is, it not only has the flexibility and process control ability of Java, enjoys the advantages of Java frameworks, but also encapsulates and extends the algorithms and storage mechanisms of SQL, allowing programmers to gain and surpass the simplicity and high performance of SQL in Java applications.

Finally, esProc SPL is here https://github.com/SPLWare/esProc .

How to perform mixed computing with multiple data sources

jbx1279 — Sat, 09 Mar 2024 12:43:08 +0000

Early applications usually only connected to one database, and calculations were also performed in the database, with little or no problem of mixed calculations from multiple data sources. The data sources of modern applications have become very rich, and the same application may also access multiple data sources, such as various SQL and NoSQL databases, text/XLS, WebService/Restful, Kafka, Hadoop…. Mixed computing on multiple data sources is an unavoidable problem that needs to be addressed.

Direct hard coding to implement in applications is very cumbersome, and commonly used application development languages such as Java are not good at doing such things. Compared to SQL, their simplicity is far inferior.

It is also not appropriate to import multi-source data into one database and then calculate. Not to mention that importing takes time and results in a loss of data real-timeness, the importing of certain data (such as data supported by Mongodb and multi-layer document data) into a relational database losslessly is a very difficult and costly task. After all, the existence of these diverse data sources has a reason and cannot be easily replaced by relational databases. Otherwise, there’s no need to invent Mongodb, just use MySQL.

What about a logical data warehouse? It sounds very heavy. Before use, it is necessary to define metadata to map these diverse data sources, which is very cumbersome. Moreover, most logical data warehouses are still SQL based, making it difficult to map these diverse data losslessly.

What about the pile of computing frameworks? Especially the stream computing framework. It is possible to access many data sources, but the computing itself provides very little functionality. Either use SQL, there will be mapping difficulties like a logical data warehouse; To freely access various data sources, you have to write your own calculation code in Java.

Facing mixed computing problems on multiple data sources, esProc SPL is a good approach.

esProc SPL is an open source computing engine developed purely in Java, and it is here https://github.com/SPLWare/esProc .

How can esProc SPL solve this problem? There are mainly two aspects:

The abstract access interface for diverse data sources can map a wide range of data into a few data objects.
Based on the data objects in 1, independently implement sufficiently rich computing power independent of the data source.

With these two abilities, when encountering new data sources, just encapsulate the interface and constantly supplement it.

esProc SPL provides two basic data objects: table sequence and cursor, corresponding to in-memory data table and streaming data table, respectively.

Almost all data sources, including relational databases, provide interfaces to return these two types of data objects: small data is read at one time, and an in-memory data table (table sequence) is used; Big data needs to be gradually returned using a streaming data table (cursor). With these two types of data objects, almost all data sources can be covered.

In this way, there is no need to define metadata for mapping in advance, and the data can be accessed directly using the methods provided by the data source itself, and then encapsulated into one of these two types of data objects. This can preserve the characteristics of the data source and fully utilize its storage and computing capabilities. Of course, there is no need to perform a “certain” import action on the data first, and real-time access can be implemented. These two types of data objects share the common capabilities of diverse data source access interfaces, while the mapping data table method used in a logical data warehouse does not correctly abstract the common characteristics of diverse data sources, and its applicability is much narrower.

It should be noted that SPL’s table sequence and cursor both support multi-layer structured data as well as text data, which allows for the receiving and processing of JSON data (or its binary variant).

Take a look at some examples:

Relational database, A2 returns a table sequence, A3 returns a cursor

	A
1	=connect("MyCompany")
2	=A1.query("select * from employees order by hire_date asc limit 100")
3	=A1.cursor("select * from salaries")
4	>A1.close()

Local file, A1/A3 returns a table sequence, A2 returns a cursor

	A
1	=T("Orders.csv")
2	=file("sales.txt").cursor@t()
3	=file("Orders.xls").xlsimport@t()

Restful, A1 returns text in JSON format

	A
1	=httpfile("http://127.0.0.1:6868/restful/emp_orders").read()
2	=json(A1)

Elastic Search

	A
1	>apikey="Authorization:ApiKey a2x6aEF……KZ29rT2hoQQ=="
2	'{ "counter" : 1, "tags" : ["red"] ,"beginTime":"2022-01-03" ,"endTime":"2022-02-15" }
3	=es_rest("https://localhost:9200/index1/_doc/1", "PUT",A2;"Content-Type: application/x-ndjson",apikey)

Mongodb, A2 returns a table sequence, A3 returns a cursor

	A
1	=mongo_open("mongodb://127.0.0.1:27017/mymongo")
2	=mongo_shell(A1,"{'find':'orders',filter:{OrderID: {$gte: 50}},batchSize:100}")
3	=mongo_shell@dc(A1,"{'find':'orders',filter:{OrderID: { $gte: 50}},batchSize:10}")
4	=mongo_close(A1)

Kafka, A2 returns a table sequence containing JSON data, A3 returns a cursor

	A
1	=kafka_open("/kafka/my.properties", "topic1")
2	=kafka_poll(A1)
3	=kafka_poll@c(A1)
4	=kafka_close(A1)

HBase, A2/A3 returns a tables equence, A4 returns a cursor

	A
1	=hbase_open("hdfs://192.168.0.8", "192.168.0.8")
2	=hbase_get(A1,"Orders","row1","datas:Amount":number:amt,"datas:OrderDate"📅od)
3	=hbase_scan(A1,"Orders")
4	=hbase_scan@c(A1,"Orders")
5	=hbase_close(A1)

There are already many data sources encapsulated in esProc SPL and they are still increasing:

esProc SPL provides comprehensive computing power for table sequences, including filtering, grouping, sorting, join, etc. Its richness far exceeds that of SQL, and most operations can be implemented in just one line:

Filter：T.select(Amount>1000 && Amount<=3000 && like(Client,"*s*"))
Sort：T.sort(Client,-Amount)
Distinct：T.id(Client)
Group：T.groups(year(OrderDate);sum(Amount))
Join：join(T1:O,SellerId; T2:E,EId)
TopN：T.top(-3;Amount)
TopN in group：T.groups(Client;top(3,Amount))

There are similar calculations on cursors, and the syntax is almost identical. We won’t provide detailed examples here. Interested friends can refer to the materials on the esProc SPL official website.

Based on these foundations, mixed computing is very easy to implement:

Two relational databases

	A
1	=oracle.query("select EId,Name from employees")
2	=mysql.query("select SellerId, sum(Amount) subtotal from Orders group by SellerId")
3	=join(A1:O,SellerId; A2:E,EId)
4	=A3.new(O.Name,E.subtotal)

Relational Database and JSON

	A
1	=json(file("/data/EO.json").read())
2	=A1.conj(Orders)
3	=A2.select(Amount>1000 &&Amount<=3000 && like@c(Client,"s"))
4	=db.query@x("select ID,Name,Area from Client")
5	=join(A3:o,Client;A4:c,ID)

Mongodb and relational database

	A
1	=mongo_open("mongodb://127.0.0.1:27017/mongo")
2	=mongo_shell(A1,"test1.find()")
3	=A2.new(Orders.OrderID,Orders.Client,Name,Gender,Dept).fetch()
4	=mongo_close(A1)
5	=db.query@x("select ID,Name,Area from Client")
6	=join(A3:o, Orders.Client;A4:c,ID)

Restful and local text file

	A
1	=httpfile("http://127.0.0.1:6868/api/getData").read()
2	=json(A1)
3	=T("/data/Client.csv")
4	=join(A2:o,Client;A3:c,ClientID)

With mixed computing capabilities, we can also solve T+0 calculations by the way.

The single TP database naturally supports T+0 calculation. When there is too much data accumulation, it can affect the performance of the TP database. At this point, a portion of historical data is usually moved to a professional AP database, which is known as hot and cold data separation. The TP database only stores recently generated hot data, while the AP database stores historical cold data. After reducing the pressure on the TP database, it can run smoothly.

But in this way, when doing real-time whole-data statistics, cross database calculations are required, which has always been a hassle, especially when facing heterogeneous databases (TP and AP databases are usually not of the same structure). With esProc SPL’s mixed computing capability of different data sources, this problem can be easily solved.

	A	B
1	=[[connect@l("oracle"),"ORACLE"],[connect@l("hive"),"HIVE"]]
2	=SQL="select month(orderdate) ordermonth,sellerid,sum(amount) samount,count(amount) camount from sales group by month(orderdate),sellerid"
3	fork A1	=SQL.sqltranslate(A3(2))
4		=A3(1).query(B3)
5	=A3.conj().groups(ordermonth,sellerid;sum(samount):totalamount,sum(camount):totalcount)

For the mixed calculation between TP database Oracle and AP database Hive, SPL can also convert SQL into dialect syntax of different databases.

Then, how can the code written in esProc SPL be integrated into the application?

Very simple, esProc provides a standard JDBC driver, allowing Java programs to call SPL code just like calling database SQL.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery("=json(file(\"Orders.csv\")).select(Amount>1000 && like(Client,\"*s*\")

More complex SPL scripts can be saved as files, just like calling stored procedures:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("call queryOrders()");
statement.execute();

This is equivalent to providing a logical database without storage and without SQL.

When will the pre calculation of customer profile analysis be over

jbx1279 — Sat, 02 Mar 2024 10:42:43 +0000

Customer profiling is very fashionable in current business analysis. Simply put, it means putting various tags on customers, using these tags to define different customer groups (so-called profiles), and then calculating the quantity (and changes) of customers in each customer group.

Logically speaking, tags are dimensions or fields of a data table. Tags are fields with relatively simple values, and there are generally two types: binary tags, which have only two values and are usually represented by 0/1, such as marital status or gender. Another type is enumeration tags, with values ranging from a few to a few hundred, which can be represented as small integers, such as educational status or age range. These two types of tags can also be converted to each other.

Profile statistics is nothing special, it is a standard multidimensional analysis task that filters (WHERE) the corresponding tag conditions of the customer group (profile) and then aggregates (mainly COUNT), sometimes with grouping.

However, when the amount of data is particularly large (with a large number of customers, which is the norm, otherwise there is no need for analysis), this statistic may become very slow and often cannot be calculated in real-time. All kinds of profile standards (corresponding tag conditions) need to be investigated in advance and pre calculated before they can be queried by business personnel.

As it goes, business managers may come up with a profile standard at any time to conduct statistics, and adjust the profile standard immediately when they feel there is a problem and make another statistic. This exploratory analysis is the only way to obtain meaningful results. But when pre calculation is necessary, all of this loses its meaning.

Then, why cannot it calculate fast? Normally speaking, this simple calculation has been played very skillfully by the industry.

There are three main reasons for this.

The number of tags is particularly large, especially binary tags, which may be hundreds or even thousands, and are constantly increasing. This quantity often exceeds the maximum number of fields in a database table (usually only 256 or 512), and cannot be directly designed as a table field technically. Therefore, alternative measures should be taken.

There are generally two methods: one is to design multiple tables with customer ID as the primary key, each table storing a portion of the tags, and then JOIN them during the calculation; Another method is to convert columns to rows, number the tags, and design the data structure as {customer ID, tag number, tag value}, and it will involve GROUP BY, HAVING, or COUNT (DISTINCT) in the calculation.

The performance of both methods is not good. JOIN is a long-standing problem in databases, as are GROUP BY and COUNT (DISTINCT). Faced with a huge number of customer IDs, the calculation methods of the database require a large amount of memory, which can easily lead to crashes or the use of buffer to traverse multiple times, resulting in a sharp decline in performance. Another method is to use strings to represent all tag values, with values in the strings such as “Tag 1=Value 1; Tag 2=Value 2; …”. Although this method is flexible, the reading and parsing of strings are very time-consuming, and the computational performance is often not as good as JOIN and GROUP, only it generally does not crash.

The filtering condition for enumerating tags is usually an IN operation. If not optimized, the database needs to compare N times (the length of the IN subset) for IN operation, with a complexity of O(N). Even if the binary search comparison is used after sorting, it still needs to compare logN times, with a complexity of O(logN). Moreover, the binary search method itself has some fixed overhead, and although its complexity has decreased, its optimization effect on a small IN subset (commonly seen in profile analysis) is not significant.

The most crucial reason is that profile analysis often targets multiple profile standards simultaneously. Each profile standard corresponds to different filtering conditions (WHERE) and grouping rules (GROUP), and in the database, an SQL statement needs to be executed to traverse the data table once to calculate the result. When there are multiple profile standards, there will be multiple SQL statements, and the data table will be traversed multiple times. A typical profile analysis interface may have dozens or even hundreds of profile indexes simultaneously, which means that a huge data table needs to be traversed dozens or even hundreds of times, making it impossible to calculate in real time and can only be pre calculated.

If there are good optimization methods in all three aspects, profile analysis can run faster. Unfortunately, most relational databases cannot do so. The first factor is engineering and should have been able to be handled, but due to the closed nature of the database requiring unified metadata management, having too many fields can lead to exceptionally complex metadata, so most databases also refuse to support it; The second factor is semi theoretical and semi engineering. SQL can only use IN to describe the filtering conditions of enumeration tags, and this requires the database optimization engine to recognize and adopt a calculation scheme that can avoid set member comparison, which is also something most databases have never thought of. The third factor is theoretical. A single SQL statement cannot return multiple summary results, and optimization engines cannot assume any correlation between multiple SQL statements. It can be said that all databases can only be forced to traverse multiple times, so as mentioned earlier, this is the most critical factor.

esProc can do all!

Strictly speaking, esProc SPL is not a database, but a professional computing engine. It provides a columnar-storage format file and, based on it, provides computing power that does not rely on a database, which can completely replace the computing capabilities of a data warehouse. Unlike traditional databases, esProc SPL no longer adopts relational algebra and SQL. Instead, it has created the theory of discrete datasets and invented a new programming language SPL, which can code richer optimization logic without being limited by SQL.

The column-number capacity of the esProc SPL columnar storage file is much larger (up to thousands of columns), and it does not have the concept of metadata. It will not cause overall management burden due to the large number of columns in a certain file (table), and can naturally solve the above first factor. Moreover, even if there are too many tags that result in too many columns and a table splitting scheme needs to be adopted, esProc SPL also has a high-performance ordered merge algorithm for this situation. The performance loss caused by multi table association is very small, and the speed is not much different from single table traversal, which mainly depends on the number of tags involved.

Specifically, for binary tags, esProc SPL also supports bitwise storage, which uses a single binary bit to store a tag, combining multiple tags into a 16 or 32-bit or 64-bit integer. This can significantly reduce the number of columns in the table and the storage capacity (which can be reduced by 16-64 times compared to conventional storage methods). When there are many binary tags involved in the profile standard, it can also effectively reduce the amount of data reading and calculation, that is, a 16-bit integer stores multiple binary tags that require to calculate conditions, and can only be read and calculated once. ( Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension )

Some databases now support bit operations, but SQL syntax writing is still quite cumbersome. esProc SPL provides pseudo table objects that make the operation of combining binary tags transparent. Programmers can continue to operate on individual tag fields, which are actually converted by SPL to certain bits of a 16-bit integer. ( SPL Pseudo Table Data Type Optimization )

For the conditions on the enumerating tags, esProc SPL provides a method of boolean dimensional sequence, which can convert IN calculations into array values, and the complexity is directly reduced to O(1), that is, the calculation time is independent of the length of the IN subset ( Performance Optimization - 8.4 [Multi-dimensional analysis] Dimension of boolean sequence )

For multi indexes statistics, esProc SPL also has a syntax of multi-purpose traversal, which can calculate multiple statistical values in one traversal Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal ）.This is the key to solving the performance problem of profile analysis.

Here is a multi-index calculation test based on TPCH 100G data generated on a wide table ( SPL computing performance test series: multi-index aggregating ）：

	4C16G			8C32G
No. of indexes1	1	2	3	1	2	3
ClickHouse wide table	77.4	156.0	249.6	34.7	69.0	106.4
Starrocks wide table	135.7	253.6	402.6	62.2	104.6	156.2
esProc SPL wide table	114.2	119.5	124.1	57.7	61.6	64.6
esProc SPL Join			100.5			49.5

It can be seen that when only one index is calculated, the performance of esProc SPL cannot keep up with ClickHouse. But when multiple indexes are calculated, it will exceed, and the more indexes, the more obvious the advantage. When faced with dozens or hundreds of indexes in profile analysis, the multi-purpose traversal mechanism of esProc SPL will exhibit a rolling advantage.

In practice, the performance of esProc SPL is also excellent, and here are two cases where SPL is used to solve pre-calculation:

Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times , using the aforementioned binary bits to represent the binary dimension, and boolean sequences to optimize the enumeration tag conditions, provides more than 200 times the performance than the MPP database.

Open-source SPL optimizes bank pre-calculated fixed query to real-time flexible query , further using the multi-purpose traversal technology, real-time calculation of nearly 200 indexes on one same page, completely canceled the pre-calculation, and changed the original business form.

esProc SPL also has a number of engineering advantages:

esProc is pure Java software that can operate in any JVM environment, can be seamlessly embedded in Java programs, giving the computing power of a data warehouse to applications in various scenarios in a very lightweight manner.

esProc provides a visual development environment that supports single step execution, breakpoint setting, and WYSIWYG result preview. Developing and debugging is much more convenient than SQL and stored procedures.

SPL also has comprehensive process control statements, such as for loops and if branches, and supports subroutine calls. It has the procedural ability only available in stored procedures, and can comprehensively replace SQL and stored procedures.

…

Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.

After retrieving JSON from ES Kafka Mongodb Restful...

jbx1279 — Sat, 24 Feb 2024 13:12:56 +0000

JSON is a good thing that can carry rich structured data information in a common text format. Many modern technologies prefer to use JSON as a data transmission format, such as Elastic Search, Restful, Kafka, etc. Mongodb, which is more concerned about performance, uses binary JSON.

Structured data is often in bulk and often requires recalculation.

However, JSON related class libraries are not very convenient to use for calculations. JSONpath is fine to parse JSON, but it doesn’t have much computing power. Simple filtering and aggregation are fine, but it cannot handle slightly complex operations such as grouping and summarization. Basically, you have to hard code yourself.

Write it into the database to calculate? It is too heavy. Moreover, JSON often has multiple layers of structured data, and writing it into a relational database requires creating several associated tables, which makes the cost of loading into a database much higher than the calculation itself.

esProc SPL will help you.

esProc SPL is an open source computing engine developed purely in Java, and it is here https://github.com/SPLWare/esProc.

esProc SPL encapsulates json library, and it can parse JSON text into a computable SPL table sequence (SPL’s in-memory structured data object) in just one line:

	A
1	=file("d:\xml\emp_orders.json").read()
2	=json(A1)

The SPL table sequence naturally has a multi-layer structure, which means that the field values can be another table sequence, which is naturally in line with JSON’s structure:

Once converted into an SPL table sequence, the calculation itself is a strength of esProc. Filtering, grouping, and join are never a problem, and most of the calculation objectives can be implemented in one line:

Filter：T.select(Amount>1000 && Amount<=3000 && like(Client,"*s*"))
Sort：T.sort(Client,-Amount)
Distinct：T.id(Client)
Group：T.groups(year(OrderDate);sum(Amount))
Join：join(T1:O,SellerId; T2:E,EId)
TopN：T.top(-3;Amount)
TopN in group：T.groups(Client;top(3,Amount))

There are many of these contents, and we won’t expand on them here. Interested friends can refer to the relevant materials on the esProc SPL official website.

eSProc SPL has encapsulated many common access interfaces for JSON data sources.

Restful: Plain text JSON, and it can generate JSON text back after calculation

	A
1	=httpfile("http://127.0.0.1:6868/restful/emp_orders").read()
2	=json(A1)
3	=A2.conj(Orders).select(Amount>1000 && Amount<=2000 && like@c(Client,"business"))
4	=json(A3)

Elastic Search: it can directly write JSON constants in SPL code and participate in transmission and calculation

	A
1	>apikey="Authorization:ApiKey a2x6aEF……KZ29rT2hoQQ=="
2	'{ "counter" : 1, "tags" : ["red"] ,"beginTime":"2022-01-03" ,"endTime":"2022-02-15" }
3	=es_rest("https://localhost:9200/index1/_doc/1", "PUT",A2;"Content-Type: application/x-ndjson",apikey)
4	=json(A3.Content)

Mongodb: It is also OK for binary JSON

	A
1	=mongo_open("mongodb://127.0.0.1:27017/mymongo")
2	=mongo_shell(A1,"{'find':'orders',filter:{OrderID: {$gte: 50}},batchSize:100}")
3	=A2.cursor.firstBatch.select(Amount>1000 && Amount<=2000 && like@c(Client,"business"))
4	=mongo_close(A1)

Kafka: SPL also encapsulates the interfaces for writing to these data sources, forming an IO loop

	A
1	=kafka_open("/kafka/my.properties", "topic1")
2	=kafka_poll(A1)
3	=A2.derive(json(value):v).new(key, v.fruit, v.weight)
4	=kafka_send(A1, "A100", json(A3))
5	=kafka_close(A1)

For Mongodb, Kafka and other data sources that may return large amounts of data, esProc SPL also provides cursor objects and methods that can read batch by batch and process while reading. We won’t provide detailed examples here. Interested friends can also go to the official website to check the information.

Usually, JSON data does not exist independently, but also exchanges data with other data sources and performs mixed calculations. esProc SPL is certainly not invented only to deal with JSON, but it is a professional computing engine that can support a wide range of data sources:

These data sources can all be read as table sequence and cursor by SPL, making it very easy to implement mixed computation and exchange data.

Then, how can the code written in esProc SPL be integrated into the application?

Very simple, esProc provides a standard JDBC driver, allowing Java programs to call SPL code just like executing database SQL.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery("=json(file(\"Orders.csv\")).select(Amount>1000 && like(Client,\"*s*\")

More complex SPL scripts can be saved as files, just like calling stored procedures:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("call queryOrders()");
statement.execute();

As pure Java developed software, esProc SPL can be seamlessly integrated into Java applications, just like the code written by application programmers themselves, enjoying the advantages of mature Java frameworks together. SPL itself has well-established process control statements, such as for loops and if branches, and also supports subroutine calls. Using only SPL can achieve very complex business logic, directly forming a complete business unit, without the need for upper-level Java code to cooperate. The main program simply calls the SPL script.

Store SPL scripts as files and place them outside of the main application program. Code modifications can be made independently and immediately take effect, unlike Java code that needs to be recompiled after code modifications, and the entire application needs to be shut down and restarted. This can achieve hot swap of business logic, especially suitable for supporting frequently changing businesses, which is also where JSON is widely used.

When TP database is too bloated, then add an AP database?

jbx1279 — Sat, 27 Jan 2024 12:30:15 +0000

It is almost universally accepted in the industry and has been successfully practiced for many years that when the TP database is too bloated, then add an AP database. Is there anything to discuss?

The use of AP database can indeed alleviate the computational pressure of the TP database, and the AP database usually has better computational performance and can provide users with a better experience, which is indeed a common practice in the database field. However, it does not mean that this road is incredibly smooth.

The first is the cost. The operation and maintenance of databases have never been simple. A TP database already has many tasks, and now we need to add another one, with different types and knowledge reserves. The AP database is often a distributed MPP, which is not only expensive but also highly complex to operate and maintain. There are many new AP databases that are indeed fast, but their functionality is insufficient, and they do not support complex SQL or stored procedures enough, which can result in high transformation workload. Regardless, soaring costs are inevitable.

Secondly, migrating computing tasks to the AP database can also be awkward. Usually, not all tasks should be migrated at one time, as this not only involves a huge workload but also poses significant risks. A more secure approach is to gradually migrate, first moving the most stressful tasks from the TP database. After all, the TP database is only running slowly and not unusable. After sharing some of the pressure, it can also run smoother.

But in this way, the selection of the AP database is a challenge. We know that a database is a closed system, where all data and tasks are managed uniformly by the database. With the continuous addition of data and tasks, the content of database management is also increasing, which may cause the previously smooth tasks to become less smooth. When there were few migrations in the initial stage, of course, it ran smoothly. However, if it was found that the AP database was difficult to support in the later stage, a lot of work had already accumulated and could only be awkwardly expanded.

The closeness of the database can also cause T+0 (real-time calculation of whole data) issues. A single database naturally enables the calculation of whole data, without the problem of T+0. After installing the AP database, the data is split into two databases, and mixed computing with multiple databases is basically an impossible task for the database itself, especially when the AP and TP databases are of different types. Either seek help from professional data synchronization tools, but it is difficult to achieve smooth and timely development, or develop at the application layer, driving up development costs.

Using an HTAP database is also not a good solution, as the AP capability of such databases is usually insufficient, and it also requires replacing the original TP database, which is too risky.

Then it seems that when TP is too bloated, then add an AP, it looks very beautiful, but there are actually many problems.

Is there another option?

esProc SPL is a more lightweight and flexible solution.

esProc SPL is not a database. But it has high computing performance and can serve as an AP database. Here is a test result of TPCH 100G (in seconds):

	esProc SPL	StarRocks	Clickhouse	Oracle
q1	9.7	14.0	15.4	114.3
q2	1.3	0.6	17.3	1.9
q3	8.8	9.8	memory overflow	165.8
q4	4.9	5.7	memory overflow	158.4
q5	8.9	13.1	memory overflow	174.5
q6	4.5	3.9	4.8	126.7
q7	10.5	12.4	memory overflow	181.5
q8	6.9	8.3	memory overflow	209.7
q9	16.8	21.3	memory overflow	256.0
q10	8.3	11.1	58.3	195.6
q11	0.9	1.3	6.7	8.7
q12	4.9	4.8	10.7	186.0
q13	12.1	21.3	134.1	33.3
q14	3.3	4.6	10.2	170.0
q15	4.7	7.1	11.2	161.8
q16	2.7	2.9	4.0	10.8
q17	5.3	4.2	44.6	156.5
q18	6.4	20.8	memory overflow	416.8
q19	5.8	6.0	>600	144.1
q20	5.2	5.2	31.2	171.0
q21	11.9	14.5	syntax error	360.7
q22	2.5	1.9	8.4	37.7
Total	146.3	194.8	-	3441.8

It can be seen that the performance of esProc is orders of magnitude higher than that of TP databases represented by Oracle, and can be comparable to or even surpass professional AP databases. For detailed test reports, please refer to: SPL computing performance test series: TPCH .

With the guarantee of performance benchmark, let’s take a look at how esProc SPL can avoid the embarrassment of using an AP database.

Strictly speaking, esProc SPL is not a database, but a professional computing engine. It directly uses files to store data. esProc does not have the concept of “(data)base”, which means there is no concept of loading into database and exporting out of database, as well as inside or outside a database. Data files can be stored in directories, moved freely, and redundant, and even placed on the cloud, making management very free and convenient. The operation and maintenance complexity of data storage is much lower than that of databases.

Unlike convention databases, esProc, as a pure Java program, can be seamlessly embedded into Java applications for execution. The entire computing engine is built into the JDBC driver package, unlike databases that require an independent server process. The esProc core package is less than 15MB, and with various third-party data source drivers, it is only hundreds of MBs and can even run smoothly on Android. esProc is like the code written by programmers themselves, packed into a large package, running within the same process, and enjoying the framework advantages brought by mature Java frameworks together, and the operation and maintenance of computing tasks are also very simple.

esProc SPL does not require a unified management system. Data can be stored dispersed in files without metadata concept or constraints between data. The operational logic can also be dispersed across various applications, and subsequent tasks and predecessor tasks do not need to run on the same set of hardware resources at all. Resource grabbing and correlation coupling will not occur, let alone predecessor tasks being affected by subsequent tasks. In this way, we can easily and effortlessly migrate computing business from TP database to esProc (in fact, the term ‘to esProc’ is not very appropriate, the correct statement is that esProc takes on the computing task. esProc does not have a domain of its own management like a database, it is just a computing engine, without the concepts of ‘in’ or ‘out’).

Of course, esProc also allows users to centralize all computing tasks and execute them on designated resources. Whether it is a decentralized, centralized, or hybrid mode depends entirely on the needs and conditions of users. So, esProc is a very flexible solution, without limiting users’ freedom.

T+0 is even simpler. esProc SPL is an open computing engine that doesn’t even count as a logical database. It does not require pre-defined metadata, and any accessible data source can be calculated at any time. Both small data (in-memory tables) and large data (cursors) can be supported. This makes it easy to achieve cross source computing, especially mixed computing between TP database and esProc files. T+0 is a natural benefit brought by esProc.

esProc no longer uses SQL, but has its own programming language SPL. SPL has all the computing power of SQL:

Orders.sort(Amount) // sorting
Orders.select(Amount*Quantity>3000 && like(Client,"*S*")) // filtering
Orders.groups(Client; sum(Amount)) // grouping
Orders.id(Client) // distict
join(Orders:o,SellerId ; Employees:e,EId) // join
...

However, in any case, it requires mastering a new programming language and also rewriting the calculation tasks written in SQL into SPL. Users may be concerned that this will result in significant migration workload.

At first glance, it looks like this, but after careful analysis, it will be clear that the development workload using esProc is not much higher than using an AP database, and long-term development efficiency will actually be significantly improved.

The SQL corresponding to simple calculation tasks is also relatively simple, which can be easily ported to the AP database, and sometimes even run intact. If you want to change it to SPL, it may be a bit troublesome, although it is not difficult, after all, the syntax form is completely different and you always need to rewrite it. However, the real hassles are those complex tasks (the so-called 20/80 principle, where 80% of the workload is spent on 20% of difficult tasks), these SQL statements are often hundreds or thousands of rows, and even use database specific functions and even stored procedures. However, there is often a significant difference between the AP database and the TP database in this regard. Even if it is still written in SQL, the workload is close to redevelopment. If encountering functions that are not supported by the AP database (such as stored procedures), it may need to rely on external third-party programs or UDFs to implement.

SPL has much stronger computational capabilities than SQL, and for complex operations, SPL is much more convenient than SQL. For example, in this task, to calculate the longest consecutive days for a stock to rise, SQL needs to be written in multiple nested, lengthy, and difficult to understand:

select max(ContinuousDays) from (
    select count(*) ContinuousDays from (
        select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (
            select TradeDate,case when Price>lag(price) over ( order by TradeDate)then 0 else 1 end UpDownTag from Stock ))
    group by NoRisingDays )

The same calculation logic is much simpler to write in SPL:

Stock.sort(TradeDate).group@i(Price<Price[-1]).max(~.len())

If this type of task is frequently encountered, which development workload is greater?

The difficulty of debugging SQL and stored procedures is notorious worldwide, and there are no signs of improvement yet. SPL has a very convenient debugging interface, which can also increase development efficiency:

(The SPL code is written in a grid, which is very different from ordinary programming languages. Please refer to here: A programming language coding in a grid )

SPL can implement more high-performance algorithms, and its computational performance is also stronger than SQL. It often achieves the effect a cluster using a single machine, and in the vast majority of cases, it can limit distribution while saving a large amount of procurement and operation costs. Due to space limitations, this topic will not be elaborated further. You can refer to Here comes big data technology that rivals clusters on a single machine .

Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.

Another culprit behind the slow running and crashing of a database

jbx1279 — Sat, 20 Jan 2024 12:29:23 +0000

That’s right, it’s the famous JOIN.

JOIN has always been a major challenge in database computing, and the industry has come up with many ways to calculate it. If no optimization is done, it would be to loop through and traverse two associated tables, which is a multiplication level complexity that can be unbearable if the data volume is slightly larger. Mature databases are certainly not so foolish. For the most common equivalence JOIN (with an association condition of equal key values), the HASH JOIN method is usually used, which can reduce the computational complexity by K times (K is the length of the HASH space, and we won’t go into detail here, there is a lot of information otherwhere).

The HASH JOIN algorithm works well in memory. When the amount of data is too large to fit in memory, it is likely to involve HASH heap caching, and when unlucky, a second HASH is required, resulting in uncontrollable performance. Distributed calculation is even more troublesome. The database industry has invented methods such as broadcast join and shuffle join to switch to a single machine JOIN (which will not be elaborated here), which is very troublesome and results in a sharp decline in performance, leading to the phenomenon of slower speed when there are more cluster nodes. Some databases only implement in-memory algorithms for higher speed, so it’s not surprising that they crash when the data volume is large.

Is JOIN really so difficult to deal with?

If we strictly follow the JOIN defined by SQL to implement, it is indeed quite difficult. However, when everyone is striving to solve JOIN performance issues, they almost never consider whether this definition is reasonable.

In fact, in reality, the vast majority of equivalence JOINs with business significance will have primary keys (or logical primary keys) involved, rather than random (without primary keys, it is mostly likely that the code was written incorrectly). By utilizing this feature, it is possible to optimize the performance of JOINs. Unfortunately, SQL’s definition of JOIN does not address this point at all.

We can divide equivalence JOINs into two categories: one is the foreign key association between ordinary fields of the fact table and the primary key of the dimension table, which is a many-to-one relationship; The other is the primary key association between the primary key of the primary table and part of the primary keys of the sub table, which is a one-to-many relationship (which may degenerate into a one-to-one homo-dimension table relationship); Then use their respective features to optimize performance.

Dimension tables are usually relatively small and can be fully read into memory. In this way, as long as the fact table is traversed, no matter how large the fact table is or how many dimension tables there are, there is no need for HASH heap buffering, let alone a second HASH. After distinguishing the dimension table, it can be directly replicated in the cluster nodes without the need to perform broadcast or shuffle during calculations.

If the dimension table is too large to be read into memory, it can be stored (distributed) in an orderly manner according to the primary key; When the fact table is small, the JOIN operation becomes a search operation (on the dimension table), and there is no HASH heap buffering or second HASH. When the fact table is also large, it can be buffered according to the interval of the dimension table. On the one hand, it only needs unilateral heap buffering (to reduce the buffer amount), and on the other hand, it is also impossible to have a second HASH. Although it is not too fast in this case, it still has a lot of advantages over HASH JOIN. When distributed, the dimension table can also be loaded into the memory of cluster nodes, and the fact table can be traversed separately at each node.

The situation of primary sub (homo-dimensional) tables is even simpler. After being stored in an orderly manner according to the primary key, a very low complexity merging algorithm can be used. JOIN can be implemented with just a little bit of memory and one traversal, without involving HASH heap partition or a second HASH, and without shuffle actions, even without calculating HASH. No matter how large is the data amount, it will not crash.

Unfortunately, relational databases cannot implement these algorithms, as they need to faithfully implement the definition of relational algebra. Some databases may undergo engineering optimizations, such as using merge join when discovering the data is ordered. However, due to the unordered set foundation of relational algebra, the physical order of the data cannot be guaranteed (only logical order cannot avoid the huge cost of hard disk jumping), and in most cases, these algorithms cannot be adopted.

Well, esProc SPL can！

Strictly speaking, esProc SPL is not a database, but a professional computing engine. Its definition of equivalence JOIN is divided into two categories as above, and clearly defines the concepts of fact table, dimension table, primary table, and sub table (although we often refer to these terms when discussing databases, relational algebra does not strictly define them). esProc SPL no longer uses relational algebra and SQL, but has created a discrete dataset theory based on ordered sets and invented a new programming language SPL, which naturally supports physically ordered storage (both in memory and external storage) and can implement the efficient algorithms mentioned above; SPL also provides corresponding functions for different categories of JOINs, allowing programmers to easily utilize these algorithms to achieve high performance.

For foreign key association with small data amount and can be fully loaded into memory, if done only once, the method of esProc SPL is not significantly different from the HASH JOIN of the database, as the HASH value also needs to be calculated to find the associated record. But the HASH value calculated in SPL can be reused, and the next time the same dimension table participates in the association (which is a common occurrence), there is no need to calculate it again. Specifically, if the fact table can also be loaded into memory, the association can be done beforehand, and there is no need to perform HASH calculations and comparisons when doing JOIN. These all rely on the feature that the primary key of the dimension table participates in the association, which cannot be implemented in SQL system that does not recognize this feature.

By utilizing the orderliness of SPL, dimension table numbering can also be achieved. Namely, convert foreign keys into sequence numbers, which can directly locate dimension table records using sequence numbers, avoid HASH calculation and comparison, and improve association calculation performance. The unordered SQL system also cannot implement this algorithm.

In other cases, such as when the data volume is too large to fit in memory, the previous analysis can already demonstrate the enormous advantages of the SPL algorithm. Due to space limitations, the principle of SPL executing JOIN will no longer be explained in detail here. Interested buddies can go to Scudata Forum to search for relevant information.

Let’s take a look at the test results of a join operation and a wide table (in seconds):

	4C16G			8C32G
	esProc SPL	StarRocks	ClickHouse	esProc SPL	StarRocks	ClickHouse
Wide table	114.2	129.9	74.3	57.7	62.1	33.2
Two table join	21.5	78.8	204.1	11.5	35.1	89.3
Seven table join	55.6	152.5	Memory overflow	30.6	73.3	Memory overflow

It can be seen that the wide table operation performance of esProc SPL is not as good as ClickHouse, but with the adoption of these new algorithms (specifically using techniques such as multi-layer dimension table pre association, dimension table primary key numbering, and primary sub table merging), JOIN performance has significant advantages. It is also not surprising that ClickHouse’s JOIN performance is poor. When two tables are joined, there is only small dimension table, and it is just slow but still able to handle it. When seven tables are joined, large primary and sub tables are involved, the phenomenon of collapse occurs.

The complete test report can be found in SPL computing performance test series: associate tables and wide table .

Let’s still look at this actual spatiotemporal collision case: identify the top 20 phones that have appeared the most frequently in the same time period and location as a specified phone, with a data scale of approximately 25 billion rows. The SQL is written as follows:

WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400)
SELECT * FROM (
    SELECT B.id id, COUNT( DISINCT B.tn ) cnt
    FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn
    WHERE A.id=a AND B.id<>a
GROUP BY id )
ORDER BY cnt DESC
LIMIT 20

There is a self-JOIN here, and a single node ClickHouse crashes directly, and a 5-node cluster takes more than 30 minutes to get the result. The SPL code gets the result in less than 6 minutes using just one node:

	A
1	=now()
2	>NL=100000,NT=3*96
3	=file("T.ctx").open()
4	=A3.cursor(tm,loc;id==a).fetch().align(NLNT,(loc-1)NT+tm\900+1)
5	=A3.cursor@mv(;id!=a && A4((loc-1)*NT+tm\900+1))
6	=A5.group@s(id;icount@o(tm\900):cnt).total(top(-20;cnt))
7	=interval@ms(A1,now())

From the SPL code, there is no trace of JOIN at all, because after changing the JOIN definition, SPL can integrate JOIN into other computational processes. Here A4 is creating a dimension table, and A4((loc-1)*NT+tm\900+1) in A5 is equivalent to the filtering effect of inner-join. It also utilizes the aforementioned sequence numbering mechanism, effectively improving JOIN performance.

Careful readers may find that the effectiveness of the esProc SPL algorithm depends on the order of data by ID, and the order of data generation is usually not the ID, but the time. Then, can this algorithm only be applied to previously sorted historical data, and become invalid for new data that cannot be sorted together in time?

esProc has taken this into account, and SPL’s multi-zone composite table can achieve incremental sorting when data enters, ensuring that the data is sorted by ID in real-time when read, allowing this ordered calculation scheme to be applied to the latest data. Moreover, statistics on fact tables usually involve time intervals, and SPL’s pseudo table supports a two-dimensional ordering mechanism, which can quickly filter out data outside the time interval and further improve computational performance.

esProc is a pure Java software that can perform operations in any JVM environment and can be seamlessly embedded into Java programs, giving the computing power of a data warehouse to applications in various scenarios in a very lightweight manner.
esProc provides a visual development environment that supports single step execution, breakpoint setting, and WYSIWYG result preview. Developing and debugging is much more convenient than SQL and stored procedures.

…

Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.

A major culprit in the slow running and collapse of a database

jbx1279 — Sat, 13 Jan 2024 12:19:41 +0000

It is the very inconspicuous account de-duplication count, written in SQL as COUNT (DISTINCT…).

Account de-duplication count is common and has important business significance in business analysis. The account here may be a user ID, bank account, phone number, license plate number…. The calculation logic is basically the same, which is to calculate how many accounts meet a certain condition from the historical data of a certain period of time.

For example, how many cars have been to NewYork last month? How many phones had calls between 2:00 am and 4:00 am last week? How many bank accounts have received overseas remittances this year? …

Historical data usually only records events that have occurred, such as a car appearing in a certain place at a certain time, a phone talking to someone at a certain time, a bank account that has been transferred in or out on a certain date, etc…. Directly filtering historical records using specified criteria will result in many records with the same account (a certain car may have been to NewYork multiple times,…), but it can only be counted once, so de-duplication needs to be done.

The filtering criteria in COUNT (DISTINCT…) are often not as simple as in the examples above. For example, how many credit card accounts have had weekly total consumption exceeding 1000 dollars this year? It is necessary to first calculate the weekly consumption amount of the account before filtering; How many accounts have consumed over 100 dollars for three consecutive days in the past month? More complex existence judgments are required for account transactions. However, regardless of the condition, it ultimately boils down to calculating COUNT (DISTINCT).

The famous e-commerce funnel statistics are such complex COUNT (DISTINCT), but it will be more complex. Funnel statistics involve multiple sequential events, with each step corresponding to a COUNT (DISTINCT), in order to calculate the customer churn rate for this step together with the previous COUNT (DISTINCT); The next step of COUNT (DISTINCT) should be filtered based on the previous step. And consider the order in which events occur. The entire process is quite complex, but essentially it is still doing COUNT (DISTINCT).

COUNT (DISTINCT) has always been a difficult problem in database computing, usually very slow. If the data volume is large (with a large number of accounts, which is also the norm), it may also lead to database crashes.

Why is this?

Because COUNT (DISINCT) requires a large amount of computation. COUNT (DISTINCT ID) needs to save the traversed different IDs in a list, and the next ID needs to be compared with this list to determine if it is new, in order to determine whether to add a count value and add it to the list. The ordinary COUNT (ID) does not need to save the traversed ID, and of course, it does not need to be compared, which is obviously much simpler. The position of COUNT (DISTINCT) in SQL may seem similar to that of COUNT and SUM, but its computing speed is much slower.

Moreover, many databases store the above ID list in memory when calculating COUNT (DISTINCT ID), which allows for high-speed access and comparison. However, if the number of accounts is large, memory may not be able to hold it, leading to inevitable crashes. If you cache this ID list to external storage, although it can avoid crashing, it is not convenient to access and compare, and performance will further decrease sharply.

Fortunately, the ID in COUNT (DISTINCT ID) calculation is usually just one column of data, and at the limit of 100GB of memory, it may hold billions of IDs, exceeding the number of accounts in most applications. Therefore, although the conventional COUNT (DISTINCT) calculation is slower, it is not likely to crash frequently.

But if the situation is more complex, it may not be certain. For example, funnel statistics will have multiple mixed COUNTs (DISTINCT), and the SQL will have nested JOINs. In this case, if you want to run faster, it will occupy much more memory (JOINs can also cause a serious contradiction between memory and performance), and the probability of crash will increase sharply.

The following is a three-step funnel analysis of a practical scenario implemented in SQL:

WITH e1 AS (
    SELECT userid, visittime AS step1_time, MIN(sessionid) AS sessionid, 1 AS step1
    FROM defined_events e1 JOIN eventgroup ON eventgroup.id = e1.eventgroup
    WHERE visittime >= DATE_ADD(arg_date,INTERVAL -14 day) AND visittime < arg_date AND eventgroup.name='SiteVisit'
    GROUP BY userid,visittime
), e2 AS (
    SELECT e2.userid, MIN(e2.sessionid) AS sessionid, 1 AS step2, MIN(visittime) AS step2_time, MIN(e1.step1_time) AS step1_time
    FROM defined_events e2 JOIN e1 ON e1.sessionid = e2.sessionid AND visittime > step1_time JOIN eventgroup ON eventgroup.id = e2.eventgroup
    WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND eventgroup.name = 'ProductDetailPage'
    GROUP BY e2.userid
), e3 AS (
    SELECT e3.userid, MIN(e3.sessionid) AS sessionid, 1 AS step3, MIN(visittime) AS step3_time, MIN(e2.step1_time) AS step1_time
    FROM defined_events e3 JOIN e2 ON e2.sessionid = e3.sessionid AND visittime > step2_time
    JOIN eventgroup ON eventgroup.id = e3.eventgroup
    WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND (eventgroup.name = 'OrderConfirmationType1')
    GROUP BY  e3.userid
)
SELECT s.devicetype AS devicetype,
    COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid  ELSE NULL END) AS step1_count,
    COUNT(DISTINCT CASE WHEN funnel_conversions.step2 IS NOT NULL THEN funnel_conversions.step2_userid  ELSE NULL END) AS step2_count,
    COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid  ELSE NULL END) AS step3_count,
    COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid  ELSE NULL END) 
        / COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid  ELSE NULL END) AS step3_rate
FROM (
    SELECT e1.step1_time AS step1_time, e1.userid AS userid, e1.userid AS step1_userid, e2.userid AS step2_userid,e3.userid AS step3_userid,
        e1.sessionid AS step1_sessionid, step1, step2, tep3
    FROM e1 LEFT JOIN e2 ON e1.userid=e2.userid LEFT JOIN e3 ON e2.userid=e3.userid) funnel_conversions
LEFT JOIN sessions s ON funnel_conversions.step1_sessionid = s.id 
GROUP BY s.devicetype

It can be seen that there are not only multiple COUNTs (DISTINCT) here, but also multiple self-join subqueries to achieve complex funnel step judgments. This SQL statement did not have a result after running for three minutes in the medium standard cluster (4-node) of Snowflake.

Then, how can we solve this annoying COUNT (DISTINCT)?

Actually, it’s not difficult, as long as the data is sorted by account, COUNT (DISINCT) is easy to calculate.

When the data is ordered by ID, then calculate COUNT (DISTINCT ID), simply save the value of the previous ID (one value). If the next ID is equal to the previous ID, the count of the current ID will be increased. If it is different, the saved ID will be replaced and clear the count. There is no need to search and compare in a large list, and the memory occupied when saving only one ID can be almost negligible. The calculation is fast and it is impossible to collapse.

For complex tasks such as funnel analysis, there is also no problem. When the data is ordered by ID, one ID of data can be read into memory each time, and complex calculations can be conveniently carried out without involving data from other IDs. In principle, It’s OK as long as the memory can hold data from one ID, and there is no problem saving multiple sets of COUNT (DISTINCT) counts.

Unfortunately, relational databases and SQL cannot do this.

As the theoretical foundation of relational databases, relational algebra is based on unordered sets. In SQL, set members (records of tables) have no order, and databases do not theoretically support ordering when storing data. The above optimization logic cannot be implemented in SQL.

esProc SPL can!

Strictly speaking, esProc SPL is not a database, but a professional computing engine. It no longer adopts relational algebra, but has created its own discrete dataset theory based on ordered sets and invented a new programming language SPL. esProc deliberately supports ordering when storing data, and SPL also provides rich methods for ordered calculations, which enables high-performance and low resource implementation of COUNT (DISTINCT) calculation.

The icount function of SPL is used for COUNT (DISTINCT) calculation, and by default, it will be implemented using the aforementioned method, which maintains a list of traversed different IDs, and the next ID will be compared to the list. If the data is ordered by ID, then use icount@o, in this case, SPL will adopt the ordered COUNT (DISTINCT) algorithm, which only maintains the previous ID value for comparison.

In particular, SPL also supports ordered cursor. When data is ordered by ID, a batch of data with the same ID can be fetched each time, and complex calculations can be performed to determine whether the current ID meets the filtering criteria. The subsequent count can be implemented directly using COUNT, without the need to maintain the previous ID and the action of comparison (which has already been compared during cursor retrieval).

Funnel analysis can be implemented using this mechanism:

	A
1	=now()
2	=eventgroup=file("eventgroup.btx").import@b()
3	=devicetype=file("devicetype.btx").import@b()
4	=long(elapse(arg_date,-14))
5	=long(arg_date)
6	=long(arg_date+1)
7	=A2.(case(NAME,"SiteVisit":1,"ProductDetailPage":2,"OrderConfirmationType1":3;null))
8	=file("defined_events.ctx").open()
9	=A8.cursor@m(USERID,SESSIONID,VISITTIME,EVENTGROUPNO;VISITTIME>=A4 && VISITTIME
10	=sessions=file("sessions.ctx").open().cursor@m(USERID,ID,DEVICETYPENO;;A9)
11	=A9.joinx@m(USERID:SESSIONID,A10:USERID:ID,DEVICETYPENO)
12	=A11.group(USERID)
13	=A12.new(~.align@a(3,EVENTGROUPNO):e,e(1).select(VISITTIME
14	=A13.run(e=join@m(e1:e1,SESSIONID;e2:e2,SESSIONID).select( e2=e2.select(VISITTIME>e1.VISITTIME && VISITTIME
15	=A14.run(e0=e1.id(DEVICETYPENO),e1=e.min(e1.VISITTIME),e2=e.min(e2),e=e.min(e1.SESSIONID),e3=e3.select(SESSIONID==e && VISITTIME>e2 && VISITTIME
16	=A15.news(e;~:DEVICETYPE,e2,e3)
17	=A16.groups(DEVICETYPE;count(1):STEP1_COUNT,count(e2):STEP2_COUNT,count(e3):STEP3_COUNT,null:STEP3_RATE)
18	=A17.run(DEVICETYPE=devicetype.m(DEVICETYPE).DEVICETYPE,STEP3_RATE=STEP3_COUNT/STEP1_COUNT)
19	=interval@s(A1,now())

(The SPL code is written in a grid, which is very different from ordinary programming languages. Please refer to here: A programming language coding in a grid ）

A12 reads the data of one USERID each time for subsequent judgment, and when calculating in A17, the count function is directly used without the need for an icount. This code is not only more concise and versatile (to do more funnel steps, just change A7, while SQL code needs to add many sub queries), but also runs faster. Using an EC2 with the same specification of Snowflake, it can be completed in just 10 seconds on a single machine.

In fact, COUNT (DISTINCT) is just a manifestation, and the essence of this problem is to group data by an account basis and then process it. Grouping and aggregation by account is a similar operation, and some aggregations are not simple SUM/MAX, and sometimes require a complex process to calculate. For example, calculating the number of times each phone has a call duration less than 3 seconds; Calculating the newly added points for last month’s credit card account, and the rule is that if you spend more than 100 dollars for three consecutive days, the points for these days will be doubled; …

Almost all event data in information systems is hung under a certain account, so this type of operation is very common and can be encountered in various queries and batch jobs. It can be said to be one of the most common business logic models. With esProc SPL’s ordered calculations based on ordered storage, this large class of problems can be implemented concisely and with high performance, while they are very difficult for relational databases in SQL system.

Here is a variant example, a spatiotemporal collision problem, to identify the top 20 mobile phones that have appeared the most frequently in the same time period and location as a specified mobile phone, with a data scale of approximately 25 billion rows. The SQL is written as follows:

WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400)
SELECT * FROM (
    SELECT B.id id, COUNT( DISINCT B.tn ) cnt
    FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn
    WHERE A.id=a AND B.id<>a
    GROUP BY id )
ORDER BY cnt DESC
LIMIT 20

There are nested DISTINCT operations and self-JOINs here, and a single node ClickHouse crashes directly and a 5-node cluster takes more than 30 minutes to get the result.

The SPL code utilizes ordered storage and the previously mentioned ordered cursor to effectively avoid these difficulties, getting the result in less than 6 minutes using just one node.

	A
1	=now()
2	>NL=100000,NT=3*96
3	=file("T.ctx").open()
4	=A3.cursor(tm,loc;id==a).fetch().align(NLNT,(loc-1)NT+tm\900+1)
5	=A3.cursor@mv(;id!=a && A4((loc-1)*NT+tm\900+1))
6	=A5.group@s(id;icount@o(tm\900):cnt).total(top(-20;cnt))
7	=interval@ms(A1,now())

esProc has taken this into account, and SPL’s multi-zone composite table can achieve incremental sorting when data enters, ensuring that the data is sorted by ID in real-time when read, allowing this ordered calculation scheme to be applied to the latest data. Moreover, such operations usually involve time intervals, and SPL’s pseudo table supports a two-dimensional ordering mechanism, which can quickly filter out data outside the time interval and further improve computational performance.

…

Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.

An alternative of SQLite for Java Programmer

jbx1279 — Sat, 06 Jan 2024 11:04:19 +0000

Many small and micro applications also need some data processing & computing capability, but the framework becomes too heavy if we integrate a database in them. In view of this, the small and lightweight SQLite has become a good choice and been widely used.

SQLite also has its inconveniences. It has weak and complicated support for external data files and other data sources; it does not provide stored procedures and needs the main program’s participation in implementing the procedure, leading to frequent data exchanges between it and the main program (because the procedure flow depends on data) and resulting in inefficient and tedious code; it is also hard to code the complicated computations in SQL and the development efficiency is low.

If you happen to be programming in Java or Android, esProc SPL might be a better choice..All those problems are gone.

esProc is a free, open-source software, which is offered in https://github.com/SPLWare/esProc.

esProc is developed purely in Java. Users just need to import its jars in the Java application for use. The integration is completely seamless.

esProc also supplies standard JDBC driver, through which it can be invoked by the Java main program, like the way to access database. However, the query language esProc uses is SPL, instead of SQL.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery("=T(\"Orders.csv\").select(Amount>1000 && like(Client,\"*s*\")

esProc also supports SQL and does not need to define metadata, simply treating the data file as a table.

ResultSet result = statement.executeQuery("$SELECT * FROM Orders.csv WHERE Amount>1000 AND Client LIKE '%s%'")

However, for a more complex computational goals, SPL is much simpler than SQL,For example, in this task, to calculate the longest consecutive days for a stock to rise, SQL needs to be written in multiple nested, verbose, and difficult to understand:

select max(ContinuousDays) from (
    select count(*) ContinuousDays from (
       select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (
            select TradeDate,case when Price>lag(price) over ( order by TradeDate) then 0 else 1 end UpDownTag from Stock ))
    group by NoRisingDays )

It is much simpler and easier to write the same logic in SPL:

Stock.sort(TradeDate).group@i(Price<Price[-1]).max(~.len())

It’s okay to implement in a more natural way (SQL can not implement this process, and can only use the above roundabout multi-layer nested method):

	A
1	=cnt=0
2	=Stock.sort(TradeDate).max(cnt=if(Price>Price[-1],cnt+1,0))

Unlike most of the programming languages that write code as text, SPL code is written in a grid. Find more information HERE.

SPL itself has a complete set of flow control statements, such as for loop and if branch statement, and supports invocation of the subprogram. This amounts to the stored procedure capability. SPL alone can implement complex business logics, almost without cooperation of the main program. The main program just needs to invoke the SPL code as it calls the stored procedure:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call queryOrders()}");
statement.execute();

Difference is that SPL scripts are interpreted execution. Any modifications take effect instantly, while changes to the stored procedure require compilation. Particularly, SPL scripts can be placed separately from the main program and modifications of them do not need recompilation and redeployment of the latter, which enables the real-time hot swapping. The benefit does not exist for logics implemented by the cooperation of the main program and SQL in the database.

SPL supports a large variety of data sources – whether they are familiar or unfamiliar – including text files of various formats, Excel files, relational databases, NoSQL databases, HTTP, Kafka, … and JSON/XML data. esProc provides access interface for each of them. Users just need one or two lines of code to read/write over them.

T("Orders.csv").select(Amount>2000 && Amount<=3000)
Orders=json(httpfile("http://127.0.0.1:6868/api/orders").read())
db=connect("mysql")
db.query("select * from salesR where SellerID=?",10)
...

To access the external data, just read it directly without the need to create a table. It is very convenient. In addition, these files and data sources are writable in SPL. They can be used for data persistence so that the written data can be accessed by other applications.

file("Orders.csv").export@t(A2)
file("Orders.xlsx").xlsexport@t(A2)
db.update(NewTable:OldTable)

SPL also offers its own binary files, which enable higher read/write performance.

Compared to the JVM, esProc is very lightweight, the core package is only 15M, which is almost negligible, and it also runs smoothly on Android.

The Terminator of ORM Technology

jbx1279 — Sat, 30 Dec 2023 13:20:12 +0000

Hibernate, Mybatis, and emerging ORM technologies such as JOOQ can easily map database tables into Java objects and provide automatic read and write methods. ORM technology makes developing database applications in Java more efficient.

However, mapping data tables is only the most basic part, and the development of business logic also involves a lot of process control and data calculation work. Process control is a strong point of Java, without pressure; However, batch structured data computing has always been a weakness of Java, and programming directly in Java can be very cumbersome.

Unfortunately, the computing capability provided by these ORM technologies is also unsatisfactory. Hibernate relies almost entirely on converting from HQL to SQL, while HQL can only correspond to a very small subset of SQL, resulting in significantly weak computational capability; JOOQ is much stronger, providing many basic operations in the Java style, such as filtering, grouping, etc. This is also one of the reasons why it has surpassed Hibernate in recent years, but the code is still not as concise as SQL.

Developing database applications in Java, esProc SPL is a better choice.

esProc SPL is an open-source software written in Java. As a pure Java software, esProc can be seamlessly integrated into Java applications like ORM, enjoying the advantages of mature Java frameworks together.

Unlike ORM technology, esProc provides a new programming language SPL based on JVM for programming, rather than directly using Java. SPL scripts are called by Java programs via the JDBC driver provided by esProc, just like calling database SQL or stored procedures.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery("=T(\"Orders.csv\").select(Amount>1000 && like(Client,\"*s*\")

Why design a new programming language instead of directly encapsulating it as Java APIs?

Java is a compiled static language that makes it difficult to implement dynamic data structures and convenient Lambda syntax, which is particularly common in structured data operations and the advantage of SQL.

Any SELECT statement in SQL will generate a new data structure, allowing for the addition and deletion of fields without the need to define the structure (class) beforehand, which is common in structured data operations. However, languages like Java won’t work effectively. It is necessary to define all the structures (classes) used when compiling the code, and it can be considered that new classes cannot be dynamically generated during the execution process (Java theoretically supports dynamic compilation, but the complexity is too high). If a specialized class is used to represent all data tables, and field names are also treated as data members of the class, it is not possible to directly use the class’s attribute syntax to reference fields, and the code is very cumbersome.

Lambda syntax is widely used in SQL, such as the condition in WHERE, which is essentially a Lambda expression. Although Java, a static language, currently supports Lambda syntax, it is far less convenient than SQL. A function header definition is needed to tell the compiler every time a Lambda function is about to be written, and the code looks messy. In Lambda functions, field names in the data table cannot be directly referenced. For example, when calculating amounts using unit price and quantity, if the parameter name used to represent the current member is x, it needs to be written in the verbose form of “x. unit price * x. quantity”. In SQL, it can be more intuitively written as “unit price * quantity”.

Only interpretive dynamic languages can implement these features of SQL, which can generate new data structures at any time, or determine whether the current parameter is a Lambda function based on the host function itself. Therefore, there is no need to write a definition header, and fields without table names can be correctly referenced based on the context.

SQL is an interpretive dynamic language, and so is SPL. Java is not, so it is difficult to write concise code in Java, no matter how APIs are provided, it is not possible.

On the basis of interpretive dynamic languages, SPL provides more comprehensive structured data objects (tables, records, cursors) and richer computational functions than SQL, including basic operations such as filtering, grouping, and join in SQL, as well as missing ordered and set operations in SQL. So, SPL code is usually more concise and easier to maintain than SQL, and much stronger than Java code based on ORM. Here are some simple comparisons (using only JOOQ, which has the strongest computing capability, as an example):

Filtering

SQL: select * from Orders where ((SellerID=2 and Amount<3000) or (SellerID=3 and Amount>=2000 and Amount<5000)) and year(OrderDate)>2010
SPL: Orders.select( ((SellerID==2 && Amount<3000) || (SellerID==3 && Amount>=2000 && Amount<5000)) && year(OrderDate)>2010)
JOOQ:
    context.select().from(ORDERS)
        .where(((ORDERS.SELLERID.equal(2).and(ORDERS.AMOUNT.lessThan(3000.0)))
    .or((ORDERS.SELLERID.equal(3).and(ORDERS.AMOUNT.greaterOrEqual(2000.0).and(ORDERS.AMOUNT.lessThan(5000.0))))))
    .and(year(ORDERS.ORDERDATE).greaterThan(2012)))
    .fetch();

Grouping

SQL: select Client, extract(year from OrderDate) y,count(1) cnt from Orders group by Client, extract(year from OrderDate) having amt<20000
SPL: Orders.groups(Client,year(OrderDate):y;sum(Amount):amt,count(1):cnt).select(amt<20000)
JOOQ:
    context.select(ORDERS.CLIENT,year(ORDERS.ORDERDATE).as("y"),sum(ORDERS.AMOUNT).as("amt"),count(one()).as("cnt"))
    .from(ORDERS)
    .groupBy(ORDERS.CLIENT,year(ORDERS.ORDERDATE))
    .having(field("amt").lessThan(20000)).fetch();

For more complex tasks, such as this one, calculating the longest consecutive days for a stock to rise, SQL needs to be written in multiple nested, lengthy, and difficult to understand:

select max(ContinuousDays) from (
    select count(*) ContinuousDays from (
        select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (
            select TradeDate,case when Price>lag(price) over ( order by TradeDate)then 0 else 1 end UpDownTag from Stock ))
    group by NoRisingDays )

JOOQ also needs to use window functions, which are more complex than SQL:

WindowDefinition woDay1 = name("woDay").as(orderBy(Stock.TradeDate));
Table<?>T0=table(select(Stock.TradeDate.as("TradeDate"),when(Stock.Price.greaterThan(lag(Stock.Price).over(woDay1)),0).otherwise(1).as("risingflag")).from(Stock).window(woDay1)).as("T0");
WindowDefinition woDay2 = name("woDay1").as(orderBy(T0.field("TradeDate")));
    Table<?>T1=table(select(sum(T0.field("UpDownTag").cast(java.math.BigDecimal.class)).over(woDay2).as("NoRisingdDys")).from(T0).window(woDay2)).as("T1");
Table<?>T2=table(select(count(one()).as("ContinuousDays")).from(T1).groupBy(T1.field("NoRisingDays"))).as("T2");
Result<?> result=context.select(max(T2.field("ContinuousDays"))).from(T2).fetch();

The same calculation logic, while using SPL is very simple:

Stock.sort(TradeDate).group@i(Price<Price[-1]).max(~.len())

Convenient reading and writing of databases is a fundamental feature of ORM technology, and SPL is certainly not a problem in this regard:

Hibernate:
    String hql ="from OrdersEntity where sellerId=10";
    Query query = session.createQuery(hql);
    List<OrdersEntity> orders = query.list();

    Transaction tx = session.beginTransaction();
    for ( int i=0; i<orders.size(); i++ ) {
        session.save(orders.get(i));
    }
    tx.commit();

JOOQ:
    java.sql.Connection conn = DriverManager.getConnection(url, userName, password);
    DSLContext context = DSL.using(conn, SQLDialect.MYSQL);
    Result<OrdersRecord> R1=context.select().from(ORDERS).fetchInto(ORDERS);

    R1.forEach(r->{r.setValue(ORDERS.AMOUNT,r.getValue(ORDERS.AMOUNT).doubleValue()+100);}); R1.forEach(r->{r.update();});

SPL:
    T=db.query("select * from test.Orders where sellerId=?",10)
    db.update(T,orders)
    db.commit()

These features are not significantly different, and SPL is still simpler due to its aforementioned language advantages.

SPL itself has well-established process control statements, such as for loops and if branches, and also supports subroutine calls. Using only SPL can achieve very complex business logic, directly forming a complete business unit, without the need for upper-level Java code to cooperate. The main program simply calls the SPL script. This is equivalent to moving stored procedures from the database to Java.

SPL is a pure Java program, it can be called by Java, and it can also call Java. In this way, even some code that is difficult to implement in SPL and needs to be implemented in Java (such as some external interfaces) or existing ready-made Java code can be integrated in SPL. SPL scripts and main Java applications can be integrated seamlessly.

As an interpretive language, SPL scripts can be stored as files and placed outside the main application program. Code modifications can be made independently and immediately take effect, unlike code based on ORM that needs to be recompiled with the main program after modification, and the entire application needs to be shut down and restarted. This can achieve hot swap of business logic, especially suitable for supporting frequently changing businesses.

The data sources supported by SPL are also very rich, whether it is a relational database or NoSQL or Kafka or Restful, whether it is a regular two-dimensional table or a multi-level JSON, SPL can all calculate and process. However, ORM technology can generally only target relational databases and cannot directly support other data sources.

Very specifically, SPL code is written in a grid, which is very different from the code typically written as text. The independent development environment is simple and easy to use, providing single step execution, breakpoint setting, and WYSIWYG result preview. Debugging and development are also more convenient than Java.

Here A programming language coding in a grid is a more detailed introduction to SPL.

Finally, esProc SPL is here https://github.com/SPLWare/esProc .

Here comes big data technology that rivals clusters on a single machine

jbx1279 — Sat, 23 Dec 2023 11:20:38 +0000

Distributed data warehouses (such as MPP) in the era of big data are a popular technology, even to the point where data warehouses are referred to as distributed.

But is distributed data warehousing really necessary? After all, these distributed data warehouse products are not cheap, with high procurement and maintenance costs. Is there a low-cost and lightweight solution?

In fact, the amount of data involved in structured data computing tasks (the main goal of data warehouses) is usually not very large. For example, a bank with tens of millions of accounts has a transaction volume of hundreds of millions of transactions per year, roughly ranging from a few gigabytes to several tens of gigabytes; The data that an e-commerce system with millions of accounts can accumulate is still of this scale. Even for a few top companies with a huge amount of data, there are still a large number of tasks involving only a small amount of data. The cases that a single computing task involves hundreds of gigabytes of data are rare, and it is difficult to accumulate to the PB level claimed by many big data solution vendors.

It can also be seen from another aspect that the number of nodes in most distributed data warehouses is not very large, often around ten or less. The computing centers of top enterprises may have thousands or even tens of thousands of nodes, but a single task only uses a few to a dozen of these nodes. For example, the medium standard warehouse with a large sales volume in SnowFlake only has 4 nodes. This is the mainstream scale of distributed data warehouses. A task with a PB level data volume, where one node processes 1T (usually several hours), also requires 1000 nodes, which is obviously not the norm.

As claimed, a single database can easily handle data on a scale of tens of gigabytes, but in reality, it cannot. Batch jobs for several hours and querying for a few minutes at a time is also a common practice. As a result, users will start thinking about distributed computing.

Why is this?

There are two reasons for this. On the one hand, although the data volume of these computing tasks is not large, they have considerable complexity and often involve multiple associations. On the other hand, the SQL syntax used in the database cannot conveniently describe these complex operations, and the reluctantly written code cannot be optimized by the database, resulting in excessive computational complexity. In other words, SQL databases cannot fully utilize hardware resources and can only hope for distributed expansion.

esProc SPL can.

esProc SPL is an open-source lightweight computing engine, here: https://github.com/SPLWare/esProc . As a pure Java developed program, it can be seamlessly embedded into Java applications, providing a high-performance computing experience without the need for a database.

esProc SPL often outperforms MPP in terms of performance, with a single machine matching a cluster.

The star clustering task of the National Astronomical Observatory has a data scale of only about 50 million rows. It takes 3.8 hours for a distributed database to run 5 million rows using 100 CPUs, and it is estimated to take 15 days (square complexity) to complete 50 million rows. The esProc SPL runs a full capacity of 50 million data on a 16 CPU single machine in less than 3 hours.

The batch job of loan business in a certain bank, ran for 4300 seconds with a HIVE cluster of 10 nodes and 1300 rows of SQL ; esProc SPL ran for 1700 seconds on a single machine with 34 lines of code.

A certain bank’s anti money laundering preparation, Vertica ran for 1.5 hours on 11 nodes, and esProc SPL single machine ran for 26 seconds, unexpectedly turning the batch job into a query!

A certain e-commerce funnel analysis, SnowFlake’s medium standard cluster (4-node) cannot get a result after 3 minutes and user gave up. esProc SPL completed in 10 seconds on a single machine.

A certain spatiotemporal collision task, with a ClickHouse cluster of 5 nodes for 1800 seconds, was optimized by esProc SPL to a single machine for 350 seconds.

…….

These cases can further illustrate that there are not many cluster nodes for a large number of actual tasks, and almost all such scenarios can be solved by esProc using a single machine.

How does esProc SPL achieve this?

In terms of engineering, esProc also adopts commonly used MPP acceleration techniques such as compression, columnar storage, indexing, and vector computing; More importantly, esProc is no longer based on SQL, but instead uses its own programming language SPL, which includes many high-performance storage mechanisms and algorithm libraries that cannot be implemented based on SQL theory:

With these foundations, it is easy to write code with lower computational complexity, effectively avoiding the problem of excessive computation of SQL code, fully utilizing hardware resources, and achieving that a single machine matches a cluster.

Regarding the performance advantages of esProc, How the performance improvement by orders of magnitude happened has a popular explanation, SPL: a database language featuring easy writing and fast running explains in depth why SQL cannot write high-performance code.

The above figure lists some high-performance technologies of SPL, and it can be seen that esProc also supports cluster computing. However, due to the high performance of esProc, in practical tasks, only a single machine is used to achieve the ability of the original cluster. As a result, except for some simple cluster scenarios for high concurrency and hot standby, esProc’s cluster computing capabilities have not had the opportunity to be deeply honed, and to some extent, it can be said that they are not mature enough.

For a specific example, the spatiotemporal collision problem mentioned earlier has a total data volume of about 25 billion rows, and SQL does not seem very complex:

WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400)
SELECT * FROM (
    SELECT B.id id, COUNT( DISINCT B.tn ) cnt
    FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn
    WHERE A.id=a AND B.id<>a
GROUP BY id )
ORDER BY cnt DESC
LIMIT 20

Traditional databases run too slowly, and users turned to ClickHouse for help. However, even in a 5-node cluster environment, they ran for more than 30 minutes and did not meet their expectations. With the same amount of data, the SPL code can complete calculations in less than 6 minutes with just one node, exceeding user expectations. Considering the gap in hardware resources, SPL is equivalent to more than 25 times faster than ClickHouse.

	A
1	=now()
2	>NL=100000,NT=3*96
3	=file("T.ctx").open()
4	=A3.cursor(tm,loc;id==a).fetch().align(NLNT,(loc-1)NT+tm\900+1)
5	=A3.cursor@mv(;id!=a && A4((loc-1)*NT+tm\900+1))
6	=A5.group@s(id;icount@o(tm\900):cnt).total(top(-20;cnt))
7	=interval@ms(A1,now())

(The SPL code is written in a grid, which is very different from ordinary programming languages. Please refer to here: A programming language coding in a grid.)

The DISTINCT calculation in SQL involves HASH and comparison, and when the data amount is large, the calculation amount will also be large. Furthermore, there will be self-join and further COUNT(DISTINCT), which will seriously slow down performance. SPL can fully utilize the ordered grouping and sequence number positioning that SQL does not have, effectively avoiding the high complexity of self-join and DISTINCT operations. Although there is no advantage in storage efficiency compared to ClickHouse, and Java may be slightly slower than C++, it still achieves an order of magnitude of performance improvement.

Running at a speed of 300 kilometers per hour does not necessarily require high-speed rail (distributed MPP), so can a family sedan (esProc SPL).