jbx1279

Posted on Oct 22, 2023

SPL computing performance test series: in-group accumulation

#performance #bigdata #database

1. Test task

This task originates from the counting of the age (storage time) of products in the ERP system of a large electrical appliance enterprise.

The original data consist of the following pair of tables that are in primary-sub relationship:

Store:

store	product	total
store_id_000001	1	700
store_id_000002	1	1346
store_id_000003	1	850

Inbound:

Field name	Data type	Field description
store	String	Store ID
product	Integer	Product ID
indate	Date	Inbound date
num	Integer	Inbound number

Now we are to calculate the age (the number of days from the inbound date to an end date) of products in the inbound within a time period (usually 90 days) before a specified end date:

Inbound_age:

Field name	Data type	Field description
store	String	Store ID
product	Integer	Product ID
indate	Date	Inbound date
num	Integer	Inbound number
age	Integer	Storage time

There are four details to know about the original data and computing process (take a 90-day time period as an example):

Products are shipped out in chronological order of receipt, that is, the earlier it is received, the earlier it will be shipped out;
When the total inbound number in the last 90 days is less than the actual stock, the age of the extra products is categorized as “over 90 days”;
When the total inbound number in the last 90 days exceeds the actual stock, it indicates part of products are already shipped out in the past 90 days and should be deleted, and their age is no longer calculated;
If the number of a certain product that is put into the store on a certain day is m, and the number of this product that is shipped out is n, then the number of the product should be recorded as m-n in the Inbound_age. Now let’s show the calculation details with actual data:

Store:

store	product	total
store_id_000001	1	700
store_id_000002	1	1346
store_id_000003	1	850

Inbound:

store	product	indate	num
store_id_000001	1	2016-09-07	100
store_id_000001	1	2016-08-13	200
store_id_000003	1	2016-09-23	300
store_id_000003	1	2016-09-15	200
store_id_000003	1	2016-08-25	400
store_id_000003	1	2016-08-07	100

The end date is set as 2016-9-30, and the time period is 90 days. The calculated inbound_age:

store	product	indate	num	age
store_id_000001	1	2016-09-07	100	23
store_id_000001	1	2016-08-13	200	48
store_id_000001	1	2016-07-01	400	91
store_id_000003	1	2016-09-23	300	7
store_id_000003	1	2016-09-15	200	15
store_id_000003	1	2016-08-25	350	36

In this table, there is a record whose indate is 2016-07-01, and the recorded age is 91, indicating that there are products with an age of over 90 days in current stock.

Let’s take a look at the calculation process of two groups of data:

For the product with store=store_id_000001, product=1, and the stock is 700, there are 2 inbound records:

Inbound date: 2016-09-07, age: 23 days, inbound number: 100; stock at that time: 700 - 100 = 600
Inbound date: 2016-08-13, age: 48 days, inbound number: 200; stock at that time: 600 - 200 = 400

The 400 products in the store were received before 2016-07-01, and their age is over 90 days.

For the product with store=store_id_000003, product=1, and the stock is 850, there are 4 inbound records:

Inbound date: 2016-09-23, age: 7 days, inbound number: 300; stock at that time: 850 - 300 = 550
Inbound date: 2016-09-15, age: 15 days, inbound number: 200; stock at that time: 550 - 200 = 350
Inbound date: 2016-08-25, age: 36 days, inbound number: 400, shipped number: 50, stock at that time: 350.

The 100 products received on 2016-08-07 are all shipped out, so their age is no longer calculated.

2. Technologies to be compared and test environment

We select two kinds of relational databases to make a comparison with SPL:

i. Oracle, which is often used as a benchmark in database performance test. Oracle 19 is to be tested.

ii). Starrocks, which is claimed to be a faster OLAP database. Starrocks 2.5.2 is to be tested.

SPL Community Edition (version 20230528) is tested.

Test environment: one physical server with the following configuration:

2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total, 64G memory, SSD (Solid State Drive)

To examine the sensitivity of these technologies to CPU and memory, we install two virtual machine environments on the physical server to conduct separate test. Virtual machine configurations:

VM1: 8 CPUs, 32G memory
VM2: 4 CPUs, 16G memory

For Starrocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on one VM, and the FE that undertakes management task is installed on the physical machine to avoid affecting the test results. For SPL and Oracle, they only need to be installed on VM.

3. Data preparation

Use generateData.splx to generate the store and inbound.

A   B   C   D   E
1   1000    100 1500        
2   200000000   20  60  10  100
3       =date(2015,10,1)        =date(2016,9,30)    
4                   
5   =create(store,product,total)
6   =file(“e:/store.ctx”).create@y(#store,#product,total)
7   =create(store,product,indate,num)
8   =file(“e:/inbound.ctx”).create@y(#store,#product,#indate,num)
9                   
10  func genOne (store,product)
11      =“store_id_”+mid(string(store+1000000),2)       
12      >A5.insert(0,B11,product,B1+rand(C1-B1))
13      =B2+rand(C2-B2) =to(1000).(rand(interval(B3,D3))).id@u().m(to(B13)).sort()
14      for B13 =elapse(B3,C13(B14))    =D2+rand(E2-D2) 
15          >A7.insert(0,B11,product,C14,D14)       
16          >C3 = C3+1      
17                  
18  for 1000000 for A1  =func(genOne,A18,B18)       
19          if (C3 >= A2)   >A6.append(A5.cursor()), A8.append(A7.cursor())
20              break A18   
21      >A6.append(A5.cursor()) >A5.reset()     
22      >A8.append(A7.cursor()) >A7.reset()     
23  >A6.close() >A8.close()         
24                  
25  =file(“e:/store.csv”).export@c(file(“e:/store.ctx”).open().cursor())
26  =file(“e:/inbound.csv”).export@c(file(“e:/inbound.ctx”).open().cursor())

The first three lines define the data scale and field’s data range of the two tables:

A1 defines the range of product IDs;

Current stock is random in the range of B1 ~ C1. A2 defines the number of records in inbound;

The number of inbound records of each product of each store is generated randomly between B2~C2, and the inbound number of each time is generated randomly between D2~E2;

Define the start and end dates. The inbound date is generated randomly between the start date and the end date (B3~D3);

A5 and A6 define the data structure of two tables. A6 and A8 define the file (composite table) for storing the data of two tables;

A10 defines a genOne function, and give the values of two primary key fields (store and product) to generate a primary table (store) record and multiple sub-table records associated with this record;

Line 18 performs a cross-multiplication and combination on three primary key fields, and then calls the genOne function to generate each group of primary-sub table data;

A25 and A26 convert two tables in composite table format (ctx) to text format (csv);

The actual scale of the generated test data is as follows:

Store: 5.06 million records, size of composite table: 33M, size of text file (csv): 123M;

Inbound: 200 million records. When the time period is 1 year, the size of composite table is 604M, and the size of text file (csv) is 6.8G;

When calculating the age of three months of data (2016-07-01~2016-09-30), the result set has 45 million records;

SPL calculates the age based on the composite table. For the data in text format, they will be imported into Oracle and Starrocks to calculates the age in SQL.

4. Test Oracle SQL

SQL

CREATE TABLE inbound_age AS
    WITH t1 AS (
        SELECT d.store,d.product,
            d.indate,
            d.num,
            SUM(d.num) OVER (PARTITION BY d.store,d.product ORDER BY d.indate DESC) AS accu_num,
            ROW_NUMBER()OVER(PARTITION BY d.store,d.product ORDER BY d.indate DESC) rn,
            c.total
        FROM inbound d JOIN store c ON d.store = c.store AND d.product = c.product
        WHERE d.indate>TO_DATE('2016-07-01','YYYY-MM-DD') AND d.indate<=TO_DATE('2016-09-30','YYYY-MM-DD')
    ),
    t11 AS (
        SELECT t1.* FROM
            t1 JOIN (
                SELECT store,product,MAX(rn) max_rn
                FROM t1 WHERE accu_num - num<total GROUP BY store,product
            )t2
            ON t1.store = t2.store AND t1.product = t2.product AND t1.rn = t2.max_rn
    ),
    t_result AS (
        SELECT t.*,
            CASE WHEN over90=0
                THEN ROUND(TO_NUMBER(TO_DATE('2016-09-30','YYYY-MM-DD') - indate))
                ELSE 91 END AS age
        FROM (
            SELECT store,product,indate,
                total-accu_num AS num,1 AS over90
            FROM t11 WHERE total>accu_num
            UNION
            SELECT store,product,indate,
                CASE WHEN total<accu_num
                THEN num-(accu_num-total) ELSE num END AS num,
                0 AS over90
            FROM t1 WHERE accu_num - num<total
        )t ORDER BY store,product,indate DESC
    )
    SELECT /*+ PARALLEL (t_result,8) */ * FROM t_result

This code employs the CREATETABLEASSELECT… statement to store the calculation result into inbound_age. The calculation process is somewhat complex, so the WITH subquery is used to define three intermediate tables. The final main query sets the parallel number as 8. Since VM2 installs a 4-core CPU, the parallel number should be changed to 4 during the test.

Do not store the calculation result
After calculating the age of three months of inbound data, the result set t_result contains 45 million records. Writing such a large result set back to hard disk would be time-consuming. In order to measure the time to write back the large result set, this SQL code is modified in a way that the last line only counts the records in the result set. In this way, it almost avoids the write-back action.

…
SELECT /*+ PARALLEL (t_result,8) */ COUNT(*) cnt FROM t_result

Full data calculation
The full inbound data refer to the data of a whole year, with a total of 200 million records. In the above test, only the data of the past 3 months (approximately 50 million records) are calculated, the calculated result set contains 45 million records.

If we change the time period to full year (2015-10-01 ~ 2016-09-30), all 200 million records will be included in the calculation, which will generate 77 million records. Change the date-related parameters in the SQL code:

…
WHERE d.indate>TO_DATE('2015-10-01','YYYY-MM-DD')AND d.indate<=TO_DATE('2016-09-30','YYYY-MM-DD')
…
ELSE 365 END AS age
…

This code indicates that the age is 1 year and above.

5. Test Starrocks SQL

SQL

CREATE TABLE inbound_age AS
    WITH t1 AS (
        SELECT d.store,d.product,
            d.indate,
            d.num,
            SUM(d.num) OVER (PARTITION BY d.store,d.product ORDER BY d.indate DESC) AS accu_num,
            ROW_NUMBER()OVER(PARTITION BY d.store,d.product ORDER BY d.indate DESC) rn,
            c.total
        FROM inbound d JOIN store c ON d.store = c.store AND d.product = c.product
        WHERE d.indate> '2016-07-01' AND d.indate<= '2016-09-30'
    ),
    t11 AS (
        SELECT t1.* FROM
            t1 JOIN (
                SELECT store,product,MAX(rn) max_rn
                FROM t1 WHERE accu_num - num<total GROUP BY store,product
            )t2
            ON t1.store = t2.store AND t1.product = t2.product AND t1.rn = t2.max_rn
    ),
    t_result AS (
        SELECT t.*,
            CASE WHEN over90=0
                THEN DATEDIFF('2016-09-30', indate)
                ELSE 91 END AS age
        FROM (
            SELECT store,product, indate,
                total-accu_num AS num,1 AS over90
            FROM t11 WHERE total>accu_num
            UNION
            SELECT store,product,indate,
                CASE WHEN total<accu_num
                THEN num-(accu_num-total) ELSE num END AS num,
                0 AS over90
            FROM t1 WHERE accu_num - num<total
        ) t ORDER BY store,product,indate DESC
    )
    SELECT * FROM t_result

There is no need for Starrocks to set the parallel number, as it will control automatically.

Do not store the calculation result
Like Oracle, modify the last line of the SQL code to only counting the records:

…
SELECT COUNT(*) cnt FROM t_result

Full data calculation
Modify the date-related parameter in the SQL code:

…
WHERE d.indate> '2015-10-01'AND d.indate<='2016-09-30'
…
ELSE 365 END AS age
…

This code indicates that the age is 1 year and above.

6. Test SPL

SPL script

A   B
1   =now()  
2   =date(2016,7,1) 
3   =date(2016,9,30)    
4   =file(“e:/store.ctx”).open().cursor@m(;;8)  
5   =file(“e:/inbound.ctx”).open().cursor(;indate>A2&&indate<=A3;A4)    
6   =A5.joinx@m(store:product,A4:store:product,total,0:age, 0:accu) 
7   =A6.group@o(store,product).(~.rvs())    
8   =A7.(~.run(cum(num;store,product):accu,num-if(accu>total,(accu-total),0):num))
9   =A8.(last=~.m(-1),~=if(last.accu>=last.total,~.select(num>0),~.derive@o().insert(0,last.store,last.product,A2,last.total-last.accu,last.total,0)))
10  =A9.(~.run(interval(indate,A3):age))    
11  =A8.conj().new(store,product,indate,num,age)    
12  fork A11    =file(“e:/age_”/#A12/“.btx”).export@b(A12)
13  =interval@ms(A1,now())

Store the data of two tables in order by the primary keys (store and product) to the composite table. A4/A5 loads the data of composite table as cursors, and the data loading and subsequent calculation are performed simultaneously. A4 specifies the parallel number as 8 (Since VM2 installs a 4-core CPU, the parallel number should be modified as 4). The cursor in A5 only loads the data of three months by means of filter condition, and segmentation is performed synchronously with the cursor in A4.

A6: perform an order-based merge join on the two tables to merge the total field of the store into the inbound.

A7: form the group to calculate age, and reverse the order in descending order by inbound date;

A8: calculate the in-group accumulated number, and the effective number of products that are still in store;

A9: remove the redundant inbound records and add the inbound records of over 90 days (set the inbound date as 2016-07-01, and calculated age is 91 days, representing the records of over 90 days), based on the comparison result of the accumulated number and the stock;

A10: calculate the age;

A11: aggregate the small result sets of each group using the conj()function, and then generate final large result set using the new() function.

Line 12 uses the parallel code block fork to save the parallel multi-cursor as multiple bin files, which will still be kept in order. The calculations before line 12 are all defined on cursor and will not be executed. A series of calculations are triggered only when the result is exported in line 12.

Do not store the calculation result
Like Oracle and Starrocks, if we don’t export the detailed data, and only count the number of records, just remove the fork code block in line 12 (A12/B12), and modify A12 to count the result cursor with skip():

    A
12  =A11.skip()

Full data calculation
Modify the start date of A2:

    A
2   =date(2015,10,1)

7. Test results

Unit: seconds

		VM1			VM2
		SPL	Oracle	Starrocks	SPL	Oracle	Starrocks
90-day data	Store result	10	181	92	22	252	Out of memory (OOM)
90-day data	Count only	9	92	63	15	167	OOM
Full data	Store result	30	451	OOM	48	800	OOM
Full data	Count only	23	295	OOM	43	601	OOM

8. Conclusions

SPL can adopt higher-performance algorithms to avoid association, while SQL can only use inefficient JOIN operation, and its performance depends heavily on the optimizer. For complex operations like the one in this test, the optimizer usually fails. For similar tests, SPL runs faster than Oracle SQL by more than 10 times.
The result set is relatively large, and it will consume more time to write it back to hard disk. Since SPL saves the result set as bin file, it is faster and has little impact on performance. In contrast, the performance of Oracle and Starrocks decreases significantly.
Starrocks, as a professional OLAP database, adopts the CPU optimization technology, and performs better than Oracle. However, Starrock still has to use inefficient algorithms, and hence its computing performance is still far behind SPL. Moreover, the CPU optimization technology relies heavily on in-memory computing, and cannot function normally when the memory capacity is small or the data amount is large.

Origin: https://blog.scudata.com/spl-computing-performance-test-series-in-group-accumulation/
SPL Source code: https://github.com/SPLWare/esProc

DEV Community

SPL computing performance test series: in-group accumulation

1. Test task

2. Technologies to be compared and test environment

3. Data preparation

4. Test Oracle SQL

5. Test Starrocks SQL

6. Test SPL

7. Test results

8. Conclusions

Top comments (0)

Read next

Turbocharging AWS Lambda: How to eliminate cold starts forever

Aurora Limitless - Sequences

Adding new columns - lowCalAlt_update5

Unidirectional associations for one-to-many