DEV Community: Judy

SQLazy: Group-based Cumulative Sums

Judy — Fri, 03 Jul 2026 03:35:38 +0000

Problem description

**
Only retain invoiced rows; reset the cumulative amount at each invoiced row

A business transaction table consists of four fields: ID, Date, Invoiced, and Amount. Each record represents one month, and Invoiced=1 indicates that an invoice was issued for that month.

The query goal: Return every invoiced month for each ID, where the invoice amount equals the sum of all amounts from the previous invoiced month (exclusive), or the beginning of the table, to the current invoiced month (inclusive).

Source data:

Expected output: Only retain invoiced rows, where each Amount is the cumulative sum since the previous invoiced row (exclusive).

Take ID: AAA as an example:

The first invoice - 2023-03: earlier January (10) + earlier February (15) + current (15) = 40

The second invoice - 2023-06: Since the previous invoice: April (10) + May (10) + current (10) = 30

Same for ID: BBB.

Step-by-step implementation with SQLazy

[Run this example online]

Let me walk through each step below:

Step 1: Sort rows by ID and Date (descending)

sort id,dt desc

This step prepares for the subsequent grouping. Once sorted in descending order, each invoiced row and the earlier non-invoiced rows after it are grouped together for cumulative summation.

Step 2: Perform a cumulative sum on the Invoiced field to generate group numbers

compute invoiced cum as grp partition id

Within each ID partition, perform a cumulative sum on the Invoiced field in the current (descending) order, including the current row. The result is as follows:

Step 3: Group and aggregate by ID and grp

summarize dt max as dt invoiced max as invoiced amount sum as amount group id grp

dt max: Since rows are sorted in descending order, the largest date within each group corresponds to the invoiced month (i.e., the date of the invoiced row).

invoiced max: Each group contains at least one row where Invoiced = 1, so the maximum Invoiced value within each group is 1.

amount sum: Sum all amounts within each group, i.e., the total amount since the invoiced row (inclusive).

Step 4: Select the required columns

derive id dt invoiced amount

This step only cleans up the output by removing the helper column grp.

Compile the steps into SQL

Once the above steps are complete, SQLazy’s compiler can generate the equivalent native SQL, without the need to write it manually.

To generate MySQL SQL:

WITH t3 AS (
  SELECT
    id,
    grp,
    MAX(dt) AS dt,
    MAX(invoiced) AS invoiced,
    SUM(amount) AS amount
  FROM
    (
      SELECT
        id,
        dt,
        invoiced,
        amount,
        SUM(invoiced) OVER (
          PARTITION BY id
          ORDER BY
            CASE
              WHEN id IS NULL THEN 1
              ELSE 0
            END,
            id ASC,
            CASE
              WHEN dt IS NULL THEN 1
              ELSE 0
            END,
            dt DESC ROWS UNBOUNDED PRECEDING
        ) AS grp
      FROM
        invoice
    ) t2
  GROUP BY
    id,
    grp
)
SELECT
  id,
  dt,
  invoiced,
  amount
FROM
  t3
ORDER BY
  id,
  grp

You only need to verify the logic of each of the four steps – no need to understand or debug the SQL – and the compiler will generate the production-ready code.

Let’s compare SQL and SQLazy in a table:

This example demonstrates how naturally SQLazy handles the “event-based cumulative reset” problem – breaking down complex grouping logic into clear steps with a simple sorting trick and a cumulative sum.

Try SQLazy online: sqlazy.com (Free to use, signup not required)
SQLazy project repository: github.com/SPLWare/SQLazy

Stop trusting AI-generated SQL blindly: Build queries step-by-step with SQLazy

Judy — Fri, 19 Jun 2026 13:29:17 +0000

AI can write runnable SQL statements, but it is often unreliable. SQLazyturns SQL development into a step-by-step, verifiable, and auditable workflow, with a compiler to ensure that the final output is correct.

Problem: The AI-generated SQL is a black box
We’ve all been in such a situation: throw a complex analytical query in ChatGPT or Claude and get a monster – dozens of lines of SQL code crammed into one lump. And you think: “This might run… but can I trust it?”

In reality, the AI-generated SQL often fails in these ways:

Incorrect join logic— Use the incorrect target table, or miss a necessary join condition.

Aggregation mistake — GROUP BY does not align with the analysis goal, or miss non-aggregated columns.

Missed filter condition— Miss a subtle business constraint (such as “only count active users”).

Semantic deviation— What you mean by “revenue” may not be the same as what the model considers the “total amount”.

Ignored boundary condition—Usually the NULL values, empty sets, and extreme values are elegantly ignored.

Today, AI can generate runnable SQL. But you never know whether you can trust it. Once the query involves deeply nested window functions and subqueries, it becomes difficult to review, debug, maintain, and migrate.

Worse still, when the results look off, how do you even begin to fix it?

You can only manually run it CTE by CTE, inserting SELECT * FROM … statements everywhere just to trace the logic;

Or tweak the prompt and ask AI to regenerate – only to end up with code that is even harder to read.

In the end, it takes longer than just writing it yourself.

This is the true cost of “black-box SQL generation”.

How SQLazy works?

**
SQLazy does not generate a huge blob of SQL in one go. Instead, it turns the SQL development into a step-by-step, traceable workflow:

Use semi-natural language syntax to describe what each step does.

Verify the logic at each step – Intermediate results are inspectable.

Let the compiler generate the final SQL.

The final SQL is generated by the compiler, not the LLM. This means:

Zero SQL errors from AI hallucination, with 100% correct results.

The logic is fully auditable.

The output is production-ready.

Example: Count the longest consecutive rising days for a stock

This is a typical analytical query, and difficult to write in pure SQL – some companies use it as the interview question, with fewer than 20% of the candidates getting it right.

Let’s see how SQLazy tackles this step by step.

First, describe the logic step by step

Rather than wrestling with the nested subqueries, SQLazy lets you express the logic as simple, sequential actions:

That’s it. One step, one simple action. Here’s the line-by-line explanation:

Load data from a file, a database, or an in-memory table built into SQLazy. In the IDE or WEB, you can instantly view the result at each step.

Filter records for stock 110838:

Sort records in ascending order by date:

Mark breaks in the rising trend to separate consecutively rising streaks:

Count the days in each consecutively rising streak:

Get the longest streak:

The logic is crystal clear. Anyone can understand what the query does without being a SQL expert.

What’s more, you can execute each step and inspect the intermediate result. For example, after the records are segmented in step 3, a new column called NoRisingDays is generated, showing the group number for each consecutive streak. If something looks wrong, you can fix them immediately – no need to wait the entire query finish and guess where things went wrong.

Then generate SQL with the compiler

SQLazy automatically compiles the above steps into the native SQL dialect for your target database. It currently supports MySQL, PostgreSQL, and Oracle, with Snowflake and BigQuery support on the roadmap.

WITH s2 AS (
  SELECT CODE, DT, CL
  FROM (SELECT CODE, DT, CL FROM stock) t_3
  WHERE CODE = 110838
)
SELECT MAX(ContinuousDays) AS max_ContinuousDays
FROM (
  SELECT NoRisingDays, COUNT(DT) AS ContinuousDays
  FROM (
    SELECT CODE, DT, CL,
      SUM(CASE WHEN CL < col__4 THEN 1 ELSE 0 END)
        OVER (ORDER BY CASE WHEN DT IS NULL THEN 1 ELSE 0 END, DT ASC) + 1 AS NoRisingDays
    FROM (
      SELECT s2.*, LAG(CL) OVER (ORDER BY CASE WHEN DT IS NULL THEN 1 ELSE 0 END, DT ASC) AS col__4
      FROM s2
    ) sub__5
  ) s3
  GROUP BY NoRisingDays
) s4

The resulting SQL is difficult to understand, review, debug, and modify. But the SQLazy workflow is clear, easy to review and audit. As long as you follow the workflow correctly, the final SQL is guaranteed to be reliable.

Here are the fundamental differences between SQLazy and ordinary AI SQL generators:

Here are my honest impressions from using SQLazy to run complex queries:

Advantages:

The result at each step is inspectable. Before, trying to write complex SQL meant imagining the intermediate results – they only exist in your head. Now you can inspect the actual data table after each step is executed, and catch and fix errors immediately. At last, no more black boxes, and what a relief!

Logic is broken down into multiple steps, which naturally serve as document. Once a workflow is finished, if requirements change three months later, there is no need to re-analyze dozens of lines – just find the relevant step and modify it. And if someone else takes over, reading through steps is so much faster than deciphering the SQL.

Debugging becomes significantly faster. Once, I wrote the wrong grouping condition at the 4th step, executed the code, spotted an extra row in the intermediate table that shouldn’t have been there, and pinpoint the error immediately. Before, I would have to run the entire SQL, add debug fields everywhere, then run it again… back and forth, over and over.

Zero cross-database hassle. Write your logic once and generate SQL for both MySQL and Oracle – no manual dialect translation needed.

Notes:

Learning cost exists. You need to adapt to the “stepwise way of thinking” – resisting the urge to reach for window functions right away. The first couple of uses may feel slower, but the thinking becomes clearer once you settle into it.

Not necessary for simple scenarios. It is faster to write a 3-line SELECT in SQL. SQLazy is better suited for the complex scenarios where your brain starts to struggle to keep up.

SQLazy is not an almighty tool. Note the unsupported features and scenarios.

Try SQLazy

**
Web Version (signup not required) Free to use , ideal for quick trials.

Desktop IDE: Best suited for daily work and large dataset processing, with unlimited local debugging.

Repository address：https://github.com/SPLWare/SQLazy

The examples directory in the project contains step-by-step solutions to multiple real-world SQL problems, including the “Count longest consecutive days of stock price gains” problem demonstrated above, as well as scenarios such as session analysis and financial indicator calculations.

Shortcomings of SQLazy

Recursive queries are not supported (already on the roadmap).

Very old databases (such as MySQL 5.5) are not supported.

SQLazy itself is not open-source, but all example workflows and documentation are on GitHub under the MIT license.

Expected feedback

If you have ever encountered an analysis scenario where writing logic in pure SQL is overly convoluted, let us know and we’ll try to rewrite it with SQLazy – so you can see whether the workflow is genuinely easier to read than plain SQL.

When you use AI to write SQL, which is your biggest pain point – accuracy, maintainability, or trustworthiness?

When it comes to the “step-by-step” approach to SQL development, what do you think is the biggest weakness?

SPL practice: solve space-time collision problem of trillion-scale calculations in only three minutes

Judy — Mon, 30 Mar 2026 08:37:07 +0000

Problem description

Definition of space-time collision
Dataset A contains the time and space information of n source objects A1, …, An, and each piece of information includes three attributes: ID (iA), location (lA) and time (tA). It can be assumed that the same Ai will not appear twice in A at the same time, or in other words, no two pieces of information have the same iA and tA. Dataset B, which has the same structure as A, contains m target objects B1, …, Bm (with the similar attributes iB, lB, tB) that are to be confirmed whether they collided with A. Likewise, it can be assumed that Bi will not appear twice in B at the same time.

This article involves many set-oriented operations. Instead of using the term “record” to refer to information of data set, we use the set-related term “member”.

Group the dataset A by iA to get n subsets, and still name these subsets A1…, An. And correspondingly, split the dataset B into m subsets B1…, Bm. If ‘a’ belongs to the subset Ai and ‘b’ belongs to the subset Bj, and a.lA=b.lB and |a.tA-b.tB|<=1 minute (this time length can be changed), then we consider that ‘a’ collides with ‘b’, and that object Ai and object Bj collided once.

Rule 1: The number of collisions of each Ai member is counted as once at most, which means that if a collides with b1 and b2, we consider that only one collision occurs between Ai and Bj.

Rule 2: The member of Bj that once collided will no longer be identified as having had a collision. For example, if b collides with both a1 and a2, and assume a1.tA<a2.tA, then only the collision between a1 and b is identified as collision, and the collision between a2 and b is no longer identified as collision.

Objective: find the top 10 objects Bj with the highest similarity for each Ai.

The formula for calculating the similarity ‘r’ is: r(Ai, Bj)=E/U.
where, the molecule E refers to the total number of collisions between Ai and Bj calculated based on the above rules;

The denominator U refers to the total number of members of Ai and Bj after deduplication, which can be calculated using |Ai|+|Bj|-E’, where E’ refers to the number of Bj members that collide with a certain Ai member.

Data structure and data scale

Dataset A

Dataset B

tA and tB are accurate to the second, and the time span is 30 days.

The total number of records of Dataset A is 21 million rows, with daily addition of approximately 700,000 rows.

The total number of records of Dataset B is 15 million rows, with daily addition of approximately 500,000 rows.

The scale of n (the number of Ai) is 2.5 million, and that of m (the number of Bj) is 1.5 million.

The number of locations is 10, which means the possibility of values of lA and lB.

Hardware environment and expectation
We hope to obtain the result within 15 minutes on a 40C128G server.

The amount of data is not large and the data can be fully loaded into in-memory database. However, the calculation process is complicated. Since it is difficult to work out in SQL alone, external program (Java or Python) is needed. As a result, the overall performance is very low, and it took more than two hours to perform this task.

Problem analysis

n is 2.5 million and m is 1.5 million. If we calculate the similarity of each pair of Ai and Bj according to the above-mentioned definition, we have to calculate 2.5 million * 1.5 million = 3.75 trillion pairs. Even if each CPU can calculate the similarity of one pair of members in only one microsecond (in fact, such complex set-oriented calculations cannot be worked out quickly), it would take several hours on the current multi-CPU environment. Obviously, this hard traversal method is not feasible.

For a pair of Ai and Bj, according to the similarity calculation formula, we know that if there is no collision between the members of Ai and Bj, then E=0, and the similarity is also 0, and hence there is no need to perform the subsequent TopN calculation.

Assume that the data in A are evenly distributed and the average number of members in each Ai is less than 10 (21 million/2.5 million), and that the Bj that collides with Ai must satisfy the condition |tA-tB|<=1 minute, if the data in B are evenly distributed, then there are approximately 350 (500,000/1440) members per minute on average. The 10 members of Ai will collide with a maximum of 350210=7000 B members (between one minute before and after each A member). The average number of members of Bj is also 10 (15 million/1.5 million), and the average number of B members in Bj is only 7000/10=700 after distributing 7000 B members into Bj. In other words, there are only 700 Bj that have the similarity not equal to 0 with Ai on average, which is much smaller than the total number of Bj (1.5 million, a difference of over 2,000 times). Considering the condition lA=lB, if all objects are also evenly distributed (which is unlikely), then the average number of Bj that have the similarity not equal to 0 with Ai can be further reduced by 10 times (10 locations).

Based on the above information, we design the following algorithms:

For each Ai, find the set of members of B that may collide with each member of Ai based on the time and location conditions, and denote the set as B’:

B’=Ai.conj(B.select(tB>=tA-60 && tB<=tA+60 && lA==lB))

Note that during the calculation of B’, the corresponding B members will be filtered for each Ai member, so the members of both Ai and B may appear repeatedly in B’.

For each Ai, we need to filter B with tB to obtain B’. If B is sorted by tB in advance, the binary search can be used to speed up. Moreover, we need to add the tA attribute of Ai to facilitate subsequent calculations (since lA and lB are always the same and iA is a fixed value for Ai, there is no need to add these two attributes). The calculation of B’ can be changed as follows:

B’=Ai.conj(B.select@b(tB>=tA-60 && tB<=tA+60).select(lA==lB).derive(Ai.tA))

to determine the set composed of member pairs consisting of Ai members and B’j members that have collided (it is also the set of the member pairs consisting of Ai members and Bj members that have collided). In this equation, the member of Ai is represented as the field tA, and the member of Bj is represented as the field tB.

If A is sorted by iA and tA in advance, and the members of Ai after grouping are also in order by tA, then likewise, the members of B’ will be ordered by tA, and the members of B’j will also in order by tA. In this way, the member with the minimum tA in each grouped subset of B’j.group(tB) will definitely be the first member. Therefore, the calculation of A’j can be simplified as:

A’j=B’j.group@1(tB)

According to rule 1 that the number of collisions of each Ai member is counted as once at most, and based on the assumption mentioned at the beginning of this article that the same Ai will not appear twice in A at the same time, we just need to deduplicate tA, that is, the numerator can be calculated as follows:

E=A’.icount(tA)

In the formula U=|Ai|+|Bj|-E’ for calculating the denominator,

A’j is the set composed of member pairs consisting of A members and B members, and these member pairs have been identified as having had a collision. Moreover, since the Bj members are already deduplicated (group@1(tB) means taking only one member from each grouped subset), the member of Bj will not appear repeatedly in A’j. Therefore, the members of A’j can correspond to the collided members of Bj one to one, and the equation E’=|A’j| holds.

|Ai| and |Bj| can be calculated and saved in advance by grouping A and B by iA and iB respectively. Since the amount of data is not large, the results can all be stored in memory.

Finally, according to the formula E/U, we can get the similarity ‘r’ between Ai and Bj. What remains is the common task of calculating TopN for the similarity results.

**Further optimization
**8. In the step 1 above, to search the full data of dataset B for each a, there is still a certain amount of computation with binary search (2*log 15 million means about 50 comparisons). Considering both the number of minutes and the number of locations are not large (30-day time span, 1440 minutes per day and 10 locations mean only around 400,000), which can be fully held in memory, we can use aligned sequence to directly locate.

When computing, let

G=B.align@a(30*1440,tB\60+1).(~.align@a(10,lB))

Because there are two dimensions: time and location, we also use a two-layer aligned sequence. First, divide B into 43200 (30*1440) groups by the minute sequence number tB\60+1, and then divide the sub-group members into 10 groups by the location sequence number lB. For the tA of a certain member in Ai, we can use:

G’=G.m(tA\60)(lA) | G(tA\60+1)(lA) | G.m(tA\60+2)(lA)

to quickly and roughly screen out a small superset of B’. The time difference between tB and tA of the member in the superset is no more than 2 minutes (the difference between the minute sequence numbers at which tA and tB are located is not greater than 1), thereby filtering out a large number of B members that are impossible to collide on the time dimension. In the small superset, since there may be a small number of members whose difference between tB and tA is greater than 1 minute, we need to use:

G’.select@b(tB>=tA-60 && tB<=tA+60)

to screen out an exact B’, which can significantly reduce the computing amount.

In addition, when calculating U, we need to search for the corresponding |Ai| by iA. If iA is continuous integer, we can also find it directly by location to avoid search action, that is:

nA=A.groups@n(iA;count(1)).(#2)

Now, |Ai|=nA(iA). B can be processed in the same way.

Practice process

Prepare the test data
We directly prepare the data that are already converted to sequence number. Assume that the time span is 30 days and the enumeration number of locations is 10, the simulated data script is as follows:

In A1, K represents the number of days, nA represents the total data amount of dataset A, nB represents the total data amount of dataset B, t represents the number of seconds of 30 days, and LN represents the enumeration number of locations.

In A2 and A3, the composite tables A.ctx and B.ctx are created respectively, and the data of randomly generated data sets A and B are exported to the two composite tables respectively.

tA (tB) refers to the number of seconds elapsed from the starting time point. For example, if the starting time point is 2023-08-23 00:00:00, then the value corresponding to the time point 2023-08-23 00:01:01 is 61.

In A2, the @p option is used to create the composite table, indicating that the first field ‘iA’ is used as the segmentation key. During parallel computing, the composite table needs to be segmented. Since the records of the same ‘iA’ cannot be assigned to two segments, we use the @p option to ensure this during the segmentation of composite table.

Special attention should be paid to different sort orders when saving A and B. A is sorted by iA and tA (the step 4 of ‘Problem analysis’), while B is sorted by tB and iB (the step 2 of ‘Problem analysis’). In this way, we can read the ordered data directly in subsequent operations.

Computing script

A1: K=30 represents the number of days of the time span to be counted; LN=10 represents the enumeration number of locations;

A4, A5: correspond to the step 7 of ‘Problem analysis’. Group the members by the ID of datasets A and B respectively, and count the number of members of each Ai and Bj, and store the result as sequence;

A6: correspond to the step 8 of ‘Problem analysis’. Align and group the members of dataset B by minute, and then align and group the sub-group members by location to calculate the aligned sequence mentioned above;

A8: correspond to the steps 1, 2 and 8 of ‘Problem analysis’. Divide the dataset A into several groups of Ai by iA, and loop through each group of Ai to obtain the corresponding B’; here we use ‘news’ instead of ‘conj’, which eliminates the derive action, and can obtain the same result. In addition, we also add iA to facilitate subsequent search for |Ai|;

A9: correspond to the steps 3 and 4 of ‘Problem analysis’. The method of calculating A’j:

B’.group(iB).group@1(tB)

can be simplified as:

B’.group@1(iB,tB)

A10: correspond to the steps 6 and 7 of ‘Problem analysis’. The result of deduplicating and counting tA by Aj’ is the molecule E, which is equivalent to the step 5 of ‘Problem analysis’. By adding the previously calculated results |Ai| and |Bi| and then subtracting the number of records in the current group (i.e. E’), we can get the denominator U. Finally, calculate the top 10 records based on the similarity result ‘r’.

Convert to sequence number and restore
Convert the ID, location and time to sequence number (the sequence-numberization of time is to calculate the number of seconds from the start time to the current time). The data structure after conversion is as follows:

Dataset A

Dataset B

The creation of data with the above code is based on the premise that the members of all fields are already converted to the above data structure. Therefore, in practice, we need to perform data conversion and organization first, and then restore the data after calculation. For details, refer to the method described in SPL Practice: integerization during data dump . Since the said method is not the focus of this article, we won’t describe it again here.

Actual effect

When the total time span is 30 days (the data volume of data set A is 21 million rows, and that of data set B is 15 million rows), computing in SPL on a single machine (8C64G) takes 161 seconds including exporting all results to CSV file.

In fact, achieving this performance requires using a small number of column-wise computing options of SPL Enterprise Edition. Since the use of such options doesn’t involve principle analysis, we do not describe it in detail in this article.

Postscript
This article discusses a typical object counting problem, which generally has the following characteristics:

Count the number of objects that satisfy a certain condition.
The number of objects is very large, but the amount of data involved in each object is not large.
The condition is very complex, usually also related to the order, and requires some steps to determine.

Normally, solving such problem needs to sort the data by object. However, since the amount of data involved in this task is very small, the optimization of storage becomes unimportant. The key to solving this task is to provide powerful set-oriented computing ability, especially the ability to compute the ordered set. For example, the data type should be able to support the set of sets so that the grouped subsets can be retained without having to aggregate like SQL. Moreover, the two-layer aligned sequence should be supported, allowing us to access the members of set by location and, the ordered grouping functionality should be provided.

Ditch 10,000 Intermediate Tables—Compute Outside the Database with Open-Source SPL

Judy — Fri, 13 Feb 2026 08:02:44 +0000

Intermediate tables are data tables in databases specifically used to store intermediate results generated from processing the original data – which is why they are so named. They are summary tables usually created for speeding up or facilitating the front-end queries and analysis. For some large organizations, years of accumulation results in tens of thousands of intermediate tables, which is an incredible number, in their databases, bringing great trouble to database operation and usage.

The large number of intermediate tables occupies too much database storage space, putting enormous pressure on storage capacity and increasing demand for capacity expansion. But database space is expensive and capacity expansion is exceedingly costly. Moreover, often there are restrictions on the expansion. It is not a good choice to cost you an arm and a leg with storing intermediate tables also because too many of them reduce database performance. Intermediate tables are not created out of thin air. Rather, they are generated from the original data through a series of computations that consume database computing resources. Sometimes, a lot of intermediate tables are produced during a computation. This consumes a large number of resources, and in serious cases, can slow down queries and transactions.

Why are there so many intermediate tables? Below are main reasons:

1. More than one step is needed to get the final result

The original data table needs to undergo complicated computations before being displayed in a report. It is hard to accomplish this with one SQL statement but with multiple, continuous SQL statements. One statement generates an intermediate result that will be used by the next statement.

2. Long wait time in real-time computations

For data-intensive and compute-intensive tasks, the wait time will be extremely long. So, report developers choose to run batch tasks at night and store results in intermediate tables. It is much faster to perform queries based on the intermediate tables.

3. Diverse data sources in a computation

Files, NoSQL and Web service almost do not have computing abilities. Data originated from them needs to be computed using the database’s computing ability. With a mixed computation between such data and data stored in the database particularly, the traditional approach is to load the external data into the database and store it as intermediate tables.

4. Intermediate tables are hard to get rid of

As databases uses flat structure to arrange tables, it is very likely that one intermediate table is shared by multiple queries after it is created. Deleting it for a finished query could affect other queries. Worse still, you cannot know exactly which applications are using this intermediate table. This makes deletion impossible, not because you do not want to get rid of it, but because you dare not do it. The consequence is that tens of thousands of intermediate tables are accumulated in the database over time.

But why we use databases to store intermediate data? According to the above causes of intermediate tables, the direct aim for storing intermediate data in the database as intermediate tables is to employ the database’s computational ability. The intermediate data will be further computed for subsequent use, and sometimes the computation is rather complicated. Now only databases (which are SQL-driven) have relatively convenient computing ability. Other storage formats like files have their own merits (high I/O performance, compressible and easy to be parallelly processed) though, they do not have computing abilities. If you try to perform computations based on files, you need to hardcode them in applications. That is far less convenient than using SQL. So, to make use of databases’ computing abilities is the essential reason of the existence of intermediate tables.

In some sense intermediate data is necessary. But consuming a huge amount of database resources in order to get only more computing ability is obviously not a good strategy. If we can enable files to have equal computing ability and store intermediate data in the outside-database file system, then problems related to intermediate tables will be solved and databases will be unburdened from or relieved of overload.

The open-source SPL can help to turn it into reality.

SPL is an open-source structured data computation engine. It can process data directly based on files, giving files the computing ability. It is database-independent, offers specialized structured data objects and a wealth of class libraries for handling them, possesses all-around computational capability, and supports procedural control that makes it convenient to implement complex computing logics. All these features qualify SPL to replace databases in handling intermediate data and subsequent data processing.

SPL file-based computations

SPL can perform computations directly based on files like CSV and Excel and multilevel data JSON and XML. It is convenient to read and handle them in SPL. We can store intermediate data in one of those file formats and handle it in SPL. Below are some basic computations:

On top of native syntax, SPL even offers supports of SQL92 standard, allowing programmers familiar with SQL to query files directly in SQL.

$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')

Support of complicated WITH clause:

$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t
left join ClientTable ct on t.Client=ct.Client

SPL has the edge on handling multilevel data like JSON and XML. To perform computations based on orders data of JSON format, for instance:

The SPL implementation is concise compared with that in other JSON libraries (like JSONPath).

SPL also allow users to query the JSON data directly in SQL:

$select * from {json(file("/data/EO.json").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null

SPL is particularly suitable for handling complex computing logics with its agile syntax and procedural control ability. To count the longest continuous days when the price of a stock rises based on stock records of txt format, for instance, SPL has the following code:

One more instance. To list the latest login interval for each user according to user login records of CSV format:

Such computing tasks are hard to code even in SQL in databases. Yet they become easy when handled in SPL.

The outside-database computing ability SPL supplies is an effective solution to problems triggered by too many intermediate tables in databases. Storing intermediate data in files releases database space resources, reduces demand for database expansion and makes database management more convenient. The outside-database computations do not take up database computing resources, and the unburdened database will be able to better serve other transactions.

High-performance file formats

Text files are commonly used data storage format. They are general-purpose and easy to read, but, at the same time, they have extremely bad performance. Traditionally, text-based computations are hard to have satisfactory performance.

Text characters cannot be computed directly. They need to be transformed to in-memory data types like integers, real numbers, dates and strings to be able to be processed. Yet text parsing is extremely complicated and takes exceptional long CPU time. Generally, hard disk reading takes up most of the time in accessing data on external storage, and text files’ performance bottle usually happens in the phase of data handling by CPU. Because of too complicated parsing, it is probably that the CPU time is greater than the hard disk reading time (especially with the high-performance SSD). So, text files are usually not used to process big data when high performance is demanded.

SPL provides two high-performance binary storage formats – bin file and composite table. A bin file uses the binary format, is compressed (to occupy less space and allow fast retrieval), stores data types (to enable faster retrieval without parsing), and supports the double increment segmentation technique to divide an append-able file, which facilitates parallel processing in an effort to further increase computing performance.

The composite table is a file storage format SPL uses to provide column-wise storage and indexing mechanism. It displays great advantage in handling scenarios where only a very small number of columns (fields) is involved. A composite table is equipped with the min-max index and supports double increment segmentation technique, letting computations to both enjoy the advantages of column-wise storage and be more easily parallelly processed to have better performance.

The two high-performance file formats are convenient to use, and have basically the same uses as text files. To read a bin file and compute it, for instance:

When the size of data to be processed is large, SPL can use cursor to perform batch retrieval and multi-CPU-based parallel processing:

=file("/data/scores.btx").cursor@bm()

When using files to store data and no matter which format the original data uses, they need to be at least converted to the binary format (like bin file) to get more advantages in both space usage and computing performance.

Ease of management

Moving intermediate data out of database to file system can not only reduce database workload but make the data extremely easy to manage. Files can be stored in operating system’s tree-structure directories. This makes them convenient to use and manage. It is neat and tidy to place intermediate tables used by different systems and modules in separate directories. This completely eliminates shared reference and thus the long-standing issue of tight coupling between systems and modules due to messy use of intermediate tables in the database. Now intermediate tables can be safely deleted without any harmful effects when corresponding modules are not used any more.

Support of diverse data source

In addition to the file sources, SPL can connect to and retrieve data from dozens of other data sources as well as perform mixed computations between different sources.

After intermediate data is stored in files, we face cross-data-source computations when trying to perform full-data queries between the file and the database holding the real-time data. It is convenient to implement these T+0 queries in SPL:

Ease of integration

SPL provides standard JDBC driver and ODBC driver for invocation by an application. For a Java program, the SPL code can also be integrated into it as an embedded computing engine, enabling the latter to have the ability to handle intermediate data.

Sample of invoking SPL code through JDBC:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…

SPL is interpreted execution and naturally supports hot-swap. Data computing logics written in SPL and their modification, operation and maintenance take effect in real-time without the need of restarting the application, making programs’ development, operation and maintenance convenient and efficient.

With SPL that offers outside-database computational capability, we can transfer intermediate tables to files, getting rid of the numerous of them from databases. This helps to relieve databases of overload and make it faster, more flexible and more scalable.

The Game-Changer Breaking Data Lake's Impossible Triangle

Judy — Wed, 04 Feb 2026 08:11:55 +0000

A brief introduction to data lake

Let’s start with data warehouse. A data warehouse is a subject-oriented data management system that aggregates data from different business systems and is intended for data query and analysis. As data expands and the number of business systems increases, data warehousing becomes necessary. In order to meet the business requirements, the raw data needs to be cleansed, transformed and deeply prepared before being loaded into the warehouse. Answering the existing business questions is the data warehouse’s core task. Those questions must be already defined.

But what if a business question is not defined (which is potential data value)? According to the data warehouse’s rule, a business question is asked first and then a model is built for it. The chain of identifying, raising and answering questions thus becomes very long. On the other hand, the data warehouse, as it stores highly prepared data, has to obtain desired data by processing the raw data when the new question requires fine data granularity. This is extremely cost-ineffective. If there are many such new questions, a query process will be overburdened.

So, in the context of this background, the data lake was born. It is a technology (or strategy) intended to store and analyze massive amounts of raw data. It enables to load as much raw data as possible into the data lake while keeping the highest fidelity as possible in storing it, and, in theory, extracting any potential data value based on full data. Speaking of this, the data lake’s two roles are absolutely obvious. One is data storage because the data lake needs to keep all raw data. The other is data analysis, which, from the technical point of view, is data computing, or the value extraction process.

Let’s look at the data lake’s performance in the two aspects.

The data lake stores full raw data, including structure data, semi-structured data and unstructured data, in its original state. The capacity of storing massive and diverse data is thus the data lake’s essential feature, which is different from the data warehouse that often uses databases to store structured data. Besides, loading data into the lake as early as possible helps fully extract value from association of differently themed data and ensure data security and integrity.

The good news is that the massive raw data storage needs can be fully met thanks to the great advance of storage and cloud technologies. Enterprises can choose self-built storage cluster or the storage service provided by a cloud vendor to deal with their business demands.

But, the toughest nut to crack is data processing! The data lake stores various types of data and each needs to be processed differently. The central and the most complicated part is structured data processing. With both historical data and newly generated business data, data processing mainly focuses on structured data. On many occasions, computations of semi-structured data and unstructured data will eventually be transformed to structured data computations.

At present, SQL-based databases and related technologies, which are also the abilities data warehouses have, dominate structured data processing field. In other words, the data lake depends on data warehouses (databases) to compute structured data. That is nearly all data lake products do. Building the data lake to store all raw data and then the data warehouse to add data processing capability catering to business needs of enterprises. As a result, data in the lake needs to be loaded to the data warehouse again through ETL. An advanced approach automates the process to some degree. The approach identifies data in the lake that needs to be loaded to the warehouse and performs the loading while the system is idle. This is the main functionality of the currently hot concept of Lakehouse. But, no matter how data is loaded to the warehouse (including the extremely inefficient method that lets the warehouse access data lake through the external table), today’s data lake is made up of three components – massive data storage, data warehouse and a specialized engine (for, like, unstructured data processing).

There are problems about this type of data lake framework.

The impossible triangle

Data lakes are expected to meet three key requirements – storing data in its original state (loading high fidelity data into the lake), sufficient computing capacity (extracting the maximum possible data value) and cost-effective development (which is obvious). The current technology stack, however, cannot achieve all the three demands at the same time.

Storing data as it is was the initial purpose of building the data lake because keeping the original data unaltered helps to extract the maximum value from it. The simplest way to achieve the purpose is that the data lake uses a completely same storage medium to store data loaded from the source. There will be, for instance, a MySQL to hold data originally stored in MySQL, a MongoDB to receive data initially stored in MongoDB, and so on. This helps load data into the lake in as hi-fi format as possible and make use of the source’s computing ability. Though achieving computations across data sources is still hard, it is enough to handle computations only involving the current source’s data, meeting the basic requirement of sufficient computing power (as part i in the above figure shows).

But the disadvantage is noticeable – the development is too expensive. Users need to put same storage mediums in place and copy all data sources accumulated over years to them. The workload is ridiculously heavy. If a data source is stored with commercial software, purchasing the software further pushes up the development cost. A relief strategy is to use a storage medium of same type, like storing Oracle data in MySQL, but it brings a side effect while the costs still stay high – some computations that could have been handled could become impossible or hard to achieve.

Now, let’s lower the bar. We don’t demand that data be duplicated at loading but just store data in the database. By doing this, we obtain the database’s computing ability and meet the requirement of cheap development (as part ii in the above figure shows) at the same time. But this is infeasible since it heavily depends on one relational database into which all data needs to be loaded.

Information may be easily lost during the loading process, which will fall short of the first requirement of building the data lake (loading high-fidelity data into the lake). Storing MongoDB data in MySQL or Hive is hard, for instance. Many MongoDB data types and relationships between sets do not exist in MySQL, such as the set data type like nested data structure, array and hash, and the instances of many-to-many relationship. They cannot be simply duplicated in the course of data migration. But rather, certain data structure needs to be restructured before the migration. That requires a series of sophisticated data reorganization steps, which is not cost-effective but needs a lot of people and time to sort out the business target and design appropriate form of target data organization. Without doing this, information will be lost, and errors, in turn, appear during the subsequent analysis. Sometimes errors are too deeply hidden to be easily visible.

A general approach is to load data unalterably into large files (or as large fields in the database). This way the information loss is within an acceptable range and data basically remains intact. File storage has many advantages. It is more flexible, more open, and has higher I/O efficiency. Particularly, storing data in files (or in a file system) is cheaper.

Yet, the problem of file storage is that files/large fields do not have computing capacity, making it impossible to meet the requirement of convenient/sufficient computing power. It seems that the impossible triangle is too strong to break.

No approach can resolve the conflict between the demand for storing data in its initial state and the convenient use of it. Under the requirement for cost-saving lake building (loading data to the lake fast), high fidelity data loading and convenient/sufficient computing power are mutually exclusive. This goes against the data lake’s goal of openness.

The underlying cause of the conflict is the contradiction between the closed database and its strict constraints. The database requires that data be loaded into it for computations and data needs to meet certain database constraints before being able to be loaded. In order to conform to the rules, data needs to be cleansed and transformed. And information loss happens during the process. Abandoning databases and switching to other routes (like files) cannot satisfy the demand of sufficient computing power, except that you turn to hardcoding. But hardcoding is too complicated and not nearly as convenient as databases.

Actually, an open computing engine can become the breaker of the impossible triangle. Such an engine possessing sufficient and convenient computing power can compute the raw data, including data stored in diverse data sources, in real time.

SPL – the open data lake computing engine

The open-source SPL is a structured data computing engine that provides open computing power for data lakes. It has diverse-source mixed computing capability that enables to compute raw data stored in different sources directly and based on its original status. No matter which storage mediums the data lake uses – same types as data sources or files, SPL can compute data directly and perform the data transformation step by step, making the lake building easier.

Open and all-around computing power

Diverse-source mixed computing ability
SPL supports various data sources, including RDB, NoSQL, JSON/XML, CSV, Webservice, etc., and mixed computations between different sources. This enables direct use of any type of raw data stored in the data lake and extraction of its value without the “loading” step and preparation. And this flexible and efficient use of data is just one of the goals of data lakes.

Being agile like this, the data lake will be able to provide data services to applications as soon as it is established rather than after the prolonged cycle of data preparation, loading and modeling. The more flexible data lake service enables real time response to business needs.

Particularly, SPL’s good support for files gives powerful computing ability to them. Storing lake data in a file system can also obtain computing power nearly as good as, even greater than, the database capability. This introduces computing capacity on the basis of part iii and makes the originally impossible triangle feasible.

Besides text files, SPL can also handle data of hierarchical format like JSON naturally. Data stored in NoSQL and RESTful can thus be used directly without transformation. It’s really convenient.

All-around computing capacity
SPL has all-around computational capability. The discrete data set model (instead of relational algebra) it is based arms it with a complete set of computing abilities as SQL has. Moreover, with agile syntax and procedural programming ability, data processing in SPL is simpler and more convenient than in SQL.

SPL boasts a wealth of class libraries for computations.

Accessing source data directly
SPL’s open computing power extends beyond data lake. Generally, if the target data isn’t synchronized from the source to the lake but is needed right now, we have no choice but to wait for the completion of synchronization. Now with SPL, we can access the data source directly to perform computations, or perform mixed computations between the data source and the existing data in the lake. Logically, the data source can be treated as part of the data lake to engage in the computation so that higher flexibility can be achieved.

High-performance computations after data transformation

SPL’s joining makes data warehouse optional. SPL has all-around, remarkable computing power and offers high-performance file storage strategies. ETLing raw data and storing it in SPL storage formats can achieve higher performance. What’s more, the file system has a series of advantages like flexible to use and easy to parallelly process.

SPL provides two high-performance storage formats – bin file and composite table. A bin file is compressed (to occupy less space and allow fast retrieval), stores data types (to enable faster retrieval without parsing), and supports the double increment segmentation technique to divide an append-able file, which facilitates parallel processing in an effort to further increase computing performance. The composite table uses column-wise storage to have great advantage in handling scenarios where only a very small number of columns (fields) is involved. A composite table is also equipped with the minmax index and supports double increment segmentation technique, letting computations both enjoy the advantages of column-wise storage and be more easily parallelly processed to have better performance.

It is easy to implement parallel processing in SPL and fully bring into play the advantage of multiple CPUs. Many SPL functions, like file retrieval, filtering and sorting, support parallel processing. It is simple and convenient for them to automatically implement the multithreaded processing only by adding the @m option. They support writing parallel program explicitly to enhance computing performance.

In addition, SPL supports a variety of high-performance algorithms SQL cannot achieve, the commonly seen TopN operation, for example. It treats calculating TopN as a kind of aggregate operation, which successfully transforms the highly complex sorting to the low-complexity aggregate operation while extending the field of application.

The SPL statements do not involve any sort-related keywords and will not trigger a full sorting. The statement for getting top N from a whole set and that from grouped subsets are basically the same and both have high performance. SPL boasts many more such high-performance algorithms.

Assisted by all these mechanisms, SPL can achieve performance orders of magnitude higher than that of the traditional data warehouses. The storage and computation issues after data are transformed are solved. Data warehouses won’t be a data lake necessity any longer.

Furthermore, SPL can perform mixed computations directly on/between transformed data and raw data by making good use of values of different types of data sources rather than by preparing data in advance. This creates highly agile data lakes.

SPL enables performing lake building phases side by side while, conventionally, they can only be performed one by one (loading, transformation and computation). Data preparation and computation can be carried out concurrently and any type of raw, irregular data can be computed directly. Dealing with the computation and the transformation at the same time rather than in serial order is the key to building an ideal data lake.

Lakehouse? More Like a Lake + Warehouse Parking Lot

Judy — Thu, 22 Jan 2026 06:50:56 +0000

From all-in-one machine, hyper-convergence, cloud computing to HTAP, we constantly try to combine multiple application scenarios together and attempt to solve this type of problem through one technology so as to achieve the goal of simple and efficient use. Lakehouse, which is very hot nowadays, is exactly such a technology; its goal is to integrate the data lake with the data warehouse to give play to their respective value at the same time.

The data lake and data warehouse have always been related closely, yet there are significant differences between them. The data lake pays more attention to retaining the original information, and its primary goal is to store the raw data “as is”. However, there are a lot of junk data in the raw data. Does storing the raw data “as is” mean that all the junk data will be stored in data lake? Yes, the data lake is just like a junk data yard where all the data is stored, regardless of whether they are useful or not. Therefore, the first problem that the data lake faces is the storage of massive (junk) data.

Benefiting from the considerableprogress of modern storage technology, the cost of storing massive data is reduced dramatically. For example, using the distributed file system can fully meet the storage needs of data lake. But, the data storing ability alone is not enough, the computing ability is also required to bring the value into play. Data lake stores various types of data and each is processed differently, and the structured data processing is of the highest degree of importance. Whether it is historical data or newly generated business data, data processing mainly focuses on structured data. On many occasions, computations of semi-structured data and unstructured data will eventually be transformed to structured data computation. Unfortunately, however, since the storage schema itself (file system) of data lake does not have the computing ability, it is impossible to process the data directly on the data lake. To process the data, you have to use other technologies (such as data warehouse). The main problem that data lake is facing is “capable of storing, but incapable of computing”.

For the data warehouse, it is just the opposite. Data warehouse is based on SQL system, and often has powerful ability to calculate the structured data. However, only after the raw data are cleansed, transformed and deeply organized until they meet database’s constraints can they be loaded into the data warehouse. In this process, a large amount of original information will be lost, even the data granularity will become coarse, resulting in a failure of obtaining the value of data with lower granularity. Moreover, the data warehouse is highly subject-oriented, and services one or a few subjects only. Since the data outside the subjects is not the target of data warehouse, it will make the range of usable data relatively narrow, making it unable to explore the value of full and unknown data as data lake does, let alone store massive raw data like data lake. Compared with data lake, the data warehouse is “capable of computing, but incapable of storing”.

From the point of view of data flow, the data of data warehouse can be organized based on data lake, so a natural idea is to integrate the data lake with the data warehouse to achieve the goal of “capable of storing and computing”, which is the so-called “Lakehouse”.

So, what is current implementing situation?

The current method is oversimplified and crude, that is, open the data access rights on the data lake to allow the data warehouse to access the data in real-time (the so-called real-time is relative to the original ETL process that needs to periodically move the data from data lake to data warehouse. Yet, there is still a certain delay in practice). Physically, the data are still stored in two places, and the data interaction is performed through high-speed network. Due to having a certain ability to “real time” process the data of data lake, the implementation result (mostly at the architecture level) is now called Lakehouse.

That’s it? Is that a Lakehouse in the true sense?

Well, I have to say - as long as the one (who claims it is Lakehouse) doesn’t feel embarrassed, embarrassing is as the one (who knows what Lakehouse should be like) feels embarrassed.

Then, how does the data warehouse read the data of data lake? A common practice is to create an external table/schema in the data warehouse to map RDB’s table, or schema, or hive’s metastore. This process is the same as the method that a traditional RDB accesses the external data through external table. Although the metadata information is retained, the disadvantages are obvious. Specifically, it requires the data lake can be mapped as tables and schema under corresponding relational model, and it also needs to organize the data before computing them. Moreover, the types of available data sources decrease (for example, we cannot perform mapping directly based on NoSQL, text, and Webservice). Furthermore, even if there are other data sources (such as RDB) available for computation in the data lake, the data warehouse usually needs to move the data to its local position when computing (such as grouping and aggregating), resulting in a high data transmission cost, performance drop, and many problems.

For the current Lakehouse, in addition to “real-time” data interaction, the original channel for periodically organizing the data in batches is still retained. In this way, the organized data of data lake can be stored into the data warehouse for local computing. Of course, this has little to do with the Lakehouse, because it was done the same way before the “integration”.

Anyway, both the data lake and data warehouse change little (only the data transmission frequency is improved, but many conditions have to be met), whether the data is transmitted from lake to warehouse through traditional ETL or external real-time mapping. Physically, the data are still stored in two places. The data lake is still the original data lake, and the warehouse is still the original data warehouse, and they are not integrated essentially!Consequently, not only are the data diversity and efficiency problems not fundamentally solved (lack of flexibility), but it also needs to organize the “junk” data of data lake first, and then load them into the warehouse before computing (poor real time performance). If you want to build a real-time and efficient data processing ability on the data lake through the “Lakehouse” implemented in this way, I'm afraid it's a joke.

Why?

If we think a little, we will find that the problem is in the data warehouse. The database system is too closed and lacks openness, it needs to load the data into the database (including external data mapping) before computing. Moreover, due to the database constraints, the data must be deeply organized to conform to the norms before being loaded into the database, while the raw data itself of data lake has a lot of “junk” data. Although it is reasonable to organize these data, it is difficult to respond to the real-time computing needs of data lake. If the database is open enough, and has the ability to directly calculate the unorganized data of data lake, and even the ability to perform mixed computing based on a variety of different types of data sources, and provide a high-performance mechanism to ensure the computing efficiency at the same time, then it is easy to implement a real Lakehouse. However, it is a pity that the database is unable to achieve this goal.

Fortunately, esProc SPL does.

SPL - an open computing engine - helps implement a real Lakehouse

The open-source SPL is a structured data computing engine that provides open computing power for data lake, which can directly calculate the raw data of data lake, there are no constraints, even no database to store the data. Moreover, SPL boasts the mixed computing ability for diverse data sources. Whether the data lake is built on a unified file system, or based on diverse data sources (RDB, NoSQL, LocalFile, Webservice), a direct mixed computing can be accomplished in SPL, and the value of data lake can be produced quickly. Furthermore, SPL provides a high-performance file storage (the storage function of data warehouse).The data can be organized unhurriedly when calculations are going on in SPL, while loading the raw data into SPL’s storage can obtain higher performance. Particular attention should be paid that the data are still stored in the file system after they are organized in SPL storage, and theoretically, they can be stored in the same place with the data lake. In this way, a real Lakehouse can be implemented.

In the whole architecture, SPL can perform unified storage and calculation directly based on data lake, and can also connect to diverse data sources in the data lake, and even directly read the external production data source. With these abilities, a real-time calculation on the data lake can be implemented, and in some scenarios that require high data timeliness (it needs to use the data before they are stored into the data lake), SPL can connect to the real-time data source, so the data timeliness is higher.

The original way that moves the data from the data lake to data warehouse can still be retained. ETLing the raw data to SPL’s high-performance storage can achieve a higher computing performance. Meanwhile, using the file system to store the data enables the data to be distributed on the SPL server (storage) or, alternatively, we can still use the unified file of data lake to store the data, that is, the work of original data warehouse is completely taken over by SPL. As a result, the Lakehouse is implemented in one system.

Let's take a look at these abilities of SPL.

Open and all-around computing power

Diverse-source mixed computing ability
SPL supports various data sources, including RDB, NoSQL, JSON/XML, CSV, Webservice, etc., and has the ability to perform mixed computation between different sources. This enables direct use of any type of raw data stored in the data lake and gives play to the value of data without transforming the data, and the action of “loading into the database” is omitted. Therefore, the flexible and efficient use of data is ensured, and a wider range of business requirements is covered.

With this ability, the data lake will be able to provide data service for applications as soon as it is established rather than having to complete a prolonged cycle of data preparation, loading and modeling. Moreover, the SPL-based data lake is more flexible, and can provide a real time response based on business needs.

Supporting file computing
Particularly, SPL’s good support for files gives powerful computing ability to them. Storing lake data in a file system can also obtain computing power nearly as good as, even greater than, the database capability. Besides text files, SPL can also handle the data of hierarchical format like JSON, and thus the data stored in NoSQL and RESTful can be used directly without transformation. It’s really convenient.

All-around computing capacity
SPL provides all-around computational capability. The discrete data set model (instead of relational algebra) it is based arms it with a complete set of computing abilities as SQL has. Moreover, with agile syntax and procedural programming ability, data processing in SPL is simpler and more convenient than in SQL.

Rich computing library of SPL

This enables the data lake to fully has the computing ability of data warehouse, achieving the first step of integrating data lake with data warehouse.

Accessing source data directly
SPL’s open computing power extends beyond data lake. Generally, if the target data isn’t synchronized from the source into the lake but is needed right now, we have no choice but to wait for the completion of synchronization. Now with SPL, we can access the data source directly to perform computations, or perform mixed computations between the data source and the existing data in the lake. Logically, the data source can be treated as part of the data lake to engage in the computation so that higher flexibility can be achieved.

High-performance computations after data organization

In addition to its own all-around and powerful computing abilities, SPL provides file-based high-performance storage. ETLing raw data and storing it in SPL storage can achieve higher performance. What’s more, the file system has a series of advantages like flexible to use and easy to parallelly process. Having the data storage ability is equivalent to achieving the second step of integrating the data lake with data warehouse, and a new open and flexible data warehouse is formed.

Currently, SPL provides two high-performance file storage formats: bin file and composite table. The bin file adopts the compression technology (faster reading due to less space occupation,), stores the data types (faster reading due to no need to parse the data type), and supports the double increment segmentation mechanism that can append the data. Since it is easy to implement parallel computing by using the segmentation strategy, computing performance is ensured. The composite table supports columnar storage, this storage schema has great advantage in scenarios where only a very small number of columns (fields) are involved. In addition, the composite table implements the minmaxindex and supports double increment segmentation mechanism, therefore, it not only enjoys the advantages of columnar storage, but also makes it easier to perform the parallel computing to improve the performance.

Furthermore, it is easy to implement parallel computing in SPL and fully bring into play the advantage of multiple CPUs. Many SPL functions, like file retrieval, filtering and sorting, support parallel processing. It is simple and convenient for them to automatically implement the multithreaded processing only by adding the @moption. They support writing parallel program explicitly to enhance computing performance.

In particular, SPL supports a variety of high-performance algorithms SQL cannot achieve. For example, the common TopN operation is treated as an aggregation operation in SPL, as a result, a high-complexity sorting operation can be transformed to a low-complexity aggregation operation while extending the range of application.

In these statements, there are no any sort-related keywords and will not trigger a full sorting action. The statements for getting top N from a whole set and from grouped subsets are basically the same and both can achieve a higher performance. SPL boasts many more such high-performance algorithms.

Depending on these mechanisms, SPL can achieve a performance that surpasses that of traditional data warehouse, the degree of surpassing is measured in orders of magnitude, and the full implementation of Lakehouse in data lake is not done in words but effective mechanisms.

Furthermore, SPL can perform mixed computations on transformed data and raw data to give full play to the value of various types of data, instead of preparing data in advance. In this way, not only is the flexibility of data lake fully expanded, but it also has the function of real-time data warehouse. This achieves the third step of integrating the data lake with data warehouse, which takes into account both the flexibility and high performance.

Through the above three steps, the path to build the data lake is improved (the original path needs to load and transform the data before computing), and the data preparation and computation can be carried out at the same time, and the data lake is built step by step. Moreover, in the process of building the data lake, the data warehouse is perfected, making the data lake has powerful computing ability, implementing the real Lakehouse. This is the correct method for implementing a real Lakehouse.
SPL is now open-source. You can obtain the source code from GitHub .

Are Wide Tables Fast or Slow?

Judy — Mon, 19 Jan 2026 07:54:58 +0000

Wide tables are usually a standard component of the BI system. Many BI projects will first prepare wide tables at the beginning of construction. A wide table is formed by joining up multiple tables that have a certain association relationship. The result set does not conform to the normal forms; and there is a large amount of redundant data. Moreover, as wide tables need to be pre-created, they are not so flexible to use.

But why do people very much prefer wide tables even if they have many shortcomings?

Because wide tables are FAST!

Querying data in the wide table is usually faster than performing the real-time multi-table join. So, building wide tables is to avoid joins. Join operations are a long-standing problem in SQL. They are difficult to write and has poor performance. Find detailed analysis about SQL joins HERE.

However, though wide tables help avoid joins, too much extra data may be read during the computation because there is data redundancy and this increases I/O time. For example, there is an Orders table where each order corresponds to 5 records in OrderDetails table. By stretching the two to a wide table, data in Order table will repeat 5 times. What’s more, the Orders table has dimension tables such as Customer and Employee, and Customer table has dimension tables including Region, and so on. When all these tables are extended to form a wide table, the entire data volume will be enlarged many times. To perform a query on that wide table, such as summing up order amounts by customer’s region, a large volume of data will be retrieved and I/O overhead is huge.

According to the above analysis, wide tables should have been slower. Why they are faster in the real-world practice? This is because relational database joins are too slow. Even if the wide table IO cost increases several times, the query is still faster than the real-time joins.

If we can do some optimization to make the join run faster, can we get satisfactory performance while avoiding a series of wide table problems including redundant data, error from result set that does not conform to normal forms and stiffness?

The answer is yes. But it is a pity that SQL cannot do that.

In the above document link, there are already detailed analysis about the SQL join. In a nutshell, the JOIN defined by Cartesian product is indeed very simple and the simple connotation gets broader denotation to cover various JOIN scenarios. But a too general definition makes it impossible to perform targeted optimization on different join operations. People can only think of some temporary solutions in engineering, but cannot fundamentally solve the problem.

Here’s another fact. Since the debut of the database, optimization methods for simple SQL used for BI analysis have been stretched to the limit by various vendors, but even so wide tables are needed to solve the performance problem. It can be seen that it is difficult to deal with – we can even say – impossible to solve the performance problem of join operations.

Is there anything we can do?

We can use SPL to tackle the problem.

SPL (Structured Process Language) is an open-source computing engine intended for structured data computations. It offers powerful computing ability independent of databases. The performance of handling join operations in SPL is much higher than those of both the SQL join and the wide table-based join. The language addresses the root of join operation performance problem as well as avoiding problems brought by wide tables.

The commonly seen equi-joins in BI analyses are categorized into two types in SPL – foreign key join and primary key join. Each is provided their own performance optimization methods, which is explained in the second half of the above-mentioned post about Join Simplification and Acceleration. SPL specifically offers dimension table preload method and numberization method for the common foreign key joins and order-based merge method for primary key joins that help to significantly reduce the join operation complexity.

Once we speed up the join operation, wide tables become useless and the volume of data to be read is reduced. The result is that SPL greatly increases BI performance.

That’s theoretical explanations. How best does SPL’s field performance?

A comparison test was performed. The test includes a common aggregation by dimension in multidimensional analysis after a join between a fact table and multiple, multilayer dimension tables, and an aggregation based on the wide table.

The test data is based on a data set of TPCH 100G and a computation involving a join between one large fact table and multiple dimension tables is designed:

Two table join between one fact table and one dimension table;
Seven-table join involving joins between a primary-sub fact table and four dimension tables, during which a dimension table is used twice;
Convert the seven-table join result to a wide table and perform wide-table-based aggregation.
The products used for performing the test are two specialized OLAP databases – StarRocks and Clickhouse, which are famous for high-performance BI analysis. Below is the test result:

Time unit: Second

Find detailed test report HERE.

According to the test result, the SQL’s wide table is faster than the join, which verifies the previous analysis. The performance of SPL’s wide table is actually not as fast as ClickHouse, but its real-time join performance is very high. It is higher than joins performed in the two SQL databases (3-9 times faster), and even surpasses the database’s wide table method by huge margins. If we take wide table’s shortcomings into account (redundant data, data error and stiffness), the advantage of SPL’s real-time join becomes more obvious. It not only avoids wide table defects but increases performance by N times.

The wide-table-based join is not necessarily faster than the real-time join! Because of SPL, costly wide tables created to obtain high performance in the BI system become useless.

SPL is now open-source. You can obtain the source code from GitHub .

Tired of ETL Bottlenecks? Build a Logical Data Warehouse with SPL

Judy — Fri, 16 Jan 2026 07:10:54 +0000

Logical DW (data warehouse) offers users the ability to logically integrate a variety of different type of data sources without moving the original data, presenting itself as a physical DW. Logical DW can address traditional DW’s problem of the inability to respond to real-time data processing needs due to long data chain caused by data moving, and hence logical DW can well meet the fast business change scenarios. Moreover, logical DW has the cross-source computing ability. However, due to the absence of physical storage, logical DW needs to map the data of each source as SQL table so as to implement mixed computation on multiple data sources.

There is no problem with the idea and application scenarios of logical DW, but the way to implement remains to be discussed. Currently, the external interface of the majority of logical DWs is still SQL, because almost all the traditional DWs are built based on SQL, and the same goes for logical DW. The benefit of using SQL is that it is universal, which can lower the learning and application thresholds.

However, SQL will weaken the logical ability, in other words, it lacks support for diverse data sources.

Unlike physical DW, logical DW will face a wider variety of data sources. Since many data sources do not meet the constraints of DW (SQL), it is useless to map them as SQL table, so the action of loading data into database to make them meet constraints is not needed (physical DW needs this action, so it doesn’t have to face diverse data sources, whereas logical DW does). As a result, logical DW lacks the support for diverse data sources. In general, it is relatively easy for logical DW to support RDB-based data sources, yet it is difficult to support multi-layer data structures such as NoSQL, Webservice and JSON, and more difficult to support various types of data sources like file system. In fact, most of today’s logical DWs can only support RDB well, and support other types of data sources poorly.

The lack of support for diverse data sources is also manifested in functionality. The reason why we use multiple data sources is that different data sources have different capabilities to adapt to different application scenarios. We know that even for RDB, there are some syntax differences between different RDBs, i.e., the databases’ respective dialect. These dialects are not fabricated out of thin air, but are designed to leverage their own abilities. Unless a logical DW could take account of all dialects when it operates based on the said databases (obviously impossible), the situation where the ability (can only be reflected in database dialect) of many databases cannot be utilized occurs. This problem is even worse for non-SQL databases, for example, the syntax of Mongodb for filtering is quite different from that of SQL. Normally, the ability to directly use the syntax of data source should be provided besides the automatic translation mechanism. Unfortunately, SQL obviously does not have this ability.

On the other hand, logical DWs are generally poor in physical computing ability.

The read of data from diverse data sources will also face performance issues. When the data amount is small, reading the data instantly is OK. However, when the data amount is large, the IO cost will be very high, and instant reading will make the performance too low to tolerate. In order to ensure computing performance, logical DW usually provides certain physical computing abilities (store the data on physical device) but, there is still a big gap between logical DW and physical DW in terms of adaptive storage and related computing performance due to certain reasons such as long-standing habit.

Essentially, logical DW should be a combination of physical DW and logical ability, and the physical computing ability should be fundamental for the logical DW. Unfortunately, a purely logical DW is very poor in physical computing (it is only suitable for small data amount and low performance scenario). In addition, logical DW should be more open and flexible. Besides being able to connect different data sources for mixed computing, logical DW should have the ability to fully utilize (leverage) the advantages of each data source based on the computing scenario.

The current dilemma is that logical DW is poor in physical computing ability, while physical DW is poor in logical data source, so we need to combine them into one and draw on each other’s merits.

Based on these factors, using SPL to implement a logical DW is a better choice.

Implement logical DW in SPL

SPL is an open-source computing engine, and has sufficiently open computing ability, making it possible to integrate multiple types of data sources for mixed computation. Inherent powerful physical computing ability, together with high-performance guarantee mechanisms and logical cross-source computing ability, makes SPL fully implement logical DW.

Logical data source ability
Currently, SPL can connect dozens of data sources and, these data sources are not limited to RDB but other data sources like NoSQL, CSV, Excel, JSON, XML, HDFS, Efficientsearch and Kafka.

When connecting these data sources, SPL will regard them as the table sequence (small data) and the cursor (big data) instead of mapping them as database table. How to generate the table sequence/cursor is the business of the data source itself (any data source offers such interface, but may not and cannot provide SQL access interface with unified syntax). In this way, the ability of each data source can be fully utilized.

It is easy for SPL to perform cross-source mixed computation based on these data sources. As an example, the following code uses SPL to handle a common cross-database computing scenario (mixed computation across different types of databases):

In this example, SPL doesn’t read all raw data from MySQL, and instead performs a grouping and aggregating operation in SQL before reading. As a result, the data volume of the large table Orders is significantly reduced, and the IO efficiency is greatly increased upon fetching the data trough interfaces (like JDBC).

As mentioned earlier, SQL translation will face the dialect issues, resulting in a failure to play the role of database’s many functionalities. SPL also provides similar translation function, which can translate standard SQL into corresponding database statements. But more importantly, SPL supports the direct use of data source’s syntax, which makes it possible to use their own syntax to give full play to its own advantages whether it is the dialect of SQL database or a non-SQL data.

In addition to cross-database computation, SPL can perform mixed calculation between data sources of any type. For example, sometimes storing the cold data to file system is more cost-effective and more flexible to process (redundantly store data at will), and the hot data are still stored in the database. If we want to do a real-time query for full data, coding in SPL:

SPL is also capable of integrating data sources other than RDB. In particular, SPL provides good support for multi-layer data structure, which makes it convenient to process the data from Web interface, IoT, and NoSQL. For example, SPL reads JSON multi-layer data and performs association query with database:

Likewise, SPL supports NoSQL such as MongoDB:

Another example, mixed computing of RESTful data and text data:

By now, we can see that SPL provides independent computing ability that has nothing to do with data source, yet the ability of data source itself can still be utilized. Users can choose where to do the calculation, data source end or logical DW (SPL), which is where the flexibility of SPL comes in.

Physical computing ability
In the previous section, we took several examples of SPL’s ability to integrate multiple data sources. In addition, SPL offers powerful physical computing ability.

SPL provides a professional structured data object: table sequence, and offers rich computing library based on the table sequence, thereby making SPL have complete and simple structured data process ability.

Here below are part of common calculation codes written in SPL:

Orders.sort(Amount)             // sort
Orders.select(Amount*Quantity>3000 && like(Client,"*S*"))       // filter
Orders.groups(Client; sum(Amount))          // group
Orders.id(Client)               // distinct
join(Orders:o,SellerId ; Employees:e,EId)           // join

By means of the procedural computation and table sequence, SPL can implement more calculations. For example, SPL supports ordered operation more directly and thoroughly; SPL also supports grouping operation, which can retain the grouped subset, i.e., the set of sets, allowing us to conveniently perform further operation on the grouped result. Compared with SQL, SPL syntax has many differences. To be precise, these differences should be advantages, which we will discuss later.

In addition to rich algorithms and libraries, SPL provides high-performance guarantee mechanism. As mentioned earlier, logical DW should be a combination of physical DW and logical ability, and the physical computing ability is very important. Only the combination of the two can provide sufficient performance guarantee. SPL designs a number of high-performance algorithms specifically for high-performance computing:

In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning…

External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval…

Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping…

Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning…

Merge and join: ordered merging, merge by segment, association positioning, attached table…

Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension…

Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing…

Of course, both logical and physical calculations cannot be separated from data storage. Sometimes organizing the data according to the computing objective (such as sorting them by specified field) can obtain higher computing performance. And conversely, the implementation of some high-performance algorithms needs to be backed up by storage. For this reason, SPL provides high-performance file storage. Please note that what SPL provides is file storage, which is completely different from the closed storage of traditional databases, and SPL does not bind storage. From a logical point of view, SPL’s high-performance files are equal to any other data source, except that SPL provides engineering methods for the file storage to improve performance, such as compression, columnar storage and index. Moreover, SPL provides many high-performance algorithms based on file storage.

Physical storage enables SPL to have physical computing ability that logical DW cannot match, and also gives SPL a significant performance advantage relative to other physical DWs. In real-world applications, SPL can often achieve a performance improvement of several times to dozens of times.

Here below are part of performance improvement cases:

Open-source SPL turns pre-association of query on bank mobile account into real-time association
Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times
Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency
Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times
Another benefit of file storage is its flexibility, which allows us to organize the data at will based on the computing objective, and avoids the situation where the data cannot be intervened as is the case with databases. Since the storage is relatively cheap, we can copy as much file as we want, as it is nothing more than a few more files. The same data can be organized in different forms (such as ordered by different fields) to adapt to different computing scenarios.

SPL itself has complete and high-performance computing ability. Powerful physical computing ability, along with rich interfaces for diverse data sources, forms a complete logical DW, which is why we think SPL is more suitable for building logical DW.

The advantages of SPL don’t stop there. In the process of building logical DW, being lightweight and simple is also its key characteristics.

More lightweight
Any mention of a DW makes it seem that it will be something heavy, and that it will be a server system even if just a logical DW. In fact, however, such operation may occur in various scenarios, and many in-application cross-source calculations essentially fall in the scope of logical DW. To this end, SPL supports not only independent deployment but integrating in applications.

SPL has very low hardware requirements and is light to install. SPL can run on any operating system as long as a JVM environment with JDK 1.8 or higher version, including common VMs and Container, is available, and it only takes up less than 1G of installation space. When integrating in an application, it only needs to embed a few jars to run, which is very convenient. Now some reporting/BI tool vendors also claim to support logical DW, but the actual effect is very poor, far inferior to professional DWs. Embedding SPL in such tool can make up for this lack, enabling it to do logical calculations such as in-application cross-source computing.

Multiple data source interfaces, which are very convenient to use, together with physical file storage that makes the use and management of file flexible and simple, as well as the agile syntax that comes with SPL, makes SPL, as a logical DW, very light to use.

It is very convenient to manage data files based on file system. Specifically, files in the file system can be managed in a multi-level directory structure, and we can set up different directories for different businesses or modules; a certain directory and subdirectories are dedicated to serving a single business, eliminating the coupling with each other; data modification will not affect other businesses; if a certain business goes down, the corresponding data (directory) can be safely deleted, making overall management very neat. Moreover, SPL has no metadata and does not require a complex management system like a database. All of these will make O&M very light.

Due to its lightness in terms of installation/embedment, use, operation and maintenance, SPL, as a logical DW, is very light in overall performance.

Lower development cost
Using SPL to perform data calculation will make it easier to develop and debug, and lower the development cost.

SPL provides the syntax that supports procedural computation, which will greatly simplify complex calculation. It is clear that to perform a computing task, writing 100 lines of code in one statement (SQL), or writing 100 lines of code in 100 statements (SPL), the complexities are completely different.

In addition, SPL provides an IDE that makes developing and debugging easier. In addition to the editing and debugging functions, the IDE allows us to code cell by cell, and offers a panel to view the result of each step, which is very convenient.

More importantly, SQL does not have complete language ability, and even cannot handle a pure data task alone. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex, and bring inconvenience to the operation and maintenance.

Compared to SQL (many scenarios are difficult or even impossible to implement in SQL), SPL provides more concise syntax and more complete ability. For example, to calculate the maximum number of days that a stock keeps rising, coding in SPL needs just one statement:

stock.sort(trade_date).group@i(close_price<close_price [-1]).max(~.len())

In contrast, when this calculation is implemented in SQL, it needs to nest multiple layers of code and implement in a very roundabout way.

In addition to conventional structured data computing library, SPL provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. In addition, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.

For more information about SPL syntax, visit: A programming language coding in a grid

For a logical DW, the logical ability and the physical computing ability are equally important. Only by combining the two abilities can a logical DW fully play its role. In addition, the integration degree of data sources, support degree for data types, performance guarantee, ease of use, and development and O&M costs are also important considerations. Overall, using SPL to build logical DW is a good choice.

SPL is now open-source. You can obtain the source code from GitHub .

Try it free~~ Download~

In-Memory Databases Are Overrated — Here’s What Actually Matters for Speed"

Judy — Tue, 13 Jan 2026 06:24:32 +0000

It is easy to think about using an in-memory database to solve the performance problem of reporting, BI analysis, batch processing, and other data analysis tasks. An in-memory database allows storing all data permanently in the memory so that the computation external memory accesses (disk reads) are not needed, disk I/O can be avoided and data processing performance can be effectively improved.

In-memory databases are advertised to have high performance and be able to solve many performance problems in business data analyses. They are fast because of not only the zero disk I/O costs but also the application of specialized memory optimization techniques. For example, the memory supports random access and has strong parallel processing ability; and supports optimization techniques such as pre-load and pre-index to enhance the computing performance.

But these techniques are not unique to in-memory databases. Actually, all computations are performed in the memory because, after all, the CPU can only compute data in the memory and data stored in the external memory needs to be read into the memory for processing. As general memory optimization techniques, they are adopted by many database products and computing engines. As long as the memory optimization is carried out cleverly, performance will catch up. In this sense, if the memory is large enough to hold all data, every database can be called in-memory database. Now the so-called “specialized in-memory databases” only offer optimization methods for the in-memory computations. They are either not good at performing computations involving a volume of data exceeding the memory capacity, or unable to handle them (because external memory computations are much more difficult). In order to highlight advantages while downplaying disadvantages, vendors name these databases “In-memory databases” in order to distinguish them from ordinary ones. On the contrary, the ability of real specialized in-memory databases is weakened.

Therefore, there’s no such thing as “specialized in-memory databases”, there are only specialized in-memory data processing techniques. It doesn’t matter whether a product claims to be an in-memory database. The really important thing is to find out whether in-memory techniques the product uses are effective or not by testing and comparing the computing performance under a large memory.

As we have understood what in-memory databases are, let’s move on to discuss in-memory processing techniques and computing performance.

We all know that the SQL-based relational databases are still the mainstream database systems – whether the database is the so-called in-memory database or not. But it’s a pity that SQL cannot make the best use of the characteristics of memory, leaving much room for performance increase. Though certain optimization methods, such as compression, column-wise storage, index, parallel processing and vector computing, can be well implemented and employed based on SQL, it is inconvenient to implement other techniques that can greatly help increase computing performance in SQL. This is because SQL has its intrinsic limits.

SQL lacks the data type for representing records. SELECT’s result (record) is a new data table unrelated to the original table, even if it has only one record. As a result, a lot of data needs to be copied during the computation (new memory space is thus needed) and costs increase in terms of both space usage and time consumption. If there is the explicit record type, we can directly use records to form a data table. The records will store memory addresses of data in the original table; data won’t be copied frequently or in a large amount in the computing process, memory will be effectively used and performance will be higher.

For the multi-table join computation, if we can convert the foreign key in the fact table into memory addresses of the records in a dimension table in advance (pre-join), no association will be needed when records are used and we will get performance as good as that of the single table query. Yet, it is a pity that SQL cannot exploit the characteristic.

SQL also has poor support for order-based computations. The language is designed based on unordered sets. It does not define ordinal numbers for members of the unordered sets or offer location operations and neighboring member reference mechanism, making it difficult to exploit data orderliness to implement high-efficiency algorithms. For example, if data in the memory is ordered, we can use binary search to speed up the query. The method can noticeably increase performance of handling scenarios involving a large volume of data. Location by ordinal numbers is a similar and more efficient method, which can make use of the high-speed memory random access characteristic to quickly retrieve data from the memory according to the specified ordinal numbers. Both methods are difficult to implement in SQL.

Making the most use of the memory power also refers to the ability to describe data structure. The relational model (SQL) depends on two-dimensional tables to represent the relationship; it is difficult for it to describe and employ more complex data structures (such as multilevel JSON format) that are very suitable for in-memory storage and has advantages in both space utilization and usage efficiency. Frankly, a language needs to support complex data structures to implement the real grouping operation. SQL forcibly performs aggregation after each grouping action, but sometimes it is the result members of the grouping operation (such as those in each group) that we concern with. To achieve the goal, SQL needs to first invent group ordinal numbers using the subquery, producing cumbersome code, repeated queries and low efficiency.

The root of SQL’s inability to make good use of the features of memory is the theory. During the time when SQL (relational model) was invented, the computer’s memory was very small. It is understandable that the language finds it awkward to adapt to the current large-memory environment. Though contemporary databases have implemented many optimizations in engineering and improved the above-mentioned situation to some extent, the optimization engine becomes useless when the scenario gets even a little complicated; after all, it is difficult to compensate the theoretical shortcomings with engineering. Moreover, the quality and application scope of an optimization engine can only be determined through strict test and evaluation. Generally, tests are limited to a range of scenarios that SQL can easily handle and do not involve the complex ones. This makes the database product selection failure-prone and highly risked.

SPL can solve all the problems.

SPL (Structured Process Language) is an open-source computing engine intended for computing structured and semi-structured data. In order to solve the above-mentioned SQL problems, SPL is designed based on a brand-new model instead of the SQL’s relational algebra theory. The language supports both the in-memory and external memory computations. It specifically offers memory optimization methods that help achieve, even surpass, the performance of an in-memory database. When the volume of data involved exceeds the memory capacity, SPL allows to use the external memory computation (load data to the memory in batches) and in certain computing scenarios the performance is nearly as good as that of the full-memory computation, enabling high-performance computations even with small available memory space. A unique thing about SPL is that it can integrate with an application by being embedded into the latter to supply high-efficiency computing ability.

SPL also offers the already mature engineering optimization techniques that the in-memory databases use, such as the previously mentioned column-wise vector computing, pre-load and pre-indexing. These general engineering strategies are significant enough to achieve performance as good as that of the in-memory database, but they are not enough to bring into play the memory’s characteristics, which, on the contrary, is SPL’s core strength.

As we said previously, we can make use of the characteristics of memory and computations to achieve higher performance when handling scenarios involving complex data structures or computations. In-memory databases cannot exploit those characteristics because of SQL’s limits.

SPL can.

Unlike SQL’s SELECT that copies data and makes data volume bigger and efficiency lower, SPL only keeps the original records’ addresses, that is, their memory addresses, but does not copy records themselves, creating advantages in both space usage and computing efficiency. The reason SPL can do this is that it has the specialized record type for storing the original data’s memory addresses (references). SQL, however, does not have the record data type. In SQL, a single record is actually a data table consisting of a single row; different data tables cannot share records; and the filtering operation copies records to generate new ones to form a new data table, producing unsatisfactory space utilization and time costs.

We all know that the CPU accesses data from the memory via addresses. If we store the addresses of certain data beforehand, it will be very fast to access them later. Take the join operation as an example. If we store addresses of the foreign key table (dimension table)’s records in the fact table, there is no need to perform the join (HASH computations) when trying to use data of the two tables and thus we can achieve an equal performance as that of the single-table query. The pre-association method has obvious advantages in handling scenarios involving data reuse and multi-table association (the number of dimension tables is relatively big). SPL offers such a mechanism to perform a pre-join between the fact table and the multiple dimension tables (store memory addresses of the dimension table’s records in the foreign key field), save the result in the memory and speed up the computation by using the memory addresses. Find more information HERE.

SQL is based on unordered sets. This makes many high-performance algorithms impossible to implement. To solve the problem, SPL directly offers the ordered sets that enable to bring into play the full potential of orderliness. For the very complicated celestial bodies contrast computations, the computation amount can be greatly reduced if we first perform an initial filtering using the order-based binary search and this makes the subsequent computations more conveniently. SQL cannot describe the computation and thus cannot exploit the characteristic of orderliness. In actual practice, it is over 3 magnitudes slower than SPL. Find more information HERE.

SPL can make most use of the ordinal numbers to perform high-efficiency accesses. For a query task, if the value of the to-be-queried key is the target value’s ordinal number in the table sequence, or if the target value’s ordinal number can be easily obtained according to the to-be-queried value, we can use the ordinal-number-based location method to complete the computation in the constant time interval without comparisons.

What’s more, we can implement the association involving a large fact table (external memory computations) according to the ordinal numbers. When a fact table is too large to fit into the memory, the previously mentioned address reference method becomes useless. In that case, we can convert the fact table’s foreign key values into positions of their corresponding records in the dimension table, which is called numberization. Then we are able to use the more efficient ordinal-number-based location method to perform searches when trying to create association between the numberized fact table and the dimension table without any comparisons, producing performance nearly as good as that of memory address reference method. In SPL, the use of ordinal numbers is an important performance optimization strategy.

We can see that various computing scenarios are taken into account when SPL is designed. It isn’t intended just for memory or external memory; it targets both so that problems of all scenarios can be well handled. Find more information HERE.

Sometimes the computing goal has definite requirements regarding the order. For example, the security industry’s consecutive rising/falling computations require comparisons between neighboring data values. Databases that implement SQL well can use the window function to complete this type of computations. But it is very difficult to express the roundabout method even for the simple computations; and the efficiency is slow.

To count the longest consecutive rising days for a certain stock, for example, SPL needs a 3-layer nested query to express the computing process even with a window function, and the execution efficiency is poor (database optimization strategies become useless for complicated scenarios). By contrast, SPL conveniently codes the computation thanks to the support of order-based computations (order-based grouping):

stock.sort(date).group@i(price>price[-1]).max(~.len())
SPL references the directly previous/next record through the relative position.

The performance of SPL’s order-based computation is particularly outstanding in handling ecommerce industry’s customer churn rate. Find detailed explanations in the article SPL computing performance test series: funnel statistics

SPL also shows its ability of making most use the memory in describing the complicated data structures such as JSON format. It supports the multilevel data structure directly. With the generic type, SPL allows using sets as the members of a sequence, that is to say, members of a set are also sets, making the language have the natural ability to describe multilevel data structures such as JSON/XML. This also enables SPL to keep the grouped subsets (a set of sets) and perform a further computation on each grouped subset, effectively avoiding the SQL-style roundabout, nested query and helping achieve higher performance.

By offering all these techniques that enable a full exploit of the memory features, SPL outperforms in-memory databases to achieve a higher computing performance and more concise code for phrasing the computing process.

That being the case, is it still necessary for “specialized in-memory databases” to exist?

Not necessary any more. Though not advertising the concept of in-memory databases, SPL actually possesses in-memory computing techniques that are more powerful than in-memory databases.

SPL also provides external memory computation ability to handle data exceeding the memory capacity, creating a broader range of application scenarios and combining the in-memory ability and external-memory ability to bring into play the most power. The one typical example of this combination is the previously mentioned funnel analysis. A product claiming that it is the in-memory database usually does not have the external memory computation ability; the lack greatly limits the application scope. SPL, however, boasts a wider application scope.

SPL can also work as an embedded computing engine and integrate with the application through its jars. It is deployed together with the application and offers in-memory database-like high-performance computations directly from within the application; it also supports the cooperative use of the in-memory ability and external memory ability.

Having high-efficiency in-memory computing ability, SPL is completely qualified to replace the “specialized in-memory databases”. Together with the external memory ability and the flexible integration ability, SPL gets a broader application scope. It is not important that whether a product is called in-memory database or not when we are examining a product claiming to supply high-performance computation. It is a good product as long as the in-memory computing techniques are awesome and their application scope is wide. According to this rule, SPL is an ideal choice.

From Heavyweight MPP to Lightweight SPL: Achieving High-Speed Data Processing Without the Cost

Judy — Wed, 07 Jan 2026 08:21:28 +0000

In order to obtain better computing performance, the MPP databases, such as Greenplum, Vertica, IQ, TD and AsterData, are often adopted. Although MPP can achieve better performance, the cost is high. Specifically, MPP consumes a large amount of hardware resources, resulting in high hardware cost, and it needs to pay expensive license fee if a commercial software is used. Moreover, it is very complicated to operate and maintain MPP, as each node needs to be maintained separately, and the uniform distribution and the consistency assurance of data under distributed framework will increase the O&M complexity. In short, it is heavy and expensive to use MPP.

Then, is there any other solution?

The main purpose of using MPP is to obtain better computing performance. If the performance can be improved in a lightweight and cost-effective way, then we can give up MPP. So, is there a way like this?

After carefully analyzing the current computing scenarios that deal with structured data (database), we found that the data amount of task in most scenarios is not particularly large. Let’s take the financial institutions whose business data are usually large as examples: for a bank with tens of millions of accounts, its transaction records only amount to hundreds of millions a year, which is not considered large; for an e-commerce system with millions of accounts, its accumulated data amount is only similar to the scale of the bank. Except for a few top companies, the computing scenarios of the vast majority of users do not involve particularly large amount of data, and the data scale of a single task is only tens of gigabytes, and the task involving data of up to 100 gigabytes is rare, let alone the petabyte-level task claimed by many large data vendors.

Normally, a conventional database should be able to handle the task of this data scale easily, but it is not the case in reality. In the real world, it is very common to take several hours to do a batch job, yet there is no extra time to re-run the job if something goes wrong; it is also very common to take tens of seconds to minutes to query a report, and the time it takes will be longer once the query occurs concurrently (even if the number of concurrent queries is small).

To cope with this, users will consider using an MPP to speed up.

Why do these situations still occur though the amount of data is not large?

The main reason is that the current database does not fully utilize the hardware resources. In other words, the performance of database is too low.

There are two deeper reasons.

One reason is that MPP adopts many engineering optimization methods such as data compression, columnar storage, index, and vector-based computing to serve the AP scenarios. By means of these methods, the computing efficiency is significantly improved. However, these methods are rarely found in traditional databases, so the performance is naturally low. If traditional databases adopted such technologies, the performance would also be improved. Unfortunately, only MPP is now using these technologies.

The other reason is that although these slow-running operations do not involve large data amount, they are usually very complex. Moreover, due to the limitations of SQL itself, it is very difficult to implement some complex operations. Even if such operations can be coded in SQL, the amount of calculation is particularly large. For example, for the order-related multi-step operation, it is difficult to code in SQL and slow to run. SQL lacks features like record type, ordered operation, and procedural computation, making it impossible to code many high-performance algorithms. As a result, programmers can only resort to slow algorithms, so performance is poor.

Running slow needs more hardware to speed up. Therefore, even if the data scale is not large, the database cannot handle, and has to resort to the distributed MPP.

Of course, we hope to obtain both the speed of a high-speed train (MPP) and the volume of a car (light solution). However, within the scope of current knowledge, it seems that a car that runs as fast as a high-speed train cannot be found, so the heavy high-speed train is a solution that has to be taken.

Fortunately, now we have the lightweight esProcSPL to fill the gap. Just like a car, SPL can achieve the speed of a high-speed train! Here are some of the advantages of esProcSPL:

No distributed framework required

As an open-source computing engine, SPL is specifically designed for processing structured data. The high-performance mechanism provided in SPL can fully utilize hardware resources, allowing a single machine to exert the computing ability of a cluster, thus making it possible to handle most computing scenarios that previously required MPP without employing a distributed framework.

In terms of engineering, SPL also adopts the common mechanisms of MPP, such as compression, columnar storage, index, and vector-based calculation to ensure excellent performance. In addition, SPL provides high-performance file storage that supports these mechanisms, eliminating the need for a closed database management system. The file storage can be directly distributed on any file system, making it more open. Not only is SPL high in computing performance, it’s also an out-of-the-box tool and lighter.

More importantly, due to the inherent defects of SQL, SPL doesn’t continue to use SQL system but adopts an independent programming language, i.e., Structured Process Language. Moreover, SPL provides more data types and operations, and makes innovation fundamentally (you should know that it is difficult to address theoretical defects with engineering methods). As we know, the software cannot change the speed of hardware. However, we can use low-complexity algorithms, then the hardware will execute less computation, and the performance will be improved naturally. SPL offers many such high-performance algorithms. For example, for the complicated multi-step ordered operation mentioned above, it is easy to implement in SPL and, it is simple to code and the running speed is fast.

High performance requires less hardware, which is a relationship we’ve talked about many times. In practice, for most of the scenarios that seem to require MPP, SPL can handle them through a single machine, which not only saves hardware cost, but is convenient in O&M.

Here are some cases for reference:

Be able to handle multi-concurrency query

In addition to improving the computing performance, the distributed technology is used to handle multi-concurrency query sometimes. For multi-concurrency query, a single machine is indeed hard to process sometimes. In this case, do we have to use MPP?

Not necessarily.

SPL provides the cloud mode, allowing us to dynamically start/stop the computing nodes based on the concurrent situation, hereby implementing elastic computing. The cloud mode of SPL is completely different from the relatively fixed cluster mode of MPP, it can flexibly handle concurrent requests, and consumes the least hardware resources.

The high-performance file storage of SPL mentioned earlier can fully ensure the computing performance. Yet, unlike the database, which need to store the data in database, SPL does not bind the storage, and the data files can be stored locally or remotely.

We know that the database has metadata, which takes up a lot of resources. As the data accumulate, the metadata will become larger and larger, and the whole system will become slower and slower, making it difficult to implement some methods such as data redundancy and trading space for time. In contrast, SPL has no metadata and is light to use.

Due to the fact that SPL does not bind the storage to computation, and is not subject to metadata, SPL naturally supports the separation between storage and computation. Even if the high-performance file of SPL is used, the performance in the whole system is the same as using text, and the file can be stored in local or network file system, and can also be stored directly to the cloud object storage like S3. With the support of separation between storage and computation, SPL can perform flexible scaling, making it very easy to cope with high-concurrency scenarios, and more flexible and scalable than MPP.

More advantages

Simple technology stack
Let’s start by comparing SPL and SQL. The similarity is that they are both the computing language for structured data, and the difference is that SQL does not have complete language ability, and even for simple data task, it is often difficult to implement independently. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex, and bring inconvenience to the operation and maintenance.

stock.sort(trade_date).group@i(close_price<close_price [-1]).max(~.len())
In contrast, when this calculation is done in SQL, it needs to nest multiple layers of code and implement in a very roundabout way.

For more information about SPL syntax, visit: A programming language coding in a grid

Concise syntax together with complete language ability directly makes the development work very efficient, and eliminate the need to resort to other technologies, hereby making the technology stack simpler, allowing everything to be done in one system, and naturally, the operation and maintenance are simpler and more convenient.

Diverse data sources
We often encounter such a situation where the adoption of MPP improves the performance but brings inconvenience, as the data of diverse sources can be processed only after they are loaded into the database, resulting in a decrease in data real-timeness and, loading such data into database and persisting them will increase the cost, take up more space, and affect the operation and maintenance. Moreover, MPP cannot take the place of original TP database, which will add a very difficult cross-database action to implement real-time hot computing.

SPL not only does not bind storage but supports the connection to and mixed calculation over diverse sources, which gives SPL good openness, making it totally different from database’s closedness that process the data only after they are loaded into database.

The data that SPL is good at processing include the conventional structured data, multi-layer structured data (json/xml, etc.), string, text and mathematical data such as matrix and vector. In particular, SPL provides powerful support for multi-layer structured data such as json and xml, far surpassing traditional databases. Therefore, SPL can work well with json-like data sources such as mongodb and kafka, and can also easily exchange data with HTTP/Restful and microservices and provide computing service. In particular, it is easy for SPL to implement the mixed operation with TP database, making it highly suitable for real-time query and count.

The benefits of openness are self-evident. It can not only avoid the database capacity and performance problems caused by ETL, it can also fully ensure the real-timeness of data and calculation. Therefore, the openness of SPL is very friendly for real time computing scenarios.

More lightweight
In addition to not binding storage and having no metadata, the lightweight nature of SPL is reflected in simple operating environment. SPL can run on any operating system as long as JDK 1.8 or higher version is available, including common VMs and Container, and only takes up less than 1G of space after installation.

What’s even more special is that SPL can not only be deployed independently but can also be integrated into applications, providing powerful computing ability within application. In this way, applications do not have to rely on central MPP to obtain powerful computing ability, and the coupling of data processing between applications is eliminated, making it flexible to use and easy to manage, and avoiding conflicts caused by multiple applications competing for central computing resources. All these are impossible for MPP.

SPL is now open-source. You can obtain the source code from GitHub .

Try it free~~ Download~

Ditch the AP Database: Solve TP Overload with Lightweight SPL

Judy — Mon, 29 Dec 2025 07:15:27 +0000

At the beginning of information system construction, usually only one database is used, and the database combines TP (transaction processing) and AP (analytical processing) together. As the scale of business and the amount of data continue to grow, the database faces increasing pressure. In order not to affect transaction, a common practice in the industry is to move data (usually cold data) out of the database and use a dedicated database to handle AP business. This method effectively reduces the burden on TP database and ensures the smooth operation of transaction business.

Normally, a professional AP database runs fast and does solve the performance problem to some extent, but it will cause some other problems.

Problems faced by AP database

The first is cost.

Currently, mainstream AP databases mainly adopt MPP framework, which is different from that of TP database. Although MPP can achieve better performance, both the software and hardware costs are high. Specifically, MPP will consume a lot of hardware resources, resulting in high hardware cost, and it needs to pay expensive license fee if a commercial software is used; Each node of MPP needs to be maintained separately, and both the uniform distribution and consistency assurance of data under a distributed framework will increase the operation complexity. These factors will drive up the cost of using AP database.

Adding an AP database will make management more complicated. To be specific, the management of the original TP database is already very complicated, including, design the metadata, make data meet constraints before loading into database, and control the access privilege, etc., and all these tasks are also required when an AP database is added, and more resources are needed due to difference of type of AP database. As a result, two sets of systems often require more than double the O&M cost, which brings cost problem.

In addition, migrating data from TP database to AP database is not easy and will face a dilemma.

The purpose of adding AP database is to move all AP businesses from TP database to AP database, yet migrating all AP businesses would pose significant risks. Not to mention whether the functionality of AP database is complete, many businesses that are originally implemented in one database may need to be redesigned after separation. Different database types and SQL compatibility differences will increase the difficulty of migration. All of these will make migrating all AP business in one go too risky.

Therefore, the safe approach is to migrate gradually, but it will encounter new problems.

We all have such experience that as the use of the database deepens and the business keeps growing, the originally normal query will become slower and slower. The reasons for this are not only the increase in data volume, but also many factors such as table quantity, indexes, metadata, and storage space. For a centrally managed database system, it is difficult to determine whether the database is effective for its own business before its load reaches a certain level.

Migrating just a little bit of business at initial stage will definitely run fast. However, it makes it difficult to determine whether the selected database is correct. As the migration proceeds gradually, a situation where the later-migrated business affects the previous business may arise, which still causes great risks. If it is found at a later stage that the AP database cannot handle its own business or needs to be scaled out a lot, then a dilemma will occur because it has accumulated a lot of work then.

In addition to the aforementioned cost and dilemma problems, there is also a more troublesome real-timeness problem.

Some businesses that used to run smoothly in one database cannot be implemented after separating some data from original database; the most typical business is the real-time query. Real-time query is naturally supported within one TP database, yet it is completely different after separating data into databases. For databases of the same type, it is sometimes possible to query across databases to implement real-time query in an indirect way. Although the performance is usually not high and the effect is poor, real-time query can at least be implemented. However, for AP database that is almost impossible to be the same in type as TP database, implementing real-time query becomes extremely difficult, resulting in a situation that real-time query business that was frequently utilized in the past has to be implemented through the way of T+1 or even T+N, and the impact on this business is self-evident.

In fact, real-time query is essentially the cross-source computing problem. If a system is open, cross-source computing can be easily implemented. However, the closedness of database requires that the data can be calculated only after they are loaded into database, it makes it extremely difficult to implement real-time query, which means that although the TP database overload problem is solved, it causes new problem, resulting in a decrease in degree in satisfying requirements.

Surely, we can also resort to HTAP database to implement real-time query, because the main goal of HTAP is to implement query and analysis in one database (does it feel like TP database?). HTAP is of course an option, but the reality is that the majority of HTAP databases have strong capabilities in terms of TP, but their capabilities in AP are often weak, which differs little from original TP databases in many cases. More importantly, using HTAP database will still face the direct cost and migration problems. The original TP database is just under pressure and not unusable; it would be a waste if it was completely abandoned.

Therefore, we need to find a relatively lightweight AP solution that will not incur high costs, preferably allowing for gradual migration while addressing real-time query problem.

Fortunately, we can use SPL as a solution to reduce the burden on TP database.

Open SPL solves various problems of AP business migration

As an open-source computing engine dedicated to AP business, SPL has the following characteristics: simply, lightweight, low usage cost, and its open computing ability and file storage support gradual migration of business without any impact on the business before and after business migration, and its mixed computing ability on multiple data sources naturally supports real-time query, making it easy to meet any business requirements even after separating AP business.

SPL application framework

Lightweight and low cost

One of the main differences between SPL and database is its simply and lightweight nature, which can reduce cost.

SPL has very low hardware requirements and the overall performance is very light. SPL can run on any operating system as long as a JVM environment with JDK 1.8 or higher version, including common VMs and Container, is available, and it only takes up less than 1G of installation space. Moreover, since SPL provides many high-performance mechanisms, the same effect as MPP cluster can often be achieved with only one SPL node, which will directly reduce the software and hardware costs (more details about high performance will be explained later).

What’s even more special is that SPL can not only be deployed independently but can also be integrated into applications, providing powerful computing ability within application. In this way, application can obtain powerful computing ability without having to rely on database. When migrating data from TP database to AP database, it can be initiated within the application. At the same time, SPL can serve as the data mart/front-end computing engine for this application.

SPL’s agile syntax also has advantages when implementing complex calculations. At the beginning of migration, we will usually choose the businesses with low performance and high resource consumption. Such businesses are often complicated, and it is often easier to remould in SPL than to modify SQL. This will bring lower development and debugging costs (to be explained in more detail later).

The lightweight of SPL is also reflected in its storage schema.

Compared with the closed storage of databases, SPL directly uses files to store data. In fact, SPL does not bind storage, users can use any medium to store data, but compared to other forms of storage, files have many incomparable advantages. Files are stored directly on the file system, either locally or on the network (cloud). Using file storage eliminates concerns about capacity issues. Since the storage is very cheap, we can copy as many files as we want, as it is nothing more than a few more files. The same is true for file backups for security purposes. There is almost no upper limit for file storage.

However, many open format files do not have high performance. For this reason, SPL provides dedicated high-performance file formats. Users can directly convert the source data and store as SPL files, and can also copy the data at will according to performance requirements during use.

Another advantage of file storage is the ability to organize data flexibly. Sometimes, we can achieve higher performance by using different algorithms after organizing the data according to computing objectives. Compared to database, which cannot intervene in storage, files are much more flexible. Specifically, the data can not only be stored redundantly in multiple copies, the same copy of data can also be designed in different organizational forms (such as ordered by different fields) to adapt to different computing scenarios.

Files in the file system can be managed in a multi-level directory structure, and we can set up different directories for different businesses or modules; a certain directory and subdirectories are dedicated to serving a single business, eliminating the coupling with each other; data modification will not affect other businesses; if a certain business goes down, the corresponding data (directory) can be safely deleted, making overall management very neat.

Gradual migration

With the support of file storage and openness, we can proceed with gradual migration.

As mentioned above, the reason why AP business is separated is because the pressure of one database that combines TP and AP businesses is large. Yet, although the database is under great pressure, it is still usable and can work well as long as the pressure is released. Since migrating all AP business in one go is too risky, the safe approach is to migrate gradually. Gradual migration can not only reduce risk but also conform to the characteristics of SPL. In the initial stage, migrating some statistical query scenarios with lower performance and high resource consumption will significantly reduce the workload of database.

More importantly, using SPL to migrate files will not result in a situation where the business migrated later affects the business migrated earlier, because the file storage does not create any relationship between files, and adding new file will not affect the use of existing files. Moreover, the original problems caused by the complex and closed system of database will not exist at all in SPL system, so we can safely proceed with migration.

With this gradual migration approach, the risk is low and the performance is controllable, so there is no need to migrate all AP businesses in one go.

Solve real-timeness problem

SPL has strong openness. In addition to file storage, SPL also supports multiple data sources, and any data source is logically equivalent for SPL. Besides the ability to connect to multiple data sources, SPL can also perform mixed calculation, which makes it easy to implement real-time query.

Due to its open computing ability, SPL can retrieve data from different databases respectively, and thus it can handle scenarios involving different types of databases well. SPL implements real-time query by performing a mixed calculation of cold data stored in file system (AP database) and the hot data stored in TP database. During calculation, SPL’ agile syntax and procedural computation can greatly simplify complex calculations in real-time query and increase development efficiency. Moreover, SPL is an interpreted-execution language, which supports hot deployment.

The following is the SPL code to perform a mixed query of the historical cold data stored in files and the hot data stored in production database:

We can see that the real-time query that is difficult to implement after the separation of databases can be implemented with just a few lines of script, thereby completely solving all problems caused by the separation of databases.

Currently, there are already many practices in migrating business from TP to SPL; we summarize common practical scenarios at
SPL practice: migrate computing tasks out of database , you can refer to it to accomplish your own migration work.

Higher performance

In practice, SPL also utilizes the engineering mechanisms currently adopted in professional AP databases, such as compression, columnar storage, index, and vector-based calculation to ensure excellent performance. As mentioned above, SPL provides high-performance file storage that supports these mechanisms, eliminating the need for a closed database management system. The file storage can be directly distributed on any file system, making it more open.

More importantly, due to the inherent defects of SQL, SPL doesn’t continue to use SQL system but adopts an independent programming language, i.e., Structured Process Language. Moreover, SPL provides more data types and operations, and innovates fundamentally (as it is difficult to address theoretical defects with engineering methods). As we know, the software cannot speed up hardware. Yet, using low-complexity algorithms can reduce the computation amount of hardware, and the performance will be improved naturally. SPL offers many such high-performance algorithms. For example, for the complicated multi-step ordered operation mentioned above, it is easy to implement in SPL and, it is simple to code and the running speed is fast.

By means of such high-performance guarantee mechanisms, SPL requires fewer hardware resources, thus achieving the cluster effect with only a single machine. Further, SPL provides multi-thread parallel and distributed computing mechanisms, which make it highly scalable to further guarantee performance.

In the following cases, SPL achieves the cluster effect on only a single machine:

Lower development cost

Since SPL does not adopt SQL system, there will be a learning cost before using it. Many people familiar with SQL may think the migration cost of using SPL will be higher.

In fact, this is not the case. From a long-term perspective, the development cost of SPL is lower!

Because SPL is not compatible with SQL, it does require recoding when migrating SQL, which will incur certain modification cost. However, SPL is easy to learn, and its syntax is concise, so the cost of modifying SQL is not very high. In contrast, although AP database also uses SQL, it will involve a lot of SQL modifications when migrating from TP database to AP database due to the types of databases are different and, since AP-related calculation logics are generally complex, it often needs to recode during modification. Moreover, the development and debugging of SQL itself is not easy and the workload of modification is not low, which is far from being as simple as “seamless migration” claimed by vendors. Overall, the modification cost of using SPL will not be much higher than using AP database, which is acceptable.

More importantly, SPL will bring long-term benefits.

Since the language ability of SQL is not complete, it is difficult for SQL to implement some complex calculations independently. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex and bring inconvenience to the operation and maintenance. TP, AP, and HTAP databases all have this problem.

In contrast, SPL provides richer data types and complete computing libraries, making it easy to handle scenarios that are difficult or even impossible to implement in SQL.

For example, when performing a funnel analysis for an e-commerce company to calculate the user churn rate, coding in SQL is very complicated, and the coding ways vary greatly when the types of databases are different. Moreover, the lack of migration ability would increase the modification cost. For example, the code of Oracle to implement a three-step funnel analysis is:

with e1 as (
 select uid,1 as step1,min(etime) as t1
 from event
 where etime>= to_date('2021-01-10') and etime<to_date('2021-01-25')
 and eventtype='eventtype1' and …
 group by 1),
e2 as (
 select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
 from event as e2
 inner join e1 on e2.uid = e1.uid
 where e2.etime>= to_date('2021-01-10') and e2.etime<to_date('2021-01-25')
 and e2.etime > t1 and e2.etime < t1 + 7
 and eventtype='eventtype2' and …
 group by 1),
e3 as (
 select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
 from event as e3
 inner join e2 on e3.uid = e2.uid
 where e3.etime>= to_date('2021-01-10') and e3.etime<to_date('2021-01-25')
 and e3.etime > t2 and e3.etime < t1 + 7
 and eventtype='eventtype3' and …
 group by 1)
select
 sum(step1) as step1,
 sum(step2) as step2,
 sum(step3) as step3
from
 e1
 left join e2 on e1.uid = e2.uid
 left join e3 on e2.uid = e3.uid

Coding in SPL:

Obviously, this code is more concise. In fact, the way to code in SPL is more versatile (for multi-step funnel calculation, it just needs to add parameters instead of adding sub-queries like SQL), and the performance is higher.

Besides conventional structured data computing library, SPL also provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. And, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.

Concise syntax, along with complete language ability, directly makes the development work very efficient, and eliminates the need to resort to other technologies, hereby making the technology stack simpler. With everything done in one system, it will naturally be very simple and convenient to operate and maintain. Therefore, for migrating AP businesses, especially when it comes to more complex calculations, there is a high probability that using SPL is more efficient and less expensive to develop than SQL.

In conclusion, when the TP database is overloaded, it is necessary to reduce its pressure but, it is not necessary to migrate all AP businesses in one go, and the goal can be achieved as long as the pressure is effectively reduced. Migrating AP business in a gradual way is a good approach, yet there will always be problems of one kind or another when migrating AP business to another database, such as high migration cost, difficult management and poor real-timeness. Fortunately, SPL’s openness and high performance can solve such problems effectively. Besides, SPL also provides complete, lightweight and high-performance computing capabilities, and consistent technology stack. From this point of view, it is a wise choice to replace AP database with SPL to reduce the burden of TP database.

SPL is now open-source. You can obtain the source code from GitHub .

Try it free~~ Download~

The Data Warehouse Doesn’t Need a ‘House’ — And That’s Why It’s Faster

Judy — Mon, 22 Dec 2025 08:01:37 +0000

We know that the early databases do not distinguish between TP and AP, and all tasks are handled in one database. When dealing with TP business, it is important to ensure the consistency of data, and consistency makes sense only when the data are limited within a certain range, which gives rise to the concept of “base”. The data to be loaded into database should meet some constraints, otherwise it cannot be loaded. There is a clear distinction between the data inside and outside the database, and this characteristic is called the closedness.

In addition to guaranteeing the consistency of data, the closedness can guarantee the security of data by working with the database management system (DBMS).

Data warehouse is developed based on database. When the database is unable to serve both OLTP and OLAP businesses at the same time, the AP business is separated into a separate database, giving rise to the data warehouse. Therefore, the data warehouse inherits many characteristics of database including the closedness. Inheriting the closedness is equivalent to inheriting the characteristics such as “the data can be used only after being loaded into database”, “the data to be loaded should meet some criteria”. Naturally, the concept of “house” of data warehouse is formed.

Then, is this closed storage necessary?

Yes, it is necessary for TP business but, it is not for the data warehouse focusing on AP business. Although the name of data warehouse contains the word “house”, its main function is to compute actually. Even though the data warehouse can store the data like a “house”, it actually serves calculation. After all, the data becomes valuable only when they are used (calculated). The competition among various types of new data warehouses on the market today is concentrated almost entirely on the computing ability, especially the performance, as well as the completeness of computing ability and the richness of functions, without exception, they are all about computing ability. Therefore, we can say that the key point of data warehouse is computation but storage.

In this case, is it feasible to only provide a rich and powerful computing engine, and not bind the storage function? In other words, does it work if there is no “house”?

Unfortunately, it is not feasible for most SQL-based (relational algebra) data warehouses today, because the binding of storage and computation is required by database that the data warehouse originates from, and cannot be changed.

However, it is feasible for the new “no house” data warehouse - esProc SPL!

As an open computing engine, esProc specializes in processing the data of AP business. Depending on its open computing ability, esProc supports connection to diverse data sources, and can perform mixed calculation over multiple data sources. Moreover, esProc boasts its own high-performance file storage to ensure the computing performance. Instead of adopting SQL as its formal language, esProc uses self-created SPL (Structured Process Language), which is more advantageous than SQL.

The term “no house” referred to in this article means that there is no closed and private storage functionality like traditional data warehouse.

Where are the data stored then?

Let’s answer this question and related questions in detail below, and see what benefits does “no house” bring (that is, what problems of “the house” can be overcome).

Real-time computation of diverse data sources

In fact, once data is generated, it will be stored in a medium that carries it, such as a database, a file, or the web. In a broad sense, the data is already stored. Since that’s the case, wouldn’t it be convenient if the data could be processed directly? Moreover, the data sources of enterprises are diverse today, and they often face a variety of data sources and types. It will be very convenient if these data sources can be processed directly.

esProc provides the ability to process such open-format multiple data sources directly. No matter where the data are stored (RDB, NoSQL, File, Hadoop, RESTful, etc.), esProc can read and calculate directly. More importantly, esProc can connect to different data sources to perform mixed computing.

Once the ability to support diverse data sources (mixed calculation) is available, the limitation of “house” is broken through, which saves the development and time costs caused by loading the data into database. In addition, the real-time calculation on multiple data sources fully guarantees the real-timeness of data, and then implements the real-time query after separating data into databases. Moreover, since the data are no longer loaded into database indiscriminately, the storage cost and pressure of database will be greatly reduced, which is also important in the initial application stage of esProc (the data warehouse and esProc coexist).

esProc also fully retains the advantages of various data sources. Specifically, RDB is stronger in computing ability, we can make RDB do part of calculations first and then let esProc do the rest in many scenarios; NoSQL and file are high in IO transfer efficiency, we can read and calculate their data directly in esProc; MongoDB supports multi-layer data storage, we can let SPL use its data directly. All these are the benefits that come with openness.

In contrast, the closed data warehouse cannot compute the data outside the database, and hence it has to import the data before computing, resulting in the addition of an ETL action. This action not only increases the workload of programmers and the burden of database, but losses the real-timeness of data. Usually, the data outside the database have irregular formats, and it is not easy to load them into database with strong constraints and, even ETL action is performed, it first needs to load the raw data into database in order to utilize database’s computing ability. As a result, ETL is changed to ELT, which increases the burden of database.

Computation can be done regardless of where the data is stored, this is one of the benefits “no house” esProc brings.

High performance

However, when esProc reads diverse data sources, although they have same logical status, the read performances (which will be reflected in the total computation time) varies because the efficiencies of accessing the interfaces provided by different data sources are different. For certain interfaces (such as RDB’s JDBC), the read performance is very low.

While it is convenient to access various data sources directly, it may result in poor computing performance.

To fully ensure the computing performance, esProc offers the specialized binary file storage format, and offers many mechanisms, such as compression, columnar storage, ordering, and parallel segmentation.

It is worth noting that esProc’s file storage is not closed within esProc (totally different from the closed storage of data warehouse), but stored as the files in the file system, and has the same status with other files such as text and Excel. esProc does not own these files, and instead it provides many optimization strategies to make the efficiency to access the files more efficient.

In contrast, the performance of data warehouse with “house” is often not high. We know that the computing efficiency of data warehouse depends on the optimization degree of optimization engine, and a good database will choose more efficient execution path according to the computing objective of SQL (rather than its literally expressed logic). However, such auto-optimization mechanism works only for simple calculations. Once SQL becomes slightly more complex, the engine will fail, and has to execute SQL according its literally expressed logic, resulting in a sharp decline in performance. In this case, if data storage can be intervened by adjusting the data according to algorithms (for example, sort the data by primary key), higher performance can be achieved. Unfortunately, the data warehouse is closed and its storage is private, we cannot intervene the storage, so we cannot achieve high performance.

In comparison, the file storage of esProc is very flexible, allowing us to design the storage based on any algorithm to make most of the advantages of file storage itself, and adjust the data based on the algorithm, so it is not surprising to achieve high performance.

Security and reliability

Having open computing ability and storing data to files will cause a problem: the closedness of traditional data warehouses can ensure the security and reliability of data inside the system, yet how can esProc, which no longer binds storage, ensure security and reliability?

In fact, there is no need to worry about this issue. esProc does not manage the data in principle, nor is it responsible for data security. To some extent, it can be said that esProc does not have and does not need a security mechanism.

The security of persistent data is in principle the responsibility of data source itself. For example, the database provides the security mechanisms such as user identification and authentication, authorization and verification mechanisms, and auditing techniques. For the data files in esProc format, many file systems or VMs provide perfect security mechanisms that can be utilized directly, such as access control, identity verification, and transmission encryption. The reliability of data can be guaranteed through the ability of the data source or professional storage technology itself.

In addition, esProc supports retrieving the data from object storage services before computing such as S3, and can also utilize their security mechanisms. The cloud storage technologies like S3 are more advantageous in terms of security and reliability. Currently, there are few databases that have the ability to provide reliability guarantees surpassing these professional technologies. Therefore, it is ok to directly employ the security mechanism of these technologies.

In terms of application access, esProc of the independent service process uses standard TCP/IP and HTTP to communicate, and can be monitored and managed by professional network security products, and the specific security measures will be the responsibility of these products. esProc specializes in data computation, and its philosophy on the non-computing tasks is to work with other specialized products.

In fact, the security and reliability will be worse for the database with “house”. The permission management and control of database is often not meticulous enough, resulting in all users of an application being high-privilege user. For the convenience of “computing”, the permission to intervene in “storage” is given, such as the dangerous permission to compile stored procedures. As a result, the security itself cannot be well protected. In contrast, esProc focuses only on “computing” rather than “storage”, and its “computing” works only on the secure mechanism of “storage” and does not affect or destroy “storage”. As for reliability, it is directly proportional to the investment cost. Even with the extremely expensive “two sites and three centers” construction, the reliability is still far inferior to the current professional cloud storage. Since that’s the case, leave professional matters to the professionals.

Therefore, we can say that “no house” can bring more security and more reliability than “with house”.

Implement HTAP requirement

In recent years, HTAP has become another hot spot in the database field. However, most databases implement HTAP only by attaching certain AP capabilities to TP database or by binding the two technologies together in other ways. Regardless of the method adopted, the issue of database migration is unavoidable. Not to mention the high risk, the closedness and performance problems of original data warehouse cannot be solved.

In fact, HTAP requirement is essentially to query the data in real-time after the separation of databases. If this ability is available, then this requirement can be implemented without modifying the original TP database (no migration risk).

We can introduce esProc based on original independent TP and AP systems, and utilize esProc’s open cross-source computing ability, high-performance storage and computing abilities and agile development ability to implement this requirement.

esProc implements HTAP in a way that cooperates with the existing system. In this way, it only needs to make few modifications to the existing system, and there is almost no need to modify TP database. Even original AP data source can still be used to make esProc gradually take over AP business. Having partially or completely taken over AP business, the historical cold data is stored in esProc’s high-performance file, and original ETL process that moves the data from business database to data warehouse can be directly migrated to esProc. When the cold data are large in amount, and no longer change, storing them as esProc’s high-performance file can obtain higher computing performance; when the amount of hot data is small, storing them still in original TP data source enables esProc to read and calculate directly. Since the amount of hot data is not large, querying directly based on TP data source will not have much impact, and the access time will not be too long. After that, by making use of esProc’s cold and hot data mixed computing ability, we can achieve real-time query for full data. The only thing we need to do is to periodically store cold data as esProc’s high-performance file, and store the small amount of recently generated hot data in original data source. In this way, not only is HTAP implemented, but it implements a high-performance HTAP, and there is little impact on the application framework.

Implement true Lakehouse

The closed data warehouse cannot build a true Lakehouse. The data lake is just like a data junk yard, it should store the original raw data in spite of the data type, as it is impossible to predict whether some data are useful or not in the future. The value of data can only be reflected through calculation, which requires the computing ability of data warehouse. However, the data warehouse is closed, and the data must be deeply organized to meet criteria before being loaded into database. In addition, the large amount of raw “junk data” in the data lake cannot be calculated directly, whereas organizing data not only losses original information, but also faces diverse data sources problem mentioned above. Consequently, the real-timeness of data cannot be guaranteed, and the ETL itself costs a lot, resulting in a poor timeliness.

Compared to the fake Lakehouse implemented on traditional data warehouse, esProc can implement true Lakehouse, because esProc has enough openness, and can calculate the unorganized data of data lake directly, and has the ability to perform mixed computation on many types of data sources while guaranteeing the computing efficiency by means of high-performance mechanism.

esProc has the ability to directly calculate the raw data in data lake, and there are no constraints, and there is no need to load data into database. Moreover, esProc can perform mixed calculation on diverse data sources. Whether the data lake is built based on a unified file system or diverse data sources (RDB, NoSQL, LocalFile, Webservice), a direct mixed computing can be done by esProc to quickly output the value of data lake. Furthermore, the high-performance file storage of esProc (the storage function of data warehouse) can be utilized to organize the data in an orderly way while computing by esProc. Converting the raw data to esProc’s storage can obtain higher performance. The data are still stored in file system after they are converted to esProc storage, and theoretically, they can be stored in the same place with data lake. In this way, a true Lakehouse is implemented.

With the support of esProc’s computing ability, the organization and computation of data can be conducted at the same time, and the data lake can be built in a stepwise and ordered manner. Moreover, the data warehouse is being refined in the process of building data lake, making the data lake has strong computing ability as well, thereby implementing a true Lakehouse.

From closed to open, this is the manifestation of the continuous progress of technology. The same goes for data warehouse, more specifically, developing from “with house” to “no house” is an inevitable stage that data warehouse experiences, and hence the data warehouse is about to enter the era of “no house”. esProc may not be perfect, but it has taken a big step forward in terms of developing the capabilities of “no house” data warehouse, and it is definitely worth a try

SPL is now open-source. You can obtain the source code from GitHub .

Try it free~~ Download~