<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Judy</title>
    <description>The latest articles on DEV Community by Judy (@esproc_spl).</description>
    <link>https://dev.to/esproc_spl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1191782%2Fdac3272f-c56a-41d9-914e-8f8fba86506b.jpg</url>
      <title>DEV Community: Judy</title>
      <link>https://dev.to/esproc_spl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/esproc_spl"/>
    <language>en</language>
    <item>
      <title>SPL practice: solve space-time collision problem of trillion-scale calculations in only three minutes</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Mon, 30 Mar 2026 08:37:07 +0000</pubDate>
      <link>https://dev.to/esproc_spl/spl-practice-solve-space-time-collision-problem-of-trillion-scale-calculations-in-only-three-1ko5</link>
      <guid>https://dev.to/esproc_spl/spl-practice-solve-space-time-collision-problem-of-trillion-scale-calculations-in-only-three-1ko5</guid>
      <description>&lt;h2&gt;
  
  
  Problem description
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Definition of space-time collision&lt;/strong&gt;&lt;br&gt;
Dataset A contains the time and space information of n source objects A1, …, An, and each piece of information includes three attributes: ID (iA), location (lA) and time (tA). It can be assumed that the same Ai will not appear twice in A at the same time, or in other words, no two pieces of information have the same iA and tA. Dataset B, which has the same structure as A, contains m target objects B1, …, Bm (with the similar attributes iB, lB, tB) that are to be confirmed whether they collided with A. Likewise, it can be assumed that Bi will not appear twice in B at the same time.&lt;/p&gt;

&lt;p&gt;This article involves many set-oriented operations. Instead of using the term “record” to refer to information of data set, we use the set-related term “member”.&lt;/p&gt;

&lt;p&gt;Group the dataset A by iA to get n subsets, and still name these subsets A1…, An. And correspondingly, split the dataset B into m subsets B1…, Bm. If ‘a’ belongs to the subset Ai and ‘b’ belongs to the subset Bj, and a.lA=b.lB and |a.tA-b.tB|&amp;lt;=1 minute (this time length can be changed), then we consider that ‘a’ collides with ‘b’, and that object Ai and object Bj collided once.&lt;/p&gt;

&lt;p&gt;Rule 1: The number of collisions of each Ai member is counted as once at most, which means that if a collides with b1 and b2, we consider that only one collision occurs between Ai and Bj.&lt;/p&gt;

&lt;p&gt;Rule 2: The member of Bj that once collided will no longer be identified as having had a collision. For example, if b collides with both a1 and a2, and assume a1.tA&amp;lt;a2.tA, then only the collision between a1 and b is identified as collision, and the collision between a2 and b is no longer identified as collision.&lt;/p&gt;

&lt;p&gt;Objective: find the top 10 objects Bj with the highest similarity for each Ai.&lt;/p&gt;

&lt;p&gt;The formula for calculating the similarity ‘r’ is: r(Ai, Bj)=E/U.&lt;br&gt;
where, the molecule E refers to the total number of collisions between Ai and Bj calculated based on the above rules;&lt;/p&gt;

&lt;p&gt;The denominator U refers to the total number of members of Ai and Bj after deduplication, which can be calculated using |Ai|+|Bj|-E’, where E’ refers to the number of Bj members that collide with a certain Ai member.&lt;/p&gt;
&lt;h2&gt;
  
  
  Data structure and data scale
&lt;/h2&gt;

&lt;p&gt;Dataset A&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt5ehzlwz8v691auyg4a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt5ehzlwz8v691auyg4a.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
Dataset B&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnh80lfg2efxc7q0yh349.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnh80lfg2efxc7q0yh349.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tA and tB are accurate to the second, and the time span is 30 days.&lt;/p&gt;

&lt;p&gt;The total number of records of Dataset A is 21 million rows, with daily addition of approximately 700,000 rows.&lt;/p&gt;

&lt;p&gt;The total number of records of Dataset B is 15 million rows, with daily addition of approximately 500,000 rows.&lt;/p&gt;

&lt;p&gt;The scale of n (the number of Ai) is 2.5 million, and that of m (the number of Bj) is 1.5 million.&lt;/p&gt;

&lt;p&gt;The number of locations is 10, which means the possibility of values of lA and lB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware environment and expectation&lt;/strong&gt;&lt;br&gt;
We hope to obtain the result within 15 minutes on a 40C128G server.&lt;/p&gt;

&lt;p&gt;The amount of data is not large and the data can be fully loaded into in-memory database. However, the calculation process is complicated. Since it is difficult to work out in SQL alone, external program (Java or Python) is needed. As a result, the overall performance is very low, and it took more than two hours to perform this task.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem analysis
&lt;/h2&gt;

&lt;p&gt;n is 2.5 million and m is 1.5 million. If we calculate the similarity of each pair of Ai and Bj according to the above-mentioned definition, we have to calculate 2.5 million * 1.5 million = 3.75 trillion pairs. Even if each CPU can calculate the similarity of one pair of members in only one microsecond (in fact, such complex set-oriented calculations cannot be worked out quickly), it would take several hours on the current multi-CPU environment. Obviously, this hard traversal method is not feasible.&lt;/p&gt;

&lt;p&gt;For a pair of Ai and Bj, according to the similarity calculation formula, we know that if there is no collision between the members of Ai and Bj, then E=0, and the similarity is also 0, and hence there is no need to perform the subsequent TopN calculation.&lt;/p&gt;

&lt;p&gt;Assume that the data in A are evenly distributed and the average number of members in each Ai is less than 10 (21 million/2.5 million), and that the Bj that collides with Ai must satisfy the condition |tA-tB|&amp;lt;=1 minute, if the data in B are evenly distributed, then there are approximately 350 (500,000/1440) members per minute on average. The 10 members of Ai will collide with a maximum of 350210=7000 B members (between one minute before and after each A member). The average number of members of Bj is also 10 (15 million/1.5 million), and the average number of B members in Bj is only 7000/10=700 after distributing 7000 B members into Bj. In other words, there are only 700 Bj that have the similarity not equal to 0 with Ai on average, which is much smaller than the total number of Bj (1.5 million, a difference of over 2,000 times). Considering the condition lA=lB, if all objects are also evenly distributed (which is unlikely), then the average number of Bj that have the similarity not equal to 0 with Ai can be further reduced by 10 times (10 locations).&lt;/p&gt;

&lt;p&gt;Based on the above information, we design the following algorithms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For each Ai, find the set of members of B that may collide with each member of Ai based on the time and location conditions, and denote the set as B’:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;B’=Ai.conj(B.select(tB&amp;gt;=tA-60 &amp;amp;&amp;amp; tB&amp;lt;=tA+60 &amp;amp;&amp;amp; lA==lB))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Note that during the calculation of B’, the corresponding B members will be filtered for each Ai member, so the members of both Ai and B may appear repeatedly in B’.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For each Ai, we need to filter B with tB to obtain B’. If B is sorted by tB in advance, the binary search can be used to speed up. Moreover, we need to add the tA attribute of Ai to facilitate subsequent calculations (since lA and lB are always the same and iA is a fixed value for Ai, there is no need to add these two attributes). The calculation of B’ can be changed as follows:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;B’=Ai.conj(B.select@b(tB&amp;gt;=tA-60 &amp;amp;&amp;amp; tB&amp;lt;=tA+60).select(lA==lB).derive(Ai.tA))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;to determine the set composed of member pairs consisting of Ai members and B’j members that have collided (it is also the set of the member pairs consisting of Ai members and Bj members that have collided). In this equation, the member of Ai is represented as the field tA, and the member of Bj is represented as the field tB.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If A is sorted by iA and tA in advance, and the members of Ai after grouping are also in order by tA, then likewise, the members of B’ will be ordered by tA, and the members of B’j will also in order by tA. In this way, the member with the minimum tA in each grouped subset of B’j.group(tB) will definitely be the first member. Therefore, the calculation of A’j can be simplified as:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A’j=B’j.group@1(tB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;According to rule 1 that the number of collisions of each Ai member is counted as once at most, and based on the assumption mentioned at the beginning of this article that the same Ai will not appear twice in A at the same time, we just need to deduplicate tA, that is, the numerator can be calculated as follows:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;E=A’.icount(tA)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;In the formula U=|Ai|+|Bj|-E’ for calculating the denominator,&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A’j is the set composed of member pairs consisting of A members and B members, and these member pairs have been identified as having had a collision. Moreover, since the Bj members are already deduplicated (group@1(tB) means taking only one member from each grouped subset), the member of Bj will not appear repeatedly in A’j. Therefore, the members of A’j can correspond to the collided members of Bj one to one, and the equation E’=|A’j| holds.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;|Ai| and |Bj| can be calculated and saved in advance by grouping A and B by iA and iB respectively. Since the amount of data is not large, the results can all be stored in memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, according to the formula E/U, we can get the similarity ‘r’ between Ai and Bj. What remains is the common task of calculating TopN for the similarity results.&lt;/p&gt;

&lt;p&gt;**Further optimization&lt;br&gt;
**8. In the step 1 above, to search the full data of dataset B for each a, there is still a certain amount of computation with binary search (2*log 15 million means about 50 comparisons). Considering both the number of minutes and the number of locations are not large (30-day time span, 1440 minutes per day and 10 locations mean only around 400,000), which can be fully held in memory, we can use aligned sequence to directly locate.&lt;/p&gt;

&lt;p&gt;When computing, let&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;G=B.align@a(30*1440,tB\60+1).(~.align@a(10,lB))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because there are two dimensions: time and location, we also use a two-layer aligned sequence. First, divide B into 43200 (30*1440) groups by the minute sequence number tB\60+1, and then divide the sub-group members into 10 groups by the location sequence number lB. For the tA of a certain member in Ai, we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;G’=G.m(tA\60)(lA) | G(tA\60+1)(lA) | G.m(tA\60+2)(lA)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to quickly and roughly screen out a small superset of B’. The time difference between tB and tA of the member in the superset is no more than 2 minutes (the difference between the minute sequence numbers at which tA and tB are located is not greater than 1), thereby filtering out a large number of B members that are impossible to collide on the time dimension. In the small superset, since there may be a small number of members whose difference between tB and tA is greater than 1 minute, we need to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;G’.select@b(tB&amp;gt;=tA-60 &amp;amp;&amp;amp; tB&amp;lt;=tA+60)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to screen out an exact B’, which can significantly reduce the computing amount.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In addition, when calculating U, we need to search for the corresponding |Ai| by iA. If iA is continuous integer, we can also find it directly by location to avoid search action, that is:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nA=A.groups@n(iA;count(1)).(#2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, |Ai|=nA(iA). B can be processed in the same way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practice process
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prepare the test data&lt;/strong&gt;&lt;br&gt;
We directly prepare the data that are already converted to sequence number. Assume that the time span is 30 days and the enumeration number of locations is 10, the simulated data script is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv4pjbgjow0a6tmeluam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv4pjbgjow0a6tmeluam.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
In A1, K represents the number of days, nA represents the total data amount of dataset A, nB represents the total data amount of dataset B, t represents the number of seconds of 30 days, and LN represents the enumeration number of locations.&lt;/p&gt;

&lt;p&gt;In A2 and A3, the composite tables A.ctx and B.ctx are created respectively, and the data of randomly generated data sets A and B are exported to the two composite tables respectively.&lt;/p&gt;

&lt;p&gt;tA (tB) refers to the number of seconds elapsed from the starting time point. For example, if the starting time point is 2023-08-23 00:00:00, then the value corresponding to the time point 2023-08-23 00:01:01 is 61.&lt;/p&gt;

&lt;p&gt;In A2, the @p option is used to create the composite table, indicating that the first field ‘iA’ is used as the segmentation key. During parallel computing, the composite table needs to be segmented. Since the records of the same ‘iA’ cannot be assigned to two segments, we use the @p option to ensure this during the segmentation of composite table.&lt;/p&gt;

&lt;p&gt;Special attention should be paid to different sort orders when saving A and B. A is sorted by iA and tA (the step 4 of ‘Problem analysis’), while B is sorted by tB and iB (the step 2 of ‘Problem analysis’). In this way, we can read the ordered data directly in subsequent operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Computing script&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4lzfamwlm6qh9d035lf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4lzfamwlm6qh9d035lf.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
A1: K=30 represents the number of days of the time span to be counted; LN=10 represents the enumeration number of locations;&lt;/p&gt;

&lt;p&gt;A4, A5: correspond to the step 7 of ‘Problem analysis’. Group the members by the ID of datasets A and B respectively, and count the number of members of each Ai and Bj, and store the result as sequence;&lt;/p&gt;

&lt;p&gt;A6: correspond to the step 8 of ‘Problem analysis’. Align and group the members of dataset B by minute, and then align and group the sub-group members by location to calculate the aligned sequence mentioned above;&lt;/p&gt;

&lt;p&gt;A8: correspond to the steps 1, 2 and 8 of ‘Problem analysis’. Divide the dataset A into several groups of Ai by iA, and loop through each group of Ai to obtain the corresponding B’; here we use ‘news’ instead of ‘conj’, which eliminates the derive action, and can obtain the same result. In addition, we also add iA to facilitate subsequent search for |Ai|;&lt;/p&gt;

&lt;p&gt;A9: correspond to the steps 3 and 4 of ‘Problem analysis’. The method of calculating A’j:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;B’.group(iB).group@1(tB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can be simplified as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;B’.group@1(iB,tB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A10: correspond to the steps 6 and 7 of ‘Problem analysis’. The result of deduplicating and counting tA by Aj’ is the molecule E, which is equivalent to the step 5 of ‘Problem analysis’. By adding the previously calculated results |Ai| and |Bi| and then subtracting the number of records in the current group (i.e. E’), we can get the denominator U. Finally, calculate the top 10 records based on the similarity result ‘r’.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Convert to sequence number and restore&lt;/strong&gt;&lt;br&gt;
Convert the ID, location and time to sequence number (the sequence-numberization of time is to calculate the number of seconds from the start time to the current time). The data structure after conversion is as follows:&lt;/p&gt;

&lt;p&gt;Dataset A&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pkbpiatfzf8by703jem.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0pkbpiatfzf8by703jem.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Dataset B&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tqiq1mypc9qqzsohkgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tqiq1mypc9qqzsohkgq.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
The creation of data with the above code is based on the premise that the members of all fields are already converted to the above data structure. Therefore, in practice, we need to perform data conversion and organization first, and then restore the data after calculation. For details, refer to the method described in SPL Practice: integerization during data dump . Since the said method is not the focus of this article, we won’t describe it again here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actual effect
&lt;/h2&gt;

&lt;p&gt;When the total time span is 30 days (the data volume of data set A is 21 million rows, and that of data set B is 15 million rows), computing in SPL on a single machine (8C64G) takes 161 seconds including exporting all results to CSV file.&lt;/p&gt;

&lt;p&gt;In fact, achieving this performance requires using a small number of column-wise computing options of SPL Enterprise Edition. Since the use of such options doesn’t involve principle analysis, we do not describe it in detail in this article.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Postscript&lt;/strong&gt;&lt;br&gt;
This article discusses a typical object counting problem, which generally has the following characteristics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Count the number of objects that satisfy a certain condition.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The number of objects is very large, but the amount of data involved in each object is not large.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The condition is very complex, usually also related to the order, and requires some steps to determine.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Normally, solving such problem needs to sort the data by object. However, since the amount of data involved in this task is very small, the optimization of storage becomes unimportant. The key to solving this task is to provide powerful set-oriented computing ability, especially the ability to compute the ordered set. For example, the data type should be able to support the set of sets so that the grouped subsets can be retained without having to aggregate like SQL. Moreover, the two-layer aligned sequence should be supported, allowing us to access the members of set by location and, the ordered grouping functionality should be provided.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>devops</category>
      <category>opensource</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Ditch 10,000 Intermediate Tables—Compute Outside the Database with Open-Source SPL</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Fri, 13 Feb 2026 08:02:44 +0000</pubDate>
      <link>https://dev.to/esproc_spl/ditch-10000-intermediate-tables-compute-outside-the-database-with-open-source-spl-4j1k</link>
      <guid>https://dev.to/esproc_spl/ditch-10000-intermediate-tables-compute-outside-the-database-with-open-source-spl-4j1k</guid>
      <description>&lt;p&gt;Intermediate tables are data tables in databases specifically used to store intermediate results generated from processing the original data – which is why they are so named. They are summary tables usually created for speeding up or facilitating the front-end queries and analysis. For some large organizations, years of accumulation results in tens of thousands of intermediate tables, which is an incredible number, in their databases, bringing great trouble to database operation and usage.&lt;/p&gt;

&lt;p&gt;The large number of intermediate tables occupies too much database storage space, putting enormous pressure on storage capacity and increasing demand for capacity expansion. But database space is expensive and capacity expansion is exceedingly costly. Moreover, often there are restrictions on the expansion. It is not a good choice to cost you an arm and a leg with storing intermediate tables also because too many of them reduce database performance. Intermediate tables are not created out of thin air. Rather, they are generated from the original data through a series of computations that consume database computing resources. Sometimes, a lot of intermediate tables are produced during a computation. This consumes a large number of resources, and in serious cases, can slow down queries and transactions.&lt;/p&gt;

&lt;p&gt;Why are there so many intermediate tables? Below are main reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. More than one step is needed to get the final result&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original data table needs to undergo complicated computations before being displayed in a report. It is hard to accomplish this with one SQL statement but with multiple, continuous SQL statements. One statement generates an intermediate result that will be used by the next statement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Long wait time in real-time computations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For data-intensive and compute-intensive tasks, the wait time will be extremely long. So, report developers choose to run batch tasks at night and store results in intermediate tables. It is much faster to perform queries based on the intermediate tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Diverse data sources in a computation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Files, NoSQL and Web service almost do not have computing abilities. Data originated from them needs to be computed using the database’s computing ability. With a mixed computation between such data and data stored in the database particularly, the traditional approach is to load the external data into the database and store it as intermediate tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Intermediate tables are hard to get rid of&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As databases uses flat structure to arrange tables, it is very likely that one intermediate table is shared by multiple queries after it is created. Deleting it for a finished query could affect other queries. Worse still, you cannot know exactly which applications are using this intermediate table. This makes deletion impossible, not because you do not want to get rid of it, but because you dare not do it. The consequence is that tens of thousands of intermediate tables are accumulated in the database over time.&lt;/p&gt;

&lt;p&gt;But why we use databases to store intermediate data? According to the above causes of intermediate tables, the direct aim for storing intermediate data in the database as intermediate tables is to employ the database’s computational ability. The intermediate data will be further computed for subsequent use, and sometimes the computation is rather complicated. Now only databases (which are SQL-driven) have relatively convenient computing ability. Other storage formats like files have their own merits (high I/O performance, compressible and easy to be parallelly processed) though, they do not have computing abilities. If you try to perform computations based on files, you need to hardcode them in applications. That is far less convenient than using SQL. So, to make use of databases’ computing abilities is the essential reason of the existence of intermediate tables.&lt;/p&gt;

&lt;p&gt;In some sense intermediate data is necessary. But consuming a huge amount of database resources in order to get only more computing ability is obviously not a good strategy. If we can enable files to have equal computing ability and store intermediate data in the outside-database file system, then problems related to intermediate tables will be solved and databases will be unburdened from or relieved of overload.&lt;/p&gt;

&lt;p&gt;The open-source SPL can help to turn it into reality.&lt;/p&gt;

&lt;p&gt;SPL is an open-source structured data computation engine. It can process data directly based on files, giving files the computing ability. It is database-independent, offers specialized structured data objects and a wealth of class libraries for handling them, possesses all-around computational capability, and supports procedural control that makes it convenient to implement complex computing logics. All these features qualify SPL to replace databases in handling intermediate data and subsequent data processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uyu0kl9g9kwewbq0vvr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uyu0kl9g9kwewbq0vvr.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  SPL file-based computations
&lt;/h2&gt;

&lt;p&gt;SPL can perform computations directly based on files like CSV and Excel and multilevel data JSON and XML. It is convenient to read and handle them in SPL. We can store intermediate data in one of those file formats and handle it in SPL. Below are some basic computations:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61n0jtodog3k4tmlc2l2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61n0jtodog3k4tmlc2l2.png" alt=" " width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On top of native syntax, SPL even offers supports of SQL92 standard, allowing programmers familiar with SQL to query files directly in SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Support of complicated WITH clause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t
left join ClientTable ct on t.Client=ct.Client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL has the edge on handling multilevel data like JSON and XML. To perform computations based on orders data of JSON format, for instance:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3pohvqwg9bpkyjzav32.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3pohvqwg9bpkyjzav32.png" alt=" " width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The SPL implementation is concise compared with that in other JSON libraries (like JSONPath).&lt;/p&gt;

&lt;p&gt;SPL also allow users to query the JSON data directly in SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$select * from {json(file("/data/EO.json").read())}
where Amount&amp;gt;=100 and Client like 'bro' or OrderDate is null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL is particularly suitable for handling complex computing logics with its agile syntax and procedural control ability. To count the longest continuous days when the price of a stock rises based on stock records of txt format, for instance, SPL has the following code:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b4vreozgu23v2zpm4aw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b4vreozgu23v2zpm4aw.png" alt=" " width="800" height="117"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One more instance. To list the latest login interval for each user according to user login records of CSV format:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0fw3aqpbd6hdl9r8s0f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0fw3aqpbd6hdl9r8s0f.png" alt=" " width="800" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Such computing tasks are hard to code even in SQL in databases. Yet they become easy when handled in SPL.&lt;/p&gt;

&lt;p&gt;The outside-database computing ability SPL supplies is an effective solution to problems triggered by too many intermediate tables in databases. Storing intermediate data in files releases database space resources, reduces demand for database expansion and makes database management more convenient. The outside-database computations do not take up database computing resources, and the unburdened database will be able to better serve other transactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-performance file formats
&lt;/h2&gt;

&lt;p&gt;Text files are commonly used data storage format. They are general-purpose and easy to read, but, at the same time, they have extremely bad performance. Traditionally, text-based computations are hard to have satisfactory performance.&lt;/p&gt;

&lt;p&gt;Text characters cannot be computed directly. They need to be transformed to in-memory data types like integers, real numbers, dates and strings to be able to be processed. Yet text parsing is extremely complicated and takes exceptional long CPU time. Generally, hard disk reading takes up most of the time in accessing data on external storage, and text files’ performance bottle usually happens in the phase of data handling by CPU. Because of too complicated parsing, it is probably that the CPU time is greater than the hard disk reading time (especially with the high-performance SSD). So, text files are usually not used to process big data when high performance is demanded.&lt;/p&gt;

&lt;p&gt;SPL provides two high-performance binary storage formats – bin file and composite table. A bin file uses the binary format, is compressed (to occupy less space and allow fast retrieval), stores data types (to enable faster retrieval without parsing), and supports the double increment segmentation technique to divide an append-able file, which facilitates parallel processing in an effort to further increase computing performance.&lt;/p&gt;

&lt;p&gt;The composite table is a file storage format SPL uses to provide column-wise storage and indexing mechanism. It displays great advantage in handling scenarios where only a very small number of columns (fields) is involved. A composite table is equipped with the min-max index and supports double increment segmentation technique, letting computations to both enjoy the advantages of column-wise storage and be more easily parallelly processed to have better performance.&lt;/p&gt;

&lt;p&gt;The two high-performance file formats are convenient to use, and have basically the same uses as text files. To read a bin file and compute it, for instance:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb2kl96innunss4o0fshs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb2kl96innunss4o0fshs.png" alt=" " width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When the size of data to be processed is large, SPL can use cursor to perform batch retrieval and multi-CPU-based parallel processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=file("/data/scores.btx").cursor@bm()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When using files to store data and no matter which format the original data uses, they need to be at least converted to the binary format (like bin file) to get more advantages in both space usage and computing performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ease of management
&lt;/h2&gt;

&lt;p&gt;Moving intermediate data out of database to file system can not only reduce database workload but make the data extremely easy to manage. Files can be stored in operating system’s tree-structure directories. This makes them convenient to use and manage. It is neat and tidy to place intermediate tables used by different systems and modules in separate directories. This completely eliminates shared reference and thus the long-standing issue of tight coupling between systems and modules due to messy use of intermediate tables in the database. Now intermediate tables can be safely deleted without any harmful effects when corresponding modules are not used any more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Support of diverse data source
&lt;/h2&gt;

&lt;p&gt;In addition to the file sources, SPL can connect to and retrieve data from dozens of other data sources as well as perform mixed computations between different sources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F565k2pkzys0djyikeuof.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F565k2pkzys0djyikeuof.png" alt=" " width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After intermediate data is stored in files, we face cross-data-source computations when trying to perform full-data queries between the file and the database holding the real-time data. It is convenient to implement these T+0 queries in SPL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvqhw9i6ua0ne2mwr55o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwvqhw9i6ua0ne2mwr55o.png" alt=" " width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Ease of integration
&lt;/h2&gt;

&lt;p&gt;SPL provides standard JDBC driver and ODBC driver for invocation by an application. For a Java program, the SPL code can also be integrated into it as an embedded computing engine, enabling the latter to have the ability to handle intermediate data.&lt;/p&gt;

&lt;p&gt;Sample of invoking SPL code through JDBC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL is interpreted execution and naturally supports hot-swap. Data computing logics written in SPL and their modification, operation and maintenance take effect in real-time without the need of restarting the application, making programs’ development, operation and maintenance convenient and efficient.&lt;/p&gt;

&lt;p&gt;With SPL that offers outside-database computational capability, we can transfer intermediate tables to files, getting rid of the numerous of them from databases. This helps to relieve databases of overload and make it faster, more flexible and more scalable.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The Game-Changer Breaking Data Lake's Impossible Triangle</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Wed, 04 Feb 2026 08:11:55 +0000</pubDate>
      <link>https://dev.to/esproc_spl/the-game-changer-breaking-data-lakes-impossible-triangle-c1c</link>
      <guid>https://dev.to/esproc_spl/the-game-changer-breaking-data-lakes-impossible-triangle-c1c</guid>
      <description>&lt;h2&gt;
  
  
  A brief introduction to data lake
&lt;/h2&gt;

&lt;p&gt;Let’s start with data warehouse. A data warehouse is a subject-oriented data management system that aggregates data from different business systems and is intended for data query and analysis. As data expands and the number of business systems increases, data warehousing becomes necessary. In order to meet the business requirements, the raw data needs to be cleansed, transformed and deeply prepared before being loaded into the warehouse. Answering the existing business questions is the data warehouse’s core task. Those questions must be already defined.&lt;/p&gt;

&lt;p&gt;But what if a business question is not defined (which is potential data value)? According to the data warehouse’s rule, a business question is asked first and then a model is built for it. The chain of identifying, raising and answering questions thus becomes very long. On the other hand, the data warehouse, as it stores highly prepared data, has to obtain desired data by processing the raw data when the new question requires fine data granularity. This is extremely cost-ineffective. If there are many such new questions, a query process will be overburdened.&lt;/p&gt;

&lt;p&gt;So, in the context of this background, the data lake was born. It is a technology (or strategy) intended to store and analyze massive amounts of raw data. It enables to load as much raw data as possible into the data lake while keeping the highest fidelity as possible in storing it, and, in theory, extracting any potential data value based on full data. Speaking of this, the data lake’s two roles are absolutely obvious. One is data storage because the data lake needs to keep all raw data. The other is data analysis, which, from the technical point of view, is data computing, or the value extraction process.&lt;/p&gt;

&lt;p&gt;Let’s look at the data lake’s performance in the two aspects.&lt;/p&gt;

&lt;p&gt;The data lake stores full raw data, including structure data, semi-structured data and unstructured data, in its original state. The capacity of storing massive and diverse data is thus the data lake’s essential feature, which is different from the data warehouse that often uses databases to store structured data. Besides, loading data into the lake as early as possible helps fully extract value from association of differently themed data and ensure data security and integrity.&lt;/p&gt;

&lt;p&gt;The good news is that the massive raw data storage needs can be fully met thanks to the great advance of storage and cloud technologies. Enterprises can choose self-built storage cluster or the storage service provided by a cloud vendor to deal with their business demands.&lt;/p&gt;

&lt;p&gt;But, the toughest nut to crack is data processing! The data lake stores various types of data and each needs to be processed differently. The central and the most complicated part is structured data processing. With both historical data and newly generated business data, data processing mainly focuses on structured data. On many occasions, computations of semi-structured data and unstructured data will eventually be transformed to structured data computations.&lt;/p&gt;

&lt;p&gt;At present, SQL-based databases and related technologies, which are also the abilities data warehouses have, dominate structured data processing field. In other words, the data lake depends on data warehouses (databases) to compute structured data. That is nearly all data lake products do. Building the data lake to store all raw data and then the data warehouse to add data processing capability catering to business needs of enterprises. As a result, data in the lake needs to be loaded to the data warehouse again through ETL. An advanced approach automates the process to some degree. The approach identifies data in the lake that needs to be loaded to the warehouse and performs the loading while the system is idle. This is the main functionality of the currently hot concept of Lakehouse. But, no matter how data is loaded to the warehouse (including the extremely inefficient method that lets the warehouse access data lake through the external table), today’s data lake is made up of three components – massive data storage, data warehouse and a specialized engine (for, like, unstructured data processing).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fovuu467hwrtowwef9i6c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fovuu467hwrtowwef9i6c.png" alt=" " width="800" height="490"&gt;&lt;/a&gt;&lt;br&gt;
There are problems about this type of data lake framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  The impossible triangle
&lt;/h2&gt;

&lt;p&gt;Data lakes are expected to meet three key requirements – storing data in its original state (loading high fidelity data into the lake), sufficient computing capacity (extracting the maximum possible data value) and cost-effective development (which is obvious). The current technology stack, however, cannot achieve all the three demands at the same time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn7omgyef54uc86s3wx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn7omgyef54uc86s3wx9.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Storing data as it is was the initial purpose of building the data lake because keeping the original data unaltered helps to extract the maximum value from it. The simplest way to achieve the purpose is that the data lake uses a completely same storage medium to store data loaded from the source. There will be, for instance, a MySQL to hold data originally stored in MySQL, a MongoDB to receive data initially stored in MongoDB, and so on. This helps load data into the lake in as hi-fi format as possible and make use of the source’s computing ability. Though achieving computations across data sources is still hard, it is enough to handle computations only involving the current source’s data, meeting the basic requirement of sufficient computing power (as part i in the above figure shows).&lt;/p&gt;

&lt;p&gt;But the disadvantage is noticeable – the development is too expensive. Users need to put same storage mediums in place and copy all data sources accumulated over years to them. The workload is ridiculously heavy. If a data source is stored with commercial software, purchasing the software further pushes up the development cost. A relief strategy is to use a storage medium of same type, like storing Oracle data in MySQL, but it brings a side effect while the costs still stay high – some computations that could have been handled could become impossible or hard to achieve.&lt;/p&gt;

&lt;p&gt;Now, let’s lower the bar. We don’t demand that data be duplicated at loading but just store data in the database. By doing this, we obtain the database’s computing ability and meet the requirement of cheap development (as part ii in the above figure shows) at the same time. But this is infeasible since it heavily depends on one relational database into which all data needs to be loaded.&lt;/p&gt;

&lt;p&gt;Information may be easily lost during the loading process, which will fall short of the first requirement of building the data lake (loading high-fidelity data into the lake). Storing MongoDB data in MySQL or Hive is hard, for instance. Many MongoDB data types and relationships between sets do not exist in MySQL, such as the set data type like nested data structure, array and hash, and the instances of many-to-many relationship. They cannot be simply duplicated in the course of data migration. But rather, certain data structure needs to be restructured before the migration. That requires a series of sophisticated data reorganization steps, which is not cost-effective but needs a lot of people and time to sort out the business target and design appropriate form of target data organization. Without doing this, information will be lost, and errors, in turn, appear during the subsequent analysis. Sometimes errors are too deeply hidden to be easily visible.&lt;/p&gt;

&lt;p&gt;A general approach is to load data unalterably into large files (or as large fields in the database). This way the information loss is within an acceptable range and data basically remains intact. File storage has many advantages. It is more flexible, more open, and has higher I/O efficiency. Particularly, storing data in files (or in a file system) is cheaper.&lt;/p&gt;

&lt;p&gt;Yet, the problem of file storage is that files/large fields do not have computing capacity, making it impossible to meet the requirement of convenient/sufficient computing power. It seems that the impossible triangle is too strong to break.&lt;/p&gt;

&lt;p&gt;No approach can resolve the conflict between the demand for storing data in its initial state and the convenient use of it. Under the requirement for cost-saving lake building (loading data to the lake fast), high fidelity data loading and convenient/sufficient computing power are mutually exclusive. This goes against the data lake’s goal of openness.&lt;/p&gt;

&lt;p&gt;The underlying cause of the conflict is the contradiction between the closed database and its strict constraints. The database requires that data be loaded into it for computations and data needs to meet certain database constraints before being able to be loaded. In order to conform to the rules, data needs to be cleansed and transformed. And information loss happens during the process. Abandoning databases and switching to other routes (like files) cannot satisfy the demand of sufficient computing power, except that you turn to hardcoding. But hardcoding is too complicated and not nearly as convenient as databases.&lt;/p&gt;

&lt;p&gt;Actually, an open computing engine can become the breaker of the impossible triangle. Such an engine possessing sufficient and convenient computing power can compute the raw data, including data stored in diverse data sources, in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  SPL – the open data lake computing engine
&lt;/h2&gt;

&lt;p&gt;The open-source SPL is a structured data computing engine that provides open computing power for data lakes. It has diverse-source mixed computing capability that enables to compute raw data stored in different sources directly and based on its original status. No matter which storage mediums the data lake uses – same types as data sources or files, SPL can compute data directly and perform the data transformation step by step, making the lake building easier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv27spstzlvqd2ogewv7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv27spstzlvqd2ogewv7b.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Open and all-around computing power
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diverse-source mixed computing ability&lt;/strong&gt;&lt;br&gt;
SPL supports various data sources, including RDB, NoSQL, JSON/XML, CSV, Webservice, etc., and mixed computations between different sources. This enables direct use of any type of raw data stored in the data lake and extraction of its value without the “loading” step and preparation. And this flexible and efficient use of data is just one of the goals of data lakes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dcfh22zz5a6o1j1yhx7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dcfh22zz5a6o1j1yhx7.png" alt=" " width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Being agile like this, the data lake will be able to provide data services to applications as soon as it is established rather than after the prolonged cycle of data preparation, loading and modeling. The more flexible data lake service enables real time response to business needs.&lt;/p&gt;

&lt;p&gt;Particularly, SPL’s good support for files gives powerful computing ability to them. Storing lake data in a file system can also obtain computing power nearly as good as, even greater than, the database capability. This introduces computing capacity on the basis of part iii and makes the originally impossible triangle feasible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77jc8t019aezj5806cal.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77jc8t019aezj5806cal.png" alt=" " width="800" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Besides text files, SPL can also handle data of hierarchical format like JSON naturally. Data stored in NoSQL and RESTful can thus be used directly without transformation. It’s really convenient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqmcx0heiuyugmgcsxtg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqmcx0heiuyugmgcsxtg.png" alt=" " width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All-around computing capacity&lt;/strong&gt;&lt;br&gt;
SPL has all-around computational capability. The discrete data set model (instead of relational algebra) it is based arms it with a complete set of computing abilities as SQL has. Moreover, with agile syntax and procedural programming ability, data processing in SPL is simpler and more convenient than in SQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuyzo1l8hae2lpktxxpty.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuyzo1l8hae2lpktxxpty.png" alt=" " width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPL boasts a wealth of class libraries for computations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accessing source data directly&lt;/strong&gt;&lt;br&gt;
SPL’s open computing power extends beyond data lake. Generally, if the target data isn’t synchronized from the source to the lake but is needed right now, we have no choice but to wait for the completion of synchronization. Now with SPL, we can access the data source directly to perform computations, or perform mixed computations between the data source and the existing data in the lake. Logically, the data source can be treated as part of the data lake to engage in the computation so that higher flexibility can be achieved.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-performance computations after data transformation
&lt;/h2&gt;

&lt;p&gt;SPL’s joining makes data warehouse optional. SPL has all-around, remarkable computing power and offers high-performance file storage strategies. ETLing raw data and storing it in SPL storage formats can achieve higher performance. What’s more, the file system has a series of advantages like flexible to use and easy to parallelly process.&lt;/p&gt;

&lt;p&gt;SPL provides two high-performance storage formats – bin file and composite table. A bin file is compressed (to occupy less space and allow fast retrieval), stores data types (to enable faster retrieval without parsing), and supports the double increment segmentation technique to divide an append-able file, which facilitates parallel processing in an effort to further increase computing performance. The composite table uses column-wise storage to have great advantage in handling scenarios where only a very small number of columns (fields) is involved. A composite table is also equipped with the minmax index and supports double increment segmentation technique, letting computations both enjoy the advantages of column-wise storage and be more easily parallelly processed to have better performance.&lt;/p&gt;

&lt;p&gt;It is easy to implement parallel processing in SPL and fully bring into play the advantage of multiple CPUs. Many SPL functions, like file retrieval, filtering and sorting, support parallel processing. It is simple and convenient for them to automatically implement the multithreaded processing only by adding the @m option. They support writing parallel program explicitly to enhance computing performance.&lt;/p&gt;

&lt;p&gt;In addition, SPL supports a variety of high-performance algorithms SQL cannot achieve, the commonly seen TopN operation, for example. It treats calculating TopN as a kind of aggregate operation, which successfully transforms the highly complex sorting to the low-complexity aggregate operation while extending the field of application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5y7qxkvdn7rep7drn0j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5y7qxkvdn7rep7drn0j.png" alt=" " width="800" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The SPL statements do not involve any sort-related keywords and will not trigger a full sorting. The statement for getting top N from a whole set and that from grouped subsets are basically the same and both have high performance. SPL boasts many more such high-performance algorithms.&lt;/p&gt;

&lt;p&gt;Assisted by all these mechanisms, SPL can achieve performance orders of magnitude higher than that of the traditional data warehouses. The storage and computation issues after data are transformed are solved. Data warehouses won’t be a data lake necessity any longer.&lt;/p&gt;

&lt;p&gt;Furthermore, SPL can perform mixed computations directly on/between transformed data and raw data by making good use of values of different types of data sources rather than by preparing data in advance. This creates highly agile data lakes.&lt;/p&gt;

&lt;p&gt;SPL enables performing lake building phases side by side while, conventionally, they can only be performed one by one (loading, transformation and computation). Data preparation and computation can be carried out concurrently and any type of raw, irregular data can be computed directly. Dealing with the computation and the transformation at the same time rather than in serial order is the key to building an ideal data lake.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Lakehouse? More Like a Lake + Warehouse Parking Lot</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Thu, 22 Jan 2026 06:50:56 +0000</pubDate>
      <link>https://dev.to/esproc_spl/lakehouse-more-like-a-lake-warehouse-parking-lot-4hfg</link>
      <guid>https://dev.to/esproc_spl/lakehouse-more-like-a-lake-warehouse-parking-lot-4hfg</guid>
      <description>&lt;p&gt;From all-in-one machine, hyper-convergence, cloud computing to HTAP, we constantly try to combine multiple application scenarios together and attempt to solve this type of problem through one technology so as to achieve the goal of simple and efficient use. Lakehouse, which is very hot nowadays, is exactly such a technology; its goal is to integrate the data lake with the data warehouse to give play to their respective value at the same time.&lt;/p&gt;

&lt;p&gt;The data lake and data warehouse have always been related closely, yet there are significant differences between them. The data lake pays more attention to retaining the original information, and its primary goal is to store the raw data “as is”. However, there are a lot of junk data in the raw data. Does storing the raw data “as is” mean that all the junk data will be stored in data lake? Yes, the data lake is just like a junk data yard where all the data is stored, regardless of whether they are useful or not. Therefore, the first problem that the data lake faces is the storage of massive (junk) data.&lt;/p&gt;

&lt;p&gt;Benefiting from the considerableprogress of modern storage technology, the cost of storing massive data is reduced dramatically. For example, using the distributed file system can fully meet the storage needs of data lake. But, the data storing ability alone is not enough, the computing ability is also required to bring the value into play. Data lake stores various types of data and each is processed differently, and the structured data processing is of the highest degree of importance. Whether it is historical data or newly generated business data, data processing mainly focuses on structured data. On many occasions, computations of semi-structured data and unstructured data will eventually be transformed to structured data computation. Unfortunately, however, since the storage schema itself (file system) of data lake does not have the computing ability, it is impossible to process the data directly on the data lake. To process the data, you have to use other technologies (such as data warehouse). The main problem that data lake is facing is “capable of storing, but incapable of computing”.&lt;/p&gt;

&lt;p&gt;For the data warehouse, it is just the opposite. Data warehouse is based on SQL system, and often has powerful ability to calculate the structured data. However, only after the raw data are cleansed, transformed and deeply organized until they meet database’s constraints can they be loaded into the data warehouse. In this process, a large amount of original information will be lost, even the data granularity will become coarse, resulting in a failure of obtaining the value of data with lower granularity. Moreover, the data warehouse is highly subject-oriented, and services one or a few subjects only. Since the data outside the subjects is not the target of data warehouse, it will make the range of usable data relatively narrow, making it unable to explore the value of full and unknown data as data lake does, let alone store massive raw data like data lake. Compared with data lake, the data warehouse is “capable of computing, but incapable of storing”.&lt;/p&gt;

&lt;p&gt;From the point of view of data flow, the data of data warehouse can be organized based on data lake, so a natural idea is to integrate the data lake with the data warehouse to achieve the goal of “capable of storing and computing”, which is the so-called “Lakehouse”.&lt;/p&gt;

&lt;p&gt;So, what is current implementing situation?&lt;/p&gt;

&lt;p&gt;The current method is oversimplified and crude, that is, open the data access rights on the data lake to allow the data warehouse to access the data in real-time (the so-called real-time is relative to the original ETL process that needs to periodically move the data from data lake to data warehouse. Yet, there is still a certain delay in practice). Physically, the data are still stored in two places, and the data interaction is performed through high-speed network. Due to having a certain ability to “real time” process the data of data lake, the implementation result (mostly at the architecture level) is now called Lakehouse.&lt;/p&gt;

&lt;p&gt;That’s it? Is that a Lakehouse in the true sense?&lt;/p&gt;

&lt;p&gt;Well, I have to say - as long as the one (who claims it is Lakehouse) doesn’t feel embarrassed, embarrassing is as the one (who knows what Lakehouse should be like) feels embarrassed.&lt;/p&gt;

&lt;p&gt;Then, how does the data warehouse read the data of data lake? A common practice is to create an external table/schema in the data warehouse to map RDB’s table, or schema, or hive’s metastore. This process is the same as the method that a traditional RDB accesses the external data through external table. Although the metadata information is retained, the disadvantages are obvious. Specifically, it requires the data lake can be mapped as tables and schema under corresponding relational model, and it also needs to organize the data before computing them. Moreover, the types of available data sources decrease (for example, we cannot perform mapping directly based on NoSQL, text, and Webservice). Furthermore, even if there are other data sources (such as RDB) available for computation in the data lake, the data warehouse usually needs to move the data to its local position when computing (such as grouping and aggregating), resulting in a high data transmission cost, performance drop, and many problems.&lt;/p&gt;

&lt;p&gt;For the current Lakehouse, in addition to “real-time” data interaction, the original channel for periodically organizing the data in batches is still retained. In this way, the organized data of data lake can be stored into the data warehouse for local computing. Of course, this has little to do with the Lakehouse, because it was done the same way before the “integration”.&lt;/p&gt;

&lt;p&gt;Anyway, both the data lake and data warehouse change little (only the data transmission frequency is improved, but many conditions have to be met), whether the data is transmitted from lake to warehouse through traditional ETL or external real-time mapping. Physically, the data are still stored in two places. The data lake is still the original data lake, and the warehouse is still the original data warehouse, and they are not integrated essentially!Consequently, not only are the data diversity and efficiency problems not fundamentally solved (lack of flexibility), but it also needs to organize the “junk” data of data lake first, and then load them into the warehouse before computing (poor real time performance). If you want to build a real-time and efficient data processing ability on the data lake through the “Lakehouse” implemented in this way, I'm afraid it's a joke.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;If we think a little, we will find that the problem is in the data warehouse. The database system is too closed and lacks openness, it needs to load the data into the database (including external data mapping) before computing. Moreover, due to the database constraints, the data must be deeply organized to conform to the norms before being loaded into the database, while the raw data itself of data lake has a lot of “junk” data. Although it is reasonable to organize these data, it is difficult to respond to the real-time computing needs of data lake. If the database is open enough, and has the ability to directly calculate the unorganized data of data lake, and even the ability to perform mixed computing based on a variety of different types of data sources, and provide a high-performance mechanism to ensure the computing efficiency at the same time, then it is easy to implement a real Lakehouse. However, it is a pity that the database is unable to achieve this goal.&lt;/p&gt;

&lt;p&gt;Fortunately, esProc SPL does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SPL - an open computing engine - helps implement a real Lakehouse&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The open-source SPL is a structured data computing engine that provides open computing power for data lake, which can directly calculate the raw data of data lake, there are no constraints, even no database to store the data. Moreover, SPL boasts the mixed computing ability for diverse data sources. Whether the data lake is built on a unified file system, or based on diverse data sources (RDB, NoSQL, LocalFile, Webservice), a direct mixed computing can be accomplished in SPL, and the value of data lake can be produced quickly. Furthermore, SPL provides a high-performance file storage (the storage function of data warehouse).The data can be organized unhurriedly when calculations are going on in SPL, while loading the raw data into SPL’s storage can obtain higher performance. Particular attention should be paid that the data are still stored in the file system after they are organized in SPL storage, and theoretically, they can be stored in the same place with the data lake. In this way, a real Lakehouse can be implemented.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wvwlud5aj37vdv58tcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wvwlud5aj37vdv58tcg.png" alt=" " width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the whole architecture, SPL can perform unified storage and calculation directly based on data lake, and can also connect to diverse data sources in the data lake, and even directly read the external production data source. With these abilities, a real-time calculation on the data lake can be implemented, and in some scenarios that require high data timeliness (it needs to use the data before they are stored into the data lake), SPL can connect to the real-time data source, so the data timeliness is higher.&lt;/p&gt;

&lt;p&gt;The original way that moves the data from the data lake to data warehouse can still be retained. ETLing the raw data to SPL’s high-performance storage can achieve a higher computing performance. Meanwhile, using the file system to store the data enables the data to be distributed on the SPL server (storage) or, alternatively, we can still use the unified file of data lake to store the data, that is, the work of original data warehouse is completely taken over by SPL. As a result, the Lakehouse is implemented in one system.&lt;/p&gt;

&lt;p&gt;Let's take a look at these abilities of SPL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open and all-around computing power
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diverse-source mixed computing ability&lt;/strong&gt;&lt;br&gt;
SPL supports various data sources, including RDB, NoSQL, JSON/XML, CSV, Webservice, etc., and has the ability to perform mixed computation between different sources. This enables direct use of any type of raw data stored in the data lake and gives play to the value of data without transforming the data, and the action of “loading into the database” is omitted. Therefore, the flexible and efficient use of data is ensured, and a wider range of business requirements is covered.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbut3krbyg9tmc6gvyvzw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbut3krbyg9tmc6gvyvzw.png" alt=" " width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this ability, the data lake will be able to provide data service for applications as soon as it is established rather than having to complete a prolonged cycle of data preparation, loading and modeling. Moreover, the SPL-based data lake is more flexible, and can provide a real time response based on business needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supporting file computing&lt;/strong&gt;&lt;br&gt;
Particularly, SPL’s good support for files gives powerful computing ability to them. Storing lake data in a file system can also obtain computing power nearly as good as, even greater than, the database capability. Besides text files, SPL can also handle the data of hierarchical format like JSON, and thus the data stored in NoSQL and RESTful can be used directly without transformation. It’s really convenient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzntcgftagmd6c3opf6kr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzntcgftagmd6c3opf6kr.png" alt=" " width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All-around computing capacity&lt;br&gt;
SPL provides all-around computational capability. The discrete data set model (instead of relational algebra) it is based arms it with a complete set of computing abilities as SQL has. Moreover, with agile syntax and procedural programming ability, data processing in SPL is simpler and more convenient than in SQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqjh7anfgy5zzxhmmbh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqjh7anfgy5zzxhmmbh2.png" alt=" " width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rich computing library of SPL&lt;/p&gt;

&lt;p&gt;This enables the data lake to fully has the computing ability of data warehouse, achieving the first step of integrating data lake with data warehouse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accessing source data directly&lt;/strong&gt;&lt;br&gt;
SPL’s open computing power extends beyond data lake. Generally, if the target data isn’t synchronized from the source into the lake but is needed right now, we have no choice but to wait for the completion of synchronization. Now with SPL, we can access the data source directly to perform computations, or perform mixed computations between the data source and the existing data in the lake. Logically, the data source can be treated as part of the data lake to engage in the computation so that higher flexibility can be achieved.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-performance computations after data organization
&lt;/h2&gt;

&lt;p&gt;In addition to its own all-around and powerful computing abilities, SPL provides file-based high-performance storage. ETLing raw data and storing it in SPL storage can achieve higher performance. What’s more, the file system has a series of advantages like flexible to use and easy to parallelly process. Having the data storage ability is equivalent to achieving the second step of integrating the data lake with data warehouse, and a new open and flexible data warehouse is formed.&lt;/p&gt;

&lt;p&gt;Currently, SPL provides two high-performance file storage formats: bin file and composite table. The bin file adopts the compression technology (faster reading due to less space occupation,), stores the data types (faster reading due to no need to parse the data type), and supports the double increment segmentation mechanism that can append the data. Since it is easy to implement parallel computing by using the segmentation strategy, computing performance is ensured. The composite table supports columnar storage, this storage schema has great advantage in scenarios where only a very small number of columns (fields) are involved. In addition, the composite table implements the minmaxindex and supports double increment segmentation mechanism, therefore, it not only enjoys the advantages of columnar storage, but also makes it easier to perform the parallel computing to improve the performance.&lt;/p&gt;

&lt;p&gt;Furthermore, it is easy to implement parallel computing in SPL and fully bring into play the advantage of multiple CPUs. Many SPL functions, like file retrieval, filtering and sorting, support parallel processing. It is simple and convenient for them to automatically implement the multithreaded processing only by adding the @moption. They support writing parallel program explicitly to enhance computing performance.&lt;/p&gt;

&lt;p&gt;In particular, SPL supports a variety of high-performance algorithms SQL cannot achieve. For example, the common TopN operation is treated as an aggregation operation in SPL, as a result, a high-complexity sorting operation can be transformed to a low-complexity aggregation operation while extending the range of application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe06zzhwu5zp0d1vwr03d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe06zzhwu5zp0d1vwr03d.png" alt=" " width="800" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In these statements, there are no any sort-related keywords and will not trigger a full sorting action. The statements for getting top N from a whole set and from grouped subsets are basically the same and both can achieve a higher performance. SPL boasts many more such high-performance algorithms.&lt;/p&gt;

&lt;p&gt;Depending on these mechanisms, SPL can achieve a performance that surpasses that of traditional data warehouse, the degree of surpassing is measured in orders of magnitude, and the full implementation of Lakehouse in data lake is not done in words but effective mechanisms.&lt;/p&gt;

&lt;p&gt;Furthermore, SPL can perform mixed computations on transformed data and raw data to give full play to the value of various types of data, instead of preparing data in advance. In this way, not only is the flexibility of data lake fully expanded, but it also has the function of real-time data warehouse. This achieves the third step of integrating the data lake with data warehouse, which takes into account both the flexibility and high performance.&lt;/p&gt;

&lt;p&gt;Through the above three steps, the path to build the data lake is improved (the original path needs to load and transform the data before computing), and the data preparation and computation can be carried out at the same time, and the data lake is built step by step. Moreover, in the process of building the data lake, the data warehouse is perfected, making the data lake has powerful computing ability, implementing the real Lakehouse. This is the correct method for implementing a real Lakehouse.&lt;br&gt;
SPL is now open-source. You can obtain the source code from&lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt; GitHub .&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Are Wide Tables Fast or Slow?</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Mon, 19 Jan 2026 07:54:58 +0000</pubDate>
      <link>https://dev.to/esproc_spl/are-wide-tables-fast-or-slow-3ba1</link>
      <guid>https://dev.to/esproc_spl/are-wide-tables-fast-or-slow-3ba1</guid>
      <description>&lt;p&gt;Wide tables are usually a standard component of the BI system. Many BI projects will first prepare wide tables at the beginning of construction. A wide table is formed by joining up multiple tables that have a certain association relationship. The result set does not conform to the normal forms; and there is a large amount of redundant data. Moreover, as wide tables need to be pre-created, they are not so flexible to use.&lt;/p&gt;

&lt;p&gt;But why do people very much prefer wide tables even if they have many shortcomings?&lt;/p&gt;

&lt;p&gt;Because wide tables are &lt;strong&gt;FAST&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;Querying data in the wide table is usually faster than performing the real-time multi-table join. So, building wide tables is to avoid joins. Join operations are a long-standing problem in SQL. They are difficult to write and has poor performance. Find detailed analysis about SQL joins &lt;a href="https://c.esproc.com/article/1653353923359" rel="noopener noreferrer"&gt;HERE&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;However, though wide tables help avoid joins, too much extra data may be read during the computation because there is data redundancy and this increases I/O time. For example, there is an Orders table where each order corresponds to 5 records in OrderDetails table. By stretching the two to a wide table, data in Order table will repeat 5 times. What’s more, the Orders table has dimension tables such as Customer and Employee, and Customer table has dimension tables including Region, and so on. When all these tables are extended to form a wide table, the entire data volume will be enlarged many times. To perform a query on that wide table, such as summing up order amounts by customer’s region, a large volume of data will be retrieved and I/O overhead is huge.&lt;/p&gt;

&lt;p&gt;According to the above analysis, wide tables should have been slower. Why they are faster in the real-world practice? This is because relational database joins are too slow. Even if the wide table IO cost increases several times, the query is still faster than the real-time joins.&lt;/p&gt;

&lt;p&gt;If we can do some optimization to make the join run faster, can we get satisfactory performance while avoiding a series of wide table problems including redundant data, error from result set that does not conform to normal forms and stiffness?&lt;/p&gt;

&lt;p&gt;The answer is yes. But it is a pity that &lt;strong&gt;SQL cannot do that&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the above document link, there are already detailed analysis about the SQL join. In a nutshell, the JOIN defined by Cartesian product is indeed very simple and the simple connotation gets broader denotation to cover various JOIN scenarios. But a too general definition makes it impossible to perform targeted optimization on different join operations. People can only think of some temporary solutions in engineering, but cannot fundamentally solve the problem.&lt;/p&gt;

&lt;p&gt;Here’s another fact. Since the debut of the database, optimization methods for simple SQL used for BI analysis have been stretched to the limit by various vendors, but even so wide tables are needed to solve the performance problem. It can be seen that it is difficult to deal with – we can even say – impossible to solve the performance problem of join operations.&lt;/p&gt;

&lt;p&gt;Is there anything we can do?&lt;/p&gt;

&lt;p&gt;We can use SPL to tackle the problem.&lt;/p&gt;

&lt;p&gt;SPL (Structured Process Language) is an open-source computing engine intended for structured data computations. It offers powerful computing ability independent of databases. The performance of handling join operations in SPL is much higher than those of both the SQL join and the wide table-based join. The language addresses the root of join operation performance problem as well as avoiding problems brought by wide tables.&lt;/p&gt;

&lt;p&gt;The commonly seen equi-joins in BI analyses are categorized into two types in SPL – foreign key join and primary key join. Each is provided their own performance optimization methods, which is explained in the second half of the above-mentioned post about Join Simplification and Acceleration. SPL specifically offers dimension table preload method and numberization method for the common foreign key joins and order-based merge method for primary key joins that help to significantly reduce the join operation complexity.&lt;/p&gt;

&lt;p&gt;Once we speed up the join operation, wide tables become useless and the volume of data to be read is reduced. The result is that SPL greatly increases BI performance.&lt;/p&gt;

&lt;p&gt;That’s theoretical explanations. How best does SPL’s field performance?&lt;/p&gt;

&lt;p&gt;A comparison test was performed. The test includes a common aggregation by dimension in multidimensional analysis after a join between a fact table and multiple, multilayer dimension tables, and an aggregation based on the wide table.&lt;/p&gt;

&lt;p&gt;The test data is based on a data set of TPCH 100G and a computation involving a join between one large fact table and multiple dimension tables is designed:&lt;/p&gt;

&lt;p&gt;Two table join between one fact table and one dimension table;&lt;br&gt;
Seven-table join involving joins between a primary-sub fact table and four dimension tables, during which a dimension table is used twice;&lt;br&gt;
Convert the seven-table join result to a wide table and perform wide-table-based aggregation.&lt;br&gt;
The products used for performing the test are two specialized OLAP databases – StarRocks and Clickhouse, which are famous for high-performance BI analysis. Below is the test result:&lt;/p&gt;

&lt;p&gt;Time unit: Second&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6lsfbnclnz98cvz9rp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6lsfbnclnz98cvz9rp2.png" alt=" " width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Find detailed test report &lt;a href="https://c.esproc.com/article/1690170794600" rel="noopener noreferrer"&gt;HERE&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;According to the test result, the SQL’s wide table is faster than the join, which verifies the previous analysis. The performance of SPL’s wide table is actually not as fast as ClickHouse, but its real-time join performance is very high. It is higher than joins performed in the two SQL databases (3-9 times faster), and even &lt;strong&gt;surpasses the database’s wide table method by huge margins&lt;/strong&gt;. If we take wide table’s shortcomings into account (redundant data, data error and stiffness), the advantage of SPL’s real-time join becomes more obvious. It not only avoids wide table defects but increases performance by N times.&lt;/p&gt;

&lt;p&gt;The wide-table-based join is not necessarily faster than the real-time join! Because of SPL, costly wide tables created to obtain high performance in the BI system become useless.&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from &lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; .&lt;/p&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>performance</category>
      <category>sql</category>
    </item>
    <item>
      <title>Tired of ETL Bottlenecks? Build a Logical Data Warehouse with SPL</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Fri, 16 Jan 2026 07:10:54 +0000</pubDate>
      <link>https://dev.to/esproc_spl/tired-of-etl-bottlenecks-build-a-logical-data-warehouse-with-spl-52la</link>
      <guid>https://dev.to/esproc_spl/tired-of-etl-bottlenecks-build-a-logical-data-warehouse-with-spl-52la</guid>
      <description>&lt;p&gt;Logical DW (data warehouse) offers users the ability to logically integrate a variety of different type of data sources without moving the original data, presenting itself as a physical DW. Logical DW can address traditional DW’s problem of the inability to respond to real-time data processing needs due to long data chain caused by data moving, and hence logical DW can well meet the fast business change scenarios. Moreover, logical DW has the cross-source computing ability. However, due to the absence of physical storage, logical DW needs to map the data of each source as SQL table so as to implement mixed computation on multiple data sources.&lt;/p&gt;

&lt;p&gt;There is no problem with the idea and application scenarios of logical DW, but the way to implement remains to be discussed. Currently, the external interface of the majority of logical DWs is still SQL, because almost all the traditional DWs are built based on SQL, and the same goes for logical DW. The benefit of using SQL is that it is universal, which can lower the learning and application thresholds.&lt;/p&gt;

&lt;p&gt;However, SQL will weaken the logical ability, in other words, it lacks support for diverse data sources.&lt;/p&gt;

&lt;p&gt;Unlike physical DW, logical DW will face a wider variety of data sources. Since many data sources do not meet the constraints of DW (SQL), it is useless to map them as SQL table, so the action of loading data into database to make them meet constraints is not needed (physical DW needs this action, so it doesn’t have to face diverse data sources, whereas logical DW does). As a result, logical DW lacks the support for diverse data sources. In general, it is relatively easy for logical DW to support RDB-based data sources, yet it is difficult to support multi-layer data structures such as NoSQL, Webservice and JSON, and more difficult to support various types of data sources like file system. In fact, most of today’s logical DWs can only support RDB well, and support other types of data sources poorly.&lt;/p&gt;

&lt;p&gt;The lack of support for diverse data sources is also manifested in functionality. The reason why we use multiple data sources is that different data sources have different capabilities to adapt to different application scenarios. We know that even for RDB, there are some syntax differences between different RDBs, i.e., the databases’ respective dialect. These dialects are not fabricated out of thin air, but are designed to leverage their own abilities. Unless a logical DW could take account of all dialects when it operates based on the said databases (obviously impossible), the situation where the ability (can only be reflected in database dialect) of many databases cannot be utilized occurs. This problem is even worse for non-SQL databases, for example, the syntax of Mongodb for filtering is quite different from that of SQL. Normally, the ability to directly use the syntax of data source should be provided besides the automatic translation mechanism. Unfortunately, SQL obviously does not have this ability.&lt;/p&gt;

&lt;p&gt;On the other hand, logical DWs are generally poor in physical computing ability.&lt;/p&gt;

&lt;p&gt;The read of data from diverse data sources will also face performance issues. When the data amount is small, reading the data instantly is OK. However, when the data amount is large, the IO cost will be very high, and instant reading will make the performance too low to tolerate. In order to ensure computing performance, logical DW usually provides certain physical computing abilities (store the data on physical device) but, there is still a big gap between logical DW and physical DW in terms of adaptive storage and related computing performance due to certain reasons such as long-standing habit.&lt;/p&gt;

&lt;p&gt;Essentially, logical DW should be a combination of physical DW and logical ability, and the physical computing ability should be fundamental for the logical DW. Unfortunately, a purely logical DW is very poor in physical computing (it is only suitable for small data amount and low performance scenario). In addition, logical DW should be more open and flexible. Besides being able to connect different data sources for mixed computing, logical DW should have the ability to fully utilize (leverage) the advantages of each data source based on the computing scenario.&lt;/p&gt;

&lt;p&gt;The current dilemma is that logical DW is poor in physical computing ability, while physical DW is poor in logical data source, so we need to combine them into one and draw on each other’s merits.&lt;/p&gt;

&lt;p&gt;Based on these factors, using SPL to implement a logical DW is a better choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implement logical DW in SPL
&lt;/h2&gt;

&lt;p&gt;SPL is an open-source computing engine, and has sufficiently open computing ability, making it possible to integrate multiple types of data sources for mixed computation. Inherent powerful physical computing ability, together with high-performance guarantee mechanisms and logical cross-source computing ability, makes SPL fully implement logical DW.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logical data source ability&lt;/strong&gt;&lt;br&gt;
Currently, SPL can connect dozens of data sources and, these data sources are not limited to RDB but other data sources like NoSQL, CSV, Excel, JSON, XML, HDFS, Efficientsearch and Kafka.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccvkvwnhw0qsd0g1hht5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccvkvwnhw0qsd0g1hht5.png" alt=" " width="783" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When connecting these data sources, SPL will regard them as the table sequence (small data) and the cursor (big data) instead of mapping them as database table. How to generate the table sequence/cursor is the business of the data source itself (any data source offers such interface, but may not and cannot provide SQL access interface with unified syntax). In this way, the ability of each data source can be fully utilized.&lt;/p&gt;

&lt;p&gt;It is easy for SPL to perform cross-source mixed computation based on these data sources. As an example, the following code uses SPL to handle a common cross-database computing scenario (mixed computation across different types of databases):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffevw4jdt7ut52zffekek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffevw4jdt7ut52zffekek.png" alt=" " width="800" height="198"&gt;&lt;/a&gt;&lt;br&gt;
In this example, SPL doesn’t read all raw data from MySQL, and instead performs a grouping and aggregating operation in SQL before reading. As a result, the data volume of the large table Orders is significantly reduced, and the IO efficiency is greatly increased upon fetching the data trough interfaces (like JDBC).&lt;/p&gt;

&lt;p&gt;As mentioned earlier, SQL translation will face the dialect issues, resulting in a failure to play the role of database’s many functionalities. SPL also provides similar translation function, which can translate standard SQL into corresponding database statements. But more importantly, SPL supports the direct use of data source’s syntax, which makes it possible to use their own syntax to give full play to its own advantages whether it is the dialect of SQL database or a non-SQL data.&lt;/p&gt;

&lt;p&gt;In addition to cross-database computation, SPL can perform mixed calculation between data sources of any type. For example, sometimes storing the cold data to file system is more cost-effective and more flexible to process (redundantly store data at will), and the hot data are still stored in the database. If we want to do a real-time query for full data, coding in SPL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28qpv5047nkufur8swuq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28qpv5047nkufur8swuq.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPL is also capable of integrating data sources other than RDB. In particular, SPL provides good support for multi-layer data structure, which makes it convenient to process the data from Web interface, IoT, and NoSQL. For example, SPL reads JSON multi-layer data and performs association query with database:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vobbf015zbez2qz8ml2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vobbf015zbez2qz8ml2.png" alt=" " width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Likewise, SPL supports NoSQL such as MongoDB:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoyzjz8fgopotivf49bz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoyzjz8fgopotivf49bz.png" alt=" " width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another example, mixed computing of RESTful data and text data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgrjvjizq6kif5wqk8uq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgrjvjizq6kif5wqk8uq.png" alt=" " width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By now, we can see that SPL provides independent computing ability that has nothing to do with data source, yet the ability of data source itself can still be utilized. Users can choose where to do the calculation, data source end or logical DW (SPL), which is where the flexibility of SPL comes in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Physical computing ability&lt;/strong&gt;&lt;br&gt;
In the previous section, we took several examples of SPL’s ability to integrate multiple data sources. In addition, SPL offers powerful physical computing ability.&lt;/p&gt;

&lt;p&gt;SPL provides a professional structured data object: table sequence, and offers rich computing library based on the table sequence, thereby making SPL have complete and simple structured data process ability.&lt;/p&gt;

&lt;p&gt;Here below are part of common calculation codes written in SPL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Orders.sort(Amount)             // sort
Orders.select(Amount*Quantity&amp;gt;3000 &amp;amp;&amp;amp; like(Client,"*S*"))       // filter
Orders.groups(Client; sum(Amount))          // group
Orders.id(Client)               // distinct
join(Orders:o,SellerId ; Employees:e,EId)           // join

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By means of the procedural computation and table sequence, SPL can implement more calculations. For example, SPL supports ordered operation more directly and thoroughly; SPL also supports grouping operation, which can retain the grouped subset, i.e., the set of sets, allowing us to conveniently perform further operation on the grouped result. Compared with SQL, SPL syntax has many differences. To be precise, these differences should be advantages, which we will discuss later.&lt;/p&gt;

&lt;p&gt;In addition to rich algorithms and libraries, SPL provides high-performance guarantee mechanism. As mentioned earlier, logical DW should be a combination of physical DW and logical ability, and the physical computing ability is very important. Only the combination of the two can provide sufficient performance guarantee. SPL designs a number of high-performance algorithms specifically for high-performance computing:&lt;/p&gt;

&lt;p&gt;In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning…&lt;/p&gt;

&lt;p&gt;External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval…&lt;/p&gt;

&lt;p&gt;Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping…&lt;/p&gt;

&lt;p&gt;Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning…&lt;/p&gt;

&lt;p&gt;Merge and join: ordered merging, merge by segment, association positioning, attached table…&lt;/p&gt;

&lt;p&gt;Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension…&lt;/p&gt;

&lt;p&gt;Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing…&lt;/p&gt;

&lt;p&gt;Of course, both logical and physical calculations cannot be separated from data storage. Sometimes organizing the data according to the computing objective (such as sorting them by specified field) can obtain higher computing performance. And conversely, the implementation of some high-performance algorithms needs to be backed up by storage. For this reason, SPL provides high-performance file storage. Please note that what SPL provides is file storage, which is completely different from the closed storage of traditional databases, and SPL does not bind storage. From a logical point of view, SPL’s high-performance files are equal to any other data source, except that SPL provides engineering methods for the file storage to improve performance, such as compression, columnar storage and index. Moreover, SPL provides many high-performance algorithms based on file storage.&lt;/p&gt;

&lt;p&gt;Physical storage enables SPL to have physical computing ability that logical DW cannot match, and also gives SPL a significant performance advantage relative to other physical DWs. In real-world applications, SPL can often achieve a performance improvement of several times to dozens of times.&lt;/p&gt;

&lt;p&gt;Here below are part of performance improvement cases:&lt;/p&gt;

&lt;p&gt;Open-source SPL turns pre-association of query on bank mobile account into real-time association&lt;br&gt;
Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times&lt;br&gt;
Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency&lt;br&gt;
Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times&lt;br&gt;
Another benefit of file storage is its flexibility, which allows us to organize the data at will based on the computing objective, and avoids the situation where the data cannot be intervened as is the case with databases. Since the storage is relatively cheap, we can copy as much file as we want, as it is nothing more than a few more files. The same data can be organized in different forms (such as ordered by different fields) to adapt to different computing scenarios.&lt;/p&gt;

&lt;p&gt;SPL itself has complete and high-performance computing ability. Powerful physical computing ability, along with rich interfaces for diverse data sources, forms a complete logical DW, which is why we think SPL is more suitable for building logical DW.&lt;/p&gt;

&lt;p&gt;The advantages of SPL don’t stop there. In the process of building logical DW, being lightweight and simple is also its key characteristics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More lightweight&lt;/strong&gt;&lt;br&gt;
Any mention of a DW makes it seem that it will be something heavy, and that it will be a server system even if just a logical DW. In fact, however, such operation may occur in various scenarios, and many in-application cross-source calculations essentially fall in the scope of logical DW. To this end, SPL supports not only independent deployment but integrating in applications.&lt;/p&gt;

&lt;p&gt;SPL has very low hardware requirements and is light to install. SPL can run on any operating system as long as a JVM environment with JDK 1.8 or higher version, including common VMs and Container, is available, and it only takes up less than 1G of installation space. When integrating in an application, it only needs to embed a few jars to run, which is very convenient. Now some reporting/BI tool vendors also claim to support logical DW, but the actual effect is very poor, far inferior to professional DWs. Embedding SPL in such tool can make up for this lack, enabling it to do logical calculations such as in-application cross-source computing.&lt;/p&gt;

&lt;p&gt;Multiple data source interfaces, which are very convenient to use, together with physical file storage that makes the use and management of file flexible and simple, as well as the agile syntax that comes with SPL, makes SPL, as a logical DW, very light to use.&lt;/p&gt;

&lt;p&gt;It is very convenient to manage data files based on file system. Specifically, files in the file system can be managed in a multi-level directory structure, and we can set up different directories for different businesses or modules; a certain directory and subdirectories are dedicated to serving a single business, eliminating the coupling with each other; data modification will not affect other businesses; if a certain business goes down, the corresponding data (directory) can be safely deleted, making overall management very neat. Moreover, SPL has no metadata and does not require a complex management system like a database. All of these will make O&amp;amp;M very light.&lt;/p&gt;

&lt;p&gt;Due to its lightness in terms of installation/embedment, use, operation and maintenance, SPL, as a logical DW, is very light in overall performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lower development cost&lt;/strong&gt;&lt;br&gt;
Using SPL to perform data calculation will make it easier to develop and debug, and lower the development cost.&lt;/p&gt;

&lt;p&gt;SPL provides the syntax that supports procedural computation, which will greatly simplify complex calculation. It is clear that to perform a computing task, writing 100 lines of code in one statement (SQL), or writing 100 lines of code in 100 statements (SPL), the complexities are completely different.&lt;/p&gt;

&lt;p&gt;In addition, SPL provides an IDE that makes developing and debugging easier. In addition to the editing and debugging functions, the IDE allows us to code cell by cell, and offers a panel to view the result of each step, which is very convenient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs866y0eme0wt4nd7oof.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs866y0eme0wt4nd7oof.png" alt=" " width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More importantly, SQL does not have complete language ability, and even cannot handle a pure data task alone. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex, and bring inconvenience to the operation and maintenance.&lt;/p&gt;

&lt;p&gt;Compared to SQL (many scenarios are difficult or even impossible to implement in SQL), SPL provides more concise syntax and more complete ability. For example, to calculate the maximum number of days that a stock keeps rising, coding in SPL needs just one statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stock.sort(trade_date).group@i(close_price&amp;lt;close_price [-1]).max(~.len())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In contrast, when this calculation is implemented in SQL, it needs to nest multiple layers of code and implement in a very roundabout way.&lt;/p&gt;

&lt;p&gt;In addition to conventional structured data computing library, SPL provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. In addition, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.&lt;/p&gt;

&lt;p&gt;For more information about SPL syntax, visit: A programming language coding in a grid&lt;/p&gt;

&lt;p&gt;Concise syntax, along with complete language ability, directly makes the development work very efficient, and eliminates the need to resort to other technologies, hereby making the technology stack simpler, enabling everything to be done in one system, simplifying and facilitating O&amp;amp;M, and reducing the cost.&lt;/p&gt;

&lt;p&gt;For a logical DW, the logical ability and the physical computing ability are equally important. Only by combining the two abilities can a logical DW fully play its role. In addition, the integration degree of data sources, support degree for data types, performance guarantee, ease of use, and development and O&amp;amp;M costs are also important considerations. Overall, using SPL to build logical DW is a good choice.&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from &lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;GitHub &lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Try it free~~ &lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt;Download~&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>sql</category>
    </item>
    <item>
      <title>In-Memory Databases Are Overrated — Here’s What Actually Matters for Speed"</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Tue, 13 Jan 2026 06:24:32 +0000</pubDate>
      <link>https://dev.to/esproc_spl/in-memory-databases-are-overrated-heres-what-actually-matters-for-speed-3ho1</link>
      <guid>https://dev.to/esproc_spl/in-memory-databases-are-overrated-heres-what-actually-matters-for-speed-3ho1</guid>
      <description>&lt;p&gt;It is easy to think about using an in-memory database to solve the performance problem of reporting, BI analysis, batch processing, and other data analysis tasks. An in-memory database allows storing all data permanently in the memory so that the computation external memory accesses (disk reads) are not needed, disk I/O can be avoided and data processing performance can be effectively improved.&lt;/p&gt;

&lt;p&gt;In-memory databases are advertised to have high performance and be able to solve many performance problems in business data analyses. They are fast because of not only the zero disk I/O costs but also the application of specialized memory optimization techniques. For example, the memory supports random access and has strong parallel processing ability; and supports optimization techniques such as pre-load and pre-index to enhance the computing performance.&lt;/p&gt;

&lt;p&gt;But these techniques are not unique to in-memory databases. Actually, all computations are performed in the memory because, after all, the CPU can only compute data in the memory and data stored in the external memory needs to be read into the memory for processing. As general memory optimization techniques, they are adopted by many database products and computing engines. As long as the memory optimization is carried out cleverly, performance will catch up. In this sense, if the memory is large enough to hold all data, every database can be called in-memory database. Now the so-called “specialized in-memory databases” only offer optimization methods for the in-memory computations. They are either not good at performing computations involving a volume of data exceeding the memory capacity, or unable to handle them (because external memory computations are much more difficult). In order to highlight advantages while downplaying disadvantages, vendors name these databases “In-memory databases” in order to distinguish them from ordinary ones. On the contrary, the ability of real specialized in-memory databases is weakened.&lt;/p&gt;

&lt;p&gt;Therefore, there’s no such thing as “specialized in-memory databases”, there are only specialized in-memory data processing techniques. It doesn’t matter whether a product claims to be an in-memory database. The really important thing is to find out whether in-memory techniques the product uses are effective or not by testing and comparing the computing performance under a large memory.&lt;/p&gt;

&lt;p&gt;As we have understood what in-memory databases are, let’s move on to discuss in-memory processing techniques and computing performance.&lt;/p&gt;

&lt;p&gt;We all know that the SQL-based relational databases are still the mainstream database systems – whether the database is the so-called in-memory database or not. But it’s a pity that SQL cannot make the best use of the characteristics of memory, leaving much room for performance increase. Though certain optimization methods, such as compression, column-wise storage, index, parallel processing and vector computing, can be well implemented and employed based on SQL, it is inconvenient to implement other techniques that can greatly help increase computing performance in SQL. This is because SQL has its intrinsic limits.&lt;/p&gt;

&lt;p&gt;SQL lacks the data type for representing records. SELECT’s result (record) is a new data table unrelated to the original table, even if it has only one record. As a result, a lot of data needs to be copied during the computation (new memory space is thus needed) and costs increase in terms of both space usage and time consumption. If there is the explicit record type, we can directly use records to form a data table. The records will store memory addresses of data in the original table; data won’t be copied frequently or in a large amount in the computing process, memory will be effectively used and performance will be higher.&lt;/p&gt;

&lt;p&gt;For the multi-table join computation, if we can convert the foreign key in the fact table into memory addresses of the records in a dimension table in advance (pre-join), no association will be needed when records are used and we will get performance as good as that of the single table query. Yet, it is a pity that SQL cannot exploit the characteristic.&lt;/p&gt;

&lt;p&gt;SQL also has poor support for order-based computations. The language is designed based on unordered sets. It does not define ordinal numbers for members of the unordered sets or offer location operations and neighboring member reference mechanism, making it difficult to exploit data orderliness to implement high-efficiency algorithms. For example, if data in the memory is ordered, we can use binary search to speed up the query. The method can noticeably increase performance of handling scenarios involving a large volume of data. Location by ordinal numbers is a similar and more efficient method, which can make use of the high-speed memory random access characteristic to quickly retrieve data from the memory according to the specified ordinal numbers. Both methods are difficult to implement in SQL.&lt;/p&gt;

&lt;p&gt;Making the most use of the memory power also refers to the ability to describe data structure. The relational model (SQL) depends on two-dimensional tables to represent the relationship; it is difficult for it to describe and employ more complex data structures (such as multilevel JSON format) that are very suitable for in-memory storage and has advantages in both space utilization and usage efficiency. Frankly, a language needs to support complex data structures to implement the real grouping operation. SQL forcibly performs aggregation after each grouping action, but sometimes it is the result members of the grouping operation (such as those in each group) that we concern with. To achieve the goal, SQL needs to first invent group ordinal numbers using the subquery, producing cumbersome code, repeated queries and low efficiency.&lt;/p&gt;

&lt;p&gt;The root of SQL’s inability to make good use of the features of memory is the theory. During the time when SQL (relational model) was invented, the computer’s memory was very small. It is understandable that the language finds it awkward to adapt to the current large-memory environment. Though contemporary databases have implemented many optimizations in engineering and improved the above-mentioned situation to some extent, the optimization engine becomes useless when the scenario gets even a little complicated; after all, it is difficult to compensate the theoretical shortcomings with engineering. Moreover, the quality and application scope of an optimization engine can only be determined through strict test and evaluation. Generally, tests are limited to a range of scenarios that SQL can easily handle and do not involve the complex ones. This makes the database product selection failure-prone and highly risked.&lt;/p&gt;

&lt;p&gt;SPL can solve all the problems.&lt;/p&gt;

&lt;p&gt;SPL (Structured Process Language) is an open-source computing engine intended for computing structured and semi-structured data. In order to solve the above-mentioned SQL problems, SPL is designed based on a brand-new model instead of the SQL’s relational algebra theory. The language supports both the in-memory and external memory computations. It specifically offers memory optimization methods that help achieve, even surpass, the performance of an in-memory database. When the volume of data involved exceeds the memory capacity, SPL allows to use the external memory computation (load data to the memory in batches) and in certain computing scenarios the performance is nearly as good as that of the full-memory computation, enabling high-performance computations even with small available memory space. A unique thing about SPL is that it can integrate with an application by being embedded into the latter to supply high-efficiency computing ability.&lt;/p&gt;

&lt;p&gt;SPL also offers the already mature engineering optimization techniques that the in-memory databases use, such as the previously mentioned column-wise vector computing, pre-load and pre-indexing. These general engineering strategies are significant enough to achieve performance as good as that of the in-memory database, but they are not enough to bring into play the memory’s characteristics, which, on the contrary, is SPL’s core strength.&lt;/p&gt;

&lt;p&gt;As we said previously, we can make use of the characteristics of memory and computations to achieve higher performance when handling scenarios involving complex data structures or computations. In-memory databases cannot exploit those characteristics because of SQL’s limits.&lt;/p&gt;

&lt;p&gt;SPL can.&lt;/p&gt;

&lt;p&gt;Unlike SQL’s SELECT that copies data and makes data volume bigger and efficiency lower, SPL only keeps the original records’ addresses, that is, their memory addresses, but does not copy records themselves, creating advantages in both space usage and computing efficiency. The reason SPL can do this is that it has the specialized record type for storing the original data’s memory addresses (references). SQL, however, does not have the record data type. In SQL, a single record is actually a data table consisting of a single row; different data tables cannot share records; and the filtering operation copies records to generate new ones to form a new data table, producing unsatisfactory space utilization and time costs.&lt;/p&gt;

&lt;p&gt;We all know that the CPU accesses data from the memory via addresses. If we store the addresses of certain data beforehand, it will be very fast to access them later. Take the join operation as an example. If we store addresses of the foreign key table (dimension table)’s records in the fact table, there is no need to perform the join (HASH computations) when trying to use data of the two tables and thus we can achieve an equal performance as that of the single-table query. The pre-association method has obvious advantages in handling scenarios involving data reuse and multi-table association (the number of dimension tables is relatively big). SPL offers such a mechanism to perform a pre-join between the fact table and the multiple dimension tables (store memory addresses of the dimension table’s records in the foreign key field), save the result in the memory and speed up the computation by using the memory addresses. Find more information HERE.&lt;/p&gt;

&lt;p&gt;SQL is based on unordered sets. This makes many high-performance algorithms impossible to implement. To solve the problem, SPL directly offers the ordered sets that enable to bring into play the full potential of orderliness. For the very complicated celestial bodies contrast computations, the computation amount can be greatly reduced if we first perform an initial filtering using the order-based binary search and this makes the subsequent computations more conveniently. SQL cannot describe the computation and thus cannot exploit the characteristic of orderliness. In actual practice, it is over 3 magnitudes slower than SPL. Find more information HERE.&lt;/p&gt;

&lt;p&gt;SPL can make most use of the ordinal numbers to perform high-efficiency accesses. For a query task, if the value of the to-be-queried key is the target value’s ordinal number in the table sequence, or if the target value’s ordinal number can be easily obtained according to the to-be-queried value, we can use the ordinal-number-based location method to complete the computation in the constant time interval without comparisons.&lt;/p&gt;

&lt;p&gt;What’s more, we can implement the association involving a large fact table (external memory computations) according to the ordinal numbers. When a fact table is too large to fit into the memory, the previously mentioned address reference method becomes useless. In that case, we can convert the fact table’s foreign key values into positions of their corresponding records in the dimension table, which is called numberization. Then we are able to use the more efficient ordinal-number-based location method to perform searches when trying to create association between the numberized fact table and the dimension table without any comparisons, producing performance nearly as good as that of memory address reference method. In SPL, the use of ordinal numbers is an important performance optimization strategy.&lt;/p&gt;

&lt;p&gt;We can see that various computing scenarios are taken into account when SPL is designed. It isn’t intended just for memory or external memory; it targets both so that problems of all scenarios can be well handled. Find more information HERE.&lt;/p&gt;

&lt;p&gt;Sometimes the computing goal has definite requirements regarding the order. For example, the security industry’s consecutive rising/falling computations require comparisons between neighboring data values. Databases that implement SQL well can use the window function to complete this type of computations. But it is very difficult to express the roundabout method even for the simple computations; and the efficiency is slow.&lt;/p&gt;

&lt;p&gt;To count the longest consecutive rising days for a certain stock, for example, SPL needs a 3-layer nested query to express the computing process even with a window function, and the execution efficiency is poor (database optimization strategies become useless for complicated scenarios). By contrast, SPL conveniently codes the computation thanks to the support of order-based computations (order-based grouping):&lt;/p&gt;

&lt;p&gt;stock.sort(date).group@i(price&amp;gt;price[-1]).max(~.len()) &lt;br&gt;
SPL references the directly previous/next record through the relative position.&lt;/p&gt;

&lt;p&gt;The performance of SPL’s order-based computation is particularly outstanding in handling ecommerce industry’s customer churn rate. Find detailed explanations in the article SPL computing performance test series: funnel statistics&lt;/p&gt;

&lt;p&gt;SPL also shows its ability of making most use the memory in describing the complicated data structures such as JSON format. It supports the multilevel data structure directly. With the generic type, SPL allows using sets as the members of a sequence, that is to say, members of a set are also sets, making the language have the natural ability to describe multilevel data structures such as JSON/XML. This also enables SPL to keep the grouped subsets (a set of sets) and perform a further computation on each grouped subset, effectively avoiding the SQL-style roundabout, nested query and helping achieve higher performance.&lt;/p&gt;

&lt;p&gt;By offering all these techniques that enable a full exploit of the memory features, SPL outperforms in-memory databases to achieve a higher computing performance and more concise code for phrasing the computing process.&lt;/p&gt;

&lt;p&gt;That being the case, is it still necessary for “specialized in-memory databases” to exist?&lt;/p&gt;

&lt;p&gt;Not necessary any more. Though not advertising the concept of in-memory databases, SPL actually possesses in-memory computing techniques that are more powerful than in-memory databases.&lt;/p&gt;

&lt;p&gt;SPL also provides external memory computation ability to handle data exceeding the memory capacity, creating a broader range of application scenarios and combining the in-memory ability and external-memory ability to bring into play the most power. The one typical example of this combination is the previously mentioned funnel analysis. A product claiming that it is the in-memory database usually does not have the external memory computation ability; the lack greatly limits the application scope. SPL, however, boasts a wider application scope.&lt;/p&gt;

&lt;p&gt;SPL can also work as an embedded computing engine and integrate with the application through its jars. It is deployed together with the application and offers in-memory database-like high-performance computations directly from within the application; it also supports the cooperative use of the in-memory ability and external memory ability.&lt;/p&gt;

&lt;p&gt;Having high-efficiency in-memory computing ability, SPL is completely qualified to replace the “specialized in-memory databases”. Together with the external memory ability and the flexible integration ability, SPL gets a broader application scope. It is not important that whether a product is called in-memory database or not when we are examining a product claiming to supply high-performance computation. It is a good product as long as the in-memory computing techniques are awesome and their application scope is wide. According to this rule, SPL is an ideal choice.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>From Heavyweight MPP to Lightweight SPL: Achieving High-Speed Data Processing Without the Cost</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Wed, 07 Jan 2026 08:21:28 +0000</pubDate>
      <link>https://dev.to/esproc_spl/from-heavyweight-mpp-to-lightweight-spl-achieving-high-speed-data-processing-without-the-cost-3om8</link>
      <guid>https://dev.to/esproc_spl/from-heavyweight-mpp-to-lightweight-spl-achieving-high-speed-data-processing-without-the-cost-3om8</guid>
      <description>&lt;p&gt;In order to obtain better computing performance, the MPP databases, such as Greenplum, Vertica, IQ, TD and AsterData, are often adopted. Although MPP can achieve better performance, the cost is high. Specifically, MPP consumes a large amount of hardware resources, resulting in high hardware cost, and it needs to pay expensive license fee if a commercial software is used. Moreover, it is very complicated to operate and maintain MPP, as each node needs to be maintained separately, and the uniform distribution and the consistency assurance of data under distributed framework will increase the O&amp;amp;M complexity. In short, it is heavy and expensive to use MPP.&lt;/p&gt;

&lt;p&gt;Then, is there any other solution?&lt;/p&gt;

&lt;p&gt;The main purpose of using MPP is to obtain better computing performance. If the performance can be improved in a lightweight and cost-effective way, then we can give up MPP. So, is there a way like this?&lt;/p&gt;

&lt;p&gt;After carefully analyzing the current computing scenarios that deal with structured data (database), we found that the data amount of task in most scenarios is not particularly large. Let’s take the financial institutions whose business data are usually large as examples: for a bank with tens of millions of accounts, its transaction records only amount to hundreds of millions a year, which is not considered large; for an e-commerce system with millions of accounts, its accumulated data amount is only similar to the scale of the bank. Except for a few top companies, the computing scenarios of the vast majority of users do not involve particularly large amount of data, and the data scale of a single task is only tens of gigabytes, and the task involving data of up to 100 gigabytes is rare, let alone the petabyte-level task claimed by many large data vendors.&lt;/p&gt;

&lt;p&gt;Normally, a conventional database should be able to handle the task of this data scale easily, but it is not the case in reality. In the real world, it is very common to take several hours to do a batch job, yet there is no extra time to re-run the job if something goes wrong; it is also very common to take tens of seconds to minutes to query a report, and the time it takes will be longer once the query occurs concurrently (even if the number of concurrent queries is small).&lt;/p&gt;

&lt;p&gt;To cope with this, users will consider using an MPP to speed up.&lt;/p&gt;

&lt;p&gt;Why do these situations still occur though the amount of data is not large?&lt;/p&gt;

&lt;p&gt;The main reason is that the current database does not fully utilize the hardware resources. In other words, the performance of database is too low.&lt;/p&gt;

&lt;p&gt;There are two deeper reasons.&lt;/p&gt;

&lt;p&gt;One reason is that MPP adopts many engineering optimization methods such as data compression, columnar storage, index, and vector-based computing to serve the AP scenarios. By means of these methods, the computing efficiency is significantly improved. However, these methods are rarely found in traditional databases, so the performance is naturally low. If traditional databases adopted such technologies, the performance would also be improved. Unfortunately, only MPP is now using these technologies.&lt;/p&gt;

&lt;p&gt;The other reason is that although these slow-running operations do not involve large data amount, they are usually very complex. Moreover, due to the limitations of SQL itself, it is very difficult to implement some complex operations. Even if such operations can be coded in SQL, the amount of calculation is particularly large. For example, for the order-related multi-step operation, it is difficult to code in SQL and slow to run. SQL lacks features like record type, ordered operation, and procedural computation, making it impossible to code many high-performance algorithms. As a result, programmers can only resort to slow algorithms, so performance is poor.&lt;/p&gt;

&lt;p&gt;Running slow needs more hardware to speed up. Therefore, even if the data scale is not large, the database cannot handle, and has to resort to the distributed MPP.&lt;/p&gt;

&lt;p&gt;Of course, we hope to obtain both the speed of a high-speed train (MPP) and the volume of a car (light solution). However, within the scope of current knowledge, it seems that a car that runs as fast as a high-speed train cannot be found, so the heavy high-speed train is a solution that has to be taken.&lt;/p&gt;

&lt;p&gt;Fortunately, now we have the lightweight esProcSPL to fill the gap. Just like a car, SPL can achieve the speed of a high-speed train! Here are some of the advantages of esProcSPL:&lt;/p&gt;

&lt;h2&gt;
  
  
  No distributed framework required
&lt;/h2&gt;

&lt;p&gt;As an open-source computing engine, SPL is specifically designed for processing structured data. The high-performance mechanism provided in SPL can fully utilize hardware resources, allowing a single machine to exert the computing ability of a cluster, thus making it possible to handle most computing scenarios that previously required MPP without employing a distributed framework.&lt;/p&gt;

&lt;p&gt;In terms of engineering, SPL also adopts the common mechanisms of MPP, such as compression, columnar storage, index, and vector-based calculation to ensure excellent performance. In addition, SPL provides high-performance file storage that supports these mechanisms, eliminating the need for a closed database management system. The file storage can be directly distributed on any file system, making it more open. Not only is SPL high in computing performance, it’s also an out-of-the-box tool and lighter.&lt;/p&gt;

&lt;p&gt;More importantly, due to the inherent defects of SQL, SPL doesn’t continue to use SQL system but adopts an independent programming language, i.e., Structured Process Language. Moreover, SPL provides more data types and operations, and makes innovation fundamentally (you should know that it is difficult to address theoretical defects with engineering methods). As we know, the software cannot change the speed of hardware. However, we can use low-complexity algorithms, then the hardware will execute less computation, and the performance will be improved naturally. SPL offers many such high-performance algorithms. For example, for the complicated multi-step ordered operation mentioned above, it is easy to implement in SPL and, it is simple to code and the running speed is fast.&lt;/p&gt;

&lt;p&gt;High performance requires less hardware, which is a relationship we’ve talked about many times. In practice, for most of the scenarios that seem to require MPP, SPL can handle them through a single machine, which not only saves hardware cost, but is convenient in O&amp;amp;M.&lt;/p&gt;

&lt;p&gt;Here are some cases for reference:&lt;/p&gt;

&lt;p&gt;Open-source SPL turns pre-association of query on bank mobile account into real-time association&lt;br&gt;
Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times&lt;br&gt;
Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency&lt;br&gt;
Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times&lt;/p&gt;

&lt;h2&gt;
  
  
  Be able to handle multi-concurrency query
&lt;/h2&gt;

&lt;p&gt;In addition to improving the computing performance, the distributed technology is used to handle multi-concurrency query sometimes. For multi-concurrency query, a single machine is indeed hard to process sometimes. In this case, do we have to use MPP?&lt;/p&gt;

&lt;p&gt;Not necessarily.&lt;/p&gt;

&lt;p&gt;SPL provides the cloud mode, allowing us to dynamically start/stop the computing nodes based on the concurrent situation, hereby implementing elastic computing. The cloud mode of SPL is completely different from the relatively fixed cluster mode of MPP, it can flexibly handle concurrent requests, and consumes the least hardware resources.&lt;/p&gt;

&lt;p&gt;The high-performance file storage of SPL mentioned earlier can fully ensure the computing performance. Yet, unlike the database, which need to store the data in database, SPL does not bind the storage, and the data files can be stored locally or remotely.&lt;/p&gt;

&lt;p&gt;We know that the database has metadata, which takes up a lot of resources. As the data accumulate, the metadata will become larger and larger, and the whole system will become slower and slower, making it difficult to implement some methods such as data redundancy and trading space for time. In contrast, SPL has no metadata and is light to use.&lt;/p&gt;

&lt;p&gt;Due to the fact that SPL does not bind the storage to computation, and is not subject to metadata, SPL naturally supports the separation between storage and computation. Even if the high-performance file of SPL is used, the performance in the whole system is the same as using text, and the file can be stored in local or network file system, and can also be stored directly to the cloud object storage like S3. With the support of separation between storage and computation, SPL can perform flexible scaling, making it very easy to cope with high-concurrency scenarios, and more flexible and scalable than MPP.&lt;/p&gt;

&lt;h2&gt;
  
  
  More advantages
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Simple technology stack&lt;/strong&gt;&lt;br&gt;
Let’s start by comparing SPL and SQL. The similarity is that they are both the computing language for structured data, and the difference is that SQL does not have complete language ability, and even for simple data task, it is often difficult to implement independently. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex, and bring inconvenience to the operation and maintenance.&lt;/p&gt;

&lt;p&gt;Compared to SQL (many scenarios are difficult or even impossible to implement in SQL), SPL provides more concise syntax and more complete ability. For example, to calculate the maximum number of days that a stock keeps rising, coding in SPL needs just one statement:&lt;/p&gt;

&lt;p&gt;stock.sort(trade_date).group@i(close_price&amp;lt;close_price [-1]).max(~.len())&lt;br&gt;
In contrast, when this calculation is done in SQL, it needs to nest multiple layers of code and implement in a very roundabout way.&lt;/p&gt;

&lt;p&gt;In addition to conventional structured data computing library, SPL provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. In addition, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.&lt;/p&gt;

&lt;p&gt;For more information about SPL syntax, visit: A programming language coding in a grid&lt;/p&gt;

&lt;p&gt;Concise syntax together with complete language ability directly makes the development work very efficient, and eliminate the need to resort to other technologies, hereby making the technology stack simpler, allowing everything to be done in one system, and naturally, the operation and maintenance are simpler and more convenient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diverse data sources&lt;/strong&gt;&lt;br&gt;
We often encounter such a situation where the adoption of MPP improves the performance but brings inconvenience, as the data of diverse sources can be processed only after they are loaded into the database, resulting in a decrease in data real-timeness and, loading such data into database and persisting them will increase the cost, take up more space, and affect the operation and maintenance. Moreover, MPP cannot take the place of original TP database, which will add a very difficult cross-database action to implement real-time hot computing.&lt;/p&gt;

&lt;p&gt;SPL not only does not bind storage but supports the connection to and mixed calculation over diverse sources, which gives SPL good openness, making it totally different from database’s closedness that process the data only after they are loaded into database.&lt;/p&gt;

&lt;p&gt;The data that SPL is good at processing include the conventional structured data, multi-layer structured data (json/xml, etc.), string, text and mathematical data such as matrix and vector. In particular, SPL provides powerful support for multi-layer structured data such as json and xml, far surpassing traditional databases. Therefore, SPL can work well with json-like data sources such as mongodb and kafka, and can also easily exchange data with HTTP/Restful and microservices and provide computing service. In particular, it is easy for SPL to implement the mixed operation with TP database, making it highly suitable for real-time query and count.&lt;/p&gt;

&lt;p&gt;The benefits of openness are self-evident. It can not only avoid the database capacity and performance problems caused by ETL, it can also fully ensure the real-timeness of data and calculation. Therefore, the openness of SPL is very friendly for real time computing scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More lightweight&lt;/strong&gt;&lt;br&gt;
In addition to not binding storage and having no metadata, the lightweight nature of SPL is reflected in simple operating environment. SPL can run on any operating system as long as JDK 1.8 or higher version is available, including common VMs and Container, and only takes up less than 1G of space after installation.&lt;/p&gt;

&lt;p&gt;What’s even more special is that SPL can not only be deployed independently but can also be integrated into applications, providing powerful computing ability within application. In this way, applications do not have to rely on central MPP to obtain powerful computing ability, and the coupling of data processing between applications is eliminated, making it flexible to use and easy to manage, and avoiding conflicts caused by multiple applications competing for central computing resources. All these are impossible for MPP.&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from &lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;GitHub &lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Try it free~~ &lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt;Download~&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Ditch the AP Database: Solve TP Overload with Lightweight SPL</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Mon, 29 Dec 2025 07:15:27 +0000</pubDate>
      <link>https://dev.to/esproc_spl/ditch-the-ap-database-solve-tp-overload-with-lightweight-spl-4a6n</link>
      <guid>https://dev.to/esproc_spl/ditch-the-ap-database-solve-tp-overload-with-lightweight-spl-4a6n</guid>
      <description>&lt;p&gt;At the beginning of information system construction, usually only one database is used, and the database combines TP (transaction processing) and AP (analytical processing) together. As the scale of business and the amount of data continue to grow, the database faces increasing pressure. In order not to affect transaction, a common practice in the industry is to move data (usually cold data) out of the database and use a dedicated database to handle AP business. This method effectively reduces the burden on TP database and ensures the smooth operation of transaction business.&lt;/p&gt;

&lt;p&gt;Normally, a professional AP database runs fast and does solve the performance problem to some extent, but it will cause some other problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problems faced by AP database
&lt;/h2&gt;

&lt;p&gt;The first is cost.&lt;/p&gt;

&lt;p&gt;Currently, mainstream AP databases mainly adopt MPP framework, which is different from that of TP database. Although MPP can achieve better performance, both the software and hardware costs are high. Specifically, MPP will consume a lot of hardware resources, resulting in high hardware cost, and it needs to pay expensive license fee if a commercial software is used; Each node of MPP needs to be maintained separately, and both the uniform distribution and consistency assurance of data under a distributed framework will increase the operation complexity. These factors will drive up the cost of using AP database.&lt;/p&gt;

&lt;p&gt;Adding an AP database will make management more complicated. To be specific, the management of the original TP database is already very complicated, including, design the metadata, make data meet constraints before loading into database, and control the access privilege, etc., and all these tasks are also required when an AP database is added, and more resources are needed due to difference of type of AP database. As a result, two sets of systems often require more than double the O&amp;amp;M cost, which brings cost problem.&lt;/p&gt;

&lt;p&gt;In addition, migrating data from TP database to AP database is not easy and will face a dilemma.&lt;/p&gt;

&lt;p&gt;The purpose of adding AP database is to move all AP businesses from TP database to AP database, yet migrating all AP businesses would pose significant risks. Not to mention whether the functionality of AP database is complete, many businesses that are originally implemented in one database may need to be redesigned after separation. Different database types and SQL compatibility differences will increase the difficulty of migration. All of these will make migrating all AP business in one go too risky.&lt;/p&gt;

&lt;p&gt;Therefore, the safe approach is to migrate gradually, but it will encounter new problems.&lt;/p&gt;

&lt;p&gt;We all have such experience that as the use of the database deepens and the business keeps growing, the originally normal query will become slower and slower. The reasons for this are not only the increase in data volume, but also many factors such as table quantity, indexes, metadata, and storage space. For a centrally managed database system, it is difficult to determine whether the database is effective for its own business before its load reaches a certain level.&lt;/p&gt;

&lt;p&gt;Migrating just a little bit of business at initial stage will definitely run fast. However, it makes it difficult to determine whether the selected database is correct. As the migration proceeds gradually, a situation where the later-migrated business affects the previous business may arise, which still causes great risks. If it is found at a later stage that the AP database cannot handle its own business or needs to be scaled out a lot, then a dilemma will occur because it has accumulated a lot of work then.&lt;/p&gt;

&lt;p&gt;In addition to the aforementioned cost and dilemma problems, there is also a more troublesome real-timeness problem.&lt;/p&gt;

&lt;p&gt;Some businesses that used to run smoothly in one database cannot be implemented after separating some data from original database; the most typical business is the real-time query. Real-time query is naturally supported within one TP database, yet it is completely different after separating data into databases. For databases of the same type, it is sometimes possible to query across databases to implement real-time query in an indirect way. Although the performance is usually not high and the effect is poor, real-time query can at least be implemented. However, for AP database that is almost impossible to be the same in type as TP database, implementing real-time query becomes extremely difficult, resulting in a situation that real-time query business that was frequently utilized in the past has to be implemented through the way of T+1 or even T+N, and the impact on this business is self-evident.&lt;/p&gt;

&lt;p&gt;In fact, real-time query is essentially the cross-source computing problem. If a system is open, cross-source computing can be easily implemented. However, the closedness of database requires that the data can be calculated only after they are loaded into database, it makes it extremely difficult to implement real-time query, which means that although the TP database overload problem is solved, it causes new problem, resulting in a decrease in degree in satisfying requirements.&lt;/p&gt;

&lt;p&gt;Surely, we can also resort to HTAP database to implement real-time query, because the main goal of HTAP is to implement query and analysis in one database (does it feel like TP database?). HTAP is of course an option, but the reality is that the majority of HTAP databases have strong capabilities in terms of TP, but their capabilities in AP are often weak, which differs little from original TP databases in many cases. More importantly, using HTAP database will still face the direct cost and migration problems. The original TP database is just under pressure and not unusable; it would be a waste if it was completely abandoned.&lt;/p&gt;

&lt;p&gt;Therefore, we need to find a relatively lightweight AP solution that will not incur high costs, preferably allowing for gradual migration while addressing real-time query problem.&lt;/p&gt;

&lt;p&gt;Fortunately, we can use SPL as a solution to reduce the burden on TP database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open SPL solves various problems of AP business migration
&lt;/h2&gt;

&lt;p&gt;As an open-source computing engine dedicated to AP business, SPL has the following characteristics: simply, lightweight, low usage cost, and its open computing ability and file storage support gradual migration of business without any impact on the business before and after business migration, and its mixed computing ability on multiple data sources naturally supports real-time query, making it easy to meet any business requirements even after separating AP business.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1le17xabfmuxds2s5wu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1le17xabfmuxds2s5wu.png" alt=" " width="579" height="289"&gt;&lt;/a&gt;&lt;br&gt;
SPL application framework&lt;/p&gt;
&lt;h2&gt;
  
  
  Lightweight and low cost
&lt;/h2&gt;

&lt;p&gt;One of the main differences between SPL and database is its simply and lightweight nature, which can reduce cost.&lt;/p&gt;

&lt;p&gt;SPL has very low hardware requirements and the overall performance is very light. SPL can run on any operating system as long as a JVM environment with JDK 1.8 or higher version, including common VMs and Container, is available, and it only takes up less than 1G of installation space. Moreover, since SPL provides many high-performance mechanisms, the same effect as MPP cluster can often be achieved with only one SPL node, which will directly reduce the software and hardware costs (more details about high performance will be explained later).&lt;/p&gt;

&lt;p&gt;What’s even more special is that SPL can not only be deployed independently but can also be integrated into applications, providing powerful computing ability within application. In this way, application can obtain powerful computing ability without having to rely on database. When migrating data from TP database to AP database, it can be initiated within the application. At the same time, SPL can serve as the data mart/front-end computing engine for this application.&lt;/p&gt;

&lt;p&gt;SPL’s agile syntax also has advantages when implementing complex calculations. At the beginning of migration, we will usually choose the businesses with low performance and high resource consumption. Such businesses are often complicated, and it is often easier to remould in SPL than to modify SQL. This will bring lower development and debugging costs (to be explained in more detail later).&lt;/p&gt;

&lt;p&gt;The lightweight of SPL is also reflected in its storage schema.&lt;/p&gt;

&lt;p&gt;Compared with the closed storage of databases, SPL directly uses files to store data. In fact, SPL does not bind storage, users can use any medium to store data, but compared to other forms of storage, files have many incomparable advantages. Files are stored directly on the file system, either locally or on the network (cloud). Using file storage eliminates concerns about capacity issues. Since the storage is very cheap, we can copy as many files as we want, as it is nothing more than a few more files. The same is true for file backups for security purposes. There is almost no upper limit for file storage.&lt;/p&gt;

&lt;p&gt;However, many open format files do not have high performance. For this reason, SPL provides dedicated high-performance file formats. Users can directly convert the source data and store as SPL files, and can also copy the data at will according to performance requirements during use.&lt;/p&gt;

&lt;p&gt;Another advantage of file storage is the ability to organize data flexibly. Sometimes, we can achieve higher performance by using different algorithms after organizing the data according to computing objectives. Compared to database, which cannot intervene in storage, files are much more flexible. Specifically, the data can not only be stored redundantly in multiple copies, the same copy of data can also be designed in different organizational forms (such as ordered by different fields) to adapt to different computing scenarios.&lt;/p&gt;

&lt;p&gt;Files in the file system can be managed in a multi-level directory structure, and we can set up different directories for different businesses or modules; a certain directory and subdirectories are dedicated to serving a single business, eliminating the coupling with each other; data modification will not affect other businesses; if a certain business goes down, the corresponding data (directory) can be safely deleted, making overall management very neat.&lt;/p&gt;
&lt;h2&gt;
  
  
  Gradual migration
&lt;/h2&gt;

&lt;p&gt;With the support of file storage and openness, we can proceed with gradual migration.&lt;/p&gt;

&lt;p&gt;As mentioned above, the reason why AP business is separated is because the pressure of one database that combines TP and AP businesses is large. Yet, although the database is under great pressure, it is still usable and can work well as long as the pressure is released. Since migrating all AP business in one go is too risky, the safe approach is to migrate gradually. Gradual migration can not only reduce risk but also conform to the characteristics of SPL. In the initial stage, migrating some statistical query scenarios with lower performance and high resource consumption will significantly reduce the workload of database.&lt;/p&gt;

&lt;p&gt;More importantly, using SPL to migrate files will not result in a situation where the business migrated later affects the business migrated earlier, because the file storage does not create any relationship between files, and adding new file will not affect the use of existing files. Moreover, the original problems caused by the complex and closed system of database will not exist at all in SPL system, so we can safely proceed with migration.&lt;/p&gt;

&lt;p&gt;With this gradual migration approach, the risk is low and the performance is controllable, so there is no need to migrate all AP businesses in one go.&lt;/p&gt;
&lt;h2&gt;
  
  
  Solve real-timeness problem
&lt;/h2&gt;

&lt;p&gt;SPL has strong openness. In addition to file storage, SPL also supports multiple data sources, and any data source is logically equivalent for SPL. Besides the ability to connect to multiple data sources, SPL can also perform mixed calculation, which makes it easy to implement real-time query.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4qjgnj47vwpa9cu2vkb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4qjgnj47vwpa9cu2vkb.png" alt=" " width="498" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Due to its open computing ability, SPL can retrieve data from different databases respectively, and thus it can handle scenarios involving different types of databases well. SPL implements real-time query by performing a mixed calculation of cold data stored in file system (AP database) and the hot data stored in TP database. During calculation, SPL’ agile syntax and procedural computation can greatly simplify complex calculations in real-time query and increase development efficiency. Moreover, SPL is an interpreted-execution language, which supports hot deployment.&lt;/p&gt;

&lt;p&gt;The following is the SPL code to perform a mixed query of the historical cold data stored in files and the hot data stored in production database:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugvrhgkzdf0iocdgh2vy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugvrhgkzdf0iocdgh2vy.png" alt=" " width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the real-time query that is difficult to implement after the separation of databases can be implemented with just a few lines of script, thereby completely solving all problems caused by the separation of databases.&lt;/p&gt;

&lt;p&gt;Currently, there are already many practices in migrating business from TP to SPL; we summarize common practical scenarios at&lt;br&gt;
SPL practice: migrate computing tasks out of database , you can refer to it to accomplish your own migration work.&lt;/p&gt;
&lt;h2&gt;
  
  
  Higher performance
&lt;/h2&gt;

&lt;p&gt;In practice, SPL also utilizes the engineering mechanisms currently adopted in professional AP databases, such as compression, columnar storage, index, and vector-based calculation to ensure excellent performance. As mentioned above, SPL provides high-performance file storage that supports these mechanisms, eliminating the need for a closed database management system. The file storage can be directly distributed on any file system, making it more open.&lt;/p&gt;

&lt;p&gt;More importantly, due to the inherent defects of SQL, SPL doesn’t continue to use SQL system but adopts an independent programming language, i.e., Structured Process Language. Moreover, SPL provides more data types and operations, and innovates fundamentally (as it is difficult to address theoretical defects with engineering methods). As we know, the software cannot speed up hardware. Yet, using low-complexity algorithms can reduce the computation amount of hardware, and the performance will be improved naturally. SPL offers many such high-performance algorithms. For example, for the complicated multi-step ordered operation mentioned above, it is easy to implement in SPL and, it is simple to code and the running speed is fast.&lt;/p&gt;

&lt;p&gt;By means of such high-performance guarantee mechanisms, SPL requires fewer hardware resources, thus achieving the cluster effect with only a single machine. Further, SPL provides multi-thread parallel and distributed computing mechanisms, which make it highly scalable to further guarantee performance.&lt;/p&gt;

&lt;p&gt;In the following cases, SPL achieves the cluster effect on only a single machine:&lt;/p&gt;

&lt;p&gt;Open-source SPL turns pre-association of query on bank mobile account into real-time association&lt;br&gt;
Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times&lt;br&gt;
Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency&lt;br&gt;
Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times&lt;/p&gt;
&lt;h2&gt;
  
  
  Lower development cost
&lt;/h2&gt;

&lt;p&gt;Since SPL does not adopt SQL system, there will be a learning cost before using it. Many people familiar with SQL may think the migration cost of using SPL will be higher.&lt;/p&gt;

&lt;p&gt;In fact, this is not the case. From a long-term perspective, the development cost of SPL is lower!&lt;/p&gt;

&lt;p&gt;Because SPL is not compatible with SQL, it does require recoding when migrating SQL, which will incur certain modification cost. However, SPL is easy to learn, and its syntax is concise, so the cost of modifying SQL is not very high. In contrast, although AP database also uses SQL, it will involve a lot of SQL modifications when migrating from TP database to AP database due to the types of databases are different and, since AP-related calculation logics are generally complex, it often needs to recode during modification. Moreover, the development and debugging of SQL itself is not easy and the workload of modification is not low, which is far from being as simple as “seamless migration” claimed by vendors. Overall, the modification cost of using SPL will not be much higher than using AP database, which is acceptable.&lt;/p&gt;

&lt;p&gt;More importantly, SPL will bring long-term benefits.&lt;/p&gt;

&lt;p&gt;Since the language ability of SQL is not complete, it is difficult for SQL to implement some complex calculations independently. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex and bring inconvenience to the operation and maintenance. TP, AP, and HTAP databases all have this problem.&lt;/p&gt;

&lt;p&gt;In contrast, SPL provides richer data types and complete computing libraries, making it easy to handle scenarios that are difficult or even impossible to implement in SQL.&lt;/p&gt;

&lt;p&gt;For example, when performing a funnel analysis for an e-commerce company to calculate the user churn rate, coding in SQL is very complicated, and the coding ways vary greatly when the types of databases are different. Moreover, the lack of migration ability would increase the modification cost. For example, the code of Oracle to implement a three-step funnel analysis is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with e1 as (
 select uid,1 as step1,min(etime) as t1
 from event
 where etime&amp;gt;= to_date('2021-01-10') and etime&amp;lt;to_date('2021-01-25')
 and eventtype='eventtype1' and …
 group by 1),
e2 as (
 select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
 from event as e2
 inner join e1 on e2.uid = e1.uid
 where e2.etime&amp;gt;= to_date('2021-01-10') and e2.etime&amp;lt;to_date('2021-01-25')
 and e2.etime &amp;gt; t1 and e2.etime &amp;lt; t1 + 7
 and eventtype='eventtype2' and …
 group by 1),
e3 as (
 select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
 from event as e3
 inner join e2 on e3.uid = e2.uid
 where e3.etime&amp;gt;= to_date('2021-01-10') and e3.etime&amp;lt;to_date('2021-01-25')
 and e3.etime &amp;gt; t2 and e3.etime &amp;lt; t1 + 7
 and eventtype='eventtype3' and …
 group by 1)
select
 sum(step1) as step1,
 sum(step2) as step2,
 sum(step3) as step3
from
 e1
 left join e2 on e1.uid = e2.uid
 left join e3 on e2.uid = e3.uid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Coding in SPL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bv5fi49341hj33ebmxd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5bv5fi49341hj33ebmxd.png" alt=" " width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Obviously, this code is more concise. In fact, the way to code in SPL is more versatile (for multi-step funnel calculation, it just needs to add parameters instead of adding sub-queries like SQL), and the performance is higher.&lt;/p&gt;

&lt;p&gt;Besides conventional structured data computing library, SPL also provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. And, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.&lt;/p&gt;

&lt;p&gt;Concise syntax, along with complete language ability, directly makes the development work very efficient, and eliminates the need to resort to other technologies, hereby making the technology stack simpler. With everything done in one system, it will naturally be very simple and convenient to operate and maintain. Therefore, for migrating AP businesses, especially when it comes to more complex calculations, there is a high probability that using SPL is more efficient and less expensive to develop than SQL.&lt;/p&gt;

&lt;p&gt;In conclusion, when the TP database is overloaded, it is necessary to reduce its pressure but, it is not necessary to migrate all AP businesses in one go, and the goal can be achieved as long as the pressure is effectively reduced. Migrating AP business in a gradual way is a good approach, yet there will always be problems of one kind or another when migrating AP business to another database, such as high migration cost, difficult management and poor real-timeness. Fortunately, SPL’s openness and high performance can solve such problems effectively. Besides, SPL also provides complete, lightweight and high-performance computing capabilities, and consistent technology stack. From this point of view, it is a wise choice to replace AP database with SPL to reduce the burden of TP database.&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from&lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt; GitHub .&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Try it free~~ &lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt;Download~&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Data Warehouse Doesn’t Need a ‘House’ — And That’s Why It’s Faster</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Mon, 22 Dec 2025 08:01:37 +0000</pubDate>
      <link>https://dev.to/esproc_spl/the-data-warehouse-doesnt-need-a-house-and-thats-why-its-faster-5f3f</link>
      <guid>https://dev.to/esproc_spl/the-data-warehouse-doesnt-need-a-house-and-thats-why-its-faster-5f3f</guid>
      <description>&lt;p&gt;We know that the early databases do not distinguish between TP and AP, and all tasks are handled in one database. When dealing with TP business, it is important to ensure the consistency of data, and consistency makes sense only when the data are limited within a certain range, which gives rise to the concept of “base”. The data to be loaded into database should meet some constraints, otherwise it cannot be loaded. There is a clear distinction between the data inside and outside the database, and this characteristic is called the closedness.&lt;/p&gt;

&lt;p&gt;In addition to guaranteeing the consistency of data, the closedness can guarantee the security of data by working with the database management system (DBMS).&lt;/p&gt;

&lt;p&gt;Data warehouse is developed based on database. When the database is unable to serve both OLTP and OLAP businesses at the same time, the AP business is separated into a separate database, giving rise to the data warehouse. Therefore, the data warehouse inherits many characteristics of database including the closedness. Inheriting the closedness is equivalent to inheriting the characteristics such as “the data can be used only after being loaded into database”, “the data to be loaded should meet some criteria”. Naturally, the concept of “house” of data warehouse is formed.&lt;/p&gt;

&lt;p&gt;Then, is this closed storage necessary?&lt;/p&gt;

&lt;p&gt;Yes, it is necessary for TP business but, it is not for the data warehouse focusing on AP business. Although the name of data warehouse contains the word “house”, its main function is to compute actually. Even though the data warehouse can store the data like a “house”, it actually serves calculation. After all, the data becomes valuable only when they are used (calculated). The competition among various types of new data warehouses on the market today is concentrated almost entirely on the computing ability, especially the performance, as well as the completeness of computing ability and the richness of functions, without exception, they are all about computing ability. Therefore, we can say that the key point of data warehouse is computation but storage.&lt;/p&gt;

&lt;p&gt;In this case, is it feasible to only provide a rich and powerful computing engine, and not bind the storage function? In other words, does it work if there is no “house”?&lt;/p&gt;

&lt;p&gt;Unfortunately, it is not feasible for most SQL-based (relational algebra) data warehouses today, because the binding of storage and computation is required by database that the data warehouse originates from, and cannot be changed.&lt;/p&gt;

&lt;p&gt;However, it is feasible for the new “no house” data warehouse - esProc SPL!&lt;/p&gt;

&lt;p&gt;As an open computing engine, esProc specializes in processing the data of AP business. Depending on its open computing ability, esProc supports connection to diverse data sources, and can perform mixed calculation over multiple data sources. Moreover, esProc boasts its own high-performance file storage to ensure the computing performance. Instead of adopting SQL as its formal language, esProc uses self-created SPL (Structured Process Language), which is more advantageous than SQL.&lt;/p&gt;

&lt;p&gt;The term “no house” referred to in this article means that there is no closed and private storage functionality like traditional data warehouse.&lt;/p&gt;

&lt;p&gt;Where are the data stored then?&lt;/p&gt;

&lt;p&gt;Let’s answer this question and related questions in detail below, and see what benefits does “no house” bring (that is, what problems of “the house” can be overcome).&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-time computation of diverse data sources
&lt;/h2&gt;

&lt;p&gt;In fact, once data is generated, it will be stored in a medium that carries it, such as a database, a file, or the web. In a broad sense, the data is already stored. Since that’s the case, wouldn’t it be convenient if the data could be processed directly? Moreover, the data sources of enterprises are diverse today, and they often face a variety of data sources and types. It will be very convenient if these data sources can be processed directly.&lt;/p&gt;

&lt;p&gt;esProc provides the ability to process such open-format multiple data sources directly. No matter where the data are stored (RDB, NoSQL, File, Hadoop, RESTful, etc.), esProc can read and calculate directly. More importantly, esProc can connect to different data sources to perform mixed computing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhc078icccbfz97jdk2df.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhc078icccbfz97jdk2df.png" alt=" " width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the ability to support diverse data sources (mixed calculation) is available, the limitation of “house” is broken through, which saves the development and time costs caused by loading the data into database. In addition, the real-time calculation on multiple data sources fully guarantees the real-timeness of data, and then implements the real-time query after separating data into databases. Moreover, since the data are no longer loaded into database indiscriminately, the storage cost and pressure of database will be greatly reduced, which is also important in the initial application stage of esProc (the data warehouse and esProc coexist).&lt;/p&gt;

&lt;p&gt;esProc also fully retains the advantages of various data sources. Specifically, RDB is stronger in computing ability, we can make RDB do part of calculations first and then let esProc do the rest in many scenarios; NoSQL and file are high in IO transfer efficiency, we can read and calculate their data directly in esProc; MongoDB supports multi-layer data storage, we can let SPL use its data directly. All these are the benefits that come with openness.&lt;/p&gt;

&lt;p&gt;In contrast, the closed data warehouse cannot compute the data outside the database, and hence it has to import the data before computing, resulting in the addition of an ETL action. This action not only increases the workload of programmers and the burden of database, but losses the real-timeness of data. Usually, the data outside the database have irregular formats, and it is not easy to load them into database with strong constraints and, even ETL action is performed, it first needs to load the raw data into database in order to utilize database’s computing ability. As a result, ETL is changed to ELT, which increases the burden of database.&lt;/p&gt;

&lt;p&gt;Computation can be done regardless of where the data is stored, this is one of the benefits “no house” esProc brings.&lt;/p&gt;

&lt;h2&gt;
  
  
  High performance
&lt;/h2&gt;

&lt;p&gt;However, when esProc reads diverse data sources, although they have same logical status, the read performances (which will be reflected in the total computation time) varies because the efficiencies of accessing the interfaces provided by different data sources are different. For certain interfaces (such as RDB’s JDBC), the read performance is very low.&lt;/p&gt;

&lt;p&gt;While it is convenient to access various data sources directly, it may result in poor computing performance.&lt;/p&gt;

&lt;p&gt;To fully ensure the computing performance, esProc offers the specialized binary file storage format, and offers many mechanisms, such as compression, columnar storage, ordering, and parallel segmentation.&lt;/p&gt;

&lt;p&gt;It is worth noting that esProc’s file storage is not closed within esProc (totally different from the closed storage of data warehouse), but stored as the files in the file system, and has the same status with other files such as text and Excel. esProc does not own these files, and instead it provides many optimization strategies to make the efficiency to access the files more efficient.&lt;/p&gt;

&lt;p&gt;In contrast, the performance of data warehouse with “house” is often not high. We know that the computing efficiency of data warehouse depends on the optimization degree of optimization engine, and a good database will choose more efficient execution path according to the computing objective of SQL (rather than its literally expressed logic). However, such auto-optimization mechanism works only for simple calculations. Once SQL becomes slightly more complex, the engine will fail, and has to execute SQL according its literally expressed logic, resulting in a sharp decline in performance. In this case, if data storage can be intervened by adjusting the data according to algorithms (for example, sort the data by primary key), higher performance can be achieved. Unfortunately, the data warehouse is closed and its storage is private, we cannot intervene the storage, so we cannot achieve high performance.&lt;/p&gt;

&lt;p&gt;In comparison, the file storage of esProc is very flexible, allowing us to design the storage based on any algorithm to make most of the advantages of file storage itself, and adjust the data based on the algorithm, so it is not surprising to achieve high performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security and reliability
&lt;/h2&gt;

&lt;p&gt;Having open computing ability and storing data to files will cause a problem: the closedness of traditional data warehouses can ensure the security and reliability of data inside the system, yet how can esProc, which no longer binds storage, ensure security and reliability?&lt;/p&gt;

&lt;p&gt;In fact, there is no need to worry about this issue. esProc does not manage the data in principle, nor is it responsible for data security. To some extent, it can be said that esProc does not have and does not need a security mechanism.&lt;/p&gt;

&lt;p&gt;The security of persistent data is in principle the responsibility of data source itself. For example, the database provides the security mechanisms such as user identification and authentication, authorization and verification mechanisms, and auditing techniques. For the data files in esProc format, many file systems or VMs provide perfect security mechanisms that can be utilized directly, such as access control, identity verification, and transmission encryption. The reliability of data can be guaranteed through the ability of the data source or professional storage technology itself.&lt;/p&gt;

&lt;p&gt;In addition, esProc supports retrieving the data from object storage services before computing such as S3, and can also utilize their security mechanisms. The cloud storage technologies like S3 are more advantageous in terms of security and reliability. Currently, there are few databases that have the ability to provide reliability guarantees surpassing these professional technologies. Therefore, it is ok to directly employ the security mechanism of these technologies.&lt;/p&gt;

&lt;p&gt;In terms of application access, esProc of the independent service process uses standard TCP/IP and HTTP to communicate, and can be monitored and managed by professional network security products, and the specific security measures will be the responsibility of these products. esProc specializes in data computation, and its philosophy on the non-computing tasks is to work with other specialized products.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjqoi8yp001k71dzgplr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frjqoi8yp001k71dzgplr.png" alt=" " width="689" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In fact, the security and reliability will be worse for the database with “house”. The permission management and control of database is often not meticulous enough, resulting in all users of an application being high-privilege user. For the convenience of “computing”, the permission to intervene in “storage” is given, such as the dangerous permission to compile stored procedures. As a result, the security itself cannot be well protected. In contrast, esProc focuses only on “computing” rather than “storage”, and its “computing” works only on the secure mechanism of “storage” and does not affect or destroy “storage”. As for reliability, it is directly proportional to the investment cost. Even with the extremely expensive “two sites and three centers” construction, the reliability is still far inferior to the current professional cloud storage. Since that’s the case, leave professional matters to the professionals.&lt;/p&gt;

&lt;p&gt;Therefore, we can say that “no house” can bring more security and more reliability than “with house”.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implement HTAP requirement
&lt;/h2&gt;

&lt;p&gt;In recent years, HTAP has become another hot spot in the database field. However, most databases implement HTAP only by attaching certain AP capabilities to TP database or by binding the two technologies together in other ways. Regardless of the method adopted, the issue of database migration is unavoidable. Not to mention the high risk, the closedness and performance problems of original data warehouse cannot be solved.&lt;/p&gt;

&lt;p&gt;In fact, HTAP requirement is essentially to query the data in real-time after the separation of databases. If this ability is available, then this requirement can be implemented without modifying the original TP database (no migration risk).&lt;/p&gt;

&lt;p&gt;We can introduce esProc based on original independent TP and AP systems, and utilize esProc’s open cross-source computing ability, high-performance storage and computing abilities and agile development ability to implement this requirement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazzluuvtr2lvvbl7v669.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazzluuvtr2lvvbl7v669.png" alt=" " width="398" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;esProc implements HTAP in a way that cooperates with the existing system. In this way, it only needs to make few modifications to the existing system, and there is almost no need to modify TP database. Even original AP data source can still be used to make esProc gradually take over AP business. Having partially or completely taken over AP business, the historical cold data is stored in esProc’s high-performance file, and original ETL process that moves the data from business database to data warehouse can be directly migrated to esProc. When the cold data are large in amount, and no longer change, storing them as esProc’s high-performance file can obtain higher computing performance; when the amount of hot data is small, storing them still in original TP data source enables esProc to read and calculate directly. Since the amount of hot data is not large, querying directly based on TP data source will not have much impact, and the access time will not be too long. After that, by making use of esProc’s cold and hot data mixed computing ability, we can achieve real-time query for full data. The only thing we need to do is to periodically store cold data as esProc’s high-performance file, and store the small amount of recently generated hot data in original data source. In this way, not only is HTAP implemented, but it implements a high-performance HTAP, and there is little impact on the application framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implement true Lakehouse
&lt;/h2&gt;

&lt;p&gt;The closed data warehouse cannot build a true Lakehouse. The data lake is just like a data junk yard, it should store the original raw data in spite of the data type, as it is impossible to predict whether some data are useful or not in the future. The value of data can only be reflected through calculation, which requires the computing ability of data warehouse. However, the data warehouse is closed, and the data must be deeply organized to meet criteria before being loaded into database. In addition, the large amount of raw “junk data” in the data lake cannot be calculated directly, whereas organizing data not only losses original information, but also faces diverse data sources problem mentioned above. Consequently, the real-timeness of data cannot be guaranteed, and the ETL itself costs a lot, resulting in a poor timeliness.&lt;/p&gt;

&lt;p&gt;Compared to the fake Lakehouse implemented on traditional data warehouse, esProc can implement true Lakehouse, because esProc has enough openness, and can calculate the unorganized data of data lake directly, and has the ability to perform mixed computation on many types of data sources while guaranteeing the computing efficiency by means of high-performance mechanism.&lt;/p&gt;

&lt;p&gt;esProc has the ability to directly calculate the raw data in data lake, and there are no constraints, and there is no need to load data into database. Moreover, esProc can perform mixed calculation on diverse data sources. Whether the data lake is built based on a unified file system or diverse data sources (RDB, NoSQL, LocalFile, Webservice), a direct mixed computing can be done by esProc to quickly output the value of data lake. Furthermore, the high-performance file storage of esProc (the storage function of data warehouse) can be utilized to organize the data in an orderly way while computing by esProc. Converting the raw data to esProc’s storage can obtain higher performance. The data are still stored in file system after they are converted to esProc storage, and theoretically, they can be stored in the same place with data lake. In this way, a true Lakehouse is implemented.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjv21l6t1rcstn418y5fd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjv21l6t1rcstn418y5fd.png" alt=" " width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the support of esProc’s computing ability, the organization and computation of data can be conducted at the same time, and the data lake can be built in a stepwise and ordered manner. Moreover, the data warehouse is being refined in the process of building data lake, making the data lake has strong computing ability as well, thereby implementing a true Lakehouse.&lt;/p&gt;

&lt;p&gt;From closed to open, this is the manifestation of the continuous progress of technology. The same goes for data warehouse, more specifically, developing from “with house” to “no house” is an inevitable stage that data warehouse experiences, and hence the data warehouse is about to enter the era of “no house”. esProc may not be perfect, but it has taken a big step forward in terms of developing the capabilities of “no house” data warehouse, and it is definitely worth a try&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from &lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Try it free~~ &lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt;Download~&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why Batch Jobs Suck — And How SPL Cuts Runtime from Hours to Seconds</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Fri, 19 Dec 2025 07:26:31 +0000</pubDate>
      <link>https://dev.to/esproc_spl/why-batch-jobs-suck-and-how-spl-cuts-runtime-from-hours-to-seconds-m82</link>
      <guid>https://dev.to/esproc_spl/why-batch-jobs-suck-and-how-spl-cuts-runtime-from-hours-to-seconds-m82</guid>
      <description>&lt;p&gt;The detail data produced in the business system usually needs to be processed and calculated to our desired result according to a certain logic so as to support the business activities of enterprise. In general, such data processing will involves many tasks, and it needs to calculate in batches. In the bank and insurance industries, this process is often referred to as “batch job”, and batch jobs are often needed in other industries like oil and power.&lt;/p&gt;

&lt;p&gt;Most business statistics require taking a certain day as termination day, and in order not to affect the normal business of the production system, batch jobs are generally executed at night, only then can the new detail data produced in production system that day be exported and transferred to a specialized database or data warehouse to perform operations of the batch jobs. The next morning, the result of batch job can be provided to business staff.&lt;/p&gt;

&lt;p&gt;Unlike on-line query, batch job is an off-line task that is automatically carried out on a regular basis, and hence the situation that multiple users access one task at the same time will never occur, so there is no concurrency problem and no need to return the result in real time. However, the batch job must be accomplished within a specified time period. For example, the specified time period for batch job of a bank is from 8:00 pm the first day to 7:00 am the next day, if the batch job is not accomplished by 7:00 am, it will cause serious consequence that the business staff cannot work normally.&lt;/p&gt;

&lt;p&gt;The data volume involved in a batch job is very large, and it is likely to use all historical data. Moreover, since the computing logic is complex and involves many computing steps, the time of batch jobs is often measured in hours. It is very common to take two or three hours for one batch job, and it is not surprising to take ten hours. As business grows, the data volume increases. The rapid increase of computational load on database that handles batch job will lead to a situation where the batch job cannot be accomplished after the whole night, and this will seriously affect the business, which is unacceptable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem analysis
&lt;/h2&gt;

&lt;p&gt;To solve the prolonged time of batch job, we must carefully analyze the problem existed in the existing system architecture.&lt;/p&gt;

&lt;p&gt;The relatively typical architecture of batch job system is roughly as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyn3x370key2zy1emiiis.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyn3x370key2zy1emiiis.jpg" alt=" " width="554" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As can be seen from the figure that the data needs to be exported from the production database and imported into the database handling batch jobs. The latter database is usually an RDB, you need to write stored procedure code to perform calculation of the batch jobs. The result of batch jobs will not be used directly in general, but will be exported from RDB to other systems in the form of intermediate files or imported into the database of other systems. This is a typical architecture, and the production database in the figure may be a centralized data warehouse or Hadoop, etc. Generally, the two databases in the figure are not the same database, and the data transferred between them is often in the form of file, which is conducive to reducing the degree of coupling. After the batch jobs are accomplished, the result is to be used in multiple applications, and transferred also in the form of file.&lt;/p&gt;

&lt;p&gt;The first reason for slow batch jobs is that the data import/export speed of RDB for batch jobs is too slow. Due to closed storage and computing capacities of RDB, too many constraint verifications and security processing at data import/export are required. When the data volume is large, the data read/write efficiency will be very low, and it will take a very long time. Therefore, for the database that handles batch jobs, both the process of importing file data and the process of exporting calculation result as file will be very slow.&lt;/p&gt;

&lt;p&gt;The second reason for slow batch jobs is that the performance of stored procedure is poor. Since the syntax system of SQL is too old, and there are many limitations, resulting in a failure in the implementation of many efficient algorithms, the computing performance of SQL statements in stored procedure is not unsatisfactory. Moreover, when the business logic is relatively complex, it is difficult to achieve within one SQL statement, and instead divide into multiple steps and use a dozen or even tens of SQL statements to implement. The intermediate result of each SQL statement needs to be stored as a temporary table for use in the SQL statements of subsequent steps. When the temporary table stores a large amount of data, the data must be stored, which will cause a large amount of data to be written. However, the performance of writing data is much worse than that of reading data, it will seriously slow down the entire stored procedure.&lt;/p&gt;

&lt;p&gt;For more complex calculations, it is even difficult to implement directly in SQL statements. Instead, it needs to use a database cursor to traverse and fetch the data, and perform loop computing. However, the performance of database cursor traversal computing is much worse than that of SQL statements, and this method generally does not directly support the multi-thread parallel computing, and is difficult to use the computing capacity of multiple CPU cores, as a result, it will make computing performance become worse.&lt;/p&gt;

&lt;p&gt;Then, how about using a distributed database (increase the number of nodes) to replace traditional RDB to speed up batch jobs?&lt;/p&gt;

&lt;p&gt;No, we can't. The main reason is that the batch job logic is quite complex, and it often needs thousands or even tens of thousands of lines of code to achieve even using the stored procedures of traditional database, yet the computing capacity of stored procedures of distributed database is still relatively weak, making it difficult to implement such complex batch operations.&lt;/p&gt;

&lt;p&gt;In addition, the distributed database also faces the problem of storing the intermediate result when a complex computing task has to be divided into multiple steps. Since the data may be stored on different nodes, it will result in heavy cross-network reads/writes whether storing intermediate result in previous steps or re-reading in subsequent steps, leading to uncontrollable performance.&lt;/p&gt;

&lt;p&gt;In this case, using a distributed database to speed up query via data redundancy does not work as well. The reason is that although multiple copies of redundant data can be prepared in advance before querying, the intermediate results of batch job are generated temporarily, and it needs to temporarily generate multiple copies of data if the data are redundant, which will make overall performance become slower.&lt;/p&gt;

&lt;p&gt;Therefore, the real-world batch job is usually executed within a large single database. When the computational intensity is too high, an all-in-one machine like ExaData will be used (ExaData is a multiple-database platform, and can be regarded as a super large single database as it is specially optimized in Oracle). Although this method is very slow, there is no better choice for the time being, and only such large databases have enough computing capacity for batch jobs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using SPL to perform batch jobs
&lt;/h2&gt;

&lt;p&gt;SPL, an open-source professional computing engine, offers the computing capacity that does not depend on database, and directly uses the file system to compute, and can solve the problem of extremely slow data import and export of RDB. Moreover, SPL achieves more optimized algorithms, and far surpasses stored procedure in performance, and has the ability to significantly improve computing efficiency of one machine, which is very suitable for batch jobs.&lt;/p&gt;

&lt;p&gt;The new architecture that uses SPL to implement batch jobs is shown as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckc7d0260jlrtg4whalk.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckc7d0260jlrtg4whalk.jpg" alt=" " width="554" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this new architecture, SPL solves two bottlenecks that cause slow batch jobs.&lt;/p&gt;

&lt;p&gt;Let’s start with the first bottleneck, i.e., data import and export. SPL can perform calculation directly based on file exported from production database, and there is no need to import data into an RDB. Having finished batch jobs, SPL can directly store final result as general format such as text file and transfer it to other applications, avoiding the data export from original database that handles batch jobs. In this way, slow RDB read/write is omitted.&lt;/p&gt;

&lt;p&gt;Now let's look at the second bottleneck, that is, the computing process. SPL provides better algorithms (many of which are pioneered in the industry), and the computing performance far outperforms that of stored procedure and SQL statement. SPL’s high-performance algorithms include:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fysbbrpj18vjbug9dzwyp.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fysbbrpj18vjbug9dzwyp.jpg" alt=" " width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These high-performance algorithms can be used for common calculations in batch job such as JOIN calculation, traversing, grouping and aggregating, which can effectively improve the computational speed. For example, the batch jobs often involve traversing the entire history table, and in some cases, a history table needs to be traversed many times so as to accomplish the calculations of multiple business logics. Generally, the data amount of history table is very large and each traversal will consume a lot of time. To solve this problem, we can use SPL’s &lt;strong&gt;multi-purpose traversal&lt;/strong&gt; mechanism. This mechanism can accomplish multiple computations during one round of traverse on a large table, and can save a lot of time.&lt;/p&gt;

&lt;p&gt;SPL’s &lt;strong&gt;multi-cursor&lt;/strong&gt; can achieve parallel reading and computing of data. Even for complex batch job logic, the use of multiple CPU cores can implement multi-thread parallel computing. On the contrary, it is difficult for database cursor to process in parallel. Thus, the computing speed of SPL can often be several times faster than that of stored procedure.&lt;/p&gt;

&lt;p&gt;The** delayed cursor** mechanism of SPL has the ability to define multiple computation steps on one cursor, and then let the data stream perform these steps in sequence to achieve** chain calculation*&lt;em&gt;. This mechanism can effectively reduce the number of times of storing intermediate result. In situations where data must be stored, SPL can store intermediate result as its built-in high-performance format for use in the next step. SPL’s high-performance storage is based on file, and adopts many technologies such as **ordered and compression storage, free columnar storage, double increment segmentation, self-owned compression code.&lt;/em&gt;* As a result, the disk space occupation is reduced, and the read and write speed is much faster than database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Application effect
&lt;/h2&gt;

&lt;p&gt;In this new architecture, SPL breaks RDB’s two bottlenecks for batch jobs, and obtains very good effect in practice. Let's take three cases to illustrate.&lt;/p&gt;

&lt;p&gt;Case 1: Bank L adopts traditional architecture for its batch jobs, and takes RDB as the database handing batch jobs, and uses stored procedure to program to achieve batch job logic. It takes 2 hours to carry out the stored procedure of the batch job of loan agreements, yet this is merely a preparation job for many other batch jobs. Taking so long time seriously affects all batch jobs.&lt;/p&gt;

&lt;p&gt;When using SPL, due to its high-performance algorithms and storage mechanisms such as the &lt;strong&gt;high-performance columnar storage, file cursor, multi-thread parallel processing, small result in-memory grouping, and multi-purpose cursor,&lt;/strong&gt; the computing time is reduced from 2 hours to 10 minutes, the** performance is improved by 12 times.**&lt;/p&gt;

&lt;p&gt;Moreover, SPL code is more concise. For the original stored procedure, there are more than 3300 lines of code, yet it only needs 500 cells of statements in SPL, **reducing the code amount by more than 6 times **and greatly improving development efficiency.&lt;/p&gt;

&lt;p&gt;Visit:&lt;a href="https://c.esproc.com/article/1644215913288" rel="noopener noreferrer"&gt; Open-source SPL speeds up batch operating of bank loan agreements by 10+ times for details&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Case 2: In the car insurance business of insurance company P, it needs to associate historical policies of previous years with new policies, which is called the historical policies association batch job. When RDB is used for this batch job, and using the stored procedure to associate historical policies with 10-day new policies, it will take 47 minutes, and take 112 minutes to associate 30-day new policies; if the time span increases, the computation time will be unbearably long, and it will basically become an impossible task.&lt;/p&gt;

&lt;p&gt;When SPL is used for the calculation task, after exploiting SPL’s technologies such as high-performance file storage, file cursor, ordered merging &amp;amp; segmented data-fetching, in-memory association and multi-purpose traversal, it only takes 13 minutes to associate 10-day new policies, and takes 17 minutes to associate 30-day new policies, the speed is increased by nearly 7 times. Moreover, the computation time of new algorithms slightly increases as new policies grow, and does not increase in direct proportion to the number of days of new policies just like stored procedure.&lt;/p&gt;

&lt;p&gt;Viewing from the total code volume, there are 2000 lines of code in original stored procedure, and are still more than 1800 lines after removing the comments. In contrast, SPL codes are less than 500 cells, which is less than 1/3 of original code volume.&lt;/p&gt;

&lt;p&gt;For details, visit: &lt;a href="https://c.esproc.com/article/1644827119694" rel="noopener noreferrer"&gt;Open-source SPL optimizes batch operating of insurance company from 2 hours to 17 minutes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Case 3: For the detail data of granted loans of Bank T through Internet, it requires running a batch job on a daily basis to count and aggregate all historical data as of a specified date. If this batch job is implemented in SQL statements of RDB, the total running time is 7.8 hours, which is too long, and even affects other batch jobs, and hence it is necessary to optimize.&lt;/p&gt;

&lt;p&gt;When SPL is used, and after exploiting SPL’s technologies like the high-performance file, file cursor, ordered grouping, ordered association, delayed cursor and binary search, the running time can be reduced from 7.8 hours to 180 seconds in the case of single thread, and to 137 seconds in 2-thread, the speed is increased by 204 times.&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from &lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Try it free~~&lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt; Download~&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Beyond SQL: Solving Data Warehouse Performance Bottlenecks with Smart Algorithms, Not Just Bigger Clusters</title>
      <dc:creator>Judy</dc:creator>
      <pubDate>Wed, 17 Dec 2025 08:09:32 +0000</pubDate>
      <link>https://dev.to/esproc_spl/beyond-sql-solving-data-warehouse-performance-bottlenecks-with-smart-algorithms-not-just-bigger-32n6</link>
      <guid>https://dev.to/esproc_spl/beyond-sql-solving-data-warehouse-performance-bottlenecks-with-smart-algorithms-not-just-bigger-32n6</guid>
      <description>&lt;p&gt;As the volume of data continues to grow and the complexity of business rises gradually, we are facing a big challenge in data processing efficiency. The most typical manifestation is that the performance problem of data warehouse is becoming more and more prominent when dealing with an analytical task, and some problems occur from time to time such as high computing pressure, low performance, long query time or even unable to find out result, production accident caused by failure to accomplish a batch job on time. When a data warehouse has performance problem, it doesn't serve the business well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solutions for performance problem of traditional data warehouse
&lt;/h2&gt;

&lt;p&gt;Let’s start with the most common solution - cluster, that is, use the distributed technology and rely on expanding hardware to improve performance. It is certain that splitting a large task to every node of a cluster to let the nodes compute simultaneously can achieve better performance than performing this task on a single node. Even if we don't do distributed computing, but simply share a concurrent task, we can also reduce the computing pressure of a single node. The idea of cluster to solve performance problem is simple and rough. As long as the data warehouse supports cluster and the task can be split, the performance problem can be solved just through adding hardware resources. Although this solution may not achieve a linear performance improvement, it will basically work.&lt;/p&gt;

&lt;p&gt;The disadvantage of cluster is high cost. In the current era of big data, we inevitably mention the cluster whenever performance improvement is involved, and often ignore whether the performance of a single node is brought into full play, as we think that the method of adding hardware resources is OK as long as the cluster is supported. Therefore, the cluster is likely to be a “panacea” in the eyes of many people. However, we should note that the cluster needs more hardware resources, the cost will naturally be high, and the operation and maintenance of cluster also need to invest more. In addition, some complex multi-step computing tasks cannot utilize cluster at all since they can’t be split. For example, most of the multi-step batch jobs involving large data volume can only be performed with a single node (single database stored procedure). Although cluster is a good solution, it is not a panacea. Even if we have enough money to build a cluster, we cannot solve all performance problems.&lt;/p&gt;

&lt;p&gt;For some time-consuming query tasks, we can adopt the pre-calculation solution, that is, use the method of trading space for time to process the data to be queried in advance. In this way, the computing complexity can be reduced to O (1), and the efficiency can be greatly improved. Likewise, this solution can solve many performance problems. Pre-aggregating the data to be calculated and processing them in advance can trade space for time, which is particularly effective for multidimensional analysis scenarios.&lt;/p&gt;

&lt;p&gt;However, the disadvantage of the solution is extremely poor in flexibility. For example, in the multi-dimensional analysis, although it is theoretically possible to pre-calculate all dimension combinations (this will satisfy all query requirements), we will find it unrealistic in practice because it requires a huge storage space to perform pre-calculation, so we have to do partial pre-calculation after sorting out business, which greatly limits the query scope and reduces the flexibility.&lt;/p&gt;

&lt;p&gt;In fact, even if the full pre-calculation is performed, it still can’t tackle some situations, such as unconventional aggregation (e.g., calculating median and variance), combining aggregations (e.g., calculating average monthly sales), conditional metric (e.g., calculating the total sales of orders with transaction amount greater than 100 dollars), time period aggregation (aggregating within a freely chosen time period). The query requirements in real world are diverse and highly flexible, pre-aggregation can only solve a part or even a small part of the requirements. To meet diverse on-line query requirements in a wider range and more efficiently , more effective computing means is required.&lt;/p&gt;

&lt;p&gt;A more effective solution is the optimization engine, which can make the data warehouse run faster under the same hardware resource. This solution has become the focus of many vendors, they provide a lot of engineering means that are already well known in the industry, such as columnar storage, vectorized execution, coding compression and memory utilization (cluster can also be regarded as an engineering means). With these means, the computing performance can be improved by several times, provided that the amount of data is within a certain range, and they can fully meet the computing requirements of some scenarios. Unfortunately, these engineering means cannot change the computing complexity, and the improved performance still often fails to meet the computing requirements in scenarios where the data amount is large, or the complexity is particularly high.&lt;/p&gt;

&lt;p&gt;The more effective means of optimization engine is to improve the performance at algorithm level (improving at complexity level). A good data warehouse optimization engine can guess the real intention of a query statement, and adopt a more efficient algorithm to execute the statement (instead of executing it according to the literally expressed logic). Algorithm-level improvements usually make it possible to achieve a higher performance. Currently, most data warehouses still use SQL as their main query language, and SQL-based optimization has been done well enough. However, due to the limitations of SQL description ability, very circuitous ways have to be adopted for complex query tasks. Once the complexity of SQL statement increases, it will be difficult for optimization engine to play its role (it can't guess the real intention of the statements, and has to execute according to literally expressed logic instead, resulting in a failure to improve the performance). In short, the optimization engine works only for simply query tasks.&lt;/p&gt;

&lt;p&gt;Let's take an example, the following code is to calculate TopN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT TOP 10 x FROM T ORDER BY x DESC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most data warehouses will optimize the task instead of doing a real sorting. However, if we want to query the in-group TopN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select * from
 (select y,*,row_number() over (partition by y order by x desc) rn from T)
where rn&amp;lt;=10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, although the complexity of this calculation is not much higher than that of the previous one, the optimization engine will get confused and can't guess its real intention, and hence it has to do a big sorting according to the literally expressed meaning, resulting in low performance. Therefore, in some scenarios, we will process the data into a wide table in advance, so that we can simplify the query and give the optimization engine into play. While it is costly, we have to do that sometimes in order to effectively utilize optimization engine.&lt;/p&gt;

&lt;p&gt;Currently, almost all data warehouse vendors are competing for SQL ability. For example, they are competing to provide more comprehensive SQL support, offer stronger optimization ability, and support larger cluster, and so on. Although these actions can “please” the wide range of SQL users to the greatest extent, the aforementioned means are often less effective in the face of complex computing scenarios, and performance problem still exists. No matter how hard they work on SQL (engineering optimization), the effect is not satisfactory, nor does it fundamentally solve performance problem. Such type of performance problem is common in practice, here are a few examples.&lt;/p&gt;

&lt;p&gt;Complicated ordered computing: When analyzing the user behavior through conversion funnel, it involves multiple events, such as page browsing, searching, adding to cart, placing order and paying. To count the user churn rate after each event, we need to follow a principle that these events are completed within a specified time window, and occur in a specified order, only in this way can we get an effective result. If this task is implemented in SQL, it will be very difficult, and has to implement with the help of multiple sub-queries (as many as the number of events) and repeated associations. As a result, the SQL statement will be very complex, and some data warehouses cannot even execute such complex statement, and even if they can, the performance is very low and it's more difficult to optimize.&lt;/p&gt;

&lt;p&gt;Multi-step batch job involving large data volume. For complex batch job, SQL also doesn’t work well. In the stored procedure, SQL often needs to use the cursor to read data step by step. Since the cursor is very low in performance and unable to compute in parallel, it eventually leads to high resource consumption and low computing performance. Moreover, dozens of operation steps in stored procedure need thousands of lines of code, and it needs to repeatedly buffer the intermediate results during computing, this further decreases the performance, resulting in a phenomenon that a batch job cannot be accomplished on time occurs from time to time at the end of month/year when there are a large amount of data and many tasks.&lt;/p&gt;

&lt;p&gt;Multi-index calculation on big data. Many industries need to calculate index. For example, in the loan business of bank, there are not only multiple tiers of classification dimensions and multiple guarantee types, but also many other dimensions such as customer type, lending way, currency type, branch, date, customer age range and education background. If we combine them freely, an extremely huge number of indexes will be derived. Aggregating these indexes needs to be done based on a large amount of detailed data, and the calculation will involve many types of mixed operations such as large table association, conditional filtering, grouping and aggregation, and de-duplication and counting. Such calculations feature flexible, large in data volume and complex, and are accompanied with high concurrency, which make it very hard to implement in SQL. If we adopt the pre-calculation solution, it is inflexible, while the real-time calculation is too slow.&lt;/p&gt;

&lt;p&gt;Since these problems are difficult to be solved in SQL, extending the computing ability of SQL becomes the fourth solution following the cluster, pre-calculation and optimization engine. Nowadays, many data warehouses support using the user-defined function (UDF) to extend the ability, and allow users to write UDF to meet their own needs according to actual requirements. However, UDF is difficult to develop and requires users to have high technical skill. More importantly, UDF still cannot solve the computing performance problem of data warehouse because it is still limited by the storage of database, resulting in the inability to design more efficient data storage (organization) form according to computing characteristics. As a result, many high-performance algorithms cannot be implemented, and hence it naturally cannot achieve high performance.&lt;/p&gt;

&lt;p&gt;Therefore, to solve these problems, we should adopt a non-SQL-based solution, and let programmers control the execution logic outside the database, so as to better utilize low-complexity algorithms and make full use of engineering means.&lt;/p&gt;

&lt;p&gt;As we analyzed, some big data computing engines such as Spark emerged. Spark provides a distributed computing framework, and is still intended to meet the needs of computing ability through a large-scale cluster. Since the design based on all-in-memory operation is not friendly enough to the calculation on external storage, and RDD adopts the immutable mechanism, RDD will be copied after each calculation step, resulting in occupying and wasting a large amount of memory space and CPU resource and a very low performance, and the engineering means are not fully utilized. In addition, Spark is not rich in computing library, lacks high-performance algorithms, making it difficult to achieve the goal of “low-complexity algorithm”. Furthermore, Scala is very difficult to use, which makes it extremely difficult to code in the face of the complex computing problems mentioned above. Difficult to code and unable to achieve high performance may be one of the reasons why Spark turns to using SQL again.&lt;/p&gt;

&lt;p&gt;Since traditional data warehouse doesn't work, and external programming (Spark) is difficult and slow, is there any other alternative?&lt;/p&gt;

&lt;p&gt;From the above discussion, it is not difficult to conclude that to solve the performance problem of data warehouse, we do need a computing system independent of SQL (like Spark) but, this system should have the characteristics of easy in coding and fast in running. Specifically, this system should not be as complex as Spark when describing complicated computing logic, and should even be simpler than SQL; this system should not rely solely on a cluster in terms of computing performance, and should provide rich high-performance algorithms and engineering ability, so as to make full use of hardware resource and maximize the performance of a single node. In short, this system should have not only the ability to quickly describe low-complexity algorithm, but also sufficient engineering means. Moreover, it would be ideal if the system could be deployed, operated and maintained easily.&lt;/p&gt;

&lt;h2&gt;
  
  
  The esProc SPL solution
&lt;/h2&gt;

&lt;p&gt;esProc SPL is a computing engine specially for processing the structured and semi-structured data, having the same ability with current data warehouses. However, unlike traditional SQL-based data warehouses, esProc doesn’t continue to adopt relational algebra but designs a brand-new computing system, based on which SPL (Structured Process Language) syntax is developed. Compared with SQL, SPL has many advantages. Specifically, it provides more data types and operations, richer computing libraries, and stronger description ability; with the support of procedural computing, it allows us to write algorithms according to natural thinking without having to code in a roundabout way, making SPL code shorter; it has sufficient ability to implement multi-step complex calculations mentioned earlier, which is simpler than SQL and other hard-coding methods.&lt;/p&gt;

&lt;p&gt;When it comes to the performance, esProc SPL provides many high-performance algorithms of “lower complexity” to ensure computing performance. We know that software cannot change the performance of hardware, the only way to achieve higher computing performance under the same hardware condition is to design algorithms of lower complexity so as to make computer execute fewer basic operations, which will naturally make the computing speed faster. However, it is not enough to design low-complexity algorithms, the ability to implement them is also required, and the simpler in coding, the better the performance. Therefore, easy in coding and fast in running are the same thing actually.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj67q4jyebi9r2kabrlfu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj67q4jyebi9r2kabrlfu.jpg" alt=" " width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This figure shows part of SPL-provided high-performance algorithms, many of which are the original inventions of SPL.&lt;/p&gt;

&lt;p&gt;Of course, a high-performance algorithm cannot do without good data organization (data storage schema). For example, the ordered merge and one-sided partitioning algorithms can be implemented only when data is stored in order. Yet, the storage of database is relatively closed, and cannot be interfered externally, resulting in a failure to design the storage according to calculation characteristics. For this reason, SPL provides its own binary file storage, that is, store the data in a file system outside the database so as to make full use of the advantages of multiple data storage schema such as columnar storage, ordering, compression, and parallel segmentation, and achieve the objective of flexibly organizing the data according to computing characteristics and fully utilizing the effectiveness of high-performance algorithm.&lt;/p&gt;

&lt;p&gt;In addition to high-performance algorithms, esProc provides many engineering means to improve computing performance, such as columnar storage, coding compression, large memory and vector-based computing. As mentioned above, although these engineering means cannot change the computing complexity, they can often improve the performance by several times. With these means, along with many low-complexity algorithms built in SPL, a performance improvement of one or two orders of magnitude becomes the norm.&lt;/p&gt;

&lt;p&gt;As mentioned above, if we want to achieve high performance based on a non-SQL system, we have to control the execution logic, adopt low-complexity algorithms, and make full use of various engineering means. Unlike SQL, the theory system of SPL brings us a strong description ability, and a simpler coding method without the need to code in a roundabout way, and allows us to use its rich high-performance algorithm libraries and corresponding storage mechanisms directly, thereby achieving the goal of making full use of engineering optimization means while adopting low-complexity algorithm, as well as the effect of simply in coding and fast in running.&lt;/p&gt;

&lt;p&gt;For example, the TopN is regarded as an ordinary aggregation operation in SPL. Whether it is to calculate the TopN of full set or grouped subsets, the processing method of SPL is the same, and there is no need to do a big sorting. In this way, the goal of “using lower-complexity algorithm” is achieved, along with high performance.&lt;/p&gt;

&lt;p&gt;Again, let's take the above conversion funnel as an example to feel the difference between SQL and SPL.&lt;/p&gt;

&lt;p&gt;SQL code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with e1 as (
 select uid,1 as step1,min(etime) as t1
 from event
 where etime&amp;gt;= to_date('2021-01-10') and etime&amp;lt;to_date('2021-01-25')
 and eventtype='eventtype1' and …
 group by 1),
e2 as (
 select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
 from event as e2
 inner join e1 on e2.uid = e1.uid
 where e2.etime&amp;gt;= to_date('2021-01-10') and e2.etime&amp;lt;to_date('2021-01-25')
 and e2.etime &amp;gt; t1 and e2.etime &amp;lt; t1 + 7
 and eventtype='eventtype2' and …
 group by 1),
e3 as (
 select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
 from event as e3
 inner join e2 on e3.uid = e2.uid
 where e3.etime&amp;gt;= to_date('2021-01-10') and e3.etime&amp;lt;to_date('2021-01-25')
 and e3.etime &amp;gt; t2 and e3.etime &amp;lt; t1 + 7
 and eventtype='eventtype3' and …
 group by 1)
select
 sum(step1) as step1,
 sum(step2) as step2,
 sum(step3) as step3
from
 e1
 left join e2 on e1.uid = e2.uid
 left join e3 on e2.uid = e3.uid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL code:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql3dbxyam3xj23fd1pik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql3dbxyam3xj23fd1pik.png" alt=" " width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see that SPL code is shorter with the support of ordered computing, because SPL allows us to code step by step (procedural computing) according to natural thinking. Moreover, this code can handle a funnel analysis involving any number of steps (3 steps in this example). When the number of steps increases, we only need to modify the parameter. Therefore, SPL is obviously more advantageous than SQL that needs to add one subquery for each additional step, which benefits from the “simple in coding” of SPL.&lt;/p&gt;

&lt;p&gt;In terms of performance, SPL still has obvious advantages. This example is actually a simplified real case. In the real case, SQL code has almost up to 200 lines. The user did not get the result after 3 minutes running on Snowflake's Medium server (equivalent to 4*8=32 cores), while the user got the result in less than 10 seconds when executing SPL code on a 12-core and 1.7G low-end server. This benefits from SPL’s high-performance algorithms and corresponding engineering means.&lt;/p&gt;

&lt;p&gt;With these mechanisms, esProc SPL can make full use of hardware resources, and maximize the performance of a single node. As a result, not only can esProc SPL solve many original performance problems of a single node effectively, but even many calculations that originally required to be solved with cluster can now be solved with a single node (possibly faster), thereby achieving the computing effect of cluster by only using one node. Of course, a single node is limited in computing ability. In order to cope with this problem, SPL also provides the distributed technology, allowing us to scale out the computing capability through a cluster when a single node cannot meet the computing requirement in any case, which is SPL's high-performance computing philosophy: first improve the performance of a single node to the extreme, and turn to a cluster when the computing ability of one node is insufficient.&lt;/p&gt;

&lt;p&gt;For sure, any technology has shortcomings, SPL is not an exception. SQL has been developed for decades, and many databases already have strong optimization engines. For simple operations that are suitable to be implemented in SQL, the optimization engines have the ability to optimize slow statements written by ordinary programmers and achieve better performance. In this sense, the requirements for programmers are relatively low. Certain scenarios, such as multidimensional analysis, have been optimized for years, and certain SQL engines already have the ability to implement such scenarios quite well, and achieve extreme performance. In contrast, SPL did little in automatic optimization., and depends almost entirely on programmers to write low-complexity code to achieve high performance. In this case, programmers need receive some training to familiarize themselves with SPL’s philosophy and library functions before getting started with SPL. Although this is an extra step in comparison with SQL, it is usually worthwhile as it enables us to improve the performance by order of magnitude, and reduce the cost by several times.&lt;/p&gt;

&lt;p&gt;SPL is now open-source. You can obtain the source code from &lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Try it free~~ &lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt;Download~&lt;/a&gt;&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
