<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Serina</title>
    <description>The latest articles on DEV Community by Serina (@serina_8340).</description>
    <link>https://dev.to/serina_8340</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1487501%2F6bc73ccc-93ad-4a17-a851-1c154c6b455f.png</url>
      <title>DEV Community: Serina</title>
      <link>https://dev.to/serina_8340</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/serina_8340"/>
    <language>en</language>
    <item>
      <title>esProc SPL: Equivalent to the Python-enhanced DuckDB</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Fri, 06 Jun 2025 05:43:28 +0000</pubDate>
      <link>https://dev.to/serina_8340/esproc-spl-equivalent-to-the-python-enhanced-duckdb-29if</link>
      <guid>https://dev.to/serina_8340/esproc-spl-equivalent-to-the-python-enhanced-duckdb-29if</guid>
      <description>&lt;p&gt;For desktop data analysis users, if DuckDB is the handy ‘SQL Swiss Army Knife,’ then esProc SPL is an ‘all-in-one toolbox’ with built-in Python capabilities, maintaining SQL’s ease of use while overcoming its inherent limitations.&lt;/p&gt;

&lt;p&gt;Like DuckDB, esProc SPL offers excellent support for file handling. Common files like CSV and Excel can be used directly as databases, allowing you to run SQL queries immediately. For example, query sales data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$SELECT region, SUM(amount) FROM sales.csv GROUP BY region
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Such lightweight operations are a breeze for esProc. Similar to DuckDB, esProc also supports data binarization, but it stores data as files with excellent compression ratios. Loading millions of rows takes just seconds, making it at least three times faster than reading CSV directly.&lt;/p&gt;

&lt;p&gt;Currently, esProc’s SQL does not support window functions, making it less comprehensive than DuckDB. However, esProc has a trump card: its native language, SPL, which significantly simplifies complex tasks compared to SQL, so you actually won’t need to write cumbersome window functions.&lt;br&gt;
For example, calculating bonuses for the top 3 salespeople in each province requires multiple nested queries in SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH ranked_sales AS (
  SELECT province, salesman, amount,
    ROW_NUMBER() OVER(PARTITION BY province ORDER BY amount DESC) as rank
  FROM sales
)
SELECT * FROM ranked_sales WHERE rank &amp;lt;=3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With SPL, it’s much more straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sales.groups(province;top(-3;amount))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Such tasks can still be implemented in SQL. However, for more complex tasks, like calculating ‘automatic doubling of reward points when a customer’s consecutive purchase days exceed 5’, SQL often falls short. When it comes to flow control operations—such as implementing loops in SQL or dynamically adjusting calculation logic based on conditions, SQL’s IF and LOOP statements are too limited to be practical. The convoluted code you manage to write becomes incomprehensible even to yourself after just three days. That’s why DuckDB often relies on Python.&lt;/p&gt;

&lt;p&gt;DuckDB’s Python interface is remarkably smooth, but using them in combination still creates a sense of fragmentation. You query data with SQL, load it into a DataFrame, and often end up writing it back to the database. These are two distinct systems with different development and debugging approaches, requiring constant mental context switching that feels jarring. It’s like ordering a steak in a Chinese restaurant—it works, but it just feels awkward.&lt;/p&gt;

&lt;p&gt;In contrast, esProc SPL directly integrates the core capabilities of Python.&lt;br&gt;
The above calculation for “a customer’s consecutive purchase days exceed 5” is written in SPL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftn82ceq2gn6eoo82hlb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftn82ceq2gn6eoo82hlb8.png" alt="Image description" width="800" height="176"&gt;&lt;/a&gt;&lt;br&gt;
This code is actually even more concise than Python.&lt;/p&gt;

&lt;p&gt;esProc SPL, with its comprehensive computing capabilities, support for procedural computation, and robust flow control mechanisms, outperforms Python-enhanced DuckDB. It combines SQL’s agility with programming language’s flexibility, while eliminating the need to juggle back and forth between multiple tools. For desktop analysts who frequently handle complex calculations, esProc SPL may be a more elegant solution than ‘SQL + Python’. After all, who wouldn’t want to handle everything in one window?&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>programming</category>
      <category>devops</category>
      <category>esproc</category>
    </item>
    <item>
      <title>esProc SPL &amp; MongoDB: A Match Made in Data Heaven</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Fri, 23 May 2025 07:51:58 +0000</pubDate>
      <link>https://dev.to/serina_8340/esproc-spl-mongodb-a-match-made-in-data-heaven-26p6</link>
      <guid>https://dev.to/serina_8340/esproc-spl-mongodb-a-match-made-in-data-heaven-26p6</guid>
      <description>&lt;p&gt;MongoDB, as a mainstream NoSQL database, has become a powerful tool for handling unstructured data thanks to its flexible document structure. However, as anyone who’s used it knows, its computing capabilities leave much to be desired.&lt;/p&gt;

&lt;p&gt;The trade-off of NoSQL is sacrificing the simplicity of SQL. Let’s take the example of finding the top 10 customers by order amount. With SQL, you can accomplish the task with a single SELECT TOP 10 statement.&lt;/p&gt;

&lt;p&gt;However, with MongoDB, you’ll need to use these three operators: $group, $sort, and $limit, and implementing cross-collection joins requires painstakingly chaining multiple $lookup stages – akin to manually assembling building blocks step by step. What’s even more frustrating is that advanced operations like window functions simply can’t be implemented using MongoDB’s native syntax.&lt;/p&gt;

&lt;p&gt;Clearly, it wouldn’t be appropriate to import the data into MySQL for calculation; otherwise, why take the trouble to use MongoDB? If you were to brute-force the calculations in Java at the application layer, that would be incredibly stressful and not worth the effort.&lt;/p&gt;

&lt;p&gt;With esProc SPL, it becomes much simpler as SPL can directly perform SQL-style computations on MongoDB data. For example, for a multi-layer nested order structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
{  "_id": ObjectId("..."),  
  "order_id": "12345",      
  "customer": "C001",    
  "order_date": ISODate("2025-02-12T00:00:00Z"),   
  "order_details": [        
    {
      "product_id": "P001",    
      "product_name": "Laptop", 
      "quantity": 2,           
      "price": 150.00,         
      "total": 300.00          
    },
    ...
  ]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To query the top 10 largest customers and their order amounts, the SPL code would be:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdfz67m8y7vjtn2z05x3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdfz67m8y7vjtn2z05x3.png" alt="Image description" width="800" height="211"&gt;&lt;/a&gt;&lt;br&gt;
The key is just one line in A3, which is as concise as SQL.&lt;br&gt;
esProc SPL even allows the direct use of SQL syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$select TOP 10 Customer, SUM(Amount) AS TotalAmount from {mongo_shell@d(mongo_open("mongodb://127.0.0.1:27017/mongo"),"orders.find()")}
group by Customer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL has a natural affinity for handling nested JSON data. In MongoDB, where documents with three levels of nesting are common, using SPL’s dot operator A.B.C allows direct traversal to the deepest level. When dealing with arrays, SPL can also use conj()for expansion and groups() for grouping and statistical analysis. This harmonious balance – ‘you ensure efficient storage; I guarantee rapid computation’ – is a genuine partnership.&lt;/p&gt;

&lt;p&gt;This is even better news for Java developers. Previously, manipulating MongoDB required creating a bunch of BasicDBObject objects. Now, simply adding a few jars allows you to execute SPL statements (and the aforementioned SQL statements), and even use a script file as a stored procedure. This essentially transforms MongoDB’s interface to something resembling a relational database, reducing code volume by 90% – shifting from cumbersome pipeline operations to the clean style of Class.forName(…).execute(…).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Class.forName("com.esproc.jdbc.InternalDriver");
Connection con= DriverManager.getConnection("jdbc:esproc:local://");
Statement st = con.prepareCall("call SPL_Mongo_Example()");
st.execute();
ResultSet rs = st.getResultSet();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, SPL_Mongo_Example is the name of the SPL script (SPL_Mongo_Example.splx). Just like calling a stored procedure, the code no longer involves data processing, and is simple and clear.&lt;/p&gt;

&lt;p&gt;The fly in the ointment is that esProc SPL is developed in Java. Calling it from other languages like Python/C++ requires using an HTTP interface, unlike Java applications which can integrate seamlessly. But considering its ability to make MongoDB directly unlock SQL computational capabilities, this minor flaw is entirely forgivable. After all, in this data-driven era, a tool that saves you from writing 200 lines of aggregation code deserves the compliment “That’s awesome.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;esProc SPL Github page.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mongodb</category>
      <category>programming</category>
      <category>devops</category>
      <category>esproc</category>
    </item>
    <item>
      <title>Analyzing DuckDB’s Performance Optimization through TOPN and COUNT DISTINCT Operations</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Tue, 20 May 2025 07:49:51 +0000</pubDate>
      <link>https://dev.to/serina_8340/analyzing-duckdbs-performance-optimization-through-topn-and-count-distinct-operations-218p</link>
      <guid>https://dev.to/serina_8340/analyzing-duckdbs-performance-optimization-through-topn-and-count-distinct-operations-218p</guid>
      <description>&lt;p&gt;In recent years, DuckDB has emerged as a popular choice for numerous data analysis scenarios. Its lightweight nature, ease of use, and simple integration also make it well-suited for programmers performing local analysis. The direct use of SQL provides both convenience and efficiency.&lt;/p&gt;

&lt;p&gt;However, ease of writing code isn’t the only consideration; fast execution and intelligent optimization are also key to the user experience.&lt;/p&gt;

&lt;p&gt;We will use TOPN and COUNT DISTINCT operations as examples to analyze DuckDB’s performance optimization.&lt;/p&gt;

&lt;p&gt;Test environment and data preparation&lt;br&gt;
Test hardware:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3nxviyygwdf90gstehl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3nxviyygwdf90gstehl.png" alt="Image description" width="800" height="181"&gt;&lt;/a&gt;&lt;br&gt;
This is not a high-end configuration; rather, it’s closer to the environment developers use daily, making the test results more relevant.&lt;/p&gt;

&lt;p&gt;Test data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1x0d9dxzmb5qc8migacg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1x0d9dxzmb5qc8migacg.png" alt="Image description" width="800" height="266"&gt;&lt;/a&gt;&lt;br&gt;
Both tables exceed available memory to ensure realistic simulation of big data processing scenarios and avoid unfair advantages from caching.&lt;/p&gt;

&lt;p&gt;Entire-set TOPN: Excellent performance&lt;br&gt;
First, let’s look at a classic TOPN operation: retrieving the top 100 rows by amount field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select * from topn order by amount desc limit 100;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the SQL is interpreted literally, the database would first sort the entire table and then select the top 100 rows. This results in an algorithm complexity of n*logn. If the data volume exceeds available memory, performance drops significantly.&lt;/p&gt;

&lt;p&gt;However, test results show DuckDB performs quite well in this scenario. After importing the 1 billion rows of raw text data (22GB), it compresses the data into an 8GB database file. The first query takes approximately 20 seconds (largely due to initial loading and caching), with subsequent queries stabilizing around 4 seconds.&lt;/p&gt;

&lt;p&gt;During the execution, no temporary files were generated, indicating that DuckDB’s optimizer avoids full-table sorting, and uses an optimization algorithm: maintain just a small set of size N and traverse the data only once. This strategy reduces the algorithm complexity from n*logn to n*logN, resulting in a substantial performance improvement when N is small.&lt;/p&gt;

&lt;p&gt;In this test, DuckDB performs excellently, demonstrating both intelligence and efficiency.&lt;/p&gt;

&lt;p&gt;Grouped TOPN: Shortcomings exposed&lt;br&gt;
Next, let’s look at a slightly more complex operation: grouped TOPN.&lt;/p&gt;

&lt;p&gt;The task is to group the table by the first letter of the id field and select the top 100 rows with the largest amount values in each group. The SQL is roughly as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select * from (
  select left(id,1) as gid, amount,
         row_number() over (partition by left(id,1) order by amount desc) rn
  from topn
) where rn &amp;lt;= 100;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In theory, this scenario is similar to regular TOPN. Since each group contains only tens of thousands of rows, maintaining a small set of the top N values and traversing it once should theoretically avoid a full sort as well.&lt;/p&gt;

&lt;p&gt;However, in actual running, DuckDB seems to be “faithfully executing the SQL semantics”: fully sorting each group and then filtering using row_number. This strategy resulted in extensive read/write operations between memory and external storage, causing temporary files to swell to over 40GB during the execution. Even after 10 minutes, the query remained incomplete.&lt;/p&gt;

&lt;p&gt;In other words, although optimization is possible logically, DuckDB doesn’t actively recognize ‘grouped TOPN’ as ‘group aggregation’ scenario and instead rigidly performs sorting. In this case, users are left to optimize the query themselves by rewriting the SQL query to avoid sorting or using a different tool.&lt;/p&gt;

&lt;p&gt;For example, esProc SPL directly supports using groups in combination with the top function, treating TOPN as an aggregation operation. It obtains the results by traversing the data only once, and only takes 31 seconds to complete the same query, demonstrating a considerable performance advantage.&lt;/p&gt;

&lt;p&gt;SPL script – Entire-set TOPN:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95vm3awacvsugnzi1vi0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95vm3awacvsugnzi1vi0.png" alt="Image description" width="800" height="122"&gt;&lt;/a&gt;&lt;br&gt;
SPL script – Grouped TOPN:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqitphaqt19mbt73jwwd4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqitphaqt19mbt73jwwd4.png" alt="Image description" width="800" height="119"&gt;&lt;/a&gt;&lt;br&gt;
Summary of test results (Unit: seconds)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8xza4jlk05kz11z5gvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8xza4jlk05kz11z5gvd.png" alt="Image description" width="800" height="123"&gt;&lt;/a&gt;&lt;br&gt;
COUNT DISTINCT: Unable to leverage ordered data&lt;br&gt;
Let’s now discuss another common operation: COUNT DISTINCT.&lt;/p&gt;

&lt;p&gt;In SQL, counting the number of distinct order IDs that meet a specific condition is typically expressed as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select count( distinct orderid ) from orderdetail where amount&amp;gt;50;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In DuckDB, the logic for this operation is simple and brute-force: loop through all data, collect the order IDs that meet the condition into a DISTINCT set, compare each new value with the current DISTINCT set to determine whether to add it, and finally count the size of the DISTINCT set.&lt;/p&gt;

&lt;p&gt;If the data is unordered, this approach is fine and consistent with the practices of most databases.&lt;/p&gt;

&lt;p&gt;However, in this test, we intentionally inserted the data in ascending order by orderid. In this case, if the algorithm leveraged this ordered characteristic, it could significantly improve performance, as the adjacent rows with the same orderid values will be skipped directly, thus eliminating the need to maintain a DISTINCT set.&lt;/p&gt;

&lt;p&gt;Regrettably, DuckDB lacks the ability to actively recognize and leverage the ordered nature of the data. This is confirmed by the test results: it took over 200 seconds to process 2.4 billion rows using a single thread.&lt;/p&gt;

&lt;p&gt;In contrast, using esProc SPL and its icount@o function to perform COUNT DISTINCT specifically for ordered data, the test completed in approximately 50 seconds—about 4 times faster than DuckDB.&lt;/p&gt;

&lt;p&gt;For more complex operations, such as calculating two COUNT DISTINCT values with different conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select count( distinct case when amount&amp;gt;50 then orderid else null end ),
       count( distinct case when amount&amp;gt;100 then orderid else null end )
from orderdetail;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DuckDB repeatedly builds DISTINCT sets, causing the execution time to skyrocket to over 800 seconds. Conversely, esProc SPL reuses the scan process, resulting in almost no change in execution time.&lt;/p&gt;

&lt;p&gt;The root cause of this gap lies in SQL’s inability to express “ordered” information, leaving the optimizer no means to leverage such information. In contrast, languages like esProc SPL—which can explicitly express orderedness and utilize dedicated function calls—hold a natural advantage in such scenarios.&lt;/p&gt;

&lt;p&gt;SPL script - One count distinct query:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6pm13wk7e05gk9egfti.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6pm13wk7e05gk9egfti.png" alt="Image description" width="800" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPL script - Two count distinct queries:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F30wrd60fi43isujtah2w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F30wrd60fi43isujtah2w.png" alt="Image description" width="800" height="127"&gt;&lt;/a&gt;&lt;br&gt;
Summary of test results (Unit: seconds):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y172spbc7xqbkgnuibh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y172spbc7xqbkgnuibh.png" alt="Image description" width="800" height="120"&gt;&lt;/a&gt;&lt;br&gt;
As a lightweight analytical database, DuckDB boasts excellent usability and particularly excels at optimizing entire-set TOPN queries, delivering a great daily user experience.&lt;/p&gt;

&lt;p&gt;However, when faced with slightly more complex problems, such as grouped TOPN or COUNT DISTINCT on inherently ordered data, the “clumsiness” of SQL semantics is exposed: the optimizer cannot recognize the underlying logic, which leads to a significant performance gap.&lt;/p&gt;

&lt;p&gt;In these scenarios, using tools like esProc SPL, which can describe computational details and support the expression of ordered characteristics, can greatly improve performance.&lt;/p&gt;

&lt;p&gt;In conclusion, DuckDB is a good tool, provided that the scenario is simple and allows the optimizer free rein. However, if the problem involves ordered characteristics or special aggregation logic, achieving fast running requires tools that can describe “smarter” computational methods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;esProc SPL Github page.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>devops</category>
      <category>programming</category>
      <category>esproc</category>
    </item>
    <item>
      <title>With SPL, It Seems We Don’t Need ORM Anymore</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Fri, 16 May 2025 07:10:27 +0000</pubDate>
      <link>https://dev.to/serina_8340/with-spl-it-seems-we-dont-need-orm-anymore-m31</link>
      <guid>https://dev.to/serina_8340/with-spl-it-seems-we-dont-need-orm-anymore-m31</guid>
      <description>&lt;p&gt;ORM technology indeed simplifies basic CRUD operations, but it also has many limitations when dealing with complex calculations. Hibernate’s HQL is notably inadequate, making it difficult to implement dynamic column calculations and multi-layer associations. While JOOQ improves flexibility through its DSL, its grouping calculations require multi-layer nesting, leading to more verbose code than native SQL.&lt;/p&gt;

&lt;p&gt;esProc SPL is like a “plug-in” for data computation! To write a multi-layer JOIN based on dynamic conditions, you had to use chain calls in Java for half an hour with JOOQ. Now you can implement this in just a few lines of SPL script, and its syntax is more intuitive than SQL. For example, calculating ‘top 3 sales by department’, which even requires nesting in SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT dept, name FROM (
  SELECT dept, name, RANK() OVER (PARTITION BY dept ORDER BY sales DESC) as rank 
  FROM employee
) WHERE rank &amp;lt;=3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can be implemented in SPL with just one line of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee.groups(dept; top(-3, sales))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL also supports dynamic data structures. It eliminates the need to predefine entity classes, and you can dynamically add fields in the script at any time: Orders.derive(Amount*0.1:tax, Amount+tax:total_amount). This is in contrast to JOOQ’s need for predefinition. Performing calculations in SPL is as simple as SQL. For example, filtering conditions can be directly written as Orders.select(amount&amp;gt;1000 &amp;amp;&amp;amp; like(client,"&lt;em&gt;s&lt;/em&gt;")), with field names used directly without object prefixes. In contrast, JOOQ’s approach, such as ORDERS.AMOUNT.gt(1000), pales in comparison.&lt;/p&gt;

&lt;p&gt;When performing cross-database analysis (e.g., MySQL user data + Elasticsearch logs), you had to write ETL scripts to extract and load the data. With SPL, you can now process data across databases directly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqy3p8duagokox32r291.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgqy3p8duagokox32r291.png" alt="Image description" width="800" height="229"&gt;&lt;/a&gt;&lt;br&gt;
No need for extracting data, no need for creating intermediate tables. In this scenario, ORM is left helpless!&lt;/p&gt;

&lt;p&gt;ORM is more suitable for handling simple tasks, while complex calculations, cross-source operations, and dynamic logic can all be handed over to SPL. Whether it’s real-time risk control, dynamic reporting, or IoT stream processing, SPL can handle them all with ease. Additionally, SPL’s cursor mechanism allows for reading and computing simultaneously without exceeding memory, and its syntax is concise and code is flexible. Try processing Kafka stream data with JOOQ, the threading model in Java alone can drive you crazy. In contrast, SPL can perform real-time aggregation directly with kafka_open().kafka_poll@c().groups(hour(time);avg(value)). The gap is like the difference between a wagon and a Tesla.&lt;/p&gt;

&lt;p&gt;Similar to ORM, SPL is also developed purely in Java and can be seamlessly integrated into Java applications for deployment and distribution. However, unlike ORM, using SPL for computations typically needs to write the business logic into scripts, which are then called by Java through JDBC.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Class.forName("com.esproc.jdbc.InternalDriver");
Connection con= DriverManager.getConnection("jdbc:esproc:local://");
Statement st = con.prepareCall("call SplScript()"); //SPL script name
st.execute();
ResultSet rs = st.getResultSet();

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will cause a separation of calculation code and Java code, which is quite different from the style where ORM is tightly integrated into Java applications. ORM programmers may find it unfamiliar at first. In fact, SPL has full support for flow control, such as if and for, making business function implementation more convenient than Java.&lt;/p&gt;

&lt;p&gt;The advantage of independent SPL scripts is the hot-update feature. SPL scripts are interpreted at runtime. When running in standalone applications, if the statistical logic changes, you can leisurely modify the SPL script, upload it directly to the server, and the business system will take effect in seconds without even needing a restart. However, with tools like JOOQ, you have to recompile and redeploy after modifying the Java code, which results in a poor experience.&lt;/p&gt;

&lt;p&gt;Essentially, SPL does not objectify data tables; rather, it directly manipulates the database using SPL. While this approach might be less convenient than MyBatis for simple single-table CRUD operations, SPL will absolutely rescue you from the ORM quagmire when facing the three major challenges: complex calculations, heterogeneous data, and frequently changing requirements. Programmers shouldn’t make things harder on themselves. Let ORM handle what it’s good at – object mapping – and leave computations to the specialized SPL. Isn’t that far more elegant than wrestling with SQL in Java?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/SPLWare/esProc" rel="noopener noreferrer"&gt;esProc SPL Github page.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>devops</category>
      <category>opensource</category>
      <category>news</category>
    </item>
    <item>
      <title>Besides True Parallelism and Big Data, esProc SPL’s Conciseness Leaves Python in the Dust</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Thu, 15 May 2025 06:47:36 +0000</pubDate>
      <link>https://dev.to/serina_8340/besides-true-parallelism-and-big-data-esproc-spls-conciseness-leaves-python-in-the-dust-3dlo</link>
      <guid>https://dev.to/serina_8340/besides-true-parallelism-and-big-data-esproc-spls-conciseness-leaves-python-in-the-dust-3dlo</guid>
      <description>&lt;p&gt;Python, with its concise syntax and rich libraries, provides a significantly simpler way to data computation than Java, even surpassing SQL in convenience, which explains its immense popularity in the field of data analysis.&lt;/p&gt;

&lt;p&gt;However, the emergence of esProc SPL may disrupt this ranking.&lt;/p&gt;

&lt;p&gt;A typical example is big data processing. When memory cannot hold the full data, even common aggregation/filtering operation requires approximately ten lines of Python code. If sorting/grouping operation is needed, it would be more complex, with code volume skyrocketing to hundreds of lines, involving numerous function and method coding, as well as temporary file handling. This level of complexity surpasses the capabilities of data analysts. This is because Python does not provide cursor natively; when data exceeds memory capacity, programmers have to split data themselves, leading to verbose and messy code written with great effort.&lt;/p&gt;

&lt;p&gt;In contrast, SPL is much simpler. Because SPL has a built-in cursor data type, and aggregation/filtering operation can be done in a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;file(“huge.txt”).cursor@t().total(sum(amount))
file(“huge.txt”).cursor@t().select(amount&amp;gt;=1000)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even complex operations like sorting and grouping that are difficult to implement in Python can still be done with SPL in just one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;file(“huge.txt”).cursor@t().sortx(area)
file(“huge.txt”).cursor@t().groups(area,amount)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL cursors are not limited to just loading data in batches, but also incorporate optimization ways like indexing and blocking to ensure efficient computation even with large data volume.&lt;/p&gt;

&lt;p&gt;Parallel computing is essential for big data processing. Python doesn’t offer a true multi-thread parallel mechanism, rendering multi-core CPUs virtually useless, and resorting to writing multi-process programs is cumbersome. In contrast, SPL simplifies parallel computing to a single option - just add @m to automatically enable parallel computing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;file(“huge.txt”).cursor@tm().groups(area;sum(amount))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL offers true parallel computing. The underlying level automatically splits tasks and distributes them based on the number of CPU cores, with each core independently processing a portion of data and then merging the results, which fully leverages multi-core advantages. The entire process is transparent to users and requires no manual intervention from programmers.&lt;/p&gt;

&lt;p&gt;SPL also offers its own high-performance storage, employing mechanisms like binary format, compression, and columnar storage to significantly improve data read efficiency. Moreover, the storage can be flexibly designed based on computation goals, such as ordered storage by specified fields and appropriate redundancy. This ensures both high efficiency and flexibility.&lt;/p&gt;

&lt;p&gt;With SPL cursor and rich computational support on cursor, data analysts can confidently tackle big data challenges. This already puts SPL several steps ahead of Python. When combined with SPL’s simple parallel computing and own high-performance storage, SPL leaves Python far behind.&lt;/p&gt;

&lt;p&gt;Even for non-big-data processing scenarios, SPL remains more concise than Python.&lt;br&gt;
For common, simple calculations, SPL and Python are similar, with no significant difference. For example, calculate the top three:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee.top(-3;salary)  //SPL
employee.nlargest(3, 'salary')  //Python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, once slightly more complex scenarios are involved, the difference becomes obvious. For example, to calculate the top three within each group, SPL retains its concise style:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee.groups(department;top(-3;salary))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL directly treats topN as an aggregation operation and performs direct grouping.&lt;/p&gt;

&lt;p&gt;In contrast, Python is considerably more complex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee.groupby('department').apply(lambda group: group.nlargest(3, 'salary'))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only does it require a combination of apply and lambda, but the style is inconsistent with that of calculating the whole set.&lt;/p&gt;

&lt;p&gt;Similar inconsistencies in syntax and style are common in Python, which adds extra memorization cost. For example, both login.groupby('user')['time'].min() and login.groupby('user').agg({'time': 'min'}) calculate the minimum value within each group, but they return completely different objects, and the subsequent supported operations also vary, requiring extra caution during use.&lt;/p&gt;

&lt;p&gt;Differences also arise in position-based calculations. For example, to extract the data for the 5th, 10th, 15th, … trading days of 2025, SPL directly uses # to represent position, making it simple and intuitive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stock.select(year(date)==2025).select(# % 5 == 0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python, however, requires a workaround: first filtering, then using reset_index() to re-number, and finally retrieving values based on position:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stock_2025 = stock[stock['date'].dt.year == 2025].reset_index(drop=True)
selected_stock = stock_2025[stock_2025.index % 5 == 4]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The step reset_index(drop=True) alone is easy to forget; forgetting it will result in errors. SPL, on the other hand, natively supports sequence numbers, so there’s no need to worry about these details.&lt;/p&gt;

&lt;p&gt;In addition, when calculating growth rates or moving averages, it’s often necessary to reference adjacent records. SPL directly uses [-1] to retrieve the previous record, making the code natural. For example, calculate the maximum monthly sales growth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sales.(if(#&amp;gt;1,~-~[-1],0)).max()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python requires shift()or rolling() to generate a new Series before performing the calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sales.rolling(window=2).apply(lambda x: x[1] - x[0], raw=True).max()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To calculate if the three-month moving average is increasing, SPL remains straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sales.(~[-2,0].pselect(~&amp;lt;=~[-1])==null)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python is much more cumbersome, requiring both rolling and lambda:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sales.rolling(window=3).apply(lambda x: (x[0] &amp;lt; x[1] and x[1] &amp;lt; x[2]), raw=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only is the code complex, but the fixed rolling window is also inflexible.&lt;/p&gt;

&lt;p&gt;Both Python and SPL support lambda syntax, but SPL is simpler and more direct. For example, to label managers with salaries over 5000, the Python code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee['flag'] = employee.apply(lambda row: 'yes' if row['position'] == 'manager' and row['salary'] &amp;gt; 5000 else 'no', axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL doesn’t require lambda keywords and makes lambda syntax implicit, enabling direct coding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee.derive(if(position == "manager" &amp;amp;&amp;amp; salary &amp;gt; 5000, "yes", "no"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code simplicity is sometimes even more important than performance. In this aspect, SPL leaves Python even further behind.&lt;/p&gt;

&lt;p&gt;With its cursor calculations, true parallel processing, and proprietary high-performance storage, SPL effortlessly handles massive data without struggling with complex code. SPL truly deserves the praise “It’s awesome!”&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Comparison of esProc SPL and DuckDB in Data Storage</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Mon, 12 May 2025 08:42:10 +0000</pubDate>
      <link>https://dev.to/serina_8340/comparison-of-esproc-spl-and-duckdb-in-data-storage-k22</link>
      <guid>https://dev.to/serina_8340/comparison-of-esproc-spl-and-duckdb-in-data-storage-k22</guid>
      <description>&lt;p&gt;Data storage is essentially about striking a balance between flexibility, performance, and ease of use. Both DuckDB and esProc SPL offer their own binary storage formats, but they differ significantly in data organization.&lt;/p&gt;

&lt;p&gt;DuckDB still employs a conventional database mechanism where data organization is logically holistic, and data under a specific subject forms a database, and there is a set of metadata to describe the structure and relationship of data in the database. A database is logically a whole, with clear distinctions between data inside and outside the database, as well as explicit import and export actions, which is often referred to as closedness. Closedness provides better manageability, but it also implies a lack of freedom in data organization. While DuckDB can also process data outside the database, this falls under the multi-data source functionality, and often still requires mapping to tables.&lt;/p&gt;

&lt;p&gt;esProc is completely different; it is not a database. Its data organization is logically fragmented. It lacks the concept of a subject, and consequently, has no metadata. Of course, it makes no distinction between data inside and outside the database, nor does it involve import or export actions. Any data can participate in calculations as long as it’s accessible. The only difference lies in the access performance from different data sources. esProc designs a specialized binary data storage format for high performance (columnar storage, compression, etc.). However, from a logical standpoint, this storage format is treated the same as a text file or data extracted from other databases. esProc’s storage approach is strongly characterized by its openness, and data organization is more flexible and unconstrained, but this comes at the cost of data integrity and manageability.&lt;/p&gt;

&lt;p&gt;Now, let’s take a look at the binary format differences between DuckDB and esProc SPL.&lt;/p&gt;

&lt;p&gt;DuckDB’s .duckdb files adopt pure columnar storage, where all data is compressed in blocks by column. Columnar storage is well-suited for analytical queries (such as calculating total sales), as it only needs to read a single column of data, and the compressed data results in small file sizes and fast read speeds.&lt;/p&gt;

&lt;p&gt;esProc offers two storage formats to address different scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;btx files (row-based binary format): Data is stored by row, with a structure similar to CSV, but saved in binary format for faster read/write speed. This format is suitable for small-scale data or temporary storage (such as intermediate calculation results), as it requires no pre-defined structure and can be written to and used immediately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ctx files (columnar composite tables): Data is stored by column and blocked in order based on primary keys (such as time or ID). This format is suitable for large-scale data analysis, allowing it to directly skip blocks that do not meet the conditions, reducing IO consumption and significantly improving speed. The ctx file also offers the option of row-based storage or no compression, allowing for selection based on different computational scenarios.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This “one-size-fits-all” design enables esProc to strike a balance between convenience and performance: btx is convenient for small files, while ctx is efficient for big data.&lt;/p&gt;

&lt;p&gt;During big data analysis, esProc’s ctx, in addition to possessing the advantages of DuckDB’s columnar storage, also specifically supports ordered computation (as opposed to SQL’s unordered nature). ctx can also be designed according to the computation. For example, storing data in a specified order allows for the application of order-related algorithms like binary search to improve computational performance. Redundancy can also be implemented – after all, it’s just a matter of having an additional copy of a file.&lt;/p&gt;

&lt;p&gt;The flexibility of esProc storage is also manifested in its ability to allow the storage of different types of data within the same field. This greatly increases flexibility but sacrifices performance, requiring trade-offs based on specific needs.&lt;/p&gt;

&lt;p&gt;Moreover, the value of an esProc field can be another record or table, thereby better supporting multi-layered data storage and usage. DuckDB now also offers good support for JSON data, making the two systems comparable in this regard.&lt;/p&gt;

&lt;p&gt;esProc SPL and DuckDB each have their own distinctive characteristics in data management and storage. DuckDB employs a conventional database model, using metadata to uniformly manage structured data. Its closed mechanism provides strong manageability but limits flexibility. In contrast, esProc adopts fragmented data organization, with no metadata constraints, supporting mixed computations of data from any source, but requiring developers to manage the data themselves. In terms of storage, DuckDB only supports a pure columnar format, while esProc offers a dual mode of btx row-based storage and ctx columnar composite table, supporting multiple data types within the same field and multi-layer nested data, allowing storage strategies to be flexibly chosen according to the scenario.&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>programming</category>
      <category>devops</category>
      <category>discuss</category>
    </item>
    <item>
      <title>esProc SPL vs DuckDB: Which is more Lightweight for In-Application Computation</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Fri, 09 May 2025 07:58:03 +0000</pubDate>
      <link>https://dev.to/serina_8340/esproc-spl-vs-duckdb-which-is-more-lightweight-for-in-application-computation-25m2</link>
      <guid>https://dev.to/serina_8340/esproc-spl-vs-duckdb-which-is-more-lightweight-for-in-application-computation-25m2</guid>
      <description>&lt;p&gt;Both DuckDB and esProc SPL can be embedded in applications as computing engines. This article will compare which is more lightweight. The term “lightweight” refers not only to size, but also to simplicity in development and maintenance.&lt;/p&gt;

&lt;p&gt;DuckDB is indeed convenient to use; you can directly import it in Python and get started, and it integrates smoothly with Java ecosystem via JDBC. esProc primarily targets the Java ecosystem – its 15MB jar can be easily deployed within a project, enabling seamless execution. For non-Java programs, invocation is achieved through an HTTP interface. Both installation packages are quite small, exhibiting lightweight characteristics.&lt;/p&gt;

&lt;p&gt;esProc scripts are interpreted and support hot deployment, enabling computational logic modifications without service restart. In this aspect, esProc is on par with DuckDB.&lt;/p&gt;

&lt;p&gt;Differences in their lightweight nature are particularly evident in cross-data source mixed computation scenarios. Although DuckDB supports common file formats such as CSV and Parquet, as well as some databases like MySQL, it requires developing deeply customized connectors for each data source separately. Consequently, mainstream relational databases like Oracle and SQL Server remain unsupported, and it is even more challenging to support NoSQL databases like MongoDB. When users need to perform cross-source computations between MySQL and Oracle, the lack of official connectors typically necessitates resorting to Python for importing. The inclusion of such “glue code” not only complicates the technology stack but, more critically, burdens the system architecture, violating the lightweight principle.&lt;/p&gt;

&lt;p&gt;In contrast, esProc SPL employs a “native interface + light encapsulation” approach, achieving natural compatibility with all relational databases through JDBC and enabling access to unstructured data sources, such as MongoDB and Kafka, with only a shallow encapsulation based on native interfaces. This standardized extension mechanism enables support for dozens of data source types, covering all scenarios like files, databases, API interfaces, and message queues. Moreover, users can rapidly extend through reserved extension interfaces, truly realizing a lightweight experience of “connect and compute immediately”.&lt;/p&gt;

&lt;p&gt;In addition, DuckDB has a critical flaw when handling complex computations: SQL inherently lacks flow control capabilities. Basic functionalities like for/if are unavoidable for even moderately complex business logic. However, since SQL cannot handle these operations, and DuckDB provides no supplemental mechanisms like stored procedures, users have to resort to external languages with flow control capabilities, such as Python, to brute-force solutions when encountering such requirements. This not only results in fragmented and verbose code but also necessitates maintaining two distinct technology stacks. It’s akin to building a crane just to move bricks—far from being lightweight and agile.&lt;/p&gt;

&lt;p&gt;esProc’s SPL directly integrates flow control into data processing language, encompassing features such as loops, conditionals, and exception handling. It can handle familiar SQL queries while also replacing Python for flow control, making it a comprehensive language solution. Programmers no longer need to juggle between SQL and Python, the technology stack is simplified, and the overall performance is more lightweight.&lt;/p&gt;

&lt;p&gt;Being lightweight isn’t just about having the smallest installation package; it’s like when moving—you can’t just look at the suitcase size, you need to see if one suitcase can hold all your belongings. DuckDB may appear small and nimble, but when it comes to scenarios requiring cross-data source association computations or writing business logic with loops and conditionals, it still needs external assistance. It’s like a rice cooker that promises one-touch cooking, but if you want to steam buns, you still need to connect an external steamer.&lt;/p&gt;

&lt;p&gt;The cleverness of esProc lies in its ability to handle complex tasks on its own. Whether it’s a database or an API, as long as it can connect, it can compute—even perform mixed computations. Whether it’s simple statistics or complex rules, a single set of script syntax handles it all. It’s like a transforming toolbox—it looks the size of a screwdriver, but when opened, it reveals wrenches, pliers, and drills. Most importantly, there’s no need to search everywhere for accessories, which is what truly makes it lightweight.&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>programming</category>
      <category>devops</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Comparison of esProc SPL and DuckDB in Multi-Data Source Capabilities</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Wed, 07 May 2025 05:57:07 +0000</pubDate>
      <link>https://dev.to/serina_8340/comparison-of-esproc-spl-and-duckdb-in-multi-data-source-capabilities-32op</link>
      <guid>https://dev.to/serina_8340/comparison-of-esproc-spl-and-duckdb-in-multi-data-source-capabilities-32op</guid>
      <description>&lt;p&gt;Both DuckDB and esProc SPL support diverse data sources, and this article will compare the differences between them.&lt;/p&gt;

&lt;p&gt;Types of supported data sources&lt;/p&gt;

&lt;p&gt;DuckDB supports a wide range of data source types, covering common file formats (such as CSV, Parquet, JSON, Excel), cloud storage (such as AWS S3, Azure Blob Storage), and relational databases (such as MySQL, PostgreSQL, SQLite). It can also access web data via https. Additionally, DuckDB supports some emerging data lake formats (such as Delta Lake, Iceberg).&lt;/p&gt;

&lt;p&gt;esProc supports a wider range of data source types, covering more local files, databases, and remote data sources. Here are some data sources supported by SPL:&lt;/p&gt;

&lt;p&gt;• Local files: CSV, Excel, JSON, XML, Parquet, ORC, etc.&lt;/p&gt;

&lt;p&gt;• All relational databases: MySQL, PostgreSQL, Oracle, SQL Server, etc. (via JDBC)&lt;/p&gt;

&lt;p&gt;• NoSQL databases: MongoDB, Cassandra, Redis, etc.&lt;/p&gt;

&lt;p&gt;• Cloud storage: HDFS, AWSS3, GCS, etc.&lt;/p&gt;

&lt;p&gt;• Remote data sources: RESTfulAPI, WebService, FTP/SFTP, etc.&lt;/p&gt;

&lt;p&gt;• Others: Kafka, ElasticSearch, etc.&lt;/p&gt;

&lt;p&gt;In terms of the number of data sources, esProc supports more types of data sources, especially in non-relational databases (such as MongoDB, Redis) and support for Kafka, ES, etc., where esProc has a significant advantage.&lt;/p&gt;

&lt;p&gt;From a deeper perspective, DuckDB’s data source access relies on dedicated connectors, which need to be developed separately for each data source, resulting in high complexity. It is also very difficult for users to further develop based on the open-source code. As a result, the number of available connectors is significantly limited, and even the most common relational databases are not fully supported. Currently, DuckDB supports MySQL, PG, and SQLite, but does not support other common databases such as Oracle and MSSQL, which will make it difficult to perform mixed queries across multiple data sources. For example, when performing mixed calculations between MySQL and Oracle, if there is no suitable connector, users can only resort to Python as a workaround.&lt;/p&gt;

&lt;p&gt;esProc utilizes the native interface of data sources. All relational databases can be connected via JDBC, which is naturally supported. Other data sources such as MongoDB and Kafka can also be simply encapsulated based on the native interface, resulting in high development speed and thus providing a richer connector library. Users can easily add their own connectors by implementing the reserved extension interface.&lt;/p&gt;

&lt;p&gt;With these rich support and data source extension capabilities, it is very easy to use esProc to implement mixed calculations across multiple data sources. MySQL+Oracle can be calculated directly, and it is also simple to extend to unsupported data sources.&lt;/p&gt;

&lt;p&gt;There is no obvious superiority or inferiority between DuckDB’s dedicated connector and esProc’s simple encapsulation using the native interface. The former can provide deeper support and optimization, achieving a certain level of transparency; the latter is more flexible, supporting a wide range of data sources and offering flexible extension. The specific preference depends on actual needs.&lt;/p&gt;

&lt;p&gt;Data type processing&lt;/p&gt;

&lt;p&gt;DuckDB has very mature support for CSV and Parquet files, enabling efficient reading and querying of these files. For example, DuckDB can directly load CSV files and execute SQL queries, making the operation straightforward and simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM 'data.csv' WHERE column_a &amp;gt; 100;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;esProc also makes it simple to process CSV files using SPL syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T("data.csv").select(column_a &amp;gt; 100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition to SPL syntax, esProc also provides SQL syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$SELECT * FROM data.csv WHERE column_a &amp;gt; 100;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use SQL for simple scenarios and SPL for complex ones. They can also be used in combination.&lt;/p&gt;

&lt;p&gt;Due to the limitations of SQL, many complex calculations are not easy to implement. DuckDB integrates well with Python, allowing complex requirements to be met with Python’s assistance. However, the writing and debugging of these two systems are different, which will create a strong sense of split. esProc provides SQL and the more powerful SPL. Operations that SQL cannot handle can all be implemented with SPL, often in a simpler way. Performing calculations within a single system enhances overall coherence.&lt;/p&gt;

&lt;p&gt;Another significant difference lies in JSON processing. esProc can better handle complex calculations and scenarios that require preserving JSON’s hierarchical structure. When performing multi-level structure calculations, SPL can directly access sublevel data using dots (.), which is very intuitive. There is no need to rely on UNNEST to unfold layer by layer or nested queries to preserve the integrity of the data structure, as in DuckDB. The support for multi-level data calculations is very thorough.&lt;/p&gt;

&lt;p&gt;Multi-layer and multi-condition data filtering in SPL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json(file("orders.json").read()). select(order_details.product.category=="Electronics" &amp;amp;&amp;amp; order_details.sum(price*quantity)&amp;gt;200)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared to DuckDB, esProc supports a richer variety of data sources and is easier to extend, enabling mixed calculations across most data sources. In terms of data processing, esProc not only supports SQL syntax but also SPL, which can handle more complex scenarios within a single system, eliminating the sense of split between SQL and Python systems. Especially for processing multi-layered JSON data, SPL is simpler and more intuitive.&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>discuss</category>
      <category>devops</category>
      <category>esproc</category>
    </item>
    <item>
      <title>SPL Operates Multi-layer JSON Data Much More Conveniently than DuckDB</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Tue, 29 Apr 2025 06:29:35 +0000</pubDate>
      <link>https://dev.to/serina_8340/spl-operates-multi-layer-json-data-much-more-conveniently-than-duckdb-5536</link>
      <guid>https://dev.to/serina_8340/spl-operates-multi-layer-json-data-much-more-conveniently-than-duckdb-5536</guid>
      <description>&lt;p&gt;esProc SPL is much more convenient than DuckDB in operating multi-layer JSON data, particularly when preserving JSON hierarchy and performing complex calculations are required.&lt;/p&gt;

&lt;p&gt;DuckDB’s ability to operate JSON is quite good. The read_json_auto() function can directly parse JSON to a table structure, allowing you to operate on multi-layer data directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT order_id, order_date, json_extract(customer, '$.name') AS cusName,json_extract(customer, '$.city') AS cusCity FROM read_json_auto('orders.json')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL is simpler for such basic operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json(file("orders.json").read()).new(order_id, order_date,customer.name:cusname,customer.city:cuscity)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using the dot (.) to directly access sublevel data is very intuitive.&lt;/p&gt;

&lt;p&gt;For slightly complex calculations, such as determining the sales amount for the Electronics category in an order’s data, DuckDB requires expanding order_details, then filtering for category=‘Electronics’, and finally calculating SUM(price*quantity).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT sum(od.quantity*od.price) amount
FROM read_json_auto('orders.json') AS o,
LATERAL UNNEST(o.order_details) AS t(od),
LATERAL UNNEST([od.product]) AS t(p)
WHERE p.category = 'Electronics'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To implement this calculation, SQL needs to associate the sub table with the primary table using an inner join for filtering. This is a bit roundabout, but not too complicated.&lt;/p&gt;

&lt;p&gt;SPL, in contrast, can directly treat the sub table as a set for calculations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json(file("order3.json").read()).conj(order_details).select(product.category=="Electronics").sum(quantity*price)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just a single statement, without associations, and simpler logic shows that the advantage over DuckDB is clearer.&lt;/p&gt;

&lt;p&gt;In more complex scenarios, such as filtering order details for the ‘Electronics’ category and excluding orders with amounts below $200, DuckDB SQL becomes difficult to write. You first need to expand the order_details and aggregate order amounts, then filter for eligible orders based on the aggregation result, and finally resort to nested queries or CTEs in order to preserve the integrity of the data structure. As the SQL becomes lengthy, it becomes less user-friendly for debugging. Using Lambda syntax can be simpler, but it is quite different from traditional SQL form.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    o.order_id, 
    LIST_FILTER(o.order_details, x -&amp;gt; x.product.category = 'Electronics') AS order_details
FROM read_json_auto(orders.json') AS o
WHERE 
    ARRAY_LENGTH(LIST_FILTER(o.order_details, x -&amp;gt; x.product.category = 'Electronics')) &amp;gt; 0
    AND SUM(
        LIST_FILTER(o.order_details, x -&amp;gt; x.product.category = 'Electronics') -&amp;gt; 
            (x -&amp;gt; x.price * x.quantity)
    ) &amp;gt; 200;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SPL code is still natural:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=A2.select(order_details.select@1(product.category=="Electronics") &amp;amp;&amp;amp; order_details.sum(price*quantity)&amp;gt;200)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Still, a single statement, simply treating the sub table as a set is enough. No complex subqueries and Lambda syntax, direct referencing, filtering, and aggregation work well regardless of the number of layers. Moreover, SPL preserves the multi-layer structure of JSON without requiring complex SQL like GROUP BY and LATERAL UNNEST.&lt;/p&gt;

&lt;p&gt;While DuckDB does operate JSON well, it still requires UNNEST and similar SQL structures, which becomes cumbersome as the data layer increases. In contrast, SPL can directly operate on multi-layer JSON structures, making filtering and aggregation convenient while preserving the original data hierarchy. Clearly, it is better suited for complex JSON computation scenarios.&lt;/p&gt;

&lt;p&gt;It's free，&lt;a href="https://www.esproc.com/download-esproc/" rel="noopener noreferrer"&gt;Download esProc~~&lt;/a&gt;&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>programming</category>
      <category>coding</category>
      <category>esproc</category>
    </item>
    <item>
      <title>Local Data Analysis: DuckDB or esProc SPL?</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Mon, 28 Apr 2025 03:42:05 +0000</pubDate>
      <link>https://dev.to/serina_8340/local-data-analysis-duckdb-or-esproc-spl-4462</link>
      <guid>https://dev.to/serina_8340/local-data-analysis-duckdb-or-esproc-spl-4462</guid>
      <description>&lt;p&gt;DuckDB can directly read common files such as CSV, Parquet, and JSON. With just a single SQL statement, it can load the file and perform a query, such as SELECT * FROM ‘data.csv’ WHERE price&amp;gt;100. For users accustomed to SQL, this “file-as-table” operation experience is very user-friendly, enabling rapid implementation of simple filtering and aggregation calculations.&lt;/p&gt;

&lt;p&gt;However, when faced with complex scenarios—such as cross-file iterative calculations, processing unstructured logs, or implementing dynamic conditional branches—relying solely on SQL often falls short. In such cases, you have to resort to Python for writing loops or UDF. This hybrid programming approach leads to a noticeable sense of split: You need to constantly switch between SQL’s logical thinking and Python’s procedural thinking when programming, and separately handle SQL snippets and Python variables when debugging, which is quite cumbersome.&lt;/p&gt;

&lt;p&gt;esProc SPL is a significantly better alternative. It supports over 20 file formats, including CSV and Excel, and can also parse semi-structured data such as RESTful and NoSQL, providing broader data source support than DuckDB. Moreover, esProc provides both SQL and SPL syntax, allowing you to use SQL for simple queries and SPL for complex tasks. All tasks can be handled within the same interface.&lt;/p&gt;

&lt;p&gt;For example, when calculating the number of consecutive rising days for a stock, SPL can easily implement ordered grouping and this ability can be combined with SQL to accomplish this task:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ux85jgn2ir4cz8xwgi7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6ux85jgn2ir4cz8xwgi7.png" alt="Image description" width="800" height="161"&gt;&lt;/a&gt;&lt;br&gt;
This concise syntax based on sequence numbers is more intuitive and easier to understand than SQL’s multi-layer nested subqueries.&lt;/p&gt;

&lt;p&gt;With esProc SPL’s IDE, developers can view the results of each step in real time, offering higher interactivity and far superior debugging efficiency than DuckDB, and it is also more intuitive than Python’s IDE.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzv0kvmkzjc6kur075djz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzv0kvmkzjc6kur075djz.png" alt="Image description" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F657ilcgartjy5sf59u8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F657ilcgartjy5sf59u8n.png" alt="Image description" width="800" height="483"&gt;&lt;/a&gt;&lt;br&gt;
With the support for diverse data sources and procedural computation, SPL fully covers the entire workflow from data loading to result output. For example, in e-commerce user behavior analysis, reading JSON logs, associating CSV product tables, calculating page dwell time, and generating funnel results can all be done with just a single script, without the need to switch interfaces.&lt;/p&gt;

&lt;p&gt;If the amount of data is larger, SPL can utilize cursor and parallel processing to handle the data. Tests reveal that SPL’s multi-thread segmented loading technology is more than 3 times faster than DuckDB when performing grouping and aggregation on a 100GB CSV file.&lt;/p&gt;

&lt;p&gt;DuckDB is suitable for simple SQL file query scenarios, while esProc SPL is more appropriate for complex calculations or big data processing. A system combines the simplicity of SQL, superior procedural computing capabilities than Python, and a more interactive IDE—truly an all-in-one tool.&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>programming</category>
      <category>devops</category>
      <category>esproc</category>
    </item>
    <item>
      <title>esProc SPL: Combining the Strengths of DuckDB and Python</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Wed, 19 Mar 2025 08:30:46 +0000</pubDate>
      <link>https://dev.to/serina_8340/esproc-spl-combining-the-strengths-of-duckdb-and-python-26cd</link>
      <guid>https://dev.to/serina_8340/esproc-spl-combining-the-strengths-of-duckdb-and-python-26cd</guid>
      <description>&lt;p&gt;That DuckDB is gaining more and more attention is no accident. As a rising star in desktop analytics, it masters SQL with ease, effortlessly handling CTE recursive queries, multi-layered window functions, and complex JOINs. It can even handle aggregations on datasets with hundreds of millions of rows with ease—just run a SELECT AVG(revenue) FROM terabyte_table GROUP BY region, and it’s done. Such capability is truly commendable. But even the strongest warriors have their Achilles’ heel, and DuckDB’s shortcomings lie in those areas where SQL falls short but business demands persist.&lt;/p&gt;

&lt;p&gt;Encountering “unconventional requirements”&lt;/p&gt;

&lt;p&gt;For example, the boss wants to calculate “the moving average of each customer’s last 3 order amounts, limited to weekend orders.” Want to do this in SQL? Be ready for a three-layer nested query: first, filter weekend orders; then, group and sort by customer; finally, calculate the moving average using a window function.&lt;/p&gt;

&lt;p&gt;Of course, you can write it with some effort, but this isn’t a one-time job. Similar requirements are common, and putting in so much effort every time isn’t a practical in the long run.&lt;/p&gt;

&lt;p&gt;And there’s an even more frustrating issue: flow control. Want to create a loop in SQL or dynamically adjust calculation logic based on conditions? For example, “If today’s sales increase exceeds 10%, then run promotional calculations; otherwise, skip.” Unfortunately, SQL’s IF and LOOP statements are too limited to be practical. The convoluted code you manage to write becomes incomprehensible even to you after just three days.&lt;/p&gt;

&lt;p&gt;Resorting to Python?&lt;/p&gt;

&lt;p&gt;Of course, DuckDB has a backup plan: “For tasks that SQL struggles with, hand them off to Python! There’s nothing wrong with this plan. DuckDB’s Python API is indeed easy to use; a simple conn.sql().to_df() lets you seamlessly switch to pandas.&lt;/p&gt;

&lt;p&gt;For example, to calculate “double reward points when a client’s consecutive purchase days exceed 5,” SQL’s window functions can calculate consecutive dates, but handling dynamic conditions is cumbersome. In comparison, Python is more straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = duckdb.sql("SELECT client, order_date FROM orders ORDER BY client, order_date").df()
df['order_date'] = pd.to_datetime(df['order_date'])
results = []
for client, group in df.groupby("client"):
    streak = 1  
    prev_date = None
    for date in group["order_date"]:
        if prev_date and (date - prev_date).days == 1:  
            streak += 1
        else:  
            streak = 1
        if streak &amp;gt;= 5:  
            results.append({"client": client, "bonus": "Doubling"})
        prev_date = date
print(results)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code works, but the data must be exported from DuckDB to a DataFrame. A logically coherent business requirement has to be split into SQL preprocessing and Python post-processing, forcing you to keep switching between the two. This approach is not only awkward to develop and debug but also wastes significant time on data transfer, driving you crazy.&lt;/p&gt;

&lt;p&gt;Moreover, Python doesn’t excel at everything. It lacks big data processing capabilities and native parallel computing support for large datasets, making it inferior to DuckDB in this regard.&lt;/p&gt;

&lt;p&gt;SPL is a good solution&lt;/p&gt;

&lt;p&gt;With esProc SPL, these problems vanish. It’s like a combined-evolved version of ‘DuckDB + Python’, handling everything in one system, making it simpler and more efficient.&lt;/p&gt;

&lt;p&gt;Like DuckDB, esProc SPL offers SQL support. SQL can be run directly on common files such as CSV and Excel. For example, query sales data:&lt;/p&gt;

&lt;p&gt;$SELECT region, SUM(amount) FROM sales.csv GROUP BY region&lt;/p&gt;

&lt;p&gt;Such lightweight operations are a breeze for esProc. Moreover, it offers a binary file format with good compression ratio and fast data I/O – capabilities on par with DuckDB.&lt;/p&gt;

&lt;p&gt;Furthermore, for complex requirements, esProc provides native SPL syntax as a backup. For example, for the consecutive purchase reward task mentioned earlier, written in SPL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqgoqd75zjxawzmawl61.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqgoqd75zjxawzmawl61.png" alt="Image description" width="800" height="170"&gt;&lt;/a&gt;&lt;br&gt;
This code not only identifies eligible clients for double rewards but also retrieves their purchase details (A3). Doesn’t it feel like you’re getting ahead of the business needs?&lt;/p&gt;

&lt;p&gt;SPL also provides far better JSON support compared to DuckDB, allowing you to directly navigate through nested data with simple dot notation: example.contact(1).metadata.verified, which is much cleaner than DuckDB’s json_extract(contact[1], '$.metadata.verified').&lt;/p&gt;

&lt;p&gt;SPL offers robust support to address Python’s weakness in handling big data. For large datasets, SPL’s cursor mechanism will show you how easy it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=file("huge.log").cursor@t()
=A1.groups(;sum(amount):total, count(~):rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data is read in a stream, with only the current batch held in memory. This allows files of hundreds of gigabytes to run smoothly. Additionally, it supports parallel and segmented processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=file("huge.log").cursor@tm(;4)  //4-thread parallel processing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DuckDB + Python is like using chopsticks for a steak dinner – each works well individually but the combination feels awkward. What about esProc SPL? Think of it as a full-stack kitchen suite: it combines SQL’s rigor with Python’s flexibility, enhanced with a universal toolkit for multi-source data mixed computation + big data processing—all seamlessly integrated into one system. Isn’t the ultimate dream of data engineers to write less code and slack off? SPL might just be the closest shortcut to achieving that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.esproc.com/download-Desktop/" rel="noopener noreferrer"&gt;Free download esProc&lt;/a&gt;&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>programming</category>
      <category>coding</category>
      <category>esproc</category>
    </item>
    <item>
      <title>I tried out a new programming language for data analysts</title>
      <dc:creator>Serina</dc:creator>
      <pubDate>Thu, 23 Jan 2025 08:09:08 +0000</pubDate>
      <link>https://dev.to/serina_8340/i-tried-out-a-new-programming-language-for-data-analysts-2m52</link>
      <guid>https://dev.to/serina_8340/i-tried-out-a-new-programming-language-for-data-analysts-2m52</guid>
      <description>&lt;p&gt;The programming language esProc SPL recommended for data analysts is definitely worth a try.&lt;/p&gt;

&lt;p&gt;Let’s talk about its advantages and disadvantages, it’s hard to say whether it’s an advantage or a disadvantage, it’s more like a characteristic.&lt;/p&gt;

&lt;p&gt;Firstly, the usage threshold is very low, and it can be used immediately after installation without the need for database support. It can directly process files like CSV, which is particularly convenient to operate. The data reading and processing are done in one go, making it easier than other tools.&lt;/p&gt;

&lt;p&gt;Group and aggregate CSV files in one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T(“order.csv”).groups(area;sum(amout))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL has an interesting characteristic:** grid code**. Its code is written in grids, similar to Excel. Buddies who are used to SQL and Python may feel a bit unfamiliar at first, as you have never seen such a writing style before. It may seem a bit unconventional, but it is actually quite convenient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe45jl3xldlrpo49ajrt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe45jl3xldlrpo49ajrt.png" alt="Image description" width="800" height="474"&gt;&lt;/a&gt;&lt;br&gt;
This grid code has two good advantages besides being relatively neat:&lt;br&gt;
One is that there is no need to define temporary variables, and the subsequent code can reference the results of the previous cells through cell names such as A2 and A3, which is very convenient. If rows and columns are added or deleted, the cell names will automatically change, so there will be no reference errors, which is similar to Excel.&lt;/p&gt;

&lt;p&gt;The second is good interactivity. On the right side, there is a result panel. After the code runs, you can see the calculation results by clicking on each cell, without the need for manual output. The results are quite intuitive, and any mistakes can be quickly detected and corrected. The overall interactivity is good.&lt;/p&gt;

&lt;p&gt;The syntax of SPL is its own, different from SQL, but it has all the necessary functions such as grouping, filtering, and join. Specifically, SPL has made significant improvements in grouping and ordered operations, making it noticeably more concise than SQL and Python in handling complex analysis tasks.&lt;/p&gt;

&lt;p&gt;You can have a more deeply feeling through this official example: Find the player who scores three consecutive times within one minute&lt;/p&gt;

&lt;p&gt;SPL&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhf05knvl3b4fae9fuegv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhf05knvl3b4fae9fuegv.png" alt="Image description" width="800" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SQL&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH numbered_scores AS (
SELECT team, player, play_time, score,
           ROW_NUMBER() OVER (ORDER BY play_time) AS rn
FROM ball_game)
SELECT DISTINCT s1.player
FROM numbered_scores s1
JOIN numbered_scores s2 ON s1.player = s2.player AND s1.rn = s2.rn - 1
JOIN numbered_scores s3 ON s1.player = s3.player AND s1.rn = s3.rn - 2
WHERE s3.play_time - s1.play_time &amp;lt;60 ;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df = pd.read_csv("../ball_game.csv")
df["play_time"] = pd.to_datetime(df["play_time"])
result_players = []
player = None
start_index = 0
consecutive_scores = 0
for i in range(len(df)-2):
    current = df.iloc[i]
    if player != current["player"]:
        player = current["player"]
        consecutive_scores = 1
    else:
        consecutive_scores += 1
    last2 = df.iloc[i-2] if i &amp;gt;=2 else None
    if consecutive_scores &amp;gt;= 3 and (current['play_time'] - last2['play_time']).seconds &amp;lt; 60:
        result_players.append(player)
result_players = list(set(result_players))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SPL uses some symbols, such as ~ and #, which may seem a bit strange at first glance. It’s not really difficult, just look at a few more examples and you’ll be proficient.&lt;/p&gt;

&lt;p&gt;What I want to complain about is that the code prompt of SPL development environment is not good and intelligent enough, which is the aspect that should be improved.&lt;/p&gt;

&lt;p&gt;And the graphical interface of this editor looks too outdated, developed using Swing? This thing is basically unused now, isn’t it. Although it has all the things needed, its appearance does not quite meet the aesthetic preferences of modern people.&lt;/p&gt;

&lt;p&gt;The saved code files are not text, which makes it difficult to do version control and code review outside of the development environment. It also seems that they cannot be used in VSCode.&lt;/p&gt;

&lt;p&gt;That’s all for now, overall speaking it’s pretty good. There are no major issues with the functionality, it’s easy to get started, and convenient to use. It’s suitable to be added to data analysis toolboxes and you may find it particularly useful sometimes. Of course, if it can be compatible with text format code and make the graphical interface and editor more modern, it should be more popular.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
