DEV Community: Serina

When AI-Generated SQL Becomes Untrustworthy: How to Restore Confidence in Our Data

Serina — Thu, 25 Jun 2026 03:33:06 +0000

A growing trust crisis
You throw a task at an AI, and seconds later it spits out a dozen-line SQL blob. You copy and paste it, and it runs. But do you really feel confident about it?

That’s probably the daily reality for every data analyst and developer today.

Today, AI can generate runnable SQL. The problem is that you don’t know whether you can trust it. Once queries involve deeply nested window functions and multi-level subqueries, they become difficult to review, debug, maintain, and port across databases.

Research has shown that when LLMs lack sufficient schema context and domain knowledge, they can produce hallucinated outputs, such as incorrect table joins, flawed aggregation logic, or missing critical filters. According to dbt’s 2026 benchmark, even the most advanced LLMs achieved only 64.5% accuracy on Text2SQL tasks. In other words, one out of every three SQL queries generated by AI may contain errors.

A 2025 Stack Overflow survey revealed a more unsettling fact: only 2.7% of professional developers place a high level of trust in AI tools. Meanwhile, 42% of submitted code is already AI-generated, and only 48% of is reviewed by humans. The latest GitHub data paints a worrying picture: AI is rapidly producing code, while humans are struggling to keep up with the pace of verification.

Is this an AI problem, or a problem of how we approach it?

If you think about it, the root of the problem may not lie in AI itself, but in the fact that we are asking AI to do things it is not well suited for.

AI is well suited for brainstorming. It can help you clarify your thinking, break down complex requirements, and explore multiple possible solutions. However, when it comes to precise execution, the statistical nature of AI means it cannot guarantee a 100% correct result.

Entrusting final code generation to a probabilistic model is, in itself, a methodological mismatch.

The decline of Stack Overflow indirectly confirms this point. The number of new questions on this platform — once a hub of knowledge for developers worldwide — plummeted from a peak of over 300,000 per month to just 2,640 in January 2026. Developers have increasingly turned to AI for quick answers, only to find that the results inconsistent and unreliable.

Thus, we are stuck in an awkward situation: we use AI to improve efficiency, only to find that we cannot fully trust it.

Is there a way to combine AI’s flexibility with our need for certainty?

SQLazy: Let AI think, let the compiler write
This is the original intention behind SQLazy: implement complex SQL logic step by step using natural language, then compile it into auditable, production-ready SQL.

SQLazy’s core idea is simple yet unconventional: do not let AI generate the final SQL directly. Instead, you – or an AI assistant – breaks down complex requirements into clear, step-by-step logic. That logic is then passed to a deterministic compiler, which generates the final executable SQL.

Let’s make this approach clearer with an analogy: AI is your advisor, helping you map out the strategy; the compiler is your craftsman, turning that logic into a product through a deterministic workflow. You are responsible for verifying the soundness of the reasoning throughout the workflow, while the machine handles the execution details – much like using a calculator for arithmetic: you input the correct sequence of operations, and it guarantees the correct result.

Example: Grouping based on adjacent record comparison

In the following event table ordered by timestamp, some adjacent rows have identical values in the “value” field.

We want to put adjacent records with the same value in one group. For each group, take the group’s start timestamp and the start timestamp of the next group as its start and end timestamps, and use them to construct a new two-dimensional table. For the last group, the start timestamp of the next group is defined as 9999-12-31 00:00:00.

Writing SQL is not easy. Let’s try solving it step by step with SQLazy:

Step 1: Sort records by timestamp to ensure that they are processed in chronological order

Step 2: Generate group IDs by comparing and grouping adjacent records

This step is crucial. The “segment” operation divides the records into multiple groups by traversing the data: whenever the value field changes, a new group is created and the group ID (gid) is automatically generated.

After running this step, the intermediate table will include a new column called gid.

Notice how the gid changes: when the value changes from 1 to 2, a new group begins; when it stays at 2, the records belong to the same group; when it changes from 2 back to 1, another new group begins.

Step 3: Group the data and select the first record from each group

For each gid (retain the value field), take the earliest timestamp in the group as effective_from, and use the first id in the group as the representative id (since ids are increasing, this corresponds to the smallest id within the group).

Step 4: Compute the end timestamp (the start timestamp of the next group)

Here, effective_from[1] represents “the effective_from value in the next row” (similar to the SQL LEAD function). nvl handles the case where there is no next row for the last group, by filling in a predefined maximum date.

Step 5: Remove the helper column gid

This step simply cleans up the output by removing the temporary gid column, and get the final result.

Once each step and its result have been verified, compile the workflow into the target SQL dialect (Oracle in this example).

WITH Value AS (
  SELECT
    id,
    value,
    timestamp
  FROM
    events
),
Value2 AS (
  SELECT
    gid,
    id AS id,
    value AS value,
    timestamp AS effective_from
  FROM
    (
      SELECT
        id,
        value,
        timestamp,
        SUM(
          CASE
            WHEN value <> col__5 THEN 1
            ELSE 0
          END
        ) OVER (
          ORDER BY
            CASE
              WHEN timestamp IS NULL THEN 1
              ELSE 0
            END,
            timestamp ASC
        ) + 1 AS gid
      FROM
        (
          SELECT
            Value.*,
            LAG(value) OVER (
              ORDER BY
                CASE
                  WHEN timestamp IS NULL THEN 1
                  ELSE 0
                END,
                timestamp ASC
            ) AS col__5
          FROM
            Value
        ) sub__6
    ) Value1
  GROUP BY
    gid
)
SELECT
  gid,
  id,
  value,
  effective_from,
  LEAD(
    effective_from,
    1,
    TO_DATE('9999-12-31 00:00:00', 'YYYY-MM-DD HH24:MI:SS') OVER (
      ORDER BY
        gid
    )
  ) AS effective_to
FROM
  Value2
ORDER BY
  gid

The compiler ultimately generates a complex SQL query containing nested window functions and subqueries. Within the SQLazy frame, however, you do not need to review that SQL. You only need to verify that each step of the workflow is logically correct. Once the logic is validated, the compiler ensures that the final output is correct by construction.

This is the key difference between SQLazy and other AI SQL generators:

Generating correct complex SQL is not the end of the story – SQLazy goes further and solves the following challenges:

Hard to review

A 50-line SQL block with four levels of nested subqueries and three window functions is difficult even for seasoned developers to quickly determine whether the logic is correct. SQLazy’s step-by-step approach makes review extremely simple: you only need to check the order and conditions across 5-7 steps, where the input and output of each step are clearly visible. Even business stakeholders and product managers can participate in validating the logic.

Hard to debug

Traditional SQL debugging is a black-box experience: you can see the final result, but you have little visibility into the intermediate steps. Where did the logic go wrong? It’s hard to tell. The only choice is to add debug fields, comment out some subqueries, and run the query again and again. SQLazy enables step-by-step execution, allowing you to inspect the intermediate table produced at each step. Suppose the grouping condition at step 3 gets wrong, you can fix it immediately and the subsequent steps are automatically recomputed. What used to take hours of debugging can now be resolved in minutes.

Hard to maintain

You open a complex SQL file written three months ago. Now the requirements are changed. You stare at the window functions, wondering: “What was this LAG supposed to do?” Worse still, the original requirements may have already been lost. In SQLazy, the steps themselves serve as living documentation. Even a new team member can understand the logic within minutes of opening the workflow. To update requirements, you simply modify the corresponding step and let the compiler regenerate the SQL.

Hard to port across databases

Suppose you need to migrate code from Oracle to PostgreSQL: DECODE must be replaced with CASE, ROWNUM with LIMIT, and TO_DATE requires different formatting. SQLazy’s compiler has built-in support for three SQL dialects – MySQL, PostgreSQL, and Oracle, with Snowflake and BigQuery coming soon. Write the steps once, generate SQL for multiple dialects, and reduce migration cost to zero.

Audit and compliance: the often overlooked value
In highly regulated sectors such as finance, healthcare, and government, auditability of data queries is a strict requirement. Compliance teams need to know how are the numbers in this report computed, who wrote the logic, and whether it has been validated?

Ordinary AI-generated SQL cannot answer these questions. You can only present a snippet of code and say, “This was generated by ChatGPT.” Compliance auditors will not accept that.

SQLazy’s step-by-step workflow naturally satisfies auditing requirements:

Each step is clearly described.

You can take a screenshot of the result at each step for documentation.

The deterministic nature of the compiler guarantees that the same steps always produce the same SQL.

Version control is easy because the workflow file can be committed to Git like code.

This enables a complete audit trail that is traceable, interpretable, and reproducible when an auditor asks, “Why does this figure show 20% rather than 30%?”

SQLazy’s positioning: what problems does it solve?
SQLazy is designed for a specific category of SQL workloads: complex, medium-to-heavy analytical scenarios that are difficult to reason about. For example:

Continuous trend and rising/falling period analysis, such as “calculating the longest consecutive rising streak in a stock”.

Event sequence & dialogue segmentation for user behavior path analysis.

Dynamic conditional grouping according to different aggregation method for different customer groups.

Time window analysis, which involves rolling calculation and missing value imputation.

Financial transaction metrics calculation.

The examples directory in SQLazy’s GitHub repository contains many real-world cases for these scenarios.

But in certain scenarios, SQLazy is not necessary:

Simple queries involving only 3-5 code lines.

An expert at writing window functions manually, with no need for third-party review.

A one-person team with no requirement for documentation and review.

At the moment when the AI wave sweeps through everything, particularly the coding field, SQLazy emerges to offer a different technological framework.

SQLazy recognizes AI’s efficiency superiority while rejecting unquestioning acceptance. It upholds the philosophy that humans should be responsible for the final output and that AI should only serve as the consultant rather than the decision-maker. It uses the compiler, the validation tool, to make up for the inconsistency of AI-generated output. Simply put, AI plans, the compiler guarantees correct output.

Indeed, there is still room to improve SQLazy: although both the desktop and web versions are free to use, the tool itself is not open-source; and some advanced features are still under development. However, in the long run, the methodology behind SQLazy may prove more resilient than simply using AI to replace SQL writing. This is because it not only improves development efficiency, but also preserves human control over critical decision-making in the AI era, while helping to establish an auditable, maintainable, and portable data asset system.

Rather than blindly accepting a SQL black box you don’t even fully understand, it is better to break down the logic into explicit steps and let the compiler handle the final execution. SQLazy makes this workflow possible.

Try SQLazy online: (Free to use, signup not required)
Project repository: github.com/SPLWare/SQLazy
Case collection: github.com/SPLWare/sqlazy/tree/master/examples

esProc SPL: Equivalent to the Python-enhanced DuckDB

Serina — Fri, 06 Jun 2025 05:43:28 +0000

For desktop data analysis users, if DuckDB is the handy ‘SQL Swiss Army Knife,’ then esProc SPL is an ‘all-in-one toolbox’ with built-in Python capabilities, maintaining SQL’s ease of use while overcoming its inherent limitations.

Like DuckDB, esProc SPL offers excellent support for file handling. Common files like CSV and Excel can be used directly as databases, allowing you to run SQL queries immediately. For example, query sales data:

$SELECT region, SUM(amount) FROM sales.csv GROUP BY region

Such lightweight operations are a breeze for esProc. Similar to DuckDB, esProc also supports data binarization, but it stores data as files with excellent compression ratios. Loading millions of rows takes just seconds, making it at least three times faster than reading CSV directly.

Currently, esProc’s SQL does not support window functions, making it less comprehensive than DuckDB. However, esProc has a trump card: its native language, SPL, which significantly simplifies complex tasks compared to SQL, so you actually won’t need to write cumbersome window functions.
For example, calculating bonuses for the top 3 salespeople in each province requires multiple nested queries in SQL:

WITH ranked_sales AS (
  SELECT province, salesman, amount,
    ROW_NUMBER() OVER(PARTITION BY province ORDER BY amount DESC) as rank
  FROM sales
)
SELECT * FROM ranked_sales WHERE rank <=3

With SPL, it’s much more straightforward:

sales.groups(province;top(-3;amount))

Such tasks can still be implemented in SQL. However, for more complex tasks, like calculating ‘automatic doubling of reward points when a customer’s consecutive purchase days exceed 5’, SQL often falls short. When it comes to flow control operations—such as implementing loops in SQL or dynamically adjusting calculation logic based on conditions, SQL’s IF and LOOP statements are too limited to be practical. The convoluted code you manage to write becomes incomprehensible even to yourself after just three days. That’s why DuckDB often relies on Python.

DuckDB’s Python interface is remarkably smooth, but using them in combination still creates a sense of fragmentation. You query data with SQL, load it into a DataFrame, and often end up writing it back to the database. These are two distinct systems with different development and debugging approaches, requiring constant mental context switching that feels jarring. It’s like ordering a steak in a Chinese restaurant—it works, but it just feels awkward.

In contrast, esProc SPL directly integrates the core capabilities of Python.
The above calculation for “a customer’s consecutive purchase days exceed 5” is written in SPL:

This code is actually even more concise than Python.

esProc SPL, with its comprehensive computing capabilities, support for procedural computation, and robust flow control mechanisms, outperforms Python-enhanced DuckDB. It combines SQL’s agility with programming language’s flexibility, while eliminating the need to juggle back and forth between multiple tools. For desktop analysts who frequently handle complex calculations, esProc SPL may be a more elegant solution than ‘SQL + Python’. After all, who wouldn’t want to handle everything in one window?

esProc SPL & MongoDB: A Match Made in Data Heaven

Serina — Fri, 23 May 2025 07:51:58 +0000

MongoDB, as a mainstream NoSQL database, has become a powerful tool for handling unstructured data thanks to its flexible document structure. However, as anyone who’s used it knows, its computing capabilities leave much to be desired.

The trade-off of NoSQL is sacrificing the simplicity of SQL. Let’s take the example of finding the top 10 customers by order amount. With SQL, you can accomplish the task with a single SELECT TOP 10 statement.

However, with MongoDB, you’ll need to use these three operators: $group, $sort, and $limit, and implementing cross-collection joins requires painstakingly chaining multiple $lookup stages – akin to manually assembling building blocks step by step. What’s even more frustrating is that advanced operations like window functions simply can’t be implemented using MongoDB’s native syntax.

Clearly, it wouldn’t be appropriate to import the data into MySQL for calculation; otherwise, why take the trouble to use MongoDB? If you were to brute-force the calculations in Java at the application layer, that would be incredibly stressful and not worth the effort.

With esProc SPL, it becomes much simpler as SPL can directly perform SQL-style computations on MongoDB data. For example, for a multi-layer nested order structure:


{  "_id": ObjectId("..."),  
  "order_id": "12345",      
  "customer": "C001",    
  "order_date": ISODate("2025-02-12T00:00:00Z"),   
  "order_details": [        
    {
      "product_id": "P001",    
      "product_name": "Laptop", 
      "quantity": 2,           
      "price": 150.00,         
      "total": 300.00          
    },
    ...
  ]}

To query the top 10 largest customers and their order amounts, the SPL code would be:

The key is just one line in A3, which is as concise as SQL.
esProc SPL even allows the direct use of SQL syntax:

$select TOP 10 Customer, SUM(Amount) AS TotalAmount from {mongo_shell@d(mongo_open("mongodb://127.0.0.1:27017/mongo"),"orders.find()")}
group by Customer

SPL has a natural affinity for handling nested JSON data. In MongoDB, where documents with three levels of nesting are common, using SPL’s dot operator A.B.C allows direct traversal to the deepest level. When dealing with arrays, SPL can also use conj()for expansion and groups() for grouping and statistical analysis. This harmonious balance – ‘you ensure efficient storage; I guarantee rapid computation’ – is a genuine partnership.

This is even better news for Java developers. Previously, manipulating MongoDB required creating a bunch of BasicDBObject objects. Now, simply adding a few jars allows you to execute SPL statements (and the aforementioned SQL statements), and even use a script file as a stored procedure. This essentially transforms MongoDB’s interface to something resembling a relational database, reducing code volume by 90% – shifting from cumbersome pipeline operations to the clean style of Class.forName(…).execute(…).

Class.forName("com.esproc.jdbc.InternalDriver");
Connection con= DriverManager.getConnection("jdbc:esproc:local://");
Statement st = con.prepareCall("call SPL_Mongo_Example()");
st.execute();
ResultSet rs = st.getResultSet();

Here, SPL_Mongo_Example is the name of the SPL script (SPL_Mongo_Example.splx). Just like calling a stored procedure, the code no longer involves data processing, and is simple and clear.

The fly in the ointment is that esProc SPL is developed in Java. Calling it from other languages like Python/C++ requires using an HTTP interface, unlike Java applications which can integrate seamlessly. But considering its ability to make MongoDB directly unlock SQL computational capabilities, this minor flaw is entirely forgivable. After all, in this data-driven era, a tool that saves you from writing 200 lines of aggregation code deserves the compliment “That’s awesome.”

esProc SPL Github page.

Analyzing DuckDB’s Performance Optimization through TOPN and COUNT DISTINCT Operations

Serina — Tue, 20 May 2025 07:49:51 +0000

In recent years, DuckDB has emerged as a popular choice for numerous data analysis scenarios. Its lightweight nature, ease of use, and simple integration also make it well-suited for programmers performing local analysis. The direct use of SQL provides both convenience and efficiency.

However, ease of writing code isn’t the only consideration; fast execution and intelligent optimization are also key to the user experience.

We will use TOPN and COUNT DISTINCT operations as examples to analyze DuckDB’s performance optimization.

Test environment and data preparation
Test hardware:

This is not a high-end configuration; rather, it’s closer to the environment developers use daily, making the test results more relevant.

Test data:

Both tables exceed available memory to ensure realistic simulation of big data processing scenarios and avoid unfair advantages from caching.

Entire-set TOPN: Excellent performance
First, let’s look at a classic TOPN operation: retrieving the top 100 rows by amount field:

select * from topn order by amount desc limit 100;

If the SQL is interpreted literally, the database would first sort the entire table and then select the top 100 rows. This results in an algorithm complexity of n*logn. If the data volume exceeds available memory, performance drops significantly.

However, test results show DuckDB performs quite well in this scenario. After importing the 1 billion rows of raw text data (22GB), it compresses the data into an 8GB database file. The first query takes approximately 20 seconds (largely due to initial loading and caching), with subsequent queries stabilizing around 4 seconds.

During the execution, no temporary files were generated, indicating that DuckDB’s optimizer avoids full-table sorting, and uses an optimization algorithm: maintain just a small set of size N and traverse the data only once. This strategy reduces the algorithm complexity from n*logn to n*logN, resulting in a substantial performance improvement when N is small.

In this test, DuckDB performs excellently, demonstrating both intelligence and efficiency.

Grouped TOPN: Shortcomings exposed
Next, let’s look at a slightly more complex operation: grouped TOPN.

The task is to group the table by the first letter of the id field and select the top 100 rows with the largest amount values in each group. The SQL is roughly as follows:

select * from (
  select left(id,1) as gid, amount,
         row_number() over (partition by left(id,1) order by amount desc) rn
  from topn
) where rn <= 100;

In theory, this scenario is similar to regular TOPN. Since each group contains only tens of thousands of rows, maintaining a small set of the top N values and traversing it once should theoretically avoid a full sort as well.

However, in actual running, DuckDB seems to be “faithfully executing the SQL semantics”: fully sorting each group and then filtering using row_number. This strategy resulted in extensive read/write operations between memory and external storage, causing temporary files to swell to over 40GB during the execution. Even after 10 minutes, the query remained incomplete.

In other words, although optimization is possible logically, DuckDB doesn’t actively recognize ‘grouped TOPN’ as ‘group aggregation’ scenario and instead rigidly performs sorting. In this case, users are left to optimize the query themselves by rewriting the SQL query to avoid sorting or using a different tool.

For example, esProc SPL directly supports using groups in combination with the top function, treating TOPN as an aggregation operation. It obtains the results by traversing the data only once, and only takes 31 seconds to complete the same query, demonstrating a considerable performance advantage.

SPL script – Entire-set TOPN:

SPL script – Grouped TOPN:

Summary of test results (Unit: seconds)

COUNT DISTINCT: Unable to leverage ordered data
Let’s now discuss another common operation: COUNT DISTINCT.

In SQL, counting the number of distinct order IDs that meet a specific condition is typically expressed as:

select count( distinct orderid ) from orderdetail where amount>50;

In DuckDB, the logic for this operation is simple and brute-force: loop through all data, collect the order IDs that meet the condition into a DISTINCT set, compare each new value with the current DISTINCT set to determine whether to add it, and finally count the size of the DISTINCT set.

If the data is unordered, this approach is fine and consistent with the practices of most databases.

However, in this test, we intentionally inserted the data in ascending order by orderid. In this case, if the algorithm leveraged this ordered characteristic, it could significantly improve performance, as the adjacent rows with the same orderid values will be skipped directly, thus eliminating the need to maintain a DISTINCT set.

Regrettably, DuckDB lacks the ability to actively recognize and leverage the ordered nature of the data. This is confirmed by the test results: it took over 200 seconds to process 2.4 billion rows using a single thread.

In contrast, using esProc SPL and its icount@o function to perform COUNT DISTINCT specifically for ordered data, the test completed in approximately 50 seconds—about 4 times faster than DuckDB.

For more complex operations, such as calculating two COUNT DISTINCT values with different conditions:

select count( distinct case when amount>50 then orderid else null end ),
       count( distinct case when amount>100 then orderid else null end )
from orderdetail;

DuckDB repeatedly builds DISTINCT sets, causing the execution time to skyrocket to over 800 seconds. Conversely, esProc SPL reuses the scan process, resulting in almost no change in execution time.

The root cause of this gap lies in SQL’s inability to express “ordered” information, leaving the optimizer no means to leverage such information. In contrast, languages like esProc SPL—which can explicitly express orderedness and utilize dedicated function calls—hold a natural advantage in such scenarios.

SPL script - One count distinct query:

SPL script - Two count distinct queries:

Summary of test results (Unit: seconds):

As a lightweight analytical database, DuckDB boasts excellent usability and particularly excels at optimizing entire-set TOPN queries, delivering a great daily user experience.

However, when faced with slightly more complex problems, such as grouped TOPN or COUNT DISTINCT on inherently ordered data, the “clumsiness” of SQL semantics is exposed: the optimizer cannot recognize the underlying logic, which leads to a significant performance gap.

In these scenarios, using tools like esProc SPL, which can describe computational details and support the expression of ordered characteristics, can greatly improve performance.

In conclusion, DuckDB is a good tool, provided that the scenario is simple and allows the optimizer free rein. However, if the problem involves ordered characteristics or special aggregation logic, achieving fast running requires tools that can describe “smarter” computational methods.

esProc SPL Github page.

With SPL, It Seems We Don’t Need ORM Anymore

Serina — Fri, 16 May 2025 07:10:27 +0000

ORM technology indeed simplifies basic CRUD operations, but it also has many limitations when dealing with complex calculations. Hibernate’s HQL is notably inadequate, making it difficult to implement dynamic column calculations and multi-layer associations. While JOOQ improves flexibility through its DSL, its grouping calculations require multi-layer nesting, leading to more verbose code than native SQL.

esProc SPL is like a “plug-in” for data computation! To write a multi-layer JOIN based on dynamic conditions, you had to use chain calls in Java for half an hour with JOOQ. Now you can implement this in just a few lines of SPL script, and its syntax is more intuitive than SQL. For example, calculating ‘top 3 sales by department’, which even requires nesting in SQL:

SELECT dept, name FROM (
  SELECT dept, name, RANK() OVER (PARTITION BY dept ORDER BY sales DESC) as rank 
  FROM employee
) WHERE rank <=3

can be implemented in SPL with just one line of code:

employee.groups(dept; top(-3, sales))

SPL also supports dynamic data structures. It eliminates the need to predefine entity classes, and you can dynamically add fields in the script at any time: Orders.derive(Amount*0.1:tax, Amount+tax:total_amount). This is in contrast to JOOQ’s need for predefinition. Performing calculations in SPL is as simple as SQL. For example, filtering conditions can be directly written as Orders.select(amount>1000 && like(client,"s")), with field names used directly without object prefixes. In contrast, JOOQ’s approach, such as ORDERS.AMOUNT.gt(1000), pales in comparison.

When performing cross-database analysis (e.g., MySQL user data + Elasticsearch logs), you had to write ETL scripts to extract and load the data. With SPL, you can now process data across databases directly:

No need for extracting data, no need for creating intermediate tables. In this scenario, ORM is left helpless!

ORM is more suitable for handling simple tasks, while complex calculations, cross-source operations, and dynamic logic can all be handed over to SPL. Whether it’s real-time risk control, dynamic reporting, or IoT stream processing, SPL can handle them all with ease. Additionally, SPL’s cursor mechanism allows for reading and computing simultaneously without exceeding memory, and its syntax is concise and code is flexible. Try processing Kafka stream data with JOOQ, the threading model in Java alone can drive you crazy. In contrast, SPL can perform real-time aggregation directly with kafka_open().kafka_poll@c().groups(hour(time);avg(value)). The gap is like the difference between a wagon and a Tesla.

Similar to ORM, SPL is also developed purely in Java and can be seamlessly integrated into Java applications for deployment and distribution. However, unlike ORM, using SPL for computations typically needs to write the business logic into scripts, which are then called by Java through JDBC.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection con= DriverManager.getConnection("jdbc:esproc:local://");
Statement st = con.prepareCall("call SplScript()"); //SPL script name
st.execute();
ResultSet rs = st.getResultSet();

This will cause a separation of calculation code and Java code, which is quite different from the style where ORM is tightly integrated into Java applications. ORM programmers may find it unfamiliar at first. In fact, SPL has full support for flow control, such as if and for, making business function implementation more convenient than Java.

The advantage of independent SPL scripts is the hot-update feature. SPL scripts are interpreted at runtime. When running in standalone applications, if the statistical logic changes, you can leisurely modify the SPL script, upload it directly to the server, and the business system will take effect in seconds without even needing a restart. However, with tools like JOOQ, you have to recompile and redeploy after modifying the Java code, which results in a poor experience.

Essentially, SPL does not objectify data tables; rather, it directly manipulates the database using SPL. While this approach might be less convenient than MyBatis for simple single-table CRUD operations, SPL will absolutely rescue you from the ORM quagmire when facing the three major challenges: complex calculations, heterogeneous data, and frequently changing requirements. Programmers shouldn’t make things harder on themselves. Let ORM handle what it’s good at – object mapping – and leave computations to the specialized SPL. Isn’t that far more elegant than wrestling with SQL in Java?

esProc SPL Github page.

Besides True Parallelism and Big Data, esProc SPL’s Conciseness Leaves Python in the Dust

Serina — Thu, 15 May 2025 06:47:36 +0000

Python, with its concise syntax and rich libraries, provides a significantly simpler way to data computation than Java, even surpassing SQL in convenience, which explains its immense popularity in the field of data analysis.

However, the emergence of esProc SPL may disrupt this ranking.

A typical example is big data processing. When memory cannot hold the full data, even common aggregation/filtering operation requires approximately ten lines of Python code. If sorting/grouping operation is needed, it would be more complex, with code volume skyrocketing to hundreds of lines, involving numerous function and method coding, as well as temporary file handling. This level of complexity surpasses the capabilities of data analysts. This is because Python does not provide cursor natively; when data exceeds memory capacity, programmers have to split data themselves, leading to verbose and messy code written with great effort.

In contrast, SPL is much simpler. Because SPL has a built-in cursor data type, and aggregation/filtering operation can be done in a single line:

file(“huge.txt”).cursor@t().total(sum(amount))
file(“huge.txt”).cursor@t().select(amount>=1000)

Even complex operations like sorting and grouping that are difficult to implement in Python can still be done with SPL in just one line:

file(“huge.txt”).cursor@t().sortx(area)
file(“huge.txt”).cursor@t().groups(area,amount)

SPL cursors are not limited to just loading data in batches, but also incorporate optimization ways like indexing and blocking to ensure efficient computation even with large data volume.

Parallel computing is essential for big data processing. Python doesn’t offer a true multi-thread parallel mechanism, rendering multi-core CPUs virtually useless, and resorting to writing multi-process programs is cumbersome. In contrast, SPL simplifies parallel computing to a single option - just add @m to automatically enable parallel computing:

file(“huge.txt”).cursor@tm().groups(area;sum(amount))

SPL offers true parallel computing. The underlying level automatically splits tasks and distributes them based on the number of CPU cores, with each core independently processing a portion of data and then merging the results, which fully leverages multi-core advantages. The entire process is transparent to users and requires no manual intervention from programmers.

SPL also offers its own high-performance storage, employing mechanisms like binary format, compression, and columnar storage to significantly improve data read efficiency. Moreover, the storage can be flexibly designed based on computation goals, such as ordered storage by specified fields and appropriate redundancy. This ensures both high efficiency and flexibility.

With SPL cursor and rich computational support on cursor, data analysts can confidently tackle big data challenges. This already puts SPL several steps ahead of Python. When combined with SPL’s simple parallel computing and own high-performance storage, SPL leaves Python far behind.

Even for non-big-data processing scenarios, SPL remains more concise than Python.
For common, simple calculations, SPL and Python are similar, with no significant difference. For example, calculate the top three:

employee.top(-3;salary)  //SPL
employee.nlargest(3, 'salary')  //Python

However, once slightly more complex scenarios are involved, the difference becomes obvious. For example, to calculate the top three within each group, SPL retains its concise style:

employee.groups(department;top(-3;salary))

SPL directly treats topN as an aggregation operation and performs direct grouping.

In contrast, Python is considerably more complex:

employee.groupby('department').apply(lambda group: group.nlargest(3, 'salary'))

Not only does it require a combination of apply and lambda, but the style is inconsistent with that of calculating the whole set.

Similar inconsistencies in syntax and style are common in Python, which adds extra memorization cost. For example, both login.groupby('user')['time'].min() and login.groupby('user').agg({'time': 'min'}) calculate the minimum value within each group, but they return completely different objects, and the subsequent supported operations also vary, requiring extra caution during use.

Differences also arise in position-based calculations. For example, to extract the data for the 5th, 10th, 15th, … trading days of 2025, SPL directly uses # to represent position, making it simple and intuitive:

stock.select(year(date)==2025).select(# % 5 == 0)

Python, however, requires a workaround: first filtering, then using reset_index() to re-number, and finally retrieving values based on position:

stock_2025 = stock[stock['date'].dt.year == 2025].reset_index(drop=True)
selected_stock = stock_2025[stock_2025.index % 5 == 4]

The step reset_index(drop=True) alone is easy to forget; forgetting it will result in errors. SPL, on the other hand, natively supports sequence numbers, so there’s no need to worry about these details.

In addition, when calculating growth rates or moving averages, it’s often necessary to reference adjacent records. SPL directly uses [-1] to retrieve the previous record, making the code natural. For example, calculate the maximum monthly sales growth:

sales.(if(#>1,~-~[-1],0)).max()

Python requires shift()or rolling() to generate a new Series before performing the calculation:

sales.rolling(window=2).apply(lambda x: x[1] - x[0], raw=True).max()

To calculate if the three-month moving average is increasing, SPL remains straightforward:

sales.(~[-2,0].pselect(~<=~[-1])==null)

Python is much more cumbersome, requiring both rolling and lambda:

sales.rolling(window=3).apply(lambda x: (x[0] < x[1] and x[1] < x[2]), raw=True)

Not only is the code complex, but the fixed rolling window is also inflexible.

Both Python and SPL support lambda syntax, but SPL is simpler and more direct. For example, to label managers with salaries over 5000, the Python code:

employee['flag'] = employee.apply(lambda row: 'yes' if row['position'] == 'manager' and row['salary'] > 5000 else 'no', axis=1)

SPL doesn’t require lambda keywords and makes lambda syntax implicit, enabling direct coding:

employee.derive(if(position == "manager" && salary > 5000, "yes", "no"))

Code simplicity is sometimes even more important than performance. In this aspect, SPL leaves Python even further behind.

With its cursor calculations, true parallel processing, and proprietary high-performance storage, SPL effortlessly handles massive data without struggling with complex code. SPL truly deserves the praise “It’s awesome!”

Comparison of esProc SPL and DuckDB in Data Storage

Serina — Mon, 12 May 2025 08:42:10 +0000

Data storage is essentially about striking a balance between flexibility, performance, and ease of use. Both DuckDB and esProc SPL offer their own binary storage formats, but they differ significantly in data organization.

DuckDB still employs a conventional database mechanism where data organization is logically holistic, and data under a specific subject forms a database, and there is a set of metadata to describe the structure and relationship of data in the database. A database is logically a whole, with clear distinctions between data inside and outside the database, as well as explicit import and export actions, which is often referred to as closedness. Closedness provides better manageability, but it also implies a lack of freedom in data organization. While DuckDB can also process data outside the database, this falls under the multi-data source functionality, and often still requires mapping to tables.

esProc is completely different; it is not a database. Its data organization is logically fragmented. It lacks the concept of a subject, and consequently, has no metadata. Of course, it makes no distinction between data inside and outside the database, nor does it involve import or export actions. Any data can participate in calculations as long as it’s accessible. The only difference lies in the access performance from different data sources. esProc designs a specialized binary data storage format for high performance (columnar storage, compression, etc.). However, from a logical standpoint, this storage format is treated the same as a text file or data extracted from other databases. esProc’s storage approach is strongly characterized by its openness, and data organization is more flexible and unconstrained, but this comes at the cost of data integrity and manageability.

Now, let’s take a look at the binary format differences between DuckDB and esProc SPL.

DuckDB’s .duckdb files adopt pure columnar storage, where all data is compressed in blocks by column. Columnar storage is well-suited for analytical queries (such as calculating total sales), as it only needs to read a single column of data, and the compressed data results in small file sizes and fast read speeds.

esProc offers two storage formats to address different scenarios:

btx files (row-based binary format): Data is stored by row, with a structure similar to CSV, but saved in binary format for faster read/write speed. This format is suitable for small-scale data or temporary storage (such as intermediate calculation results), as it requires no pre-defined structure and can be written to and used immediately.
ctx files (columnar composite tables): Data is stored by column and blocked in order based on primary keys (such as time or ID). This format is suitable for large-scale data analysis, allowing it to directly skip blocks that do not meet the conditions, reducing IO consumption and significantly improving speed. The ctx file also offers the option of row-based storage or no compression, allowing for selection based on different computational scenarios.

This “one-size-fits-all” design enables esProc to strike a balance between convenience and performance: btx is convenient for small files, while ctx is efficient for big data.

During big data analysis, esProc’s ctx, in addition to possessing the advantages of DuckDB’s columnar storage, also specifically supports ordered computation (as opposed to SQL’s unordered nature). ctx can also be designed according to the computation. For example, storing data in a specified order allows for the application of order-related algorithms like binary search to improve computational performance. Redundancy can also be implemented – after all, it’s just a matter of having an additional copy of a file.

The flexibility of esProc storage is also manifested in its ability to allow the storage of different types of data within the same field. This greatly increases flexibility but sacrifices performance, requiring trade-offs based on specific needs.

Moreover, the value of an esProc field can be another record or table, thereby better supporting multi-layered data storage and usage. DuckDB now also offers good support for JSON data, making the two systems comparable in this regard.

esProc SPL and DuckDB each have their own distinctive characteristics in data management and storage. DuckDB employs a conventional database model, using metadata to uniformly manage structured data. Its closed mechanism provides strong manageability but limits flexibility. In contrast, esProc adopts fragmented data organization, with no metadata constraints, supporting mixed computations of data from any source, but requiring developers to manage the data themselves. In terms of storage, DuckDB only supports a pure columnar format, while esProc offers a dual mode of btx row-based storage and ctx columnar composite table, supporting multiple data types within the same field and multi-layer nested data, allowing storage strategies to be flexibly chosen according to the scenario.

esProc SPL vs DuckDB: Which is more Lightweight for In-Application Computation

Serina — Fri, 09 May 2025 07:58:03 +0000

Both DuckDB and esProc SPL can be embedded in applications as computing engines. This article will compare which is more lightweight. The term “lightweight” refers not only to size, but also to simplicity in development and maintenance.

DuckDB is indeed convenient to use; you can directly import it in Python and get started, and it integrates smoothly with Java ecosystem via JDBC. esProc primarily targets the Java ecosystem – its 15MB jar can be easily deployed within a project, enabling seamless execution. For non-Java programs, invocation is achieved through an HTTP interface. Both installation packages are quite small, exhibiting lightweight characteristics.

esProc scripts are interpreted and support hot deployment, enabling computational logic modifications without service restart. In this aspect, esProc is on par with DuckDB.

Differences in their lightweight nature are particularly evident in cross-data source mixed computation scenarios. Although DuckDB supports common file formats such as CSV and Parquet, as well as some databases like MySQL, it requires developing deeply customized connectors for each data source separately. Consequently, mainstream relational databases like Oracle and SQL Server remain unsupported, and it is even more challenging to support NoSQL databases like MongoDB. When users need to perform cross-source computations between MySQL and Oracle, the lack of official connectors typically necessitates resorting to Python for importing. The inclusion of such “glue code” not only complicates the technology stack but, more critically, burdens the system architecture, violating the lightweight principle.

In contrast, esProc SPL employs a “native interface + light encapsulation” approach, achieving natural compatibility with all relational databases through JDBC and enabling access to unstructured data sources, such as MongoDB and Kafka, with only a shallow encapsulation based on native interfaces. This standardized extension mechanism enables support for dozens of data source types, covering all scenarios like files, databases, API interfaces, and message queues. Moreover, users can rapidly extend through reserved extension interfaces, truly realizing a lightweight experience of “connect and compute immediately”.

In addition, DuckDB has a critical flaw when handling complex computations: SQL inherently lacks flow control capabilities. Basic functionalities like for/if are unavoidable for even moderately complex business logic. However, since SQL cannot handle these operations, and DuckDB provides no supplemental mechanisms like stored procedures, users have to resort to external languages with flow control capabilities, such as Python, to brute-force solutions when encountering such requirements. This not only results in fragmented and verbose code but also necessitates maintaining two distinct technology stacks. It’s akin to building a crane just to move bricks—far from being lightweight and agile.

esProc’s SPL directly integrates flow control into data processing language, encompassing features such as loops, conditionals, and exception handling. It can handle familiar SQL queries while also replacing Python for flow control, making it a comprehensive language solution. Programmers no longer need to juggle between SQL and Python, the technology stack is simplified, and the overall performance is more lightweight.

Being lightweight isn’t just about having the smallest installation package; it’s like when moving—you can’t just look at the suitcase size, you need to see if one suitcase can hold all your belongings. DuckDB may appear small and nimble, but when it comes to scenarios requiring cross-data source association computations or writing business logic with loops and conditionals, it still needs external assistance. It’s like a rice cooker that promises one-touch cooking, but if you want to steam buns, you still need to connect an external steamer.

The cleverness of esProc lies in its ability to handle complex tasks on its own. Whether it’s a database or an API, as long as it can connect, it can compute—even perform mixed computations. Whether it’s simple statistics or complex rules, a single set of script syntax handles it all. It’s like a transforming toolbox—it looks the size of a screwdriver, but when opened, it reveals wrenches, pliers, and drills. Most importantly, there’s no need to search everywhere for accessories, which is what truly makes it lightweight.

Comparison of esProc SPL and DuckDB in Multi-Data Source Capabilities

Serina — Wed, 07 May 2025 05:57:07 +0000

Both DuckDB and esProc SPL support diverse data sources, and this article will compare the differences between them.

Types of supported data sources

DuckDB supports a wide range of data source types, covering common file formats (such as CSV, Parquet, JSON, Excel), cloud storage (such as AWS S3, Azure Blob Storage), and relational databases (such as MySQL, PostgreSQL, SQLite). It can also access web data via https. Additionally, DuckDB supports some emerging data lake formats (such as Delta Lake, Iceberg).

esProc supports a wider range of data source types, covering more local files, databases, and remote data sources. Here are some data sources supported by SPL:

• Local files: CSV, Excel, JSON, XML, Parquet, ORC, etc.

• All relational databases: MySQL, PostgreSQL, Oracle, SQL Server, etc. (via JDBC)

• NoSQL databases: MongoDB, Cassandra, Redis, etc.

• Cloud storage: HDFS, AWSS3, GCS, etc.

• Remote data sources: RESTfulAPI, WebService, FTP/SFTP, etc.

• Others: Kafka, ElasticSearch, etc.

In terms of the number of data sources, esProc supports more types of data sources, especially in non-relational databases (such as MongoDB, Redis) and support for Kafka, ES, etc., where esProc has a significant advantage.

From a deeper perspective, DuckDB’s data source access relies on dedicated connectors, which need to be developed separately for each data source, resulting in high complexity. It is also very difficult for users to further develop based on the open-source code. As a result, the number of available connectors is significantly limited, and even the most common relational databases are not fully supported. Currently, DuckDB supports MySQL, PG, and SQLite, but does not support other common databases such as Oracle and MSSQL, which will make it difficult to perform mixed queries across multiple data sources. For example, when performing mixed calculations between MySQL and Oracle, if there is no suitable connector, users can only resort to Python as a workaround.

esProc utilizes the native interface of data sources. All relational databases can be connected via JDBC, which is naturally supported. Other data sources such as MongoDB and Kafka can also be simply encapsulated based on the native interface, resulting in high development speed and thus providing a richer connector library. Users can easily add their own connectors by implementing the reserved extension interface.

With these rich support and data source extension capabilities, it is very easy to use esProc to implement mixed calculations across multiple data sources. MySQL+Oracle can be calculated directly, and it is also simple to extend to unsupported data sources.

There is no obvious superiority or inferiority between DuckDB’s dedicated connector and esProc’s simple encapsulation using the native interface. The former can provide deeper support and optimization, achieving a certain level of transparency; the latter is more flexible, supporting a wide range of data sources and offering flexible extension. The specific preference depends on actual needs.

Data type processing

DuckDB has very mature support for CSV and Parquet files, enabling efficient reading and querying of these files. For example, DuckDB can directly load CSV files and execute SQL queries, making the operation straightforward and simple:

SELECT * FROM 'data.csv' WHERE column_a > 100;

esProc also makes it simple to process CSV files using SPL syntax:

T("data.csv").select(column_a > 100)

In addition to SPL syntax, esProc also provides SQL syntax:

$SELECT * FROM data.csv WHERE column_a > 100;

Use SQL for simple scenarios and SPL for complex ones. They can also be used in combination.

Due to the limitations of SQL, many complex calculations are not easy to implement. DuckDB integrates well with Python, allowing complex requirements to be met with Python’s assistance. However, the writing and debugging of these two systems are different, which will create a strong sense of split. esProc provides SQL and the more powerful SPL. Operations that SQL cannot handle can all be implemented with SPL, often in a simpler way. Performing calculations within a single system enhances overall coherence.

Another significant difference lies in JSON processing. esProc can better handle complex calculations and scenarios that require preserving JSON’s hierarchical structure. When performing multi-level structure calculations, SPL can directly access sublevel data using dots (.), which is very intuitive. There is no need to rely on UNNEST to unfold layer by layer or nested queries to preserve the integrity of the data structure, as in DuckDB. The support for multi-level data calculations is very thorough.

Multi-layer and multi-condition data filtering in SPL:

json(file("orders.json").read()). select(order_details.product.category=="Electronics" && order_details.sum(price*quantity)>200)

Compared to DuckDB, esProc supports a richer variety of data sources and is easier to extend, enabling mixed calculations across most data sources. In terms of data processing, esProc not only supports SQL syntax but also SPL, which can handle more complex scenarios within a single system, eliminating the sense of split between SQL and Python systems. Especially for processing multi-layered JSON data, SPL is simpler and more intuitive.

SPL Operates Multi-layer JSON Data Much More Conveniently than DuckDB

Serina — Tue, 29 Apr 2025 06:29:35 +0000

esProc SPL is much more convenient than DuckDB in operating multi-layer JSON data, particularly when preserving JSON hierarchy and performing complex calculations are required.

DuckDB’s ability to operate JSON is quite good. The read_json_auto() function can directly parse JSON to a table structure, allowing you to operate on multi-layer data directly:

SELECT order_id, order_date, json_extract(customer, '$.name') AS cusName,json_extract(customer, '$.city') AS cusCity FROM read_json_auto('orders.json')

SPL is simpler for such basic operations:

json(file("orders.json").read()).new(order_id, order_date,customer.name:cusname,customer.city:cuscity)

Using the dot (.) to directly access sublevel data is very intuitive.

For slightly complex calculations, such as determining the sales amount for the Electronics category in an order’s data, DuckDB requires expanding order_details, then filtering for category=‘Electronics’, and finally calculating SUM(price*quantity).

SELECT sum(od.quantity*od.price) amount
FROM read_json_auto('orders.json') AS o,
LATERAL UNNEST(o.order_details) AS t(od),
LATERAL UNNEST([od.product]) AS t(p)
WHERE p.category = 'Electronics'

To implement this calculation, SQL needs to associate the sub table with the primary table using an inner join for filtering. This is a bit roundabout, but not too complicated.

SPL, in contrast, can directly treat the sub table as a set for calculations.

json(file("order3.json").read()).conj(order_details).select(product.category=="Electronics").sum(quantity*price)

Just a single statement, without associations, and simpler logic shows that the advantage over DuckDB is clearer.

In more complex scenarios, such as filtering order details for the ‘Electronics’ category and excluding orders with amounts below $200, DuckDB SQL becomes difficult to write. You first need to expand the order_details and aggregate order amounts, then filter for eligible orders based on the aggregation result, and finally resort to nested queries or CTEs in order to preserve the integrity of the data structure. As the SQL becomes lengthy, it becomes less user-friendly for debugging. Using Lambda syntax can be simpler, but it is quite different from traditional SQL form.

SELECT
    o.order_id, 
    LIST_FILTER(o.order_details, x -> x.product.category = 'Electronics') AS order_details
FROM read_json_auto(orders.json') AS o
WHERE 
    ARRAY_LENGTH(LIST_FILTER(o.order_details, x -> x.product.category = 'Electronics')) > 0
    AND SUM(
        LIST_FILTER(o.order_details, x -> x.product.category = 'Electronics') -> 
            (x -> x.price * x.quantity)
    ) > 200;

The SPL code is still natural:

=A2.select(order_details.select@1(product.category=="Electronics") && order_details.sum(price*quantity)>200)

Still, a single statement, simply treating the sub table as a set is enough. No complex subqueries and Lambda syntax, direct referencing, filtering, and aggregation work well regardless of the number of layers. Moreover, SPL preserves the multi-layer structure of JSON without requiring complex SQL like GROUP BY and LATERAL UNNEST.

While DuckDB does operate JSON well, it still requires UNNEST and similar SQL structures, which becomes cumbersome as the data layer increases. In contrast, SPL can directly operate on multi-layer JSON structures, making filtering and aggregation convenient while preserving the original data hierarchy. Clearly, it is better suited for complex JSON computation scenarios.

It's free，Download esProc~~

Local Data Analysis: DuckDB or esProc SPL?

Serina — Mon, 28 Apr 2025 03:42:05 +0000

DuckDB can directly read common files such as CSV, Parquet, and JSON. With just a single SQL statement, it can load the file and perform a query, such as SELECT * FROM ‘data.csv’ WHERE price>100. For users accustomed to SQL, this “file-as-table” operation experience is very user-friendly, enabling rapid implementation of simple filtering and aggregation calculations.

However, when faced with complex scenarios—such as cross-file iterative calculations, processing unstructured logs, or implementing dynamic conditional branches—relying solely on SQL often falls short. In such cases, you have to resort to Python for writing loops or UDF. This hybrid programming approach leads to a noticeable sense of split: You need to constantly switch between SQL’s logical thinking and Python’s procedural thinking when programming, and separately handle SQL snippets and Python variables when debugging, which is quite cumbersome.

esProc SPL is a significantly better alternative. It supports over 20 file formats, including CSV and Excel, and can also parse semi-structured data such as RESTful and NoSQL, providing broader data source support than DuckDB. Moreover, esProc provides both SQL and SPL syntax, allowing you to use SQL for simple queries and SPL for complex tasks. All tasks can be handled within the same interface.

For example, when calculating the number of consecutive rising days for a stock, SPL can easily implement ordered grouping and this ability can be combined with SQL to accomplish this task:

This concise syntax based on sequence numbers is more intuitive and easier to understand than SQL’s multi-layer nested subqueries.

With esProc SPL’s IDE, developers can view the results of each step in real time, offering higher interactivity and far superior debugging efficiency than DuckDB, and it is also more intuitive than Python’s IDE.

With the support for diverse data sources and procedural computation, SPL fully covers the entire workflow from data loading to result output. For example, in e-commerce user behavior analysis, reading JSON logs, associating CSV product tables, calculating page dwell time, and generating funnel results can all be done with just a single script, without the need to switch interfaces.

If the amount of data is larger, SPL can utilize cursor and parallel processing to handle the data. Tests reveal that SPL’s multi-thread segmented loading technology is more than 3 times faster than DuckDB when performing grouping and aggregation on a 100GB CSV file.

DuckDB is suitable for simple SQL file query scenarios, while esProc SPL is more appropriate for complex calculations or big data processing. A system combines the simplicity of SQL, superior procedural computing capabilities than Python, and a more interactive IDE—truly an all-in-one tool.

esProc SPL: Combining the Strengths of DuckDB and Python

Serina — Wed, 19 Mar 2025 08:30:46 +0000

That DuckDB is gaining more and more attention is no accident. As a rising star in desktop analytics, it masters SQL with ease, effortlessly handling CTE recursive queries, multi-layered window functions, and complex JOINs. It can even handle aggregations on datasets with hundreds of millions of rows with ease—just run a SELECT AVG(revenue) FROM terabyte_table GROUP BY region, and it’s done. Such capability is truly commendable. But even the strongest warriors have their Achilles’ heel, and DuckDB’s shortcomings lie in those areas where SQL falls short but business demands persist.

Encountering “unconventional requirements”

For example, the boss wants to calculate “the moving average of each customer’s last 3 order amounts, limited to weekend orders.” Want to do this in SQL? Be ready for a three-layer nested query: first, filter weekend orders; then, group and sort by customer; finally, calculate the moving average using a window function.

Of course, you can write it with some effort, but this isn’t a one-time job. Similar requirements are common, and putting in so much effort every time isn’t a practical in the long run.

And there’s an even more frustrating issue: flow control. Want to create a loop in SQL or dynamically adjust calculation logic based on conditions? For example, “If today’s sales increase exceeds 10%, then run promotional calculations; otherwise, skip.” Unfortunately, SQL’s IF and LOOP statements are too limited to be practical. The convoluted code you manage to write becomes incomprehensible even to you after just three days.

Resorting to Python?

Of course, DuckDB has a backup plan: “For tasks that SQL struggles with, hand them off to Python! There’s nothing wrong with this plan. DuckDB’s Python API is indeed easy to use; a simple conn.sql().to_df() lets you seamlessly switch to pandas.

For example, to calculate “double reward points when a client’s consecutive purchase days exceed 5,” SQL’s window functions can calculate consecutive dates, but handling dynamic conditions is cumbersome. In comparison, Python is more straightforward:

df = duckdb.sql("SELECT client, order_date FROM orders ORDER BY client, order_date").df()
df['order_date'] = pd.to_datetime(df['order_date'])
results = []
for client, group in df.groupby("client"):
    streak = 1  
    prev_date = None
    for date in group["order_date"]:
        if prev_date and (date - prev_date).days == 1:  
            streak += 1
        else:  
            streak = 1
        if streak >= 5:  
            results.append({"client": client, "bonus": "Doubling"})
        prev_date = date
print(results)

The code works, but the data must be exported from DuckDB to a DataFrame. A logically coherent business requirement has to be split into SQL preprocessing and Python post-processing, forcing you to keep switching between the two. This approach is not only awkward to develop and debug but also wastes significant time on data transfer, driving you crazy.

Moreover, Python doesn’t excel at everything. It lacks big data processing capabilities and native parallel computing support for large datasets, making it inferior to DuckDB in this regard.

SPL is a good solution

With esProc SPL, these problems vanish. It’s like a combined-evolved version of ‘DuckDB + Python’, handling everything in one system, making it simpler and more efficient.

Like DuckDB, esProc SPL offers SQL support. SQL can be run directly on common files such as CSV and Excel. For example, query sales data:

$SELECT region, SUM(amount) FROM sales.csv GROUP BY region

Such lightweight operations are a breeze for esProc. Moreover, it offers a binary file format with good compression ratio and fast data I/O – capabilities on par with DuckDB.

Furthermore, for complex requirements, esProc provides native SPL syntax as a backup. For example, for the consecutive purchase reward task mentioned earlier, written in SPL:

This code not only identifies eligible clients for double rewards but also retrieves their purchase details (A3). Doesn’t it feel like you’re getting ahead of the business needs?

SPL also provides far better JSON support compared to DuckDB, allowing you to directly navigate through nested data with simple dot notation: example.contact(1).metadata.verified, which is much cleaner than DuckDB’s json_extract(contact[1], '$.metadata.verified').

SPL offers robust support to address Python’s weakness in handling big data. For large datasets, SPL’s cursor mechanism will show you how easy it is:

=file("huge.log").cursor@t()
=A1.groups(;sum(amount):total, count(~):rows)

Data is read in a stream, with only the current batch held in memory. This allows files of hundreds of gigabytes to run smoothly. Additionally, it supports parallel and segmented processing:

=file("huge.log").cursor@tm(;4)  //4-thread parallel processing

DuckDB + Python is like using chopsticks for a steak dinner – each works well individually but the combination feels awkward. What about esProc SPL? Think of it as a full-stack kitchen suite: it combines SQL’s rigor with Python’s flexibility, enhanced with a universal toolkit for multi-source data mixed computation + big data processing—all seamlessly integrated into one system. Isn’t the ultimate dream of data engineers to write less code and slack off? SPL might just be the closest shortcut to achieving that.

Free download esProc