Comparison of esProc SPL and DuckDB in Multi-Data Source Capabilities

Both DuckDB and esProc SPL support diverse data sources, and this article will compare the differences between them.

Types of supported data sources

DuckDB supports a wide range of data source types, covering common file formats (such as CSV, Parquet, JSON, Excel), cloud storage (such as AWS S3, Azure Blob Storage), and relational databases (such as MySQL, PostgreSQL, SQLite). It can also access web data via https. Additionally, DuckDB supports some emerging data lake formats (such as Delta Lake, Iceberg).

esProc supports a wider range of data source types, covering more local files, databases, and remote data sources. Here are some data sources supported by SPL:

• Local files: CSV, Excel, JSON, XML, Parquet, ORC, etc.

• All relational databases: MySQL, PostgreSQL, Oracle, SQL Server, etc. (via JDBC)

• NoSQL databases: MongoDB, Cassandra, Redis, etc.

• Cloud storage: HDFS, AWSS3, GCS, etc.

• Remote data sources: RESTfulAPI, WebService, FTP/SFTP, etc.

• Others: Kafka, ElasticSearch, etc.

In terms of the number of data sources, esProc supports more types of data sources, especially in non-relational databases (such as MongoDB, Redis) and support for Kafka, ES, etc., where esProc has a significant advantage.

From a deeper perspective, DuckDB’s data source access relies on dedicated connectors, which need to be developed separately for each data source, resulting in high complexity. It is also very difficult for users to further develop based on the open-source code. As a result, the number of available connectors is significantly limited, and even the most common relational databases are not fully supported. Currently, DuckDB supports MySQL, PG, and SQLite, but does not support other common databases such as Oracle and MSSQL, which will make it difficult to perform mixed queries across multiple data sources. For example, when performing mixed calculations between MySQL and Oracle, if there is no suitable connector, users can only resort to Python as a workaround.

esProc utilizes the native interface of data sources. All relational databases can be connected via JDBC, which is naturally supported. Other data sources such as MongoDB and Kafka can also be simply encapsulated based on the native interface, resulting in high development speed and thus providing a richer connector library. Users can easily add their own connectors by implementing the reserved extension interface.

With these rich support and data source extension capabilities, it is very easy to use esProc to implement mixed calculations across multiple data sources. MySQL+Oracle can be calculated directly, and it is also simple to extend to unsupported data sources.

There is no obvious superiority or inferiority between DuckDB’s dedicated connector and esProc’s simple encapsulation using the native interface. The former can provide deeper support and optimization, achieving a certain level of transparency; the latter is more flexible, supporting a wide range of data sources and offering flexible extension. The specific preference depends on actual needs.

Data type processing

DuckDB has very mature support for CSV and Parquet files, enabling efficient reading and querying of these files. For example, DuckDB can directly load CSV files and execute SQL queries, making the operation straightforward and simple:

SELECT * FROM 'data.csv' WHERE column_a > 100;

esProc also makes it simple to process CSV files using SPL syntax:

T("data.csv").select(column_a > 100)

In addition to SPL syntax, esProc also provides SQL syntax:

$SELECT * FROM data.csv WHERE column_a > 100;

Use SQL for simple scenarios and SPL for complex ones. They can also be used in combination.

Due to the limitations of SQL, many complex calculations are not easy to implement. DuckDB integrates well with Python, allowing complex requirements to be met with Python’s assistance. However, the writing and debugging of these two systems are different, which will create a strong sense of split. esProc provides SQL and the more powerful SPL. Operations that SQL cannot handle can all be implemented with SPL, often in a simpler way. Performing calculations within a single system enhances overall coherence.

Another significant difference lies in JSON processing. esProc can better handle complex calculations and scenarios that require preserving JSON’s hierarchical structure. When performing multi-level structure calculations, SPL can directly access sublevel data using dots (.), which is very intuitive. There is no need to rely on UNNEST to unfold layer by layer or nested queries to preserve the integrity of the data structure, as in DuckDB. The support for multi-level data calculations is very thorough.

Multi-layer and multi-condition data filtering in SPL:

json(file("orders.json").read()). select(order_details.product.category=="Electronics" && order_details.sum(price*quantity)>200)

Compared to DuckDB, esProc supports a richer variety of data sources and is easier to extend, enabling mixed calculations across most data sources. In terms of data processing, esProc not only supports SQL syntax but also SPL, which can handle more complex scenarios within a single system, eliminating the sense of split between SQL and Python systems. Especially for processing multi-layered JSON data, SPL is simpler and more intuitive.

Free download esProc

DEV Community

Comparison of esProc SPL and DuckDB in Multi-Data Source Capabilities

Top comments (0)