DEV Community

Cover image for LLMSQL&Arabic WikiTableQA&Payment-SQL:SQL Dataset(202510)
SQLFlash
SQLFlash

Posted on

LLMSQL&Arabic WikiTableQA&Payment-SQL:SQL Dataset(202510)

This month, the NL2SQL field saw the emergence of several noteworthy new datasets and benchmarks. Moving beyond the previous focus primarily on "task difficulty," the recent research trend has clearly shifted towards "data quality and applicability" (e.g., LLMSQL's systematic cleaning of older datasets), "real-world deployment in specific domains" (e.g., Payment-SQL's industrial-grade complex scenarios), and "multilingual support" (e.g., Arabic WikiTableQA).

LLMSQL

LLMSQL is a systematic reconstruction and upgrade of the classic WikiSQL dataset, reformatted into standard SQL. It aims to address adaptation issues for generative tasks in the Large Language Model (LLM) era.

Paper Intent

Although the original WikiSQL dataset is an early classic in the field, it suffers from significant "standard aging" and "evaluation fragmentation" issues as technology has evolved. Early research often used varied preprocessing scripts and evaluation logic, making it difficult to fairly compare model performance. Furthermore, the original annotations contained numerous token indices designed for older Pointer Networks, a format not well-suited for modern generative LLMs.

The core intention of LLMSQL is to establish an "LLM-native" standardized benchmark. It not only aims to clean data errors but also to tackle the reproducibility crisis in evaluation. By providing cleaned, plain-text SQL and a unified evaluation framework, the project seeks to eliminate bias from data noise, allowing researchers to precisely quantify LLMs' instruction-following and basic reasoning capabilities in a standardized single-table query environment.

Dataset Analysis

The dataset retains approximately 80,330 sample pairs. Its core value lies in building an automated, closed-loop cleaning pipeline that specifically addresses the following technical challenges:

  • Variant Exhaustion and Execution Verification: To tackle queries returning empty results due to case sensitivity mismatches in the original data, the team designed automated scripts. The system generates various case variants (e.g., 'New York', 'new york') for each query entity and performs exhaustive testing.
  • Execution-Based Trial-and-Error Correction: Variants are substituted into the SQL for actual database execution. Once a non-empty result is obtained, the system updates the entity spelling in both the natural language question and the SQL statement, ensuring data consistency and validity.
  • Format Reconstruction: It completely abandons the complex Pointer-based slot annotations designed for Pointer Networks in the original WikiSQL, restructuring all data into standard, directly executable plain-text SQL.

Summary

LLMSQL modernizes an existing large-scale dataset, solving the applicability problems of classic datasets in the LLM era.

  • Its main advantage is providing a more rigorous and reproducible testing environment compared to the original WikiSQL.
  • For research teams needing to fine-tune lightweight models or evaluate the basic instruction-following capabilities of large models, LLMSQL lowers the barrier to entry. It currently serves as a reliable benchmark for assessing fundamental single-table query capabilities (e.g., WHERE clause logic, basic aggregation).

Arabic WikiTableQA

Arabic WikiTableQA is the first large-scale Arabic non-SQL Table Question Answering benchmark, filling a gap for non-English languages in the TableQA domain.

Paper Intent

Existing Arabic datasets (like Ar-Spider) primarily focus on Semantic Parsing, i.e., "converting natural language to SQL." However, modern application scenarios often require obtaining answers directly from semi-structured tables (e.g., web pages, documents) rather than generating query code. This paper aims to build a benchmark for open-domain Direct TableQA, evaluating LLMs' reasoning abilities on unstructured/semi-structured data, and exploring the use of Knowledge Graph (KG) techniques to address challenges like Arabic's complex morphology and long-table retrieval.

Dataset Analysis

This dataset is an Arabic adaptation of the English WikiTableQuestions (containing 2,108 tables and 22,000+ QA pairs). Its construction process highlights the complexity of cross-lingual transfer:

  • Standardized Cleaning and Translation: GPT-4o was used to translate content into Modern Standard Arabic, while also handling the cleaning of Tatweel elongation characters and diacritics, and unifying date/number formats. This addresses common data consistency issues in low-resource languages.
  • Non-SQL Direct QA Paradigm: Unlike tasks reliant on database schemas, this dataset focuses on open-domain direct table QA, testing model comprehension capabilities without explicit schema constraints.
  • Dynamic Graph Enhancement: A semantic graph is dynamically generated for each table, extracting node entities and relation subgraphs. This design aims to assist models in handling complex logical reasoning, compensating for the limitations of pure text retrieval.

Summary

Arabic WikiTableQA fills a void in Arabic TableQA. Its "translation + cleaning + graph enhancement" construction paradigm provides a reference for building datasets in other low-resource languages. This dataset is particularly suitable for evaluating models' multilingual reasoning abilities in unstructured query scenarios and for researching the performance of RAG techniques when processing languages with complex morphology.


Payment-SQL

Payment-SQL is an industrial-grade dataset derived from the real-world financial payments domain, released as part of the SQLGovernor project. It is specifically designed for evaluating LLMs' ability to handle high-complexity OLAP (Online Analytical Processing) queries.

Paper Intent

Mainstream academic datasets (like Spider, BIRD) primarily focus on "logical correctness." However, in industrial settings, especially financial OLAP scenarios, SQL execution efficiency is equally critical. In practice, DBAs often face logically complex yet inefficient redundant queries. The SQLGovernor paper aims to demonstrate LLMs' potential in code governance and performance optimization—acting as an "intelligent DBA" by performing syntax correction, logical rewriting, and performance tuning on long, complex SQL. This helps reduce database load and shorten report generation times.

Dataset Analysis

Payment-SQL contains 50 high-difficulty OLAP queries. Its construction exhibits typical "production environment extraction" characteristics, reflecting an extremely high level of task complexity:

  • Real Industrial Source & Schema Complexity: Queries were curated by human experts based on real execution logs, sourced from a massive industrial schema containing 74 tables and thousands of columns. On average, each query involves 2 tables and 11 columns, realistically reflecting the depth of business associations.
  • Breakthrough Text Length: The average query length is a substantial 421 tokens, far exceeding the 107 tokens of the "Challenger" category in the BIRD dataset. The shortest query (173 tokens) already meets the traditional definition of "difficult," with the longest reaching 1169 tokens. This sets a stringent standard for evaluating models' long-context processing capabilities and logical robustness.
  • Dedicated Benchmark for Rewriting Systems: The dataset is specifically designed for evaluating SQL rewriting systems, measuring effectiveness by comparing execution times before and after rewriting in the same environment. This metric, based on real performance gains, holds more practical significance than mere syntax matching.

Summary

Payment-SQL is a highly representative dataset for enterprise-level practical applications. Its core strength lies in moving beyond the traditional NL2SQL focus solely on generation, shifting towards the generation and optimization of long-context, highly complex analytical logic. For research teams dedicated to solving complex reporting and database performance optimization problems in fields like finance and e-commerce, this dataset provides an extremely valuable reference and testing environment.

Recommended reading

Top comments (0)