DEV Community

Cover image for SQLStorm & CogniSQL: An AI-Augmented SQL Dataset(202508)
SQLFlash
SQLFlash

Posted on

SQLStorm & CogniSQL: An AI-Augmented SQL Dataset(202508)

In the past two months, several new datasets have been released in the NL2SQL field. Based on publicly available online materials, four papers—SQLStorm, CogniSQL, RubikSQL, and FinStat2SQL—have mentioned dataset releases. Among these, RubikSQL has not yet open-sourced its code or dataset, and FinStat2SQL has not explicitly stated whether its dataset will be publicly available.

Therefore, this article will focus on introducing the currently accessible SQLStorm and CogniSQL datasets.

SQLStorm

SQLStorm v1.0 is a large-scale benchmark test based on real-world data, encompassing three different data sizes (1 GB, 12 GB, and 220 GB) and covering over 18,000 queries. This benchmark pioneers the use of artificial intelligence to generate query workloads, producing a massive (22 MB) volume of SQL queries that closely mimic real-world scenarios at an extremely low cost ($15).

It significantly expands the coverage of SQL functionality and query structures. In contrast, traditional manually crafted benchmarks like TPC-H, TPC-DS, and JOB fall short in query diversity and complexity.

SQLStorm can be used in the following scenarios:

  • Enhance SQL compatibility between different systems.
  • Improve system quality by identifying and fixing system crashes or errors.
  • Leverage query execution trends to discover opportunities for optimizing cardinality estimation and query optimizers.
  • Comprehensively evaluate system performance, including execution speed and robustness.

Dataset Access

Dataset Analysis

SQLStorm employs large language models (LLMs) to generate SQL statements for database performance testing, aiming to address the shortcomings of traditional datasets like TPC-H in terms of SQL feature coverage. The dataset is compatible with mainstream database systems such as PostgreSQL, Umbra, and DuckDB. The data is partially based on real databases provided by StackOverflow, including a set of schemas and data of three sizes:

  • StackOverflow DBA (1 GB)
  • StackOverflow Math (12 GB)
  • StackOverflow Full (222 GB)

The query generation process is as follows:

  1. Use large models to generate approximately 35,000 SQL statements covering simple, medium, and complex types.
  2. Through model screening, about 18,000 high-quality queries are ultimately retained.

Influence

SQLStorm can generate diverse SQL queries at a low cost, effectively revealing performance bottlenecks and errors in database systems. For example, after introducing SQLStorm, Umbra quickly improved its Crash, Error, and Timeout + OOM metrics on SQLStorm by addressing the issues encountered. Additionally, by examining the fourth set of graphs, SQLStorm detected performance degradation issues that were not evident in TPC-H, and these were identified and resolved (Note: Umbra and SQLStorm appear to be from the same team).

Summary

SQLStorm can also be used to evaluate tasks such as SQL optimization. Leveraging the complete database data and samples provided by SQLStorm, it can effectively assess performance differences after SQL optimization. Furthermore, it can utilize the complete database data to evaluate the performance quality of SQL generated by different NL2SQL systems.


CogniSQL

CogniSQL has released two curated datasets that significantly advance research in execution-aligned, scalable text-to-SQL generation. By open-sourcing these resources, the community can directly leverage high-precision SQL samples and clear reasoning paths to support lightweight reinforcement learning and reasoning-enhanced text-to-SQL model training under limited computational resources.

Dataset Composition

  1. Reasoning Traces (5,024 entries):

    Generated by Qwen QWQ 32B, this dataset provides fixed reasoning paths, enhancing interpretability and enabling transparent observation of the SQL generation process.

  2. Positive Sample Corpus (36,356 entries):

    Generated by Qwen2.5-7B-Coder, each original training example produces six distinct training instances comprising both reasoning processes and outcomes, thereby expanding the diversity of reasoning paths.

Dataset Access

Dataset Analysis

These datasets were used in the supervised fine-tuning (SFT) phase of the study:

  • Phase 1: Using Reasoning Traces (5,024 entries from Qwen QWQ 32B), the execution accuracy on BIRD-dev dropped from ≈52.0% (baseline) to ≈46.0%.
  • Phase 2: SFT with the self-generated Positive Sample Corpus recovered performance, achieving ≈57.3% execution accuracy—nearly matching the baseline. These results indicate that high-precision, in-distribution samples (even from the model itself) can benefit SFT.

Influence

Beyond the datasets, the paper’s core contribution is the CogniSQL-R1-Zero framework. After experimenting with long AI agent workflows, SFT, and GRPO, the authors ultimately used GRPO for reinforcement learning to enhance NL2SQL performance. Experiments showed that CogniSQL-R1-Zero based on Qwen2.5-7B-Coder achieved 59% execution accuracy on Bird-dev, outperforming larger baseline models like DeepSeek-Coder (236B) and Mistral (123B).

Summary

CogniSQL’s key contribution lies in generating high-quality, small-model-aligned reasoning traces and positive samples through a generative approach, addressing the prior lack of datasets tailored to small-model reasoning logic. This enables effective generalization for small models in low-resource environments via SFT. The paper also summarizes critical insights for large-model training:

  1. Zero-Shot Chain-of-Thought (CoT) is limited: While LLaMA 3.1 8B generated partially coherent reasoning steps, <20% of SQL queries executed correctly, highlighting a significant gap between CoT and SQL generation.
  2. Multi-agent workflows are costly: Despite achieving 85% accuracy, computational overhead, latency, and system complexity make them impractical for large-scale deployment.
  3. Supervised Fine-Tuning (SFT) can degrade performance: SFT with distilled reasoning data caused overfitting in Qwen-7B, reducing accuracy from 52.0% to 46.0%.
  4. Self-generated data aids SFT recovery: Using self-generated, high-accuracy SQL samples for SFT restored accuracy to 57.3%, emphasizing the importance of in-distribution, high-precision data.
  5. Hybrid methods (SFT+RL) are stable but suboptimal: While cold-start followed by RL training ensured stable convergence (≈58.0% accuracy), pure RL achieved superior peak performance and workflow simplicity.

Recommended reading

Top comments (0)