GeoSQL-Eval: Finally, a PostGIS Benchmark That Doesn’t Make Me Scream

#llm #postgres #sql #testing

GeoSQL-Eval / GeoSQL-Bench

Finally—a PostGIS test that doesn’t make me want to throw my laptop. GeoSQL-Eval checks if LLMs actually get spatial queries, not just vomit syntactically valid but useless SQL. They dropped GeoSQL-Bench: 14,178 real tasks, 340 PostGIS functions covered, 82 legit spatial DBs (land use, transport networks—you name it).

Leaderboard: https://haoyuejiao.github.io/GeoSQL-Eval-Leaderboard/
Paper: https://arxiv.org/pdf/2509.25264

Paper Intent

Let’s be real: old NL2SQL benchmarks skip the messy spatial stuff—geometry types, CRS, PostGIS quirks. So models hallucinate ST_Buffer when they need ST_Distance. GeoSQL-Bench + GeoSQL-Eval fix that. Built with spatial DB folks, not just theorists. Tests if models handle real client queries, not textbook examples.

Dataset Analysis

2,380 MCQs/T-F: Straight from PostGIS 3.5 docs—tests if models know what functions do, not just syntax.
3,744 SQL gen tasks: Mix of clear prompts ("add column age") and vague ones ("add a field")—forces type guessing (VARCHAR? INT? You decide).
2,155 schema tasks: Built on UN GGIM + ISO 19115 databases. Models must navigate actual table relationships. All GPT-4o drafted → triple-checked by human spatial experts. No lazy labeling.

Summary

Tested 24 models. GPT-5/o4-mini crushed geometry-heavy queries. But 70% of errors? Still function misuse. Schema tasks (multi-table joins) = hardest. This isn’t "another benchmark"—it’s the first real test for spatial SQL. Period.

DeKeyNLU

DeKeyNLU fixes the quiet killer in NL2SQL: LLMs failing to break down "Show me Q3 sales in APAC" into actual DB steps. They built a dataset where humans actually verified task splits and keywords—then baked it into DeKeySQL’s pipeline.

Paper: https://aclanthology.org/2025.findings-emnlp.1312.pdf
Data: https://github.com/AlexJJJChen/DeKeyNLU

Paper Intent

RAG/CoT pipelines keep choking on task decomposition and keyword extraction. Existing datasets? Fragmented or missing domain keywords ("fiscal year," "student cohort"). DeKeyNLU drops a clean fix: a new dataset + DeKeySQL’s 3-module flow—user question understanding → entity retrieval → SQL generation. They fine-tuned only the understanding module... and accuracy jumped hard.

Dataset Analysis

1,500 QA pairs, pulled from BIRD benchmark (finance, education, real DB scenarios).
Split 7:2:1—train/val/test, no weird ratios.
Workflow: GPT-4o drafted task splits (main/sub) + keywords (objects/implementation) → three experts cross-checked *three times*. Painstaking? Yes. Worth it? Absolutely.

Summary

Fine-tuning "user question understanding" with DeKeyNLU pushed BIRD dev accuracy from 62.31% → 69.10%, Spider from 84.2% → 88.7%. Plot twist? Entity retrieval is the make-or-break step (not understanding), followed by question parsing. Proves: targeted dataset design + smart pipeline tweaks > throwing more data at the problem. Finally—NL2SQL that gets what you mean.

DEV Community

GeoSQL-Eval: Finally, a PostGIS Benchmark That Doesn’t Make Me Scream

GeoSQL-Eval / GeoSQL-Bench

Paper Intent

Dataset Analysis

Summary

DeKeyNLU

Paper Intent

Dataset Analysis

Summary

Top comments (0)