๐ช Intro
Ever wanted to get that kind on constraint on a table :
CREATE TABLE sensor_data(
ts TIMESTAMP,
measurement INTEGER,
...,
SORTED(ts)
);
Well, it turns out this topic has a dedicated discussion:
For now it's not implemented...yet. But we have a lot of use cases out there.
Let's see how to achieve this in pure sql
... and why it would be so useful.
๐ฟ Demo
๐ค (A bit of) Context
We are currently using duckdb
on GitHub Actions (see opt-nc/setup-duckdb-action
) as it's a very convenient and efficient way to check data quality with sql
as part of our CI... with very very few efforts, see below:
๐ฆ Effortless Data Quality w/duckdb on GitHub โพ๏ธ
adriens for opt-nc ใป Jul 25 '23
โ๏ธ Recently, we also felt the need to keep a csv
file sorted according to a given column to keep it as clean as possible and letting people make Pull Requests... and, most important : delegating the CI the role to explain the end user why the data he wants to put cannot be merged.
We did not want any human in the loop to moderate contributors and explain how the data should be provided to get the PR merged. In particular in the context of we use GH auto-merge to
"[...] increase development velocity by enabling auto-merge for a pull request so that the pull request will merge automatically when all merge requirements are met."
๐ This productivity hack is all about achieving that in pure sql
... and take profit of:
-
duckdb
columnar storage format (see Exploring DuckDB and the Columnar Advantage for more) CHECK
constraint
โน๏ธ Notice that this hack can be applied on very large volumes of data.
๐ช sql
tricks
-----------------------------------------------------------
--
-- Check if a table column is sorted w/ integrity check
--
-----------------------------------------------------------
-- Create a table with a column that is not sorted
-- Feed some random stuff
-- The real target table
create or replace table demo_sort (text varchar);
insert into demo_sort values ('DuckDb');
insert into demo_sort values ('duckdb');
insert into demo_sort values ('Duckdb');
insert into demo_sort values ('DUCKDB');
insert into demo_sort values ('duckDB');
insert into demo_sort values ('DUCKdb');
insert into demo_sort values ('DuckDB');
-- Check the resulting table
from demo_sort;
-- Prepare test environment
CREATE SEQUENCE seq_original START 1;
CREATE SEQUENCE seq_sorted START 1;
create or replace temp table orig_table as
select nextval('seq_original') as index,
text from demo_sort;
create or replace temp table sorted_table as
select nextval('seq_sorted') as index,
text
from (select text from demo_sort order by text);
-- Check the resulting tables
from orig_table;
from sorted_table;
-- Create the table that compares the sorted and original tables columns
create or replace temp table test_table(orig_text varchar,
orig_index integer,
sorted_index integer
-- the magic part XD
check(orig_index = sorted_index)
);
-- Populate the comparison table
insert into test_table
select
orig_table.text as orig_text,
orig_table.index as orig_index,
sorted_table.index as sorted_index,
from
orig_table,
sorted_table
where
orig_table.text = sorted_table.text
order by orig_table.index;
-- Enjoy the resulting "Constraint Error: CHECK constraint failed: test_table"
Top comments (2)
doc(ressources) : How to keep a column sorted #95
cf dev.to/adriens/duckdb-sql-hack-get...