💭 CI, duckdb
et and data protection
To efficiently yet effortlessly manage data quality, we created a GitHub Action to install duckdb
:


🦆 Effortless Data Quality w/duckdb on GitHub ♾️
adriens for opt-nc ・ Jul 25 '23
... but recently I had to face an another challenge : as part of our CI, I had the need to validate data... that were relying on web resources.
I needed to be sure that a GitHub Account was really existing (for example to avoid typos) as part of our CI.
In this very short article, I'll show how to use DuckDB with the http_client
extension to verify GitHub handles stored in a table, for example to lint data as part of a CI pipeline thanks to GitHub Duckdb Action... and do the job with a very simple SQL
script and CHECK
constraints.
🍿 For impatients
🔖 Resources
- https://github.com/Query-farm/duckdb-extension-httpclient
- https://duckdb.org/community_extensions/extensions/http_client.html
- https://github.com/marketplace/actions/duckdb-setup
🕹️ Do the job with SQL
First install extension:
INSTALL http_client FROM community;
LOAD http_client;
Then create a table with some example data:
create or replace table person_gh_member
(
sam_accountname varchar primary key,
gh_member varchar not null
);
-- insert rows
INSERT INTO person_gh_member (sam_accountname, gh_member)
values
('adriens', 'adriens'),
('jdoe', 'johndoe'),
('asmith', 'annasmithRRRRR');
Finally, run a query to check the status of each GitHub handle:
create or replace view v_person_gh_status as
select
sam_accountname,
gh_member,
'https://github.com/' || gh_member as gh_url,
cast(http_get(gh_url).status as integer) as http_gh_status
from person_gh_member;
--where gh_status <> 200;
from v_person_gh_status;
Now, to lint, we can add a check to see if any status is not 200:
from v_person_gh_status
where http_gh_status <> 200;
create or replace table lint_gh_handle(
gh_handle varchar primary key,
gh_url varchar not null unique check (gh_url like 'https://github.com/%'),
gh_status integer check (gh_status = 200)
);
insert into lint_gh_handle(
gh_handle,
gh_url,
gh_status)
select
gh_member,
'https://github.com/' || gh_member as gh_url,
cast(http_get(gh_url).status as integer) as gh_status
from person_gh_member;
Top comments (4)
Very cool use of DuckDB..
duckdb
is such a productivity tool 🤩Hi guys,
we are maintaining ℹ️ Setup Duckdb Github Action... and we recently had the need to protect some data on our CI. I discovered your extension... which did a great job 😃
So I decided to document this use case, so here is below some feedback :
Thanks a lot for the great work, it will save us lot of efforts 🦥