DEV Community

Cover image for ๐Ÿฆ† From API to scheduled offline copies with DuckDB on Kaggle โ™พ๏ธ
adriens
adriens

Posted on • Originally published at kaggle.com

๐Ÿฆ† From API to scheduled offline copies with DuckDB on Kaggle โ™พ๏ธ

โ” About

While I was working on endoflife.date integrations, the need for offline copy started to raise:

Offline copy of data #2530

I really like the idea but to avoid repeated calls of the API for every product I would like data on, I would like to be maintain a local copy of the data and then only download updates each time I start my application (or after a particular time period e.g. only request updates once every 24 hours)

Ideally, I would be able to get the data in JSON format which I can then manage locally.

Alternative would be to call the API for every product to get the product data for each product. But this would also require that I know all of the products in the first place which given the dynamic nature of the data isn't very attractive.

After some various attempts, I finally found a Kaggle based solution.

I wanted the data to:

  • ๐Ÿ‘ Be easy to share
  • โœ… Rely on the official API
  • ๐Ÿ” Up-to date (without any effort)
  • ๐Ÿ”— Easy to integrate with third party products
  • ๐Ÿง‘โ€๐Ÿ”ฌ Be deployed on a datacentric/datascience platform
  • ๐Ÿค“ Show source code (Open Source)
  • ๐Ÿš€ Be easily extensible

Therefore I created a Notebook that does the following things once a week:

  1. Queries the API
  2. Load & store data in a DuckDb database
  3. Export resulting database in sql an csv
  4. Export database a Apache Parquet files

๐Ÿงฐ Tools

All you need is Python and DuckDB json functions:

JSON - DuckDB

DuckDB is an in-process database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python and R.

favicon duckdb.org

๐ŸŽฏ Result

As you can see, for now, the only input is the API:

Image description

... while we have fresh output files:

Image description

Image description

๐Ÿ—ฃ๏ธ Conclusion

Finally I delivered the following solution to the community:

๐ŸŽ Weekly Scheduled offline exports on Kaggle โ™พ๏ธ #2633

โ” About

Getting an easy to use offline copyof endoflife.date would be very convenient to be able to produce data-analysis.

๐Ÿ‘‰ This issue is about using endoflife.date API to get an automated offline copy of the datas.

๐ŸŽ The Notebook

Below are the very portable outputs :

image

๐Ÿ’ฐ Benefits

Weekly:

๐Ÿ”– Related resources

Top comments (3)

Collapse
 
adriens profile image
adriens

Collapse
 
adriens profile image
adriens
Collapse
 
adriens profile image
adriens