DecDEPO

Posted on Apr 17

Distributing an open dataset across 12 platforms in one day: a playbook

#opensource #opendata #devops #productivity

We shipped the Swedish Construction FAQ dataset (503 bilingual Q&A pairs, CC BY 4.0, DOI 10.5281/zenodo.19630803) yesterday. By end of day it was on twelve platforms.

This is the checklist, in order. Each step takes 10-30 minutes. Most are scriptable.

The 12 platforms

#	Platform	What it gives you	Effort
1	GitHub	Canonical source, issues, Pages	base
2	Zenodo	DOI, Scholar indexing, archival	10 min
3	Hugging Face Datasets	ML audience, `load_dataset()` UX	15 min
4	PyPI	`pip install` for Python users	20 min
5	Kaggle	Data-science audience	10 min
6	Wikidata	Knowledge graph, LLM indexing	30 min
7	GitHub Pages	Static landing page, SEO	5 min
8	Colab notebook	One-click "try it" UX	15 min
9	Hugging Face Space	Live interactive demo	20 min
10	Awesome-list PRs	Organic discovery	15 min/list
11	Dev.to / Mastodon / HN	Human announcement reach	10 min each
12	LinkedIn / Medium	Longer-form professional reach	15 min each

What to ship before you announce

Before posting anywhere, finalize these files in the repo:

[x] README.md with badges (license, DOI, language, format)
[x] LICENSE (CC BY 4.0 for data, MIT for any code)
[x] CITATION.cff with the DOI and a preferred-citation block
[x] .zenodo.json so Zenodo picks up rich metadata on release
[x] llms.txt at repo root — tells LLMs where to find the canonical metadata
[x] Data files in multiple formats: JSON, JSONL, CSV, Alpaca, ShareGPT

If you skip these, every downstream platform will have thin, inconsistent metadata.

The scripting order

1. GitHub first

Create the repo, push everything, tag a release. Zenodo listens for GitHub releases via webhook.

gh repo create myorg/mydataset --public --source=.
git push -u origin main
gh release create v1.0.0 --title "Initial release" --notes "..."

2. Zenodo — before anything else downstream

Because the DOI becomes part of every downstream card. Don't mirror first and then retrofit the DOI everywhere.

Link your GitHub account to Zenodo
Flip the switch on your repo
Cut a release → Zenodo mints a DOI automatically
.zenodo.json in the repo controls metadata (creators, keywords, related_identifiers)

3. Hugging Face Datasets

# Create the dataset repo via API
curl -X POST https://huggingface.co/api/repos/create \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type":"dataset","name":"mydataset","organization":"MYORG"}'

# Then it's just git
git clone https://USER:$HF_TOKEN@huggingface.co/datasets/MYORG/mydataset
# Add your data/, add README.md with the YAML card, push

The dataset card YAML matters. Use configs: with split paths — it unlocks the in-browser dataset viewer.

4. PyPI wrapper

Ship a tiny Python package that just bundles the data and exposes a loader:

# pyproject.toml + one module that does:
import json, importlib.resources as r
def load(): return json.loads(r.files(__package__).joinpath('faq.json').read_text('utf-8'))

Publish with python -m build && twine upload dist/*. Users get pip install your-dataset-name.

5. Kaggle mirror

One-time manual upload via the web form, then the Kaggle API for updates:

kaggle datasets version -p . -m "Update to v1.0.0"

Tag generously. Kaggle's search weights tags heavily.

6. Wikidata

This is where most datasets stop — and why their metadata never gets into LLM training pipelines. The MediaWiki API is a bit arcane but the full flow is ~30 lines of Node. I wrote the how-to here.

Create at minimum:

The dataset entity (P31 → Q1172284 "dataset")
The publisher entity (your company / yourself)
P123 (publisher) + P170 (creator) links between them
P275 (license), P356 (DOI), P407 (language), P856 (official website)

7. GitHub Pages

A one-page static landing site helps human discoverability and SEO. Enable it on the repo's main branch.

8. Colab notebook

Drop a notebooks/quickstart.ipynb that loads the raw JSON from GitHub and does three example queries. Add the Colab badge to README:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ORG/REPO/blob/main/notebooks/quickstart.ipynb)

People click it. Dramatically lowers the "try it" barrier.

9. Hugging Face Space (Static)

A static Space is HTML + JS + CSS, no backend. Three files:

index.html — the UI
app.js — fetch dataset from GitHub raw, render
README.md with sdk: static in the YAML frontmatter

Cost: zero. Traffic: HF Spaces is indexed, browsable, trending. Ours is at DecDEPO/swedish-construction-faq-search.

10. Awesome-list PRs

awesome-X repos are still the organic-discovery hack of 2026. Find the relevant ones:

gh search repos "awesome-YOUR-TOPIC" --json fullName,stargazersCount --limit 50

Fork, add your entry under the matching section, open a PR. Keep the line format aligned with existing entries — maintainers reject inconsistent PRs fast.

11. Dev.to + Mastodon + HN

Three short announcements, roughly same copy, different lengths:

Mastodon: 500 chars, one link, 3-5 hashtags. Warning: em-dash (—) triggers HTTP 400 on the API. Use hyphens.
HN: Show HN: <name> — <one-line>. Wait for karma before spamming.
Dev.to: a proper blog post with frontmatter. The API accepts the whole markdown file in one POST. Use canonical_url if the same content lives elsewhere.

12. LinkedIn / Medium

Last, because they're manual. Write once for LinkedIn (short), once for Medium (long-form tutorial). Crosslink both back to the canonical GitHub.

Lessons from this particular rollout

The DOI is the glue. Once the DOI exists, every platform accepts the same metadata block. Mint it first, copy-paste everywhere.
Wikidata moves the needle more than expected. Google, Siri, Perplexity, and LLM training pipelines all read from Wikidata. A single QID gets your dataset into places you can't even list.
Write a tutorial article mid-rollout. We wrote the Wikidata how-to between rolling out to other platforms. It became its own distribution channel (and ranked for "wikidata open dataset tutorial" in under 24 hours).
Static Spaces > gradio Spaces for demos. No build step, no dependencies, no cold-start. Instant load.
Don't forget CITATION.cff for companion datasets. We did it for the flagship only. Had to go back and enrich four companions a day later. Do them all at once.

The scorecard

From our own rollout:

17 GitHub repos, all topic-tagged
1 flagship + 5 companion datasets on Hugging Face
6 Wikidata entities in a connected graph
1 DOI, auto-updated on each release
1 Colab notebook, 1 Static Space demo
8 awesome-list PRs open
4 Mastodon toots, 3 Dev.to posts, 1 HN submission, 1 LinkedIn post

All of it in 36 hours, mostly automated. Not a single paid channel.

The one step everyone skips

Cross-linking. After you've shipped to 12 platforms, go back and add the other 11 links to each one's metadata. Wikidata's P856, HF's homepage:, PyPI's project_urls, Kaggle's description. This is what makes the graph a graph.

That's it. One day, twelve platforms, zero spend. Go ship.

Swedish Construction FAQ is maintained by Zaragoza AB, Helsingborg. All datasets CC BY 4.0. Try the live demo: Space.

DEV Community