DEV Community

DecDEPO
DecDEPO

Posted on

Distributing an open dataset across 12 platforms in one day: a playbook

We shipped the Swedish Construction FAQ dataset (503 bilingual Q&A pairs, CC BY 4.0, DOI 10.5281/zenodo.19630803) yesterday. By end of day it was on twelve platforms.

This is the checklist, in order. Each step takes 10-30 minutes. Most are scriptable.

The 12 platforms

# Platform What it gives you Effort
1 GitHub Canonical source, issues, Pages base
2 Zenodo DOI, Scholar indexing, archival 10 min
3 Hugging Face Datasets ML audience, load_dataset() UX 15 min
4 PyPI pip install for Python users 20 min
5 Kaggle Data-science audience 10 min
6 Wikidata Knowledge graph, LLM indexing 30 min
7 GitHub Pages Static landing page, SEO 5 min
8 Colab notebook One-click "try it" UX 15 min
9 Hugging Face Space Live interactive demo 20 min
10 Awesome-list PRs Organic discovery 15 min/list
11 Dev.to / Mastodon / HN Human announcement reach 10 min each
12 LinkedIn / Medium Longer-form professional reach 15 min each

What to ship before you announce

Before posting anywhere, finalize these files in the repo:

  • [x] README.md with badges (license, DOI, language, format)
  • [x] LICENSE (CC BY 4.0 for data, MIT for any code)
  • [x] CITATION.cff with the DOI and a preferred-citation block
  • [x] .zenodo.json so Zenodo picks up rich metadata on release
  • [x] llms.txt at repo root — tells LLMs where to find the canonical metadata
  • [x] Data files in multiple formats: JSON, JSONL, CSV, Alpaca, ShareGPT

If you skip these, every downstream platform will have thin, inconsistent metadata.

The scripting order

1. GitHub first

Create the repo, push everything, tag a release. Zenodo listens for GitHub releases via webhook.

gh repo create myorg/mydataset --public --source=.
git push -u origin main
gh release create v1.0.0 --title "Initial release" --notes "..."
Enter fullscreen mode Exit fullscreen mode

2. Zenodo — before anything else downstream

Because the DOI becomes part of every downstream card. Don't mirror first and then retrofit the DOI everywhere.

  • Link your GitHub account to Zenodo
  • Flip the switch on your repo
  • Cut a release → Zenodo mints a DOI automatically
  • .zenodo.json in the repo controls metadata (creators, keywords, related_identifiers)

3. Hugging Face Datasets

# Create the dataset repo via API
curl -X POST https://huggingface.co/api/repos/create \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type":"dataset","name":"mydataset","organization":"MYORG"}'

# Then it's just git
git clone https://USER:$HF_TOKEN@huggingface.co/datasets/MYORG/mydataset
# Add your data/, add README.md with the YAML card, push
Enter fullscreen mode Exit fullscreen mode

The dataset card YAML matters. Use configs: with split paths — it unlocks the in-browser dataset viewer.

4. PyPI wrapper

Ship a tiny Python package that just bundles the data and exposes a loader:

# pyproject.toml + one module that does:
import json, importlib.resources as r
def load(): return json.loads(r.files(__package__).joinpath('faq.json').read_text('utf-8'))
Enter fullscreen mode Exit fullscreen mode

Publish with python -m build && twine upload dist/*. Users get pip install your-dataset-name.

5. Kaggle mirror

One-time manual upload via the web form, then the Kaggle API for updates:

kaggle datasets version -p . -m "Update to v1.0.0"
Enter fullscreen mode Exit fullscreen mode

Tag generously. Kaggle's search weights tags heavily.

6. Wikidata

This is where most datasets stop — and why their metadata never gets into LLM training pipelines. The MediaWiki API is a bit arcane but the full flow is ~30 lines of Node. I wrote the how-to here.

Create at minimum:

  • The dataset entity (P31Q1172284 "dataset")
  • The publisher entity (your company / yourself)
  • P123 (publisher) + P170 (creator) links between them
  • P275 (license), P356 (DOI), P407 (language), P856 (official website)

7. GitHub Pages

A one-page static landing site helps human discoverability and SEO. Enable it on the repo's main branch.

8. Colab notebook

Drop a notebooks/quickstart.ipynb that loads the raw JSON from GitHub and does three example queries. Add the Colab badge to README:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ORG/REPO/blob/main/notebooks/quickstart.ipynb)
Enter fullscreen mode Exit fullscreen mode

People click it. Dramatically lowers the "try it" barrier.

9. Hugging Face Space (Static)

A static Space is HTML + JS + CSS, no backend. Three files:

  • index.html — the UI
  • app.js — fetch dataset from GitHub raw, render
  • README.md with sdk: static in the YAML frontmatter

Cost: zero. Traffic: HF Spaces is indexed, browsable, trending. Ours is at DecDEPO/swedish-construction-faq-search.

10. Awesome-list PRs

awesome-X repos are still the organic-discovery hack of 2026. Find the relevant ones:

gh search repos "awesome-YOUR-TOPIC" --json fullName,stargazersCount --limit 50
Enter fullscreen mode Exit fullscreen mode

Fork, add your entry under the matching section, open a PR. Keep the line format aligned with existing entries — maintainers reject inconsistent PRs fast.

11. Dev.to + Mastodon + HN

Three short announcements, roughly same copy, different lengths:

  • Mastodon: 500 chars, one link, 3-5 hashtags. Warning: em-dash (—) triggers HTTP 400 on the API. Use hyphens.
  • HN: Show HN: <name> — <one-line>. Wait for karma before spamming.
  • Dev.to: a proper blog post with frontmatter. The API accepts the whole markdown file in one POST. Use canonical_url if the same content lives elsewhere.

12. LinkedIn / Medium

Last, because they're manual. Write once for LinkedIn (short), once for Medium (long-form tutorial). Crosslink both back to the canonical GitHub.

Lessons from this particular rollout

  • The DOI is the glue. Once the DOI exists, every platform accepts the same metadata block. Mint it first, copy-paste everywhere.
  • Wikidata moves the needle more than expected. Google, Siri, Perplexity, and LLM training pipelines all read from Wikidata. A single QID gets your dataset into places you can't even list.
  • Write a tutorial article mid-rollout. We wrote the Wikidata how-to between rolling out to other platforms. It became its own distribution channel (and ranked for "wikidata open dataset tutorial" in under 24 hours).
  • Static Spaces > gradio Spaces for demos. No build step, no dependencies, no cold-start. Instant load.
  • Don't forget CITATION.cff for companion datasets. We did it for the flagship only. Had to go back and enrich four companions a day later. Do them all at once.

The scorecard

From our own rollout:

  • 17 GitHub repos, all topic-tagged
  • 1 flagship + 5 companion datasets on Hugging Face
  • 6 Wikidata entities in a connected graph
  • 1 DOI, auto-updated on each release
  • 1 Colab notebook, 1 Static Space demo
  • 8 awesome-list PRs open
  • 4 Mastodon toots, 3 Dev.to posts, 1 HN submission, 1 LinkedIn post

All of it in 36 hours, mostly automated. Not a single paid channel.

The one step everyone skips

Cross-linking. After you've shipped to 12 platforms, go back and add the other 11 links to each one's metadata. Wikidata's P856, HF's homepage:, PyPI's project_urls, Kaggle's description. This is what makes the graph a graph.

That's it. One day, twelve platforms, zero spend. Go ship.


Swedish Construction FAQ is maintained by Zaragoza AB, Helsingborg. All datasets CC BY 4.0. Try the live demo: Space.

Top comments (0)