We shipped the Swedish Construction FAQ dataset (503 bilingual Q&A pairs, CC BY 4.0, DOI 10.5281/zenodo.19630803) yesterday. By end of day it was on twelve platforms.
This is the checklist, in order. Each step takes 10-30 minutes. Most are scriptable.
The 12 platforms
| # | Platform | What it gives you | Effort |
|---|---|---|---|
| 1 | GitHub | Canonical source, issues, Pages | base |
| 2 | Zenodo | DOI, Scholar indexing, archival | 10 min |
| 3 | Hugging Face Datasets | ML audience, load_dataset() UX |
15 min |
| 4 | PyPI |
pip install for Python users |
20 min |
| 5 | Kaggle | Data-science audience | 10 min |
| 6 | Wikidata | Knowledge graph, LLM indexing | 30 min |
| 7 | GitHub Pages | Static landing page, SEO | 5 min |
| 8 | Colab notebook | One-click "try it" UX | 15 min |
| 9 | Hugging Face Space | Live interactive demo | 20 min |
| 10 | Awesome-list PRs | Organic discovery | 15 min/list |
| 11 | Dev.to / Mastodon / HN | Human announcement reach | 10 min each |
| 12 | LinkedIn / Medium | Longer-form professional reach | 15 min each |
What to ship before you announce
Before posting anywhere, finalize these files in the repo:
- [x]
README.mdwith badges (license, DOI, language, format) - [x]
LICENSE(CC BY 4.0 for data, MIT for any code) - [x]
CITATION.cffwith the DOI and apreferred-citationblock - [x]
.zenodo.jsonso Zenodo picks up rich metadata on release - [x]
llms.txtat repo root — tells LLMs where to find the canonical metadata - [x] Data files in multiple formats: JSON, JSONL, CSV, Alpaca, ShareGPT
If you skip these, every downstream platform will have thin, inconsistent metadata.
The scripting order
1. GitHub first
Create the repo, push everything, tag a release. Zenodo listens for GitHub releases via webhook.
gh repo create myorg/mydataset --public --source=.
git push -u origin main
gh release create v1.0.0 --title "Initial release" --notes "..."
2. Zenodo — before anything else downstream
Because the DOI becomes part of every downstream card. Don't mirror first and then retrofit the DOI everywhere.
- Link your GitHub account to Zenodo
- Flip the switch on your repo
- Cut a release → Zenodo mints a DOI automatically
-
.zenodo.jsonin the repo controls metadata (creators, keywords, related_identifiers)
3. Hugging Face Datasets
# Create the dataset repo via API
curl -X POST https://huggingface.co/api/repos/create \
-H "Authorization: Bearer $HF_TOKEN" \
-H "Content-Type: application/json" \
-d '{"type":"dataset","name":"mydataset","organization":"MYORG"}'
# Then it's just git
git clone https://USER:$HF_TOKEN@huggingface.co/datasets/MYORG/mydataset
# Add your data/, add README.md with the YAML card, push
The dataset card YAML matters. Use configs: with split paths — it unlocks the in-browser dataset viewer.
4. PyPI wrapper
Ship a tiny Python package that just bundles the data and exposes a loader:
# pyproject.toml + one module that does:
import json, importlib.resources as r
def load(): return json.loads(r.files(__package__).joinpath('faq.json').read_text('utf-8'))
Publish with python -m build && twine upload dist/*. Users get pip install your-dataset-name.
5. Kaggle mirror
One-time manual upload via the web form, then the Kaggle API for updates:
kaggle datasets version -p . -m "Update to v1.0.0"
Tag generously. Kaggle's search weights tags heavily.
6. Wikidata
This is where most datasets stop — and why their metadata never gets into LLM training pipelines. The MediaWiki API is a bit arcane but the full flow is ~30 lines of Node. I wrote the how-to here.
Create at minimum:
- The dataset entity (
P31→Q1172284"dataset") - The publisher entity (your company / yourself)
-
P123(publisher) +P170(creator) links between them -
P275(license),P356(DOI),P407(language),P856(official website)
7. GitHub Pages
A one-page static landing site helps human discoverability and SEO. Enable it on the repo's main branch.
8. Colab notebook
Drop a notebooks/quickstart.ipynb that loads the raw JSON from GitHub and does three example queries. Add the Colab badge to README:
[](https://colab.research.google.com/github/ORG/REPO/blob/main/notebooks/quickstart.ipynb)
People click it. Dramatically lowers the "try it" barrier.
9. Hugging Face Space (Static)
A static Space is HTML + JS + CSS, no backend. Three files:
-
index.html— the UI -
app.js— fetch dataset from GitHub raw, render -
README.mdwithsdk: staticin the YAML frontmatter
Cost: zero. Traffic: HF Spaces is indexed, browsable, trending. Ours is at DecDEPO/swedish-construction-faq-search.
10. Awesome-list PRs
awesome-X repos are still the organic-discovery hack of 2026. Find the relevant ones:
gh search repos "awesome-YOUR-TOPIC" --json fullName,stargazersCount --limit 50
Fork, add your entry under the matching section, open a PR. Keep the line format aligned with existing entries — maintainers reject inconsistent PRs fast.
11. Dev.to + Mastodon + HN
Three short announcements, roughly same copy, different lengths:
- Mastodon: 500 chars, one link, 3-5 hashtags. Warning: em-dash (—) triggers HTTP 400 on the API. Use hyphens.
-
HN:
Show HN: <name> — <one-line>. Wait for karma before spamming. -
Dev.to: a proper blog post with frontmatter. The API accepts the whole markdown file in one POST. Use
canonical_urlif the same content lives elsewhere.
12. LinkedIn / Medium
Last, because they're manual. Write once for LinkedIn (short), once for Medium (long-form tutorial). Crosslink both back to the canonical GitHub.
Lessons from this particular rollout
- The DOI is the glue. Once the DOI exists, every platform accepts the same metadata block. Mint it first, copy-paste everywhere.
- Wikidata moves the needle more than expected. Google, Siri, Perplexity, and LLM training pipelines all read from Wikidata. A single QID gets your dataset into places you can't even list.
- Write a tutorial article mid-rollout. We wrote the Wikidata how-to between rolling out to other platforms. It became its own distribution channel (and ranked for "wikidata open dataset tutorial" in under 24 hours).
- Static Spaces > gradio Spaces for demos. No build step, no dependencies, no cold-start. Instant load.
-
Don't forget
CITATION.cfffor companion datasets. We did it for the flagship only. Had to go back and enrich four companions a day later. Do them all at once.
The scorecard
From our own rollout:
- 17 GitHub repos, all topic-tagged
- 1 flagship + 5 companion datasets on Hugging Face
- 6 Wikidata entities in a connected graph
- 1 DOI, auto-updated on each release
- 1 Colab notebook, 1 Static Space demo
- 8 awesome-list PRs open
- 4 Mastodon toots, 3 Dev.to posts, 1 HN submission, 1 LinkedIn post
All of it in 36 hours, mostly automated. Not a single paid channel.
The one step everyone skips
Cross-linking. After you've shipped to 12 platforms, go back and add the other 11 links to each one's metadata. Wikidata's P856, HF's homepage:, PyPI's project_urls, Kaggle's description. This is what makes the graph a graph.
That's it. One day, twelve platforms, zero spend. Go ship.
Swedish Construction FAQ is maintained by Zaragoza AB, Helsingborg. All datasets CC BY 4.0. Try the live demo: Space.
Top comments (0)