Last week we released the Swedish Construction FAQ — 503 bilingual Q&A pairs, CC BY 4.0, DOI 10.5281/zenodo.19630803. The dataset got onto Hugging Face, Zenodo, Kaggle, PyPI — the usual distribution stack.
But one step moved the needle more than the others: putting it on Wikidata as a first-class entity.
This post is a practical walkthrough. No philosophy, just what we did and why.
What a Wikidata entity gets you
-
A stable identifier —
Q139393633is the QID for our dataset. It'll outlive our GitHub account, our domain, our company. - Machine-readable citability — Wikidata is the knowledge graph that Google, Siri, Alexa, OpenAI, Anthropic, Perplexity, and basically every LLM training pipeline reads from.
- A place for the DOI and license to live together — P275 (license) + P356 (DOI) + P407 (language) + P1476 (title) as proper RDF triples.
-
Free cross-linking — every related entity (the company, the companion datasets, the subject matter) shows up in the
Special:WhatLinksHeregraph. - It costs nothing — no API key, no app registration, no fees. Just a SUL account.
The entities we created
For the FAQ dataset alone, we ended up with six connected entities:
| QID | Entity | What it is |
|---|---|---|
Q139393633 |
Swedish Construction FAQ | The flagship dataset |
Q139393658 |
Zaragoza AB | The publisher/creator (our company) |
Q139393817 |
Construction Terminology Glossary | Companion trilingual glossary |
Q139393818 |
Building Materials Specifications | Companion spec dataset |
Q139393819 |
Construction Inspection Templates | Companion template dataset |
Q139393821 |
Renovation Timeline Reference | Companion timeline dataset |
Each companion dataset has P123 (publisher) and P170 (creator) pointing to Q139393658, which means the company's entity automatically aggregates everything we publish.
The minimum viable dataset entity
Here's the shape we used — it's the minimum that actually gets you into the graph in a way that indexers respect.
{
"labels": {
"en": { "language": "en", "value": "Swedish Construction FAQ" },
"sv": { "language": "sv", "value": "Svensk byggbransch FAQ" }
},
"descriptions": {
"en": { "language": "en", "value": "open bilingual Q&A dataset on Swedish construction" }
},
"claims": [
{"property": "P31", "value": "Q1172284"}, // instance of → dataset
{"property": "P275", "value": "Q20007257"}, // license → CC BY 4.0
{"property": "P356", "value": "10.5281/zenodo.19630803"}, // DOI
{"property": "P407", "value": "Q9027"}, // language → Swedish
{"property": "P407", "value": "Q1860"}, // language → English
{"property": "P123", "value": "Q139393658"}, // publisher → Zaragoza AB
{"property": "P170", "value": "Q139393658"}, // creator → Zaragoza AB
{"property": "P856", "value": "github.com/..."}, // official website
{"property": "P1324", "value": "github.com/..."},// source repo
{"property": "P577", "value": "+2026-04-17"} // publication date
]
}
Calling the MediaWiki API from Node
The API takes three requests: get a login token, log in, get a CSRF token. Then you can wbeditentity.
const API = 'https://www.wikidata.org/w/api.php';
const jar = new Map();
const cookie = () => [...jar].map(([k,v]) => `${k}=${v}`).join('; ');
async function call(params, method = 'POST') {
const body = new URLSearchParams({ ...params, format: 'json' });
const opts = {
method,
headers: { 'User-Agent': 'your-app/1.0', 'Cookie': cookie() },
};
if (method === 'POST') {
opts.headers['Content-Type'] = 'application/x-www-form-urlencoded';
opts.body = body;
}
const res = await fetch(method === 'GET' ? `${API}?${body}` : API, opts);
// Parse cookies into the jar so subsequent calls are authenticated
(res.headers.get('set-cookie') || '').split(/,(?=[^;]+=)/).forEach(c => {
const [kv] = c.split(';'); const [k,v] = kv.split('=');
if (k && v) jar.set(k.trim(), v.trim());
});
return res.json();
}
// 1. Login token
const lt = await call({ action: 'query', meta: 'tokens', type: 'login' }, 'GET');
// 2. Login (SUL credentials work — no bot password required on Wikidata)
await call({
action: 'login',
lgname: process.env.WIKI_USER,
lgpassword: process.env.WIKI_PASS,
lgtoken: lt.query.tokens.logintoken,
});
// 3. CSRF token for the edit
const ct = await call({ action: 'query', meta: 'tokens', type: 'csrf' }, 'GET');
// 4. Create the entity
const r = await call({
action: 'wbeditentity',
new: 'item',
data: JSON.stringify(entity),
token: ct.query.tokens.csrftoken,
summary: 'Create entity for Swedish Construction FAQ (CC BY 4.0)',
});
console.log(r.entity.id); // → "Q139393633"
That's the whole thing. About 30 lines of code, one HTTP session, and your dataset has a permanent identity in the global knowledge graph.
Gotchas we hit
-
Abuse filter 11 ("badwords") will flag entities with the word "Awesome" in the label. We had to rename one repo's entity description. Doesn't matter what the actual word is — if the filter warns, your write fails with
abusefilter-warningand you have to retry with different wording. -
CAPTCHAs appear on newer accounts when you add external links. On
sv.wikipedia.org, we had to strip URLs from our first edit. Onwikidata.org, none of our edits hit a CAPTCHA — the Wikibase API is more permissive here. -
SUL (Single User Login) isn't automatic across all projects. Our account worked on
wikidata.orgfrom day one but not onsv.wikipedia.org(login failed with "wrongpassword"). If you want to edit multiple projects, expect one manual login per project to "attach" your SUL. - Rate limits are generous. We created six entities in under five minutes with a 2-second sleep between them. No throttling.
Connecting the graph
Once the entities exist, the real value comes from linking them. We made the dataset a "work by" the company entity:
-
Q139393633(dataset) — hasP123andP170→Q139393658(company) - All companion datasets do the same
- The company entity itself has
P31(aktiebolag),P17(Sweden),P159(Helsingborg),P452(construction industry)
Run SPARQL on query.wikidata.org and the whole graph falls out:
SELECT ?dataset ?datasetLabel WHERE {
?dataset wdt:P123 wd:Q139393658 .
?dataset wdt:P31 wd:Q1172284 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,sv" }
}
That's how external tools find everything we've published — one query.
What changed after we did this
Small, obvious things:
- The Hugging Face dataset card now links to six Wikidata QIDs — readers click through.
- Our Zenodo record has
related_identifierspointing at the Wikidata URI — so when someone cites the DOI, the graph picks up the citation. - Wikidata's own "What links here" page now serves as a free backlink-monitoring tool for us.
None of this is rocket science. But it's the one layer most open datasets skip.
Go do it
Your dataset already has a DOI, probably a GitHub repo, probably a Hugging Face mirror. Putting it on Wikidata takes ~30 lines of Node and one afternoon.
The entity you create today will outlive your current URLs. That's worth the afternoon.
Swedish Construction FAQ is maintained by Zaragoza AB, Helsingborg. All datasets CC BY 4.0. Explore the graph: Q139393633 · Q139393658.
Top comments (0)