DEV Community

DecDEPO
DecDEPO

Posted on

How I put my open dataset on the Wikidata knowledge graph (and why you should too)

Last week we released the Swedish Construction FAQ — 503 bilingual Q&A pairs, CC BY 4.0, DOI 10.5281/zenodo.19630803. The dataset got onto Hugging Face, Zenodo, Kaggle, PyPI — the usual distribution stack.

But one step moved the needle more than the others: putting it on Wikidata as a first-class entity.

This post is a practical walkthrough. No philosophy, just what we did and why.

What a Wikidata entity gets you

  1. A stable identifierQ139393633 is the QID for our dataset. It'll outlive our GitHub account, our domain, our company.
  2. Machine-readable citability — Wikidata is the knowledge graph that Google, Siri, Alexa, OpenAI, Anthropic, Perplexity, and basically every LLM training pipeline reads from.
  3. A place for the DOI and license to live together — P275 (license) + P356 (DOI) + P407 (language) + P1476 (title) as proper RDF triples.
  4. Free cross-linking — every related entity (the company, the companion datasets, the subject matter) shows up in the Special:WhatLinksHere graph.
  5. It costs nothing — no API key, no app registration, no fees. Just a SUL account.

The entities we created

For the FAQ dataset alone, we ended up with six connected entities:

QID Entity What it is
Q139393633 Swedish Construction FAQ The flagship dataset
Q139393658 Zaragoza AB The publisher/creator (our company)
Q139393817 Construction Terminology Glossary Companion trilingual glossary
Q139393818 Building Materials Specifications Companion spec dataset
Q139393819 Construction Inspection Templates Companion template dataset
Q139393821 Renovation Timeline Reference Companion timeline dataset

Each companion dataset has P123 (publisher) and P170 (creator) pointing to Q139393658, which means the company's entity automatically aggregates everything we publish.

The minimum viable dataset entity

Here's the shape we used — it's the minimum that actually gets you into the graph in a way that indexers respect.

{
  "labels": {
    "en": { "language": "en", "value": "Swedish Construction FAQ" },
    "sv": { "language": "sv", "value": "Svensk byggbransch FAQ" }
  },
  "descriptions": {
    "en": { "language": "en", "value": "open bilingual Q&A dataset on Swedish construction" }
  },
  "claims": [
    {"property": "P31",  "value": "Q1172284"},       // instance of  dataset
    {"property": "P275", "value": "Q20007257"},      // license  CC BY 4.0
    {"property": "P356", "value": "10.5281/zenodo.19630803"}, // DOI
    {"property": "P407", "value": "Q9027"},          // language  Swedish
    {"property": "P407", "value": "Q1860"},          // language  English
    {"property": "P123", "value": "Q139393658"},     // publisher  Zaragoza AB
    {"property": "P170", "value": "Q139393658"},     // creator  Zaragoza AB
    {"property": "P856", "value": "github.com/..."}, // official website
    {"property": "P1324", "value": "github.com/..."},// source repo
    {"property": "P577", "value": "+2026-04-17"}     // publication date
  ]
}
Enter fullscreen mode Exit fullscreen mode

Calling the MediaWiki API from Node

The API takes three requests: get a login token, log in, get a CSRF token. Then you can wbeditentity.

const API = 'https://www.wikidata.org/w/api.php';
const jar = new Map();
const cookie = () => [...jar].map(([k,v]) => `${k}=${v}`).join('; ');

async function call(params, method = 'POST') {
  const body = new URLSearchParams({ ...params, format: 'json' });
  const opts = {
    method,
    headers: { 'User-Agent': 'your-app/1.0', 'Cookie': cookie() },
  };
  if (method === 'POST') {
    opts.headers['Content-Type'] = 'application/x-www-form-urlencoded';
    opts.body = body;
  }
  const res = await fetch(method === 'GET' ? `${API}?${body}` : API, opts);
  // Parse cookies into the jar so subsequent calls are authenticated
  (res.headers.get('set-cookie') || '').split(/,(?=[^;]+=)/).forEach(c => {
    const [kv] = c.split(';'); const [k,v] = kv.split('=');
    if (k && v) jar.set(k.trim(), v.trim());
  });
  return res.json();
}

// 1. Login token
const lt = await call({ action: 'query', meta: 'tokens', type: 'login' }, 'GET');

// 2. Login (SUL credentials work — no bot password required on Wikidata)
await call({
  action: 'login',
  lgname: process.env.WIKI_USER,
  lgpassword: process.env.WIKI_PASS,
  lgtoken: lt.query.tokens.logintoken,
});

// 3. CSRF token for the edit
const ct = await call({ action: 'query', meta: 'tokens', type: 'csrf' }, 'GET');

// 4. Create the entity
const r = await call({
  action: 'wbeditentity',
  new: 'item',
  data: JSON.stringify(entity),
  token: ct.query.tokens.csrftoken,
  summary: 'Create entity for Swedish Construction FAQ (CC BY 4.0)',
});

console.log(r.entity.id); // → "Q139393633"
Enter fullscreen mode Exit fullscreen mode

That's the whole thing. About 30 lines of code, one HTTP session, and your dataset has a permanent identity in the global knowledge graph.

Gotchas we hit

  • Abuse filter 11 ("badwords") will flag entities with the word "Awesome" in the label. We had to rename one repo's entity description. Doesn't matter what the actual word is — if the filter warns, your write fails with abusefilter-warning and you have to retry with different wording.
  • CAPTCHAs appear on newer accounts when you add external links. On sv.wikipedia.org, we had to strip URLs from our first edit. On wikidata.org, none of our edits hit a CAPTCHA — the Wikibase API is more permissive here.
  • SUL (Single User Login) isn't automatic across all projects. Our account worked on wikidata.org from day one but not on sv.wikipedia.org (login failed with "wrongpassword"). If you want to edit multiple projects, expect one manual login per project to "attach" your SUL.
  • Rate limits are generous. We created six entities in under five minutes with a 2-second sleep between them. No throttling.

Connecting the graph

Once the entities exist, the real value comes from linking them. We made the dataset a "work by" the company entity:

  • Q139393633 (dataset) — has P123 and P170Q139393658 (company)
  • All companion datasets do the same
  • The company entity itself has P31 (aktiebolag), P17 (Sweden), P159 (Helsingborg), P452 (construction industry)

Run SPARQL on query.wikidata.org and the whole graph falls out:

SELECT ?dataset ?datasetLabel WHERE {
  ?dataset wdt:P123 wd:Q139393658 .
  ?dataset wdt:P31 wd:Q1172284 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,sv" }
}
Enter fullscreen mode Exit fullscreen mode

That's how external tools find everything we've published — one query.

What changed after we did this

Small, obvious things:

  • The Hugging Face dataset card now links to six Wikidata QIDs — readers click through.
  • Our Zenodo record has related_identifiers pointing at the Wikidata URI — so when someone cites the DOI, the graph picks up the citation.
  • Wikidata's own "What links here" page now serves as a free backlink-monitoring tool for us.

None of this is rocket science. But it's the one layer most open datasets skip.

Go do it

Your dataset already has a DOI, probably a GitHub repo, probably a Hugging Face mirror. Putting it on Wikidata takes ~30 lines of Node and one afternoon.

The entity you create today will outlive your current URLs. That's worth the afternoon.


Swedish Construction FAQ is maintained by Zaragoza AB, Helsingborg. All datasets CC BY 4.0. Explore the graph: Q139393633 · Q139393658.

Top comments (0)