How I scraped Welcome to the Jungle Jobs and what the dataset actually looks like

#webscraping #apify #jobs #tutorial

I spent an afternoon pulling a sample of Welcome to the Jungle Jobs into a structured dataset so I could see what was actually available before committing to a real pipeline. This post walks through the schema, a couple of representative rows, and the trade-offs I bumped into.

The problem

If you have ever tried to do tracking hiring trends, building talent pipelines, salary benchmarking and competitive recruiting intelligence from Welcome to the Jungle Jobs, you know the source is rarely friendly to programmatic access. Welcome to the Jungle Jobs Scraper WTTJ Jobs, Salary & Company Data Extract job listings from Welcome to the Jungle (WTTJ) at scale. The first thing I wanted to answer was: what shape does the data actually arrive in, and which fields are reliable versus optional?

The schema I ended up with

After a single sample run I got back 25 rows with a consistent set of fields:

jobId -- job id
slug -- slug
title -- title
url -- url
contractType -- contract type
remote -- remote
experienceMin -- experience min
language -- language
publishedAt -- published at
salaryYearlyMin -- salary yearly min
salaryYearlyMax -- salary yearly max
salaryCurrency -- salary currency
salaryPeriod -- salary period
hasSalary -- has salary
professionName -- profession name
professionCategory -- profession category
professionSubCat -- profession sub cat
companyId -- company id
companyName -- company name
companySlug -- company slug
companyUrl -- company url
companyLogoUrl -- company logo url
companySize -- company size
companyFunding -- company funding
companyDescription -- company description
companyWebsite -- company website
offices -- offices
description -- description
scrapedAt -- scraped at

A few notes from skimming the output. Identifier fields like the ones above are typically strings even when they look numeric -- treat them as opaque tokens, not integers. Timestamp-like fields are emitted as ISO-8601 UTC, which makes downstream joining painless. Free-text fields can be long; trim before indexing.

A couple of real rows

Here are two records from the sample, lightly trimmed for readability:

{
  "jobId": "64ca8ccf-6193-416a-9b0b-ff909bd046ed",
  "slug": "strategic-partnerships-manager_paris_ALMA_Wd73Q60",
  "title": "Strategic Partnerships Manager",
  "url": "https://www.welcometothejungle.com/en/companies/alma/jobs/strategic-partnerships-manager_paris_ALMA_Wd73Q60",
  "contractType": "full_time",
  "remote": "partial",
  "experienceMin": null,
  "language": "en",
  "publishedAt": "2026-05-14T04:00:14Z",
  "salaryYearlyMin": null
}

{
  "jobId": "2771d62d-09bf-41b1-90c3-ed6eb2cd8285",
  "slug": "account-executive-smb-belgium_brussels",
  "title": "Account Executive Mid Market - Benelux",
  "url": "https://www.welcometothejungle.com/en/companies/alma/jobs/account-executive-smb-belgium_brussels",
  "contractType": "full_time",
  "remote": "partial",
  "experienceMin": null,
  "language": "en",
  "publishedAt": "2026-05-14T04:00:14Z",
  "salaryYearlyMin": null
}

You can see the shape is flat -- which is great if your destination is a warehouse table or a Pandas DataFrame, less great if you want graph-style joins. For most recruiters, talent-intel analysts and job-market researchers, flat is the right default.

Edge cases worth flagging

A few gotchas I noticed while inspecting the records:

Some optional fields are absent rather than null. If your loader assumes a fixed key set, normalise with .get(k, None) instead of row[k].
Long text fields contain newlines and the occasional non-ASCII character; UTF-8 everywhere or you will have a bad time.
The source rate-limits aggressively if you hammer it from a single IP, so paging and rotating proxies matter for any non-trivial volume.

Who would actually use this

The obvious audience is recruiters, talent-intel analysts and job-market researchers. Concrete use-cases I can see for the dataset: tracking hiring trends, building talent pipelines, salary benchmarking and competitive recruiting intelligence. Even if you just need a one-off snapshot rather than a recurring feed, having the data in a clean JSON/CSV shape saves an afternoon of HTML parsing.

Tooling

I ran the extraction on Apify because their infra handles the proxy rotation and scheduling for me. The actor name and store link is at the bottom for anyone who wants to reproduce the exact pull. The wider tooling choices are roughly: extraction with a managed runner so I do not have to babysit proxies, staging on object storage so I can replay history without re-scraping, transformation with whatever SQL flavour my warehouse speaks, and visualisation with whatever notebook or dashboard tool happens to be installed.

Where this goes next

Realistic next steps if I were building on top of this dataset for a real project: extend the scrape window from a single afternoon sample to a multi-week recurring feed, add a small data-quality assertion layer that runs after each scrape, and wire up a notification on assertion failures so schema drift gets caught the day it happens. Past that, the work is mostly downstream -- modelling the data into views that answer specific business questions and putting those views behind dashboards or API endpoints. None of it is glamorous, but all of it is the difference between "interesting one-off pull" and "feed my team actually uses".

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/welcome-to-the-jungle-jobs-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.