A field-by-field look at Blocket.se data: structure, types and edge cases

#webscraping #apify #ecommerce #tutorial

When you are evaluating a new data source the first thing you want is not the marketing pitch, it is the schema. Here is a field-by-field walkthrough of what Blocket.se actually returns, based on a sample I pulled while researching the source.

What this dataset is

Blocket.se Scraper Cars, Electronics & Marketplace Listings Scrape listings from Blocket.se Sweden's largest classified ads platform. In practice that means each record is one logical entity -- one listing, one post, one record, depending on the source -- with all of the fields you would expect plus a few metadata columns added by the scraper.

The fields

id -- id
url -- url
title -- title
price -- price
location -- location
category -- category
sellerName -- seller name
sellerType -- seller type
shipping -- shipping
brand -- brand
imageUrl -- image url
imageUrls -- image urls
tradeType -- trade type
publishedAt -- published at
lat -- lat
lng -- lng
make -- make
model -- model
modelSpec -- model spec
year -- year
mileage -- mileage
fuel -- fuel
gearbox -- gearbox
regno -- regno
scrapedAt -- scraped at

A quick read on each category:

Identifiers are stable across re-scrapes and safe to use as natural keys. They are strings even if they look numeric.
Content fields are the actual payload. Expect free-form text, some HTML residue if the source had any, and the occasional non-ASCII character.
Numeric fields (counts, prices, scores) tend to be already-coerced to int or float -- but always double-check the first run because some sources emit them as strings.
Timestamps come back as ISO-8601 UTC, which is the right default.
Provenance fields like a scrapedAt or source URL tell you when and where the row came from. Keep them in your warehouse for audit purposes.

Two real rows

Here is what two trimmed records look like:

{
  "id": "22162018",
  "url": "https://www.blocket.se/recommerce/forsale/item/22162018",
  "title": "AL-KO Razor Cut 38,1 HM Comfort cylindergräsklippare 38 cm",
  "price": "500 kr",
  "location": "Kolsva",
  "category": "BAP_ALL",
  "sellerName": null,
  "sellerType": "Privat",
  "shipping": null,
  "brand": null
}

{
  "id": "18134186",
  "url": "https://www.blocket.se/recommerce/forsale/item/18134186",
  "title": "Cykelbarnstol 1940-tal Blå.",
  "price": "725 kr",
  "location": "Väddö",
  "category": "BAP_ALL",
  "sellerName": null,
  "sellerType": "Privat",
  "shipping": "Kan skickas",
  "brand": null
}

Edge cases to plan for

Three patterns I saw that you should pre-empt in your loader:

Missing optional keys. Some rows have a field that other rows do not. Always use .get() semantics, never positional access.
Encoding artefacts in text columns. UTF-8 throughout the pipeline. If you have a Windows-1252 layer anywhere, expect smart quotes to break it.
Duplicate rows across overlapping runs. If you scrape every six hours you will see overlap. Dedup on the natural identifier.

How I would model it in a warehouse

The natural shape for a destination table is one row per source entity, with the identifier promoted to a primary key and the timestamp columns cast to TIMESTAMP. Free-text columns go into a TEXT/VARCHAR(MAX) and any list-shaped values either get exploded into a child table or stored as a JSON column depending on whether you need to query the elements individually.

A typical loader for this shape might look like: read the raw JSON into a DataFrame with pd.json_normalize, apply a small column-rename map, write to a staging table with to_sql or your warehouse's bulk loader, then run a MERGE statement keyed on the natural identifier into the curated table. The whole pipeline is comfortably under a hundred lines of code if you do not over-engineer it.

Who this is for

Merchandise analysts, competitor-intel teams and price trackers are the natural audience. The dataset is rich enough to support real analytical questions but flat enough to land in a warehouse with one statement. If you are evaluating sources for a new project, this is the kind of dataset where the cost-benefit is firmly on the "just use it" side -- the engineering work to integrate is small relative to the analytical value you get out.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/blocket-se-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.