Comparing approaches to extracting Finn.No data

#webscraping #apify #jobs #dataengineering

There is more than one way to get Finn.No into a structured dataset, and the right answer depends a lot on how often you need fresh data, how much volume you are after, and how much engineering time you want to spend on the plumbing. Here is the trade-off matrix I worked through before settling on an approach.

What the data looks like (regardless of approach)

Finn.no Scraper Real Estate, Cars, Jobs & Marketplace Data for Norway Scrape Finn.no Norway's largest classifieds platform and export structured listing data to JSON, CSV or Excel. The end-state schema is more or less fixed by the source:

finnkode -- finnkode
url -- url
adType -- ad type
title -- title
location -- location
localAreaName -- local area name
price -- price
totalPrice -- total price
monthlyFee -- monthly fee
size -- size
plotSize -- plot size
ownershipType -- ownership type
propertyType -- property type
bedrooms -- bedrooms
viewingDate -- viewing date
agent -- agent
agentLogoUrl -- agent logo url
imageUrl -- image url
imageUrls -- image urls
lat -- lat
lng -- lng
scrapedAt -- scraped at

The differences between approaches are not really about schema -- they are about reliability, maintenance burden, and total cost of ownership.

Approach 1: Roll your own scraper

The DIY path. Pros: total control, no third-party dependency, very cheap on small volumes. Cons: you own the proxy rotation, the rate-limit handling, the retry logic, the schema-drift detection, the scheduling, the monitoring, and the bug pager.

If you have one engineer who has done this kind of work before and you only need one source, this is fine. If you need ten sources, the maintenance burden compounds faster than you would expect.

Approach 2: Generic crawl framework + custom selectors

The middle path. Use Scrapy or Playwright with your own parsing logic. Pros: less boilerplate, decent observability for free. Cons: you still own the proxy and rate-limit story, plus you are now coupled to a framework that has its own learning curve.

This is a sensible choice for multi-source projects where you want one mental model across all the scrapers.

Approach 3: Managed scraping infrastructure

Use a hosted runner that handles proxies, scheduling and storage. Pros: minimal engineering time, predictable cost, very fast to get a first run out the door. Cons: cost scales with volume, less control over edge cases.

For one-off explorations and steady-state recurring pipelines under a few million records per month, this is what I keep ending up on.

Two sample records (for context)

What the eventual output looks like, regardless of how you got there:

{
  "finnkode": "463621591",
  "url": "https://www.finn.no/realestate/homes/ad.html?finnkode=463621591",
  "adType": "realestate",
  "title": "Innbydende og oppgradert 3-roms leilighet | V.v & fyring inkl. | Epoq kjøkken | Innglasset balkong | Ingen forkjøpsrett!",
  "location": "Nordtvetbakken 2, Oslo",
  "localAreaName": "KALBAKKEN",
  "price": "4 300 000 kr",
  "totalPrice": "4 403 798 kr",
  "monthlyFee": "6 884 kr",
  "size": "70 m²"
}

{
  "finnkode": "463301345",
  "url": "https://www.finn.no/realestate/homes/ad.html?finnkode=463301345",
  "adType": "realestate",
  "title": "Strøken 3-roms hjørneleilighet fra 2023 med sørvestvendt innglasset balkong og eget vaskerom | P-plass i kjeller og heis",
  "location": "Melhustunet 24B, Melhus",
  "localAreaName": "MELHUS SENTRUM",
  "price": "5 990 000 kr",
  "totalPrice": "6 140 840 kr",
  "monthlyFee": "2 680 kr",
  "size": "83 m²"
}

How I would pick

A rough decision tree:

One-off exploration: managed approach. The setup-cost of DIY is not worth it for a single run.
Steady recurring feed, single source, modest volume: managed approach unless cost becomes prohibitive.
Multiple sources, large volume, dedicated team: framework + custom selectors. The unit economics flip.
Adversarial source with active anti-bot defences: probably a specialist provider or a custom build with serious proxy budget.

Verdict

For Finn.No specifically the volume and update-frequency profile is moderate, and a managed runner is the most defensible default. The dataset shape above is the same either way.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/finn-no-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.