How I scraped the CQC Care Register without hitting the API auth wall

#javascript #typescript #node #webdev

The Care Quality Commission regulates 56,000+ healthcare and social care locations in England — care homes, GP surgeries, hospitals, dental practices, home care agencies. If you work in care sector tech, you've probably needed this data at some point.

There's a CQC REST API, and I was planning to wrap it. Then I hit the auth wall.

The API is now authenticated

CQC migrated their API to api.service.cqc.org.uk and added bearer token authentication. You need to register at their developer portal, create an application, and include an Authorization: Bearer <token> header on every request. That's not a dealbreaker for enterprise use cases, but it creates friction for a data product — it means requiring users to register with CQC before they can run your actor.

I checked the old API base (api.cqc.org.uk/public/v1) as a fallback. HTTP 403. Fully blocked.

The open-data file rescue

CQC publishes a monthly open-data file called HSCA_Active_Locations.ods. It's a 23 MB OpenDocument Spreadsheet with every active regulated location in England — all 56,000 of them. Free, no auth, Open Government Licence. The URL is date-stamped and changes each month, but the transparency page always links to the current version.

The approach: scrape the transparency page to find the current ODS URL, download the file, parse it, filter rows, push results. No API. No auth wall.

The ODS parsing challenge

ODS files (.ods) are ZIP archives containing XML. The standard tool for parsing them in Node.js is SheetJS (xlsx package, v0.18.5 — the last Apache 2.0 release).

The first surprise: the workbook has three sheets — README, HSCA_Active_Locations, and Dual_Registration_Locations. SheetJS defaults to the first sheet, which is the README with 34 rows. I added logic to find the sheet with the most rows.

for (const name of workbook.SheetNames) {
    const probe = XLSX.utils.sheet_to_json(sheet, { header: 1 });
    if (probe.length > bestCount) {
        bestCount = probe.length;
        bestSheet = sheet;
    }
}

Second surprise: the column names are Title Case with spaces (Location Local Authority, Location Latest Overall Rating), not the snake_case shown in third-party docs. I built a col() helper that tries multiple column name variants to stay resilient against future renames.

Third: dates come back as JS Date objects when you pass { cellDates: true } to XLSX.read(). Coordinates come back as strings. Both handled in normalisation.

Bonus: service type and user band extraction

The ODS has 122 columns. About 60 of them are boolean flags (Y/null) for specific service types and service user bands — things like Service type - Care home service with nursing and Service user band - Dementia.

Rather than mapping 60 individual columns, I extract the active ones into arrays:

function extractTagged(row: RawOdsRow, prefix: string): string[] {
    return Object.entries(row)
        .filter(([key, val]) => key.startsWith(prefix) && (val === 'Y' || val === true))
        .map(([key]) => key.slice(prefix.length));
}

Output: "serviceTypes": ["Care home service without nursing"] and "serviceUserBands": ["Dementia", "Older People"]. Much cleaner than 60 separate boolean fields.

The result

The actor supports two modes: searchLocations (filter by area, rating, service type) and lookupProviders (fetch all locations for specific provider IDs). It loads and parses the 23 MB ODS on every run — about 10–15 seconds — then filters in memory. No proxy needed, no browser automation, no anti-bot handling.

Available on Apify Store: CQC Care Register Scraper

The whole build took a few hours. The messiest part was the ODS column discovery — definitely worth logging the full column list on first run before writing any normalisation code.