De' Clerke

Posted on Jun 2

Web Scraping Kenyan Data Sources: What's Available, What Fights Back, and the Patterns That Keep Pipelines Running

#webscraping #beautifulsoup #playwright #requests

I've scraped property listings from BuyRentKenya (1,338 of them), parliament bills from the National Assembly site (319 PDFs), job postings from six Kenyan job boards, forex rates from CBK, and news articles from Business Daily and The Standard. Each of those sources has its own quirks, and a few of them actively fight back. This article covers what I learned building five production scrapers that ran on schedule without getting blocked.

Before writing a single line of scraping code, check the Network tab in your browser's DevTools. Many sites that look like they need scraping are actually hitting an undocumented JSON API in the background. Interact with the page, watch the XHR/Fetch tab, and look for the request that returns the data. If you find one, you're writing a two-line requests.get() call instead of parsing HTML. The NSE equities page works exactly this way.

Pick Your Tool Before You Start

Three tools cover almost every case. Knowing which one to reach for first saves a lot of wasted effort.

requests + BeautifulSoup for static HTML. This is the right starting point for most Kenyan government and news sites. Fast, lightweight, no browser overhead.

Playwright when requests returns empty content or a "loading..." placeholder. The page requires JavaScript to render its content. You need a real browser.

Official API when one exists. data.go.ke runs CKAN, CBK publishes exchange rates, EIA has a full API. Always check before scraping.

The test that tells you which one applies:

import requests

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

r = requests.get(url, headers=HEADERS, timeout=15)

if len(r.text) < 500 or "Just a moment" in r.text or "cf-browser-verification" in r.text:
    print("Cloudflare detected. Use Playwright-stealth.")
else:
    print(f"Static HTML. Got {len(r.text)} characters.")

Under 500 characters means you got a challenge page, not the actual content. "Just a moment" is Cloudflare's loading screen. Both mean requests will not work here.

Kenyan Data Sources: What Works for Each

data.go.ke: CKAN API

The Kenya Open Data portal runs CKAN, which has a documented REST API. You can list datasets, fetch metadata, and download resources programmatically without touching HTML.

CKAN_BASE = "https://www.opendata.go.ke/api/3/action"

def list_datasets() -> list[str]:
    r = requests.get(f"{CKAN_BASE}/package_list", timeout=15)
    r.raise_for_status()
    return r.json()["result"]

def get_dataset(dataset_id: str) -> dict:
    r = requests.get(f"{CKAN_BASE}/package_show", params={"id": dataset_id}, timeout=15)
    r.raise_for_status()
    return r.json()["result"]

Each dataset record contains a resources list with download URLs. Datasets cover agriculture, health, education, and population. Most are downloadable as CSV directly. Use the API to discover what's available; use a plain requests.get() to download the file.

CBK Forex Rates

The Central Bank publishes daily exchange rates at centralbank.go.ke/forex/. The page renders as a standard HTML table. pandas handles it in four lines.

import pandas as pd
import requests

def scrape_cbk_forex() -> pd.DataFrame:
    url = "https://www.centralbank.go.ke/forex/"
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    tables = pd.read_html(r.text)
    df = tables[0]
    df.columns = df.columns.str.strip()
    df["fetched_at"] = pd.Timestamp.now()
    return df

pd.read_html() finds all HTML tables on the page and returns them as a list of DataFrames. The forex table is the first one. This is faster than parsing with BeautifulSoup when the data you want is already in a <table> tag.

NSE Equities

The NSE live prices page makes a background API call. Open the Network tab, click XHR/Fetch, reload the page, and you will see requests to endpoints under nse.co.ke/api/. The equities endpoint returns JSON.

NSE_HEADERS = {
    "User-Agent": "Mozilla/5.0",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://www.nse.co.ke/",
}

def fetch_nse_prices() -> list[dict]:
    r = requests.get(
        "https://www.nse.co.ke/api/equity/prices",
        headers=NSE_HEADERS,
        timeout=10
    )
    r.raise_for_status()
    return r.json()

The X-Requested-With: XMLHttpRequest header is required. Without it, the endpoint returns a 403. This is a common pattern on Kenyan sites that use jQuery Ajax internally.

Historical price data is a different story. It is locked behind PDF market reports, which means PDF parsing. See the parliament bills section below for the pattern.

Kenya National Assembly Bills

The parliament site lists bills at parliament.go.ke/the-national-assembly/bills. Each bill links to a PDF hosted on the same domain. The scraping is two steps: parse the HTML table to get the list, then download and parse each PDF.

import pdfplumber
import tempfile
import os

def scrape_national_assembly_bills() -> list[dict]:
    url = "http://parliament.go.ke/the-national-assembly/bills"
    r = requests.get(url, headers=HEADERS, timeout=20)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    bills = []
    for row in soup.select("table.bills-table tbody tr"):
        cols = row.find_all("td")
        if len(cols) < 3:
            continue
        link = cols[0].find("a")
        bills.append({
            "title":   cols[0].get_text(strip=True),
            "status":  cols[1].get_text(strip=True),
            "date":    cols[2].get_text(strip=True),
            "pdf_url": link["href"] if link else None,
        })
    return bills

def download_and_parse_bill(pdf_url: str) -> str:
    r = requests.get(pdf_url, headers=HEADERS, timeout=30)
    r.raise_for_status()
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
        f.write(r.content)
        tmp_path = f.name
    try:
        full_text = []
        with pdfplumber.open(tmp_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text.append(text)
        return "\n".join(full_text)
    finally:
        os.unlink(tmp_path)

Write to a tempfile, parse, then delete it. Do not save the PDFs to disk permanently unless you specifically need them. In BungeWatch, I parsed 223 bills this way and stored only the extracted text and keyword counts.

For PDF tables, pdfplumber has a page.extract_tables() method. For bulk text extraction where speed matters more than precision, PyMuPDF (fitz) is faster.

import fitz

def extract_text_fast(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    return "\n".join(page.get_text() for page in doc)

Use pdfplumber when the PDF has structured tables. Use fitz when you just need the raw text quickly.

Kenyan Job Boards

I pulled from six sources for JobSense. The breakdown:

Source	Tool	Notes
BrighterMonday	requests + BS4	Static HTML, pagination via `?page=N`
JobWeb Kenya	requests + BS4	Static HTML, some JavaScript on detail pages
MyJobMag	requests + BS4	Standard pagination
LinkedIn	Playwright + login	Requires authentication, JS rendering
Indeed Kenya	Playwright	Dynamic content

For LinkedIn, Playwright with a logged-in session is the only reliable approach. Store the login cookies after the first session and reuse them. Without cookies, LinkedIn redirects you to the login page after a few pages.

For BrighterMonday and JobWeb, a standard session with headers handles pagination cleanly:

def scrape_brightermonday(keyword: str, max_pages: int = 10) -> list[dict]:
    base_url = f"https://www.brightermonday.co.ke/jobs?q={keyword}"
    session = build_session()
    all_jobs = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}&page={page}"
        soup = get_page(url, session)
        jobs = parse_job_listings(soup)
        if not jobs:
            break
        all_jobs.extend(jobs)

    return all_jobs

BuyRentKenya Property Listings

This was the Kenya Real Estate Pipeline. BuyRentKenya uses ?page=N pagination, static HTML, and responds cleanly to a browser User-Agent. 67 pages at about 20 listings each gives you the full dataset.

def scrape_buyrentkenya(max_pages: int = 70) -> list[dict]:
    base_url = "https://www.buyrentkenya.com/property-for-sale"
    session = build_session()
    all_listings = []

    for page in range(1, max_pages + 1):
        soup = get_page(f"{base_url}?page={page}", session)
        listings = parse_listings(soup)
        if not listings:
            break
        all_listings.extend(listings)

    return all_listings

The final run produced 1,338 listings. The site does not block scrapers as long as you include a realistic User-Agent and keep delays between requests.

Kenyan News Sites

Check for RSS before scraping HTML. Business Daily, The Standard, and The Star all publish RSS feeds. RSS gives you structured data (title, link, published date, description) with no parsing required.

import feedparser

def fetch_rss(feed_url: str) -> list[dict]:
    feed = feedparser.parse(feed_url)
    return [
        {
            "title":     e.title,
            "link":      e.link,
            "published": e.get("published"),
            "summary":   e.get("summary"),
        }
        for e in feed.entries
    ]

articles = fetch_rss("https://businessdailyafrica.com/rss/feed")

RSS only gives you headlines and summaries. For full article text, follow the links and scrape the article pages with requests + BS4. Most Kenyan news sites are static HTML. article tags with class names like article-body or post-content are the usual targets.

The Session Builder Pattern

Building a session once and reusing it across all pages is more efficient than creating a new connection for every request. It also lets you configure retry behavior in one place.

import requests
import time
import random
from fake_useragent import UserAgent
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

ua = UserAgent()

FULL_HEADERS = {
    "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection":      "keep-alive",
    "Referer":         "https://www.google.com/",
}

def build_session() -> requests.Session:
    session = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=2,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"],
    )
    session.mount("https://", HTTPAdapter(max_retries=retry))
    session.mount("http://",  HTTPAdapter(max_retries=retry))
    return session

def get_page(url: str, session: requests.Session) -> BeautifulSoup:
    headers = {**FULL_HEADERS, "User-Agent": ua.random}
    r = session.get(url, headers=headers, timeout=20)
    if r.status_code == 429:
        wait = int(r.headers.get("Retry-After", 30))
        print(f"Rate limited. Waiting {wait}s...")
        time.sleep(wait)
        r = session.get(url, headers={**FULL_HEADERS, "User-Agent": ua.random}, timeout=20)
    r.raise_for_status()
    time.sleep(random.uniform(1.5, 3.5))
    return BeautifulSoup(r.text, "lxml")

Rotating the User-Agent on every request with ua.random keeps your requests looking like they come from different browsers. The backoff_factor=2 on the retry adapter means it waits 2 seconds before the first retry, 4 before the second, 8 before the third. That covers most transient server errors without hammering the site.

Checkpoint and Resume for Long Scrapes

A scrape that covers 70 pages and fails on page 45 should resume from page 45, not page 1. Save a checkpoint after every successful page.

import json
import os

CHECKPOINT_FILE = "scrape_checkpoint.json"

def load_checkpoint() -> dict:
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            return json.load(f)
    return {"last_page": 0, "scraped_ids": []}

def save_checkpoint(state: dict):
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump(state, f)

def scrape_with_checkpoint(base_url: str, max_pages: int = 100) -> list[dict]:
    state = load_checkpoint()
    start_page = state.get("last_page", 0) + 1
    scraped_ids = set(state.get("scraped_ids", []))
    all_results = []
    session = build_session()

    for page in range(start_page, max_pages + 1):
        try:
            soup = get_page(f"{base_url}?page={page}", session)
            items = parse_listings(soup)
            if not items:
                break

            new_items = [i for i in items if i.get("url") not in scraped_ids]
            all_results.extend(new_items)
            scraped_ids.update(i["url"] for i in new_items)

            save_checkpoint({"last_page": page, "scraped_ids": list(scraped_ids)})
            print(f"Page {page}: +{len(new_items)} items")

        except Exception as e:
            print(f"Page {page} failed: {e}")
            continue

    return all_results

Delete the checkpoint file after the scrape completes. If you leave it in place, the next scheduled run will start from the last saved page instead of the beginning, which breaks incremental scraping.

JavaScript-Rendered Pages with Playwright

When requests gives you empty HTML or generic loading content, the page depends on JavaScript to render. Playwright runs a real Chromium browser.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> BeautifulSoup:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0",
            viewport={"width": 1920, "height": 1080},
        )
        page = context.new_page()
        page.goto(url, timeout=30000, wait_until="domcontentloaded")
        page.wait_for_selector("div.listings", timeout=10000)
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(1500)
        content = page.content()
        browser.close()
    return BeautifulSoup(content, "lxml")

Wait for a specific element with wait_for_selector rather than using wait_for_load_state("networkidle"). Waiting for networkidle means waiting until no network requests have fired for 500ms. On pages with telemetry, ads, or analytics loading in the background, that can take 10+ seconds or never settle. Waiting for the element you actually need is faster and more reliable.

For pages with a "Load More" button, click through until the button disappears:

while page.locator("button.load-more").is_visible():
    page.click("button.load-more")
    page.wait_for_timeout(1500)

When debugging a Playwright scraper, take a screenshot at the failure point:

try:
    page.wait_for_selector("div.listings", timeout=5000)
except Exception:
    page.screenshot(path="error_screenshot.png", full_page=True)
    raise

The screenshot shows you exactly what the browser rendered, which is almost always more useful than any error message.

Network Interception: When the Site Calls Its Own API

Some sites build their frontend as a JavaScript app that calls an internal API. Playwright can intercept those calls and grab the JSON directly, skipping HTML parsing entirely.

api_responses = []

def handle_response(response):
    if "api/jobs" in response.url and response.status == 200:
        try:
            api_responses.append(response.json())
        except Exception:
            pass

page.on("response", handle_response)
page.goto(url)
page.wait_for_load_state("networkidle")
print(f"Captured {len(api_responses)} API responses")

I used this on LinkedIn during the JobSense project. The page renders job cards, but underneath it is calling api.linkedin.com/graphql with a structured query. Intercepting those calls gives you structured JSON with no HTML parsing required.

BeautifulSoup Gotchas

The hyphen-class problem. On SBT Japan, every field on a car listing card uses class names like card-mileage, card-year, card-price. You would expect card.find(class_="card-mileage") to work. It does not. BeautifulSoup treats hyphens as word boundaries in class matching in certain contexts, and the lookup fails. CSS selectors like soup.select(".card-mileage") also fail with a malformed selector error when the class starts with a hyphen.

The fix is to iterate children and check membership directly:

def find_by_hyphen_class(container, suffix: str):
    for el in container.find_all(True):
        classes = el.get("class", [])
        if any(suffix in cls for cls in classes):
            return el
    return None

mileage_el = find_by_hyphen_class(card, "-mileage")
year_el    = find_by_hyphen_class(card, "-year")

The dual-element selector problem. BE FORWARD renders listings as <div class="stocklist-row"> in desktop view and <tr class="stocklist-row"> in table view. A selector for one tag type misses the other. Pass a list to find_all:

cards = soup.find_all(["div", "tr"], class_="stocklist-row")

Finding the right selector. When DevTools gives you a long CSS path that does not work in BeautifulSoup, dump all classes on the page and look for the right one:

classes = set()
for tag in soup.find_all(True):
    if tag.get("class"):
        classes.add(f"{tag.name}.{'.'.join(tag['class'])}")
for c in sorted(classes):
    print(c)

Scan the output for the class names that match the element you are targeting. This is faster than guessing and re-running the scraper.

Storing Data: Idempotent Upserts

A scraper that runs on a schedule will re-encounter listings it has already stored. Your storage layer needs to handle duplicates without creating them. Use a UNIQUE constraint on the natural key and ON CONFLICT DO UPDATE on writes.

CREATE TABLE IF NOT EXISTS listings (
    id          SERIAL PRIMARY KEY,
    url         TEXT UNIQUE,
    title       TEXT,
    price       NUMERIC(14, 2),
    location    VARCHAR(200),
    source      VARCHAR(50),
    scraped_at  TIMESTAMPTZ DEFAULT NOW()
);

from sqlalchemy.dialects.postgresql import insert as pg_insert

def upsert_listings(records: list[dict]):
    if not records:
        return
    with engine.begin() as conn:
        stmt = pg_insert(Listing).values(records)
        stmt = stmt.on_conflict_do_update(
            index_elements=["url"],
            set_={"price": stmt.excluded.price, "scraped_at": stmt.excluded.scraped_at},
        )
        conn.execute(stmt)
    print(f"Upserted {len(records)} rows")

The ON CONFLICT DO UPDATE updates price and scraped_at when a URL is already in the database. This means your table always has the latest price for each listing, not a growing pile of duplicates.

Scheduling with Airflow

Once the scraper works, wrapping it in an Airflow DAG gives you scheduled runs, retries, and failure alerts without any extra infrastructure.

from airflow.decorators import dag, task
from datetime import datetime, timedelta

@dag(
    schedule="0 6 * * *",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    max_active_runs=1,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=10)},
    tags=["scraping", "real-estate"],
)
def kenya_real_estate_dag():

    @task
    def scrape() -> list[dict]:
        from scraper import scrape_all_pages
        return scrape_all_pages(
            "https://www.buyrentkenya.com/property-for-sale",
            max_pages=70
        )

    @task
    def store(records: list[dict]) -> int:
        upsert_listings(records)
        return len(records)

    @task
    def report(count: int):
        print(f"Scrape complete: {count} listings loaded")

    records = scrape()
    count = store(records)
    report(count)

kenya_real_estate_dag()

Two things worth noting here. First, max_active_runs=1 prevents overlap if a run takes longer than the schedule interval. A scrape of 70 pages with polite delays can take 5 to 10 minutes. Without this setting, a second run can start before the first one finishes, both writing to the same table at the same time. Second, retries: 2 with a 10-minute retry delay covers transient network failures without hammering the site immediately.

What the Kenyan Ecosystem Actually Looks Like

After building scrapers for most of these sources, here is the honest picture:

What is well-structured: data.go.ke (CKAN API), CBK forex (clean HTML table), job boards (mostly static HTML with predictable pagination), news RSS feeds.

What requires more work: NSE historical data (PDF reports), parliament bills (HTML table plus PDF per bill), BuyRentKenya (static HTML but needs checkpoint for 70 pages), news article full text (follow links after RSS).

What actively resists scraping: LinkedIn (requires login, JS rendering), CarFromJapan and similar Cloudflare-protected sites (need Playwright-stealth or a proxy). For Cloudflare sites, playwright-stealth patches the browser context to remove the JavaScript signals that identify it as a headless browser.

The general principle that holds across all of them: check the Network tab before writing any code, respect robots.txt, use polite delays, and build idempotent storage from the start. A scraper that breaks after 30 minutes because of a duplicate key error is not a production scraper.

These patterns come from five data engineering projects that scrape Kenyan sources on a schedule. The code for Kenya Real Estate, BungeWatch, JobSense, and BizPulse Kenya is on my GitHub.

Follow me on dev.to for more articles on data pipelines, dbt, and Airflow.

Top comments (1)

Nico Morris • Jun 18

Interesting breakdown of the Kenyan data landscape.

From my experience, the real challenge isn't whether a site can be scraped—it's whether the job can still finish reliably once Cloudflare and other anti-bot systems start intervening.

For regional data collection projects, we've found that proxy rotation strategy and recovery from blocked sessions often matter more than raw scraping speed. Residential proxies with stable session management can help maintain continuity across long-running tasks, while intelligent rotation reduces the impact of IP reputation issues when collecting data from region-specific websites.

The scraping logic is usually the easy part. Keeping pipelines running consistently despite challenges, bans, and changing defenses is where most of the engineering effort goes.