DEV Community

Cover image for Web Scraping Kenyan Data Sources: What's Available, What Fights Back, and the Patterns That Keep Pipelines Running
De' Clerke
De' Clerke

Posted on

Web Scraping Kenyan Data Sources: What's Available, What Fights Back, and the Patterns That Keep Pipelines Running

I've scraped property listings from BuyRentKenya (1,338 of them), parliament bills from the National Assembly site (319 PDFs), job postings from six Kenyan job boards, forex rates from CBK, and news articles from Business Daily and The Standard. Each of those sources has its own quirks, and a few of them actively fight back. This article covers what I learned building five production scrapers that ran on schedule without getting blocked.

Before writing a single line of scraping code, check the Network tab in your browser's DevTools. Many sites that look like they need scraping are actually hitting an undocumented JSON API in the background. Interact with the page, watch the XHR/Fetch tab, and look for the request that returns the data. If you find one, you're writing a two-line requests.get() call instead of parsing HTML. The NSE equities page works exactly this way.


Pick Your Tool Before You Start

Three tools cover almost every case. Knowing which one to reach for first saves a lot of wasted effort.

requests + BeautifulSoup for static HTML. This is the right starting point for most Kenyan government and news sites. Fast, lightweight, no browser overhead.

Playwright when requests returns empty content or a "loading..." placeholder. The page requires JavaScript to render its content. You need a real browser.

Official API when one exists. data.go.ke runs CKAN, CBK publishes exchange rates, EIA has a full API. Always check before scraping.

The test that tells you which one applies:

import requests

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

r = requests.get(url, headers=HEADERS, timeout=15)

if len(r.text) < 500 or "Just a moment" in r.text or "cf-browser-verification" in r.text:
    print("Cloudflare detected. Use Playwright-stealth.")
else:
    print(f"Static HTML. Got {len(r.text)} characters.")
Enter fullscreen mode Exit fullscreen mode

Under 500 characters means you got a challenge page, not the actual content. "Just a moment" is Cloudflare's loading screen. Both mean requests will not work here.


Kenyan Data Sources: What Works for Each

data.go.ke: CKAN API

The Kenya Open Data portal runs CKAN, which has a documented REST API. You can list datasets, fetch metadata, and download resources programmatically without touching HTML.

CKAN_BASE = "https://www.opendata.go.ke/api/3/action"

def list_datasets() -> list[str]:
    r = requests.get(f"{CKAN_BASE}/package_list", timeout=15)
    r.raise_for_status()
    return r.json()["result"]

def get_dataset(dataset_id: str) -> dict:
    r = requests.get(f"{CKAN_BASE}/package_show", params={"id": dataset_id}, timeout=15)
    r.raise_for_status()
    return r.json()["result"]
Enter fullscreen mode Exit fullscreen mode

Each dataset record contains a resources list with download URLs. Datasets cover agriculture, health, education, and population. Most are downloadable as CSV directly. Use the API to discover what's available; use a plain requests.get() to download the file.

CBK Forex Rates

The Central Bank publishes daily exchange rates at centralbank.go.ke/forex/. The page renders as a standard HTML table. pandas handles it in four lines.

import pandas as pd
import requests

def scrape_cbk_forex() -> pd.DataFrame:
    url = "https://www.centralbank.go.ke/forex/"
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    tables = pd.read_html(r.text)
    df = tables[0]
    df.columns = df.columns.str.strip()
    df["fetched_at"] = pd.Timestamp.now()
    return df
Enter fullscreen mode Exit fullscreen mode

pd.read_html() finds all HTML tables on the page and returns them as a list of DataFrames. The forex table is the first one. This is faster than parsing with BeautifulSoup when the data you want is already in a <table> tag.

NSE Equities

The NSE live prices page makes a background API call. Open the Network tab, click XHR/Fetch, reload the page, and you will see requests to endpoints under nse.co.ke/api/. The equities endpoint returns JSON.

NSE_HEADERS = {
    "User-Agent": "Mozilla/5.0",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://www.nse.co.ke/",
}

def fetch_nse_prices() -> list[dict]:
    r = requests.get(
        "https://www.nse.co.ke/api/equity/prices",
        headers=NSE_HEADERS,
        timeout=10
    )
    r.raise_for_status()
    return r.json()
Enter fullscreen mode Exit fullscreen mode

The X-Requested-With: XMLHttpRequest header is required. Without it, the endpoint returns a 403. This is a common pattern on Kenyan sites that use jQuery Ajax internally.

Historical price data is a different story. It is locked behind PDF market reports, which means PDF parsing. See the parliament bills section below for the pattern.

Kenya National Assembly Bills

The parliament site lists bills at parliament.go.ke/the-national-assembly/bills. Each bill links to a PDF hosted on the same domain. The scraping is two steps: parse the HTML table to get the list, then download and parse each PDF.

import pdfplumber
import tempfile
import os

def scrape_national_assembly_bills() -> list[dict]:
    url = "http://parliament.go.ke/the-national-assembly/bills"
    r = requests.get(url, headers=HEADERS, timeout=20)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    bills = []
    for row in soup.select("table.bills-table tbody tr"):
        cols = row.find_all("td")
        if len(cols) < 3:
            continue
        link = cols[0].find("a")
        bills.append({
            "title":   cols[0].get_text(strip=True),
            "status":  cols[1].get_text(strip=True),
            "date":    cols[2].get_text(strip=True),
            "pdf_url": link["href"] if link else None,
        })
    return bills

def download_and_parse_bill(pdf_url: str) -> str:
    r = requests.get(pdf_url, headers=HEADERS, timeout=30)
    r.raise_for_status()
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as f:
        f.write(r.content)
        tmp_path = f.name
    try:
        full_text = []
        with pdfplumber.open(tmp_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text.append(text)
        return "\n".join(full_text)
    finally:
        os.unlink(tmp_path)
Enter fullscreen mode Exit fullscreen mode

Write to a tempfile, parse, then delete it. Do not save the PDFs to disk permanently unless you specifically need them. In BungeWatch, I parsed 223 bills this way and stored only the extracted text and keyword counts.

For PDF tables, pdfplumber has a page.extract_tables() method. For bulk text extraction where speed matters more than precision, PyMuPDF (fitz) is faster.

import fitz

def extract_text_fast(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    return "\n".join(page.get_text() for page in doc)
Enter fullscreen mode Exit fullscreen mode

Use pdfplumber when the PDF has structured tables. Use fitz when you just need the raw text quickly.

Kenyan Job Boards

I pulled from six sources for JobSense. The breakdown:

Source Tool Notes
BrighterMonday requests + BS4 Static HTML, pagination via ?page=N
JobWeb Kenya requests + BS4 Static HTML, some JavaScript on detail pages
MyJobMag requests + BS4 Standard pagination
LinkedIn Playwright + login Requires authentication, JS rendering
Indeed Kenya Playwright Dynamic content

For LinkedIn, Playwright with a logged-in session is the only reliable approach. Store the login cookies after the first session and reuse them. Without cookies, LinkedIn redirects you to the login page after a few pages.

For BrighterMonday and JobWeb, a standard session with headers handles pagination cleanly:

def scrape_brightermonday(keyword: str, max_pages: int = 10) -> list[dict]:
    base_url = f"https://www.brightermonday.co.ke/jobs?q={keyword}"
    session = build_session()
    all_jobs = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}&page={page}"
        soup = get_page(url, session)
        jobs = parse_job_listings(soup)
        if not jobs:
            break
        all_jobs.extend(jobs)

    return all_jobs
Enter fullscreen mode Exit fullscreen mode

BuyRentKenya Property Listings

This was the Kenya Real Estate Pipeline. BuyRentKenya uses ?page=N pagination, static HTML, and responds cleanly to a browser User-Agent. 67 pages at about 20 listings each gives you the full dataset.

def scrape_buyrentkenya(max_pages: int = 70) -> list[dict]:
    base_url = "https://www.buyrentkenya.com/property-for-sale"
    session = build_session()
    all_listings = []

    for page in range(1, max_pages + 1):
        soup = get_page(f"{base_url}?page={page}", session)
        listings = parse_listings(soup)
        if not listings:
            break
        all_listings.extend(listings)

    return all_listings
Enter fullscreen mode Exit fullscreen mode

The final run produced 1,338 listings. The site does not block scrapers as long as you include a realistic User-Agent and keep delays between requests.

Kenyan News Sites

Check for RSS before scraping HTML. Business Daily, The Standard, and The Star all publish RSS feeds. RSS gives you structured data (title, link, published date, description) with no parsing required.

import feedparser

def fetch_rss(feed_url: str) -> list[dict]:
    feed = feedparser.parse(feed_url)
    return [
        {
            "title":     e.title,
            "link":      e.link,
            "published": e.get("published"),
            "summary":   e.get("summary"),
        }
        for e in feed.entries
    ]

articles = fetch_rss("https://businessdailyafrica.com/rss/feed")
Enter fullscreen mode Exit fullscreen mode

RSS only gives you headlines and summaries. For full article text, follow the links and scrape the article pages with requests + BS4. Most Kenyan news sites are static HTML. article tags with class names like article-body or post-content are the usual targets.


The Session Builder Pattern

Building a session once and reusing it across all pages is more efficient than creating a new connection for every request. It also lets you configure retry behavior in one place.

import requests
import time
import random
from fake_useragent import UserAgent
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

ua = UserAgent()

FULL_HEADERS = {
    "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection":      "keep-alive",
    "Referer":         "https://www.google.com/",
}

def build_session() -> requests.Session:
    session = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=2,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"],
    )
    session.mount("https://", HTTPAdapter(max_retries=retry))
    session.mount("http://",  HTTPAdapter(max_retries=retry))
    return session

def get_page(url: str, session: requests.Session) -> BeautifulSoup:
    headers = {**FULL_HEADERS, "User-Agent": ua.random}
    r = session.get(url, headers=headers, timeout=20)
    if r.status_code == 429:
        wait = int(r.headers.get("Retry-After", 30))
        print(f"Rate limited. Waiting {wait}s...")
        time.sleep(wait)
        r = session.get(url, headers={**FULL_HEADERS, "User-Agent": ua.random}, timeout=20)
    r.raise_for_status()
    time.sleep(random.uniform(1.5, 3.5))
    return BeautifulSoup(r.text, "lxml")
Enter fullscreen mode Exit fullscreen mode

Rotating the User-Agent on every request with ua.random keeps your requests looking like they come from different browsers. The backoff_factor=2 on the retry adapter means it waits 2 seconds before the first retry, 4 before the second, 8 before the third. That covers most transient server errors without hammering the site.


Checkpoint and Resume for Long Scrapes

A scrape that covers 70 pages and fails on page 45 should resume from page 45, not page 1. Save a checkpoint after every successful page.

import json
import os

CHECKPOINT_FILE = "scrape_checkpoint.json"

def load_checkpoint() -> dict:
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            return json.load(f)
    return {"last_page": 0, "scraped_ids": []}

def save_checkpoint(state: dict):
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump(state, f)

def scrape_with_checkpoint(base_url: str, max_pages: int = 100) -> list[dict]:
    state = load_checkpoint()
    start_page = state.get("last_page", 0) + 1
    scraped_ids = set(state.get("scraped_ids", []))
    all_results = []
    session = build_session()

    for page in range(start_page, max_pages + 1):
        try:
            soup = get_page(f"{base_url}?page={page}", session)
            items = parse_listings(soup)
            if not items:
                break

            new_items = [i for i in items if i.get("url") not in scraped_ids]
            all_results.extend(new_items)
            scraped_ids.update(i["url"] for i in new_items)

            save_checkpoint({"last_page": page, "scraped_ids": list(scraped_ids)})
            print(f"Page {page}: +{len(new_items)} items")

        except Exception as e:
            print(f"Page {page} failed: {e}")
            continue

    return all_results
Enter fullscreen mode Exit fullscreen mode

Delete the checkpoint file after the scrape completes. If you leave it in place, the next scheduled run will start from the last saved page instead of the beginning, which breaks incremental scraping.


JavaScript-Rendered Pages with Playwright

When requests gives you empty HTML or generic loading content, the page depends on JavaScript to render. Playwright runs a real Chromium browser.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> BeautifulSoup:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0",
            viewport={"width": 1920, "height": 1080},
        )
        page = context.new_page()
        page.goto(url, timeout=30000, wait_until="domcontentloaded")
        page.wait_for_selector("div.listings", timeout=10000)
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(1500)
        content = page.content()
        browser.close()
    return BeautifulSoup(content, "lxml")
Enter fullscreen mode Exit fullscreen mode

Wait for a specific element with wait_for_selector rather than using wait_for_load_state("networkidle"). Waiting for networkidle means waiting until no network requests have fired for 500ms. On pages with telemetry, ads, or analytics loading in the background, that can take 10+ seconds or never settle. Waiting for the element you actually need is faster and more reliable.

For pages with a "Load More" button, click through until the button disappears:

while page.locator("button.load-more").is_visible():
    page.click("button.load-more")
    page.wait_for_timeout(1500)
Enter fullscreen mode Exit fullscreen mode

When debugging a Playwright scraper, take a screenshot at the failure point:

try:
    page.wait_for_selector("div.listings", timeout=5000)
except Exception:
    page.screenshot(path="error_screenshot.png", full_page=True)
    raise
Enter fullscreen mode Exit fullscreen mode

The screenshot shows you exactly what the browser rendered, which is almost always more useful than any error message.


Network Interception: When the Site Calls Its Own API

Some sites build their frontend as a JavaScript app that calls an internal API. Playwright can intercept those calls and grab the JSON directly, skipping HTML parsing entirely.

api_responses = []

def handle_response(response):
    if "api/jobs" in response.url and response.status == 200:
        try:
            api_responses.append(response.json())
        except Exception:
            pass

page.on("response", handle_response)
page.goto(url)
page.wait_for_load_state("networkidle")
print(f"Captured {len(api_responses)} API responses")
Enter fullscreen mode Exit fullscreen mode

I used this on LinkedIn during the JobSense project. The page renders job cards, but underneath it is calling api.linkedin.com/graphql with a structured query. Intercepting those calls gives you structured JSON with no HTML parsing required.


BeautifulSoup Gotchas

The hyphen-class problem. On SBT Japan, every field on a car listing card uses class names like card-mileage, card-year, card-price. You would expect card.find(class_="card-mileage") to work. It does not. BeautifulSoup treats hyphens as word boundaries in class matching in certain contexts, and the lookup fails. CSS selectors like soup.select(".card-mileage") also fail with a malformed selector error when the class starts with a hyphen.

The fix is to iterate children and check membership directly:

def find_by_hyphen_class(container, suffix: str):
    for el in container.find_all(True):
        classes = el.get("class", [])
        if any(suffix in cls for cls in classes):
            return el
    return None

mileage_el = find_by_hyphen_class(card, "-mileage")
year_el    = find_by_hyphen_class(card, "-year")
Enter fullscreen mode Exit fullscreen mode

The dual-element selector problem. BE FORWARD renders listings as <div class="stocklist-row"> in desktop view and <tr class="stocklist-row"> in table view. A selector for one tag type misses the other. Pass a list to find_all:

cards = soup.find_all(["div", "tr"], class_="stocklist-row")
Enter fullscreen mode Exit fullscreen mode

Finding the right selector. When DevTools gives you a long CSS path that does not work in BeautifulSoup, dump all classes on the page and look for the right one:

classes = set()
for tag in soup.find_all(True):
    if tag.get("class"):
        classes.add(f"{tag.name}.{'.'.join(tag['class'])}")
for c in sorted(classes):
    print(c)
Enter fullscreen mode Exit fullscreen mode

Scan the output for the class names that match the element you are targeting. This is faster than guessing and re-running the scraper.


Storing Data: Idempotent Upserts

A scraper that runs on a schedule will re-encounter listings it has already stored. Your storage layer needs to handle duplicates without creating them. Use a UNIQUE constraint on the natural key and ON CONFLICT DO UPDATE on writes.

CREATE TABLE IF NOT EXISTS listings (
    id          SERIAL PRIMARY KEY,
    url         TEXT UNIQUE,
    title       TEXT,
    price       NUMERIC(14, 2),
    location    VARCHAR(200),
    source      VARCHAR(50),
    scraped_at  TIMESTAMPTZ DEFAULT NOW()
);
Enter fullscreen mode Exit fullscreen mode
from sqlalchemy.dialects.postgresql import insert as pg_insert

def upsert_listings(records: list[dict]):
    if not records:
        return
    with engine.begin() as conn:
        stmt = pg_insert(Listing).values(records)
        stmt = stmt.on_conflict_do_update(
            index_elements=["url"],
            set_={"price": stmt.excluded.price, "scraped_at": stmt.excluded.scraped_at},
        )
        conn.execute(stmt)
    print(f"Upserted {len(records)} rows")
Enter fullscreen mode Exit fullscreen mode

The ON CONFLICT DO UPDATE updates price and scraped_at when a URL is already in the database. This means your table always has the latest price for each listing, not a growing pile of duplicates.


Scheduling with Airflow

Once the scraper works, wrapping it in an Airflow DAG gives you scheduled runs, retries, and failure alerts without any extra infrastructure.

from airflow.decorators import dag, task
from datetime import datetime, timedelta

@dag(
    schedule="0 6 * * *",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    max_active_runs=1,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=10)},
    tags=["scraping", "real-estate"],
)
def kenya_real_estate_dag():

    @task
    def scrape() -> list[dict]:
        from scraper import scrape_all_pages
        return scrape_all_pages(
            "https://www.buyrentkenya.com/property-for-sale",
            max_pages=70
        )

    @task
    def store(records: list[dict]) -> int:
        upsert_listings(records)
        return len(records)

    @task
    def report(count: int):
        print(f"Scrape complete: {count} listings loaded")

    records = scrape()
    count = store(records)
    report(count)

kenya_real_estate_dag()
Enter fullscreen mode Exit fullscreen mode

Two things worth noting here. First, max_active_runs=1 prevents overlap if a run takes longer than the schedule interval. A scrape of 70 pages with polite delays can take 5 to 10 minutes. Without this setting, a second run can start before the first one finishes, both writing to the same table at the same time. Second, retries: 2 with a 10-minute retry delay covers transient network failures without hammering the site immediately.


What the Kenyan Ecosystem Actually Looks Like

After building scrapers for most of these sources, here is the honest picture:

What is well-structured: data.go.ke (CKAN API), CBK forex (clean HTML table), job boards (mostly static HTML with predictable pagination), news RSS feeds.

What requires more work: NSE historical data (PDF reports), parliament bills (HTML table plus PDF per bill), BuyRentKenya (static HTML but needs checkpoint for 70 pages), news article full text (follow links after RSS).

What actively resists scraping: LinkedIn (requires login, JS rendering), CarFromJapan and similar Cloudflare-protected sites (need Playwright-stealth or a proxy). For Cloudflare sites, playwright-stealth patches the browser context to remove the JavaScript signals that identify it as a headless browser.

The general principle that holds across all of them: check the Network tab before writing any code, respect robots.txt, use polite delays, and build idempotent storage from the start. A scraper that breaks after 30 minutes because of a duplicate key error is not a production scraper.


These patterns come from five data engineering projects that scrape Kenyan sources on a schedule. The code for Kenya Real Estate, BungeWatch, JobSense, and BizPulse Kenya is on my GitHub.

Follow me on dev.to for more articles on data pipelines, dbt, and Airflow.

Top comments (0)