Getting data from modern websites is not the same as it used to be- today, most websites render their data dynamically making it hard for traditional web scrapping tools to obtain any data.
In this project: The Africa Energy project, we are going to use different tools to obtain data about Energy Indicators across 54 African countries for the years 2000 - 2022 from the The Africa Energy Portal.
The project features: a web scrapper that extracts JSON data from API network responses.
Project Overview
The Africa Energy Portal is a dynamic webpage that contains information about energy indicators across 54 African countries. The indicators are energy access, supply and technical aspects related to energy.
The indicators are further broken down into sub-sectors such as 'Population access to electricity-National' which shows the percentage of people with access to electricity on a national level etc.
The Process
The project uses the following technologies:
a. Python for developing the web scrapping logic
b. Selenium for browser automation and capturing network responses
c. Pandas for data handling and manipulation
d. MongoDB which is a NOSQL database for storing and querying the data
The scrapper utilizes Selenium to automate browser interaction such as loading the page and selecting all required themes, years and countries for precise data extraction.
The results
The scrapper obtains all the selected fields of the data ie;
- id, name, score, unit, region name, indicator topic, indicator source, indicator name, indicator group, year, url
The data is extracted in JSON format and appended to an empty list before it is flattened and converted to csv format.
Working on this project has been nothing short of a learning experience from the thought process of understanding the project, to learning of different ways to execute the project, to implementation of the same.
You can check out the project on Github and feel free to reach out for inquiries or collaboration!
Top comments (1)
Complete coverage means every indicator has a value or explicit null for all 54 countries from 2000–2022, aiming for 95% overall and capturing units/source/version. Success = fast throughput, high completeness, tight accuracy and freshness under 24-72h-keeping it snappy and up-to-date. We normalize to ISO-3166, validate with JSON Schema/pydantic, flatten via pandas.json_normalize, track lineage, and stash docs in MongoDB per country-year-indicator with compound indexes, upserts, and unit conversions. To keep things smooth, we throttle, run bounded parallelism, cache with ETag/If-Modified-Since, log like pros, retry smartly, play nice with ToS/robots ship via Docker, and schedule idempotent delta runs with Airflow/Prefect.