DEV Community: Hillary Onyango

Working with Apache to automate collection of Weather data for Kenya’s major Agricultural Areas

Hillary Onyango — Sat, 26 Jul 2025 22:46:14 +0000

Introduction
One of the important aspects of Apache is it enables collection of messy data scattered all over into a seamless flow of cleaned data that can be used by analytics teams and data scientists to do what they do best. This project collects weather data extracts daily from the OpenWeatherMap API, transforms it into a structured format, and loads it into a PostgreSQL database Access to reliable weather data is essential for farmers to optimize crop yields, manage irrigation, and mitigate climate risks. The Kenya Agricultural Regions Weather ETL Pipeline addresses this need by automating the collection, processing, and storage of weather data for 17 key agricultural regions. Built with Apache Airflow and integrated with the OpenWeatherMap API and PostgreSQL, this project delivers a robust, scalable solution for data-driven agriculture.

Project Overview

Kenya’s agricultural sector is a cornerstone of its economy, supporting millions of livelihoods and contributing significantly to national GDP. This Data Engineering project avails the precious data that analysts and data scientists can use to make important decisions to improve the contribution of agriculture to the nation’s economics and food security in general. This ETL Pipeline is an automated system that extracts daily weather data from the OpenWeatherMap API, transforms it into a structured format, and loads it into a PostgreSQL database. The pipeline captures critical metrics such as temperature, humidity, pressure, wind speed, rainfall, and weather descriptions in major agricultural regions like Eldoret, Nakuru, Kitale, Embu, and others. By leveraging Apache Airflow, the system ensures reliable scheduling, error handling, and monitoring, with email notifications for operational transparency
Key Features
The pipeline is designed for efficiency and reliability, offering:
• Daily Weather Updates: Collects data for 17 agricultural regions in Kenya.
• Automated Workflow: Uses Apache Airflow to orchestrate the ETL process.
• PostgreSQL Integration: Stores data in a structured, queryable database.
• Comprehensive Error Handling: Manages API failures, database issues, and data validation errors.
• Email Notifications: Sends alerts for pipeline failures and weekly success reports.
• Secure Configuration: Uses environment variables for API keys, database credentials, and SMTP settings.

Technical Design
The pipeline follows a standard ETL architecture:

Extract: Fetches raw weather data from the OpenWeatherMap API using Python’s requests library, targeting coordinates for each of the 17 regions.
Transform: Cleans and processes data with pandas, converting units (e.g., Kelvin to Celsius), handling missing values, and structuring data for storage.
Load: Inserts transformed data into a PostgreSQL database using sqlalchemy and psycopg2-binary. The database schema is designed for simplicity and scalability:

def create_weather_table():
    """Create weather_data table if it doesn't exist"""
    create_table_sql = """
    CREATE TABLE IF NOT EXISTS weather_data (
        id SERIAL PRIMARY KEY,
        region VARCHAR(50) NOT NULL,
        latitude DECIMAL(10, 6) NOT NULL,
        longitude DECIMAL(10, 6) NOT NULL,
        temperature DECIMAL(5, 2),
        feels_like DECIMAL(5, 2),
        temp_min DECIMAL(5, 2),
        temp_max DECIMAL(5, 2),
        pressure INTEGER,
        humidity INTEGER,
        visibility INTEGER,
        wind_speed DECIMAL(5, 2),
        wind_direction INTEGER,
        cloudiness INTEGER,
        weather_main VARCHAR(50),
        weather_description VARCHAR(100),
        rainfall_1h DECIMAL(8, 2) DEFAULT 0,
        rainfall_3h DECIMAL(8, 2) DEFAULT 0,
        sunrise TIMESTAMP,
        sunset TIMESTAMP,
        data_timestamp TIMESTAMP NOT NULL,
        extraction_timestamp TIMESTAMP NOT NULL,
        heat_index DECIMAL(5, 2),
        dew_point DECIMAL(5, 2),
        is_favorable_temp BOOLEAN,
        is_high_humidity BOOLEAN,
        rainfall_category VARCHAR(20),
        date DATE,
        hour INTEGER,
        month INTEGER,
        year INTEGER,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        UNIQUE(region, data_timestamp)
    );

    -- Create indexes for better query performance
    CREATE INDEX IF NOT EXISTS idx_weather_region ON weather_data(region);
    CREATE INDEX IF NOT EXISTS idx_weather_date ON weather_data(date);
    CREATE INDEX IF NOT EXISTS idx_weather_timestamp ON weather_data(data_timestamp);
    CREATE INDEX IF NOT EXISTS idx_weather_region_date ON weather_data(region, date);
    """

    conn = get_postgres_connection()
    try:
        cursor = conn.cursor()
        cursor.execute(create_table_sql)
        conn.commit()
        logging.info("Weather table created successfully")
        cursor.close()
    except psycopg2.Error as e:
        logging.error(f"Error creating table: {e}")
        conn.rollback()
        raise
    finally:
        conn.close()

Implementation
The project is built with Python 3.8+ and Apache Airflow 2.0+, with dependencies including pandas, sqlalchemy, psycopg2-binary, requests, and python-dotenv. Below are the key components:

1. Project Structure
The GitHub repository includes:
• daily_weather_etl_kenya.py: Defines the Airflow DAG for the ETL workflow.
• .env: Stores sensitive configurations (API keys, database credentials, SMTP settings).
• README.md: Provides setup and usage instructions.
• .gitignore: Excludes sensitive files from version control.
Setup involves cloning the repository, creating a virtual environment, installing dependencies, and configuring the .env file.

2. Airflow DAG
The DAG is scheduled daily with:
• Retries: 2 attempts for failed tasks.
• Retry Delay: 5 minutes.
• Email Notifications: Enabled for failures and weekly success summaries via SMTP (e.g., Gmail).

Tasks include API data extraction, data transformation, and database loading, all orchestrated through Airflow’s intuitive interface.

3. Error Handling
The pipeline handles:
• API Issues: Timeouts and invalid responses are managed with retry logic.
• Database Errors: Connection failures are caught and logged.
• Data Validation: Ensures data integrity before loading.
• Logging: Airflow logs provide detailed execution insights.
4. Deployment
Users set up Airflow by initializing the database, creating an admin user, and running the webserver (http://localhost:8080) and scheduler. The Airflow UI enables monitoring of DAG runs and task statuses.
Impact
This pipeline empowers stakeholders in Kenya’s agricultural sector:
• Farmers: Use weather data to plan planting, irrigation, and harvesting.
• Researchers: Analyze historical data for climate and crop studies.
• Policymakers: Leverage insights for agricultural planning and disaster response.
By automating data collection, the system saves time and ensures consistent, high-quality data.
Challenges and Solutions
Key challenges included:
• API Rate Limits: Addressed by optimizing API calls and implementing retries.
• Data Quality: Ensured through validation and standardization.
• Configuration Security: Managed with python-dotenv for environment variables.
Future Enhancements
Potential improvements include:
• Real-time data streaming for more frequent updates.
• Integration with additional data sources (e.g., soil moisture sensors).
• A visualization dashboard for end users.
• Expansion to cover more Kenyan regions.

Acknowledgments
• OpenWeatherMap API: For providing accessible weather data.
• Apache Airflow: For robust orchestration.
• PostgreSQL: For reliable storage.

Here is the github link for more about the project as I work on improving it:
(https://github.com/HillaryOnyango/Kenya-Agricultural-Regions-Weather-ETL-Pipeline)
Report on the status of the DataBase
The pipeline has been up, ingesting data perfectly into the database.

The data collected over the weekend shows that nearly all major regions of the country experiences little to no rain, with most regions covered with broken clouds.

Interms of the extreme nature of the weather conditions from various regions accross the country, here is what was revealed by the data

Working with PostgreSQL through DBeaver: A tutorial

Hillary Onyango — Thu, 15 May 2025 00:47:00 +0000

PostgreSQL is one of the most advanced open-source databases, making it popular DB with Data Analysts and Data Engineers. To get the Postgres, Go to https://www.postgresql.org/download/. Select the suitable installer and download it and follow the installation instructions

DBeaver is a complimentary database tool that supports any database having a JDBC driver. Software developers, SQL writers, database administrators, and data analysts leverage its excellent functionality for interacting with databases. This DBeaver tutorial for PostgreSQL, from connecting and working with data achieve a particular objective
Why DBeaver?
DB is popular with many developers and data gurus because it provides an intuitive user interface to connect to multitude of databases, including MySQL, MariaDB, PostgreSQL, SQLite, and many others and allowing them to perform essential operations on the data.
**DBeaver Installation and Connecting to PostgreSQL
**To install DBeaver, the visit the download page using the following URL:
https://dbeaver.io/download/
Select the version of DBeaver for your OS and download the installer. Then, open the installer and follow the instructions to complete the DBeaver installation.
Once installed, the next phase is connection to SQL. To connect to a database in DBeaver, open the utility and click Database in the top menu. Then, click the New Database Connection option or the Plug icon located within ribbon of icons at the top right portion of the user interface

Then select your DB (which in this case is PostgreSQL) and then hit the next down to go the connection section.

The final stage before we have our DB ready is creating connection, which is very critical. Here is a snippet of how the connection section look like:

The I would explain using example when using localhost as host and the DB name is Jumia.

Once the above details have been keyed in, it is important to test connection first before hitting the “finish” button to ensure that the connection is successful. Here is the output if the connection is successful.

For non-local users, here is snippet of how yours will look (basically the same, just variation in the host used.)

For those using cloud services such as aiven, pretty the same but looks like this:

Upon successful connection, we can perform normal operation on DBeaver just as we would in PostgreSQL. Like shown below:

Fun Fact: The data is also available in PostgreSQL in your computer and you can access it using same DB name/tables like one below:

The Growing Importance of Data Literacy in a Digital World Introduction

Hillary Onyango — Wed, 12 Mar 2025 01:49:26 +0000

Introduction

The world has experienced an exponential technological revolution in the last 40 years and this era has ushered in a massive influx of information, with data now forming the backbone of decision-making across industries. An estimate of 2.5 quintillion bytes of data is created every day, and with the rise of automation, machine learning, and artificial intelligence, this number is only set to increase. In this era AI, IoT, ML, data is like the new ‘oil’ that is required to keep these systems and engine running. From healthcare to finance, education to e-commerce, just a few to mention, data is gathered, analyzed and used to make decisions that influences every aspect of our lives. However, the ability to harness and make sense of this information effectively requires a crucial skill—data literacy

Understanding Data Literacy
In a world overflowing with data, our ability to understand, interpret, and utilize it effectively has never been more critical Data literacy can be defined as the capacity to read, analyze, and communicate data insights effectively. It empowers individuals and organizations to make informed decisions, identify trends, and ultimately enhance business outcomes. At the top of the list of the benefits of making data-driven decisions is that the decision is based on factual insights rather than relying solely on intuition. This approach often results in improved efficiency and stronger business strategies, giving data-driven companies a competitive edge over those that do not prioritize data in their decision-making.
However, simply having access to data is not enough to guarantee success. Businesses also need skilled professionals who can analyze and interpret data effectively. As a result, there is a growing demand in the job market for professionals with strong data literacy skills

Key Data Literacy Skills

1.Data Analytics
Data analysis involves examining and interpreting data to extract meaningful insights. Data analysis can entail performing basic operations like reviewing patterns and drawing conclusions or as complex as applying sophisticated techniques use statistical models and machine learning algorithms. There are several types of data analysis, with four primary approaches:

Descriptive analysis – Explains what has happened based on historical data. _ -Diagnostic analysis_ – Identifies reasons behind specific outcomes.
Predictive analysis – Forecasts potential future trends based on patterns.
Prescriptive analysis – Recommends actions to achieve desired results. Developing analytical skills is crucial for making informed business decisions and identifying opportunities for growth. 2. Data Cleaning (Wrangling)

Data cleaning is the process of converting raw data into a into a usable format. This often entails removing inconsistencies, filling in missing information, and organizing data for analysis.
Clean data reduces errors in decision-making and ensures accuracy in reporting.While many businesses automate this process using algorithms, employees involved in data collection and entry must also uphold data integrity standard. Most popular tools used in this stage include Power BI and Microsoft Excel.

3. Data Visualization
Data visualization is the practice of transforming raw numbers into visual formats such as graphs, charts, and infographics. It plays a crucial role in making complex data more understandable, especially for stakeholders who may not have advanced data skills. Common visualization techniques include:

Charts and tables for presenting trends
Maps for geographic data representation
Infographics for summarizing key insights
Interactive dashboards for real-time data monitoring.

Data specialists use popular tools like Microsoft Excel, Google Charts, Tableau, and Power BI to create compelling visualizations that drive better decision-making.

4. The Data Ecosystem
A data ecosystem refers to the interconnected components that support an organization’s data management, including:

- Physical infrastructure – Servers, databases, and cloud storage solutions
- Software and tools – Data analysis platforms, programming languages, and AI-driven applications
- Data sources– Internal and external datasets that fuel insights.

Understanding an organization's data ecosystem helps professionals optimize workflows,improve efficiency, and ensure data is used effectively.

5. Data Governance
Data governance is the framework that defines how an organization manages its data assets. It ensures data is accurate, secure, and used responsibly. Key areas of data governance include:

Quality – Maintaining data accuracy, consistency, and completeness.
Security– Protecting data from unauthorized access and breaches.
Privacy – Safeguarding sensitive information, such as customer and employee records.
Stewardship – Enforcing policies and best practices for ethical data handling. Many companies have formal data policies that outline rules for employees, ensuring compliance with legal and industry standards

6. The Data Team
Understanding the roles within a data team can help professionals collaborate effectively. Most organizations have:

Data Scientists– Experts in advanced analytics, machine learning, and predictive modeling.
Data Engineers – Developers who build and maintain data infrastructures, ensuring seamless data processing.
Data Analysts– Specialists who conduct analyses, generate reports, and provide actionable insights. Each of these roles contributes to turning raw data into valuable business intelligence, helping organizations remain competitive in a data-driven world.