DEV Community

Cover image for Working with Apache to automate collection of Weather data for Kenya’s major Agricultural Areas
Hillary Onyango
Hillary Onyango

Posted on • Edited on

Working with Apache to automate collection of Weather data for Kenya’s major Agricultural Areas

Introduction
One of the important aspects of Apache is it enables collection of messy data scattered all over into a seamless flow of cleaned data that can be used by analytics teams and data scientists to do what they do best. This project collects weather data extracts daily from the OpenWeatherMap API, transforms it into a structured format, and loads it into a PostgreSQL database Access to reliable weather data is essential for farmers to optimize crop yields, manage irrigation, and mitigate climate risks. The Kenya Agricultural Regions Weather ETL Pipeline addresses this need by automating the collection, processing, and storage of weather data for 17 key agricultural regions. Built with Apache Airflow and integrated with the OpenWeatherMap API and PostgreSQL, this project delivers a robust, scalable solution for data-driven agriculture.

Project Overview

Kenya’s agricultural sector is a cornerstone of its economy, supporting millions of livelihoods and contributing significantly to national GDP. This Data Engineering project avails the precious data that analysts and data scientists can use to make important decisions to improve the contribution of agriculture to the nation’s economics and food security in general. This ETL Pipeline is an automated system that extracts daily weather data from the OpenWeatherMap API, transforms it into a structured format, and loads it into a PostgreSQL database. The pipeline captures critical metrics such as temperature, humidity, pressure, wind speed, rainfall, and weather descriptions in major agricultural regions like Eldoret, Nakuru, Kitale, Embu, and others. By leveraging Apache Airflow, the system ensures reliable scheduling, error handling, and monitoring, with email notifications for operational transparency
Key Features
The pipeline is designed for efficiency and reliability, offering:
Daily Weather Updates: Collects data for 17 agricultural regions in Kenya.
Automated Workflow: Uses Apache Airflow to orchestrate the ETL process.
PostgreSQL Integration: Stores data in a structured, queryable database.
Comprehensive Error Handling: Manages API failures, database issues, and data validation errors.
Email Notifications: Sends alerts for pipeline failures and weekly success reports.
Secure Configuration: Uses environment variables for API keys, database credentials, and SMTP settings.

Technical Design
The pipeline follows a standard ETL architecture:

  1. Extract: Fetches raw weather data from the OpenWeatherMap API using Python’s requests library, targeting coordinates for each of the 17 regions.
  2. Transform: Cleans and processes data with pandas, converting units (e.g., Kelvin to Celsius), handling missing values, and structuring data for storage.
  3. Load: Inserts transformed data into a PostgreSQL database using sqlalchemy and psycopg2-binary. The database schema is designed for simplicity and scalability:
def create_weather_table():
    """Create weather_data table if it doesn't exist"""
    create_table_sql = """
    CREATE TABLE IF NOT EXISTS weather_data (
        id SERIAL PRIMARY KEY,
        region VARCHAR(50) NOT NULL,
        latitude DECIMAL(10, 6) NOT NULL,
        longitude DECIMAL(10, 6) NOT NULL,
        temperature DECIMAL(5, 2),
        feels_like DECIMAL(5, 2),
        temp_min DECIMAL(5, 2),
        temp_max DECIMAL(5, 2),
        pressure INTEGER,
        humidity INTEGER,
        visibility INTEGER,
        wind_speed DECIMAL(5, 2),
        wind_direction INTEGER,
        cloudiness INTEGER,
        weather_main VARCHAR(50),
        weather_description VARCHAR(100),
        rainfall_1h DECIMAL(8, 2) DEFAULT 0,
        rainfall_3h DECIMAL(8, 2) DEFAULT 0,
        sunrise TIMESTAMP,
        sunset TIMESTAMP,
        data_timestamp TIMESTAMP NOT NULL,
        extraction_timestamp TIMESTAMP NOT NULL,
        heat_index DECIMAL(5, 2),
        dew_point DECIMAL(5, 2),
        is_favorable_temp BOOLEAN,
        is_high_humidity BOOLEAN,
        rainfall_category VARCHAR(20),
        date DATE,
        hour INTEGER,
        month INTEGER,
        year INTEGER,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        UNIQUE(region, data_timestamp)
    );

    -- Create indexes for better query performance
    CREATE INDEX IF NOT EXISTS idx_weather_region ON weather_data(region);
    CREATE INDEX IF NOT EXISTS idx_weather_date ON weather_data(date);
    CREATE INDEX IF NOT EXISTS idx_weather_timestamp ON weather_data(data_timestamp);
    CREATE INDEX IF NOT EXISTS idx_weather_region_date ON weather_data(region, date);
    """

    conn = get_postgres_connection()
    try:
        cursor = conn.cursor()
        cursor.execute(create_table_sql)
        conn.commit()
        logging.info("Weather table created successfully")
        cursor.close()
    except psycopg2.Error as e:
        logging.error(f"Error creating table: {e}")
        conn.rollback()
        raise
    finally:
        conn.close()
Enter fullscreen mode Exit fullscreen mode

Implementation
The project is built with Python 3.8+ and Apache Airflow 2.0+, with dependencies including pandas, sqlalchemy, psycopg2-binary, requests, and python-dotenv. Below are the key components:

1. Project Structure
The GitHub repository includes:
• daily_weather_etl_kenya.py: Defines the Airflow DAG for the ETL workflow.
• .env: Stores sensitive configurations (API keys, database credentials, SMTP settings).
• README.md: Provides setup and usage instructions.
• .gitignore: Excludes sensitive files from version control.
Setup involves cloning the repository, creating a virtual environment, installing dependencies, and configuring the .env file.

Stracture

2. Airflow DAG
The DAG is scheduled daily with:
• Retries: 2 attempts for failed tasks.
• Retry Delay: 5 minutes.
• Email Notifications: Enabled for failures and weekly success summaries via SMTP (e.g., Gmail).

SMTP sent the email successfully
Tasks include API data extraction, data transformation, and database loading, all orchestrated through Airflow’s intuitive interface.

3. Error Handling
The pipeline handles:
• API Issues: Timeouts and invalid responses are managed with retry logic.
• Database Errors: Connection failures are caught and logged.
• Data Validation: Ensures data integrity before loading.
• Logging: Airflow logs provide detailed execution insights.
4. Deployment
Users set up Airflow by initializing the database, creating an admin user, and running the webserver (http://localhost:8080) and scheduler. The Airflow UI enables monitoring of DAG runs and task statuses.
Impact
This pipeline empowers stakeholders in Kenya’s agricultural sector:
Farmers: Use weather data to plan planting, irrigation, and harvesting.
Researchers: Analyze historical data for climate and crop studies.
Policymakers: Leverage insights for agricultural planning and disaster response.
By automating data collection, the system saves time and ensures consistent, high-quality data.
Challenges and Solutions
Key challenges included:
API Rate Limits: Addressed by optimizing API calls and implementing retries.
Data Quality: Ensured through validation and standardization.
Configuration Security: Managed with python-dotenv for environment variables.
Future Enhancements
Potential improvements include:
• Real-time data streaming for more frequent updates.
• Integration with additional data sources (e.g., soil moisture sensors).
• A visualization dashboard for end users.
• Expansion to cover more Kenyan regions.

Acknowledgments
OpenWeatherMap API: For providing accessible weather data.
Apache Airflow: For robust orchestration.
PostgreSQL: For reliable storage.

Here is the github link for more about the project as I work on improving it:
(https://github.com/HillaryOnyango/Kenya-Agricultural-Regions-Weather-ETL-Pipeline)
Report on the status of the DataBase
The pipeline has been up, ingesting data perfectly into the database.
Pipeline DAF

The data collected over the weekend shows that nearly all major regions of the country experiences little to no rain, with most regions covered with broken clouds.
Weather update
Interms of the extreme nature of the weather conditions from various regions accross the country, here is what was revealed by the data

Top comments (0)