DEV Community

Rahul Soni
Rahul Soni

Posted on

Building an ETL Job with Python and DuckDB: Analyzing Electric Vehicle Data

Image description

In this blog post, we will explore how to build an ETL (Extract, Transform, Load) job using Python and DuckDB to analyze electric vehicle data. We will walk through the code step by step, explaining each part and highlighting key concepts along the way. By the end of this post, you will have a some understanding of how to process and analyze datasets efficiently using Python and DuckDB.

Prerequisites

Before we begin, make sure you have the following prerequisites:

  1. Python installed on your machine
  2. DuckDB Python library installed (pip install duckdb)

Introduction

As the popularity of electric vehicles continues to rise, there is a growing need to analyze and extract insights from electric vehicle data. This blog post aims to demonstrate how to build an ETL job to process electric vehicle data using Python and DuckDB.

Setting Up DuckDB and Python

To get started, we need to import the necessary libraries and establish a connection to DuckDB. Open your favorite Python editor and create a new Python script. Let's begin by importing the required libraries:

import duckdb
from duckdb import DuckDBPyConnection
import os
import logging
Enter fullscreen mode Exit fullscreen mode

We also need to configure logging for better visibility into the execution of our ETL job. Add the following code to configure the logging module:

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
Enter fullscreen mode Exit fullscreen mode

Creating the Database Table

The first step in our ETL process is to create a table in DuckDB to store the electric vehicle data. We'll define the schema of the table and execute the SQL statements to create it. Here's the code for creating the "vehicles" table:

CREATE_TABLE_VEHICLES = """
        DROP TABLE IF EXISTS vehicles CASCADE;
        CREATE TABLE vehicles (
            VIN VARCHAR(10),
            County VARCHAR,
            City VARCHAR,
            State VARCHAR(2),
            Postal_Code VARCHAR(10),
            Model_Year INTEGER,
            Make VARCHAR,
            Model VARCHAR,
            Electric_Vehicle_Type VARCHAR,
            CAFV_Eligibility VARCHAR,
            Electric_Range INTEGER,
            Base_MSRP DECIMAL,
            Legislative_District INTEGER,
            DOL_Vehicle_ID VARCHAR(15),
            Vehicle_Location VARCHAR,
            Electric_Utility VARCHAR,
            Census_Tract VARCHAR(11)
    );
"""

conn = duckdb.connect(database="electric_vehicles_data")
cursor = conn.cursor()

# Create vehicles table
try:
    cursor.execute(CREATE_TABLE_VEHICLES)
    logger.info("Vehicles table created.")
except Exception as e:
    logger.error(f"Failed to create vehicles table: {str(e)}")
    conn.close()
    return
Enter fullscreen mode Exit fullscreen mode

In the code above, we define the SQL statement CREATE_TABLE_VEHICLES as a multi-line string. It drops the existing "vehicles" table if it exists and creates a new table with the specified columns and data types.

We establish a connection to DuckDB using duckdb.connect() and create a cursor to execute SQL statements. Then, we execute the CREATE TABLE statement using the cursor. If any exception occurs, we log an error message and close the connection.

Extracting and Loading Data

Now that we have our table structure in place, let's proceed to extract data from a CSV file and load it into our "vehicles" table. We'll define a function called insert_data() to handle this task:

def insert_data(cursor: DuckDBPyConnection, csv_file_path: str):
    try:
        cursor.sql(
            f"

INSERT INTO vehicles SELECT * FROM read_csv_auto('{csv_file_path}');"
        )
        cursor.commit()
        logger.info("Data insertion successful.")
    except Exception as e:
        logger.error(f"Failed to insert data: {str(e)}")
Enter fullscreen mode Exit fullscreen mode

The insert_data() function takes two parameters: the cursor object and the path to the CSV file containing the data. Inside the function, we use the read_csv_auto() method provided by DuckDB to read the CSV file and insert the data into the "vehicles" table using the INSERT INTO statement.

To call this function and load the data, add the following code to your script:

csv_file_path = "data/Electric_Vehicle_Population_Data.csv"
insert_data(cursor, csv_file_path)
Enter fullscreen mode Exit fullscreen mode

Make sure to replace "data/Electric_Vehicle_Population_Data.csv" with the actual path to your CSV file.

Data Analysis Queries

With the data loaded into our "vehicles" table, we can now perform various data analysis queries. Let's define a few functions to run these queries and retrieve the results.

Counting Cars per City

Our first query will count the number of cars per city. Define a function called count_cars_per_city():

def count_cars_per_city(cursor: DuckDBPyConnection):
    cars_per_city_query = """
    SELECT city, COUNT(VIN) AS cars_in_city 
    FROM vehicles 
    GROUP BY city
    """

    try:
        cursor.execute(cars_per_city_query)
        results = cursor.fetchall()
        return results
    except Exception as e:
        logger.error(f"Failed to count cars per city: {str(e)}")
        return []
Enter fullscreen mode Exit fullscreen mode

The count_cars_per_city() function takes the cursor object as a parameter. It executes the SQL query to count the cars per city and fetches the results. The results are returned as a list of tuples, where each tuple contains the city name and the corresponding car count.

To retrieve the cars per city, add the following code to your script:

cars_per_city = count_cars_per_city(cursor)
logger.info(f"Cars per city: {cars_per_city}")
Enter fullscreen mode Exit fullscreen mode

Finding the Top Three Popular Vehicles:

The find_top_three_popular_vehicles function executes an SQL query that groups the data by make and model, counts the number of vehicles for each combination, and retrieves the top three most popular vehicles based on the vehicle count. This information helps us identify the most sought-after electric vehicles.

Determining the Most Popular Vehicle by Postal Code:

The find_most_popular_vehicle_by_postal_code function executes an SQL query that groups the data by postal code, model, and make. It then calculates the vehicle count for each combination and selects the most popular vehicle for each postal code. This analysis reveals the preferred electric vehicle models in different areas.

Counting Cars by Model Year:

The count_cars_by_model_year function executes an SQL query to count the number of cars for each model year. This analysis provides an overview of the electric vehicle market's growth over time.

Saving Results

As we generate the results of our data analysis queries, it's important to save them for future reference. Let's define a directory structure to store the results and write the results to files.

output_dir = "data/results"
os.makedirs(output_dir, exist_ok=True)
Enter fullscreen mode Exit fullscreen mode

In the code above, we define the output_dir variable as the path to the directory where the results will be saved. We use os.makedirs() to create the directory if it doesn't exist.

Next, we'll save the results of the "count_cars_by_model_year" query to separate Parquet files for each year. Add the following code after the count_cars_by_model_year() function:

for row in rows:
    year, count = row

    year_dir = os.path.join(output_dir, str(year))
    os.makedirs(year_dir, exist_ok=True)

    file_path = os.path.join(year_dir, f"{year}.parquet")

    try:
        with open(file_path, "w") as file:
            file.write(f"Count: {count}")
        logger.info(f"File saved: {file_path}")
    except Exception as e:
        logger.error(f"Failed to save file: {str(e)}")
Enter fullscreen mode Exit fullscreen mode

In the code above, we iterate over the rows returned by the count_cars_by_model_year() function. For each row, we extract the year and count values. We create a subdirectory for each year within the output_dir and save a file named "{year}.parquet" with the count value inside.

Conclusion

Congratulations! You have successfully built an ETL job using Python and DuckDB to analyze electric vehicle data. We covered the steps involved in creating a database table, loading data from a CSV file, running data analysis queries, and saving the results.

Python and DuckDB provide powerful tools for processing and analyzing large datasets efficiently. With the knowledge gained from this blog post, you can now apply these concepts to other data analysis tasks and gain valuable insights from your data.

Feel free to experiment with different queries and analysis techniques to further explore the electric vehicle data. Happy coding!

Note: Don't forget to close the database connection at the end of your script: conn.close().

Github Repo

Additional Resources

Top comments (0)