Vanny Durby

Posted on Oct 11

From Sunlight to Insights: A Python-Powered Dive into Solar Energy Data

#python #pandas #datascience

The sun showers our planet with more energy in a single hour than all of humanity consumes in an entire year. As developers, we're uniquely positioned to harness not just this energy, but the vast amounts of data it generates. From IoT sensors on solar farms to meteorological stations, the world of renewable energy is a goldmine of fascinating time-series data.

Today, we're embarking on a data adventure. We'll use Python and its powerful data science libraries to explore a rich dataset of solar measurements. You'll learn how to take raw CSV files, clean them up, and start asking questions to reveal the hidden patterns of the sun, the weather, and their intricate dance. This isn't just an academic exercise; it's a hands-on introduction to the kind of exploratory data analysis (EDA) that forms the foundation of machine learning models for energy forecasting and grid optimization.

This article is a deep-dive rewrite of an excellent tutorial originally published on iunera.com's blog, titled "Solar Data Analysis with Python (Part 1: Introduction to the Solar Dataset)". We'll expand on its concepts to create a comprehensive guide for the dev.to community.

Setting Up Your Digital Sundial 🛠️

Before we can start crunching numbers, we need to get our environment ready. This tutorial assumes you have a basic understanding of Python. Here’s the toolkit we'll be using:

Python (3.8+): The language of choice for data science.
Jupyter Notebook: An interactive environment perfect for EDA. If you're new to it, iunera has a great beginner's tutorial on Jupyter Notebooks.
Pandas: The cornerstone library for data manipulation and analysis in Python.
Numpy: For numerical operations, often used under the hood by Pandas.
Matplotlib: The classic library for creating static, animated, and interactive visualizations.

You can install these libraries using pip:

pip install jupyter pandas numpy matplotlib

The Dataset: A Solar Goldmine

The data we'll be working with comes from a project by the Government of Pakistan, NREL (National Renewable Energy Laboratory), and USAID. It provides high-resolution, time-series measurements for solar radiation, temperature, and other meteorological metrics for several cities in Pakistan. You can access and download the data from this link.

Pakistan is an excellent case study due to its high solar potential, receiving abundant sunlight throughout the year. This makes the data particularly interesting for analyzing the viability of solar power generation.

Step 1: Data Acquisition and Assembly

A common task for any data professional is gathering data from multiple sources and combining them. Let's start by downloading the data for a couple of cities, say Hyderabad and Islamabad, and then merge them into a single, unified DataFrame.

After downloading the CSV files, we can load one to see what we're working with.

import pandas as pd

# Load a single file to inspect it
df_hyd = pd.read_csv('solar-measurementspakistanhyderabadwb-esmapqc.csv')
print(df_hyd.head())

Now, let's write a small script to read all the city CSVs in a directory, add a city column to each one (so we know where the data came from), and concatenate them into a single master DataFrame.

import pandas as pd
import glob

# Path to the directory containing your CSV files
path = 'your_data_directory/'
all_files = glob.glob(path + "*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    # Extract city name from the filename for a new column
    city_name = filename.split('pakistan')[-1].split('wb-esmapqc.csv')[0]
    df['city'] = city_name
    li.append(df)

# Concatenate all DataFrames in the list
master_df = pd.concat(li, axis=0, ignore_index=True)

print(master_df.info())
print(master_df['city'].value_counts())

This gives us a single, powerful DataFrame to work with, containing data from multiple locations.

Step 2: Decoding the Solar Jargon ☀️

Before we can analyze the data, we need to understand what the columns mean. Here are the key players:

Column	Detail
`time`	Date and time (yyyy-mm-dd HH:MM)
`ghi_pyr`	Global Horizontal Irradiance (W/m²) from a pyranometer
`dni`	Direct Normal Irradiance (W/m²)
`dhi`	Diffuse Horizontal Irradiance (W/m²)
`air_temperature`	Air temperature (°C)
`relative_humidity`	Relative humidity (%)
`barometric_pressure`	Ambient air pressure (Pascals)

Let's break down the most important solar terms:

Global Horizontal Irradiance (GHI): This is the total solar radiation received by a horizontal surface on the ground. It includes both direct sunlight and scattered light from the sky. Think of it as the energy a solar panel would get if it were lying flat on your roof.
Direct Normal Irradiance (DNI): This measures only the solar radiation coming in a straight, direct line from the sun. To capture this energy effectively, you'd need a solar panel that perfectly tracks the sun's position across the sky. High DNI is crucial for concentrating solar power plants.
Diffuse Horizontal Irradiance (DHI): This is the sunlight that's been scattered by clouds, dust, and molecules in the atmosphere. It's why you can still see during the day even when it's completely overcast. GHI is essentially the sum of DHI and the horizontal component of DNI.

Step 3: The Data Janitor's Work 🧼

No real-world dataset is perfect. It's often messy, with missing values and inconsistencies. Data cleaning is a critical, albeit unglamorous, part of the job.

Let's start by checking for missing values in our master_df.

import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the percentage of missing values for each column
missing_percentage = master_df.isnull().sum() * 100 / len(master_df)

# Plot the results
plt.figure(figsize=(10, 6))
sns.barplot(x=missing_percentage.index, y=missing_percentage.values)
plt.xticks(rotation=90)
plt.ylabel('Percentage of Missing Values (%)')
plt.title('Missing Value Analysis')
plt.show()

Right away, we see that columns like comments and ghi_rsi have a significant number of NaN values. Let's investigate the comments column. What does it say when it's not empty?

print(master_df['comments'].dropna().unique())

# Output might include: ['Tracking device not operational' 'Power supply failure']

This is illuminating! The comment Tracking device not operational likely explains why dni and dhi values might be missing in those rows—the instrument to measure them wasn't working. For now, we'll keep these rows since they still contain valuable data like ghi and air_temperature.

Step 4: EDA - Uncovering Daily Rhythms 📈

Now for the fun part: visualization! Let's plot some of these metrics over a single day to understand their natural cycles.

The Daily Temperature Cycle

Let's pick a single day for one city and plot its temperature.


# First, convert the 'time' column to a proper datetime object
master_df['time'] = pd.to_datetime(master_df['time'])

# Filter for a single day in Hyderabad
one_day_df = master_df[(master_df['city'] == 'hyderabad') & (master_df['time'].dt.date == pd.to_datetime('2016-06-01').date())]

plt.figure(figsize=(12, 6))
plt.plot(one_day_df['time'], one_day_df['air_temperature'])
plt.title('Air Temperature on June 1, 2016 in Hyderabad')
plt.xlabel('Time of Day')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()

The plot reveals a familiar pattern: the temperature is lowest just before dawn, rises sharply in the morning, peaks in the mid-afternoon (around 2-4 PM), and then gradually cools down after sunset.

The Dance of Temperature and Humidity

If we plot relative humidity on the same day, we see an interesting inverse relationship.


# (Code for plotting relative humidity similar to above)

As the air temperature rises, its capacity to hold water vapor increases. This means that even if the absolute amount of moisture in the air stays the same, the relative humidity drops. This is why the humidity is highest in the cool early morning and lowest during the hot afternoon.

Seeing a Clear Day vs. a Cloudy Day

The GHI and DNI plots are fantastic for diagnosing the weather. On a perfectly clear day, the GHI plot will be a smooth, symmetrical bell curve, peaking at solar noon.

However, on a day with passing clouds, the chart will look spiky and irregular. Each dip in the curve represents a cloud temporarily blocking the sun.

This ability to distinguish weather conditions from the data is incredibly powerful for solar energy forecasting.

Step 5: Zooming Out - A Year in the Sun

Plotting a single day is insightful, but what about seasonal trends? To do this, we need to aggregate our high-frequency data (every few minutes) into daily summaries. We can use Pandas' resample() and agg() methods to calculate the daily minimum, maximum, and mean for our metrics.


# Set 'time' as the index for resampling
hyd_df = master_df[master_df['city'] == 'hyderabad'].set_index('time')

# Resample to daily frequency, calculating min, mean, and max
daily_temp = hyd_df['air_temperature'].resample('D').agg(['min', 'mean', 'max'])

# Drop days with no data
daily_temp.dropna(inplace=True)

# Plot the results
plt.figure(figsize=(15, 7))
plt.plot(daily_temp.index, daily_temp['max'], label='Daily Max Temp', color='red')
plt.plot(daily_temp.index, daily_temp['mean'], label='Daily Mean Temp', color='orange')
plt.plot(daily_temp.index, daily_temp['min'], label='Daily Min Temp', color='blue')
plt.fill_between(daily_temp.index, daily_temp['min'], daily_temp['max'], color='gray', alpha=0.2)
plt.title('Daily Temperature Variation in Hyderabad (2015-2016)')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

This aggregated view clearly shows the seasons. We can see the brutally hot summer months leading up to June, where maximum daily temperatures can soar, and the milder winter period from November to February.

Beyond Pandas: Scaling Up for the Real World

While Pandas is fantastic for datasets that fit in memory, imagine analyzing this kind of data in real-time from thousands of sensors across an entire continent. This is a classic time-series data challenge. At that scale, you need specialized, high-performance analytical databases.

This is where technologies like Apache Druid come in. Druid is designed for sub-second queries on massive, streaming datasets. Managing such a system in a production environment is a complex task, often involving containerization with Kubernetes and requiring deep expertise in cluster tuning and writing performant queries to avoid performance bottlenecks.

Conclusion and Next Steps 🚀

We've taken a raw collection of solar energy data and transformed it into meaningful insights. We've learned how to:

Combine multiple datasets into one using Pandas.
Understand and interpret key meteorological variables.
Perform basic data cleaning to handle missing values.
Visualize daily and seasonal patterns using Matplotlib.

This is just the beginning of the journey. From here, you could explore forecasting solar output with machine learning, comparing the solar potential between different cities, or detecting anomalies in sensor readings.

The world of data is vast, and the tools to explore it are more accessible than ever. So go ahead, find a dataset that interests you, and start discovering the stories hidden within the numbers.

For businesses looking to implement large-scale, real-time data analytics solutions, specialized expertise is key. If you're tackling challenges with massive time-series data, exploring services like Apache Druid AI Consulting in Europe or looking into advanced platforms like Enterprise MCP Server Development can provide the support needed to build robust, scalable systems.

DEV Community