Abstract
Air quality has become a significant concern in many cities globally due to increased pollution levels. In this short article, we'll see how to calculate the Air Quality Index (AQI), store air quality data in SingleStoreDB and quickly visualise the AQI.
The notebook file used in this article is available from GitHub.
Introduction
In this article, we'll obtain air quality data from the Department for Environment, Food and Rural Affairs (Defra) in the UK and store it in SingleStoreDB. SingleStoreDB integrates with many external tools and libraries.
Create a SingleStoreDB Cloud account
A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.
Obtain Air Quality Data
We'll acknowledge Defra and uk-air.defra.gov.uk as the source of information as well as the licence:
© Crown 2023 copyright Defra via uk-air.defra.gov.uk, licenced under the Open Government Licence (OGL).
We'll use the following code to obtain the metadata:
metadata_url = "http://uk-air.defra.gov.uk/openair/R_data/AURN_metadata.RData"
with urlopen(metadata_url) as metadataset:
metadata = rdata.read_rda(metadataset)
metadata_df = metadata["AURN_metadata"]
metadata_df.info()
Example output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2836 entries, 1 to 2836
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site_id 2836 non-null string
1 site_name 2836 non-null string
2 location_type 2836 non-null string
3 latitude 2836 non-null float64
4 longitude 2836 non-null float64
5 parameter 2836 non-null string
6 Parameter_name 2836 non-null string
7 start_date 2836 non-null string
8 end_date 2836 non-null string
9 ratified_to 2836 non-null string
10 zone 2836 non-null string
11 agglomeration 1656 non-null object
12 local_authority 2815 non-null object
dtypes: float64(2), object(2), string(9)
memory usage: 288.2+ KB
Next, let's search for "Kensington" in local_authority
and retrieve the site_id
values:
# Filter the DataFrame for rows where 'Kensington' is included in 'local_authority' (case-insensitive)
kensington_sites = metadata_df[
metadata_df["local_authority"].str.contains("Kensington", case = False, na = False)
]
kensington_site_ids = kensington_sites["site_id"].unique()
kensington_site_ids
Example output:
<StringArray>
['CRD', 'CRD2', 'KC1', 'WL']
Length: 4, dtype: string
We'll use the value KC1 and the year 2022 and download the data and store it in a Dataframe:
site_id = kensington_site_ids[2]
year = 2022
dataset_url = f"https://uk-air.defra.gov.uk/openair/R_data/{site_id}_{year}.RData"
with urlopen(dataset_url) as dataset:
data = rdata.read_rda(dataset)
aqi_df = data[f"{site_id}_{year}"]
We'll ensure that the date is in the correct format:
aqi_df["date"] = pd.to_datetime(aqi_df["date"], unit = "s")
Let's take a quick look at the data:
aqi_df.head()
Example output:
date O3 NO NO2 NOXasNO2 SO2 CO PM10 PM2.5 wd ws temp site code
1 2022-01-01 00:00:00 39.11572 0.24946 9.94500 10.51875 0.26609 0.209556 14.9 12.076 212.0 3.0 10.4 London N. Kensington KC1
2 2022-01-01 01:00:00 38.71658 0.12473 7.84125 8.22375 -0.53218 0.197914 16.3 12.642 214.8 3.7 10.5 London N. Kensington KC1
3 2022-01-01 02:00:00 41.51056 0.12473 6.88500 7.07625 0.26609 0.221198 15.4 11.604 212.9 3.6 10.2 London N. Kensington KC1
4 2022-01-01 03:00:00 45.70153 0.12473 4.97250 5.16375 0.26609 0.186272 13.6 10.189 208.8 3.9 10.6 London N. Kensington KC1
5 2022-01-01 04:00:00 48.49551 0.12473 4.59000 4.59000 0.26609 0.186272 10.9 8.302 206.7 4.0 11.3 London N. Kensington KC1
We can see the range of values:
aqi_df.describe()
Example output:
date O3 NO NO2 NOXasNO2 SO2 CO PM10 PM2.5 wd ws temp
count 8760 8617.000000 8593.000000 8593.000000 8593.000000 7572.000000 7737.000000 8622.000000 8622.000000 8664.000000 8664.000000 8664.000000
mean 2022-07-02 11:30:00 50.337057 4.756186 18.884452 26.177035 0.557447 0.179206 14.600139 8.870449 193.805794 3.286034 10.889578
min 2022-01-01 00:00:00 -0.399140 -0.124730 0.382500 0.382500 -1.330450 0.034926 0.700000 0.377000 0.100000 0.000000 -5.900000
25% 2022-04-02 05:45:00 33.527760 0.249460 7.650000 8.606250 0.266090 0.104778 8.200000 4.151000 107.475000 2.200000 6.500000
50% 2022-07-02 11:30:00 51.688630 0.748380 13.196250 14.535000 0.532180 0.139704 11.900000 6.226000 219.100000 2.900000 10.700000
75% 2022-10-01 17:15:00 66.855950 1.870950 24.480000 27.157500 0.532180 0.197914 17.400000 10.283000 265.000000 4.100000 15.300000
max 2022-12-31 23:00:00 166.441380 368.701880 108.438750 652.927500 10.377510 1.723016 110.400000 84.623000 360.000000 12.900000 31.900000
std NaN 26.469337 18.191091 16.328962 39.461979 0.677925 0.141622 10.499228 8.222276 98.460865 1.567511 6.410335
We can see data for the polluting gases:
- Ozone (O3)
- Nitrogen monoxide (NO)
- Nitrogen dioxide (NO2)
- Sulphur dioxide (SO2)
- Carbon monoxide (CO)
- Particulates (PM10 and PM2.5)
We'll check for missing data:
aqi_df.isna().sum()
Example output:
date 0
O3 143
NO 167
NO2 167
NOXasNO2 167
SO2 1188
CO 1023
PM10 138
PM2.5 138
wd 96
ws 96
temp 96
site 0
code 0
For our initial analysis, we'll remove the rows with missing data:
aqi_df.dropna(inplace = True)
Calculate Air Quality Index (AQI)
Next, we'll use the python-aqi
library to calculate the AQI. This library uses algorithms from:
- The United States Environmental Protection Agency (EPA)
- The China Ministry of Environmental Protection (MEP)
We'll initially use the particulates (PM10 and PM2.5) to calculate the AQI using the EPA algorithm and create a new column in the Pandas Dataframe to store the result:
aqi_df["aqi"] = aqi_df.apply(
lambda row: aqi.to_aqi([
(aqi.POLLUTANT_PM25, row["PM2.5"]),
(aqi.POLLUTANT_PM10, row["PM10"])
]),
axis = 1
)
Now we'll store the Pandas Dataframe in SingleStoreDB.
Store data in SingleStoreDB
We'll now create a connection to SingleStoreDB:
from sqlalchemy import *
db_connection = create_engine(connection_url)
Now we'll create a new table in SingleStoreDB using the Pandas Dataframe:
aqi_df.to_sql(
"aqi",
con = db_connection,
if_exists = "replace",
index = False,
chunksize = 1000
)
We'll check some data in the table:
SELECT * FROM aqi LIMIT 5;
Example output:
+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+
| date | O3 | NO | NO2 | NOXasNO2 | SO2 | CO | PM10 | PM2.5 | wd | ws | temp | site | code | aqi |
+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+
| 2022-01-05 02:00:00 | 55.0813 | 0.12473 | 8.22375 | 8.415 | 0.53218 | 0.11642 | 7.7 | 4.151 | 299.3 | 5.2 | 1.2 | London N. Kensington | KC1 | 17 |
| 2022-01-09 11:00:00 | 53.4848 | 3.7419 | 22.185 | 27.9225 | 0.79827 | 0.17463 | 11.1 | 5.755 | 296.9 | 3.5 | 5.8 | London N. Kensington | KC1 | 24 |
| 2022-01-09 19:00:00 | 3.39269 | 9.10529 | 69.615 | 83.5762 | 0.79827 | 0.512248 | 28.4 | 22.264 | 271.5 | 2.5 | 1.9 | London N. Kensington | KC1 | 72 |
| 2022-01-10 11:00:00 | 5.38839 | 29.3116 | 70.1888 | 115.132 | 1.33045 | 0.46568 | 24.2 | 11.698 | 131.5 | 1.5 | 1.2 | London N. Kensington | KC1 | 48 |
| 2022-01-11 09:00:00 | 30.9333 | 1.87095 | 26.3925 | 29.2612 | 0 | 0.197914 | 6.8 | 5.472 | 208.4 | 2.1 | 8.1 | London N. Kensington | KC1 | 22 |
+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+
A variety of tools could now be used with SingleStoreDB to perform analytics on the data. For example, we have previously discussed the ease with which we can use analytics and visualisation tools with SingleStoreDB, such as Apache Superset, Metabase and Looker.
Bonus: Create visualisation
From our Python environment, we can also create a quick visualisation using Plotly:
# Define the ranges for each AQI category
aqi_bounds = [0, 51, 101, 151, 201, 301, 500]
# Define the names of each AQI category
aqi_categories = [
"Good",
"Moderate",
"Unhealthy for Sensitive Groups",
"Unhealthy",
"Very Unhealthy",
"Hazardous",
"Very Hazardous"
]
# Define the colours for each AQI category
aqi_colors = [
[0.0, "green" ],
[0.1, "green" ],
[0.102, "yellow"],
[0.2, "yellow"],
[0.202, "orange"],
[0.3, "orange"],
[0.302, "red" ],
[0.4, "red" ],
[0.402, "purple"],
[0.6, "purple"],
[0.602, "maroon"],
[1.0, "maroon"]
]
# Define the tick values and labels for the colorbar
tickvals = [(aqi_bounds[i] + aqi_bounds[i + 1]) / 2 for i in range(len(aqi_bounds)-1)]
ticktext = aqi_categories[:-1]
# Create a new figure object with a heatmap trace
fig = go.Figure(data = go.Heatmap(
x = aqi_df["date"].to_numpy(),
y = aqi_df["site"],
z = aqi_df["aqi"],
colorscale = aqi_colors,
zmin = aqi_bounds[0],
zmax = aqi_bounds[-1],
colorbar = dict(title = "Air Quality Index",
tickmode = "array",
ticktext = ticktext,
tickvals = tickvals
)
)
)
# Customise the layout of the figure
fig.update_layout(
xaxis_title = "Date",
yaxis_title = "Location",
title = "Air Quality Index by Location and Date",
width = 700,
height = 600,
yaxis = dict(tickangle = -90)
)
# Show the figure
fig.show()
This produces the heatmap shown in Figure 1.
The overall quality of air throughout 2022 was Good. The next steps would be to drill down and further analyse the data, check the data for the other polluting gases, and so on. We could also make the heatmap more interesting by adding additional weather data.
Summary
In this short article, we have quickly analysed some air quality data from Defra in the UK, using the US EPA algorithm with data for particulates (PM10 and PM2.5). However, we note that many countries use different algorithms to calculate their AQI, although the approach described in this article could be easily adapted for other countries.
Top comments (0)