DEV Community

Cover image for Quick tip: Analysing Air Quality Data with SingleStoreDB
Akmal Chaudhri for SingleStore

Posted on • Updated on

Quick tip: Analysing Air Quality Data with SingleStoreDB

Abstract

Air quality has become a significant concern in many cities globally due to increased pollution levels. In this short article, we'll see how to calculate the Air Quality Index (AQI), store air quality data in SingleStoreDB and quickly visualise the AQI.

The notebook file used in this article is available from GitHub.

Introduction

In this article, we'll obtain air quality data from the Department for Environment, Food and Rural Affairs (Defra) in the UK and store it in SingleStoreDB. SingleStoreDB integrates with many external tools and libraries.

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

Obtain Air Quality Data

We'll acknowledge Defra and uk-air.defra.gov.uk as the source of information as well as the licence:

© Crown 2023 copyright Defra via uk-air.defra.gov.uk, licenced under the Open Government Licence (OGL).

We'll use the following code to obtain the metadata:

metadata_url = "http://uk-air.defra.gov.uk/openair/R_data/AURN_metadata.RData"

with urlopen(metadata_url) as metadataset:
    metadata = rdata.read_rda(metadataset)

metadata_df = metadata["AURN_metadata"]

metadata_df.info()
Enter fullscreen mode Exit fullscreen mode

Example output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2836 entries, 1 to 2836
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   site_id          2836 non-null   string 
 1   site_name        2836 non-null   string 
 2   location_type    2836 non-null   string 
 3   latitude         2836 non-null   float64
 4   longitude        2836 non-null   float64
 5   parameter        2836 non-null   string 
 6   Parameter_name   2836 non-null   string 
 7   start_date       2836 non-null   string 
 8   end_date         2836 non-null   string 
 9   ratified_to      2836 non-null   string 
 10  zone             2836 non-null   string 
 11  agglomeration    1656 non-null   object 
 12  local_authority  2815 non-null   object 
dtypes: float64(2), object(2), string(9)
memory usage: 288.2+ KB
Enter fullscreen mode Exit fullscreen mode

Next, let's search for "Kensington" in local_authority and retrieve the site_id values:

# Filter the DataFrame for rows where 'Kensington' is included in 'local_authority' (case-insensitive)
kensington_sites = metadata_df[
    metadata_df["local_authority"].str.contains("Kensington", case = False, na = False)
]

kensington_site_ids = kensington_sites["site_id"].unique()

kensington_site_ids
Enter fullscreen mode Exit fullscreen mode

Example output:

<StringArray>
['CRD', 'CRD2', 'KC1', 'WL']
Length: 4, dtype: string
Enter fullscreen mode Exit fullscreen mode

We'll use the value KC1 and the year 2022 and download the data and store it in a Dataframe:

site_id = kensington_site_ids[2]
year = 2022

dataset_url = f"https://uk-air.defra.gov.uk/openair/R_data/{site_id}_{year}.RData"

with urlopen(dataset_url) as dataset:
    data = rdata.read_rda(dataset)

aqi_df = data[f"{site_id}_{year}"]
Enter fullscreen mode Exit fullscreen mode

We'll ensure that the date is in the correct format:

aqi_df["date"] = pd.to_datetime(aqi_df["date"], unit = "s")
Enter fullscreen mode Exit fullscreen mode

Let's take a quick look at the data:

aqi_df.head()
Enter fullscreen mode Exit fullscreen mode

Example output:

          date            O3       NO       NO2    NOXasNO2    SO2       CO     PM10   PM2.5   wd    ws   temp          site         code
1 2022-01-01 00:00:00  39.11572  0.24946  9.94500  10.51875  0.26609  0.209556  14.9  12.076  212.0  3.0  10.4  London N. Kensington  KC1
2 2022-01-01 01:00:00  38.71658  0.12473  7.84125   8.22375 -0.53218  0.197914  16.3  12.642  214.8  3.7  10.5  London N. Kensington  KC1
3 2022-01-01 02:00:00  41.51056  0.12473  6.88500   7.07625  0.26609  0.221198  15.4  11.604  212.9  3.6  10.2  London N. Kensington  KC1
4 2022-01-01 03:00:00  45.70153  0.12473  4.97250   5.16375  0.26609  0.186272  13.6  10.189  208.8  3.9  10.6  London N. Kensington  KC1
5 2022-01-01 04:00:00  48.49551  0.12473  4.59000   4.59000  0.26609  0.186272  10.9   8.302  206.7  4.0  11.3  London N. Kensington  KC1
Enter fullscreen mode Exit fullscreen mode

We can see the range of values:

aqi_df.describe()
Enter fullscreen mode Exit fullscreen mode

Example output:

              date              O3           NO           NO2       NOXasNO2        SO2          CO          PM10         PM2.5         wd           ws          temp    
count                 8760  8617.000000  8593.000000  8593.000000  8593.000000  7572.000000  7737.000000  8622.000000  8622.000000  8664.000000  8664.000000  8664.000000
mean   2022-07-02 11:30:00    50.337057     4.756186    18.884452    26.177035     0.557447     0.179206    14.600139     8.870449   193.805794     3.286034    10.889578
min    2022-01-01 00:00:00    -0.399140    -0.124730     0.382500     0.382500    -1.330450     0.034926     0.700000     0.377000     0.100000     0.000000    -5.900000
25%    2022-04-02 05:45:00    33.527760     0.249460     7.650000     8.606250     0.266090     0.104778     8.200000     4.151000   107.475000     2.200000     6.500000
50%    2022-07-02 11:30:00    51.688630     0.748380    13.196250    14.535000     0.532180     0.139704    11.900000     6.226000   219.100000     2.900000    10.700000
75%    2022-10-01 17:15:00    66.855950     1.870950    24.480000    27.157500     0.532180     0.197914    17.400000    10.283000   265.000000     4.100000    15.300000
max    2022-12-31 23:00:00   166.441380   368.701880   108.438750   652.927500    10.377510     1.723016   110.400000    84.623000   360.000000    12.900000    31.900000
std                    NaN    26.469337    18.191091    16.328962    39.461979     0.677925     0.141622    10.499228     8.222276    98.460865     1.567511     6.410335
Enter fullscreen mode Exit fullscreen mode

We can see data for the polluting gases:

  • Ozone (O3)
  • Nitrogen monoxide (NO)
  • Nitrogen dioxide (NO2)
  • Sulphur dioxide (SO2)
  • Carbon monoxide (CO)
  • Particulates (PM10 and PM2.5)

We'll check for missing data:

aqi_df.isna().sum()
Enter fullscreen mode Exit fullscreen mode

Example output:

date           0
O3           143
NO           167
NO2          167
NOXasNO2     167
SO2         1188
CO          1023
PM10         138
PM2.5        138
wd            96
ws            96
temp          96
site           0
code           0
Enter fullscreen mode Exit fullscreen mode

For our initial analysis, we'll remove the rows with missing data:

aqi_df.dropna(inplace = True)
Enter fullscreen mode Exit fullscreen mode

Calculate Air Quality Index (AQI)

Next, we'll use the python-aqi library to calculate the AQI. This library uses algorithms from:

  • The United States Environmental Protection Agency (EPA)
  • The China Ministry of Environmental Protection (MEP)

We'll initially use the particulates (PM10 and PM2.5) to calculate the AQI using the EPA algorithm and create a new column in the Pandas Dataframe to store the result:

aqi_df["aqi"] = aqi_df.apply(
    lambda row: aqi.to_aqi([
        (aqi.POLLUTANT_PM25, row["PM2.5"]),
        (aqi.POLLUTANT_PM10, row["PM10"])
    ]),
    axis = 1
)
Enter fullscreen mode Exit fullscreen mode

Now we'll store the Pandas Dataframe in SingleStoreDB.

Store data in SingleStoreDB

We'll now create a connection to SingleStoreDB:

from sqlalchemy import *

db_connection = create_engine(connection_url)
Enter fullscreen mode Exit fullscreen mode

Now we'll create a new table in SingleStoreDB using the Pandas Dataframe:

aqi_df.to_sql(
    "aqi",
    con = db_connection,
    if_exists = "replace",
    index = False,
    chunksize = 1000
)
Enter fullscreen mode Exit fullscreen mode

We'll check some data in the table:

SELECT * FROM aqi LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Example output:

+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+
| date                | O3      | NO      | NO2     | NOXasNO2 | SO2     | CO       | PM10 | PM2.5  | wd    | ws   | temp | site                 | code | aqi  |
+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+
| 2022-01-05 02:00:00 | 55.0813 | 0.12473 | 8.22375 |    8.415 | 0.53218 |  0.11642 |  7.7 |  4.151 | 299.3 |  5.2 |  1.2 | London N. Kensington | KC1  | 17   |
| 2022-01-09 11:00:00 | 53.4848 |  3.7419 |  22.185 |  27.9225 | 0.79827 |  0.17463 | 11.1 |  5.755 | 296.9 |  3.5 |  5.8 | London N. Kensington | KC1  | 24   |
| 2022-01-09 19:00:00 | 3.39269 | 9.10529 |  69.615 |  83.5762 | 0.79827 | 0.512248 | 28.4 | 22.264 | 271.5 |  2.5 |  1.9 | London N. Kensington | KC1  | 72   |
| 2022-01-10 11:00:00 | 5.38839 | 29.3116 | 70.1888 |  115.132 | 1.33045 |  0.46568 | 24.2 | 11.698 | 131.5 |  1.5 |  1.2 | London N. Kensington | KC1  | 48   |
| 2022-01-11 09:00:00 | 30.9333 | 1.87095 | 26.3925 |  29.2612 |       0 | 0.197914 |  6.8 |  5.472 | 208.4 |  2.1 |  8.1 | London N. Kensington | KC1  | 22   |
+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+
Enter fullscreen mode Exit fullscreen mode

A variety of tools could now be used with SingleStoreDB to perform analytics on the data. For example, we have previously discussed the ease with which we can use analytics and visualisation tools with SingleStoreDB, such as Apache Superset, Metabase and Looker.

Bonus: Create visualisation

From our Python environment, we can also create a quick visualisation using Plotly:

# Define the ranges for each AQI category
aqi_bounds = [0, 51, 101, 151, 201, 301, 500]

# Define the names of each AQI category
aqi_categories = [
    "Good",
    "Moderate",
    "Unhealthy for Sensitive Groups",
    "Unhealthy",
    "Very Unhealthy",
    "Hazardous",
    "Very Hazardous"
]

# Define the colours for each AQI category
aqi_colors = [
    [0.0,   "green" ],
    [0.1,   "green" ],
    [0.102, "yellow"],
    [0.2,   "yellow"],
    [0.202, "orange"],
    [0.3,   "orange"],
    [0.302, "red"   ],
    [0.4,   "red"   ],
    [0.402, "purple"],
    [0.6,   "purple"],
    [0.602, "maroon"],
    [1.0,   "maroon"]
]

# Define the tick values and labels for the colorbar
tickvals = [(aqi_bounds[i] + aqi_bounds[i + 1]) / 2 for i in range(len(aqi_bounds)-1)]
ticktext = aqi_categories[:-1]

# Create a new figure object with a heatmap trace
fig = go.Figure(data = go.Heatmap(
    x = aqi_df["date"].to_numpy(),
    y = aqi_df["site"],
    z = aqi_df["aqi"],
    colorscale = aqi_colors,
    zmin = aqi_bounds[0],
    zmax = aqi_bounds[-1],
    colorbar = dict(title = "Air Quality Index",
                    tickmode = "array",
                    ticktext = ticktext,
                    tickvals = tickvals
                )
    )
)

# Customise the layout of the figure
fig.update_layout(
    xaxis_title = "Date",
    yaxis_title = "Location",
    title = "Air Quality Index by Location and Date",
    width = 700,
    height = 600,
    yaxis = dict(tickangle = -90)
)

# Show the figure
fig.show()
Enter fullscreen mode Exit fullscreen mode

This produces the heatmap shown in Figure 1.

Figure 1. AQI for KC1.

Figure 1. AQI for KC1.

The overall quality of air throughout 2022 was Good. The next steps would be to drill down and further analyse the data, check the data for the other polluting gases, and so on. We could also make the heatmap more interesting by adding additional weather data.

Summary

In this short article, we have quickly analysed some air quality data from Defra in the UK, using the US EPA algorithm with data for particulates (PM10 and PM2.5). However, we note that many countries use different algorithms to calculate their AQI, although the approach described in this article could be easily adapted for other countries.

Top comments (0)