DEV Community

agenthustler
agenthustler

Posted on

Scraping Real-Time Air Quality and Pollution Data with Python

Scraping Real-Time Air Quality and Pollution Data with Python

Air quality data drives decisions in public health, urban planning, and environmental research. While some APIs exist, many local monitoring stations only publish data on websites without structured access.

Here's how to build a real-time air quality scraper in Python.

The Data Landscape

Air quality indexes (AQI) come from thousands of monitoring stations worldwide. Sources include:

  • Government environmental agencies
  • Community sensor networks (PurpleAir, OpenAQ)
  • Weather services with pollution overlays

Setting Up

import requests
from bs4 import BeautifulSoup
import json
import time
from datetime import datetime

PROXY_URL = "https://api.scraperapi.com"
API_KEY = "YOUR_SCRAPERAPI_KEY"
Enter fullscreen mode Exit fullscreen mode

For real-time data that updates every hour, ScraperAPI ensures you never get blocked during continuous monitoring.

Scraping AQI from AQICN

def get_city_aqi(city):
    params = {
        "api_key": API_KEY,
        "url": f"https://aqicn.org/city/{city}/"
    }
    response = requests.get(PROXY_URL, params=params)
    soup = BeautifulSoup(response.text, "html.parser")

    aqi_element = soup.select_one("#aqiwgtvalue")
    if aqi_element:
        aqi_value = int(aqi_element.text.strip())
    else:
        return None

    pollutants = {}
    for row in soup.select("#cur_pol tr"):
        cells = row.select("td")
        if len(cells) >= 2:
            name = cells[0].text.strip()
            value = cells[1].text.strip()
            pollutants[name] = value

    return {
        "city": city,
        "aqi": aqi_value,
        "pollutants": pollutants,
        "timestamp": datetime.now().isoformat(),
        "category": classify_aqi(aqi_value)
    }

def classify_aqi(value):
    if value <= 50: return "Good"
    elif value <= 100: return "Moderate"
    elif value <= 150: return "Unhealthy for Sensitive Groups"
    elif value <= 200: return "Unhealthy"
    elif value <= 300: return "Very Unhealthy"
    else: return "Hazardous"
Enter fullscreen mode Exit fullscreen mode

Multi-City Monitoring

def monitor_cities(cities, interval_minutes=60):
    all_readings = []

    for city in cities:
        reading = get_city_aqi(city)
        if reading:
            all_readings.append(reading)
            print(f"{city}: AQI {reading['aqi']} ({reading['category']})")

            if reading["aqi"] > 150:
                send_alert(city, reading)
        time.sleep(3)

    return all_readings

cities = ["beijing", "delhi", "london", "los-angeles", "tokyo"]
readings = monitor_cities(cities)
Enter fullscreen mode Exit fullscreen mode

Scraping Sensor Network Data

def scrape_purple_air_region(nwlat, nwlng, selat, selng):
    params = {
        "api_key": API_KEY,
        "url": f"https://www.purpleair.com/json?nwlat={nwlat}&nwlng={nwlng}&selat={selat}&selng={selng}"
    }
    response = requests.get(PROXY_URL, params=params)
    data = response.json()

    sensors = []
    for result in data.get("results", []):
        sensors.append({
            "name": result.get("Label", ""),
            "lat": result.get("Lat"),
            "lon": result.get("Lon"),
            "pm25": result.get("PM2_5Value"),
            "temp": result.get("temp_f"),
            "humidity": result.get("humidity"),
            "last_seen": result.get("LastSeen")
        })
    return sensors

# San Francisco Bay Area
bay_area = scrape_purple_air_region(37.9, -122.6, 37.3, -122.0)
Enter fullscreen mode Exit fullscreen mode

Storing Historical Data

import sqlite3

def init_db():
    conn = sqlite3.connect("air_quality.db")
    conn.execute('''CREATE TABLE IF NOT EXISTS readings (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        city TEXT, aqi INTEGER, category TEXT,
        pollutants TEXT, timestamp TEXT
    )''')
    conn.commit()
    return conn

def store_readings(conn, readings):
    for r in readings:
        conn.execute(
            "INSERT INTO readings (city, aqi, category, pollutants, timestamp) VALUES (?, ?, ?, ?, ?)",
            (r["city"], r["aqi"], r["category"], json.dumps(r["pollutants"]), r["timestamp"])
        )
    conn.commit()

def get_trend(conn, city, days=7):
    cursor = conn.execute(
        "SELECT aqi, timestamp FROM readings WHERE city=? ORDER BY timestamp DESC LIMIT ?",
        (city, days * 24)
    )
    return cursor.fetchall()
Enter fullscreen mode Exit fullscreen mode

Infrastructure for Continuous Scraping

Running a 24/7 air quality monitor requires reliable proxy rotation:

  • ScraperAPI — automatic retries and IP rotation for consistent hourly scrapes
  • ThorData — residential proxies when government sites block datacenter IPs
  • ScrapeOps — monitor your scraping jobs and get alerts on failures

Conclusion

Real-time air quality scraping unlocks environmental monitoring at any scale. From personal dashboards to city-wide pollution tracking, this pipeline handles continuous data collection reliably. Store historically, alert on spikes, and you have a production environmental monitoring system.

Top comments (0)