Scraping Real-Time Air Quality and Pollution Data with Python
Air quality data drives decisions in public health, urban planning, and environmental research. While some APIs exist, many local monitoring stations only publish data on websites without structured access.
Here's how to build a real-time air quality scraper in Python.
The Data Landscape
Air quality indexes (AQI) come from thousands of monitoring stations worldwide. Sources include:
- Government environmental agencies
- Community sensor networks (PurpleAir, OpenAQ)
- Weather services with pollution overlays
Setting Up
import requests
from bs4 import BeautifulSoup
import json
import time
from datetime import datetime
PROXY_URL = "https://api.scraperapi.com"
API_KEY = "YOUR_SCRAPERAPI_KEY"
For real-time data that updates every hour, ScraperAPI ensures you never get blocked during continuous monitoring.
Scraping AQI from AQICN
def get_city_aqi(city):
params = {
"api_key": API_KEY,
"url": f"https://aqicn.org/city/{city}/"
}
response = requests.get(PROXY_URL, params=params)
soup = BeautifulSoup(response.text, "html.parser")
aqi_element = soup.select_one("#aqiwgtvalue")
if aqi_element:
aqi_value = int(aqi_element.text.strip())
else:
return None
pollutants = {}
for row in soup.select("#cur_pol tr"):
cells = row.select("td")
if len(cells) >= 2:
name = cells[0].text.strip()
value = cells[1].text.strip()
pollutants[name] = value
return {
"city": city,
"aqi": aqi_value,
"pollutants": pollutants,
"timestamp": datetime.now().isoformat(),
"category": classify_aqi(aqi_value)
}
def classify_aqi(value):
if value <= 50: return "Good"
elif value <= 100: return "Moderate"
elif value <= 150: return "Unhealthy for Sensitive Groups"
elif value <= 200: return "Unhealthy"
elif value <= 300: return "Very Unhealthy"
else: return "Hazardous"
Multi-City Monitoring
def monitor_cities(cities, interval_minutes=60):
all_readings = []
for city in cities:
reading = get_city_aqi(city)
if reading:
all_readings.append(reading)
print(f"{city}: AQI {reading['aqi']} ({reading['category']})")
if reading["aqi"] > 150:
send_alert(city, reading)
time.sleep(3)
return all_readings
cities = ["beijing", "delhi", "london", "los-angeles", "tokyo"]
readings = monitor_cities(cities)
Scraping Sensor Network Data
def scrape_purple_air_region(nwlat, nwlng, selat, selng):
params = {
"api_key": API_KEY,
"url": f"https://www.purpleair.com/json?nwlat={nwlat}&nwlng={nwlng}&selat={selat}&selng={selng}"
}
response = requests.get(PROXY_URL, params=params)
data = response.json()
sensors = []
for result in data.get("results", []):
sensors.append({
"name": result.get("Label", ""),
"lat": result.get("Lat"),
"lon": result.get("Lon"),
"pm25": result.get("PM2_5Value"),
"temp": result.get("temp_f"),
"humidity": result.get("humidity"),
"last_seen": result.get("LastSeen")
})
return sensors
# San Francisco Bay Area
bay_area = scrape_purple_air_region(37.9, -122.6, 37.3, -122.0)
Storing Historical Data
import sqlite3
def init_db():
conn = sqlite3.connect("air_quality.db")
conn.execute('''CREATE TABLE IF NOT EXISTS readings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
city TEXT, aqi INTEGER, category TEXT,
pollutants TEXT, timestamp TEXT
)''')
conn.commit()
return conn
def store_readings(conn, readings):
for r in readings:
conn.execute(
"INSERT INTO readings (city, aqi, category, pollutants, timestamp) VALUES (?, ?, ?, ?, ?)",
(r["city"], r["aqi"], r["category"], json.dumps(r["pollutants"]), r["timestamp"])
)
conn.commit()
def get_trend(conn, city, days=7):
cursor = conn.execute(
"SELECT aqi, timestamp FROM readings WHERE city=? ORDER BY timestamp DESC LIMIT ?",
(city, days * 24)
)
return cursor.fetchall()
Infrastructure for Continuous Scraping
Running a 24/7 air quality monitor requires reliable proxy rotation:
- ScraperAPI — automatic retries and IP rotation for consistent hourly scrapes
- ThorData — residential proxies when government sites block datacenter IPs
- ScrapeOps — monitor your scraping jobs and get alerts on failures
Conclusion
Real-time air quality scraping unlocks environmental monitoring at any scale. From personal dashboards to city-wide pollution tracking, this pipeline handles continuous data collection reliably. Store historically, alert on spikes, and you have a production environmental monitoring system.
Top comments (0)