Web scraping geospatial data opens up powerful possibilities for mapping, logistics, and urban planning applications. In this guide, I'll show you how to build a complete pipeline that collects, processes, and stores location-based data.
Why Geospatial Data Matters
Location data powers everything from ride-sharing apps to real estate analytics. Public sources like OpenStreetMap, government portals, and business directories contain rich geospatial datasets — but they rarely offer clean API access.
Setting Up the Pipeline
First, install the dependencies:
pip install requests beautifulsoup4 geopandas shapely
Step 1: Scraping Location Data
Here's a scraper that collects business locations with coordinates:
import requests
from bs4 import BeautifulSoup
import json
def scrape_locations(city, category):
# Use ScraperAPI to handle anti-bot protections
api_url = "https://api.scraperapi.com"
params = {
"api_key": "YOUR_SCRAPERAPI_KEY",
"url": f"https://example-directory.com/{city}/{category}",
"render": "true"
}
response = requests.get(api_url, params=params)
soup = BeautifulSoup(response.text, "html.parser")
locations = []
for listing in soup.select(".business-listing"):
name = listing.select_one(".name").text.strip()
lat = listing.get("data-lat")
lng = listing.get("data-lng")
address = listing.select_one(".address").text.strip()
locations.append({
"name": name,
"latitude": float(lat),
"longitude": float(lng),
"address": address,
"city": city,
"category": category
})
return locations
data = scrape_locations("new-york", "restaurants")
print(f"Collected {len(data)} locations")
Step 2: Geocoding and Validation
Raw scraped coordinates often need validation. Use GeoPandas to clean the data:
import geopandas as gpd
from shapely.geometry import Point
def validate_coordinates(locations):
valid = []
for loc in locations:
lat, lng = loc["latitude"], loc["longitude"]
# Basic coordinate validation
if -90 <= lat <= 90 and -180 <= lng <= 180:
loc["geometry"] = Point(lng, lat)
valid.append(loc)
else:
print(f"Invalid coords for {loc['name']}: {lat}, {lng}")
gdf = gpd.GeoDataFrame(valid, geometry="geometry", crs="EPSG:4326")
return gdf
geo_data = validate_coordinates(data)
print(f"Valid locations: {len(geo_data)}")
Step 3: Storage and Export
Store your geospatial data in formats that GIS tools understand:
def export_pipeline(gdf, output_name):
# GeoJSON for web maps
gdf.to_file(f"{output_name}.geojson", driver="GeoJSON")
# Shapefile for desktop GIS
gdf.to_file(f"{output_name}.shp")
# CSV with WKT geometry for databases
gdf_copy = gdf.copy()
gdf_copy["wkt"] = gdf_copy.geometry.to_wkt()
gdf_copy.drop(columns=["geometry"]).to_csv(f"{output_name}.csv", index=False)
print(f"Exported {len(gdf)} records in 3 formats")
export_pipeline(geo_data, "nyc_restaurants")
Step 4: Scheduling Regular Updates
Use a simple scheduler to keep your pipeline fresh:
import schedule
import time
def daily_pipeline():
cities = ["new-york", "los-angeles", "chicago"]
categories = ["restaurants", "hotels", "retail"]
for city in cities:
for cat in categories:
locations = scrape_locations(city, cat)
gdf = validate_coordinates(locations)
export_pipeline(gdf, f"{city}_{cat}")
time.sleep(2) # Respectful rate limiting
schedule.every().day.at("02:00").do(daily_pipeline)
while True:
schedule.run_pending()
time.sleep(60)
Scaling Tips
- Use ScraperAPI to rotate proxies and handle JavaScript-rendered pages automatically
- For high-volume residential proxies, ThorData provides reliable geo-targeted IPs
- Monitor your scraping success rates with ScrapeOps dashboards
Conclusion
Building a geospatial data pipeline with web scraping gives you access to location intelligence that would cost thousands from commercial providers. Start small with one city and category, validate your coordinate quality, and scale from there.
The key is combining reliable scraping infrastructure with proper geospatial tooling — and always respecting the source websites' terms of service and rate limits.
Top comments (0)