How to Build a Geospatial Data Pipeline with Web Scraping

#python #tutorial #webdev #programming

Web scraping geospatial data opens up powerful possibilities for mapping, logistics, and urban planning applications. In this guide, I'll show you how to build a complete pipeline that collects, processes, and stores location-based data.

Why Geospatial Data Matters

Location data powers everything from ride-sharing apps to real estate analytics. Public sources like OpenStreetMap, government portals, and business directories contain rich geospatial datasets — but they rarely offer clean API access.

Setting Up the Pipeline

First, install the dependencies:

pip install requests beautifulsoup4 geopandas shapely

Step 1: Scraping Location Data

Here's a scraper that collects business locations with coordinates:

import requests
from bs4 import BeautifulSoup
import json

def scrape_locations(city, category):
    # Use ScraperAPI to handle anti-bot protections
    api_url = "https://api.scraperapi.com"
    params = {
        "api_key": "YOUR_SCRAPERAPI_KEY",
        "url": f"https://example-directory.com/{city}/{category}",
        "render": "true"
    }

    response = requests.get(api_url, params=params)
    soup = BeautifulSoup(response.text, "html.parser")

    locations = []
    for listing in soup.select(".business-listing"):
        name = listing.select_one(".name").text.strip()
        lat = listing.get("data-lat")
        lng = listing.get("data-lng")
        address = listing.select_one(".address").text.strip()

        locations.append({
            "name": name,
            "latitude": float(lat),
            "longitude": float(lng),
            "address": address,
            "city": city,
            "category": category
        })

    return locations

data = scrape_locations("new-york", "restaurants")
print(f"Collected {len(data)} locations")

Step 2: Geocoding and Validation

Raw scraped coordinates often need validation. Use GeoPandas to clean the data:

import geopandas as gpd
from shapely.geometry import Point

def validate_coordinates(locations):
    valid = []
    for loc in locations:
        lat, lng = loc["latitude"], loc["longitude"]

        # Basic coordinate validation
        if -90 <= lat <= 90 and -180 <= lng <= 180:
            loc["geometry"] = Point(lng, lat)
            valid.append(loc)
        else:
            print(f"Invalid coords for {loc['name']}: {lat}, {lng}")

    gdf = gpd.GeoDataFrame(valid, geometry="geometry", crs="EPSG:4326")
    return gdf

geo_data = validate_coordinates(data)
print(f"Valid locations: {len(geo_data)}")

Step 3: Storage and Export

Store your geospatial data in formats that GIS tools understand:

def export_pipeline(gdf, output_name):
    # GeoJSON for web maps
    gdf.to_file(f"{output_name}.geojson", driver="GeoJSON")

    # Shapefile for desktop GIS
    gdf.to_file(f"{output_name}.shp")

    # CSV with WKT geometry for databases
    gdf_copy = gdf.copy()
    gdf_copy["wkt"] = gdf_copy.geometry.to_wkt()
    gdf_copy.drop(columns=["geometry"]).to_csv(f"{output_name}.csv", index=False)

    print(f"Exported {len(gdf)} records in 3 formats")

export_pipeline(geo_data, "nyc_restaurants")

Step 4: Scheduling Regular Updates

Use a simple scheduler to keep your pipeline fresh:

import schedule
import time

def daily_pipeline():
    cities = ["new-york", "los-angeles", "chicago"]
    categories = ["restaurants", "hotels", "retail"]

    for city in cities:
        for cat in categories:
            locations = scrape_locations(city, cat)
            gdf = validate_coordinates(locations)
            export_pipeline(gdf, f"{city}_{cat}")
            time.sleep(2)  # Respectful rate limiting

schedule.every().day.at("02:00").do(daily_pipeline)

while True:
    schedule.run_pending()
    time.sleep(60)

Scaling Tips

Use ScraperAPI to rotate proxies and handle JavaScript-rendered pages automatically
For high-volume residential proxies, ThorData provides reliable geo-targeted IPs
Monitor your scraping success rates with ScrapeOps dashboards

Conclusion

Building a geospatial data pipeline with web scraping gives you access to location intelligence that would cost thousands from commercial providers. Start small with one city and category, validate your coordinate quality, and scale from there.

The key is combining reliable scraping infrastructure with proper geospatial tooling — and always respecting the source websites' terms of service and rate limits.