Bypassing Scraper Latency: Building a Real-Time Economic Indicator (REI) Tracker with Python

#python #webscraping #dataengineering #economics

Official economic metrics, like the Consumer Price Index (CPI), are structural "lagging indicators." By the time government agencies collect, clean, and publish inflation data, the market has already moved.

As developers and data analysts, we don't have to wait. We can build our own high-frequency, bottom-up economic indicators by tapping into live digital shelf prices.

In this article, I will share the architectural pattern and a production-ready Python implementation for a Real-Time Economic Indicator (REI) Tracker focused on daily essentials in the Osaka, Japan metropolitan area.

1. The Engineering Challenge: Anti-Bot Barriers

Extracting consistent, localized pricing data from search engines and shopping platforms is notoriously difficult. Sophisticated anti-bot protections, CAPTCHAs, and shifting DOM structures turn standard scraping libraries (like BeautifulSoup or Selenium) into a maintenance nightmare.

To maintain focus on the essence of data analysis rather than infrastructure maintenance, I utilized SearchApi (specifically their Google Shopping engine). It abstracts away proxy rotation and browser rendering, serving as a reliable data pipeline for high-frequency tracking.

2. Technical Implementation (`rei_tracker.py`)

Here is the complete, robust implementation. It uses dataclasses for clean configuration, requests.Session for connection pooling, and pandas for handling time-series data persistence with built-in deduplication.


python
import requests
import statistics
import pandas as pd
from datetime import datetime, date
import os
import re
import logging
from pathlib import Path
from dataclasses import dataclass
from typing import List, Dict, Optional
from dotenv import load_dotenv

# ====================== Configuration ======================
load_dotenv()

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s | %(levelname)s | %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger(__name__)

API_KEY = os.getenv("SEARCHAPI_API_KEY")
if not API_KEY:
    raise ValueError("SEARCHAPI_API_KEY not found in .env file!")

@dataclass
class Config:
    target_items: List[str] = None
    location: str = "Osaka, Osaka, Japan"
    gl: str = "jp"
    hl: str = "ja"
    num_results: int = 20
    min_samples: int = 3

# Default tracking items for monitoring economic temperature
default_config = Config(
    target_items=[
        "Egg 10-pack",
        "Rice 5kg",
        "Tissue paper 5-pack",
        "Gasoline price",
        "iPhone 15 128GB",
    ]
)

class EconomicIndicatorTracker:
    def __init__(self, api_key: str, config: Config):
        self.api_key = api_key
        self.config = config
        self.endpoint = "[https://www.searchapi.io/api/v1/search](https://www.searchapi.io/api/v1/search)"
        self.session = requests.Session()

    @staticmethod
    def parse_price(price_str: str) -> Optional[int]:
        """Safely convert localized price strings into clean integers"""
        if not price_str:
            return None
        cleaned = str(price_str).replace('円', '').replace('¥', '').replace(' ', '')
        cleaned = re.sub(r'[^\d.,]', '', cleaned)
        cleaned = cleaned.replace(',', '')
        try:
            return int(float(cleaned))
        except ValueError:
            return None

    def get_market_price(self, query: str) -> Optional[Dict]:
        """Retrieve structured price data from Google Shopping via SearchApi"""
        params = {
            "engine": "google_shopping",
            "q": query,
            "location": self.config.location,
            "api_key": self.api_key,
            "gl": self.config.gl,
            "hl": self.config.hl,
            "num": self.config.num_results,
        }

        try:
            response = self.session.get(self.endpoint, params=params, timeout=20)
            response.raise_for_status()
            data = response.json()

            prices = []
            for item in data.get("shopping_results", []):
                price_str = item.get("price") or item.get("extracted_price")
                if price_str:
                    parsed = self.parse_price(price_str)
                    if parsed and parsed > 0:
                        prices.append(parsed)

            if len(prices) < self.config.min_samples:
                logger.warning(f"Too few samples for {query} ({len(prices)} items)")
                return None

            return {
                "item": query,
                "date": date.today().isoformat(),
                "timestamp": datetime.now().isoformat(),
                "median_price": round(statistics.median(prices)),
                "sample_count": len(prices),
                "min_price": min(prices),
                "max_price": max(prices)
            }
        except Exception as e:
            logger.error(f"Error fetching {query}: {e}")
        return None

def main():
    tracker = EconomicIndicatorTracker(API_KEY, default_config)
    results = []

    print(f"--- Starting Real-time Economic Investigation ({date.today().isoformat()}) ---")

    for item in default_config.target_items:
        logger.info(f"Fetching data for: {item}...")
        stats = tracker.get_market_price(item)
        if stats:
            results.append(stats)
            logger.info(f"   → Median: ¥{stats['median_price']:,} ({stats['sample_count']} samples)")

    if results:
        df_new = pd.DataFrame(results)
        csv_file = f"economic_indicator_{datetime.now().strftime('%Y%m')}.csv"

        # ==================== Time-Series Persistence & Deduplication ====================
        if Path(csv_file).exists():
            df_existing = pd.read_csv(csv_file)
            # Overwrite today's run if it already exists to prevent duplication
            df_existing = df_existing[df_existing['date'] != date.today().isoformat()]
            df_combined = pd.concat([df_existing, df_new], ignore_index=True)
        else:
            df_combined = df_new

        df_combined.to_csv(csv_file, index=False)
        logger.info(f"💾 Saved to time-series ledger: {csv_file}")

        print("\n" + "="*50)
        print("### Market Price Summary ###")
        print(df_new[['item', 'median_price', 'sample_count']].to_string(index=False))
        print("="*50)
    else:
        logger.error("No data collected.")

if __name__ == "__main__":
    main()


3. Statistical Sincerity: Why Median Pricing?
When mining web data, raw lists of numbers are full of noise. A simple Mean (average) can easily be skewed by a single luxury item, a wholesale bundle, or a data entry error.

To remain statistically sincere, this system utilizes Median Pricing. By pulling the absolute middle value of the distribution, we filter out outliers naturally, yielding a metric that genuinely represents "the market center" for local consumers.

4. Architectural Highlights
Connection Reuse: Utilizing requests.Session() avoids the overhead of establishing a new TCP handshake for each tracking item.

Data Safety: The persistence layer uses an idempotent logic; running the script multiple times a day updates the record rather than corrupting the time-series integrity with duplicate historical blocks.

High-Fidelity Localization: Geographic parameters map the supply chain and regional transport reality directly into the pricing indices.

5. Next Phase: Multi-Dimensional Spatial Analysis
This architecture serves as a foundational module. By transitioning from the Shopping engine to the Google Maps API, we can map out service-sector business density, local service rates, and regional price dispersion into interactive spatial heatmaps.

🌐 Open for Collaborations & Engineering Roles
I specialize in building robust data pipelines, automation systems, and high-fidelity scrapers that convert unstructured web architecture into actionable economic and business insights.

If your team is looking for a Data Engineer / Backend Developer to design reliable scrapers, automate workflows, or write high-quality technical content:

📩 Contact Me: [webmaster.kazu@gmail.com]

💻 GitHub Repository: https://github.com/kobayashikazu/rei-tracker-osaka

🏢 Website: laboratory.kazuuu.net

Special thanks to SearchApi.io for supporting this research environment.