agenthustler

Posted on Mar 27

How to Build a Browser Extension for Passive Data Collection

#python #tutorial #webdev #programming

Browser extensions can passively collect pricing data, track inventory changes, and monitor competitor updates as you browse. Here's how to build one that collects structured data from any website you visit.

Use Cases for Passive Collection

Price monitoring: Automatically log prices as you browse e-commerce sites
Job market research: Capture salary data from job postings you view
Competitive analysis: Track competitor feature changes over time
Research aggregation: Collect data points across academic sources

Extension Architecture

Our extension has three parts:

Content script — Runs on every page, extracts structured data
Background service worker — Manages storage and API sync
Popup UI — Shows collection status and stats

The Manifest (manifest.json)

{
  "manifest_version": 3,
  "name": "Data Collector",
  "version": "1.0",
  "permissions": ["storage", "activeTab"],
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "js": ["content.js"]
    }
  ],
  "background": {
    "service_worker": "background.js"
  },
  "action": {
    "default_popup": "popup.html"
  }
}

Content Script (content.js)

The content script extracts structured data based on the site you're visiting:

// Site-specific extractors
const extractors = {
  "amazon.com": () => {
    const title = document.querySelector("#productTitle")?.textContent?.trim();
    const price = document.querySelector(".a-price-whole")?.textContent;
    const rating = document.querySelector("#acrPopover")?.title;
    return title ? { type: "product", title, price, rating } : null;
  },

  "linkedin.com/jobs": () => {
    const title = document.querySelector(".job-details-jobs-unified-top-card__job-title")?.textContent?.trim();
    const company = document.querySelector(".job-details-jobs-unified-top-card__company-name")?.textContent?.trim();
    const salary = document.querySelector(".salary-main-rail__current-range")?.textContent?.trim();
    return title ? { type: "job", title, company, salary } : null;
  },

  "zillow.com": () => {
    const price = document.querySelector("[data-testid='price']")?.textContent;
    const address = document.querySelector("[data-testid='fs-chip-container']")?.textContent;
    const beds = document.querySelector("[data-testid='bed-bath-item']")?.textContent;
    return price ? { type: "listing", price, address, beds } : null;
  }
};

// Run extraction on page load
function extract() {
  const hostname = window.location.hostname;
  for (const [site, extractor] of Object.entries(extractors)) {
    if (hostname.includes(site)) {
      const data = extractor();
      if (data) {
        data.url = window.location.href;
        data.timestamp = new Date().toISOString();
        chrome.runtime.sendMessage({ action: "save", data });
      }
      break;
    }
  }
}

// Run after page loads
if (document.readyState === "complete") extract();
else window.addEventListener("load", extract);

Background Service Worker (background.js)

// Store collected data
chrome.runtime.onMessage.addListener((msg, sender, sendResponse) => {
  if (msg.action === "save") {
    chrome.storage.local.get(["collected"], (result) => {
      const collected = result.collected || [];
      collected.push(msg.data);
      chrome.storage.local.set({ collected });

      // Sync to server every 50 items
      if (collected.length % 50 === 0) {
        syncToServer(collected);
      }
    });
  }
});

async function syncToServer(data) {
  try {
    await fetch("https://your-api.com/collect", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify(data)
    });
    // Clear synced data
    chrome.storage.local.set({ collected: [] });
  } catch (e) {
    console.error("Sync failed:", e);
  }
}

Python Backend to Process Collected Data

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import pandas as pd
import json

app = FastAPI()

class DataPoint(BaseModel):
    type: str
    url: str
    timestamp: str

@app.post("/collect")
async def collect(data: List[DataPoint]):
    """Receive and store collected data."""
    df = pd.DataFrame([d.dict() for d in data])
    df.to_csv(
        "collected_data.csv",
        mode="a",
        header=not pd.io.common.file_exists("collected_data.csv"),
        index=False
    )
    return {"stored": len(data)}

@app.get("/stats")
async def stats():
    """Get collection statistics."""
    df = pd.read_csv("collected_data.csv")
    return {
        "total": len(df),
        "by_type": df["type"].value_counts().to_dict(),
        "latest": df.tail(5).to_dict(orient="records")
    }

Analyzing Collected Data

Once you've collected data over days or weeks, the analysis gets interesting:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("collected_data.csv")
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Price trends for products you've viewed
products = df[df["type"] == "product"]
products["price_num"] = products["price"].str.extract(r"(\d+\.?\d*)").astype(float)

# Group by product and plot price over time
for title in products["title"].unique()[:5]:
    subset = products[products["title"] == title]
    plt.plot(subset["timestamp"], subset["price_num"], label=title[:30])

plt.legend()
plt.title("Price Tracking from Browsing")
plt.savefig("price_trends.png")

Scaling with Proxy Services

When you need to supplement passive collection with active scraping, tools like ScraperAPI handle the heavy lifting. For large-scale data collection across multiple sites, ThorData provides reliable residential proxies, and ScrapeOps monitors your collection pipeline health.

Privacy and Ethics

Only collect publicly visible data
Never collect personal information from other users
Be transparent about what your extension collects
Include a clear privacy policy
Comply with GDPR and CCPA if applicable

Conclusion

A passive data collection extension turns your regular browsing into a structured dataset. Start with one site extractor, validate the data quality, then expand to more sites. The key is building good extractors that handle site variations gracefully.

DEV Community