Browser extensions can passively collect pricing data, track inventory changes, and monitor competitor updates as you browse. Here's how to build one that collects structured data from any website you visit.
Use Cases for Passive Collection
- Price monitoring: Automatically log prices as you browse e-commerce sites
- Job market research: Capture salary data from job postings you view
- Competitive analysis: Track competitor feature changes over time
- Research aggregation: Collect data points across academic sources
Extension Architecture
Our extension has three parts:
- Content script — Runs on every page, extracts structured data
- Background service worker — Manages storage and API sync
- Popup UI — Shows collection status and stats
The Manifest (manifest.json)
{
"manifest_version": 3,
"name": "Data Collector",
"version": "1.0",
"permissions": ["storage", "activeTab"],
"content_scripts": [
{
"matches": ["<all_urls>"],
"js": ["content.js"]
}
],
"background": {
"service_worker": "background.js"
},
"action": {
"default_popup": "popup.html"
}
}
Content Script (content.js)
The content script extracts structured data based on the site you're visiting:
// Site-specific extractors
const extractors = {
"amazon.com": () => {
const title = document.querySelector("#productTitle")?.textContent?.trim();
const price = document.querySelector(".a-price-whole")?.textContent;
const rating = document.querySelector("#acrPopover")?.title;
return title ? { type: "product", title, price, rating } : null;
},
"linkedin.com/jobs": () => {
const title = document.querySelector(".job-details-jobs-unified-top-card__job-title")?.textContent?.trim();
const company = document.querySelector(".job-details-jobs-unified-top-card__company-name")?.textContent?.trim();
const salary = document.querySelector(".salary-main-rail__current-range")?.textContent?.trim();
return title ? { type: "job", title, company, salary } : null;
},
"zillow.com": () => {
const price = document.querySelector("[data-testid='price']")?.textContent;
const address = document.querySelector("[data-testid='fs-chip-container']")?.textContent;
const beds = document.querySelector("[data-testid='bed-bath-item']")?.textContent;
return price ? { type: "listing", price, address, beds } : null;
}
};
// Run extraction on page load
function extract() {
const hostname = window.location.hostname;
for (const [site, extractor] of Object.entries(extractors)) {
if (hostname.includes(site)) {
const data = extractor();
if (data) {
data.url = window.location.href;
data.timestamp = new Date().toISOString();
chrome.runtime.sendMessage({ action: "save", data });
}
break;
}
}
}
// Run after page loads
if (document.readyState === "complete") extract();
else window.addEventListener("load", extract);
Background Service Worker (background.js)
// Store collected data
chrome.runtime.onMessage.addListener((msg, sender, sendResponse) => {
if (msg.action === "save") {
chrome.storage.local.get(["collected"], (result) => {
const collected = result.collected || [];
collected.push(msg.data);
chrome.storage.local.set({ collected });
// Sync to server every 50 items
if (collected.length % 50 === 0) {
syncToServer(collected);
}
});
}
});
async function syncToServer(data) {
try {
await fetch("https://your-api.com/collect", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(data)
});
// Clear synced data
chrome.storage.local.set({ collected: [] });
} catch (e) {
console.error("Sync failed:", e);
}
}
Python Backend to Process Collected Data
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import pandas as pd
import json
app = FastAPI()
class DataPoint(BaseModel):
type: str
url: str
timestamp: str
@app.post("/collect")
async def collect(data: List[DataPoint]):
"""Receive and store collected data."""
df = pd.DataFrame([d.dict() for d in data])
df.to_csv(
"collected_data.csv",
mode="a",
header=not pd.io.common.file_exists("collected_data.csv"),
index=False
)
return {"stored": len(data)}
@app.get("/stats")
async def stats():
"""Get collection statistics."""
df = pd.read_csv("collected_data.csv")
return {
"total": len(df),
"by_type": df["type"].value_counts().to_dict(),
"latest": df.tail(5).to_dict(orient="records")
}
Analyzing Collected Data
Once you've collected data over days or weeks, the analysis gets interesting:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("collected_data.csv")
df["timestamp"] = pd.to_datetime(df["timestamp"])
# Price trends for products you've viewed
products = df[df["type"] == "product"]
products["price_num"] = products["price"].str.extract(r"(\d+\.?\d*)").astype(float)
# Group by product and plot price over time
for title in products["title"].unique()[:5]:
subset = products[products["title"] == title]
plt.plot(subset["timestamp"], subset["price_num"], label=title[:30])
plt.legend()
plt.title("Price Tracking from Browsing")
plt.savefig("price_trends.png")
Scaling with Proxy Services
When you need to supplement passive collection with active scraping, tools like ScraperAPI handle the heavy lifting. For large-scale data collection across multiple sites, ThorData provides reliable residential proxies, and ScrapeOps monitors your collection pipeline health.
Privacy and Ethics
- Only collect publicly visible data
- Never collect personal information from other users
- Be transparent about what your extension collects
- Include a clear privacy policy
- Comply with GDPR and CCPA if applicable
Conclusion
A passive data collection extension turns your regular browsing into a structured dataset. Start with one site extractor, validate the data quality, then expand to more sites. The key is building good extractors that handle site variations gracefully.
Top comments (0)