DEV Community: M.Amin Mashayekhan

Designing a Scalable Logging System for Web Scrapers: How to Prevent a Database Meltdown

M.Amin Mashayekhan — Mon, 09 Jun 2025 11:56:39 +0000

Introduction

A web crawler is only as useful as its stability.

When a scraper sends a log every 7 seconds — and dozens or even hundreds of users are using it simultaneously — things can go wrong, fast.

How do you stop:

The database from exploding?
The system from slowing to work?
The support team from drowning in unmanageable logs?

To prevent this, you need a logging architecture that scales.

This article walks through a real-world implementation of a logging system purpose-built for a scalable web scraper, focusing on performance, durability, and developer experience.

Two Hidden Enemies of Real-Time Logging Systems

1- Unbounded Data Growth

Without structure and filtering, logs quickly saturate the database and make analytics almost impossible.

2- Performance Degradation

Poorly designed logging directly impacts frontend responsiveness and backend throughput.

Frontend Logging Strategy (React): Store Less, Show More

In the React frontend, I chose to keep only the latest 50 logs in memory and display them in the UI.

const MAX_LOGS = 50;

const handleLog = useCallback((log: ScrapLog) => {
  setLogs((prev) => [log, ...prev.slice(0, MAX_LOGS - 1)]);
}, []);

Why this works:

Users see only fresh, relevant logs
DOM and memory stay lightweight
Logs can be exported as CSV for support teams

Backend Architecture: Separate the Signals from the Noise

1- Summary Logs

Each scraping session generates one summary record containing:

Number of list pages scraped (pagination)
Number of product items extracted
URL of the first and last list pages scraped (pagination)
First and last items extracted
Final status (success or failure)
Total execution time

Retention: Permanent
Use case: Dashboard analytics and long-term monitoring

2- Fine-Grained Logs (Triggered by Unexpected Errors)

If the scraper encounters an unexpected error (i.e., a type of error not seen in the past 24 hours), the frontend sends the last 50 logs to the server. These include:

URLs of visited pages
Actions performed
Any captured errors

Format: Lightweight, structured JSON
Retention: 7 days (configurable)
Cleanup: Automatically via scheduled job

Smart Cleanup in Laravel + MySQL

Efficient storage isn't enough-you must clean up intelligently.

🔍 Why Indexing Matters

To speed up deletion of old logs, we index the created_at field. This drastically improves the performance of time-based queries.

CREATE INDEX idx_created_at ON scrap_logs (created_at);

🧹 Scheduled Cleanup Job in Laravel

// App\Console\Commands\DeleteOldLogs.php
class DeleteOldLogs extends Command {
    protected $signature = 'logs:cleanup';

    public function handle() {
        ScrapLog::where('created_at', '<', now()->subDays(7))->delete();
        $this->info('Old logs cleaned successfully!');
    }
}

And register it in the scheduler:

protected function schedule(Schedule $schedule) {
    $schedule->command('logs:cleanup')->dailyAt('01:00');
}

Why This Architecture Works

✅ Only essential logs are stored
✅ Tables stay clean and queryable
✅ Debugging becomes painless
✅ Data analysis remains performant

Conclusion

A well-designed logging system isn't just for debugging -it's a critical survival mechanism. With a scalable, performance-conscious architecture, your system can remain:

Stable under load
Transparent when things go wrong
Insightful for business and technical teams
User-friendly.

Since implementing this system:

Debugging is fast
User behavior is easy to analyze
We retain full traceability when incidents occur

I hope these insights help you on your journey.

📣 Let's Talk

How do you handle logging in production system?
Share your thoughts in the comments - I'd love to hear your approach. If this article helped, consider clapping 👏 and following for more insights on web development, browser automation, and software engineering.

📬 Get in Touch

Email: amin.mashayekhan@gmail.com
Book a Quick Tech Call: https://calendly.com/amin-mashayekhan/15min-tech-call

Let's build better tools, faster!

6 Lessons Learned from Building a Production-Grade Chrome Extension with Web Scraping

M.Amin Mashayekhan — Thu, 22 May 2025 13:25:34 +0000

Introduction: Why Chrome Extensions Matter

If you’ve ever tried to build a browser extension that interacts with the real world — not just toy examples — you’ll know how complex things can get
fast.

Recently, I built a production-grade Chrome Extension that performs deep web scraping with advanced resume mechanisms, survives browser shutdowns, and optimizes performance using smart techniques like link prefetching, batching, and caching.

In this article, I’ll share the key takeaways from building a production-grade Chrome Extension for web scraping — from crawling logic and performance hacks to resilience patterns that survive browser crashes.

Whether you’re a beginner looking to build your first extension or a seasoned developer aiming to scale your architecture, these insights will help you navigate the complexities of browser automation and web scraping. Let’s dive in!

The Project: A Chrome Extension for Web Scraping

The goal of my project was to create a Chrome Extension that could crawl websites, scrape data, and store it for later analysis. The extension needed to:

Scrape data from web pages (e.g., links, text, tables).
Support resumable crawling (e.g., handle pagination and session state).
Optimize performance with techniques like link prefetching and state caching.
Survive browser restarts without losing progress.
Respect ethical web scraping practices, such as adhering to robots.txt.

I built the extension using React and TypeScript for the frontend, with a background script (background.js) handling the crawling logic. The extension also communicated with a Laravel backend to manage sites and crawl sessions. Here’s a high-level overview of the architecture:

Popup UI (popup.tsx): A React-based interface for users to start, stop, and monitor crawls.
Background Script (background.js): Handles crawling logic, tab management, and data extraction. Content Script (content.js): Injects into web pages to scrape data.
Backend (Laravel): Manages site configurations and crawl logs via APIs.

Now, let’s break down the key lessons I learned during this project, complete with code examples and practical advice.

Lesson 1: Master the Interaction Between Scripts for Maintainability

Chrome Extensions operate in a unique environment with three main types of scripts: content scripts, background scripts, and popup scripts. Understanding how these scripts interact is crucial for building a maintainable extension.

Content Scripts run in the context of the web page and can access the DOM directly. They’re ideal for scraping data.
Background Scripts run persistently in the background and handle long-running tasks like crawling or tab management.
Popup Scripts (or UI scripts) manage the user interface, typically built with frameworks like React.

In my project, the content script (content.js) scraped data from pages, the background script (background.js) orchestrated the crawl, and the popup UI (popup.tsx) displayed the status. Communication between these scripts is handled via Chrome’s message-passing API (chrome.runtime.sendMessage and chrome.runtime.onMessage).

Here’s an example of how I set up message passing between the popup and background script:

// popup.tsx (Sending a message to start a crawl)
const handleStartCrawl = () => {
  chrome.runtime.sendMessage({
    action: 'START_CRAWL',
    siteId: selectedSite,
    startUrl: startUrl,
  });
};

// background.js (Listening for messages)
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'START_CRAWL') {
    startCrawling(request.siteId, request.startUrl);
  }
});

async function startCrawling(siteId, startUrl) {
  const tab = await chrome.tabs.create({ url: startUrl });
  // Crawling logic...
}

Takeaway: Keep the responsibilities of each script clear. Use message passing to decouple components, making your codebase modular and easier to debug. For example, if the popup UI crashes, the background script can continue crawling uninterrupted.

Lesson 2: Use Message Passing Wisely for Scalability

Message passing is the backbone of Chrome Extension communication, but it can become a bottleneck if not used carefully. In my project, I initially overused message passing, sending frequent updates between the popup and background script. This led to performance issues, especially during long crawls with hundreds of pages.

To address this, I adopted a more modular approach:

Batch Updates: Instead of sending a message for every scraped page, I batched updates and sent them periodically.
State Management: I stored crawl state (e.g., pages crawled, logs) in chrome.storage.local to reduce message frequency. Here’s how to implemente batched updates in background.js:

let crawlLogs = [];
let pagesCrawled = 0;

function updateCrawlStatus() {
  chrome.runtime.sendMessage({
    action: 'CRAWL_UPDATE',
    data: { pagesCrawled, logs: crawlLogs.slice(-10) }, // Send last 10 logs
  });
}

chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'START_CRAWL') {
    // Crawling logic...
    setInterval(updateCrawlStatus, 5000); // Update every 5 seconds
  }
});

Takeaway: Use message passing sparingly and batch updates where possible. For persistent state, leverage chrome.storage to ensure scalability, especially for long-running tasks like web scraping.

Lesson 3: Performance Optimization Is Non-Negotiable

Performance is critical in Chrome Extensions, especially for web scraping tasks that involve frequent network requests and DOM manipulation. I implemented several optimizations to make my crawler efficient:

1- Link Prefetching: I used <link rel=”prefetch”> to preload the next page’s resources while scraping the current page, reducing load times.

// content.js
function prefetchNextLink(nextUrl) {
  const link = document.createElement('link');
  link.rel = 'prefetch';
  link.href = nextUrl;
  document.head.appendChild(link);
}

2- Batching Network Calls: Instead of making an API call for every scraped page, I batched data and sent it to the backend in chunks.

// background.js
let scrapedData = [];
async function saveDataToBackend() {
  if (scrapedData.length >= 10) { // Batch size of 10
    await fetch('https://api.example.com/save', {
      method: 'POST',
      body: JSON.stringify(scrapedData),
      headers: { 'Content-Type': 'application/json' },
    });
    scrapedData = []; // Clear after sending
  }
}

3- Caching State: I cached the crawl state in chrome.storage.local to avoid recomputing it on browser restarts.

// background.js
function saveState() {
  chrome.storage.local.set({
    crawlState: { pagesCrawled, logs: crawlLogs, lastUrl: currentUrl },
  });
}

function loadState() {
  chrome.storage.local.get(['crawlState'], (result) => {
    if (result.crawlState) {
      pagesCrawled = result.crawlState.pagesCrawled;
      crawlLogs = result.crawlState.logs;
      currentUrl = result.crawlState.lastUrl;
    }
  });
}

Takeaway: Optimize performance at every layer — network, DOM, and state management. Techniques like prefetching, batching, and caching can significantly improve the user experience, especially for resource-intensive tasks like web scraping.

Lesson 4: Design Resumable Crawlers for Robustness

A key feature of my extension was its ability to resume crawling after interruptions, such as browser restarts. This required careful handling of pagination and session state.

Pagination Handling: I stored the current page number and total pages in chrome.storage.local. If the crawl stopped, it could resume from the last page.
Session State: I saved the crawl session (e.g., site ID, start URL, scraped data) to the backend, ensuring persistence across browser sessions.

Here’s how I implemented the resume mechanism:

// background.js
let currentPage = 1;
let totalPages = 10; // Example

async function crawlPage(tabId) {
  chrome.scripting.executeScript({
    target: { tabId },
    function: scrapePage,
  }).then((results) => {
    const { links, data } = results[0].result;
    crawlLogs.push({ url: tab.url, data });
    currentPage++;

    // Save state
    chrome.storage.local.set({ currentPage, crawlLogs });

    if (currentPage <= totalPages) {
      const nextUrl = links[0]; // Simplified
      chrome.tabs.update(tabId, { url: nextUrl });
    }
  });
}

function resumeCrawl() {
  chrome.storage.local.get(['currentPage', 'crawlLogs'], (result) => {
    currentPage = result.currentPage || 1;
    crawlLogs = result.crawlLogs || [];
    // Restart crawl from last known state
  });
}

Takeaway: Design your crawler to handle interruptions gracefully. Store state persistently (locally or on a backend) and implement logic to resume from the last known point, ensuring a seamless user experience.

Lesson 5: Respect `robots.txt` for Ethical Scraping

Web scraping comes with ethical and legal responsibilities. One of the biggest mistakes I made early on was ignoring the target website’s robots.txt file. This file specifies which parts of a site can be crawled, and ignoring it can lead to broken trust, guideline violations, or even IP bans.

Takeaway: Always respect robots.txt and other ethical guidelines. Not only does this prevent legal issues, but it also builds trust with website owners and users.

Lesson 6: The DOM Is Never as Consistent as You’d Hope

One of the biggest challenges in web scraping is dealing with inconsistent DOM structures. I assumed that most websites would have predictable HTML, but I quickly learned otherwise. Some sites used dynamic rendering (e.g., React apps with lazy-loaded content), while others had broken or nested HTML.

To handle this, I made my scraping logic more robust:

// content.js
function scrapePage() {
  const links = Array.from(document.querySelectorAll('a[href]'))
    .map((a) => a.href)
    .filter((href) => href.startsWith('http')); // Filter invalid links
  const data = document.querySelector('.content')?.innerText || document.body.innerText; // Fallback
  return { links, data };
}

Takeaway: Build resilient scraping logic with fallbacks and error handling. Test your crawler on a variety of websites to ensure it can handle inconsistent DOM structures.

Conclusion:

Building a Chrome Extension with web scraping capabilities was a challenging but rewarding experience. The lessons I’ve shared — mastering script interactions, optimizing performance, designing resumable crawlers, respecting ethical guidelines, and handling inconsistent DOMs — have made me a better developer. Whether you’re just starting out or looking to scale your extension architecture, I hope these insights help you on your journey.

What challenges have you faced while building Chrome Extensions? Have you worked on web scraping projects before? Share your experiences in the comments — I’d love to hear from you! If you found this article helpful, consider clapping 👏 and following for more insights on web development, browser automation, and software engineering.

I’m open to connecting, exchanging ideas, or collaborating on exciting projects. Feel free to reach out:

Email: amin.mashayekhan@gmail.com
Book a Quick Tech Call: https://calendly.com/amin-mashayekhan/15min-tech-call Let’s build better tools, faster!