DEV Community

Cover image for 6 Lessons Learned from Building a Production-Grade Chrome Extension with Web Scraping
M.Amin Mashayekhan
M.Amin Mashayekhan

Posted on • Originally published at Medium

6 Lessons Learned from Building a Production-Grade Chrome Extension with Web Scraping

Introduction: Why Chrome Extensions Matter

If you’ve ever tried to build a browser extension that interacts with the real world — not just toy examples — you’ll know how complex things can get
fast.

Recently, I built a production-grade Chrome Extension that performs deep web scraping with advanced resume mechanisms, survives browser shutdowns, and optimizes performance using smart techniques like link prefetching, batching, and caching.

In this article, I’ll share the key takeaways from building a production-grade Chrome Extension for web scraping — from crawling logic and performance hacks to resilience patterns that survive browser crashes.

Whether you’re a beginner looking to build your first extension or a seasoned developer aiming to scale your architecture, these insights will help you navigate the complexities of browser automation and web scraping. Let’s dive in!


The Project: A Chrome Extension for Web Scraping

The goal of my project was to create a Chrome Extension that could crawl websites, scrape data, and store it for later analysis. The extension needed to:

  • Scrape data from web pages (e.g., links, text, tables).
  • Support resumable crawling (e.g., handle pagination and session state).
  • Optimize performance with techniques like link prefetching and state caching.
  • Survive browser restarts without losing progress.
  • Respect ethical web scraping practices, such as adhering to robots.txt.

I built the extension using React and TypeScript for the frontend, with a background script (background.js) handling the crawling logic. The extension also communicated with a Laravel backend to manage sites and crawl sessions. Here’s a high-level overview of the architecture:

  • Popup UI (popup.tsx): A React-based interface for users to start, stop, and monitor crawls.
  • Background Script (background.js): Handles crawling logic, tab management, and data extraction. Content Script (content.js): Injects into web pages to scrape data.
  • Backend (Laravel): Manages site configurations and crawl logs via APIs.

Now, let’s break down the key lessons I learned during this project, complete with code examples and practical advice.


Lesson 1: Master the Interaction Between Scripts for Maintainability

Chrome Extensions operate in a unique environment with three main types of scripts: content scripts, background scripts, and popup scripts. Understanding how these scripts interact is crucial for building a maintainable extension.

  • Content Scripts run in the context of the web page and can access the DOM directly. They’re ideal for scraping data.
  • Background Scripts run persistently in the background and handle long-running tasks like crawling or tab management.
  • Popup Scripts (or UI scripts) manage the user interface, typically built with frameworks like React.

In my project, the content script (content.js) scraped data from pages, the background script (background.js) orchestrated the crawl, and the popup UI (popup.tsx) displayed the status. Communication between these scripts is handled via Chrome’s message-passing API (chrome.runtime.sendMessage and chrome.runtime.onMessage).

Here’s an example of how I set up message passing between the popup and background script:

// popup.tsx (Sending a message to start a crawl)
const handleStartCrawl = () => {
  chrome.runtime.sendMessage({
    action: 'START_CRAWL',
    siteId: selectedSite,
    startUrl: startUrl,
  });
};

// background.js (Listening for messages)
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'START_CRAWL') {
    startCrawling(request.siteId, request.startUrl);
  }
});

async function startCrawling(siteId, startUrl) {
  const tab = await chrome.tabs.create({ url: startUrl });
  // Crawling logic...
}
Enter fullscreen mode Exit fullscreen mode

Takeaway: Keep the responsibilities of each script clear. Use message passing to decouple components, making your codebase modular and easier to debug. For example, if the popup UI crashes, the background script can continue crawling uninterrupted.


Lesson 2: Use Message Passing Wisely for Scalability

Message passing is the backbone of Chrome Extension communication, but it can become a bottleneck if not used carefully. In my project, I initially overused message passing, sending frequent updates between the popup and background script. This led to performance issues, especially during long crawls with hundreds of pages.

To address this, I adopted a more modular approach:

  • Batch Updates: Instead of sending a message for every scraped page, I batched updates and sent them periodically.
  • State Management: I stored crawl state (e.g., pages crawled, logs) in chrome.storage.local to reduce message frequency. Here’s how to implemente batched updates in background.js:
let crawlLogs = [];
let pagesCrawled = 0;

function updateCrawlStatus() {
  chrome.runtime.sendMessage({
    action: 'CRAWL_UPDATE',
    data: { pagesCrawled, logs: crawlLogs.slice(-10) }, // Send last 10 logs
  });
}

chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'START_CRAWL') {
    // Crawling logic...
    setInterval(updateCrawlStatus, 5000); // Update every 5 seconds
  }
});
Enter fullscreen mode Exit fullscreen mode

Takeaway: Use message passing sparingly and batch updates where possible. For persistent state, leverage chrome.storage to ensure scalability, especially for long-running tasks like web scraping.


Lesson 3: Performance Optimization Is Non-Negotiable

Performance is critical in Chrome Extensions, especially for web scraping tasks that involve frequent network requests and DOM manipulation. I implemented several optimizations to make my crawler efficient:

1- Link Prefetching: I used <link rel=”prefetch”> to preload the next page’s resources while scraping the current page, reducing load times.

// content.js
function prefetchNextLink(nextUrl) {
  const link = document.createElement('link');
  link.rel = 'prefetch';
  link.href = nextUrl;
  document.head.appendChild(link);
}
Enter fullscreen mode Exit fullscreen mode

2- Batching Network Calls: Instead of making an API call for every scraped page, I batched data and sent it to the backend in chunks.

// background.js
let scrapedData = [];
async function saveDataToBackend() {
  if (scrapedData.length >= 10) { // Batch size of 10
    await fetch('https://api.example.com/save', {
      method: 'POST',
      body: JSON.stringify(scrapedData),
      headers: { 'Content-Type': 'application/json' },
    });
    scrapedData = []; // Clear after sending
  }
}
Enter fullscreen mode Exit fullscreen mode

3- Caching State: I cached the crawl state in chrome.storage.local to avoid recomputing it on browser restarts.

// background.js
function saveState() {
  chrome.storage.local.set({
    crawlState: { pagesCrawled, logs: crawlLogs, lastUrl: currentUrl },
  });
}

function loadState() {
  chrome.storage.local.get(['crawlState'], (result) => {
    if (result.crawlState) {
      pagesCrawled = result.crawlState.pagesCrawled;
      crawlLogs = result.crawlState.logs;
      currentUrl = result.crawlState.lastUrl;
    }
  });
}
Enter fullscreen mode Exit fullscreen mode

Takeaway: Optimize performance at every layer — network, DOM, and state management. Techniques like prefetching, batching, and caching can significantly improve the user experience, especially for resource-intensive tasks like web scraping.


Lesson 4: Design Resumable Crawlers for Robustness

A key feature of my extension was its ability to resume crawling after interruptions, such as browser restarts. This required careful handling of pagination and session state.

  • Pagination Handling: I stored the current page number and total pages in chrome.storage.local. If the crawl stopped, it could resume from the last page.
  • Session State: I saved the crawl session (e.g., site ID, start URL, scraped data) to the backend, ensuring persistence across browser sessions.

Here’s how I implemented the resume mechanism:

// background.js
let currentPage = 1;
let totalPages = 10; // Example

async function crawlPage(tabId) {
  chrome.scripting.executeScript({
    target: { tabId },
    function: scrapePage,
  }).then((results) => {
    const { links, data } = results[0].result;
    crawlLogs.push({ url: tab.url, data });
    currentPage++;

    // Save state
    chrome.storage.local.set({ currentPage, crawlLogs });

    if (currentPage <= totalPages) {
      const nextUrl = links[0]; // Simplified
      chrome.tabs.update(tabId, { url: nextUrl });
    }
  });
}

function resumeCrawl() {
  chrome.storage.local.get(['currentPage', 'crawlLogs'], (result) => {
    currentPage = result.currentPage || 1;
    crawlLogs = result.crawlLogs || [];
    // Restart crawl from last known state
  });
}
Enter fullscreen mode Exit fullscreen mode

Takeaway: Design your crawler to handle interruptions gracefully. Store state persistently (locally or on a backend) and implement logic to resume from the last known point, ensuring a seamless user experience.


Lesson 5: Respect robots.txt for Ethical Scraping

Web scraping comes with ethical and legal responsibilities. One of the biggest mistakes I made early on was ignoring the target website’s robots.txt file. This file specifies which parts of a site can be crawled, and ignoring it can lead to broken trust, guideline violations, or even IP bans.

Takeaway: Always respect robots.txt and other ethical guidelines. Not only does this prevent legal issues, but it also builds trust with website owners and users.

Lesson 6: The DOM Is Never as Consistent as You’d Hope

One of the biggest challenges in web scraping is dealing with inconsistent DOM structures. I assumed that most websites would have predictable HTML, but I quickly learned otherwise. Some sites used dynamic rendering (e.g., React apps with lazy-loaded content), while others had broken or nested HTML.

To handle this, I made my scraping logic more robust:

// content.js
function scrapePage() {
  const links = Array.from(document.querySelectorAll('a[href]'))
    .map((a) => a.href)
    .filter((href) => href.startsWith('http')); // Filter invalid links
  const data = document.querySelector('.content')?.innerText || document.body.innerText; // Fallback
  return { links, data };
}
Enter fullscreen mode Exit fullscreen mode

Takeaway: Build resilient scraping logic with fallbacks and error handling. Test your crawler on a variety of websites to ensure it can handle inconsistent DOM structures.

Conclusion:

Building a Chrome Extension with web scraping capabilities was a challenging but rewarding experience. The lessons I’ve shared — mastering script interactions, optimizing performance, designing resumable crawlers, respecting ethical guidelines, and handling inconsistent DOMs — have made me a better developer. Whether you’re just starting out or looking to scale your extension architecture, I hope these insights help you on your journey.


What challenges have you faced while building Chrome Extensions? Have you worked on web scraping projects before? Share your experiences in the comments — I’d love to hear from you! If you found this article helpful, consider clapping 👏 and following for more insights on web development, browser automation, and software engineering.


I’m open to connecting, exchanging ideas, or collaborating on exciting projects. Feel free to reach out:

Top comments (0)