🚀 Introducing Page Replica: Web Scraping and Caching Tool

What is Page Replica?

"Page Replica" is a versatile web scraping and caching tool built with Node.js, Express, and Puppeteer. It helps prerender web app pages (React, Angular, Vue, etc.), which can be served via Nginx for SEO or other purposes.

Key Features:

Scrape Individual Pages or Entire Sitemaps: Easily scrape and cache individual web pages or entire sitemaps through an API.
Remove JavaScript: Optionally remove JavaScript from the scraped pages for better SEO performance.
Nginx Configuration: Serve cached pages optimally using our sample Nginx configuration, managing both user and search engine bot traffic.

Why Use Page Replica?

SEO Optimization: Improve your website's SEO by serving prerendered pages to search engine bots.
Caching for Speed: Cache pages to improve load times for your users and reduce server load.
Ease of Use: With our new web service, you can start scraping and caching pages without any installation.

Getting Started

Installation (for Self-Hosted Users)

If you prefer to run Page Replica locally, follow these steps:

Clone the Repository:

   git clone https://github.com/html5-ninja/page-replica.git
   cd page-replica

Install Dependencies:

   npm install

Configure Settings: Update index.js with your desired configuration:

   const CONFIG = {
     baseUrl: "https://example.com",
     removeJS: true,
     addBaseURL: true,
     cacheFolder: "path_to_cache_folder",
   }

Start the API:

   npm start

Usage

Scraping Individual Pages

To scrape a single page, make a GET request to /page with the url query parameter:

curl http://localhost:8080/page?url=https://example.com

Scraping Sitemaps

To scrape pages from a sitemap, make a GET request to /sitemap with the url query parameter:

curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml

Serve Cached Pages with Nginx

Our sample Nginx configuration in nginx_config_sample/example.com.conf helps you efficiently manage traffic:

Users: Regular users are routed to the main application server.
Bots: Search engine bots are redirected to a dedicated server block for cached HTML delivery.

Need Assistance?

If you have any questions or need support, we're here to help! Join our GitHub Discussion to get in touch with us.

Folder Structure

nginx_config_sample: Sample Nginx configuration for redirecting bot traffic to the cached content server.
api.js: Express application handling web scraping requests.
index.js: Core web scraping logic using Puppeteer.
package.json: Node.js project configuration.

Thank you for choosing Page Replica. We look forward to providing you with the best possible service. Happy scraping! 🕷️