What is Page Replica?
"Page Replica" is a versatile web scraping and caching tool built with Node.js, Express, and Puppeteer. It helps prerender web app pages (React, Angular, Vue, etc.), which can be served via Nginx for SEO or other purposes.
Key Features:
- Scrape Individual Pages or Entire Sitemaps: Easily scrape and cache individual web pages or entire sitemaps through an API.
- Remove JavaScript: Optionally remove JavaScript from the scraped pages for better SEO performance.
- Nginx Configuration: Serve cached pages optimally using our sample Nginx configuration, managing both user and search engine bot traffic.
Why Use Page Replica?
- SEO Optimization: Improve your website's SEO by serving prerendered pages to search engine bots.
- Caching for Speed: Cache pages to improve load times for your users and reduce server load.
- Ease of Use: With our new web service, you can start scraping and caching pages without any installation.
Getting Started
Installation (for Self-Hosted Users)
If you prefer to run Page Replica locally, follow these steps:
- Clone the Repository:
git clone https://github.com/html5-ninja/page-replica.git
cd page-replica
- Install Dependencies:
npm install
-
Configure Settings:
Update
index.js
with your desired configuration:
const CONFIG = {
baseUrl: "https://example.com",
removeJS: true,
addBaseURL: true,
cacheFolder: "path_to_cache_folder",
}
- Start the API:
npm start
Usage
Scraping Individual Pages
To scrape a single page, make a GET request to /page
with the url
query parameter:
curl http://localhost:8080/page?url=https://example.com
Scraping Sitemaps
To scrape pages from a sitemap, make a GET request to /sitemap
with the url
query parameter:
curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml
Serve Cached Pages with Nginx
Our sample Nginx configuration in nginx_config_sample/example.com.conf
helps you efficiently manage traffic:
- Users: Regular users are routed to the main application server.
- Bots: Search engine bots are redirected to a dedicated server block for cached HTML delivery.
Need Assistance?
If you have any questions or need support, we're here to help! Join our GitHub Discussion to get in touch with us.
Folder Structure
-
nginx_config_sample
: Sample Nginx configuration for redirecting bot traffic to the cached content server. -
api.js
: Express application handling web scraping requests. -
index.js
: Core web scraping logic using Puppeteer. -
package.json
: Node.js project configuration.
Thank you for choosing Page Replica. We look forward to providing you with the best possible service. Happy scraping! 🕷️
Top comments (0)