DEV Community

Casualwriter
Casualwriter

Posted on

a portable lightweight web crawler using Powerpage.

Just code a portable lightweight web crawler using Powerpage. Powerpage Web Crawler is a portable javascript-application running with Powerpage. It is coded by vanilla javascript in about 350 lines codes, without any dependency.

Image description

Powerpage Web Crawler is a portable program, just simply download and run powerpage.exe. It is a powerful and easy-to-use web-scrawler suitable for blog site crawling and offline-reading.

Just simply define below, for example

  • base-url := https://dev.to/casualwriter // the home page of favor blog site
  • index-pattern := none // RegExp of the url pattern of category page
  • page-pattern := /casualwriter/[a-z] // RegExp of the url pattern of content page
  • content-css := #main-title h1, #article-body //css selector for blog content.

Program will

  • crawl all category pages.
  • find out all url of content pages.
  • crawl content for one page, or all pages.
  • save setting and links to database (support multiple sites)
  • save content pages to local files.
  • allow off-line reading from local files.

About Powerpage

Powerpage Web Crawler run with PowerPage, which is a lightweight web browser with DB capability and windows accessibility, for quick development of javascript/html/css application.

for the source code of Powerpage, please visit https://github.com/casualwriter/powerpage/tree/main/source/src

By the way, sorry for beginner coding style and rough screen layout (for independence).

Enjoy,

Top comments (0)