DEV Community

Erikafu
Erikafu

Posted on

How to build a web crawler as a beginner?

Using Computer Language (Example: Python)

For any non-coders who wish to build a web scraper using a computer language, Python might be the easiest one to start with comparing to PHP, Java, C/C++. Python's grammars are rather simple and readable for anyone that reads English.

Here is a simple example of a web crawler writing with Python.


import Queue

   initial_page = "http://www.renminribao.com"

        url_queue = Queue.Queue()

        seen = set()

        seen.insert(initial_page)

        url_queue.put(initial_page)

while(True):

   if url_queue.size()>0:

        current_url = url_queue.get()

        store(current_url)

        for next_url in extract_urls(current_url):

              if next_url not in seen:

                   seen.put(next_url)

                   url_queue.put(next_url)

   else:

          break

As beginners without knowing how to program, we are absolutely required to spend time and energy in learning Python and then writing a web crawler ourselves. The whole studying process might last several months.

Using Web Scraping Tool (Example: Octoparse)

When a beginner wants to build a web crawler within a reasonable time, a visual web scraping software like Octoparse is a good option to consider. It is a coding-free web scraping tool that comes with a free version. In comparison with other web scraping tools, Octoparse can be a cost-efficient solution for anyone looking to quickly scrape some data off a website.Top 5 Web Scraping Tools Comparison.

How to“Build a web crawler” in Octoparse.

1.Wizard Mode for easy scraping

Wizard Mode which will guide users step by step in scraping data in Octoparse provides three pre-built templates – “List or Table”, “List and Detail” and “Single Page”. Providing the pre-built templates were able to satisfy our need, we can easily to build a “web crawler” in Octoparse within clicks after downloading Octoparse.

2.Advanced Mode for complex web scraping

Since some websites are built with complex structures, Wizard Mode cannot help us scrape all the data we want. Thus, we’d better use Advanced Mode which is more powerful and flexible in scraping data.

Here is an example that how to build a web crawler by using Octoparse.VEDIO: Scrape product information from Amazon (Octoparse 7.X)

Top comments (1)

Collapse
 
penmerchant profile image
Af

Thanks for sharing this. Looking forward to develop this kind of project in the future