What is web scraping?
Web scraping is a way to take some data from a large amount of data on the website and export it in different types of shapes such as JSON, CSV, Excel sheets and various extensions depending on the application or framework we use all of this for the purpose of analyzing that data to draw conclusions and comparisons from it.
How does web scraping work?
- the web scraping first take one or more websites URL
- then the scraper loads the HTML page, and if you use advanced scrapper it will render the entire page including CSS and JavaScript
- than scraper extract all the page data or specific element base on what we need
- then it will export the data in CSV, Excel, JSON or any other sport extinctions
What are the uses of web scraping?
- Scraping data from websites to generate leads
- Scraping product data from sites like Amazon for competitor analysis
- Scraping product details for comparison shopping
- Scraping financial data for market insights and research
- Scraping jobs websites to find most Appropriate for clients
- there are a lot of things to use scraping with that is based on the person who uses it
What do I need as a programmer to learn it?
- Basic knowledge in programming languages like python or JavaScript
- Basic knowledge in a framework that is for scraping and this is some example for python (Scrapy, PySpider, Selenium)
- Basic Html knowledge that is for knew the type of element in the target web site to scrape it
- Basic CSS or XML knowledge that for use it to select the HTML * elements from the website by the framework tools
- (Optional) basic knowledge in the regular expression to search for the HTML elements in the website
Conclusion:
in the end, web scraping is an important topic and easy to learn by some basic knowledge you can begin to work in this niche
Top comments (6)
So ultimately you're using CSS classes and IDs to pull the data from an HTML element and save it?
What happens with React generated elements that don't have consistant CSS classes?
You can find any element via some selector. The only difference is the robustness of the solution. There is, however, no fully robust solution as everything (the DOM hierachy, the CSS classes, and the used IDs) may be changed from the site owner.
Just open your dev tools, click in the elements tab on some DOM node and select "copy selector".
I have been seeing "Xpath" everywhere. Can it not be used as an absolute path? I am interested in web scraping.
P.S - Great article
You can use css as selector for the target html elements but i advice you to use xml in web scraping it have alot of advantages
👏
Thank you for your article.
as a newbie I am using e-scraper.com to scrape data for eCommerce i need.