Nota: Te interesa una versión en español de este articulo?, dejamelo saber en los comentarios y me animo!
How to monitoring thousands of prices and find the real deal?
In this post I will explain you, how capture prices of any page, clean this data and merge day by day to collect prices and discover amazing insides. Go ahead...
The first step is explore our target page. The principal browsers have the magic tool named: "Inspect this element" or this page.
You can view the bones of the page: HTML (text of the page) and CSS (Styles of the content). I recomend you review some basic concepts about this languages.
However, I will show you the process in four simple steps:
- Select the cursor option into the inspector.
- Move to the price you want and be sure that is highlighed
- Verify the price in the HTML code
- Remember de class price of the above item (span in this case).
- Go above and find the "class" that contain the complete article (including the previous selected price). In my case is .item
Copy my scraper project github project
Four files are disposed:
- config.yaml: You can put the URL to be scrapped. The file allow organize by retail site, category and queries.
Also, you can put the HTML class to be selected and contain the article in this case: .item
common.py: A simple python code to import and parse the previous .yaml file.
item_page_object.py: Python class to read the page and provide a method to extract all articles of that page.
In this file in lines 23 to 37 a for loop iterate over each item. In lines 33 to 35 the code capture: title, price and image.
Note the line 34 ...find("span","price").span.string
This line capture the desired prices.
You can know more about this statement in beautifulsoup documentation here.
The constructor requires: the base URL (without page number), the category of products to be extracted and the total number of pages.
- prices-scraper.py: The principal code with 3 arguments:
- Retail site to be scraped (only alkosto are used at the moment).
- Category of the product - twelve categories implemented at the moment.
- Number of pages to scrap in the selected categories
python3 prices_scraper.py alkosto televisores 3
python3 prices_scraper.py alkosto computadores-tablets 6
The code use the Homepage class into a for loop to collect all the data of the selected category and saving on a csv file.
Some logging messages are used to inform the user about the progress telling at the end the total of articles founded.
Finally, an example of the exported data is provided
Prices are obtained as strings with dots and commas and commonly spaces. This stage transform this values to numbers to be analyzed or stored as number.
Cleaner python script is provided here
Note: The previous obtained csv contain a column named with de date of the extraction, this script requiere this date to clean values.
python3 prices_cleaner.py camaras_2020_07_05_articles.csv 05_07_20
Load is commonly performed on a database. However, in this project a compiled .csv is generated.
Load script requires two .csv to merge.
If you perform the scraping task more than two days, each new day data is merge to a _db.csv file.
A new column is added each day with the prices of that day. Empty values are filled with -1.
The project is available here
At this stage maybe you can say: That's a lot of stages and many work.
However, in this stage we will be to automatize the entired proces to extract, clean and load with only one command.
This process is knowed as pipeline
This pipeline allow to merge the previous stages.
The repository and the instructions can be founded here
I wish that you enjoy this journey of obtain prices of any website.
Let me know if I can help you in some stage of the process.