<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Guillermo Sanchez</title>
    <description>The latest articles on DEV Community by Guillermo Sanchez (@datacloudgui).</description>
    <link>https://dev.to/datacloudgui</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F435850%2F2a130603-a7bc-4ac2-995f-0e65f1541a55.jpg</url>
      <title>DEV Community: Guillermo Sanchez</title>
      <link>https://dev.to/datacloudgui</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datacloudgui"/>
    <language>en</language>
    <item>
      <title>Web scraper of prices in a colombian page</title>
      <dc:creator>Guillermo Sanchez</dc:creator>
      <pubDate>Tue, 04 Aug 2020 02:59:13 +0000</pubDate>
      <link>https://dev.to/datacloudgui/web-scraper-of-prices-in-a-colombian-page-3e2h</link>
      <guid>https://dev.to/datacloudgui/web-scraper-of-prices-in-a-colombian-page-3e2h</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y3-hK10y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/5fs8iyz4npchtmtu8rj8.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y3-hK10y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/5fs8iyz4npchtmtu8rj8.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nota:&lt;/strong&gt; Te interesa una versión en español de este articulo?, dejamelo saber en los comentarios y me animo!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to monitoring thousands of prices and find the real deal?&lt;/strong&gt;&lt;br&gt;
In this post I will explain you, how capture prices of any page, clean this data and merge day by day to collect prices and discover amazing insides. Go ahead...&lt;/p&gt;
&lt;h1&gt;
  
  
  The website
&lt;/h1&gt;

&lt;p&gt;The first step is explore our target page. The principal browsers have the magic tool named: "Inspect this element" or this page.&lt;/p&gt;

&lt;p&gt;You can view the bones of the page: HTML (text of the page) and CSS (Styles of the content). I recomend you review some basic concepts about this languages.&lt;/p&gt;

&lt;p&gt;However, I will show you the process in four simple steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select the cursor option into the inspector.&lt;/li&gt;
&lt;li&gt;Move to the price you want and be sure that is highlighed&lt;/li&gt;
&lt;li&gt;Verify the price in the HTML code&lt;/li&gt;
&lt;li&gt;Remember de class &lt;strong&gt;price&lt;/strong&gt; of the above item (span in this case).&lt;/li&gt;
&lt;li&gt;Go above and find the "class" that contain the complete article (including the previous selected price). In my case is &lt;strong&gt;.item&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Pd4_EKs0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/p109qnw0uxqtevm3n4ju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Pd4_EKs0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/p109qnw0uxqtevm3n4ju.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Scraper project
&lt;/h2&gt;

&lt;p&gt;Copy my scraper project &lt;a href="https://github.com/datacloudgui/prices_scraper"&gt;github project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Four files are disposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;config.yaml:&lt;/strong&gt; You can put the URL to be scrapped. The file allow organize by retail site, category and queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;a href="https://www.alkosto.com/electro"&gt;https://www.alkosto.com/electro&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, you can put the HTML class to be selected and contain the article in this case: &lt;strong&gt;.item&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;common.py:&lt;/strong&gt; A simple python code to import and parse the previous .yaml file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;item_page_object.py:&lt;/strong&gt; Python class to read the page and provide a method to extract all articles of that page.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this file in lines 23 to 37 a for loop iterate over each item. In lines 33 to 35 the code capture: title, price and image.&lt;br&gt;
&lt;strong&gt;Note the line 34 ...find("span","price").span.string&lt;/strong&gt;&lt;br&gt;
This line capture the desired prices.&lt;/p&gt;

&lt;p&gt;You can know more about this statement in beautifulsoup documentation &lt;a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The constructor requires: the base URL (without page number), the category of products to be extracted and the total number of pages.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;prices-scraper.py:&lt;/strong&gt; The principal code with 3 arguments:&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Retail site to be scraped (only alkosto are used at the moment).&lt;/li&gt;
&lt;li&gt;Category of the product - twelve categories implemented at the moment.&lt;/li&gt;
&lt;li&gt;Number of pages to scrap in the selected categories&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;##example: &lt;/p&gt;

&lt;p&gt;python3 prices_scraper.py alkosto televisores 3&lt;br&gt;
  python3 prices_scraper.py alkosto computadores-tablets 6&lt;/p&gt;

&lt;p&gt;The code use the Homepage class into a for loop to collect all the data of the selected category and saving on a csv file.&lt;/p&gt;

&lt;p&gt;Some logging messages are used to inform the user about the progress telling at the end the total of articles founded.&lt;/p&gt;

&lt;p&gt;Finally, an example of the exported data is provided&lt;/p&gt;
&lt;h2&gt;
  
  
  Clean data
&lt;/h2&gt;

&lt;p&gt;Prices are obtained as strings with dots and commas and commonly spaces. This stage transform this values to numbers to be analyzed or stored as number.&lt;/p&gt;

&lt;p&gt;Cleaner python script is provided &lt;a href="https://github.com/datacloudgui/prices_cleaner"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; The previous obtained csv contain a column named with de date of the extraction, this script requiere this date to clean values.&lt;/p&gt;

&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python3&lt;/span&gt; &lt;span class="n"&gt;prices_cleaner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="n"&gt;camaras_2020_07_05_articles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt; &lt;span class="mi"&gt;05_07_20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Load data
&lt;/h2&gt;

&lt;p&gt;Load is commonly performed on a database. However, in this project a compiled .csv is generated.&lt;/p&gt;

&lt;p&gt;Load script requires two .csv to merge.&lt;br&gt;
If you perform the scraping task more than two days, each new day data is merge to a _db.csv file.&lt;/p&gt;

&lt;p&gt;A new column is added each day with the prices of that day. Empty values are filled with -1.&lt;/p&gt;

&lt;p&gt;The project is available &lt;a href="https://github.com/datacloudgui/prices_load"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final step (When the magic happens)
&lt;/h2&gt;

&lt;p&gt;At this stage maybe you can say: That's a lot of stages and many work.&lt;/p&gt;

&lt;p&gt;However, in this stage we will be to automatize the entired proces to extract, clean and load with only one command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This process is knowed as pipeline&lt;/strong&gt;&lt;br&gt;
This pipeline allow to merge the previous stages.&lt;/p&gt;

&lt;p&gt;The repository and the instructions can be founded &lt;a href="https://github.com/datacloudgui/prices_pipeline"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I wish that you enjoy this journey of obtain prices of any website.&lt;/p&gt;

&lt;p&gt;Let me know if I can help you in some stage of the process.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
