<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Diretnan Domnan</title>
    <description>The latest articles on DEV Community by Diretnan Domnan (@deven96).</description>
    <link>https://dev.to/deven96</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F130490%2F7d1d5c93-5acc-434e-ac36-c84b81092905.jpg</url>
      <title>DEV Community: Diretnan Domnan</title>
      <link>https://dev.to/deven96</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/deven96"/>
    <language>en</language>
    <item>
      <title>Waihona</title>
      <dc:creator>Diretnan Domnan</dc:creator>
      <pubDate>Thu, 16 Sep 2021 01:03:21 +0000</pubDate>
      <link>https://dev.to/deven96/waihona-33j8</link>
      <guid>https://dev.to/deven96/waihona-33j8</guid>
      <description>&lt;h2&gt;
  
  
  What is &lt;a href="https://github.com/bisoncorps/waihona"&gt;Waihona&lt;/a&gt;?
&lt;/h2&gt;

&lt;p&gt;It aims to be a Rust crate for performing simple cloudstorage functions across popular cloud providers such as AWS, GCP and Azure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is my personal motivation?
&lt;/h2&gt;

&lt;p&gt;I find Rust particularly hard to adapt to especially coming from a dynamic language background with Python and Javascript. There is so much to unpack and learn with regards to the borrow checker, lifetimes, generics with regards to traits e.t.c. I tackle this the only way I know how, by writing incrementally difficult projects tailored to help broaden my understanding. Still not where I am comfortable with yet, but this project helped get me onboard and comfortable with most of these concepts and also with the package management lifecycle in the Rust community&lt;/p&gt;

&lt;p&gt;Viva Rust!&lt;/p&gt;

</description>
      <category>github</category>
      <category>rust</category>
    </item>
    <item>
      <title>Scrape Everything! Building Web scrapers with Python and Go</title>
      <dc:creator>Diretnan Domnan</dc:creator>
      <pubDate>Fri, 14 Feb 2020 12:11:59 +0000</pubDate>
      <link>https://dev.to/deven96/scrape-everything-building-web-scrapers-with-python-and-go-34i7</link>
      <guid>https://dev.to/deven96/scrape-everything-building-web-scrapers-with-python-and-go-34i7</guid>
      <description>&lt;h3&gt;
  
  
  Table of Content
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Recognizing web scraping opportunities&lt;/li&gt;
&lt;li&gt;
Key components of a scraping project

&lt;ul&gt;
&lt;li&gt;Manual Inspection&lt;/li&gt;
&lt;li&gt;Web Requests&lt;/li&gt;
&lt;li&gt;Response Parsing&lt;/li&gt;
&lt;li&gt;Data cleaning and transformation&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
Packaging and Deploying your scraper

&lt;ul&gt;
&lt;li&gt;Packaging&lt;/li&gt;
&lt;li&gt;Deployment&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
Example Scraping Project with Python

&lt;ul&gt;
&lt;li&gt;Search Engine Parser&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
Example Scraping Project with Go

&lt;ul&gt;
&lt;li&gt;Gophie&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The web we all know and love is a huge pool of unstructured data, usually presented in the form of pages that can be accessed through a web browser. &lt;a href="https://en.wikipedia.org/wiki/Web_scraping"&gt;Web scraping&lt;/a&gt; seeks to harness the power of this data by automating it's extraction using well selected tools of choice, to save us time and effort. In this article, I will guide you through some of the steps involved in writing web scrapers to tame the beast. The concepts explained are language agnostic but since I am transitioning from the dynamically typed joyous Python syntax to the rigid static Go syntax, this article will explain with examples from both sides. It then rounds off by giving a brief description of two projects written in &lt;a href="https://developers.google.com/edu/python/set-up"&gt;Python&lt;/a&gt; and &lt;a href="https://golang.org/doc/install"&gt;Go&lt;/a&gt; respectively. The projects in question achieve very different goals but are all built on this awesome concept of web scraping.&lt;/p&gt;

&lt;p&gt;&lt;b&gt; You will learn the following: &lt;/b&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recognizing web scraping opportunities&lt;/li&gt;
&lt;li&gt;Key components of a scraping project&lt;/li&gt;
&lt;li&gt;Packaging and Deploying your scraper&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚩 &lt;i&gt; A lot of the code snippets shown in this article will not compile on their own, as most times I will be making reference to a variable declared in a previous snippet or using an example url, e.t.c &lt;/i&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Recognizing web scraping opportunities
&lt;/h3&gt;

&lt;p&gt;There is tremendous data online in the form of web pages. This data can be transformed from unstructured to structured data to be stored in a database, used to power a dashboard, e.t.c.&lt;br&gt;
The trick to recognizing web scraping opportunities is as simple as having a project that requires data which is available on web pages, but not easily accessible through standardized libraries or &lt;a href="https://en.wikipedia.org/wiki/Web_API"&gt;web APIs&lt;/a&gt;. &lt;br&gt;
Another indicator is the volatility of the data i.e how frequently the data to be scraped updates or changes on the site.&lt;br&gt;
A strong web scraping opportunity is usually consolidation. &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LLw2fMlD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/vgr8x1b0uwzjtbapm1bw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LLw2fMlD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/vgr8x1b0uwzjtbapm1bw.jpeg" alt="Jigsaw Pieces visualizing consolidation"&gt;&lt;/a&gt;&lt;br&gt;
Consolidation refers to the combination of two or more entities. In our case refers to the merging of data obtained from different sources into one single endpoint. Take for instance, a project to build a dashboard for displaying major disease statistics in West Africa. This project is a major undertaking and  will usually involve several processes, including but not limited to manual data collection. Let us assume however that we have access to multiple websites, each having some small part of the data we need. We can scrape all the websites in question and consolidate the data into one single dashboard while saving time and effort. &lt;br&gt;
Some caveats exist however, one being that some websites have &lt;a href="https://consent.yahoo.com/collectConsent?sessionId=3_cc-session_aaba83a8-aaf1-425b-8d2b-9e2f77695b43&amp;amp;lang=en-gb&amp;amp;inline=false"&gt;terms and conditions&lt;/a&gt; barring scraping, especially for the purpose of commercial use. Some sites resort to blacklisting suspicious IP addresses that send too many requests at too rapid a rate as it may &lt;a href="https://en.wikipedia.org/wiki/Denial-of-service_attack"&gt;break their servers&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key components of a scraping project
&lt;/h3&gt;

&lt;p&gt;There are some persistent components in every web scraping project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manual Inspection&lt;/li&gt;
&lt;li&gt;Web Request&lt;/li&gt;
&lt;li&gt;Response Parsing&lt;/li&gt;
&lt;li&gt;Data Cleaning and Transformation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HH3Oh9v8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/jr8k1c3rs18iz9ok0o0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HH3Oh9v8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/jr8k1c3rs18iz9ok0o0d.png" alt="Generic Components of a web scraper"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Manual Inspection
&lt;/h4&gt;

&lt;p&gt;To successfully create a web scraper, it is important to understand the concepts of &lt;a href="https://www.w3schools.com/html/"&gt;HTML&lt;/a&gt; pages. A website is a soup containing multiple moving parts of code. In order to extract relevant information, we have to inspect the web page for the pieces of relevance. Manual inspection is first carried out before anything just to get a general feel of the information on the page.&lt;br&gt;
To manually inspect a page, right click on the web page and then select "Inspect". This allows you peer into the soul or source code of the website.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wcJMxQx8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/54o9eqp3tt87eqiycb16.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wcJMxQx8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/54o9eqp3tt87eqiycb16.png" alt="Right click menu"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On clicking "Inspect", a console should pop up. This console lets you navigate and see the corresponding HTML tag for every information displayed on the page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7jO9O1Tb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/iiizmt4rv7q3b517no3a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7jO9O1Tb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/iiizmt4rv7q3b517no3a.png" alt="web dev console"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Web Request
&lt;/h4&gt;

&lt;p&gt;This involves sending a request to a website and receiving the response. The request being can be configured to receive html pages, get files, set retry policy, e.t.c.  The corresponding response received contains a lot of information we might want to use such as status code, content length, content type and response body. These are all important information that we then proceed to parse during the Response Parsing stage. There are libraries to achieve that in Python (e.g &lt;a href="https://2.python-requests.org/en/master/"&gt;requests&lt;/a&gt;) and Go (e.g &lt;a href="https://golang.org/pkg/net/http/"&gt;http&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Here is an implementation in Python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="s"&gt;"""
Web Requests using Python Requests
"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;requests.adapters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HTTPAdapter&lt;/span&gt;

&lt;span class="c1"&gt;# simple get request
&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://example.com/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# posting form data
&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'username'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;'niceusername'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s"&gt;'password'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;'123456'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://example.com/form"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# setting retry policy to 5 max retries
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://stackoverflow.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Here is the corresponding implementation in Go&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;/** 
Web Requests using Go net/http 
**/&lt;/span&gt;
&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"net/http"&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(){&lt;/span&gt;
    &lt;span class="c"&gt;// simple get request&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://example.com/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// posting form data&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PostForm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://example.com/form"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Values&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"key"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"Value"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"123"&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Unfortunately, net/http does not provide retries, but we can make use of third party libraries e.g &lt;a href="https://github.com/BoseCorp/pester"&gt;pester&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;/** Retry Policy with pester **/&lt;/span&gt;
&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"github.com/BoseCorp/pester"&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pester&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c"&gt;// set retry policy to 5 max retries&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MaxRetries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://stackoverflow.com"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  Response Parsing
&lt;/h4&gt;

&lt;p&gt;This involves extracting information from the web page and still goes hand in hand with manual inspection mentioned earlier. As the name suggests, we parse the response by making use of the attributes provided by the response e.g the response-body, status-code, content-length, e.t.c. of which HTML which is contained in the response-body is usually the most laborious to parse, assuming we were to do it ourselves. Thankfully, there are structured stable libraries that help us parse HTML easily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!--- Assume this is our html --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;
    This is some
     &lt;span class="nt"&gt;&amp;lt;b&amp;gt;&lt;/span&gt;
     regular
     &lt;span class="nt"&gt;&amp;lt;i&amp;gt;&lt;/span&gt;HTML&lt;span class="nt"&gt;&amp;lt;/i&amp;gt;&lt;/span&gt;
     &lt;span class="nt"&gt;&amp;lt;/b&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;table&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"important-data"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;th&amp;gt;&lt;/span&gt;Name&lt;span class="nt"&gt;&amp;lt;/th&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;th&amp;gt;&lt;/span&gt;Country&lt;span class="nt"&gt;&amp;lt;/th&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;th&amp;gt;&lt;/span&gt;Weight(kg)&lt;span class="nt"&gt;&amp;lt;/th&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;th&amp;gt;&lt;/span&gt;Height(cm)&lt;span class="nt"&gt;&amp;lt;/th&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Smith&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Nigeria&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;42&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;160&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Eve&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Nigeria&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;49&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;180&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Tunde&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Nigeria&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;65&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;175&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Koffi&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Ghana&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;79&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;154&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/table&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For Python, the undeniable king of the parsers is &lt;a href="https://pypi.org/project/beautifulsoup4/"&gt;BeautifulSoup4&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;namedtuple&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;TableElement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;namedtuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'TableElement'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Name Country Weight Height'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;request_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'http://www.example.com'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="c1"&gt;# using beautiful soup with the 'lxml' parser
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"lxml"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# extract the table
&lt;/span&gt;&lt;span class="n"&gt;tb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"important-data"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;# find each row
&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tr"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"td"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# tds would be empty for the row with only table heads (tr)
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;table_element&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TableElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# print first person name       
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For Go, there is no undisputed king of parsing in my opinion, but my immediate favourite has come to be &lt;a href="https://github.com/gocolly/colly"&gt;Colly&lt;/a&gt;. Colly combines the work of net/http into it's library, so it can perform both web request and parsing&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"strconv"&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gocolly/colly"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TableElement&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Name&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
  &lt;span class="n"&gt;Country&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
  &lt;span class="n"&gt;Weight&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;
  &lt;span class="n"&gt;Height&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="c"&gt;// table array to hold data&lt;/span&gt;
       &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;TableElement&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
       &lt;span class="c"&gt;// instantiate collector&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;colly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCollector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
       &lt;span class="c"&gt;// set up rule for table with important-data id&lt;/span&gt;
       &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnHTML&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"table[id=important-data]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tab&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;colly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTMLElement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="n"&gt;tab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ForEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tr"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;colly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTMLElement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="n"&gt;tds&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChildTexts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"td"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="n"&gt;new_row&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;TableElement&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;Country&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;Weight&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;Height&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParseFloat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                  &lt;span class="p"&gt;}&lt;/span&gt;
              &lt;span class="c"&gt;// append every row to table&lt;/span&gt;
              &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="c"&gt;// assuming the same html as above&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Visit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://example.com"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c"&gt;// print first individual name&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  Data Cleaning and Transformation
&lt;/h4&gt;

&lt;p&gt;If somehow scraping the html tags gave you the final form of the data you are looking for then 🎉 , your work is majorly done. If it is not, then the next part is usually very integral and involves rolling up your sleeves and doing some cleaning and transformation. Text coming from within tags is usually messy, unnecessarily spaced and might need some regular expression magic to further extract reasonable data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="s"&gt;"""
Extracting Content Filename from Content-Disposition using Python regex
"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;
&lt;span class="c1"&gt;# assume we have a response already
&lt;/span&gt;&lt;span class="n"&gt;content_disp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Header"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;"Content-Disposition"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;filename_re&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'filename="(.*)"'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename_re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_disp&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Avengers (2019).mp4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;





&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;/** 
Extracting Content Filename from Content-Disposition using Go regex
**/&lt;/span&gt;
&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="s"&gt;"regexp"&lt;/span&gt;
   &lt;span class="s"&gt;"fmt"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;re&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustCompile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;`filename="(.*)"`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Content-Disposition"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FindStringSubmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c"&gt;// Avengers (2019).mp4&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Furthermore, to clean data while easily melting, pivoting and manipulating its rows and columns, one could make use of &lt;a href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt; in Python and it's Go equivalent &lt;a href="https://github.com/go-gota/gota"&gt;Gota&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# using the previous tables variable which is a list of TableElement namedtuples
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="s"&gt;"""  Console output
     Name       Country      Weight        Height
0    Smith      Nigeria       42            160
1    Eve        Nigeria       49            180
2    Tunde      Nigeria       65            175
3    Koffi      Ghana         79            154
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;





&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="s"&gt;"github.com/go-gota/gota"&lt;/span&gt;
     &lt;span class="s"&gt;"fmt"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(){&lt;/span&gt;
    &lt;span class="c"&gt;// using the previous tables variable which is a slice of TableElement structs&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoadStructs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;/** Console output
     Name       Country      Weight        Height
0    Smith      Nigeria       42            160
1    Eve        Nigeria       49            180
2    Tunde      Nigeria       65            175
3    Koffi      Ghana         79            154
**/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Refer to this awesome article on &lt;a href="https://realpython.com/python-data-cleaning-numpy-pandas/"&gt;data cleaning with Pandas&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Packaging and Deploying your scraper
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Packaging
&lt;/h4&gt;

&lt;p&gt;This is all dependent on the use case of your scraper. You can determine your use case by asking questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should the scraper stream data to another endpoint or should it remain dormant until polled?&lt;/li&gt;
&lt;li&gt;Are you storing the data for usage at a latter time?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They will guide you into selecting the proper packaging for your scraper.&lt;br&gt;
The first question helps you consider the scraper as either a webservice or a utility. If it is not a streaming scraper, you should consider using it as a library or Command Line Interface(CLI). Optionally you can decide to build a Graphic User Interface(GUI) if you are into that sort of thing. For CLI, I would suggest Python's &lt;a href="https://docs.python.org/3/library/argparse.html"&gt;argparse&lt;/a&gt; and &lt;a href="https://www.github.com/abiosoft/ishell"&gt;ishell&lt;/a&gt; for Golang. For creating cross platform desktop GUI, &lt;a href="https://pypi.org/project/PyQt5/"&gt;PyQt5&lt;/a&gt; for Python and &lt;a href="https://github.com/fyne-io/fyne/"&gt;Fyne&lt;/a&gt; for Golang should suffice&lt;/p&gt;

&lt;p&gt;NOTE 📝 : If you are setting up a streaming type scraper especially, you may have to apply a few tricks such as rate limiting, ip and header rotation, e.t.c so as not to get your scraper blacklisted.&lt;/p&gt;

&lt;p&gt;The second question helps decide whether to add a database into the mix or not. View Python &lt;a href="https://www.psycopg.org/"&gt;psycopg&lt;/a&gt; and Go &lt;a href="https://godoc.org/github.com/lib/pq"&gt;pq&lt;/a&gt; to be able to connect to a &lt;a href="https://www.postgresql.org/"&gt;PostgresQL&lt;/a&gt; database&lt;/p&gt;

&lt;p&gt;I have been able to rely on &lt;a href="https://www.fullstackpython.com/flask.html"&gt;Flask&lt;/a&gt; for spawning simple Python servers and using them to deploy scrapers isn't much of a hassle either.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jsonify&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scraping_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;""" scraping function to extract data """&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;jsonify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;home&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'category'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# dynamically set the url to be scraped based on the category received
&lt;/span&gt;    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"https://example.com/?category={category}"&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;json_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scraping_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json_response&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="s"&gt;'__main__'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;9000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For Golang, net/http is still the way to go for spawning small servers to interface with logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; 
   &lt;span class="s"&gt;"net/http"&lt;/span&gt;
   &lt;span class="s"&gt;"encoding/json"&lt;/span&gt;
   &lt;span class="s"&gt;"os"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Example&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;          &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Description&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;scraping_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Example&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// scraping function to extract data&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"category argument is missing in url"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusForbidden&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"https://example.com/?category="&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scraping_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scraping_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// dump results&lt;/span&gt;
    &lt;span class="n"&gt;json_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scraping&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to serialize response:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(){&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// check if port variable has been set&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PORT"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":9000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PORT"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To interact with the server, just use &lt;a href="https://curl.haxx.se/"&gt;curl&lt;/a&gt;. If it is running on port 9000, the command to test both servers is shown as&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s1"&gt;'http://127.0.0.1:9000/?category=cats'&lt;/span&gt;

&lt;span class="o"&gt;{&lt;/span&gt;
 &lt;span class="s2"&gt;"Name"&lt;/span&gt;: &lt;span class="s2"&gt;""&lt;/span&gt;,
 &lt;span class="s2"&gt;"Description"&lt;/span&gt;: &lt;span class="s2"&gt;""&lt;/span&gt;,
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  Deployment
&lt;/h4&gt;

&lt;p&gt;Your scraper web application was never meant to live on your system. It was built to win, to conquer, to prosper, to fly (okay enough of the cringey Fly - Nicki Minaj motivational detour). &lt;a href="https://en.wikipedia.org/wiki/Platform_as_a_service"&gt;PaaS&lt;/a&gt; such as &lt;a href="http://herokuapp.com/"&gt;Heroku&lt;/a&gt; makes it easy to deploy both Python and Golang web applications by just writing a simple Procfile. For flask applications, it is much better to use a production ready server such as &lt;a href="https://gunicorn.org/"&gt;Gunicorn&lt;/a&gt;. Just add it to your &lt;code&gt;requirements.txt&lt;/code&gt; file at the root of your application.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Flask Procfile&lt;/span&gt;
web: gunicorn scraper:app
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;File layout for flask scraper&lt;/p&gt;

&lt;p&gt;|-scraper.py&lt;br&gt;
|&lt;br&gt;
|-requirements.txt&lt;br&gt;
|&lt;br&gt;
|-Procfile&lt;/p&gt;

&lt;p&gt;Read &lt;a href="https://stackabuse.com/deploying-a-flask-application-to-heroku/"&gt;Deploying a Flask application to Heroku&lt;/a&gt; for a deeper look into the deployment process.&lt;/p&gt;

&lt;p&gt;For the golang application, make sure to have a &lt;a href="https://github.com/golang/go/wiki/Modules"&gt;go.mod&lt;/a&gt; file at the root of the project and a Procfile as well. Using your &lt;code&gt;go.mod&lt;/code&gt;, Heroku knows to generate a binary located in &lt;code&gt;bin/&lt;/code&gt; named after your main package scipt&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Go Procfile&lt;/span&gt;
web: ./bin/scraper 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;File layout for net/http scraper&lt;/p&gt;

&lt;p&gt;|-scraper.go&lt;br&gt;
|&lt;br&gt;
|-go.mod&lt;br&gt;
|&lt;br&gt;
|-Procfile&lt;/p&gt;

&lt;p&gt;Read &lt;a href="https://devcenter.heroku.com/articles/getting-started-with-go"&gt;Getting Started on Heroku with Go&lt;/a&gt; for more information&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Scraping Project with Python
&lt;/h3&gt;

&lt;p&gt;I was playing around with creating a &lt;a href="https://github.com/bisoncorps/imageq"&gt;reverse image search engine&lt;/a&gt; using &lt;a href="https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py"&gt;Keras's ResNet50 and Imagenet weights&lt;/a&gt; and made a small search page for it. The problem however, was that whenever I got the image class in question, I would have to redirect to Google.I want to display the results on my own web application, I cried. I tried using Google's API but at the same time I wanted the option of switching the search engine to perhaps something else like Bing. The corresponding script grew into it's own open source project and ended up involving multiple contributors and numerous scraped engines. It is available as a python library and has been used by over 600 other github repositories, mostly bots that need search engine capability within their code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XI4BkJQV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1wmhooxfglwjql3qzfli.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XI4BkJQV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1wmhooxfglwjql3qzfli.gif" alt="SEP logo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;View Search Engine Parser on &lt;a href="https://github.com/bisoncorps/search-engine-parser"&gt;Github&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Scraping Project with Go
&lt;/h3&gt;

&lt;p&gt;I was visiting my favourite low data movie download site &lt;a href="https://netnaija.com/"&gt;netnaija&lt;/a&gt;, but that day was particularly annoying. I moved to click and I got redirected to pictures of b**bs. I just want free movies, I cried.&lt;br&gt;
I proceeded to create &lt;a href="https://github.com/bisoncorps/gophie"&gt;Gophie&lt;/a&gt;. It is a CLI tool which enables me to be the cheapskate I am without getting guilt tripped. Search and download netnaija movies from the comfort of my terminal. Scraping 1, Netnaija 0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Pqbve4Mv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/7kktu9lrceiomgseq8lw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Pqbve4Mv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/7kktu9lrceiomgseq8lw.jpeg" alt="Gophie logo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;View Gophie on &lt;a href="https://github.com/bisoncorps/gophie"&gt;Github&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, you've come across some of the ideas that go into creating web scrapers for data retrieval. Some of this information, with a little more research on your part can go into creating Industry-standard scrapers. I tried to make the article as end to end as possible however this is my first attempt at writing a technical article, so feel free to call my attention to anything crucial you feel I might have missed or leave questions in the comments below. Now go and scrape everything!&lt;/p&gt;

&lt;p&gt;PS ✏️ : I am not sure if it is legal to scrape Google, but Google scrapes everybody. Thank you&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>go</category>
      <category>github</category>
    </item>
    <item>
      <title>Python Bluetooth and Wifi?</title>
      <dc:creator>Diretnan Domnan</dc:creator>
      <pubDate>Fri, 30 Aug 2019 10:31:39 +0000</pubDate>
      <link>https://dev.to/deven96/python-bluetooth-and-wifi-25ga</link>
      <guid>https://dev.to/deven96/python-bluetooth-and-wifi-25ga</guid>
      <description>&lt;p&gt;I started working on a desktop application for a lecturer of mine a  little while back. The application was to get both bluetooth and wifi devices graphed with their rssi values. I decided to try my hand at using python (majorly because I had seen pybluez)&lt;/p&gt;

&lt;p&gt;This quickly spiralled into wanting to create a python package that has these inbuilt functionalities and voila &lt;a href="https://github.com/bisoncorps/signalum"&gt;Signalum&lt;/a&gt; was born. My logic was that if I could create a library then the PyQT application would simply plug into it.&lt;/p&gt;

&lt;p&gt;I only have very limited experience in using python so I'm certain that I'm doing countless things wrong so I would appreciate pull requests, suggestions, etc.&lt;/p&gt;

&lt;p&gt;Please be nice ☺&lt;/p&gt;

</description>
      <category>python</category>
      <category>bluetooth</category>
      <category>wifi</category>
    </item>
  </channel>
</rss>
