<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nitish Kushwaha</title>
    <description>The latest articles on DEV Community by Nitish Kushwaha (@nitish-kushwaha).</description>
    <link>https://dev.to/nitish-kushwaha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1348662%2F1bc7ee70-02ed-49a1-a5bf-95a191273142.jpeg</url>
      <title>DEV Community: Nitish Kushwaha</title>
      <link>https://dev.to/nitish-kushwaha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nitish-kushwaha"/>
    <language>en</language>
    <item>
      <title>Label Encoding in ML</title>
      <dc:creator>Nitish Kushwaha</dc:creator>
      <pubDate>Wed, 21 Aug 2024 15:04:50 +0000</pubDate>
      <link>https://dev.to/nitish-kushwaha/label-encoding-in-ml-1426</link>
      <guid>https://dev.to/nitish-kushwaha/label-encoding-in-ml-1426</guid>
      <description>&lt;p&gt;&lt;strong&gt;Label Encoding&lt;/strong&gt; is one of the most used techniques in machine learning. It is used to convert the categorial data in numerical form. So, data can be fitted into the model.&lt;/p&gt;

&lt;p&gt;Let us understand why we use the &lt;em&gt;Label Encoding&lt;/em&gt;. Imagine having the data, containing the essential columns in the form of &lt;em&gt;string&lt;/em&gt;. But, you cannot fit this data in the model, because modelling only works on numerical data, what do we do? Here comes the life-saving technique which is evaluated at the preprocessing step when we ready the data for fitting, which is &lt;em&gt;Label Encoding&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We will use the &lt;em&gt;iris&lt;/em&gt; dataset from &lt;em&gt;Scikit-Learn&lt;/em&gt; library, to understand the workings of Label Encoder. Make sure you have the following libraries installed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;pandas
scikit-learn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For installing as libraries, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; pandas scikit-learn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now open Google Colab Notebook, and dive into coding and learning Label Encoder.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Start with importing the following libraries:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Import the &lt;em&gt;iris&lt;/em&gt; dataset, and initialize it for usage:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;
&lt;span class="n"&gt;iris&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_iris&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Now, we need to select the data that we want &lt;em&gt;Encode&lt;/em&gt;, we will be encoding the &lt;em&gt;species&lt;/em&gt; names for the irises.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;species&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_names&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;species&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;array&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;'setosa'&lt;/span&gt;, &lt;span class="s1"&gt;'versicolor'&lt;/span&gt;, &lt;span class="s1"&gt;'virginica'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;U10'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Let's instantiate the class &lt;em&gt;LabelEncoder&lt;/em&gt; from &lt;em&gt;preprocessing&lt;/em&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;label_encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LabelEncoder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Now, we are ready to fit the data using the &lt;em&gt;label encoder&lt;/em&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;label_encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;species&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will output similar to this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7iwn07edl8p495zz7es.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7iwn07edl8p495zz7es.png" alt="Label encoder fit output" width="198" height="64"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you get this output, you have successfully fitted the data. But, the question is how you will find out what values are assigned to each species and in which order.&lt;/p&gt;

&lt;p&gt;The order in which &lt;em&gt;Label Encoder&lt;/em&gt; fits the data is stored in &lt;em&gt;classes&lt;/em&gt;_ attribute. Encoding starts from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;data_length-1&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;label_encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classes_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;array&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;'setosa'&lt;/span&gt;, &lt;span class="s1"&gt;'versicolor'&lt;/span&gt;, &lt;span class="s1"&gt;'virginica'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;U10'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The label encoder will automatically sort the data, and start the encoding from the left side. Here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;setosa -&amp;gt; 0
versicolor -&amp;gt; 1
virginica -&amp;gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Now, let's test the fitted data. We will transform the iris species &lt;code&gt;setosa&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;label_encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;setosa&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: array([0])&lt;/p&gt;

&lt;p&gt;Again, if you transform the specie &lt;code&gt;virginica&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;label_encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;virginica&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: array([2])&lt;/p&gt;

&lt;p&gt;You can also input the list of species, such as &lt;code&gt;["setosa", "virginica"]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.fit" rel="noopener noreferrer"&gt;Scikit Learn documentation for label encoder &amp;gt;&amp;gt;&amp;gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Data Analysis 1: Scraping web pages</title>
      <dc:creator>Nitish Kushwaha</dc:creator>
      <pubDate>Thu, 18 Apr 2024 20:30:34 +0000</pubDate>
      <link>https://dev.to/nitish-kushwaha/data-analysis-1-scraping-web-pages-27ji</link>
      <guid>https://dev.to/nitish-kushwaha/data-analysis-1-scraping-web-pages-27ji</guid>
      <description>&lt;p&gt;There is no shortage of excellent datasets on the internet, but you might want to show prospective employers that you're able to find and scrape your own data as well. Plus, knowing how to scrape the web means you can find and use datasets that match your interests, regardless of not they've already been compiled.&lt;/p&gt;

&lt;p&gt;Scraping your own dataset also gives you the ability to build your custom dataset for testing as well as for large projects.&lt;/p&gt;

&lt;p&gt;Today we're going to scrape the famous news site "Times of India" and find all the link tags available on the page, verify those URLs, and process it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting up
&lt;/h3&gt;

&lt;p&gt;We will start with importing the required libraries and save the URL we want to scrape in the "URL" variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;validators&lt;/span&gt;

&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://timesofindia.indiatimes.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Request Page
&lt;/h3&gt;

&lt;p&gt;The first we are going to do is request the page using the &lt;strong&gt;requests&lt;/strong&gt; library, we going to send the &lt;em&gt;get&lt;/em&gt; request, and we will receive a response with page content, we can also say that we have downloaded the page.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, save the response in &lt;em&gt;res&lt;/em&gt; variables, then we check whether our request was successful, and if the &lt;em&gt;status_code&lt;/em&gt;, if &lt;em&gt;status_code&lt;/em&gt; is 200, then we parse the page with &lt;em&gt;beautiful soup&lt;/em&gt; to extract information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parsing Content
&lt;/h3&gt;

&lt;p&gt;In the above code, we have created the &lt;em&gt;soup&lt;/em&gt; variable, which will store the parsed page.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Filtering link Tags
&lt;/h3&gt;

&lt;p&gt;We have successfully requested and parsed the web page, now it's time to filter all the link tags from the page.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;allLinkTag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unverified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allLinkTag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;unverified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we have filtered all the link tags and stored them in the list &lt;em&gt;allLinkTag&lt;/em&gt;, and we have created an &lt;em&gt;unverified&lt;/em&gt; list, which will store all the URLs, present in those link tags, we had done this by iterating over link tag and extracting the &lt;em&gt;href&lt;/em&gt; attribute from the link tag.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validating URLs
&lt;/h3&gt;

&lt;p&gt;Now we have all the URLs from the page, it is time to validate them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;validUrls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;inValidUrls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unverified&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;validators&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;validUrls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;inValidUrls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have iterated over all the unverified URL, and validated them using the &lt;strong&gt;url()&lt;/strong&gt; function of the &lt;strong&gt;validators&lt;/strong&gt; library. If the URL is valid, we push it to the &lt;em&gt;validUrls&lt;/em&gt; list, else push the URL to the &lt;em&gt;inValidUrls&lt;/em&gt; list.&lt;/p&gt;

&lt;p&gt;Here we have filtered all the valid and invalid URLs from the page, using web scraping.&lt;/p&gt;

&lt;p&gt;Using the above procedure you can extract as many websites as you want, and build your custom dataset for testing or for your project.&lt;/p&gt;

&lt;p&gt;Checkout the full code on GitHub:&lt;/p&gt;

&lt;p&gt;Github: &lt;a href="https://github.com/deaxparadox/ArtOfAIProject/tree/main/101-web-scraping-using-bs4/blog" rel="noopener noreferrer"&gt;Web Scraping 1&lt;/a&gt;&lt;/p&gt;

</description>
      <category>scraping</category>
      <category>beautifulsoup</category>
      <category>datascience</category>
      <category>python</category>
    </item>
  </channel>
</rss>
