<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Umberto Calice</title>
    <description>The latest articles on DEV Community by Umberto Calice (@insidbyte).</description>
    <link>https://dev.to/insidbyte</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1078565%2F905b8099-c1b0-4764-b00d-06e26dc390ff.png</url>
      <title>DEV Community: Umberto Calice</title>
      <link>https://dev.to/insidbyte</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/insidbyte"/>
    <language>en</language>
    <item>
      <title>Analysis And Generation Model ML</title>
      <dc:creator>Umberto Calice</dc:creator>
      <pubDate>Mon, 08 May 2023 10:37:44 +0000</pubDate>
      <link>https://dev.to/insidbyte/analysisandgenerationmodelml-15l</link>
      <guid>https://dev.to/insidbyte/analysisandgenerationmodelml-15l</guid>
      <description>&lt;h1&gt;
  
  
  Analysis_And_Generation_Model_ML
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;BEFORE READING THIS REPOSITORY IT IS RECOMMENDED TO START FROM:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/insidbyte/Analysis_and_processing"&gt;https://github.com/insidbyte/Analysis_and_processing&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;I have in fact decided to generate a custom vocabulary to train the model and it would be appropriate to look at the repository code.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  SEE THIS REPOSITORY AT: &lt;a href="https://dev.tourl"&gt;https://github.com/insidbyte/Analysis_And_Generation_Model_ML&lt;/a&gt;
&lt;/h1&gt;




&lt;p&gt;&lt;strong&gt;&lt;em&gt;OPTIONS:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1)-GENERATE MODEL

&lt;p&gt;2)-TEST WITH HYPERPARAMETER TUNING&lt;/p&gt;

&lt;p&gt;3)-PLOT WITH TFIDF VECTORIZER AND SVD TRUNCATED REDUCTION&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Menù&lt;br&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Starting the ModelsGenerator.py file from the terminal it will appear:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WuzYzeS5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/aj6wwy60xud3raf0iapv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WuzYzeS5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/aj6wwy60xud3raf0iapv.png" alt="Image description" width="679" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;&lt;em&gt;OPTION 1:&lt;/em&gt;&lt;/strong&gt;
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Model generation:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;I decided to use tfidf and support vector machine because they are highly suitable for text processing and&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;support vector machine with the linear kernel is highly suitable for classifications based on two classes as&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;in our case: positive and negative&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8QbYEZHU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hdebr7viz6shxl7c26jr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8QbYEZHU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hdebr7viz6shxl7c26jr.png" alt="Image description" width="792" height="1045"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Kaggle IMDb dataset example:
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KzEDf_FS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c7mng52u3jt45hubyj6j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KzEDf_FS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c7mng52u3jt45hubyj6j.png" alt="Image description" width="800" height="657"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;em&gt;I created a Client in Angular to send requests to a Python Server&lt;/em&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h1&gt;
  
  
  CLIENT:
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8EPpHNiU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pio900ekqiafdeq0swlx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8EPpHNiU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pio900ekqiafdeq0swlx.png" alt="Image description" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  SERVER:
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--k4s-cov0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/epsndnypy8cwo2i2zs0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--k4s-cov0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/epsndnypy8cwo2i2zs0v.png" alt="Image description" width="800" height="73"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  RESPONSE FROM THE SERVER:
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BPjGKgva--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2t652130c0ykq10akvhs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BPjGKgva--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2t652130c0ykq10akvhs.png" alt="Image description" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  ANOTHER EXAMPLE:
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ksBDiWPP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k7lfdcjizo2f6fftyw1e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ksBDiWPP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k7lfdcjizo2f6fftyw1e.png" alt="Image description" width="793" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--N8ZdJp0v--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0csg4713mmmvoi4wa6ln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--N8ZdJp0v--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0csg4713mmmvoi4wa6ln.png" alt="Image description" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;&lt;em&gt;OPTION 2:&lt;/em&gt;&lt;/strong&gt;
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Test hyperparameters with gridsearchCV and tfidf vectorizer:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;A good way to automate the test phase and save time searching for the best parameters to&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;generate the most accurate model possible is to use GrisearchCV made available by scikit-learn&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;The code in ModelsGenerator.py must be customized based on the dataset to be analyzed&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;em&gt;WARNING !!&lt;/em&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;em&gt;If we don't study the scikit-learn documentation we could start infinite analyzes&lt;/em&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;em&gt;so it is always advisable to know what we are doing&lt;/em&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Link scikit-learn: &lt;a href="https://scikit-learn.org/"&gt;https://scikit-learn.org/&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Input:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BdW57Lbe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/32fx2hqqbhhh3kjux3ht.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BdW57Lbe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/32fx2hqqbhhh3kjux3ht.png" alt="Image description" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Output:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rtmfpQXC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j5bjmxa681tr17fxctt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rtmfpQXC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j5bjmxa681tr17fxctt7.png" alt="Image description" width="800" height="37"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;&lt;em&gt;OPTION 3:&lt;/em&gt;&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Input:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gt652dJn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xd4wbnxqf93prap424wv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gt652dJn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xd4wbnxqf93prap424wv.png" alt="Image description" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;This option is experimental, the reduction is not applied to model training because it&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;generates too few components and RAM memory (8GB) of my PC is not enough to generate&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;more components even if the results are interesting!&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Output:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--syisDYZl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jqnh9lgb9jonha1gze9q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--syisDYZl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jqnh9lgb9jonha1gze9q.png" alt="Image description" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  CONCLUSION:
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;We got satisfactory results and generated a fairly accurate&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;model this repository will be updated over time&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;For info or collaborations contact me at: &lt;a href="mailto:u.calice@studenti.poliba.it"&gt;u.calice@studenti.poliba.it&lt;/a&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
    <item>
      <title>DATASET Analysis and processing</title>
      <dc:creator>Umberto Calice</dc:creator>
      <pubDate>Mon, 08 May 2023 10:30:08 +0000</pubDate>
      <link>https://dev.to/insidbyte/dataset-analysis-and-processing-npn</link>
      <guid>https://dev.to/insidbyte/dataset-analysis-and-processing-npn</guid>
      <description>&lt;h1&gt;
  
  
  SEE THIS REPOSITORY AT : &lt;a href="https://dev.tourl"&gt;https://github.com/insidbyte/Analysis_and_processing&lt;/a&gt;
&lt;/h1&gt;




&lt;p&gt;1)-Install python with a version &amp;gt;= 3.9.*&lt;/p&gt;

&lt;p&gt;2)-Installing virtual env via pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install virtualenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3)-Access the folder where you want to create the virtual environment and type in command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;virtualenv --python C:\Path\To\Python\python.exe name_of_new_venv_folder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4)-Access the created folder with the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd name_of_new_venv_folder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5)-Activate the virtual environment with the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.\Sripts\activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6)-The following command creates a file called requirements.txt which enumerates the installed packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip freeze &amp;gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;7)-This file can then be used by contributors to update virtual environments using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;8)-To return to normal system settings, use the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deactivate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Analysis_and_processing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Performs a statistical analysis of the dataset and has more options:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1)-Merges two datasets.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;2)-Performs a first cleanup of the dataset.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;3)-Analyze and possibly eliminate stop words.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;4)-Lemmatize.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;5)-Correct.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All the options present in this tool, with the exception of number 3, use multiprocessing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;As many processes as there are cores of the machine used will start, so it is advisable to run the script from the terminal and to close any activity running on the machine.&lt;br&gt;
The steps must be performed in order otherwise the output dataset will not be reliable!&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  First step:
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;We will remove special characters, websites, emails, html code and all the contractures of the English language. First we go to the first file and write on the first line : True.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y5dxxauF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bzvju59kyzu1saovijob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y5dxxauF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bzvju59kyzu1saovijob.png" alt="Image description" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next we launch main.py.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--X7b81Qnt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pnnq6s5fm8hdes59lqck.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--X7b81Qnt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pnnq6s5fm8hdes59lqck.png" alt="Image description" width="180" height="40"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YVGHO1FA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/diwcp3ywuqv4esmd81wn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YVGHO1FA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/diwcp3ywuqv4esmd81wn.png" alt="Image description" width="462" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Then we insert the following Input:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6LRHLd8s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q6cy3j86c0439hi6nnsm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6LRHLd8s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q6cy3j86c0439hi6nnsm.png" alt="Image description" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e4c3GIoH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/70ipmvzb9y2rtmk22nke.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e4c3GIoH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/70ipmvzb9y2rtmk22nke.png" alt="Image description" width="800" height="788"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Second step:
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;This option corrects the supplied text, dividing it into 8 datasets and concatenating them to return the requested dataset. With 8 cores it took 9 hours for 60MB of datset!! It is highly expensive in terms of: cpu, memory and execution time. I recommend doing this only if necessary.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Input:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CFh98c0Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0tm1mi7hn3kkotk143eh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CFh98c0Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0tm1mi7hn3kkotk143eh.png" alt="Image description" width="754" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YhSAta1l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/67onvw665f05fnwsxls8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YhSAta1l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/67onvw665f05fnwsxls8.png" alt="Image description" width="800" height="618"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Third step:
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Let's lemmatize by trimming the dataset a bit and replacing each compound word with its own root&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Input:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZaLN8rwE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/chpn2ubxyb127fwkevcv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZaLN8rwE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/chpn2ubxyb127fwkevcv.png" alt="Image description" width="790" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Output&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--meTmE2ox--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7wedwuqg1jr08vuse3q8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--meTmE2ox--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7wedwuqg1jr08vuse3q8.png" alt="Image description" width="800" height="746"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Fourth step:
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;In this case it is not necessary but in case we have cleaned and lemmatized only the positive or negative reviews, we need to merge the dataset to proceed to the analysis phase.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--valS6bEp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tpry5uut4xenos6ptaxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--valS6bEp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tpry5uut4xenos6ptaxj.png" alt="Image description" width="657" height="649"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;As we can see the union dataset weighs less because it has eliminated the stop words separately for positive and negative&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_cZ74TSu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g4666vrugdxpv67w1muc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_cZ74TSu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g4666vrugdxpv67w1muc.png" alt="Image description" width="730" height="159"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;After several tests we noticed that the union dataset is less efficient for model generation.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Fifth step:
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;This is the most important step because it allows you to greatly lighten the lemmatized and clean dataset. To add new stopwords beyond those already present in the repository just add the words in the text files:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uWMbVOMP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8v09ek0mtwu406v236lh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uWMbVOMP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8v09ek0mtwu406v236lh.png" alt="Image description" width="648" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Input:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--050nf-BU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/089ce2z4bhhe8ruitxjh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--050nf-BU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/089ce2z4bhhe8ruitxjh.png" alt="Image description" width="685" height="703"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UPRsbBVU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8w2gz3w0ofj6sm355w1p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UPRsbBVU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8w2gz3w0ofj6sm355w1p.png" alt="Image description" width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7A6H6pSs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jh6g8ia6lve3xsxl3flo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7A6H6pSs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jh6g8ia6lve3xsxl3flo.png" alt="Image description" width="800" height="123"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We can see how many positive and negative reviews the dataset has and perform word-cloud or ngrams analysis. Below are some images that show the effectiveness of the previous phases and some information invaluable for building personalized wordlists.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Positive and negative review count:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FHaFFjBj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9pj1yquvqttai6u36gw1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FHaFFjBj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9pj1yquvqttai6u36gw1.png" alt="Image description" width="600" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Most Meaningful Words for Word Cloud:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Negative:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IcfZpstw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dzkigljao8qdknfwj3tw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IcfZpstw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dzkigljao8qdknfwj3tw.png" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Positive:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2hFQSCjM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/29su4k0hh3guem5lxcgt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2hFQSCjM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/29su4k0hh3guem5lxcgt.png" alt="Image description" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Most common words in the dataset:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Positive:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ANd0P4sZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r5od3m4oley3kvo7p15o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ANd0P4sZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r5od3m4oley3kvo7p15o.png" alt="Image description" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Negative:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---O1tuJtA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rcdvtbeyrzteqoqxjv8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---O1tuJtA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rcdvtbeyrzteqoqxjv8i.png" alt="Image description" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Most common words in the dataset with NGRAMS 2:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m9-dUnQ3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6lbdf7pfz0rqx37rdabs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m9-dUnQ3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6lbdf7pfz0rqx37rdabs.png" alt="Image description" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  CONCLUSIONS:
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;At this point we can say that we have created a complete tool that allows us to analyze and modify the dataset. I will show in another repository another useful tool for vectorization and search by tuning hyperparameters with GridSearchCV.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Link:&lt;/strong&gt; &lt;a href="https://dev.tourl"&gt;https://github.com/insidbyte/Analysis_And_Generation_Model_ML&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Link documentation of this repository&lt;/strong&gt;: &lt;a href="https://dev.tourl"&gt;https://scikit-learn.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>datascience</category>
    </item>
    <item>
      <title>POTMY WEB-APP</title>
      <dc:creator>Umberto Calice</dc:creator>
      <pubDate>Sun, 07 May 2023 21:03:30 +0000</pubDate>
      <link>https://dev.to/insidbyte/potmy-web-app-mj6</link>
      <guid>https://dev.to/insidbyte/potmy-web-app-mj6</guid>
      <description>&lt;p&gt;Web application written in javascript that uses the spotify rest api to search and listen to music by creating a playlist of the searched tracks.&lt;/p&gt;

&lt;p&gt;Search ways :&lt;br&gt;
1)-&lt;strong&gt;Random&lt;/strong&gt; : &lt;br&gt;
based on an artist.&lt;br&gt;
2)-&lt;strong&gt;Specific&lt;/strong&gt; :&lt;br&gt;
based on title and artist&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.tourl"&gt;https://github.com/insidbyte/Potimy_App&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>api</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
