<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mustafa A. Elghrib</title>
    <description>The latest articles on DEV Community by Mustafa A. Elghrib (@maelghrib).</description>
    <link>https://dev.to/maelghrib</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1249560%2Fb8678190-1688-44ba-b540-5ce3092d7f52.png</url>
      <title>DEV Community: Mustafa A. Elghrib</title>
      <link>https://dev.to/maelghrib</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maelghrib"/>
    <language>en</language>
    <item>
      <title>How We Generated a 10K Dataset Using LLM to Fine-Tune Another LLM</title>
      <dc:creator>Mustafa A. Elghrib</dc:creator>
      <pubDate>Fri, 07 Jun 2024 11:40:33 +0000</pubDate>
      <link>https://dev.to/maelghrib/how-we-generated-a-10k-dataset-using-llm-to-fine-tune-another-llm-3nkh</link>
      <guid>https://dev.to/maelghrib/how-we-generated-a-10k-dataset-using-llm-to-fine-tune-another-llm-3nkh</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;After &lt;a href="https://dev.to/maelghrib/three-levels-of-scrapping-data-from-basic-to-advanced-to-pro-2i6p"&gt;scraping the data&lt;/a&gt; that will be used as input for the LLM to generate output, we needed to create a dataset to finetune an LLM for Tarwiiga AdGen, a Google Ads generator using AI developed at Tarwiiga. The tool was taking input and producing a JSON output. While we were relying on LLMs like OpenAI's GPT, Google's Gemini, and Anthropic's Claude for generating ads with special prompts and using LangChain parsers to get a JSON, we wanted to use this approach for generating the dataset. Here, I am discussing our approach for generating a 10K dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjdjujrxkx97a3p5ex2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjdjujrxkx97a3p5ex2o.png" alt="Genearting 10K dataset using LLM" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Old Approach 
&lt;/h2&gt;

&lt;p&gt;But before that, I want to mention that I first tried to make the LLM generate everything from input to output. I asked it to give me a list of 10 inputs and then looped through this list to generate the JSON output and save them in a CSV file. However, I found that each time I requested a list of inputs, it generated many duplicates. I think this happened because the LLM's API was caching the responses. While this issue could be worked around to reduce the number of duplicates, I decided to work with real data that I expect to receive in the future when using the tool. Besides, it was taking too long to generate all the inputs and then proceed to generate the output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scraping Data
&lt;/h2&gt;

&lt;p&gt;That's why I &lt;a href="https://dev.to/maelghrib/three-levels-of-scrapping-data-from-basic-to-advanced-to-pro-2i6p"&gt;scraped data&lt;/a&gt; to use as input. With the approach I followed, as mentioned in the article, I was able to scrape millions of data points. Specifically, I scraped data from 12 categories, with each category containing 5,000 pages. Each page had about 20 inputs, resulting in a total of 12 * 5,000 * 20 = 1,200,000 inputs, or approximately one million two hundred thousand. In reality, some pages contained more than 20 inputs, so I ended up with 1,239,232 data points. There were a lot of duplicate inputs - 1,173,847 to be exact - leaving me with 65,385 unique data points. While this approach didn't completely eliminate duplicate inputs, it was much faster to get inputs from another source rather than relying on the LLM. Now, the LLM can focus solely on generating outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Overview
&lt;/h2&gt;

&lt;p&gt;As I was sending API requests to LLM APIs, I needed to find a way to manage the generation process efficiently. I started with one category and looped through 200 pages, with each page including around 20 inputs, sometimes a bit more or less. This process allowed me to generate around 3,859 data points for the first category. For another category, I generated around 3,899 data points, and for a third category, I generated 2,171 data points. In total, this amounted to 3,859 + 3,899 + 2,171 = 9,929 data points, which is approximately a 10K dataset.&lt;/p&gt;

&lt;p&gt;During the generation process, I was able to fine-tune Google's Gemma 2B on a 1K dataset, which yielded very good results. I will discuss fine-tuning in a future post, but for now, I want to focus on how I handled the generation process.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generation Process
&lt;/h2&gt;

&lt;p&gt;The process is basic, and I didn't do any optimization initially; I just wanted to start and see how things would go. To understand it, let's start from the bottom up. First, we have the AdGen code that takes an input and generates a JSON output representing the Google Ad elements. This is crafted with a special prompt and parsers to extract JSON.&lt;/p&gt;

&lt;p&gt;With around 20 inputs per page, I divided them into chunks of size 5. Above this, there's a loop that goes through pages to get inputs. I made it loop through 10 pages to get the 20 inputs from each page, then divided those 20 inputs into chunks of 5. For each input, a request was sent to the LLM, and the output was saved in a CSV file. This resulted in a category folder with 200 subfolders for pages, and inside each page, there were 4 dataset CSV files.&lt;/p&gt;

&lt;p&gt;This process took a long time on some LLMs like GPT-4 and was faster on others like GPT-3.5 and Gemini Pro 1.5. I think GPT-4 was slower because it was busy with other users' requests, though I'm not certain. There were also some issues with Gemini making a lot of retries. I ended up running the same script multiple times, changing the range of pages each time: the first script from page 0 to 10, the second script from page 10 to 20, and so on.&lt;/p&gt;

&lt;p&gt;While I think this approach could be optimized and improved, my goal was to quickly generate a dataset for fine-tuning. With this approach, I was able to generate a 10K dataset, which is very good for fine-tuning any LLM, though it contains duplicate inputs. The unique inputs, as mentioned above, were around 65K. Generating a 65K dataset would require optimizing the code to make it faster, but that's not necessary for now; it can be done later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I hope this article was helpful to you. Please don't hesitate to ask me any questions, and you can reach me on &lt;a href="https://twitter.com/maelghrib" rel="noopener noreferrer"&gt;Twitter (X)&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/maelghrib" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Three Levels of Scrapping Data: From Basic to Advanced to Pro</title>
      <dc:creator>Mustafa A. Elghrib</dc:creator>
      <pubDate>Thu, 06 Jun 2024 20:05:36 +0000</pubDate>
      <link>https://dev.to/maelghrib/three-levels-of-scrapping-data-from-basic-to-advanced-to-pro-2i6p</link>
      <guid>https://dev.to/maelghrib/three-levels-of-scrapping-data-from-basic-to-advanced-to-pro-2i6p</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;When I was working on Tarwiiga AdGen at &lt;a href="https://tarwiiga.com" rel="noopener noreferrer"&gt;Tarwiiga&lt;/a&gt;, I needed to finetune an LLM for Google ads generation, but the data was not found, so I needed to create it from scratch, the tool was taking input from two words or three and give a JSON output, so we need a dataset that contains different inputs with JSON output, but there was a problem getting the input, I tried to make the LLM suggest inputs but it was containing lots of duplicates, so I decided to scrape online data to be an input, I was first using regular programming to scrape data but then improved it to use AI then found another killer way that helped me scrape millions of data points at no time, here is I am talking about those three levels of scraping data for anyone who may be interested.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvogokf05ke5pwad9lre0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvogokf05ke5pwad9lre0.png" alt="Post Cover" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic Level: Scrape with regular programming
&lt;/h2&gt;

&lt;p&gt;At the basic level, I was using regular Python scripts with Selenium and Beautifulsoup, where Selenium simulates user behavior on the browser and Beautifulsoup with Requests for scrapping HTML and extracting texts, this was working for basic stuff where the data is public or the data is in the first page without needing you to scrape endless feeds, but when it needs login and scrapping endless feed this was like a hell and I couldn't manage to fix all of the bugs!&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced Level: Scrapping with AI
&lt;/h2&gt;

&lt;p&gt;At the advanced level, while this couldn't solve the problem of the basic level, but incorporated a new way of scrapping, instead of relying on manually writing Python code that scrapes HTML and extracts data in JSON, I shifted this task to LLM to do it, just giving it the HTML and tell it to extract the content you want and put them in JSON, looping through list of HTML documents will give you a list of JSON objects that you could save it in CSV file or database, but this also rise a new problem of limited token length of LLM and sometimes LLM won't give accurate results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pro Level: Scrapping with the request URL
&lt;/h2&gt;

&lt;p&gt;At the pro level, you don't need to do the above stuff, you could just take the request URL especially if it is an endless feed to get the JSON response, this needs you to open the inspector of the browser and go to the network tab and refresh the page then scroll and track changes until you find the request URL that fetches data, and take this URL and just make a request to it to give you the response in JSON, this way was very powerful for me as I managed to scrape millions of data at no time, also I didn't face any problems related to handling endless feed or handling login or not even needed to depend on AI and its context length limit.&lt;/p&gt;

&lt;p&gt;Here is a detailed guide on how to get data from the Twitter (X) feed using the request UR.&lt;/p&gt;

&lt;p&gt;First, open &lt;a href="https://x.com" rel="noopener noreferrer"&gt;https://x.com&lt;/a&gt; and then inspect and go to the network tab, refresh, and start scrolling until you find this HomeTimeline click on it and it will give you all the details of the request and response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckp5t1bib54kv8jxvatz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fckp5t1bib54kv8jxvatz.png" alt="Inspector Network Tab Preview" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then go to preview to see a preview of the response&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa7r877fq0ls5g3qvvtg0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa7r877fq0ls5g3qvvtg0.png" alt="Response Preview" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here is the payload&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4h1ay8omb6sxuba44kw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4h1ay8omb6sxuba44kw.png" alt="Payload Preview" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Take the request URL and the payload and put them in Postman&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feosut95ogpqzgqmcoxdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feosut95ogpqzgqmcoxdm.png" alt="Postman Preview" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Put any headers that come with the request and then click send and you will get the response&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa5keuklersaxpzyslkl1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa5keuklersaxpzyslkl1.png" alt="Headers Preview" width="800" height="606"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You could click on the code icon at Postman to get the code that does this request in your language. Here is the code in Python.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftcirmgi6x8h08iboyt34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftcirmgi6x8h08iboyt34.png" alt="Python Code Preview" width="800" height="626"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that is it! In this way, you could get millions of data points as I did on other specific websites for our use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I hope this article was helpful to you, please don't hesitate to ask me any questions and you can reach me on &lt;a href="https://twitter.com/maelghrib" rel="noopener noreferrer"&gt;Twitter (X)&lt;/a&gt; and &lt;a href="https://www.linkedin.com/in/maelghrib" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
