How We Generated a 10K Dataset Using LLM to Fine-Tune Another LLM

Mustafa A. Elghrib — Fri, 07 Jun 2024 11:40:33 +0000

Introduction

After scraping the data that will be used as input for the LLM to generate output, we needed to create a dataset to finetune an LLM for Tarwiiga AdGen, a Google Ads generator using AI developed at Tarwiiga. The tool was taking input and producing a JSON output. While we were relying on LLMs like OpenAI's GPT, Google's Gemini, and Anthropic's Claude for generating ads with special prompts and using LangChain parsers to get a JSON, we wanted to use this approach for generating the dataset. Here, I am discussing our approach for generating a 10K dataset.

Old Approach

But before that, I want to mention that I first tried to make the LLM generate everything from input to output. I asked it to give me a list of 10 inputs and then looped through this list to generate the JSON output and save them in a CSV file. However, I found that each time I requested a list of inputs, it generated many duplicates. I think this happened because the LLM's API was caching the responses. While this issue could be worked around to reduce the number of duplicates, I decided to work with real data that I expect to receive in the future when using the tool. Besides, it was taking too long to generate all the inputs and then proceed to generate the output.

Scraping Data

That's why I scraped data to use as input. With the approach I followed, as mentioned in the article, I was able to scrape millions of data points. Specifically, I scraped data from 12 categories, with each category containing 5,000 pages. Each page had about 20 inputs, resulting in a total of 12 * 5,000 * 20 = 1,200,000 inputs, or approximately one million two hundred thousand. In reality, some pages contained more than 20 inputs, so I ended up with 1,239,232 data points. There were a lot of duplicate inputs - 1,173,847 to be exact - leaving me with 65,385 unique data points. While this approach didn't completely eliminate duplicate inputs, it was much faster to get inputs from another source rather than relying on the LLM. Now, the LLM can focus solely on generating outputs.

Quick Overview

As I was sending API requests to LLM APIs, I needed to find a way to manage the generation process efficiently. I started with one category and looped through 200 pages, with each page including around 20 inputs, sometimes a bit more or less. This process allowed me to generate around 3,859 data points for the first category. For another category, I generated around 3,899 data points, and for a third category, I generated 2,171 data points. In total, this amounted to 3,859 + 3,899 + 2,171 = 9,929 data points, which is approximately a 10K dataset.

During the generation process, I was able to fine-tune Google's Gemma 2B on a 1K dataset, which yielded very good results. I will discuss fine-tuning in a future post, but for now, I want to focus on how I handled the generation process.

The Generation Process

The process is basic, and I didn't do any optimization initially; I just wanted to start and see how things would go. To understand it, let's start from the bottom up. First, we have the AdGen code that takes an input and generates a JSON output representing the Google Ad elements. This is crafted with a special prompt and parsers to extract JSON.

With around 20 inputs per page, I divided them into chunks of size 5. Above this, there's a loop that goes through pages to get inputs. I made it loop through 10 pages to get the 20 inputs from each page, then divided those 20 inputs into chunks of 5. For each input, a request was sent to the LLM, and the output was saved in a CSV file. This resulted in a category folder with 200 subfolders for pages, and inside each page, there were 4 dataset CSV files.

This process took a long time on some LLMs like GPT-4 and was faster on others like GPT-3.5 and Gemini Pro 1.5. I think GPT-4 was slower because it was busy with other users' requests, though I'm not certain. There were also some issues with Gemini making a lot of retries. I ended up running the same script multiple times, changing the range of pages each time: the first script from page 0 to 10, the second script from page 10 to 20, and so on.

While I think this approach could be optimized and improved, my goal was to quickly generate a dataset for fine-tuning. With this approach, I was able to generate a 10K dataset, which is very good for fine-tuning any LLM, though it contains duplicate inputs. The unique inputs, as mentioned above, were around 65K. Generating a 65K dataset would require optimizing the code to make it faster, but that's not necessary for now; it can be done later.

Conclusion

I hope this article was helpful to you. Please don't hesitate to ask me any questions, and you can reach me on Twitter (X) and Linkedin.

Three Levels of Scrapping Data: From Basic to Advanced to Pro

Mustafa A. Elghrib — Thu, 06 Jun 2024 20:05:36 +0000

Introduction

When I was working on Tarwiiga AdGen at Tarwiiga, I needed to finetune an LLM for Google ads generation, but the data was not found, so I needed to create it from scratch, the tool was taking input from two words or three and give a JSON output, so we need a dataset that contains different inputs with JSON output, but there was a problem getting the input, I tried to make the LLM suggest inputs but it was containing lots of duplicates, so I decided to scrape online data to be an input, I was first using regular programming to scrape data but then improved it to use AI then found another killer way that helped me scrape millions of data points at no time, here is I am talking about those three levels of scraping data for anyone who may be interested.

Basic Level: Scrape with regular programming

At the basic level, I was using regular Python scripts with Selenium and Beautifulsoup, where Selenium simulates user behavior on the browser and Beautifulsoup with Requests for scrapping HTML and extracting texts, this was working for basic stuff where the data is public or the data is in the first page without needing you to scrape endless feeds, but when it needs login and scrapping endless feed this was like a hell and I couldn't manage to fix all of the bugs!

Advanced Level: Scrapping with AI

At the advanced level, while this couldn't solve the problem of the basic level, but incorporated a new way of scrapping, instead of relying on manually writing Python code that scrapes HTML and extracts data in JSON, I shifted this task to LLM to do it, just giving it the HTML and tell it to extract the content you want and put them in JSON, looping through list of HTML documents will give you a list of JSON objects that you could save it in CSV file or database, but this also rise a new problem of limited token length of LLM and sometimes LLM won't give accurate results.

Pro Level: Scrapping with the request URL

At the pro level, you don't need to do the above stuff, you could just take the request URL especially if it is an endless feed to get the JSON response, this needs you to open the inspector of the browser and go to the network tab and refresh the page then scroll and track changes until you find the request URL that fetches data, and take this URL and just make a request to it to give you the response in JSON, this way was very powerful for me as I managed to scrape millions of data at no time, also I didn't face any problems related to handling endless feed or handling login or not even needed to depend on AI and its context length limit.

Here is a detailed guide on how to get data from the Twitter (X) feed using the request UR.

First, open https://x.com and then inspect and go to the network tab, refresh, and start scrolling until you find this HomeTimeline click on it and it will give you all the details of the request and response.

Then go to preview to see a preview of the response

And here is the payload

Take the request URL and the payload and put them in Postman

Put any headers that come with the request and then click send and you will get the response

You could click on the code icon at Postman to get the code that does this request in your language. Here is the code in Python.

And that is it! In this way, you could get millions of data points as I did on other specific websites for our use cases.

Conclusion

I hope this article was helpful to you, please don't hesitate to ask me any questions and you can reach me on Twitter (X) and Linkedin

DEV Community: Mustafa A. Elghrib

How We Generated a 10K Dataset Using LLM to Fine-Tune Another LLM

Introduction

Old Approach

Scraping Data

Quick Overview

The Generation Process

Conclusion

Three Levels of Scrapping Data: From Basic to Advanced to Pro

Introduction

Basic Level: Scrape with regular programming

Advanced Level: Scrapping with AI

Pro Level: Scrapping with the request URL

Conclusion