DEV Community

Natalia D
Natalia D

Posted on • Edited on

4 3

My machine-learning pet project. Part 2. Preparing my dataset

My pet-project is about food recognition. More info here.

Image description

Data sets in tutorials vs data sets in the wild. From Towards AI on Twitter

First thing that came to my mind was to scrape stuff from some instagram account. Have you seen how many recipes are there??? Millions. And they have descriptions, from which I could extract labels. I thought it would be ez. I managed to scrape about 10 posts using this:

instaloader profile nytcooking --no-videos --no-metadata-json --slide 1 --post-filter='date_utc >= datetime(2012, 5, 21)' --sanitize-paths
Enter fullscreen mode Exit fullscreen mode

So far so good. But then I tried to scrape a bit more, e.g. posts for 3 months, and started getting 429 Too many requests errors. Creating new instagram profiles didn't help. How could I, alone, beat an army of well paid developers? I needed another approach.

I chose one of the recipe websites that I used to visit. It has good photos and descriptions and it's easy to scrape. I picked Scrapy to do the job. It's actively supported (last commit - 5 days ago), has good documentation and a readable code.

I saved one sample webpage to my desktop and launched scrapy shell:

scrapy shell ../Desktop/test.html
Enter fullscreen mode Exit fullscreen mode

It helped me to prepare a bunch of selectors like this:

recipe.xpath('./p[@class="material-anons__des"]//text()').get()
Enter fullscreen mode Exit fullscreen mode

Then I created a project template using command like this:

scrapy startproject myproject [project_dir]
Enter fullscreen mode Exit fullscreen mode

I took a look at their example repo at github.

The only thing I have struggled for some time was item pipelines. At one point I copied some stuff from stackoverflow to settings.py. And scrapy started to complain, smth like file A has an error now. I went to scrapy github and found a nice little comment in file A about what really should be in the settings file now. So that's what I added to settings.py:

FILES_STORE = './csv'
IMAGES_STORE = './images'
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}
FEEDS={
    './csv/items-%(batch_id)d': {
        'format': 'csv'
    }
}
Enter fullscreen mode Exit fullscreen mode

These settings allowed me to download images and make a *.csv file with parsed titles,descriptions and image paths urls without writing any extra lines of code. In my opinion, Scrapy is a very powerful instrument, so I totally recommend to use it when you want to scrape some data (not from instagram).

The next post is going to be about brushing up scraped data and labelling it.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay