DEV Community

Natalia D
Natalia D

Posted on • Edited on

4 3

My machine-learning pet project. Part 2. Preparing my dataset

My pet-project is about food recognition. More info here.

Image description

Data sets in tutorials vs data sets in the wild. From Towards AI on Twitter

First thing that came to my mind was to scrape stuff from some instagram account. Have you seen how many recipes are there??? Millions. And they have descriptions, from which I could extract labels. I thought it would be ez. I managed to scrape about 10 posts using this:

instaloader profile nytcooking --no-videos --no-metadata-json --slide 1 --post-filter='date_utc >= datetime(2012, 5, 21)' --sanitize-paths
Enter fullscreen mode Exit fullscreen mode

So far so good. But then I tried to scrape a bit more, e.g. posts for 3 months, and started getting 429 Too many requests errors. Creating new instagram profiles didn't help. How could I, alone, beat an army of well paid developers? I needed another approach.

I chose one of the recipe websites that I used to visit. It has good photos and descriptions and it's easy to scrape. I picked Scrapy to do the job. It's actively supported (last commit - 5 days ago), has good documentation and a readable code.

I saved one sample webpage to my desktop and launched scrapy shell:

scrapy shell ../Desktop/test.html
Enter fullscreen mode Exit fullscreen mode

It helped me to prepare a bunch of selectors like this:

recipe.xpath('./p[@class="material-anons__des"]//text()').get()
Enter fullscreen mode Exit fullscreen mode

Then I created a project template using command like this:

scrapy startproject myproject [project_dir]
Enter fullscreen mode Exit fullscreen mode

I took a look at their example repo at github.

The only thing I have struggled for some time was item pipelines. At one point I copied some stuff from stackoverflow to settings.py. And scrapy started to complain, smth like file A has an error now. I went to scrapy github and found a nice little comment in file A about what really should be in the settings file now. So that's what I added to settings.py:

FILES_STORE = './csv'
IMAGES_STORE = './images'
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}
FEEDS={
    './csv/items-%(batch_id)d': {
        'format': 'csv'
    }
}
Enter fullscreen mode Exit fullscreen mode

These settings allowed me to download images and make a *.csv file with parsed titles,descriptions and image paths urls without writing any extra lines of code. In my opinion, Scrapy is a very powerful instrument, so I totally recommend to use it when you want to scrape some data (not from instagram).

The next post is going to be about brushing up scraped data and labelling it.

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay