DEV Community

Condé Nast Italy
Condé Nast Italy

Posted on • Edited on

9 4

When Food meets AI: the Smart Recipe Project

Part 1: Cleaning and manipulating food data

Cooking recipes, videos, photos are everywhere on the web, which is today the greatest archive of food-related content.
But ​what if this big amount of data meets Artificial Intelligence? We in the Smart Recipe Project answered the question of developing systems able to interpret and extract information from food recipes.
Are you wondering how?

The project step-by-step:

  1. using NLP techniques, we enriched data, labeling entities and adding entity-specific information;
  2. exploiting state of the art ML and DL models, we developed services able to automatically extract information from recipes;
  3. adopting the Amazon Neptune technology, we built graph databases to store and navigate relationships among data.

But first... we collected and cleaned the data.

Data Extraction

Using Python and its text manipulation libraries, we extracted recipes from tsv databases:

import pandas as pd
def data_extractor(df_content, df_ingredients, df_steps, start, dim): list_cell = []
   for n, cell in enumerate(df_content[start:start+dim]): 
      if str(cell) != 'nan':
         list_cell.append((start+n, cell)) 
      else:
         list_cell.append((start+n, df_ingredients[n] + '\n' +df_steps[n])) 
   return list_cell

Data Cleaning

Then cleaned them with a couple of regex:

def clean_recipe(recipe, regex_list):
   for (regex1, regex2, ...) in regex_list:
      recipe = re.sub(regex1, regex2, text.lower()) 
   return recipe

Data preprocessing

Finally, we 1) tokenized and 2) pos tagged the data with NLTK:

import nltk

def tokenize(recipe):
   sentences = nltk.sent_tokenize(recipe, language="English") 
   tokens = []
   [tokens.append(nltk.MWETokenizer(sentence, "english")) for
sentence in sentences ] 
   return tokens
def pos_tagger(recipe): 
   tagged_tokens=[]
   tokenized_text = tokenize(clean_recipe(recipe, regex_list))
   tagged_tokens = [[ str(tag_token[0]).lower() + "\t" + str(tag_token[1]) 
   for tag_token in nltk.pos_tag(tokens)] for tokens in tokenized_text ]
   return tagged_tokens

Curious about the output? Go on Medium to read the complete article and find out more about the most appetizing stages of our work.


When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of contents

Part 1: Cleaning and manipulating food data
Part 1: A smart method for tagging your datasets
Part 2: NER for all tastes: extracting information from cooking recipes
Part 2: Neither fish nor fowl? Classify it with the Smart Ingredient Classifier
Part 3: FoodGraph: a graph database to connect recipes and food data
Part 3. FoodGraph: Loading data and Querying the graph with SPARQL

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay