When Food meets AI: the Smart Recipe Project

#machinelearning #artificialintelligenge #food #python

Part 1: Cleaning and manipulating food data

Cooking recipes, videos, photos are everywhere on the web, which is today the greatest archive of food-related content.
But what if this big amount of data meets Artificial Intelligence? We in the Smart Recipe Project answered the question of developing systems able to interpret and extract information from food recipes.
Are you wondering how?

The project step-by-step:

using NLP techniques, we enriched data, labeling entities and adding entity-specific information;
exploiting state of the art ML and DL models, we developed services able to automatically extract information from recipes;
adopting the Amazon Neptune technology, we built graph databases to store and navigate relationships among data.

But first... we collected and cleaned the data.

Data Extraction

Using Python and its text manipulation libraries, we extracted recipes from tsv databases:

import pandas as pd
def data_extractor(df_content, df_ingredients, df_steps, start, dim): list_cell = []
   for n, cell in enumerate(df_content[start:start+dim]): 
      if str(cell) != 'nan':
         list_cell.append((start+n, cell)) 
      else:
         list_cell.append((start+n, df_ingredients[n] + '\n' +df_steps[n])) 
   return list_cell

Data Cleaning

Then cleaned them with a couple of regex:

def clean_recipe(recipe, regex_list):
   for (regex1, regex2, ...) in regex_list:
      recipe = re.sub(regex1, regex2, text.lower()) 
   return recipe

Data preprocessing

Finally, we 1) tokenized and 2) pos tagged the data with NLTK:

import nltk

def tokenize(recipe):
   sentences = nltk.sent_tokenize(recipe, language="English") 
   tokens = []
   [tokens.append(nltk.MWETokenizer(sentence, "english")) for
sentence in sentences ] 
   return tokens

def pos_tagger(recipe): 
   tagged_tokens=[]
   tokenized_text = tokenize(clean_recipe(recipe, regex_list))
   tagged_tokens = [[ str(tag_token[0]).lower() + "\t" + str(tag_token[1]) 
   for tag_token in nltk.pos_tag(tokens)] for tokens in tokenized_text ]
   return tagged_tokens