Part 1: Cleaning and manipulating food data
Cooking recipes, videos, photos are everywhere on the web, which is today the greatest archive of food-related content.
But what if this big amount of data meets Artificial Intelligence? We in the Smart Recipe Project answered the question of developing systems able to interpret and extract information from food recipes.
Are you wondering how?
The project step-by-step:
- using NLP techniques, we enriched data, labeling entities and adding entity-specific information;
- exploiting state of the art ML and DL models, we developed services able to automatically extract information from recipes;
- adopting the Amazon Neptune technology, we built graph databases to store and navigate relationships among data.
But first... we collected and cleaned the data.
Data Extraction
Using Python and its text manipulation libraries, we extracted recipes from tsv databases:
import pandas as pd
def data_extractor(df_content, df_ingredients, df_steps, start, dim): list_cell = []
for n, cell in enumerate(df_content[start:start+dim]):
if str(cell) != 'nan':
list_cell.append((start+n, cell))
else:
list_cell.append((start+n, df_ingredients[n] + '\n' +df_steps[n]))
return list_cell
Data Cleaning
Then cleaned them with a couple of regex:
def clean_recipe(recipe, regex_list):
for (regex1, regex2, ...) in regex_list:
recipe = re.sub(regex1, regex2, text.lower())
return recipe
Data preprocessing
Finally, we 1) tokenized and 2) pos tagged the data with NLTK:
import nltk
def tokenize(recipe):
sentences = nltk.sent_tokenize(recipe, language="English")
tokens = []
[tokens.append(nltk.MWETokenizer(sentence, "english")) for
sentence in sentences ]
return tokens
def pos_tagger(recipe):
tagged_tokens=[]
tokenized_text = tokenize(clean_recipe(recipe, regex_list))
tagged_tokens = [[ str(tag_token[0]).lower() + "\t" + str(tag_token[1])
for tag_token in nltk.pos_tag(tokens)] for tokens in tokenized_text ]
return tagged_tokens
Curious about the output? Go on Medium to read the complete article and find out more about the most appetizing stages of our work.
When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles
Table of contents
Part 1: Cleaning and manipulating food data
Part 1: A smart method for tagging your datasets
Part 2: NER for all tastes: extracting information from cooking recipes
Part 2: Neither fish nor fowl? Classify it with the Smart Ingredient Classifier
Part 3: FoodGraph: a graph database to connect recipes and food data
Part 3. FoodGraph: Loading data and Querying the graph with SPARQL
Top comments (0)