DEV Community: Condé Nast Italy

When Food meets AI: the Smart Recipe Project

Condé Nast Italy — Fri, 07 Aug 2020 10:05:49 +0000

Part 3. FoodGraph: Loading data and Querying the graph with SPARQL

Did you ever try a Maritozzo?

In the past post, we converted the recipe data, stored in JSON files, into RDF triples. In this post, we show you:

how we loaded this data on Amazon Neptune;
how we integrated the output of the extractor and classifier systems in FoodGraph;
how we can query the graph to extract useful and connected information.

To query the graph, we use SPARQL. SPARQL is an RDF query language, namely a semantic query language for databases, able to retrieve and manipulate data stored or viewed in the RDF format.

Loading data on Amazon Neptune

We followed the described procedure to load the RDF triples on the Amazon Neptune service.
We used an Amazon Simple Storage Service, the Amazon S3 bucket. Firstly we created an S3 bucket; then we uploaded the data. In this first phase, we loaded the RDF data to build the first level of the graph (see the previous article).

In the case we want to add a few recipes at the time, we can alternatively use the SPARQL statement INSERT DATA :

Integrating the extractor and classifier services within the graph

Once the recipes have been loaded, we checked whether there are recipes not yet processed by the extractor and classifier services. This means to check which recipes have not
i) food entity chunks extracted (the bnode in the graph, see the previous article);
ii) ingredients classified.

This is the SPARQL query to check whether bnodes exist in the graph (through the statement FILTER NOT EXISTS), which is equivalent to say “return all the recipes without bnodes”:

Extracting knowledge from the graph via SPARQL

Now the graph is on Amazon Neptune. Let’s have fun of these connections, extracting knowledge from the graph:

With the above query we interrogate the graph to know 1) whether there are recipes containing the ingredient “butter” and 2) which are these recipes. The WHERE statement navigates the graph following the pattern described in the triples to arrive at the query result. In this case, the output is the id of the recipes which have the ingredients ”butter”.
We can query the graph to return recipes containing more than one ingredient or all the recipes containing some ingredients and not others:

The Smart Recipe Project: what has been done, what can be done

With this last article, we conclude illustrating the main stages of the Smart Recipe Project, this innovative and amazing project involving on one side the global company Condé Nast, and on the other the IT company RES.

We have in mind some possible interesting applications for the resources we developed under the Smart Recipe Project like:

personalization of contents, personalized recipe searchers, newsletter;
recommendation systems for food items, recipes, and menus, which integrate, where needed, dietary restrictions;
virtual assistants, able to guide you in planning and cooking meals;
smart cooking devices, and much more.

As always, go on Medium to read the complete article.

When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of contents

Part 1: Cleaning and manipulating food data
Part 1: A smart method for tagging your datasets
Part 2: NER for all tastes: extracting information from cooking recipes
Part 2: Neither fish nor fowl? Classify it with the Smart Ingredient Classifier
Part 3: FoodGraph: a graph database to connect recipes and food data
Part 3. FoodGraph: Loading data and Querying the graph with SPARQL

When Food meets AI: the Smart Recipe Project

Condé Nast Italy — Mon, 03 Aug 2020 11:20:00 +0000

Part 3: FoodGraph: a graph database to connect recipes and food data

Delicious Cuttlefish

After enriching data and developing ML and DL models to extract and classify elements from recipes, we in the Smart Recipe Project moved a step further connecting the data and the output of the services (see the previous posts) under a graph database architecture. The goal is the creation of a knowledge base, named FoodGraph, where the different recipe data information is connected together to form a deep net of knowledge.

In this two-section post:
(SECTION 1): we will give you some insights about the concepts and technologies used in designing a graph database;

(SECTION 2): we show you our method for converting JSON files, containing the recipe data, into RDF triples, the data model we chose for constructing the graph.

Graph database: key concepts

Graph databases are a NoSQL way to store and treat data and relationships among it, where relationships are equally important to data itself. In contrast to other approaches, the graph databases are initially designed to incorporate relationships since they store connections alongside the data in the model.

The building blocks of a graph database are:

Nodes or vertex → They are the constructs standing for real-world entities participating in relationships.
Edges or links → They represent connections and relationships among nodes and express the existing properties between the entities.

RDF: a data model to build the graph database.

RDF stands for Resource Description Framework (RDF) and is a data model that describes the semantics, or meaning of information. The core structure of an RDF model is a set of triples, each consisting of a subject, a predicate and, an object, which together form an RDF graph or a

triples store. Each RDF statement states a single thing about its subject (in purple) by linking it to an object (in red) by the means of a predicate (in red), the property.

<http://www.w3.org/TR/rdf-syntax-grammar><http://purl.org/dc/elements/.1/title> "RDF/XML Syntax Specification".

In the example, above the triple states “The technical report on RDF syntax and grammar has
the title RDF/XML Syntax Specification.”

Ontologies and Vocabularies.

An ontology represents a formal description of a knowledge domain as a set of concepts relationships that hold between them. To enable such a description, we need to formally specify components such as individuals (instances of objects), classes, attributes, and relations as well as restrictions, rules, and axioms. As a result, ontologies do not only introduce a sharable and reusable knowledge representation but can also add new knowledge about the domain and help data integration when data comes from different datasets.

Logic and Inferences.

Another important component of linked data is the possibility to perform inferences (or reasoning) on data though rules defined with data itself. Inference means that automatic procedures performed by inference engines (or “reasoners”) can generate new relationships based on data and some additional information in the form of an ontology. Thus the database can be used not only to retrieve information but also to deduce new information from facts in the data.

SPARQL.

SPARQL is an RDF query language, namely a semantic query language for databases, able to retrieve and manipulate data stored in RDF format. The results of SPARQL queries return the resources for all triples that match the specified patterns and can be result sets or RDF graphs.

Amazon Neptune.

Amazon Neptune is a graph database service that simplifies the construction and the integration of applications working with highly connected datasets. Its engine is able to store billions of relationships which can be speedily navigated and queried.

Convert JSON file to RDF

The first step for building the graph database consists of converting the JSON files, containing the recipe data, into RDF triples.
With few lines of code, we extracted data from the JSON file (using the Python library json) and converted it into RDF triples (in Turtle format), manually writing the RDF structure. This approach fits well with our task since the number of the data type to convert is relatively few.
The procedure to build the RDF triples consists in general of three steps:

Prefix declaration → The prefixes identify the ontologies/vocabularies describing properties, classes, entities, and attributes used to build the graph. These elements indeed can be called in the triple via URI or using a namespace prefix. In Turtle format, the prefixes are introduced by a “@” and stand at the beginning of the Turtle document.
Data extraction → using the Python library json, we extract the data contained in the json array. This data represents the nodes of the RDF graph.
Writing RDF triples → using the data extracted and the ontologies declared, we manually write the RDF triples in a Turtle file. This data will be loaded on Amazon Neptune.

This is, for example, the JSON file containing 1) the output of the extractor service (see the previous post) and 2) other technical information of the NER model within the service:

This is the code we used to convert the JSON to RDF triples:


import json
import uuid

lang_dict = {
    "it": ("italian", "wiki:Q652"),
    "en": ("english", "wiki:Q1860"),
}

def lev2_rdfgraph(json_array, lang_dict):
    with open("path.ttl", 'w', encoding='utf-8') as lev2_rdfgraph:
    #prefix declaration
        lev2_rdfgraph.write("""@prefix recipe:<http//www.example.com/recipe/>.
        @prefix example:<http://www.example.com/>.
        @prefix schema:<https://schema.org/>.
        @prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
        @prefix xs:<http://www.w3.org/2001/XMLSchema#>.
        @prefix dcterms:<http://purl.org/dc/terms/>.
        @prefix wiki:<http://wikidata.org/wiki/>.”””)   

#data extraction
        for js in json_array:
            id_recipe = js['id']
            model_date = js['info_services']['model_date']
            language = js['language']
        #write rdf triples              
        rdf_file.write("recipe:"+id_recipe+"\nschema:dateModified\n" + str(model_date)+".\n")
        #data extraction
       for chunk in js['intervals']:
           ingr, ingr_id  = "", ""
           for token in chunk['ingr']:
                ingr = str(ingr+token+" ")
                ingr_id = str(uuid.uuid4().hex()
                bnode_name = str(uuid.uuid4())
                #write rdf triples                
                lev2_rdfgraph.write("recipe:"+id_recipe+"\nschema:material\n_:"+bnode_name+".\n")                   
                if 'unit' in chunk.keys():
                    lev2_rdfgraph.write(“_:”+bnode_name+”\nschema:materialExtent\n\”+str(chunk[“unit”])+”\”.\n”)
                if “value” in chunk.keys():   
                lev2_rdfgraph.write("_:"+bnode_name+"\nrdf:value\n\""+str(chunk['value]')+“\”.\n)
           lev2_rdfgraph.write("_:"+bnode_name+"\nschema:recipeIngredient\nexample:"+ingr_id+".\n")                                                 
           lev2_rdfgraph.write("example:"+ingred_id+"\nxs:string\n\"+ingr.rstrip()+"\".\n\n")
     lev2_rdfgraph.write("recipe:" + id_recipe + "\n" + "dcterms:language" + "\n" +   lang_dict[language][1] + ".\n\n"                                                                                       
     lev2_rdfgraph.write(lang_dict[language][1] + '\nxs:string\n"' + lang_dict[language][0] + '".\n\n'

lev2_rdfgraph(json_array, lang_dict)

This is a graphic visualization for this piece of graph. The nodes represent the subject and the object of the graph, while the edges the predicates (for clarity, in the figure the properties are in the extended form and not called via prefix).

Food Graph is a three-level depth graph. Let’s discover the other levels of knowledge reading the Medium article.

When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of content

When food meets AI: the Smart Recipe Project

Condé Nast Italy — Mon, 27 Jul 2020 06:35:35 +0000

Part 2. Neither fish nor fowl? Classify it with the Smart Ingredient Classifier

Mussel soup

In the previous article, we extracted food entities (ingredients, quantities and units of measurement) from recipes. In this post, we classify the ingredient taxonomic class using the BERT model. In plain words, this means to classify Emmental as cheese, orange as a fruit, peas as a vegetable, and so on for each ingredient in recipes.

BERT in five points

Since its release in late 2018, BERT has positively changed the way to face NLP tasks, solving many challenging problems in the NLP field.
Under this view, one of the main problems in NLP consists of a lack of training data. To cope with this lack, the idea is to exploit a large amount of unannotated data for training general-purpose language representation models, a process known as pre-training, and then fine-tuning these models on a smaller task-specific dataset.
Though this technique is not new (see word2vec and GloVE embeddings), we can say, BERT exploits it better. Why? Let’s find it out in five points:

It is built on a Transformer architecture, a powerful state-of-the-art architecture, which applies an attention mechanism to understand relationships between tokens in a sentence.
It is deeply bidirectional since it takes into account the left and right contexts at the same time.
BERT is pre-trained on a large corpus of unlabeled text that allows to pick up the deeper and intimate understandings of how the language works.
BERT can be fine-tuned for different tasks by adding a few additional output layers.
BERT is trained to perform:
Masked Language Modelling: BERT has to predict randomly masked words.
New sentence prediction: BERT tries to predict the next sentence in a sequence of sentences.

The Smart Recipe Project Taxonomy

To carry out the task, we designed a taxonomy, a model of classification for defining macro-categories and classifying the ingredients within them:

Such categorization is then used to tag the dataset that trains the model.

BERT for ingredient taxonomic classification

For our task (ingredient taxonomic classification), the pre-trained BERT models have optimal performance. We chose the bert-base-multilingual-cased model and divided the classifier into two modules:

A training module. We used Bert For Sequence Classification a basic Bert with a single linear layer at the top for classification. Both the pre-trained model and the untrained layer were trained on our data.
An applying module. The applier takes the trained model and uses it to determine the taxonomic class of the ingredient in the recipe.
You can find a more detailed version of the post on Medium.

When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of content

When Food meets AI: the Smart Recipe Project

Condé Nast Italy — Mon, 20 Jul 2020 08:11:59 +0000

Part 2. NER for all tastes: extracting information from cooking recipes

In the previous articles, we constructed two label datasets to train machine learning models and develop systems able to interpret cooking recipes.

This post dives into the extractor system, a system able to extract ingredients, quantities, time of preparation, and other useful information from recipes. To develop the service, we tried different Named Entity Recognition (NER) approaches.

Hold on! What is NER?

ER is a two-step process consisting of a) identifying entities (a token or a group of tokens) in documents and b) categorizing them into some predetermined categories such as Person, City, Company... For the task, we created our own categories, which are INGREDIENT, QUANTIFIER and UNIT.

NER is a very useful NLP application to group and categorize a great amount of data which share similarities and relevance. For this, it can be applied to many business use cases like Human resources, Customer support,Search and recommendation engines,Content classification, and much more.

NER for the Smart Recipe Project

For the Smart Recipe Project, we trained four models: a CRF model, a BiLSTM model, a combination of the previous two (BiLSTM-CRF) and the NER Flair NLP model.

CRF model

Linear-chain Conditional Random Fields - (https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541)[CRFs] - are a very popular way to control sequence prediction. CRFs are discriminative models able to solve some shortcomings of the generative counterpart. Indeed while an HHM output is modeled on the joint probability distribution, a CRF output is computed on the conditional probability distribution.

In poor words, while a generative classifier tries to learn how the data was generated, a discriminative one tries to model just observing the data

In addition to this, CRFs take into account the features of the current and previous labels in sequence. This increases the amount of information the model can rely on to make a good prediction.

Fig.1 CRF Network

For the task, we used the Stanford NER algorithm, which is an implementation of a CRF classifier. This model outperforms the other models in accuracy, though it cannot understand the context of the forward labels (a pivotal feature for sequential tasks like NER) and requires extra feature engineering.

BiLSTM with character embeddings

Going neural... we trained a Long Short-Term Memory (LSTM) model. LSTM networks are a type of Recurrent Neural Networks (RNNs), except that the hidden layer updates are replaced by purpose-built memory cells. As a result, they find and exploit better long-range dependencies in the data.

To benefit from both past and future context, we used a bidirectional LSTM model (BiLSTM), which processes the text in two directions: both forward (left to right) and backward (right to left). This allows the model to uncover more patterns as the amount of input information increases.

Fig.2 BiLSTM architecture

Moreover, we incorporated character-based word representation as the input of the model. Character-level representation exploits explicit sub-word-level information, infers features for unseen words and shares information of morpheme-level regularities.

NER Flair NLP

This model belongs to the (https://github.com/flairNLP/flair)[Flair NLP library] developed and open-sourced by (https://research.zalando.com/)[Zalando Research]. The strength of the model lies in a) the use of state-of-the-art character, word and context string embeddings (like (https://nlp.stanford.edu/projects/glove/)[GloVe], (https://arxiv.org/abs/1810.04805)[BERT], (https://arxiv.org/pdf/1802.05365.pdf)[ELMo]...), b) the possibility to easier combine these embeddings.

In particular, (https://www.aclweb.org/anthology/C18-1139/)[Contextual string embedding] helps to contextualize words producing different embeddings for polysemous words (same words with different meanings):

Fig.3 Context String Embedding network

BiLSTM-CRF

Last but not least, we tried a hybrid approach. We added a layer of CRF to a BiLSTM model. The advantages (well explained here) of such a combo is that this model can efficiently use both 1) past and future input features, thanks to the bidirectional LSTM component, and 2) sentence level tag information, thanks to a CRF layer. The role of the last layer is to impose some other constraints on the final output.

Fig. 4 BiLSTM-CRF: general architecture

What about performance?

(https://medium.com/@condenastitaly/when-food-meets-ai-the-smart-recipe-project-8dd1f5e727b5)[Read the complete article on medium], to discover that and more about this step of the Smart Recipe Project.

When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of content

When Food meets AI: the Smart Recipe Project

Condé Nast Italy — Mon, 13 Jul 2020 07:25:28 +0000

Part 1: A smart method for tagging your datasets

A classic Tiramisù

Raise your hand if you have never come across the “lack of data” problem while working on ML projects.

The unavailability or scarcity of training data is indeed one of the most serious challenges in ML and specifically in NLP. A problem that gets harder when the data you need has to be labeled. When no other shortcut works for you, the only alternative is to tag your data... At this point, we imagine the enthusiasm on your face!

But don’t put you off! Read the post and discover how we impressively reduced the time and cost of the tagging process.

DISCLAIMER:
We worked within the food context, but the approach can be easily extended to many different cases.

But first… What to tag?

The entities we want to tag are:
INGREDIENT: apples, cheese, yogurt, hot peppers…
QUANTIFIER: one, 2, ¾, a couple of….
UNIT of measurements: oz, g, lb, liter, cups, tbsp...

We used a variant of the IOB schema to tag the entities, where B-, I- tags indicate the beginning and intermediate positions of entities. O is the default tag.

Let’s tag!

We speeded up the ingredient tagging process with TagINGR, a semi-automatic tool which works:
1. matching items in the recipes with those in a list of ingredients;
2. adding the tag INGREDIENT when an item is both in the list and in the recipe.

Here the code:

In part 1, the recipe_tagger function tokenizes words and declares some variables:

def recipe_tagger(lang, desc_ingr_list, recipe): # Part 1
    tokenized_ingr_list =  [tokenize(lang, el) for el in desc_ingr_list]
    for ingr_token_list in tokenized_ingr_list:
        ingr, tag_ingr = "", ""
        for ingr_token in ingr_token_list:
            for n, token in enumerate(ingr_token):
                if n == 0:
                    ingr, ingr_tag = str(token).lower()+r'\t[A-Z]+\tO\n', str(token).lower()+" B-INGREDIENT\n"
                else:
                     ingr, ingr_tag = ingr+str(token).lower()+r'\t[A-Z]+\tO\n', ingr_tag+str(token).lower()+" I-INGREDIENT\n"

In part 2, it tags the ingredients:

if len(ingr_tokens) == 1:      #Part 2
                if re.search(ingr, recipe) and re.search(r'\t[NN][A-Z]*\tO', re.search(ingr, recipe).group()):
                    text = re.sub(r'\n'+str(ingr), "\n"+str(ingr_tag), str(text))
            else:
                if re.search(ingr, recipe):
                    text = re.sub(r'\n'+str(ingr), "\n"+str(ingr_tag), str(text))
    recipe = re.sub(r'\n(.*)\t.*\t(.*)',"\n\\1 \\2", recipe)
    return recipe

What about other entities?

Once ingredients were tagged, we can easily tag quantities and units. We first individuated some entity patterns and then tagged them using a set of regex:

All very well, but… how did we build the list? what assures us it is complete? what does NN mean in the code? these and other questions will be answered in the medium. Go read it!

When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of content

When Food meets AI: the Smart Recipe Project

Condé Nast Italy — Thu, 02 Jul 2020 10:37:09 +0000

Part 1: Cleaning and manipulating food data

Cooking recipes, videos, photos are everywhere on the web, which is today the greatest archive of food-related content.
But what if this big amount of data meets Artificial Intelligence? We in the Smart Recipe Project answered the question of developing systems able to interpret and extract information from food recipes.
Are you wondering how?

The project step-by-step:

using NLP techniques, we enriched data, labeling entities and adding entity-specific information;
exploiting state of the art ML and DL models, we developed services able to automatically extract information from recipes;
adopting the Amazon Neptune technology, we built graph databases to store and navigate relationships among data.

But first... we collected and cleaned the data.

Data Extraction

Using Python and its text manipulation libraries, we extracted recipes from tsv databases:

import pandas as pd
def data_extractor(df_content, df_ingredients, df_steps, start, dim): list_cell = []
   for n, cell in enumerate(df_content[start:start+dim]): 
      if str(cell) != 'nan':
         list_cell.append((start+n, cell)) 
      else:
         list_cell.append((start+n, df_ingredients[n] + '\n' +df_steps[n])) 
   return list_cell

Data Cleaning

Then cleaned them with a couple of regex:

def clean_recipe(recipe, regex_list):
   for (regex1, regex2, ...) in regex_list:
      recipe = re.sub(regex1, regex2, text.lower()) 
   return recipe

Data preprocessing

Finally, we 1) tokenized and 2) pos tagged the data with NLTK:

import nltk

def tokenize(recipe):
   sentences = nltk.sent_tokenize(recipe, language="English") 
   tokens = []
   [tokens.append(nltk.MWETokenizer(sentence, "english")) for
sentence in sentences ] 
   return tokens

def pos_tagger(recipe): 
   tagged_tokens=[]
   tokenized_text = tokenize(clean_recipe(recipe, regex_list))
   tagged_tokens = [[ str(tag_token[0]).lower() + "\t" + str(tag_token[1]) 
   for tag_token in nltk.pos_tag(tokens)] for tokens in tokenized_text ]
   return tagged_tokens

Curious about the output? Go on Medium to read the complete article and find out more about the most appetizing stages of our work.

When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles

Table of contents

DEV Community: Condé Nast Italy

When Food meets AI: the Smart Recipe Project

Part 3. FoodGraph: Loading data and Querying the graph with SPARQL

Loading data on Amazon Neptune

Integrating the extractor and classifier services within the graph

Extracting knowledge from the graph via SPARQL

The Smart Recipe Project: what has been done, what can be done

When Food meets AI: the Smart Recipe Project

Part 3: FoodGraph: a graph database to connect recipes and food data

Graph database: key concepts

RDF​:​ a data model to build the graph database.

Ontologies and Vocabularies.

Logic and Inferences.

SPARQL.

Amazon Neptune.

Convert JSON file to RDF

When food meets AI: the Smart Recipe Project

Part 2. Neither fish nor fowl? Classify it with the Smart Ingredient Classifier

BERT in five points

The Smart Recipe Project Taxonomy

BERT for ingredient taxonomic classification

When Food meets AI: the Smart Recipe Project

Part 2. NER for all tastes: extracting information from cooking recipes

Hold on! What is NER?

NER for the Smart Recipe Project

CRF model

BiLSTM with character embeddings

NER Flair NLP

BiLSTM-CRF

What about performance?

When Food meets AI: the Smart Recipe Project

Part 1: A smart method for tagging your datasets

But first… What to tag?

Let’s tag!

Here the code:

What about other entities?

When Food meets AI: the Smart Recipe Project

Part 1: Cleaning and manipulating food data

The project step-by-step:

Data Extraction

Data Cleaning

Data preprocessing

RDF: a data model to build the graph database.