DEV Community: Rahul Gupta

How to get a free GPU and train a spaCy model ?

Rahul Gupta — Tue, 25 Aug 2020 09:44:46 +0000

We all have been there. I have an interesting dataset that we want to train our shiny new model on. Unfortunately, I don't have a dedicated GPU on my Macbook 2015 model. Unless, you are somebody who uses graphical intensive application such as games, numerical processing software regularly, It will not make sense for you to buy a dedicated GPU.

Luckily there are a lot of remote GPU options available to you. Depending upon your use-case, you can choose the one that fits your need.

Option	Pros	Cons
Cloud Provider(Gcloud, AWS, Azure)	flexibility, save the data	Higher ramp-up time
Colaboratory notebook	Good documentation	Short runtimes, slow GPU, not good for long training jobs
Jupyter hub	Open-source, multiple language support	no free GPU support
Kaggle Notebooks	free 43 hours of GPU computing	data IO to machine is little inconvenient

So, today we will talk about how we use GPU on kaggle to train a spaCy model for Hindi Language. Biggest challenge of training a model is to get the clean data that accurately represent your Machine learning problem. Let's do a quick search to get a list of the datasets available

A quick search on Github with "Hindi tagger" yields these results

After browsing through these datasets, you will notice that most of these datasets are relatively small and follow incoherent tagging scheme incompatible with how spaCy's input data format. Luckily, we have other dataset that we can use here from CONLL competition.

UniversalDependencies / UD_Hindi-HDTB

Summary

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB) created at IIIT Hyderabad, India.

Introduction

The Hindi Universal Dependency Treebank was automatically converted from Hindi Dependency Treebank (HDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu. HDTB is developed at IIIT-H India.

Acknowledgments

The project is supported by NSF Grant (Award Number: CNS 0751202; CFDA Number: 47.070).

Any publication reporting the work done using this data should cite the following references:

Riyaz Ahmad Bhat, Rajesh Bhatt, Annahita Farudi, Prescott Klassen, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, Ashwini Vaidya, Sri Ramagurumurthy Vishnu, and Fei Xia. The Hindi/Urdu Treebank Project. In the Handbook of Linguistic Annotation (edited by Nancy Ide and James Pustejovsky), Springer Press

@InCollection{bhathindi
  Title                    = {The Hindi/Urdu Treebank Project}
  Author                   = {Bhat, Riyaz Ahmad and Bhatt, Rajesh and Farudi, Annahita and Klassen, Prescott and Narasimhan,

…

View on GitHub

Browsing the stats.xml file gives us an overview of different pos tags available in the dataset.

Let's open the notebook and enable GPU for the session from three dots > Accelerator > GPU. Note that there is tpu option as well, but TPU can only be used for Keras and Tensorflow models. Spacy uses none of those, it uses its own custom neural network library, thinc.

Let's clone this repository using the command below in Kaggle notebook. This will download the data from repo in the working directory.

! git clone https://github.com/UniversalDependencies/UD_Hindi-HDTB

Let's quickly check if we have access to gpu

import tensorflow as tf 
tf.test.gpu_device_name()

Spacy expects training input data to be in the form of JSON documents, but our downloaded data is in .connlu format. So, we will use spacy convert for conversion to JSON.

! mkdir data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-dev.conllu data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-train.conllu data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-test.conllu data

Now, we are all setup to start training the model

! spacy train hi model_dir data/hi_hdtb-ud-train.json data/hi_hdtb-ud-dev.json  -g 0

Don't forget to pass the argument -g 0 to enable the gpu usage for training. It will save the trained model in the model_dir directory. It runs about 6X faster on gpu than on my local machine. There are probably ways to make it run faster, as the job on kaggle notebook was CPU constrained. Anyway, the whole job finished in half an hour on kaggle notebook.

Let's load the model and run some inferences

from spacy.lang.hi import Hindi 
from spacy.gold import docs_to_json
nlp_hi = Hindi()


nlp_hi.add_pipe(nlp_hi.create_pipe('tagger'))
nlp_hi.add_pipe(nlp_hi.create_pipe('parser'))
nlp_hi.add_pipe(nlp_hi.create_pipe('ner'))
nlp_hi = nlp_hi.from_disk("model_dir/model-best/")


sentence = "मैं खाना खा रहा हूँ।"
doc = nlp_hi(sentence)
print(docs_to_json([doc]))
# ...
# {'id': 0, 'orth': 'मैं', 'tag': 'PRP', 'head': 2, 'dep': 'nsubj', 'ner': 'O'}
# ...

After the finishes, let's gzip the model and download it locally from the file-viewer pan on the right in the kaggle notebook.

! tar -cvzf model.tgz model_dir/model-best

Hurray !

Here is the Kaggle notebook link, if you want to play around.
https://www.kaggle.com/rahul1990gupta/training-a-spacy-hindi-model?scriptVersionId=41283884

Processing Hindi text with spaCy(2): Finding Synonyms

Rahul Gupta — Fri, 21 Aug 2020 16:31:50 +0000

In this post, we will explore word embedding and how can we used them to determine similarities for words, sentences and documents.

So, let's use spacy to convert raw text into spaCy docs/tokens and look at the vector embeddings.

from spacy.lang.hi import Hindi 
nlp = Hindi()
sent1 = 'मुझे भोजन पसंद है।'
doc = nlp(sent1)
doc[0].vector
# array([], dtype=float32)

Oops! There is no vector corresponding to the token. As we can see that there are no word embeddings available for Hindi words. Luckily, there are word embeddings available online under fasttext project by facebook. So, we will download them and load that in spaCy.

import requests 
url = "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.vec.gz"
r = requests.get(url, allow_redirects=True)
fpath = url.split("/")[-1]
with open(fpath, "wb") as fw:
  fw.write(r.content)

The word-vector file is about 1 GB in size. So, it will take some time to download.
Let's see how we can use external word embeddings in spaCy
Here is a link to spaCy documentation on how to do this https://spacy.io/usage/vectors-similarity#converting

Once word-vectors are downloaded, let's load them into spaCy model on command line

python -m spacy init-model hi ./hi_vectors_wiki_lg --vectors-loc cc.hi.300.vec.gz

Let's load the model now in spacy to do some work
import spacy

nlp_hi = spacy.load("./hi_vectors_wiki_lg")
doc = nlp_hi(sent1)
doc[0].vector

Now, we see that the vector is available to use in spaCy. Let's use these embedding to determine similarity of two sentences. Let's use these vectors to compare two very similar sentences

sent2 = 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।'
doc1 = nlp_hi(sent1)
doc2 = nlp_hi(sent2)

# Both the sent1 and sent2 are very similar, so, we expect their similarity score to be high
doc1.similarity(doc2) # prints 0.86

Now, let's use these embeddings to find synonyms of a word.

def get_similar_words(word):
  vector = word.vector
  results = nlp_hi.vocab.vectors.most_similar(vector.reshape(1, 300))

  ret = []
  for result in results:    
    try:
      candidate = nlp_hi.vocab[result[0][0]]
      ret.append(candidate.text)
    except KeyError:
      pass
    return ret
get_similar_words(doc[1]) # prints ['भोजन']

That's not very useful.
Maybe word vectors are very sparse and trained on very small vocabulary.
Let's look into nltk library to see if we can use Hindi WordNet to find similar words of a word. However, NLTK documentation mentions that they don't support hin language yet. So, the search continues.

After a bit of googling, I found out that a research group at IITB has been developing WordNet for Indian languages for quite a while
Checkout link more details.
They published a python library pyiwn for easy accessibility. They haven't yet put it in nltk yet because coverage of Hindi synsets isn't enough to be integrated in NLTK yet.
With that, Let's install this library

pip install pyiwn

import pyiwn 
iwn = pyiwn.IndoWordNet(lang=pyiwn.Language.HINDI)
aam_all_synsets = iwn.synsets('आम') # Mango
aam_all_synsets

# [Synset('कच्चा.adjective.2283'),
# Synset('अधपका.adjective.2697'),
# Synset('आम.noun.3462'),
# Synset('आम.noun.3463'),
# Synset('सामान्य.adjective.3468'),
# Synset('सामूहिक.adjective.3469'),
# Synset('आँव.noun.6253'),
# Synset('आँव.noun.8446'),
# Synset('आम.adjective.39736')]

It's very interesting to see that our synsets of the word include both meaning of the word: Mango and common. Let's pick one synset and different synonyms in the synset

aam = aam_all_synsets[2]

# Let's took at the definition 
aam.gloss()
# prints 'एक फल जो खाया या चूसा जाता है'

# This will print examples where the word is being used
aam.examples()
# ['तोता पेड़ पर बैठकर आम खा रहा है ।',
# 'शास्त्रों ने आम को इंद्रासनी फल की संज्ञा दी है ।']

# Now, let's look at the synonyms for the word 
aam.lemma_names()
# ['आम',
# 'आम्र',
# 'अंब',
# 'अम्ब',
# 'आँब',
# 'आंब',
# 'रसाल',
# 'च्यूत',
# 'प्रियांबु',
# 'प्रियाम्बु',
# 'केशवायुध',
# 'कामायुध',
# 'कामशर',
# 'कामांग']

Let's print some Hyponyms for our synset
A is a Hyponym of B if A is a type of B. For example pigeon is a bird, so pigeon is a hyponym of Bird

iwn.synset_relation(aam, pyiwn.SynsetRelations.HYPONYMY)[:5]
# [Synset('सफेदा.noun.1294'),
# Synset('अंबिया.noun.2888'),
# Synset('सिंदूरिया.noun.8636'),
# Synset('जरदालू.noun.4724'),
# Synset('तोतापरी.noun.6892')]

Conclusion

Now that we have played around with wordnet for a while. Let's recap what a WordNet is. WordNet aims to store the meaning of words along with relationships between words. So, in a sense Wordnet = Language Dictionary + Thesauras + Hierarchical IS-A relationships for nouns + More.

Note: If you want to play around with the notebooks, you can click the link below

Processing Hindi text with SpaCy

Rahul Gupta — Fri, 21 Aug 2020 06:43:44 +0000

Note: I understand that this post can be hard to follow for non-Hindi readers, so I have included English translation of those words after the Hindi words.

Tons of resources are available for processing English(and most roman languages) text, but not so much for other languages. In this post, we will explore How we can use spaCy for processing Hindi text.

Here we will be using spaCy module for processing and indic-nlp-datasets for getting data. We will be using text from Devdas novel by Sharat Chandra for demonstrating common NLP tasks here.

Let's install these two libraries.

pip install spacy 
pip install indic-nlp-datasets

from idatasets.devdas import load_devdas

devdas = load_devdas()
# devdas.data is a generator of paragraphs
paragraphs = list(devdas.data)
text = " ".join(paragraphs)
words = text.split(" ")

So, words has list of all the words in the novel.

from collections import Counter 
cnt = Counter(words)

cnt.most_common(10)
# print 
# [('के', 696), // of
#  ('ने', 676), 
#  ('नही', 672), // not
#  ('से', 626), // to 
#  ('मे', 562), // in 
#  ('की', 480), // 
#  ('है', 444), // is 
#  ('देवदास', 437),// Devdas
#  ('को', 336), // 's
#  ('पार्वती', 332)] // Parvati

What we see that top words are not specially meaningful, mostly connectors and articles. Let's use the spacy's hindi stop word list to get rid of those.

from spacy.lang.hi import STOP_WORDS as STOP_WORDS_HI
not_stop_words = [word for word in words if word not in set(STOP_WORDS_HI) ]

non_stop_cnt = Counter(non_stop_words)

non_stop_cnt.most_common(10)

# prints 
# [('नही', 782), // not
#  ('देवदास', 472), // Devdas 
#  ('कहा-', 390), // said
#  ('पार्वती', 345), // Parvati
#  ('क्या', 237), // what 
#  ('दिन', 187), // day 
#  ('बात', 168),// Talk 
#  ('तुम', 168), // you
#  ('मै', 160), // I 
#  ('चन्द्रमुखी', 154)] // Chadramukhi

Now we see more interesting words appearing as common words. Three out of these 10 most common words (namely, 'देवदास', 'पार्वती', 'चन्द्रमुखी')[Devdas, Parvati, Chandramukhi] corresponds to three main characters around which whole love-triangle story revolves.

Printing most common word is great, isn't enough to justify a cushy data scientist job. :D So, Let's make it prettier using WordCloud.

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This gives us this plot below.

Wait, where are all the words gone ??

After googling a bit, the github issue below talks about how we needs to devnagri fonts to render the image correctly.
https://github.com/amueller/word_cloud/issues/70

so, we modify the code to accept a custom font file


font="gargi.ttf"

wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This yields the image below

You may notice that the WordCloud renders the Hindi letters, but it doesn't contain the most frequent words that we saw before. Also, it doesn't have any of the accent("मात्रा"). So, what's happening here ?

The issue below talks about how "\w+" regex pattern doesn't work as expected in languages other than English. An easy work-around is to pass our own regex which matches with all Hindi letters including accents.
https://github.com/amueller/word_cloud/issues/272

So, let's fix that


wordcloud = WordCloud(
    width=400,
    height=300,
    max_font_size=50, 
    max_words=1000,
    background_color="white", 
    stopwords=STOP_WORDS_HI,
    regexp=r"[\u0900-\u097F]+", 
    font_path=font
).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

This yields the image below.

This looks alright. Few things to note here.

Names of all the prominent characters show up in the word cloud.
"नहीं"(Not) word appear a lot. Which signals that characters are often not in agreement with each other.

Next up, we will talk about how you can do some other tasks such as part of speech analysis, finding names of characters/cities/organzations in a Sentence automatically.

Hope you enjoyed reading it.
If you want to play around with it in colab, checkout the link below.