DEV Community: Samuel Nkopuruk

Every Christian Developers Should Do This!

Samuel Nkopuruk — Fri, 17 Jun 2022 11:23:40 +0000

This is a call to all Christian developers, wake up from your stupor! Christians were meant to be warriors, we are warriors on foreign land, beyond enemies lines! (Ephesian 6: 11 - 17). So let us live like it. Don’t you know that the day you gave your life to Christ, was the day you revoked your citizenship in your country and of the earth? You became a citizen of heaven and of such you would be hated, despised, scorned and treated the same way or worse than how an illegal immigrant is treated (John 17: 14 - 19; 15: 18 - 21).

We are warrior ambassadors sent forth by our Captain, Jesus the Lord of Hosts with duties to snatch, liberate, defend souls and bring down the strongholds of the enemies (Ephesian 6: 19 - 20). We are no longer in the Last Days, we are in the Last Day and we are warriors of the Last day Church, with a cloud of witnesses beholding us (Hebrews 12: 1 - 2), so let us give it our all.

The Bible says “My people are destroyed for lack of knowledge” - (Hosea 4: 6) and because we lack knowledge, we, therefore, are ignorant of the devices of the Devil which is contrary to God's Words! (2 Cor 2: 11). This is why the sword, God say we should arm ourselves with is the Word of God, and with the Word of God comes Wisdom (Ps 119: 98), Knowledge (Isa 28: 23 - 29) and Understanding (Job 32: 7 - 9) with which we shall parry every deceitful works of the Devil and bring the truth to his captives for the Word of God is Truth (John 17: 17).

Don’t you know that just as it is in the spiritual realm and also in the physical, Ignorance of the law is no defense? An offence committed in ignorance is still a punishable offence, so likewise in the spiritual realm. Satan takes huge advantage of our ignorance and makes us worship him without us ever knowing that we do!.

When you raise your hand in adoration, in praise and fanatical dedication to your favourite artist, don’t you know that you commits idolatry? or when you hanged his/her poster on your walls and placed his/her item in a special place, don’t you know that you have raised altars [high places] unto Satan?! Paul put it expressly in Romans 6: 16 - 21

Know ye not, that to whom ye yield yourselves servants to obey, his servants ye are to whom ye obey; whether of sin unto death, or of obedience unto righteousness?…

Don’t you know that the things that are exalted here on earth are despised by God, because God knows the heart of those whom you exalt - (Luke 16: 15) even God’s children are guilty of this (1 Sam 16: 6 - 7). So if Satan can take advantage of the ignorance of people and make them sin against God, so also can we take advantage of their ignorance and make them worship and praise God, because in the spiritual realm Ignorance is no Defense!.

And who better to achieve this than we developers? of all professions we are the most perfectly suited for the work of an evangelist in this modern age of technology because our applications can reach millions of people, imagine making millions of people worshipping God through the use of your applications! and bringing untold numbers to our Lord and Captain Jesus Christ.

Don’t you know that when musicians who have dedicated their lives to the Devil make their music, devilish incantation is added to the music and played at subliminal stimuli so that our conscious mind won’t be able to detect it but our subconscious mind can? thereby inviting demons into our life when we listen to them, some are even bold enough to add incantation into their lyrics that can summon up demons, that anybody who sings them or say them can also summon up the demon unknowingly, who will then torment them!

My brethren, please do not be deceived into underestimating the Devil just because you are in Christ, the Devil is Cunning! (2 Corinthians 11: 3, 10: 12). He has been around since before the creation of the world, so don’t think you can outsmart him, there is nothing we can do that can outsmart him, we are powerless against him. But luckily we have someone in us who is I AM THAT I AM, there was no beginning before Him, neither does He have an end. Only through His power can we overcome Satan.

Rev 12: 11 tells us exactly how to overcome Satan in order of importance;

Having Jesus in you (The most important! without this, having the rest is futile, and your case will be like those in *Acts 19: 15**)

The Word of God

Loving not your life

Having this three in you, is the Only way you can confidently overcome Satan, when one of them is missing in you, you stand a high chance of becoming deceived by the cunningness of Satan.

Just as they [musicians] can affect the lives of millions through their songs, so also can we developers have the ability to affect the lives of millions through the app we build, by making sure that every keypresses event prints out praises to God (either into the console, or printed to a random file in their systems), the faster they type, the more they praise God! and making every mouse clicks event be a prayer to God asking Him to bring into His loving embrace their souls, the more they click their mouse, the more fervent their prayers!

My favorite prayers are Ephesian 1: 17- 18, James 1: 21, and that Jesus should do unto them what He did to Paul, give them an unavoidable chance encounter with Him. These are printed into CMD [though they can’t see it] and also into a special folder prepared for it, with it containing no more than 20 files at anytime.

Christian developers should pray in tongues while building their applications, placing angels of God on every app that is downloaded either legally or illegally, beseeching God to hear their prayers that they will make when they will be using the app! and to guide their steps into an encounter with Him. and anytime they encounter His Word, any hindrance to His Word should be brought low (Isaiah 45: 2; Luke 3: 5) and they should receive with meekness the engrafted word which is able to save their souls (James 1: 21).

Christian musicians can also invite pastors to pray in tongues while they are making their music, and playing it at subliminal stimuli in their music, placing the angels of God onto every track, tasking them to lead anyone listening to the music to unavoidable circumstances where they have an encounter with God.

Please when you do this, know that you will be taking an active offence against Satan, he wants you to be a lukewarm Christian (Revelation 3: 14 - 17) so that you would be no threat to him. As such you will be actively attacked by him, he will make you feel what you are doing is silly [writing codes and silently praying in tongues especially when your colleagues are around you] and insignificant. But let me remind you that anything done in faith is powerful and it is the power that can overcome the world (1 John 5: 4), and anything done in faith is pleasing to God (Hebrews 11: 6)

Satan might also say in your mind, that what you are doing is fruitless! and that the people using your applications are mostly unbelievers! What Nonsense! It is written in James 2: 18 that faith is shown through works! Their act of using your application and typing and clicking with it shows their faith through their actions, also Paul in Romans 2: 26 said that if an unbeliever does that which is according to the law of God, should he not be counted as a believer? [read the whole chapter to truly understand this].

Do not be deceived people, there is a great spiritual battle being fought every day while you sleep at your duty post, it is time as a warrior for you to be alert and stand at your duty post. We are meant to be Jesus' hidden intercessors, don’t be beguiled by the powerful ministers that God is using openly to bring many to Christ, don’t you know that for every one of those ministers, they are hundreds of hidden intercessors upholding him by the power of their prayers? and they serve as a shield, without which those ministers would be unable to stand as long as they are? Why did you think apostle Paul in every one of his letters, beseeched people to pray for him and in 1 Timothy 2: 2 he specifically ask them to pray for those who are in authority?

For every open warfare, there have already been hundreds of hidden warfare that has already been fought

Come on brethren [developers], we have been given a powerful tool to which we can wage war in the secret, the enemies are already taking full advantage of it. Please there is no pacifism in Christianity! nowhere in the bible did Jesus say, we would live a peaceful life here on earth, instead He always exhorts us to be on guard, that we will face various tribulations (2 Timothy 3: 12, John 16: 32 - 33), if we pick up our cross and follow Him become His disciples. So pick up that Cross!

Become a Dev4Christ now

Developers who meets deadlines should be given the greatest respects

Samuel Nkopuruk — Wed, 15 Jun 2022 15:07:50 +0000

I have reached the stage in my programming path, that I truly hold in high respect those developers who meets up deadlines, most especially when those deadlines are set by people who are non-programmers and they set it on their “idea” of how long project ought to take.

As a voluntary project for my church, I am building a multi-functional application that is a text editor, audio transcriber, a search interface capable of both keywords, semantic search (and by His Grace, soon to be added a text summarization capability).

In a bid to spread the Truth of the Gospel of our Lord Jesus Christ, which have been almost overshadowed by false doctrines as forewarned by apostle Paul in Galatian 1: 6–9; 3: 1–3 [The whole book of Galatian was written as a warning to the Church against believing people who preach false doctrines with fair speeches — Romans 16: 17–18].

We are working on transcribing audio/video messages to be able to share it to more people, and seeing how tasking and extremely time consuming manually transcribing an audio/video message is [an hour audio/video can take over 6 hours, or for most a whole day to transcribe, and there are audio/video messages that are over 10 hours, try imagining how long that will take], even when Google Gboard speech-to-text functionalities is used.

In attempting to try to make the process easier, I made my developer skills and intentions known and I was tasked with concentrating on building the application, but to focus on only implementing the audio/video transcriber function, as that is what is sorely needed and was given 2 weeks to complete it. I had it completed in 3 weeks, though I have recurring nightmares of hidden bugs that are in the application.

Though most of the application buttons are unresponsive, as their functionalities have not been implemented. I am seriously excited of my achievement, the application was able to transcribe a 10 hour audio in 2 hours, leaving to others the easy task of just punctuating the transcripts [Something I am planning of automating using the interesting punctuation restoration models].

I will like to thank Wanderson M. Pimenta for making their wonderful PyQt GUI interface available publicly without which I would definitely be unable to build up the interface quick enough and Martin Fitzpack for his amazing books on PyQt5 as they serve as excellent reference guides when implementing the transcriber functionality,

Finally, Google and Houndify, because the application makes use of both Google and Houndify speech-to-text API, and I discovered that while been two times slower than Google, Houndify tends to be more accurate. Although as a goal, I am aiming to make the application completely offline using deep learning models to power the speech-to-text conversion.

The application is truly amazing! [at least in my perspective], it converts any audio/video to a .flac format, and since Google or Houndify won’t let you dump a 10 hour or a 5 hour video on their API for them to transcribe at once. I had to loop through the audio file to be sending 100sec of the audio to the API, and saving the returned text.

Okay I fibbed a little unintentionally, the application won’t just transcribe your 10 hour audio/video for you either, it would let you dump it but it won’t transcribe it. You must have to break it into multiple 1 hour audio/video before it will transcribe it, this is for memory concerns, as an audio in .flac format can become extremely large and might crash some systems that don’t have enough memory.

Whew the issues I encountered!, if not for Stack Overflow (Praise Be Upon It — PBUI) this application would not be, the most stressful was in trying to compile my scripts to .exe, and that was because of the FFMEG binary application I was using to convert audio/video files to the standard format. I solved this by having to modify a third party package I was using to convert the files programmatically.

Although the GUI leave much to be desired, the primary goal is accomplished, and that is what matters! Now I am further tasked with building and hosting a database and also a backup offline database [I will use PostgreSQL, it is the only DBMS I know] that will contained the transcripts and also implement a search functionality that will be able to search the database using keywords/semantic search [Now, I have no idea how to do that!] and also I have been given another 2 weeks.

Hopefully by the Grace of God, I will post of my progress in the next two week, Now let me get back to crazily coding and stalking Stack Overflow (PBUI).

How to Build a Generative Style Inference Model

Samuel Nkopuruk — Thu, 19 May 2022 23:11:35 +0000

We have all read books of authors written in eccentric style, twisting the English words to create their own unique pattern for their books (Shakespeare, Steven Erikson, Patrick Rothfuss, etc.) or even developed their own languages (J. R. R. Tolkien, George R. R. Martin, George Orwell, etc.). As fellow book lovers, we all wish if possible to be able to emulate the writing style of our favorite author, but such an endeavour will likely take months or years of intense study to be able to emulate.

To this purpose we are going to build a machine learning model that will do that for us, it will learn the way the author spells and punctuate when he writes, his style of grammar, choice of words and tone. To be able to do that, we have to show the model, books written exclusively by the author to learn from them.

Showing the model books written by different authors makes it hard for the model to detect the tone, word choice and punctuation style of the authors, as many authors have diverse style and personality. Therefore finding a common ground across them will be extremely difficult.

The books that will be used to train the model, are those written by the greatest fantasy writer of all time Steven Erikson (The Malazan Book of the Fallen). The novel is a 10 book series with a total of over 10,000 pages. You can download the books here, convert the file[s] to txt so that it can be easily opened in python.

This is a unique project in that multiple sub models have to be built in order to develop this style inference model. So before going further, pause and go through the process of building the sub models before continuing:

import os
import sys
import glob
import itertools
import numpy as np
import pandas as pd

from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer

from gensim.models.phrases import Phrases
from gensim.models.word2vec import Word2Vec

import tensorflow as tf
from tensorflow import keras

# you should be familiar with this
%run utils.py

Setting up prerequiste models

If you are at this point, you must have gone through the prerequisite projects. So permit me to rush through the boring and already familiar process of loading up our models needed to build this model.

def load_phrase_detector_model(fname, reduce_size=False):
    phrases = Phrases.load(fname)

    print(f"Loading complete")
    return phrases.freeze() if reduce_size else phrases

# load a phrase detector model
phrase_model_path = "malaz_phrase_detector"
phrases = load_phrase_detector_model(phrase_model_path, reduce_size=True)

sentences_iterator = CustomPathLineSentences('Books', include_phrase=True,
                                             phrase_model=phrases)

# load word2vec model
word2vec_path = "malaz_word2vec.bin"
word2vec = Word2Vec.load(word2vec_path)
word2vec = word2vec.wv

print("Setup complete")

Preprocessing texts

Remember our aim is to build a model that can write with the same style, tempo and grammar as the author who wrote the texts. We are going to preprocess our text in a completely different way than we did with the previous projects, though an element of each of them might be present.

First we are going to load all the text files to your memory and join them into one big lump of text to enable us perform some calculations, which we will see soon.

def preprocess_texts(sentences_iterator):
    text = []
    for sentence in sentences_iterator:
        # remember each sentence is a list of tokens
        # punctuation included
        text.extend(sentence)

    print(f"Total number of words (Phrases included) {len(text)}")

    return text

text = preprocess_texts(sentences_iterator)

The preprocess_text function iterates through the text files, tokenize, detect and combine phrases, and saves each tokens (words) into a list.

vocab = sorted(list(set(text)))
word_indices = {word: idx for idx, word in enumerate(vocab)}
indices_word = {idx: word for idx, word in enumerate(vocab)}

print(f"Total number of unique words: {len(vocab)}")

The vocab contains the unique words present in the text, the word_indices is a dictionary of words to index for reference when building a one-hot encoded target and finally, the indices_word is the reverse dictionary that serves as a lookup when interpreting the one-hot encoded target.

total_words = len(text)
total_sentences = len(sentences_iterator)
avg_word_sentences = total_words / total_sentences

print(f"Avg word per sentence: {avg_word_sentences}")

We calculated the average number of words in each sentences, and depending on the text files you use, they might differs [I got a value of 11.73]. The purpose of this value is to tell us the minimum baseline value, we should use when choosing the number of words a model will look at, before making predictions on what the next word would be.

Creating dataset

maxlen = int(avg_word_sentences + 40)
step = 3
sentences = []
next_word_target []

for idx in range(0, len(text) - maxlen, step):
    sentences.append(text[idx: idx + maxlen])
    next_word_target.append(text[idx + maxlen])

# no longer need
del text
print(len(sentences), len(next_word_target))

A lot of things is been done in the above code block, the maxlen variable hold the maximum number of words that the model will have to look at before it can make prediction on what the next word will be, step indicate the number of words to skip from the beginning (for example take 50 words from the beginning of the text, move to the third word from the beginning, take another 50, move to six … and so on.)

Executing the code will create a list containing list of words each maxlen long with it corresponding target, both of which will have the same length. So what our model is going to do is that it will look at the words in the sentence variable and tries to predict the corresponding word found in the next_word_target. With a sufficiently powerful model, it will be able to learn the author style of writing, punctuation, tone, grammar and even words arrangement (i.e it will able to learn that if the author for example write T’lan it is always followed by Imass), a perfect imitator. Scary and exciting huh, let dig in!

Hold up! you are familiar with machine learning enough to know that we can’t just feed the deep learning algorithm with strings of text, it wouldn’t know what to do with it. So we would have to convert the strings to palatable numbers for proper consumption by the model algorithm.

Turning water to wine

def cached_dataset(sentences, next_word_target, word2vec, word_indices):
    """A factory function, it returns a generator function
    that has various variables cached in it scope that will
    be unaffected by the outer scope.

    Doing this will enable the iterator function to be called
    without the need for arguments and without the function
    to request the variables from the outer scopes, because
    they are stored in it internals."""

    def generator():
        """An iterator function, it simply iterates through our earlier
        created dataset, changing the words in each sentence to its
        word vector representatives"""


        # words not available in the word vector
        # will be given a default word vectors where
        # every dimension equals to zero
        unknown_word = np.zeros(shape=(word_vector.vector_size,),
                                dtype=np.float32)

        for sentence, target_word in itertools.zip_longest(sentences, next_word_target):
            # create an dummy array of shape (len(sentence), embed_dim)
            # this is the word vector representation
            data = np.zeros(shape=(maxlen, word2vec.vector_size),
                            dtype=np.float32)

            # fills in the dummy array with the real word vector
            # values.
            for idx, word in enumerate(itertools.islice(sentence, None)):
                if word in word2vec:
                    data[idx] = word_vector[word]
                else:
                    data[idx] = unknown_word

            # create the target array
            target = np.array([word_indices[target_word]], dtype=np.int32)

            yield (tf.convert_to_tensor(data, dtype=tf.float32),
                   tf.convert_to_tensor(target, dtype=tf.int32))
    return generator

gen = cached_dataset(sentences, next_word_target, word2vec, word_indices)

# create a tensor dataset generator using 
# the Dataset API 
dataset_generator = tf.data.Dataset.from_generator(
    gen, 
    output_signature=(tf.TensorSpec(shape=(maxlen, word2vec.vector_size),
                                    dtype=tf.float32),
                      tf.TensorSpec(shape=(1,), dtype=tf.int32)))

The magic of this function, is that it converts the strings to its word vector representation on the fly, without the intermediate steps of saving it to a file. It uses Tensorflow Dataset API to feed the numerical data directly to the model during training [as you will soon see].

It uses the Dataset API to take full advantage of it ability in preprocessing more data (sentences) as the model is training on a previous batch of preprocessed data, both at the same time, thereby dramatically speeding up training.

Building model

The moment we are waiting for, and also dreading, building and training the dataset. In building the model, we are going to use LSTM architecture in the layers (research about them to learn how they work).

num_neuron = 500
model = keras.models.Sequential()
model.add(keras.layers.Input(shape=(maxlen, word2vec.vector_size)))
model.add(keras.layers.LSTM(num_neuron, return_sequences=True))
model.add(keras.layers.LSTM(num_neuron, return_sequences=True))
model.add(keras.layers.LSTM(num_neuron))

model.add(keras.layers.Dense(len(vocab), activation='softmax'))

optimizer = keras.optimizer.Adam(learning_rate=0.001)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=optimizer)
model.summary()

# define a checkpoint callback [you will really need this]
model_cb = keras.callbacks.ModelCheckpoint("model.h5", monitor='loss')

# GET READY TO RECIEVE THE GREATEST SHOCK
# OF YOUR LIFE WHEN YOU RUN THIS CODE

# Tell me what you see when you execute the code
# what did it display (You will know what I am
# talking about when you see it)
epochs = 100
batch_sixe = 32
num_workers = 4
model.fit(dataset_generator.repeat(-1).batch(batch_size).prefetch(5),
          steps_per_epoch=len(sentences) // batch_size,
          batch_size=batch_size, epochs=epochs,
          workers=num_workers, callbacks=[model_cb])

We built our model and trained it, you may have noticed that we didn’t try to limit the complexity of the model by adding limiting layers (dropout, batch normalization, etc). Doing so would have hindered our purpose of trying to imitate our writer, and for that to happen, our model would have to overfit on the data (this is one of those rare instances where overfitting the dataset becomes a good thing). So the more complex you can make your model, the greater the ability of the model to imitate the writer.

You will be much better off training this model on Kaggle or Google Colab than on your local computer, as training this model is highly intensive and may take an extraordinary amount of time [Think days].

Putting it to work

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)

    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)

    return np.argmax(probas)

The sample function just tell us which word have the highest probability of coming next¹

text = """
it was not of Jaghut construction, that it had arisen beside the three Jaghut towers of its own accord, in answer to a law unfathomable to god and mortal alike. Arisen to await the coming of those whom it would imprison for eternity. Creatures of deadly power.
"""
text = sentences_iterator.clean_token_words(text)
text = text[:maxlen]
generated = []
generated.extend(text)
sys.stdout.write(generate)

unknown_word = np.zeros(shape=(word_vector.vector_size,),
                        dtype=np.float32)

for temp in np.arange(0.1, 0.5, 0.1):
    for i in range(400):
        data = np.zeros(shape=(1, maxlen, word2vec.vector_size),
                        dtype=np.float32)

        for t, word in enumerate(text):
            if word in word2vec:
                data[0, t] = word2vec[word]
            else:
                data[0, t] = unknown_word

        preds = model.predict(data)
        next_index = sample(preds.ravel(), t)  # optimal value 0.2 to 0.4, test
        next_word = indices_word[next_index]
        generated += next_word

        text[1:].append(next_word)
        sys.stdout.write(next_word)
        sys.stdout.flush()

You can see that the model was able to some extent imitate the writing style of the writer, with the level of imitation getting higher depending on the complexities of your model. Not bad for a simple model, and it is something we can definitely have fun with, or you can go on ahead to imitate the author you have always admired.

God loves you!.

Natural Language in Action ↩

How to Build a Special Case Word Vectors

Samuel Nkopuruk — Thu, 19 May 2022 23:05:12 +0000

Using pretrained word vectors are all good and fine and are of huge benefit to your models, for example the Google word2vec contains 3 million unique words (vocabulary) that were trained on billions of texts. Those word vectors are amazingly finetuned and represent the semantic meaning of a words to an extremely high accuracy. I mean they are state-of-the-art, why should we think of ever building our own word vectors?

Why you should build your special word vectors

Visualize this scenario

You are a hardcore fantasy fan and their are these particular novel series [Malazan book of the Fallen], you and your fellow hardcores are so crazy about, you all can’t wait to express your emotions to the author [Steven Erikson] about how you all feel about the books. In a classic bid to impress your other hardcore fans and show that you are the book number one fan, you decided to build a sentimental model, that will be able to analyze their reviews and accurately recognize their state of emotion, and then translate it into five succinct word (Very Good, Good, Ok, Bad, Very Bad).

In your enthusiasm you decided to use google word2vec (I mean it has 3 million words), then you discovered a very obvious error. You and your fellow hardcore fans use words a lot in your reviews that are not found in the google word2vec 3 million vocabulary (what!!! but the book was writen in English), words like Bugg, Gothos, Seguleh etc, (I guess the over 1 billion texts that google word2vec was trained on didn’t contain those words). Even with those cryptic words containing rich sentimental values (e.g. Hood’s breaths may means wow, amazing, shock etc. depending on the context), you still think you would be able to build a fairly accurate model with the remaining words that can be found in the google word2vec vocabulary.

Bolster by that insight, you went ahead with building your model. With further analysis, you then discovered an error of the most insidious kind, the kind that are extremely hard to detect because they do not trip any alarm (silent or loud). They pass so silently most times you do not notice them until it is very late, and the damage they deal is so extreme, that they have huge impact on the performance of your model.

Ok enough with trying to scare you, but which kind of error are these? They are in this NLP context what I will call the _mis-concepted errors_, they occur as a result of when the meaning of a word drastically change when placed in a different world, you would understand much better with an example. For example let’s take the word “burn”, this word in reality means when something was at some point ignited with fire or about to, but in the Malazan world this word take a totally different meaning and not in any way related, it’s the name of a god. Below are some example of words with their different meaning.

Words	Meaning in the Real World	Meaning in the Malazan World
Hood	The metallic covering of a car engine, or a cloth	The name of the god of death.
High Fist	Taken together they mean nothing, but taken apart they mean different things	They are mostly taken as a single word, which means a rank in a military structure.
Kindly	In a kind manner	The name of a soldier (who isn’t even kind).
Divers	Someone who go deep under waters	A shape shifter, it changes from a single entity to multiple entities
Curdle	A milk going spoilt	The name of a long dead Soletaken (A dragon)

You can now see how misconceptions like these can introduce terrible errors in your model, not only that, there are also words that taken on it own means nothing or completely different thing in the Malazan world, but must be combined with another word before the meaning of the word could be revealed e.g T'lan Imass, High mage, High fist, Ampelas Rooted etc.

Ok coming out of that visualization exercise you can see why having your own custom word vectors for specific case is imperative, but that does not mean the google word2vec should suddenly become useless to you, and you wouldn’t want to be building a word2vec model for every situation you encounter.

You get optimal results when you combine both model together, with your custom word2vec vectors acting as supplementary. You use the google word2vec vectors for more common words that are found in every day English, and then use your custom word2vec vectors for those rare words or phrases that are only found in the niche you are building your model for.

Building a Word2Vec Vectors

We are now done with the blah blah part of building a word2vec, we are now going to dive into the coding part of building it. let us import all the necessary modules needed to build the model.

This project requires that you have gone through the building a phrase model detector, as it is a required prequisite.

import glob
import numpy as np

from gensim.models import callbacks
from gensim.models.word2vec import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
from gensim.models.phrases import Phrases

%run utils.py

There is nothing new in the above code except the little instruction on line 9 %run utils.py this instruction executes the codes found in the*utils.py* file in this current namespace. This file contains a very useful generic tools that feeds data to our model when training without having to load all the text file to memory, it was defined when we built a phrase detector model. Defining it here again would take up unnecessary space and distract us from our goal, so head here to see how it was defined.

def load_phrase_detector_model(fname, reduce_size=False):
    """Load a phrase detector model from disk.

    Parameters:
    fname: str
        path to the pretrained phrase detector model

    reduce_size: bool
        should be False if the full sized model was saved
        during training of the phrase detector model, then it would 
        be reduced in size when loaded, else it should False

        will raise an AttributeError if set to True and the phrase 
        detector model is not a full sized model
    """
    phrases = Phrases.load(fname)

    print(f"Loading complete")
    return phrases.freeze() if reduce_size else phrases

# load a phrase detector model
phrase_model_path = "malaz_phrase_detector"
phrases = load_phrase_detector_model(phrase_model_path, reduce_size=True)

sentences_iterator = CustomPathLineSentences('Books', include_phrase=True,
                                             phrase_model=phrases)

If a phrase detector model is passed to the custom iterator, as it iterates through sentences it combines any combination of words that constitutes phrase detected in the sentence into a single word, and returns the modified sentence (for example “High Fist” to “High_Fist”).

It doesn’t do this when we were building a phrase detector model, it simply return the split words, that is why I said the function is generic and a very useful tool to have in your toolbelt.

The phrase detector model solves the problem of where a combination of words constitute a word, this will help the word2vec model find the word vector that best represent the semantic meaning of the combined word.

The text files must have the same structure as explained in the phrase detector project.

model_path = "malaz_word2vec.bin"

class CustomCallback(CallbackAny2Vec):
    """Create a custom callback that save the model 
    at the end of each epoch and at the end of training,
    while also reporting the current epoch value."""

    def __init__(self):
        self.__epoch_trained = 0

    def on_epoch_end(self, model):
        model.save(model_path)
        self.__epoch_trained += 1
        print(self.__epoch_trained, end=' | ')

    def on_train_end(self, model):
        model.save(model_path)

epochs = 1000
vector_size = 300
min_count = 3
num_workers = 4
window_SIZE = 5
subsampling = 1e-3

model = Word2Vec(workers=num_workers,
                 vector_size=vector_size,
                 min_count=min_count,
                 window=window_size,
                 sample=subsampling)

# build word vector vocabulary
model.build_vocab(sentences_iterator)

# training word2vec
model.train(sentences_iterator, total_examples=model.corpus_count,
            epochs=epochs, compute_loss=True,
            callbacks=[CustomCallback()])

# just for precaution sake
model.save(model_path)

It might take minutes or hours depending on your computer, I slept off after 4 hours and it was still at 300 epochs. That is why I save it after every epoch and at when it finally completes training [during when I was probably asleep] and again (to be doubly sure it was save). With this simple word vector alone, you can build a fairly complex sentiment analysis model center around this author, there will be a drastic drop of accuracy if another author is added to the mix.

God loves you!

How to Build a Malazan Empire Phrase Detector Model

Samuel Nkopuruk — Thu, 19 May 2022 23:01:15 +0000

Now I’m not huge on big words, I believe in explaining things in my own understanding, and I discovered that people (newbies) tends to understand a concept better/faster when a fellow newbie explains a concept in the way he understands it, thereby breaking it to his level. But those who strive for the big technical definition can go research, they are all over the internet.

What is a Phrase Detector

A phrase detector is an implemented algorithm that uses several techniques with the most popular being the count collation (which is the technique used in gensim). This technique identifies common words that always occurs together with a minimum predefined frequency or according to gensim; threshold score, where the higher the threshold value, the stricter the selection process.

This model is used to identify phrases (bigrams) that are present in your texts, and is especially useful in building word vectors. Because it substantially reduce the computational complexity by reducing the number of vocabularies in your word vector, it does this by combining words that occurs together frequently into a single word.

import os
import re
from typing import List
from itertools import islice

from gensim import utils
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
from nltk.tokenize import NLTKWordTokenizer, PunktSentenceTokenizer
view rawmodel.py

This model is trained on the whole series of the Malazan Book of the Fallen, not only is it an excellent resources to train your model, it is also a much more excellent read, or it can be any [or all] books that are presently in your computer. The books should be in [or converted to] the .txt format for easy loading, and they should all be in a folder called the Books with no subfolder containing any more text files.

class CustomPathLineSentences:
    """Custom implementaion of gensim.models.word2vec.PathLineSentences

    It differs from gensim implementation in that it replaces the default
    tokenizer with a more powerful tokenizer, while also adding more
    functionalities to it.

    Functionalities
    1) Break the block of text into sentences using PunktSentenceTokenizer()
       as each text is split on \n
    2) For each sentence
        a) Tokenize sentence using NLTKWordTokenizer()
            i) Clean each tokens 

        b) Join words that constitute phrases into a single word if a 
           phrase detector model is passed as argument

        c) yield up the preprocessed tokens for further processing
    2) 

    Parameters

    source: str
        File path of the folder containing the text file
    limit: int
        The maximum number of characters to read in each text block
    include_phrase: bool
        If True group words that constitue phrase into a single word, 
        this should only be set to True if a phrase detector model has
        been trained
    phrase_model: phrase detector model
        The model used in detecting phrases in text
        if include_phrase is True and phrase_model is None, a ValueError
        is raised,
    """
    def __init__(self, source, limit=None, 
                 include_phrase=False, phrase_model=None):
        self.source = source
        self.limit = limit
        self.include_phrase = include_phrase
        self.word_tokenizer = NLTKWordTokenizer()
        self.sentence_tokenizer = PunktSentenceTokenizer()

        if self.include_phrase and phrase_model is not None:
            self.phrase_model = phrase_model
        elif self.include_phrase and phrase_model is None:
            raise ValueError("phrase model detector not provided")

        if os.path.isfile(self.source):
            print('This is a file, use a folder next time')
            self.input_files = [self.source]
        elif os.path.isdir(self.source):
            self.source = os.path.join(self.source, '')
            self.input_files = os.listdir(self.source)
            self.input_files = [self.source + fname 
                                    for fname in self.input_files]
            self.input_files.sort()
        else:
            raise ValueError('input is neither a file or a directory')

    def __word_cleaner(self, word, cleaned_word_tokens, punctuation) -> List[str]:
        """For each word if any punctuation is still found in the 
        beginning and ending, further split them, ignore any 
        punctuation found in between the alphabet

        """
        beginning_punc = None
        ending_punc = None

        if len(word) > 1:
            if word[0] in punctuation:
                beginning_punc = word[0]
                word = word[1:]
            if word[-1] in punctuation:
                ending_punc = word[-1]
                word = word[:-1]

        if beginning_punc is not None:
            cleaned_word_tokens.append(beginning_punc)

        # For Some reason Jupyter notebook keep restarting
        # because of this recursive code

#         if word[0] in punctuation or word[-1] in punctuation:
#             cleaned_word_tokens = self.__word_cleaner(word, cleaned_word_tokens, 
#                                                       punctuation)
#         else:
#             cleaned_word_tokens.append(word)

        cleaned_word_tokens.append(word)
        if ending_punc is not None:
            cleaned_word_tokens.append(ending_punc)

        return cleaned_word_tokens

    def clean_token_words(self, sentence) -> List[str]:
        """Split a sentence into tokens for further preprocessing"""
        word_tokens: list = sentence.split()
        cleaned_word_tokens = []
        punctuation = string.punctuation + "’" + "‘"

        for word in word_tokens:
            if not self.include_phrase:
                cleaned_word_tokens.append(word.strip(punctuation))
            else:
                self.__word_cleaner(word, cleaned_word_tokens, punctuation)

        return cleaned_word_tokens          

    def __iter__(self):
        """Iterate through the files"""
        pattern = re.compile("[‘’]")

        total_count = 0

        for fname in self.input_files:
            with utils.open(fname, 'rb') as fin:
                # iterate through the text using the inbuilt
                # readline function
                for line in islice(fin, self.limit):
                    line = utils.to_unicode(line).strip()
                    if line:
                        # text broken at the line break point may contain
                        # many sentences in it, use a sentence segmenter
                        # to further break them into sentences
                        sentences = self.sentence_tokenizer.tokenize(line)

                        # for each of those sentences break them into tokens
                        for sentence in sentences:
                            sentence = pattern.sub("'", sentence)
                            word_tokens = self.clean_token_words(sentence)
                            if not self.include_phrase:
                                yield word_tokens
                            else:
                                # combine detected words that consitutes phrases
                                # into a single word
                                generator = self.phrase_model.analyze_sentence(word_tokens)
                                yield [word[0] for word in generator]
                    to

    def __len__(self):
        counts = 0
        for sentences in self.__iter__():
            counts += 1
        return counts

Ok, that was fun, you just created and added a handy tool to your tools belt, it will be useful to us in several scenarios [especially when building a word vector or a style inference model], so save it somewhere in a python text file and name the file utils.py.

I cheated a little, I built the CustomPathLineSentences function to be generic and have many use cases not specific to only this project, as you will see when I use it in building a word vector and training a style inference model.

What this class performs in summary is that, it become an iterator for us when instantiated, iterating through our text files and preprocessing it for us at the same time, leaving to us the more difficult task of training a phrase detector model. Which we are going to do below;

# train a phrase detector
def train_phrase_detector(*, threshold=400, reduce_model_memory_size=False):
    sentences_iterator = CustomPathLineSentences('Books')
    print("List of iles that will be analyzed for word phrase (bigrams)")
    for file in sentences_iterator.input_files:
        print(file)

    phrases = Phrases(sentences_iterator, threshold=threshold, 
                     connector_words=ENGLISH_CONNECTOR_WORDS)

    print("Training completed")
    return (phrases.freeze(), sentences_iterator) if reduce_model_memory_size 
                else (phrases, sentences_iterator)

We defined a function that will handle the task of training the model for us using the preset and default parameters. Next we will go on to train the model by executing the function.

threshold = 400
reduce_model_memory_size = False
phrase_model, sentences_iterator = train_phrase_detector(
               threshold=threshold,
                                 reduce_model_memory_size=reduce_model_memory_size)
# saving the trained model
fname = "malaz_phrase_detector"
phrase_model = phrase_model.save(fname)

Good we have finished training the model and saving it to disk for further use when we need to build a word vector and a style inference model, Let see what the model learned and test to see if it can detect word phrases in texts.

# print how many phrases the model detected in the trainng text
print(f"Total number of phrases (bigrams) detected: {len(phrase_model)}")
text = """The Foolish Dog Clan will join your companies on the other side,' Coltaine said. 'You and the Weasel Clan shall guard this side while the wounded and the refugees cross"""
# preprocess the text in the same way the training
# text was preprocessed
text_cleaned = sentences_iterator.clean_token_words(text)
# detect phrases (bigrams) in text
phrases_detected = phrase_model.analyze_sentence(text_cleaned)
print("Detected phrases")
for key, values in phrases_detected.items():
    if values > 0:
        print(key)

We have successfully built a phrase detector model, and the model have been tested on a text and it was able to successfully detects the phrases in the model.

This is important, this model can be general enough to detects phrases that follows the pattern of the training text [for example, the next series of an author novel if the model has been trained on the previous series]. But if the model is made to detect phrases on a completely different pattern of text, that does not even contain the same vocabulary as the training text, the model will fail woefully.

God loves you!