Samuel Nkopuruk

Posted on May 19, 2022

How to Build a Generative Style Inference Model

We have all read books of authors written in eccentric style, twisting the English words to create their own unique pattern for their books (Shakespeare, Steven Erikson, Patrick Rothfuss, etc.) or even developed their own languages (J. R. R. Tolkien, George R. R. Martin, George Orwell, etc.). As fellow book lovers, we all wish if possible to be able to emulate the writing style of our favorite author, but such an endeavour will likely take months or years of intense study to be able to emulate.

To this purpose we are going to build a machine learning model that will do that for us, it will learn the way the author spells and punctuate when he writes, his style of grammar, choice of words and tone. To be able to do that, we have to show the model, books written exclusively by the author to learn from them.

Showing the model books written by different authors makes it hard for the model to detect the tone, word choice and punctuation style of the authors, as many authors have diverse style and personality. Therefore finding a common ground across them will be extremely difficult.

The books that will be used to train the model, are those written by the greatest fantasy writer of all time Steven Erikson (The Malazan Book of the Fallen). The novel is a 10 book series with a total of over 10,000 pages. You can download the books here, convert the file[s] to txt so that it can be easily opened in python.

This is a unique project in that multiple sub models have to be built in order to develop this style inference model. So before going further, pause and go through the process of building the sub models before continuing:

import os
import sys
import glob
import itertools
import numpy as np
import pandas as pd

from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer

from gensim.models.phrases import Phrases
from gensim.models.word2vec import Word2Vec

import tensorflow as tf
from tensorflow import keras

# you should be familiar with this
%run utils.py

Setting up prerequiste models

If you are at this point, you must have gone through the prerequisite projects. So permit me to rush through the boring and already familiar process of loading up our models needed to build this model.

def load_phrase_detector_model(fname, reduce_size=False):
    phrases = Phrases.load(fname)

    print(f"Loading complete")
    return phrases.freeze() if reduce_size else phrases

# load a phrase detector model
phrase_model_path = "malaz_phrase_detector"
phrases = load_phrase_detector_model(phrase_model_path, reduce_size=True)

sentences_iterator = CustomPathLineSentences('Books', include_phrase=True,
                                             phrase_model=phrases)

# load word2vec model
word2vec_path = "malaz_word2vec.bin"
word2vec = Word2Vec.load(word2vec_path)
word2vec = word2vec.wv

print("Setup complete")

Preprocessing texts

Remember our aim is to build a model that can write with the same style, tempo and grammar as the author who wrote the texts. We are going to preprocess our text in a completely different way than we did with the previous projects, though an element of each of them might be present.

First we are going to load all the text files to your memory and join them into one big lump of text to enable us perform some calculations, which we will see soon.

def preprocess_texts(sentences_iterator):
    text = []
    for sentence in sentences_iterator:
        # remember each sentence is a list of tokens
        # punctuation included
        text.extend(sentence)

    print(f"Total number of words (Phrases included) {len(text)}")

    return text

text = preprocess_texts(sentences_iterator)

The preprocess_text function iterates through the text files, tokenize, detect and combine phrases, and saves each tokens (words) into a list.

vocab = sorted(list(set(text)))
word_indices = {word: idx for idx, word in enumerate(vocab)}
indices_word = {idx: word for idx, word in enumerate(vocab)}

print(f"Total number of unique words: {len(vocab)}")

The vocab contains the unique words present in the text, the word_indices is a dictionary of words to index for reference when building a one-hot encoded target and finally, the indices_word is the reverse dictionary that serves as a lookup when interpreting the one-hot encoded target.

total_words = len(text)
total_sentences = len(sentences_iterator)
avg_word_sentences = total_words / total_sentences

print(f"Avg word per sentence: {avg_word_sentences}")

We calculated the average number of words in each sentences, and depending on the text files you use, they might differs [I got a value of 11.73]. The purpose of this value is to tell us the minimum baseline value, we should use when choosing the number of words a model will look at, before making predictions on what the next word would be.

Creating dataset

maxlen = int(avg_word_sentences + 40)
step = 3
sentences = []
next_word_target []

for idx in range(0, len(text) - maxlen, step):
    sentences.append(text[idx: idx + maxlen])
    next_word_target.append(text[idx + maxlen])

# no longer need
del text
print(len(sentences), len(next_word_target))

A lot of things is been done in the above code block, the maxlen variable hold the maximum number of words that the model will have to look at before it can make prediction on what the next word will be, step indicate the number of words to skip from the beginning (for example take 50 words from the beginning of the text, move to the third word from the beginning, take another 50, move to six … and so on.)

Executing the code will create a list containing list of words each maxlen long with it corresponding target, both of which will have the same length. So what our model is going to do is that it will look at the words in the sentence variable and tries to predict the corresponding word found in the next_word_target. With a sufficiently powerful model, it will be able to learn the author style of writing, punctuation, tone, grammar and even words arrangement (i.e it will able to learn that if the author for example write T’lan it is always followed by Imass), a perfect imitator. Scary and exciting huh, let dig in!

Hold up! you are familiar with machine learning enough to know that we can’t just feed the deep learning algorithm with strings of text, it wouldn’t know what to do with it. So we would have to convert the strings to palatable numbers for proper consumption by the model algorithm.

Turning water to wine

def cached_dataset(sentences, next_word_target, word2vec, word_indices):
    """A factory function, it returns a generator function
    that has various variables cached in it scope that will
    be unaffected by the outer scope.

    Doing this will enable the iterator function to be called
    without the need for arguments and without the function
    to request the variables from the outer scopes, because
    they are stored in it internals."""

    def generator():
        """An iterator function, it simply iterates through our earlier
        created dataset, changing the words in each sentence to its
        word vector representatives"""


        # words not available in the word vector
        # will be given a default word vectors where
        # every dimension equals to zero
        unknown_word = np.zeros(shape=(word_vector.vector_size,),
                                dtype=np.float32)

        for sentence, target_word in itertools.zip_longest(sentences, next_word_target):
            # create an dummy array of shape (len(sentence), embed_dim)
            # this is the word vector representation
            data = np.zeros(shape=(maxlen, word2vec.vector_size),
                            dtype=np.float32)

            # fills in the dummy array with the real word vector
            # values.
            for idx, word in enumerate(itertools.islice(sentence, None)):
                if word in word2vec:
                    data[idx] = word_vector[word]
                else:
                    data[idx] = unknown_word

            # create the target array
            target = np.array([word_indices[target_word]], dtype=np.int32)

            yield (tf.convert_to_tensor(data, dtype=tf.float32),
                   tf.convert_to_tensor(target, dtype=tf.int32))
    return generator

gen = cached_dataset(sentences, next_word_target, word2vec, word_indices)

# create a tensor dataset generator using 
# the Dataset API 
dataset_generator = tf.data.Dataset.from_generator(
    gen, 
    output_signature=(tf.TensorSpec(shape=(maxlen, word2vec.vector_size),
                                    dtype=tf.float32),
                      tf.TensorSpec(shape=(1,), dtype=tf.int32)))

The magic of this function, is that it converts the strings to its word vector representation on the fly, without the intermediate steps of saving it to a file. It uses Tensorflow Dataset API to feed the numerical data directly to the model during training [as you will soon see].

It uses the Dataset API to take full advantage of it ability in preprocessing more data (sentences) as the model is training on a previous batch of preprocessed data, both at the same time, thereby dramatically speeding up training.

Building model

The moment we are waiting for, and also dreading, building and training the dataset. In building the model, we are going to use LSTM architecture in the layers (research about them to learn how they work).

num_neuron = 500
model = keras.models.Sequential()
model.add(keras.layers.Input(shape=(maxlen, word2vec.vector_size)))
model.add(keras.layers.LSTM(num_neuron, return_sequences=True))
model.add(keras.layers.LSTM(num_neuron, return_sequences=True))
model.add(keras.layers.LSTM(num_neuron))

model.add(keras.layers.Dense(len(vocab), activation='softmax'))

optimizer = keras.optimizer.Adam(learning_rate=0.001)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=optimizer)
model.summary()

# define a checkpoint callback [you will really need this]
model_cb = keras.callbacks.ModelCheckpoint("model.h5", monitor='loss')

# GET READY TO RECIEVE THE GREATEST SHOCK
# OF YOUR LIFE WHEN YOU RUN THIS CODE

# Tell me what you see when you execute the code
# what did it display (You will know what I am
# talking about when you see it)
epochs = 100
batch_sixe = 32
num_workers = 4
model.fit(dataset_generator.repeat(-1).batch(batch_size).prefetch(5),
          steps_per_epoch=len(sentences) // batch_size,
          batch_size=batch_size, epochs=epochs,
          workers=num_workers, callbacks=[model_cb])

We built our model and trained it, you may have noticed that we didn’t try to limit the complexity of the model by adding limiting layers (dropout, batch normalization, etc). Doing so would have hindered our purpose of trying to imitate our writer, and for that to happen, our model would have to overfit on the data (this is one of those rare instances where overfitting the dataset becomes a good thing). So the more complex you can make your model, the greater the ability of the model to imitate the writer.

You will be much better off training this model on Kaggle or Google Colab than on your local computer, as training this model is highly intensive and may take an extraordinary amount of time [Think days].

Putting it to work

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)

    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)

    return np.argmax(probas)

The sample function just tell us which word have the highest probability of coming next¹

text = """
it was not of Jaghut construction, that it had arisen beside the three Jaghut towers of its own accord, in answer to a law unfathomable to god and mortal alike. Arisen to await the coming of those whom it would imprison for eternity. Creatures of deadly power.
"""
text = sentences_iterator.clean_token_words(text)
text = text[:maxlen]
generated = []
generated.extend(text)
sys.stdout.write(generate)

unknown_word = np.zeros(shape=(word_vector.vector_size,),
                        dtype=np.float32)

for temp in np.arange(0.1, 0.5, 0.1):
    for i in range(400):
        data = np.zeros(shape=(1, maxlen, word2vec.vector_size),
                        dtype=np.float32)

        for t, word in enumerate(text):
            if word in word2vec:
                data[0, t] = word2vec[word]
            else:
                data[0, t] = unknown_word

        preds = model.predict(data)
        next_index = sample(preds.ravel(), t)  # optimal value 0.2 to 0.4, test
        next_word = indices_word[next_index]
        generated += next_word

        text[1:].append(next_word)
        sys.stdout.write(next_word)
        sys.stdout.flush()

You can see that the model was able to some extent imitate the writing style of the writer, with the level of imitation getting higher depending on the complexities of your model. Not bad for a simple model, and it is something we can definitely have fun with, or you can go on ahead to imitate the author you have always admired.

God loves you!.

Natural Language in Action ↩

DEV Community

How to Build a Generative Style Inference Model

Setting up prerequiste models

Preprocessing texts

Creating dataset

Turning water to wine

Building model

Putting it to work

Top comments (0)

Read next

DevOps vs Platform Engineering

Struggling to Set Up Your Dev Environment? Your Desktop Feels It Too! Here's Fix!

CreoConnect 2024 - From the Founder's Book

My 2024 Journey: Learning from My Mistakes as a Junior Dev