We have all read books of authors written in eccentric style, twisting the English words to create their own unique pattern for their books (Shakespeare, Steven Erikson, Patrick Rothfuss, etc.) or even developed their own languages (J. R. R. Tolkien, George R. R. Martin, George Orwell, etc.). As fellow book lovers, we all wish if possible to be able to emulate the writing style of our favorite author, but such an endeavour will likely take months or years of intense study to be able to emulate.
To this purpose we are going to build a machine learning model that will do that for us, it will learn the way the author spells and punctuate when he writes, his style of grammar, choice of words and tone. To be able to do that, we have to show the model, books written exclusively by the author to learn from them.
Showing the model books written by different authors makes it hard for the model to detect the tone, word choice and punctuation style of the authors, as many authors have diverse style and personality. Therefore finding a common ground across them will be extremely difficult.
The books that will be used to train the model, are those written by the greatest fantasy writer of all time Steven Erikson (The Malazan Book of the Fallen). The novel is a 10 book series with a total of over 10,000 pages. You can download the books here, convert the file[s] to txt so that it can be easily opened in python.
This is a unique project in that multiple sub models have to be built in order to develop this style inference model. So before going further, pause and go through the process of building the sub models before continuing:
import os
import sys
import glob
import itertools
import numpy as np
import pandas as pd
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer
from gensim.models.phrases import Phrases
from gensim.models.word2vec import Word2Vec
import tensorflow as tf
from tensorflow import keras
# you should be familiar with this
%run utils.py
Setting up prerequiste models
If you are at this point, you must have gone through the prerequisite projects. So permit me to rush through the boring and already familiar process of loading up our models needed to build this model.
def load_phrase_detector_model(fname, reduce_size=False):
phrases = Phrases.load(fname)
print(f"Loading complete")
return phrases.freeze() if reduce_size else phrases
# load a phrase detector model
phrase_model_path = "malaz_phrase_detector"
phrases = load_phrase_detector_model(phrase_model_path, reduce_size=True)
sentences_iterator = CustomPathLineSentences('Books', include_phrase=True,
phrase_model=phrases)
# load word2vec model
word2vec_path = "malaz_word2vec.bin"
word2vec = Word2Vec.load(word2vec_path)
word2vec = word2vec.wv
print("Setup complete")
Preprocessing texts
Remember our aim is to build a model that can write with the same style, tempo and grammar as the author who wrote the texts. We are going to preprocess our text in a completely different way than we did with the previous projects, though an element of each of them might be present.
First we are going to load all the text files to your memory and join them into one big lump of text to enable us perform some calculations, which we will see soon.
def preprocess_texts(sentences_iterator):
text = []
for sentence in sentences_iterator:
# remember each sentence is a list of tokens
# punctuation included
text.extend(sentence)
print(f"Total number of words (Phrases included) {len(text)}")
return text
text = preprocess_texts(sentences_iterator)
The preprocess_text
function iterates through the text files, tokenize, detect and combine phrases, and saves each tokens (words) into a list.
vocab = sorted(list(set(text)))
word_indices = {word: idx for idx, word in enumerate(vocab)}
indices_word = {idx: word for idx, word in enumerate(vocab)}
print(f"Total number of unique words: {len(vocab)}")
The vocab
contains the unique words present in the text, the word_indices
is a dictionary of words to index for reference when building a one-hot encoded target and finally, the indices_word
is the reverse dictionary that serves as a lookup when interpreting the one-hot encoded target.
total_words = len(text)
total_sentences = len(sentences_iterator)
avg_word_sentences = total_words / total_sentences
print(f"Avg word per sentence: {avg_word_sentences}")
We calculated the average number of words in each sentences, and depending on the text files you use, they might differs [I got a value of 11.73]. The purpose of this value is to tell us the minimum baseline value, we should use when choosing the number of words a model will look at, before making predictions on what the next word would be.
Creating dataset
maxlen = int(avg_word_sentences + 40)
step = 3
sentences = []
next_word_target []
for idx in range(0, len(text) - maxlen, step):
sentences.append(text[idx: idx + maxlen])
next_word_target.append(text[idx + maxlen])
# no longer need
del text
print(len(sentences), len(next_word_target))
A lot of things is been done in the above code block, the maxlen
variable hold the maximum number of words that the model will have to look at before it can make prediction on what the next word will be, step
indicate the number of words to skip from the beginning (for example take 50 words from the beginning of the text, move to the third word from the beginning, take another 50, move to six … and so on.)
Executing the code will create a list containing list of words each maxlen
long with it corresponding target, both of which will have the same length. So what our model is going to do is that it will look at the words in the sentence
variable and tries to predict the corresponding word found in the next_word_target
. With a sufficiently powerful model, it will be able to learn the author style of writing, punctuation, tone, grammar and even words arrangement (i.e it will able to learn that if the author for example write T’lan it is always followed by Imass), a perfect imitator. Scary and exciting huh, let dig in!
Hold up! you are familiar with machine learning enough to know that we can’t just feed the deep learning algorithm with strings of text, it wouldn’t know what to do with it. So we would have to convert the strings to palatable numbers for proper consumption by the model algorithm.
Turning water to wine
def cached_dataset(sentences, next_word_target, word2vec, word_indices):
"""A factory function, it returns a generator function
that has various variables cached in it scope that will
be unaffected by the outer scope.
Doing this will enable the iterator function to be called
without the need for arguments and without the function
to request the variables from the outer scopes, because
they are stored in it internals."""
def generator():
"""An iterator function, it simply iterates through our earlier
created dataset, changing the words in each sentence to its
word vector representatives"""
# words not available in the word vector
# will be given a default word vectors where
# every dimension equals to zero
unknown_word = np.zeros(shape=(word_vector.vector_size,),
dtype=np.float32)
for sentence, target_word in itertools.zip_longest(sentences, next_word_target):
# create an dummy array of shape (len(sentence), embed_dim)
# this is the word vector representation
data = np.zeros(shape=(maxlen, word2vec.vector_size),
dtype=np.float32)
# fills in the dummy array with the real word vector
# values.
for idx, word in enumerate(itertools.islice(sentence, None)):
if word in word2vec:
data[idx] = word_vector[word]
else:
data[idx] = unknown_word
# create the target array
target = np.array([word_indices[target_word]], dtype=np.int32)
yield (tf.convert_to_tensor(data, dtype=tf.float32),
tf.convert_to_tensor(target, dtype=tf.int32))
return generator
gen = cached_dataset(sentences, next_word_target, word2vec, word_indices)
# create a tensor dataset generator using
# the Dataset API
dataset_generator = tf.data.Dataset.from_generator(
gen,
output_signature=(tf.TensorSpec(shape=(maxlen, word2vec.vector_size),
dtype=tf.float32),
tf.TensorSpec(shape=(1,), dtype=tf.int32)))
The magic of this function, is that it converts the strings to its word vector representation on the fly, without the intermediate steps of saving it to a file. It uses Tensorflow Dataset API to feed the numerical data directly to the model during training [as you will soon see].
It uses the Dataset API to take full advantage of it ability in preprocessing more data (sentences) as the model is training on a previous batch of preprocessed data, both at the same time, thereby dramatically speeding up training.
Building model
The moment we are waiting for, and also dreading, building and training the dataset. In building the model, we are going to use LSTM architecture in the layers (research about them to learn how they work).
num_neuron = 500
model = keras.models.Sequential()
model.add(keras.layers.Input(shape=(maxlen, word2vec.vector_size)))
model.add(keras.layers.LSTM(num_neuron, return_sequences=True))
model.add(keras.layers.LSTM(num_neuron, return_sequences=True))
model.add(keras.layers.LSTM(num_neuron))
model.add(keras.layers.Dense(len(vocab), activation='softmax'))
optimizer = keras.optimizer.Adam(learning_rate=0.001)
model.compile(loss='sparse_categorical_crossentropy',
optimizer=optimizer)
model.summary()
# define a checkpoint callback [you will really need this]
model_cb = keras.callbacks.ModelCheckpoint("model.h5", monitor='loss')
# GET READY TO RECIEVE THE GREATEST SHOCK
# OF YOUR LIFE WHEN YOU RUN THIS CODE
# Tell me what you see when you execute the code
# what did it display (You will know what I am
# talking about when you see it)
epochs = 100
batch_sixe = 32
num_workers = 4
model.fit(dataset_generator.repeat(-1).batch(batch_size).prefetch(5),
steps_per_epoch=len(sentences) // batch_size,
batch_size=batch_size, epochs=epochs,
workers=num_workers, callbacks=[model_cb])
We built our model and trained it, you may have noticed that we didn’t try to limit the complexity of the model by adding limiting layers (dropout, batch normalization, etc). Doing so would have hindered our purpose of trying to imitate our writer, and for that to happen, our model would have to overfit on the data (this is one of those rare instances where overfitting the dataset becomes a good thing). So the more complex you can make your model, the greater the ability of the model to imitate the writer.
You will be much better off training this model on Kaggle or Google Colab than on your local computer, as training this model is highly intensive and may take an extraordinary amount of time [Think days].
Putting it to work
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
The sample
function just tell us which word have the highest probability of coming next1
text = """
it was not of Jaghut construction, that it had arisen beside the three Jaghut towers of its own accord, in answer to a law unfathomable to god and mortal alike. Arisen to await the coming of those whom it would imprison for eternity. Creatures of deadly power.
"""
text = sentences_iterator.clean_token_words(text)
text = text[:maxlen]
generated = []
generated.extend(text)
sys.stdout.write(generate)
unknown_word = np.zeros(shape=(word_vector.vector_size,),
dtype=np.float32)
for temp in np.arange(0.1, 0.5, 0.1):
for i in range(400):
data = np.zeros(shape=(1, maxlen, word2vec.vector_size),
dtype=np.float32)
for t, word in enumerate(text):
if word in word2vec:
data[0, t] = word2vec[word]
else:
data[0, t] = unknown_word
preds = model.predict(data)
next_index = sample(preds.ravel(), t) # optimal value 0.2 to 0.4, test
next_word = indices_word[next_index]
generated += next_word
text[1:].append(next_word)
sys.stdout.write(next_word)
sys.stdout.flush()
You can see that the model was able to some extent imitate the writing style of the writer, with the level of imitation getting higher depending on the complexities of your model. Not bad for a simple model, and it is something we can definitely have fun with, or you can go on ahead to imitate the author you have always admired.
God loves you!.
-
Natural Language in Action ↩
Top comments (0)