DEV Community

Cover image for Creating an LLM for testing with tensorflow in Python

Posted on

Creating an LLM for testing with tensorflow in Python


I want to test a small LLM program and I decided to do it with tensorflow .

My source code is available in

I - Requirements

You need to install tensorflow and numpy

pip install 'numpy<2'
pip install tensorflow
Enter fullscreen mode Exit fullscreen mode

II - Create Dataset

You need to make a data string array to countain a small dataset, for example I create :

data = [
    "Salut comment ca va",
    "Je suis en train de coder",
    "Le machine learning est une branche de l'intelligence artificielle",
    "Le deep learning est une branche du machine learning",
Enter fullscreen mode Exit fullscreen mode

You can find a dataset on kaggle if you're not inspired.

III - Build model and train it

To do this, I create a small LLM class with the various methods.

class LLM:

    def __init__(self):
        self.model = None
        self.max_sequence_length = None
        self.input_sequences = None
        self.total_words = None
        self.tokenizer = None
        test_sentence = "Pour moi le machine learning est"
        print(self.test(test_sentence, 10))

    def tokenize(self):
        self.tokenizer = Tokenizer()
        self.total_words = len(self.tokenizer.word_index) + 1

    def create_input_sequences(self):
        self.input_sequences = []
        for line in data:
            token_list = self.tokenizer.texts_to_sequences([line])[0]
            for i in range(1, len(token_list)):
                n_gram_sequence = token_list[:i + 1]

        self.max_sequence_length = max([len(x) for x in self.input_sequences])
        self.input_sequences = pad_sequences(self.input_sequences, maxlen=self.max_sequence_length, padding='pre')

    def create_model(self):
        self.model = Sequential()
        self.model.add(Embedding(self.total_words, 100, input_length=self.max_sequence_length - 1))
        self.model.add(LSTM(150, return_sequences=True))
        self.model.add(Dense(self.total_words, activation='softmax'))

    def train(self):
        self.model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        X, y = self.input_sequences[:, :-1], self.input_sequences[:, -1]
        y = tf.keras.utils.to_categorical(y, num_classes=self.total_words), y, epochs=200, verbose=1)
Enter fullscreen mode Exit fullscreen mode

IV - Test

Finally, I test the model, with a test method called in the constructor of my classes.

Warning: I block generation in this test function if the word generated is identical to the previous one.

    def test(self, sentence: str, nb_word_to_generate: int):
        last_word = ""
        for _ in range(nb_word_to_generate):

            token_list = self.tokenizer.texts_to_sequences([sentence])[0]
            token_list = pad_sequences([token_list], maxlen=self.max_sequence_length - 1, padding='pre')
            predicted = np.argmax(self.model.predict(token_list), axis=-1)
            output_word = ""
            for word, index in self.tokenizer.word_index.items():
                if index == predicted:
                    output_word = word

            if last_word == output_word:
                return sentence

            sentence += " " + output_word
            last_word = output_word

        return sentence
Enter fullscreen mode Exit fullscreen mode

Top comments (0)