Davide Santangelo

Posted on Nov 22, 2024 • Edited on Feb 3

Building a Tiny Language Model (LLM) in Ruby: A Step-by-Step Guide - V1

#ruby #ai #llm

Introduction

Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), enabling machines to understand, generate, and even engage in meaningful dialogue with humans. They are the backbone of applications such as chatbots, machine translation, content generation, and more. While Python has become the dominant language for LLM development due to its extensive ecosystem of libraries like TensorFlow and PyTorch, Ruby provides a unique and refreshing opportunity to dive into the foundational concepts behind these models.

Ruby’s elegance and readability make it an excellent language for experimenting with the inner workings of language models. By focusing on the basics, Ruby allows developers to demystify the complexities of NLP and gain a deeper understanding of how these models operate under the hood. Moreover, Ruby’s vibrant community and simple syntax make it accessible even to those without a deep background in machine learning.

This guide will take you step by step through the process of building a simple yet functional LLM using Ruby. We’ll explore everything from preprocessing text data to implementing an N-gram model, training it on a dataset, and testing its ability to generate predictions. By the end, you’ll not only have a working implementation but also the knowledge to expand and optimize it further.

Whether you’re a Ruby enthusiast looking to explore the realm of LLMs or an NLP learner eager to try something new, this guide will empower you to embark on your journey into language modeling.

Understanding Language Models
Setting Up the Environment
Building the Dataset
Implementing the Language Model
Training the Model
Testing and Using the Model
Hardware Requirements and Performance
Advanced Section
Conclusion

Understanding Language Models

What is a Language Model?

A language model is a foundational component of natural language processing (NLP) systems. It predicts the likelihood of a word or sequence of words based on the context provided by preceding words. This ability to model the probability of word sequences is what allows machines to "understand" and generate human-like text.

The Core Idea

At its essence, a language model calculates the probability of a sequence of words:

P(w_1, w_2, \dots, w_n) = \prod_{i=1}^{n} P(w_i | w_1, w_2, \dots, w_{i-1})

Here:

$P(w_1, w_2, \dots, w_n)$ is the overall probability of the sentence.
$P(w_i | w_1, w_2, \dots, w_{i-1})$ is the conditional probability of word $w_i$ , given the previous words in the sequence.

By assigning probabilities to word combinations, the model can determine which sequences are more "natural" or likely.

Types of Language Models

Statistical Language Models (SLMs):
- These models rely on statistical techniques to estimate probabilities.
- Examples include:
  - N-gram Models: Simplify probability calculations by only considering a fixed number of preceding words ( $N-1$ ): $P(w_i | w_1, w_2, \dots, w_{i-1}) \approx P(w_i | w_{i-N+1}, \dots, w_{i-1})$
  - Hidden Markov Models (HMMs): Use probabilistic transitions between states to generate text or recognize patterns.
Neural Language Models (NLMs):
- Use neural networks to capture more complex and long-range dependencies between words.
- Examples include:
  - Recurrent Neural Networks (RNNs): Process sequences of varying lengths but struggle with long-term dependencies.
  - Transformers: Use self-attention mechanisms to model relationships across entire sequences, forming the backbone of modern LLMs like GPT and BERT.

Applications of Language Models

Text Generation:
- Language models can generate coherent sentences, paragraphs, or even entire articles by predicting one word at a time.
Speech Recognition:
- Convert spoken words into text by identifying the most likely sequence of words from audio input.
Machine Translation:
- Translate text from one language to another by understanding context and grammar.
Autocompletion and Autocorrect:
- Predict or correct words as users type, enhancing productivity and accuracy.
Chatbots and Virtual Assistants:
- Enable conversational AI by understanding user input and generating relevant responses.

Challenges in Language Modeling

Data Sparsity:
- Human language is vast, and it’s difficult to have enough data to cover all possible word combinations.
Long-Range Dependencies:
- Capturing relationships between words that are far apart in a sentence or paragraph is computationally challenging.
Ambiguity:
- Many words and phrases have multiple meanings depending on context.
Resource Requirements:
- Training and deploying large-scale models require significant computational resources.

Why are Language Models Important?

Language models form the backbone of many AI systems, enabling machines to process and generate text in a way that feels natural to humans. By predicting what comes next in a sequence, they provide the structure needed for a wide range of applications, from predictive text to automated content creation. Their development has pushed the boundaries of what machines can achieve, making NLP one of the most exciting fields in artificial intelligence.

By building a language model from scratch, as we will in this guide, you'll gain a deeper appreciation for the techniques and challenges involved in teaching machines to understand and generate language.

Why Use Ruby?

Ruby’s simplicity and elegance make it a great choice for learning and experimentation. While not as fast as Python for machine learning tasks, Ruby can handle simpler models effectively and is an excellent option for educational purposes or rapid prototyping.

Setting Up the Environment

Before diving into code, set up your development environment.

Install Required Gems

We’ll use the following gems:

numo-narray for numerical computations.
csv for data handling.
pstore for saving models.

Install them using:

gem install numo-narray
gem install pstore

Initialize the Project

Create a directory for your project:

mkdir ruby_llm
cd ruby_llm

Building the Dataset

Language models require text data. For simplicity, we’ll use a small dataset of sentences.

Example Dataset

Save the following text in a file called dataset.txt:

the cat sits on the mat
the dog barks at the moon
the bird sings in the tree

Preprocess the Data

Create a script preprocess.rb to tokenize and clean the text:

require 'csv'

def preprocess(file)
  data = File.read(file).downcase
  sentences = data.split("\n").map { |line| line.split }
  vocabulary = sentences.flatten.uniq
  { sentences: sentences, vocabulary: vocabulary }
end

data = preprocess('dataset.txt')
File.open('data.pstore', 'wb') { |f| Marshal.dump(data, f) }

Run the script:

ruby preprocess.rb

Implementing the Language Model

We’ll implement a basic N-gram Language Model.

Define the Model

Create a file language_model.rb:

require 'pstore'
require 'numo/narray'

class LanguageModel
  attr_reader :vocabulary, :ngrams

  def initialize(n = 2)
    @n = n
    @ngrams = Hash.new(0)
    @vocabulary = []
  end

  def train(sentences)
    sentences.each do |sentence|
      (0..sentence.length - @n).each do |i|
        ngram = sentence[i, @n]
        @ngrams[ngram] += 1
      end
    end
    normalize
  end

  def normalize
    @ngrams.transform_values! { |count| count.to_f / @ngrams.values.sum }
  end

  def predict(context)
    candidates = @ngrams.select { |ngram, _| ngram[0...-1] == context }
    candidates.max_by { |_, probability| probability }&.first&.last
  end

  def save_model(file)
    store = PStore.new(file)
    store.transaction do
      store[:ngrams] = @ngrams
      store[:vocabulary] = @vocabulary
    end
  end

  def load_model(file)
    store = PStore.new(file)
    store.transaction do
      @ngrams = store[:ngrams]
      @vocabulary = store[:vocabulary]
    end
  end
end

Training the Model

Create a script train.rb:

require_relative 'language_model'

data = Marshal.load(File.read('data.pstore'))
sentences = data[:sentences]

model = LanguageModel.new(2)
model.train(sentences)
model.save_model('model.pstore')

puts "Model trained and saved!"

Run the script:

ruby train.rb

Testing and Using the Model

Create a script test_model.rb:

require_relative 'language_model'

model = LanguageModel.new
model.load_model('model.pstore')

puts "Enter a word (or 'exit' to quit):"
loop do
  input = gets.chomp
  break if input == 'exit'

  prediction = model.predict([input])
  if prediction
    puts "Next word prediction: #{prediction}"
  else
    puts "No prediction available."
  end
end

Run the script and test predictions:

ruby test_model.rb

Enter a word (or 'exit' to quit):
the
Next word prediction: cat

Enter a word (or 'exit' to quit):
cat
Next word prediction: sits

Enter a word (or 'exit' to quit):
dog
Next word prediction: barks

Enter a word (or 'exit' to quit):
bird
Next word prediction: sings

Enter a word (or 'exit' to quit):
tree
No prediction available.

Enter a word (or 'exit' to quit):
exit

Hardware Requirements and Performance

Hardware Recommendations

Development: Any modern computer with 4GB+ RAM.
Training Larger Models:
- 8GB+ RAM for larger datasets.
- SSD storage for faster data access.

Performance Considerations

Dataset Size: Larger datasets improve accuracy but require more memory and processing power.
N-gram Size: Higher n values capture more context but increase computational complexity.
Optimizations:
- Use Numo::NArray for faster numerical operations.
- Parallelize training using Ruby threads (for advanced users).

Advanced Section

In this advanced section, we will explore how to enhance your Ruby-based language model. We'll dive into more sophisticated algorithms, optimization techniques, and integrations with external libraries to take your model to the next level.

Implementing N-gram Models

While a simple model might use bigrams (n=2), increasing the value of n can significantly improve the model's predictive capabilities.

def build_n_gram_model(corpus, n)
  n_grams = Hash.new { |hash, key| hash[key] = [] }
  tokens = corpus.split
  tokens.each_cons(n) do |gram|
    key = gram[0...-1].join(' ')
    value = gram[-1]
    n_grams[key] << value
  end
  n_grams
end

Smoothing Techniques

To handle zero probabilities in your n-gram model, apply smoothing techniques like Laplace smoothing.

def predict_next_word(model, context)
  vocabulary_size = model.values.flatten.uniq.size
  word_counts = model[context] || {}
  total = word_counts.values.sum + vocabulary_size
  probabilities = Hash.new(1.0 / total) # Laplace smoothing

  word_counts.each do |word, count|
    probabilities[word] = (count + 1).to_f / total
  end

  probabilities.max_by { |_, prob| prob }[0]
end

Integrating with Machine Learning Libraries

Leverage Ruby gems like torch.rb to integrate deep learning capabilities into your model.

require 'torch'

# Define a simple neural network model
class LanguageModel < Torch::NN::Module
  def initialize(vocab_size, embedding_dim, hidden_dim)
    super()
    @embeddings = Torch::NN::Embedding.new(vocab_size, embedding_dim)
    @lstm = Torch::NN::LSTM.new(embedding_dim, hidden_dim)
    @linear = Torch::NN::Linear.new(hidden_dim, vocab_size)
  end

  def forward(input)
    embeds = @embeddings.call(input)
    lstm_out, _ = @lstm.call(embeds)
    scores = @linear.call(lstm_out[-1])
    scores
  end
end

Parallelizing with Multithreading

Improve performance by processing data in parallel using Ruby’s threading capabilities.

require 'thread'

def process_corpus_in_parallel(corpus_chunks)
  queue = Queue.new
  corpus_chunks.each { |chunk| queue << chunk }

  threads = Array.new(4) do
    Thread.new do
      until queue.empty?
        chunk = queue.pop(true) rescue nil
        process_chunk(chunk) if chunk
      end
    end
  end
  threads.each(&:join)
end

By incorporating these advanced techniques, you can significantly enhance the functionality and efficiency of your Ruby-based language model. Experiment with different methods to find the optimal combination for your specific use case.

Conclusion

Congratulations! You’ve built a functional N-gram Language Model in Ruby. While this is a basic implementation, it provides a strong foundation for understanding language models. You can extend this by:

Using larger datasets.
Implementing advanced models like LSTMs or Transformers.
Exploring Ruby bindings for libraries like TensorFlow or PyTorch.

Happy coding!

Top comments (5)

Pekka • Feb 5

Davide, thank you for the interesting material. I can't understand right away why the total sum is recalculated at each step during normalization. It seems to introduce some distortions. Or do I not understand at all?
@ngrams.transform_values! { |count| count.to_f / @ngrams.values.sum }

Davide Santangelo • Feb 5

Hi Pekka great question! Let me clarify why recalculating the sum at each step isn’t ideal but doesn’t introduce distortions in this specific code:

1. Why It Works (No Distortion)

In the code @ngrams.transform_values! { |count| count.to_f / @ngrams.values.sum }, the transform_values! block iterates over the original @ngrams hash. Even though the method updates values in-place, @ngrams.values.sumis computed using the original counts for every division. This means all values are normalized against the same initial total, ensuring consistency. Distortion would only occur if the sum were recalculated after some values had already been updated (which isn’t the case here).

2. Why It’s Inefficient

Recalculating @ngrams.values.sum for every n-gram results in O(n²) time complexity (summing all values n times). For small datasets, this is negligible, but for larger models, it’s computationally wasteful.

A more efficient approach computes the total once before normalization:

total = @ngrams.values.sum.to_f @ngrams.transform_values! { |count| count / total }

This reduces the complexity to O(n) and avoids redundant calculations, while preserving correctness.

TL;DR
The original code isn’t “wrong” (no distortion), but it’s inefficient. Precomputing the total is cleaner and faster! 🚀

Let me know if you’d like further clarification! 😊

Pekka • Feb 5

Thank you very much for such a quick and detailed answer. Maybe I expressed myself a little confusingly. If I'm not mistaken, in this code at each iteration we get a new @ngrams @ngrams.transform_values! { |count| count.to_f / @ngrams.values.sum } and, accordingly, the sum of all values. It turns out that in long sentences the distortion will accumulate more towards the end. This problem does not exist in the case of total = @ngrams.values.sum.to_f, when we fix the total sum at the very beginning.

Davide Santangelo • Feb 5

Thanks for the follow-up question – it's a really insightful observation.

Your concern is valid: if, at each iteration, @ngrams.transform_values! { |count| count.to_f / @ngrams.values.sum } were to recalculate the sum based on values that have already been updated, then yes, the normalization could indeed "drift" over time, especially in longer sequences. In that scenario, later values would be divided by a sum that has already been partially normalized, which could lead to cumulative distortions.

However, here's what actually happens in Ruby:

The transform_values! method in Ruby effectively takes a "snapshot" of the hash’s key-value pairs before beginning the iteration. This means that for each key, the block is applied using the original value (the count from before any normalization has occurred). As a result, every division uses the same total sum computed from the original counts, avoiding the issue of cumulative distortion.
Why Precomputing Is Still Better:

Even though Ruby's implementation prevents the distortion in this case, recalculating @ngrams.values.sum on every iteration is inefficient—it leads to an O(n²) operation for n values.

A more efficient and explicit approach is to compute the total once:

  total = @ngrams.values.sum.to_f
  @ngrams.transform_values! { |count| count / total }

This approach not only avoids any potential ambiguity about when the total is calculated but also improves performance, especially with larger datasets.

In summary:

Actual Behavior: Ruby's transform_values! iterates over a snapshot of the original key-value pairs, so the total sum is based on the original counts for every normalization step. This prevents the kind of distortion you described.
Best Practice: Despite the safe behavior in this specific case, it's better to precompute the total sum before transforming the values. This ensures clarity, improves efficiency, and avoids potential issues in different contexts or with future changes to how the method might work.

I hope this clarifies your concern! Let me know if you have any further questions.

Pekka • Feb 5

I was very curious to check. Here's what happened with Ruby 3.2.2

=> h = {:a=>1, :b=>2, :c=>3, :d=>4} => h.transform_values{|v| v/h.values.sum.to_f} => {:a=>0.1, :b=>0.2, :c=>0.3, :d=>0.4} => h.transform_values!{|v| v/h.values.sum.to_f} => {:a=>0.1, :b=>0.21978021978021978, :c=>0.40984837111544814, :d=>0.8457323705501586}

DEV Community

Building a Tiny Language Model (LLM) in Ruby: A Step-by-Step Guide - V1

Introduction

Table of Contents

Understanding Language Models

What is a Language Model?

The Core Idea

Types of Language Models

Applications of Language Models

Challenges in Language Modeling

Why are Language Models Important?

Why Use Ruby?

Setting Up the Environment

Install Required Gems

Initialize the Project

Building the Dataset

Example Dataset

Preprocess the Data

Implementing the Language Model

Define the Model

Training the Model

Testing and Using the Model

Hardware Requirements and Performance

Hardware Recommendations

Performance Considerations

Advanced Section

Implementing N-gram Models

Smoothing Techniques

Integrating with Machine Learning Libraries

Parallelizing with Multithreading

Conclusion

Top comments (5)

Read next

Tried Phi-4, It didn't Impress

AI Engineer's Review: Poe - Platform for accessing various AI models like Llama, GPT, Claude

Quickly build UI components with AI

How to get API Key from Firecrawl