Davide Santangelo

Posted on Apr 11

Train Your Own LLM: A Deep Dive with Ruby

#ruby #machinelearning #llm #ai

Author: Davide Santangelo www.davidesantangelo.com

The landscape of Artificial Intelligence (AI) is increasingly dominated by Large Language Models (LLMs), powerful systems capable of understanding and generating human-like text. While Python has traditionally been the dominant language in the Machine Learning (ML) sphere [1], the versatility and developer-centric nature of Ruby, particularly within the Ruby on Rails framework [3], raise intriguing possibilities. Can one train their own LLM using Ruby? This report provides an expert-level deep dive into the process, exploring the necessary steps, the Ruby ecosystem's capabilities and limitations, and the practical considerations involved in undertaking such a project.

1. Introduction: Ruby in the Age of LLMs

Ruby, renowned for its elegant syntax and focus on developer happiness [3], has carved a strong niche in web development, powering major platforms like Shopify, GitHub, and Airbnb [3]. Ruby on Rails, its flagship framework, emphasizes "convention over configuration," enabling rapid development and prototyping, making it ideal for startups and building Minimum Viable Products (MVPs) [1]. The framework provides robust backend capabilities, handling business logic, databases (like PostgreSQL, MySQL, and improved SQLite in Rails 8), user authentication, and increasingly sophisticated frontend interactions via tools like Hotwire [1].

However, the domain of ML, especially the computationally intensive task of training LLMs, presents different challenges. Python's dominance stems from its rich ecosystem of mature, highly optimized libraries like TensorFlow, PyTorch, and Scikit-learn [1]. While Ruby possesses strengths in text processing and web integration [5], its native ML ecosystem is less extensive. Libraries exist, but often rely on bindings to underlying C/C++ libraries or necessitate integration with Python tools [2].

This reliance on external libraries or integration strategies forms a central theme when considering LLM training with Ruby. While pure Ruby might be suitable for educational N-gram models [10] or smaller tasks, building a state-of-the-art LLM requires leveraging tools that bridge the gap to high-performance computing environments, often originating from the C++ or Python ecosystems. This introduces complexities related to dependencies and potential performance overheads compared to a purely Python-based approach, but also unlocks significant capabilities otherwise unavailable within Ruby alone.

This report will navigate the LLM training lifecycle, examining each phase through the lens of a Ruby developer, highlighting available tools, potential roadblocks, and pragmatic strategies.

2. Setting the Stage: Environment and Core Libraries

Before embarking on LLM training, establishing a robust development environment and understanding the core Ruby libraries available for numerical computation and machine learning is crucial.

2.1. Essential Ruby Gems for ML Tasks

While not as expansive as Python's, Ruby's ML ecosystem offers several key gems:

Numo::NArray: The cornerstone for numerical computation in Ruby, analogous to Python's NumPy [8]. It provides efficient N-dimensional arrays and mathematical operations, forming the foundation for many other scientific Ruby libraries [11]. It's actively maintained and offers interfaces to libraries like GSL (GNU Scientific Library) and LAPACK/BLAS for linear algebra via Numo::Linalg [8]. Examples and documentation, including a Ruby version of "100 NumPy exercises," help users get started [14].
Rumale: A comprehensive ML toolkit providing algorithms with interfaces similar to Python's Scikit-Learn [8]. It reached version 1.0.0 in early 2025, indicating maturity and active development [8]. Rumale supports a wide range of algorithms including SVM, Logistic Regression, Random Forests, K-Means, GMM, DBSCAN, and even Multi-layer Perceptrons (MLPs) [8]. It relies on Numo::NArray for its underlying computations [15].
Torch.rb: Ruby bindings for LibTorch, the C++ backend of PyTorch [8]. This gem is crucial for deep learning tasks, allowing Ruby developers to define and train neural networks using PyTorch's computational graph and autograd capabilities, potentially leveraging GPU acceleration if LibTorch is compiled with CUDA support [17]. It requires installing LibTorch separately and involves a compilation step for the gem's C++ extension [17]. The API aims to be Ruby-like (e.g., add! instead of add_) [18]. Related gems like TorchVision, TorchText, and TorchAudio extend its capabilities [17].
tensorflow-ruby: Ruby bindings for TensorFlow, utilizing its C API [8]. It allows defining constants, variables, mathematical operations, and potentially building/running computation graphs, mimicking the Python API and supporting eager execution by default [22]. Like Torch.rb, it requires a separate TensorFlow C library installation and enables GPU usage if the library is built accordingly [22]. It supports features like gradient calculation (autodiff) and basic optimizers like Gradient Descent [22].
NLP Libraries: For text processing, Ruby offers gems like Pragmatic Segmenter for sentence boundary detection [27], various tokenizers [27], stemmers [27], and libraries for specific tasks like Named Entity Recognition [29] or interacting with NLP APIs [5]. The Text gem provides a collection of text algorithms [28].
LLM Interface Gems: Gems like RubyLLM, Aoororachain, Boxcars, and LangChain.rb provide higher-level interfaces for interacting with LLMs (both proprietary APIs like OpenAI and open-source models), often focusing on building applications using LLMs rather than training them from scratch [8]. ruby-openai is a direct wrapper for the OpenAI API [9].

Table 2.1: Key Ruby Libraries for LLM-Related Tasks

Library	Primary Function	Key Features/Notes	Python Equivalent (Conceptual)	Snippets
Numo::NArray	Numerical Computation	N-dimensional arrays, math operations, base for other libraries	NumPy	[8]
Rumale	General Machine Learning	Scikit-Learn-like interface, various algorithms (SVM, Trees, Clustering, MLP), uses Numo::NArray, v1.0.0 in 2025	Scikit-Learn	[8]
Torch.rb	Deep Learning	Bindings to LibTorch (PyTorch C++ backend), GPU support (via LibTorch), autograd, NN modules	PyTorch	[8]
tensorflow-ruby	Deep Learning	Bindings to TensorFlow C API, GPU support (via TF), eager execution, graph building, autograd	TensorFlow	[8]
NLP Gems	Natural Language Processing	Tokenization, stemming, NER, text processing utilities (Pragmatic Segmenter, ruby-stemmer, etc.)	NLTK, spaCy, HF Tokenizers	[7], [28]
LLM Interface Gems	Interacting with Pre-trained LLMs/APIs	RubyLLM, LangChain.rb, Boxcars, ruby-openai. Building applications using LLMs.	LangChain (Python), OpenAI SDK	[8]
PyCall	Python Interoperability	Call Python code/libraries directly from Ruby	(Bridge library)	[8], [34]
transformers-ruby	Using Transformer Models	Hugging Face transformers-like pipeline API, built on Torch.rb, for inference/fine-tuning pre-built models	Hugging Face transformers	[35]

2.2. Python Interoperability: PyCall and Alternatives

Given Python's dominance, interacting with Python libraries from Ruby is often necessary.

PyCall: This gem allows Ruby code to import and call Python modules and functions directly [33]. It handles basic type conversions (numerics, strings, booleans, lists/hashes) [37]. PyCall is essential for accessing libraries unavailable in Ruby, such as Hugging Face's transformers or datasets, or specialized tools like Scikit-learn [8]. Setup involves installing the gem and ensuring the correct Python environment (with necessary libraries like NumPy, Scikit-learn, etc.) is accessible [33]. One can specify the Python version via environment variables or PyCall.init [33]. Note that PyCall may not officially support multi-threaded use due to complexities with Python's Global Interpreter Lock (GIL) management [38].
Alternatives to Direct Calling: Instead of PyCall, one can use Inter-Process Communication (IPC) mechanisms:
- Shelling Out: Running Python scripts as separate processes using Ruby's backticks (),system, orOpen3module [42].Open3` provides more control over stdin, stdout, and stderr [42]. Data exchange typically happens via standard output (requiring parsing in Ruby, often using formats like JSON) or temporary files [46]. This is simpler for one-off tasks but less efficient for frequent interaction.
- Message Queues: Systems like RabbitMQ or Redis can act as intermediaries. Ruby processes can enqueue tasks or data, and Python workers can pick them up, process them, and potentially return results via another queue [48]. This decouples the processes effectively.
- APIs (REST/gRPC): The Python ML component can be exposed as a microservice with a REST or gRPC API. The Ruby application then interacts with this service via network calls [2]. This is a common pattern for integrating ML models into Rails applications [2].
- RPC (Remote Procedure Call): Frameworks like Apache Thrift allow defining services and data structures that can be used across different languages, enabling Ruby to call functions in a Python process more formally than simple IPC [52]. XMLRPC is another option, though potentially with higher overhead [52].

Choosing the right integration method depends on the frequency of interaction, the complexity of data exchange, and performance requirements. For deep integration needed during training (e.g., accessing specific library functions repeatedly), PyCall might be necessary despite its complexities. For deploying a trained model or orchestrating distinct pipeline steps, APIs or message queues often provide better decoupling and scalability [2].

2.3. The Ruby Ecosystem Context

It's important to set realistic expectations. While gems like Torch.rb and tensorflow-ruby provide powerful capabilities, they are essentially wrappers around complex external libraries. This means encountering potential compilation issues [17], needing to manage non-Ruby dependencies [22], and possibly facing an API that lags slightly behind its Python counterpart or has Ruby-specific idioms [18]. Furthermore, the surrounding ecosystem for MLOps (specialized tools for deployment, monitoring, experiment tracking) is less mature within Ruby compared to Python, which boasts tools like MLflow, Metaflow, and deep integration with platforms like Kubeflow [54]. Therefore, a Ruby-based LLM project will likely involve more "glue" code, potentially leveraging PyCall or IPC to access Python tools, or relying more heavily on general-purpose Ruby tools (like Rails, Sidekiq) for orchestration [2].

3. Phase 1: Data Preparation - Fueling the Model

LLMs are data-hungry. The quality and quantity of the training data are paramount to the model's final performance. This phase involves acquiring, cleaning, processing, and tokenizing vast amounts of text.

3.1. Acquiring Training Data

Scale: LLM pre-training requires enormous datasets, often measured in terabytes (TB) or even petabytes (PB) of text, translating to trillions of tokens [56]. For instance, Llama 3 was trained on over 15 trillion tokens (estimated 60 TB), while Llama 2 used 2 trillion tokens (~8 TB) [56]. Even older models like GPT-3 used hundreds of billions of tokens [56].
Sources:
- Web Crawls: Common Crawl is a massive, publicly available web archive (over 9.5 PB, >250 billion pages) frequently used as a base [59]. Datasets like C4, ROOTS (for BLOOM), Dolma, and RedPajama are derived and filtered from Common Crawl [56].
- Books: Large collections like Project Gutenberg or private book corpora.
- Code Repositories: GitHub, Stack Overflow, etc., especially for code-generation models [56].
- Wikipedia: A multilingual, structured source covering diverse topics [58].
- Academic Papers: arXiv is a common source for scientific text [63].
- Domain-Specific Data: For fine-tuning, organizations might use internal documents, emails, chat logs, support tickets, wikis, etc. [64].
Diversity and Quality: The dataset should be diverse, covering multiple domains, languages (if needed), and styles. Quality is critical; noisy, irrelevant, or biased data can significantly harm model performance and safety [56].

3.2. Cleaning and Preprocessing Pipeline

Raw data, especially from web crawls, is messy. A rigorous cleaning pipeline is essential [7]. Typical steps include:

Filtering: Removing boilerplate (menus, ads), low-quality content (short pages, non-textual content, excessive punctuation), potentially harmful or biased content, and documents in unwanted languages [56]. Heuristics, language detection models, and sometimes even other LLMs are used for filtering [56].
Deduplication: Removing duplicate or near-duplicate documents is crucial, as excessive repetition can harm model performance [56]. Techniques include:
- Exact Deduplication: Using hash functions (e.g., SHA256) to find identical documents [56].
- Fuzzy Deduplication: Using methods like MinHashLSH (used in GPT-3 processing with Spark [56]) or n-gram overlap to find highly similar but not identical documents [56].
Normalization: Converting text to a consistent format (e.g., lowercase, handling unicode characters, removing extra whitespace) [5].
Data Splitting: Dividing the data into training, validation, and test sets [65].

3.3. Tokenization: Breaking Down Language

LLMs operate on numerical representations of text, requiring tokenization [58].

What are Tokens?: Sub-units of text, which can be words, subwords (like token, ization), or characters [58]. Subword tokenization (e.g., BPE, WordPiece, SentencePiece) is common for LLMs as it balances vocabulary size and the ability to handle rare words or out-of-vocabulary terms [66].
Process:
1. Train a tokenizer on a representative corpus to build a vocabulary and learn merge rules (for subword tokenizers) [66].
2. Apply the trained tokenizer to the entire dataset, converting text sequences into sequences of integer IDs [68].
3. Add special tokens (e.g., [PAD], [CLS]) as required by the specific model architecture [68].
Tokenization in Ruby:
- Basic Ruby Tools: Simple tokenization can be done with String#split [5] or regex-based tokenizers like TactfulTokenizer [27].
- Dedicated Gems: More sophisticated options include Pragmatic Tokenizer (multilingual rule-based) [27], nlp-pure [27], or textoken [27].
- Compatibility Challenge: The critical aspect is using the exact same tokenizer during training as the one used during pre-training (if fine-tuning) or the one the chosen model architecture expects. Mismatched tokenization will lead to poor results.
- Practical Approach: Often, the most reliable way is to use the tokenizer associated with the target model architecture. This usually means using Python tokenizers (like those from Hugging Face's tokenizers library) via PyCall [8]. The transformers-ruby gem likely wraps these tokenizers internally when using its pipeline API [35].

3.4. Data Pipeline Implementation in Ruby

Implementing the entire large-scale data preparation pipeline (TBs/PBs) natively in Ruby presents significant challenges. While Ruby excels at text manipulation for smaller datasets [5] and can orchestrate external tools [55], the sheer scale and computational intensity of steps like fuzzy deduplication across billions of documents often necessitate specialized, distributed frameworks.

Consider the scale: processing terabytes or petabytes of data requires efficient, parallelized operations [56]. Pipelines like CCNet shard data for parallel processing [64], and the GPT-3 paper explicitly mentions using Apache Spark for tasks like MinHashLSH deduplication [56]. The Ruby ecosystem lacks widely adopted, mature frameworks directly comparable to Spark or Dask for this specific type of large-scale, distributed ML data processing.

Therefore, a pragmatic approach using Ruby might involve:

Orchestration: Using Ruby scripts or a framework like Rails to manage the overall pipeline flow, potentially triggering processing steps implemented in other languages/tools.
Smaller-Scale Processing: Handling specific tasks like normalization or filtering on smaller data chunks using Ruby's text processing gems [5].
Calling External Tools: Using Open3 or similar to invoke specialized command-line tools (potentially written in C++ or Python) for heavy lifting like deduplication or complex filtering [42].
Leveraging Python via PyCall: Executing Python scripts or libraries directly for steps where mature Python implementations exist (e.g., using Hugging Face datasets for loading and processing, or specific deduplication libraries) [40].
Using APIs/Services: Offloading data processing tasks to dedicated cloud services or data platforms.

Building and executing the data preparation phase for a large LLM entirely or primarily in Ruby could introduce performance bottlenecks or require substantial custom engineering effort compared to leveraging the more established Python data science stack for these specific, large-scale tasks.

4. Phase 2: Model Architecture - Building the Brain

The heart of an LLM is its architecture, defining how information flows and is processed. The Transformer architecture has become the standard foundation for most modern LLMs.

4.1. Understanding the Transformer Architecture

Introduced in the seminal paper "Attention Is All You Need" [70], the Transformer architecture revolutionized NLP by relying entirely on attention mechanisms, abandoning the recurrent structures (like LSTMs) common in earlier models [70]. This allows for greater parallelization during training [70]. Key components include:

Embeddings: Input tokens are converted into dense vector representations (embeddings). The dimensionality of these vectors is the model's hidden size (d_model) [70].
Positional Encoding: Since Transformers process tokens in parallel, they lack inherent sequence awareness. Positional encodings (either fixed sinusoidal functions or learned embeddings) are added to the token embeddings to inject information about token position [70]. Alternatives like Relative Position Encodings (e.g., ALiBi, Toeplitz matrices) have also emerged [70].
Multi-Head Self-Attention: The core mechanism. For each token, attention calculates a weighted sum of all other tokens' representations in the sequence (within a context window). It learns to "attend" to relevant tokens when processing the current one.
- Query, Key, Value (QKV): Each token's embedding is projected into three vectors: Query (what I'm looking for), Key (what I contain), and Value (what I offer). The attention score between two tokens is calculated based on the similarity (dot product) of the Query of the receiving token and the Key of the sending token. These scores are scaled, passed through softmax, and used to weight the Value vectors, which are then summed up [70].
- Multi-Head: Instead of one set of QKV projections, the model uses multiple "heads," each learning different attention patterns in parallel. The results are concatenated and projected back [70].
- Masking: In decoder-only models, masked self-attention prevents tokens from attending to future tokens during training, preserving the auto-regressive property (predicting the next token based only on past ones) [63].
Feed-Forward Networks (FFN): Each Transformer layer contains identical, independent feed-forward networks applied to each position. These typically consist of linear transformations with an activation function (e.g., ReLU, GELU, SwiGLU) [70].
Layer Normalization & Residual Connections: Applied around the attention and FFN modules to stabilize training and improve gradient flow [70]. RMSNorm is a common alternative to LayerNorm [70].
Output Layer (Un-embedding): A final linear layer followed by a softmax function converts the decoder's output vectors back into probability distributions over the vocabulary, predicting the next token [68]. Weight tying (sharing weights between the input embedding and output un-embedding layers) is sometimes used [70].

4.2. Transformer Variants

Different configurations of these components lead to common LLM archetypes [63]:

Encoder-Decoder (Original Transformer, BART, T5): Contains both an encoder stack (processing the input sequence) and a decoder stack (generating the output sequence, attending to both its own previous outputs and the encoder's output). Suitable for sequence-to-sequence tasks like translation or summarization [63].
Decoder-Only (GPT-like, Llama, PaLM): Uses only a decoder stack with masked self-attention. Auto-regressive: predicts the next token based on preceding ones. Excellent for text generation [63].
Encoder-Only (BERT-like, RoBERTa): Uses only an encoder stack. Bidirectional: processes the entire input sequence at once (using techniques like Masked Language Modeling during pre-training). Excels at understanding context for tasks like classification, named entity recognition, or question answering [63].

4.3. Implementing Transformers in Ruby

Implementing a full Transformer from scratch in pure Ruby, while possible for educational purposes (like the simpler N-gram model in [10]), is highly impractical for building a competitive LLM due to performance limitations [2]. The viable paths involve leveraging libraries with high-performance backends:

Option 1: Using Torch.rb: This is likely the most promising route for building custom deep learning models in Ruby. Torch.rb provides access to LibTorch's building blocks [8]. A developer could define a Transformer layer or model as a Torch::NN::Module, composing it from:
- Torch::NN::Embedding for token embeddings.
- Torch::NN::Linear for QKV projections and FFN layers.
- Implementing the attention mechanism using tensor operations provided by Torch.rb (matrix multiplications, softmax, masking). LibTorch itself contains optimized attention implementations accessible via its C++ API, which Torch.rb might expose directly or indirectly.
- Torch::NN::LayerNorm or implementing RMSNorm.
- Adding positional encodings (calculating sinusoidal or creating learnable Torch::NN::Embedding).
- GPU acceleration is handled by the underlying LibTorch library, provided it was installed with CUDA support [17].
Option 2: Using tensorflow-ruby: Similarly, one could use tensorflow-ruby to define the model structure using TensorFlow's C API operations [22]. This might involve defining constants, variables, and chaining mathematical operations (Tf::Math) to construct the layers [22]. Defining complex layers like multi-head attention might be less direct than with Torch.rb's module system, potentially requiring more explicit graph construction or reliance on pre-defined operations exposed by the C API [26]. GPU acceleration depends on the linked TensorFlow library [22].
Option 3: Using Rumale Components: Rumale offers MLPs and linear models [8]. While insufficient for a full Transformer, these could be used to implement the FFN layers or simpler components within a custom neural network architecture built in Ruby, primarily for learning or smaller-scale experiments, as Rumale lacks native GPU acceleration.
Option 4: Leveraging Pre-built Models: Instead of building from scratch, one can use existing Transformer implementations:
- transformers-ruby gem: Built on Torch.rb, this gem provides a high-level API similar to Hugging Face's Python library [35]. It allows loading and running pre-trained Transformer models (BERT, DistilBERT, MPNet, XLM-RoBERTa, etc.) for tasks like embedding generation, text classification, NER, and question answering [35]. This is primarily geared towards inference or fine-tuning existing architectures rather than building novel ones from the ground up.
- PyCall: Directly import and use Python's Hugging Face transformers library within Ruby [8]. This gives access to the vast range of models and functionalities available in the Python ecosystem.
- API-based Gems: Use gems like RubyLLM or ruby-openai to interact with models hosted externally (e.g., OpenAI API, Anthropic, Google) [8]. This avoids local training/implementation entirely.

The use of bindings like Torch.rb and tensorflow-ruby is essential for performance but introduces a layer of complexity. These gems depend on external C/C++ libraries (LibTorch, TensorFlow C library) that must be installed correctly, potentially involving platform-specific steps or compilation challenges [17]. The Ruby API provided by these bindings, while aiming for similarity, might have subtle differences or lag behind the features available in the primary Python interfaces [18]. This trade-off grants Ruby access to high-performance computing necessary for deep learning but requires developers to manage external dependencies and potentially navigate a less mature API surface compared to the Python ecosystem.

4.4. Pre-training vs. Fine-tuning: Choosing Your Path

A critical decision is whether to pre-train a model from scratch or fine-tune an existing one.

Pre-training:
- Goal: To learn general language understanding from vast, unlabeled datasets (trillions of tokens) [63].
- Process: Starts with randomly initialized weights and trains the entire model for weeks or months on massive compute infrastructure (hundreds or thousands of GPUs/TPUs) [56].
- Output: A foundational model with broad capabilities [74].
- Feasibility in Ruby: Extremely challenging and resource-intensive. While technically possible using Torch.rb or tensorflow-ruby on sufficient hardware, the ecosystem support, tooling maturity, and sheer scale make it impractical for most projects compared to using established Python frameworks or cloud platforms [74].
Fine-tuning:
- Goal: To adapt a pre-trained model for a specific task or domain [58].
- Process: Starts with weights from a pre-trained model and continues training on a smaller, often labeled, task-specific dataset (thousands to millions of examples) [64]. Can range from hours to days on fewer GPUs [77].
- Techniques:
- Full Fine-tuning: Updates all model weights. Requires significant memory/compute, similar to the final stages of pre-training but on less data [73].
- Parameter-Efficient Fine-Tuning (PEFT): Updates only a small fraction of parameters (e.g., using adapters like LoRA). Drastically reduces memory and compute requirements, making fine-tuning accessible on less hardware [73].
- Output: A specialized model optimized for the target task [73].
- Feasibility in Ruby: Much more practical. One can load a pre-trained model using Torch.rb (potentially via transformers-ruby) or tensorflow-ruby, prepare the task-specific dataset using Ruby tools or PyCall, and implement the fine-tuning loop [76]. PEFT methods, if implemented or accessible via bindings, further increase feasibility.

Table 4.1: Pre-training vs. Fine-tuning LLMs

Aspect	Pre-training	Fine-tuning
Objective	Learn general language understanding	Adapt model for specific task/domain
Starting Point	Random weights	Pre-trained model weights
Dataset	Massive (TBs/PBs), unlabeled (e.g., web crawl)	Smaller (MBs/GBs), often labeled, task-specific
Compute Needs	Very High (Thousands of GPU/TPU-weeks/months)	Moderate to Low (Hours/days on fewer GPUs, esp. with PEFT)
Time Required	Weeks to Months	Hours to Days
Typical Output	Foundational Model (e.g., GPT-3, Llama)	Specialized Model (e.g., chatbot, classifier, summarizer)
Feasibility in Ruby	Extremely difficult/impractical due to scale/tools	More feasible, especially with PEFT & leveraging bindings/PyCall
Key Snippets	[56], [74]	[58], [74]

Given the immense resources required for pre-training and the relative maturity of Ruby's ML ecosystem, fine-tuning an existing, high-quality pre-trained model presents a far more achievable and resource-efficient path for developers looking to leverage LLMs within Ruby applications [76].

5. Phase 3: The Training Loop - Teaching the Model

The training loop is where the model learns from data. It's an iterative process involving feeding data to the model, calculating how wrong its predictions are, and adjusting its internal parameters (weights) to improve.

5.1. Setting up the Training Environment: Hardware Considerations

Training LLMs is computationally demanding, primarily constrained by GPU resources, especially VRAM (Video RAM) [79].

GPU Memory Breakdown: Total VRAM usage comprises several components [81]:
- Model Weights: The largest component, directly proportional to the number of parameters. Using 16-bit precision (FP16 or BF16) is standard, requiring ~2 bytes per parameter [79]. A 7-billion parameter model needs ~14 GB, while a 70B model needs ~140 GB just for weights [79].
- Optimizer States: Optimizers like Adam/AdamW store momentum and variance values, often requiring 2-3 times the memory of the model weights if training in full precision (though techniques like 8-bit optimizers exist) [76].
- Gradients: Need space equal to the model parameters during backpropagation (typically in FP16 or FP32).
- Activations: Intermediate results stored during the forward pass, needed for gradient calculation in the backward pass. Memory usage depends on batch size, sequence length, and model architecture. Techniques like gradient checkpointing trade compute (recalculation) for memory savings [78].
- KV Cache (During Inference/Generation): Stores keys and values for attention layers. Size depends on batch size, sequence length, number of layers, and hidden size. Can become significant, especially with long sequences or high concurrency [81]. While primarily an inference concern, it’s relevant if performing generation tasks during evaluation or specific training regimes.
- Temporary Buffers & Fragmentation: Overhead from libraries and memory allocation inefficiencies [81].
Hardware Requirements:
- GPUs: High-VRAM GPUs are essential. NVIDIA is dominant (A100 40GB/80GB, H100, RTX 6000 Ada 48GB, L40S 48GB, consumer RTX 4090/3090 24GB) [79]. AMD GPUs (Instinct series) with ROCm support are emerging alternatives [80].
- Multi-GPU Setups: Training large models necessitates distributing across multiple GPUs. High-speed interconnects (NVLink for intra-node, InfiniBand or high-speed Ethernet for inter-node) are critical to avoid communication bottlenecks [75].
- CPU: Server-grade CPUs (Intel Xeon, AMD EPYC/Threadripper PRO) are recommended for platforms supporting multiple GPUs, high memory capacity, and PCIe lanes [79]. Needed for data loading/preprocessing [79].
- System RAM: Needs to be substantial, often recommended to be at least 2x the total GPU VRAM [80]. Estimates range from 32GB-64GB+ for fine-tuning smaller models to 128GB-512GB+ for training/fine-tuning larger ones [79].
- Storage: Fast NVMe SSDs are preferred [80]. Capacity needs range from ~1 TB for fine-tuning smaller models to 10-20TB+ for pre-training datasets and checkpoints [79].
- TPUs: Google’s Tensor Processing Units offer an alternative, especially within Google Cloud [83]. They excel at matrix operations and can be power-efficient, with high memory bandwidth [83]. However, they typically have lower on-chip memory capacity than high-end GPUs, a less mature/flexible ecosystem primarily tied to TensorFlow/JAX, and potentially higher hourly costs [83].

Table 5.1: Estimated Hardware for LLM Training/Fine-tuning (Illustrative)

Model Size	Task	Min Total VRAM (FP16)	Example GPU Config (Min/Ideal)	Est. System RAM	Est. Storage (Data/Chkpt)	Key Snippets
~7B	Fine-tune	~16-28 GB+	1x RTX 3090/4090/A6000 (24-48 GB) / 2x RTX 3080 Ti+	32-64 GB+	~1 TB+	[79]
~7B	Pre-train	~50-100 GB+	4x A100 40GB / 8x RTX 4090	128 GB+	1-5 TB / 500 GB+	[79]
~70B	Fine-tune	~140-280 GB+	4x A100 80GB / 8x A100 40GB / 8x RTX 6000 Ada	256-512 GB+	~8 TB+	[79]
~70B	Pre-train	~1-2 TB+	16-32+ A100/H100	512 GB-1 TB+	10-20 TB / 2 TB+	[79]

Note: These are rough estimates. Actual requirements depend heavily on batch size, sequence length, optimizer choice, use of PEFT, gradient checkpointing, and software stack efficiency.

5.2. The Core Training Loop Steps

The training process iterates through the dataset multiple times (epochs), processing data in batches [26]:

Data Batching: Load a batch of data samples (e.g., text sequences) from the training dataset. Convert them into tensors (e.g., Numo::NArray or Torch::Tensor).
Forward Pass: Feed the input batch through the model’s layers. The model computes its predictions (e.g., logits representing probabilities for the next token) [68].
Loss Calculation: Compare the model’s predictions with the actual target values (ground truth) from the batch using a loss function (e.g., Cross-Entropy Loss for classification/language modeling). The loss quantifies the model’s error [68].
Backward Pass (Backpropagation): Calculate the gradient of the loss with respect to each model parameter. This indicates how much each parameter contributed to the error. This relies on the framework’s automatic differentiation (autograd) capabilities [18]. Gradient checkpointing can be used here to save memory by recomputing activations during the backward pass instead of storing them all [78].
Optimizer Step: Adjust the model’s parameters (weights) based on the calculated gradients, aiming to reduce the loss on future predictions. The optimizer (e.g., Adam, AdamW, SGD) uses the gradients and a learning rate to determine the magnitude of the weight updates [26].

5.3. Implementing the Loop in Ruby

Using Torch.rb: This likely offers the most Python-like experience.
- Data Loading: Use Ruby’s standard library or gems like csv for simple cases. For larger datasets, consider PyCall to use Hugging Face datasets or PyTorch DataLoader. Convert data to Torch::Tensor.
- Iteration: Loop through epochs and batches.
- Forward Pass: output = model.call(input) (or model.forward(...)).
- Loss Calculation: loss = Torch::NN::CrossEntropyLoss.new.call(output, batch_labels).
- Backward Pass: loss.backward(). Requires tensors to have requires_grad: true [18].
- Optimizer Step: Define an optimizer optimizer = Torch::Optim::AdamW.new(model.parameters, lr: learning_rate). Call optimizer.zero_grad() before the backward pass (or after the step) and optimizer.step() after loss.backward().
Using tensorflow-ruby: The implementation depends on whether using eager execution or graph mode.
- Data Loading: Similar to Torch.rb, potentially using Tf::Data::Dataset if bindings are available [22].
- Eager Execution: Operations run immediately. Use Tf::GradientTape (if available) to record operations for gradient calculation: tape.gradient(loss, model.trainable_variables). Apply gradients using an optimizer (e.g., Tf::Train::GradientDescentOptimizer mentioned in [26]).
- Graph Mode (Older style, potentially relevant for C API): Define the computation graph first, including forward pass, loss, and optimizer operations. Execute the training step within a session: session.run([train_op, loss], feed_dict: {input: batch_input, labels: batch_labels}) [26]. Autodiff is handled by graph construction [26].

5.4. Monitoring Training Progress

Tracking metrics during training is crucial for understanding performance and diagnosing issues [77].

Key Metrics:
- Training Loss: Should generally decrease over time. Monitor per batch or per epoch [77].
- Validation Loss: Loss calculated on a separate validation set (not used for weight updates). Helps detect overfitting (when training loss decreases but validation loss plateaus or increases) [77].
- Perplexity: Specific to language models, measures how well the model predicts the next token. Lower is better [77].
- Task-Specific Metrics: During fine-tuning, track metrics relevant to the task (e.g., Accuracy, F1 score, BLEU) on the validation set [58].
Tools in Ruby:
- Basic Logging: Print metrics to the console or log files using Ruby’s standard library.
- TensorBoard: If using tensorflow-ruby, bindings might exist to log data compatible with TensorBoard [22].
- External Tools via API/PyCall: Integrate with platforms like Weights & Biases [82], MLflow [54], or Neptune.ai by sending metrics via their APIs or using their Python clients through PyCall.

5.5. Estimating Training Time

Predicting training duration helps in planning and resource allocation.

Core Formula: Training time is roughly proportional to the total computational work (FLOPs) divided by the hardware’s sustained throughput (FLOPS) [75].
FLOPs Estimation: A common rule of thumb for Transformer pre-training is Total FLOPs ≈ 6 * N * D, where N is the number of parameters and D is the number of tokens in the dataset. The factor of 6 accounts for the forward pass (2ND) and backward pass (4ND) [75].
Throughput (FLOPS):
- Theoretical Peak: Obtainable from GPU specifications (e.g., NVIDIA H100 specs) [75].
- Sustained Throughput: Real-world performance is lower due to bottlenecks (memory, network). Model FLOPS Utilization (MFU) = Sustained FLOPS / Theoretical Peak FLOPS. MFU is often below 50% and decreases as the number of GPUs increases [75]. Llama 3 training reported ~38-40% MFU on 16,000 H100s, achieving ~400 TFLOPS per GPU [75].
- Calculation: Effective FLOPS = GPU_Peak_FLOPS * MFU * Num_GPUs.
- Total Time: Time ≈ (6 * N * D) / Effective_FLOPS [75]. Convert seconds to hours or days.
Example (Llama 3 405B): N = 405B, D = 15.6T tokens. Total FLOPs ≈ 6 * 405e9 * 15.6e12 ≈ 3.8e25 FLOPs. Using 16,000 H100s at 400 TFLOPS/GPU gives Effective FLOPS = 16000 * 400e12 = 6.4e18 FLOPS. Time ≈ 3.8e25 / 6.4e18 ≈ 5.9e6 seconds ≈ 69 days [75].
Fine-tuning Estimation: For fine-tuning, a simpler approach is often practical: time one epoch on a small data subset and linearly extrapolate based on the full dataset size and number of epochs [87].
Scaling Laws (Chinchilla): Research suggests an optimal balance between model size (N) and dataset size (D) for a fixed compute budget (proportional to ND). The Chinchilla paper proposed D ≈ 20 * N [88]. However, recent work incorporating inference costs suggests that "overtraining" smaller models (D >> 20 * N) can be more cost-effective overall, as smaller models are cheaper to deploy [88].

The efficiency of the underlying libraries and the ease of setting up optimized distributed training significantly impact the achievable MFU. While the core calculation is universal, the Ruby ecosystem’s relative immaturity in large-scale, distributed ML frameworks compared to Python might make achieving high MFU more challenging. Using less optimized bindings or requiring more manual configuration for distributed setups could lead to lower sustained FLOPS, effectively increasing training time and cost on the same hardware compared to a highly tuned Python environment. This reinforces the idea that while Ruby can execute the training loop via bindings, the practical efficiency at massive scale might favor Python for the core computation, with Ruby potentially playing a stronger role in orchestration.

6. Phase 4: Evaluation and Iteration

Once a model is trained (or fine-tuned), evaluating its performance and iterating based on the results is crucial.

6.1. Choosing the Right Metrics

The metrics used depend on the training phase and the specific task [77]:

Pre-training Evaluation:
- Loss: Cross-entropy loss on a held-out validation set indicates how well the model is learning the data distribution [77].
- Perplexity: Measures the model’s uncertainty in predicting the next token; lower perplexity generally indicates better language modeling capabilities [77].
Fine-tuning Evaluation (Task-Specific):
- Classification Tasks (e.g., sentiment analysis, topic classification): Accuracy, Precision, Recall, F1 Score (especially important for imbalanced datasets) [77].
- Sequence Generation Tasks (e.g., translation, summarization):
- BLEU Score: Measures n-gram overlap between generated and reference translations [77].
- ROUGE Score: Measures overlap (recall-oriented) between generated and reference summaries [77].
- Question Answering: Exact Match (EM), F1 Score over tokens.
General Quality & Safety:
- Human Evaluation: Subjective assessment by humans on criteria like coherence, relevance, helpfulness, harmlessness, and adherence to instructions [65].
- Bias and Toxicity Detection: Using specific benchmarks or classifiers to measure harmful content generation [65].

6.2. Building an Evaluation Pipeline in Ruby

Implementing evaluation in Ruby follows a similar pattern to training:

Load Model: Load the trained model weights using the same library (Torch.rb, tensorflow-ruby, or via PyCall) used for training. Ensure the model is set to evaluation mode (e.g., model.eval() in Torch.rb) to disable dropout and batch normalization updates.
Load Evaluation Data: Load the validation or test dataset. Apply the exact same preprocessing and tokenization steps used during training.
Generate Predictions: Iterate through the evaluation dataset (batching is common for efficiency). For each batch, perform a forward pass through the model to get predictions (e.g., logits, generated sequences). Disable gradient calculation (e.g., within Torch.no_grad block in Torch.rb) to save memory and computation.
Calculate Metrics:
- Ruby Implementation: Simple metrics like accuracy, precision, and recall can be implemented directly in Ruby by comparing predictions to ground truth labels.
- Using Gems: Check for Ruby gems that might implement more complex metrics (e.g., specific NLP metrics).
- Using PyCall: This is often the most practical approach for standardized metrics like BLEU, ROUGE, or perplexity calculations using established Python libraries (e.g., Hugging Face evaluate, NLTK, SacreBLEU). Pass the predictions and references from Ruby to the Python functions via PyCall.

6.3. Debugging and Improving Your Model

Training LLMs rarely works perfectly on the first try. Debugging involves identifying and fixing issues ranging from code errors to suboptimal performance or undesirable model behavior [65].

Common Problems:
- Training Errors: Crashes due to incompatible data types, incorrect tensor shapes, out-of-memory errors, bugs in custom model code [90].
- Poor Performance: High loss, low accuracy/metric scores, nonsensical outputs. Could be due to bugs, poor hyperparameters, insufficient data, or architecture issues [90].
- Overfitting: Model performs well on training data but poorly on validation data. Indicates the model memorized the training set instead of generalizing [77].
- Underfitting: Model performs poorly on both training and validation data. Indicates the model is too simple or hasn’t trained long enough.
- Bias/Toxicity: Model generates harmful, biased, or unfair outputs, often reflecting biases in the training data [65].
Debugging Strategies in Ruby Context (adapting general advice [90]):
- Verify Data: Manually inspect samples from your training and validation sets. Use the tokenizer to decode input_ids back to text. Are the inputs sensible? Do the labels look correct? Is the preprocessing pipeline working as expected? [90].
- Check Model Inputs/Outputs: Ensure the data format (tensor shapes, types) being fed into the model matches what the Torch.rb or tensorflow-ruby model layers expect. Examine the model’s raw outputs (logits) before the final activation/prediction step.
- Simplify: Train on a tiny subset of the data (even just one batch) and try to make the model overfit it. If it can’t, there’s likely a fundamental bug in the model or training loop [90]. Start with a simpler model architecture or fewer layers.
- Isolate Components: Test data loading, preprocessing, model forward pass, loss calculation, and backward pass individually if possible. Manually step through the training loop logic for a single batch [90].
- Check Environment: Ensure compatible versions of Ruby, Python (if using PyCall), core ML libraries (Torch.rb/LibTorch, tensorflow-ruby/TensorFlow), and GPU drivers (CUDA).
- Hyperparameter Tuning: Experiment with learning rate, batch size, optimizer settings, regularization techniques [58].
- Data Augmentation/Improvement: If performance is poor due to limited data, consider data augmentation techniques (e.g., back-translation, synonym replacement) or acquiring more high-quality data [64].
- Architectural Changes: If underfitting persists, consider a larger model or architectural modifications.

Iteration is key. Based on evaluation results and debugging insights, adjust the data, model architecture, or training process and repeat the training and evaluation cycle until satisfactory performance is achieved.

7. Beyond Training: Deployment and Orchestration

Training is only part of the LLM lifecycle. Making the model available for use (deployment) and managing the overall workflow (orchestration) are critical next steps, and areas where Ruby can shine.

7.1. Saving and Loading Models

Persisting trained model weights is essential for deployment and resuming training.

Using Torch.rb: PyTorch (and thus likely Torch.rb) typically uses state_dict (a dictionary mapping layer names to parameter tensors) for saving and loading model weights. Methods like model.state_dict() and model.load_state_dict(state_dict) would be expected. Saving the entire model object might also be possible.
Using tensorflow-ruby: TensorFlow has various saving formats (SavedModel, checkpoints). The capabilities of tensorflow-ruby depend on the functions exposed by the C API [22]. Saving/loading graph definitions and associated variable checkpoints is the likely mechanism.
Using PyCall: The simplest approach might be to use PyCall to invoke the standard saving/loading functions from the Python library (e.g., torch.save, torch.load, model.save_pretrained in Hugging Face transformers).
ONNX (Open Neural Network Exchange): Converting the trained model to the ONNX format provides a standardized way to represent models for inference across different frameworks and hardware [8]. Tools exist (usually in Python) to convert PyTorch or TensorFlow models to ONNX. Inference can then be performed using ONNX Runtime, which has bindings for various languages, potentially including Ruby or accessible via PyCall or FFI. This is a recommended path for deploying TensorFlow models trained via bindings [24].

7.2. Serving Models with Ruby APIs

Ruby’s web frameworks are well-suited for exposing trained LLMs via APIs.

Frameworks: Ruby on Rails [2] or lighter frameworks like Sinatra can be used to build RESTful or GraphQL APIs.
Inference Workflow:
1. API endpoint receives input text (e.g., via POST request).
2. Input is preprocessed and tokenized (using the same tokenizer as training).
3. The loaded model performs inference (forward pass) on the tokenized input (using Torch.rb, tensorflow-ruby, PyCall, or an ONNX runtime).
4. Model output (e.g., generated text, classification label) is postprocessed.
5. Result is returned in the API response (e.g., JSON format).
Handling Latency: LLM inference can be time-consuming. For non-interactive tasks or to avoid blocking web requests, use background job frameworks like Sidekiq or Resque (common in Rails) [4]. The API endpoint enqueues an inference job, and the result can be retrieved later or pushed back to the client asynchronously (e.g., via WebSockets) [1].

7.3. Orchestrating ML Pipelines with Ruby

While Ruby might not be the optimal choice for executing every step of a large-scale ML pipeline (especially heavy computation), it excels at orchestrating the entire workflow [2]. The LLM lifecycle involves multiple stages: data gathering, preprocessing, training, evaluation, deployment, monitoring [69]. Ruby can act as the central controller:

Defining Workflows: Ruby’s expressiveness and DSL capabilities can be used to define complex pipelines, specifying dependencies between steps [55].
Triggering Components: Ruby scripts or applications can initiate different stages:
- Running Python scripts for data processing or training using Open3 or backticks [42].
- Making API calls to specialized microservices (e.g., a Python-based training service).
- Using PyCall to execute specific Python functions within a larger Ruby-managed workflow [40].
- Enqueuing tasks for background workers (potentially written in Python or Ruby) using job queues [4].
Interacting with MLOps Tools: Many MLOps platforms (MLflow, Kubeflow, Airflow, Flyte, Kedro, DVC, CML) provide APIs or CLIs for managing experiments, data versioning, model registries, and deployments [54]. Ruby can interact with these tools to automate and manage the lifecycle [69].
Building Interfaces: Rails or Sinatra can provide dashboards for monitoring pipeline status, viewing experiment results, or managing deployments.

This orchestration role plays directly to Ruby’s strengths. Considering the previously discussed challenges in Ruby’s native ML performance and the complexities of large-scale data processing, using Ruby as the "conductor" becomes a highly practical strategy. It allows developers to leverage Ruby’s excellent web development capabilities, clear syntax, and scripting prowess to manage the overall process, while delegating the most computationally intensive tasks (like distributed training or massive data processing) to specialized tools or services, often implemented in Python or C++. This approach provides a robust, maintainable, and developer-friendly control plane for complex LLM workflows [2].

8. Challenges and Future Directions

Training LLMs is inherently challenging, and doing so within the Ruby ecosystem introduces specific considerations.

8.1. General LLM Training Challenges

Regardless of the language, developers face significant hurdles [65]:

Data: Acquiring, cleaning, and validating massive, diverse, high-quality, and unbiased datasets is a monumental task. Issues like "unfathomable datasets" (too large to manually inspect) and benchmark contamination are persistent problems [56].
Compute: The sheer cost and availability of the required computational resources (thousands of high-end GPUs/TPUs, high-speed interconnects, massive RAM/storage) are major barriers [75].
Modeling & Training: Choosing the right architecture, optimizing hyperparameters, achieving efficient distributed training (high MFU), avoiding overfitting/underfitting, and mitigating catastrophic forgetting during fine-tuning are complex engineering problems [71]. Developing robust reasoning capabilities remains an active research area [65].
Tokenization: Balancing vocabulary size, efficiency, and handling multiple languages effectively is non-trivial [66].
Alignment & Safety: Ensuring models are helpful, honest, and harmless requires careful techniques like Reinforcement Learning from Human Feedback (RLHF), instruction tuning, and ongoing monitoring for bias and toxicity [65].
Deployment & Maintenance: Efficiently serving large models, managing latency, ensuring robustness, monitoring for performance drift, and maintaining the model over time are ongoing operational challenges [89].

8.2. Specific Challenges for Ruby

While Ruby can participate in the LLM lifecycle, its ecosystem presents unique challenges compared to Python:

Ecosystem Maturity: The collection of dedicated, optimized libraries and tools specifically for large-scale ML and deep learning is smaller and less mature in Ruby [1]. While core bindings like Torch.rb exist, the surrounding tooling for data processing, distributed training, MLOps, and specialized algorithms is less developed [54].
Performance: Pure Ruby performance can be a bottleneck for CPU-intensive preprocessing tasks. While GPU-bound training relies on C++ backends via bindings, the efficiency of these bindings and the ease of implementing highly optimized distributed training strategies might lag behind native Python frameworks [2].
Community & Resources: The ML-focused segment of the Ruby community is smaller than Python’s. This translates to fewer readily available tutorials, pre-built solutions, shared model weights adapted for Ruby bindings, and experienced practitioners specifically for advanced LLM training tasks within Ruby [2]. Developers might need to rely more on general programming skills, documentation of underlying C/Python libraries, or porting examples.
Dependency Management: Relying on bindings (Torch.rb, tensorflow-ruby, PyCall) introduces dependencies on external C/C++/Python libraries, adding complexity to environment setup and deployment compared to more self-contained Ruby applications [17].

8.3. Future Directions for Ruby in AI/ML

Despite the challenges, the role of Ruby in the AI/ML space is evolving:

Library Maturation: Continued development and refinement of key gems like Torch.rb, tensorflow-ruby, Rumale, and Numo::NArray will improve capabilities and ease of use [6]. New gems focused on LLM interaction (RubyLLM, LangChain.rb) are emerging [8].
Improved Python Integration: Enhancements to PyCall or the development of alternative, robust IPC/RPC mechanisms could streamline interaction with the Python ecosystem.
Focus on Orchestration & Integration: Ruby’s strengths in web development (Rails), scripting, and DSLs position it well to be a preferred language for building MLOps tools, orchestrating complex pipelines involving components in multiple languages, and integrating LLMs into user-facing applications [2].
Ruby Language Evolution: Ongoing improvements to Ruby’s performance (e.g., JIT compilation, Ractor for concurrency) might benefit certain aspects of ML workflows, although they are unlikely to supplant optimized C++/GPU computation for core training in the near future [3].

9. Conclusion: Your Journey with Ruby and LLMs

Embarking on training a Large Language Model using Ruby is an ambitious but increasingly feasible endeavor, albeit with important caveats. While the Ruby ecosystem, powered by gems like Numo::NArray, Rumale, Torch.rb, and tensorflow-ruby, provides the foundational tools for numerical computation and deep learning [8], it does not yet match the breadth, maturity, or performance optimization of the Python ecosystem for massive-scale pre-training [1].

Pre-training an LLM from scratch, involving trillions of tokens and vast computational resources [56], remains largely impractical within a purely Ruby-centric workflow due to limitations in native large-scale data processing frameworks and potentially less optimized distributed training support compared to Python [55]. However, fine-tuning pre-trained models presents a much more accessible and impactful path for Ruby developers [73]. By leveraging Torch.rb or tensorflow-ruby to load existing models, potentially utilizing Parameter-Efficient Fine-Tuning (PEFT) techniques, developers can adapt powerful LLMs to specific tasks or domains using smaller datasets and more modest hardware [73]. Integration with Python libraries via PyCall remains a crucial strategy for accessing state-of-the-art tokenizers, evaluation metrics, or specific model implementations [8].

Furthermore, Ruby’s core strengths align exceptionally well with the orchestration and integration aspects of the LLM lifecycle [2]. Using Ruby on Rails or other frameworks to build APIs, manage data workflows, trigger training/inference jobs (potentially running Python code), and monitor performance allows developers to create sophisticated AI-powered applications while leveraging Ruby’s productivity and maintainability [3].

The journey of training your own LLM with Ruby requires a pragmatic approach: understanding the ecosystem’s limitations, strategically leveraging bindings and Python interoperability for computationally intensive tasks, and capitalizing on Ruby’s strengths for building the surrounding application logic and orchestration layers. As the Ruby AI/ML ecosystem continues to evolve [8], the possibilities for developers will undoubtedly expand, making it an exciting space to watch and contribute to.

References

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.