Palash Kanti Kundu

Posted on Feb 14

Building a Character-Level Language Model in Rust: From Zero to "Aha!"

#deeplearning #rust #devjournal #ngram

We’ve all seen the magic of Large Language Models. You type a prompt, and it finishes your sentence. But beneath the billions of parameters and massive GPU clusters, there is a fundamental mathematical heartbeat: The N-Gram.

Today, we’re going to look under the hood of a neural N-Gram generator built from scratch in Rust. No PyTorch. No hidden abstractions. Just pure logic, traits, and tensors.

1. The Core Idea: "What comes next?"

At its simplest, a language model is just a professional guesser. If I give you the letters r-u-s-, your brain immediately screams t.

Our model does exactly this using a sliding window. We take a word like r-u-s-t, and break it into training pairs:

Context: ... -> Target: r
Context: ..r -> Target: u
Context: .ru -> Target: s

In our Rust implementation, we define this context window (the "N" in N-Gram) as a multiplier. And we slide through the window for each token we feed it and generate input/output vectors.

for name in name_list {
   let full_name = format!("{}{}.", pad_str, name);
   let chars_vec: Vec<char> = full_name.chars().collect();

   for window in chars_vec.windows(multiplier as usize + 1) {
      for i in 0..multiplier {
        inputs.push(stoi[&window[i as usize]] as u32);
      }
      targets.push(stoi[&window[multiplier as usize]] as u32);
   }
}

2. The First "Aha!" Moment: Characters aren't Numbers

Computers can't read the letter 'a'. We have to translate it into math. We use One-Hot Encoding. If our alphabet has 27 characters (a-z and a special "start/end" token), the letter 'a' becomes a vector of length 27 with a 1.0 at index 1 and 0.0 everywhere else.

// A snippet from our one-hot utility
pub fn one_hot_encode<T: Tensor<D>, D: Numeric>(
    labels: &[u32],
    num_classes: u32,
    labels_per_sample: u32,
) -> Result<T, String> {
    let num_samples = (labels.len() as u32) / labels_per_sample;
    let row_width = (num_classes * labels_per_sample) as usize;
    let mut data = vec![D::zero(); (num_samples as usize) * row_width];

    for (i, sample_labels) in labels.chunks(labels_per_sample as usize).enumerate() {
        for (j, &label_idx) in sample_labels.iter().enumerate() {
            if label_idx >= num_classes {
                return Err(format!(
                    "Label index {} exceeds num_classes {}",
                    label_idx, num_classes
                ));
            }
            let index = (i * row_width) + (j * num_classes as usize) + label_idx as usize;
            data[index] = D::one();
        }
    }

    T::new(vec![num_samples, row_width as u32], data)
}

The Insight: By doing this, we turn a linguistics problem into a geometry problem. Every character is now a coordinate in high-dimensional space.

3. The Rush to find The Approximation

Once we figure out how we can convert language texts (list of names in our case) for a suitable input/target combination for Supervised Learning, we just let the neural network take care of the rest:

nn.fit(&x_train, &y_train, &x_val, &y_val, config, hook_config)?;

4. The "Creative" Moment: Temperature Scaling

After training, our network doesn't output letters; it outputs logits—raw, unnormalized scores for every character in our vocabulary. To turn these scores into a "choice," we need a probability distribution. This is where we introduce Temperature.

Think of Temperature as a "confidence dial." Mathematically, we modify the standard Softmax function by dividing our logits by before exponentiating them:

Low Temperature: The "Safe Bet." Dividing by a small number makes high scores much higher and low scores much lower. The distribution becomes "peaky," and the model becomes highly confident and conservative. It will likely only generate the most common names from your dataset.
High Temperature: The "Risk Taker." Dividing by a larger number flattens the differences between scores. The distribution becomes "uniform," making rare character transitions almost as likely as common ones. This is where the model gets "creative," inventing names that feel phonetically plausible but entirely new.

In our Rust implementation, we apply this directly during the generation loop to influence the weights used for random sampling:

// From n_gram.rs: Applying temperature to the raw tensor output
let mut weights: Vec<f64> = data
    .iter()
    .map(|val| (val.f64() / temparature).exp())
    .collect();

The Insight: By simply adjusting a single denominator (), we shift the model's behavior from a rigid database lookup to a creative linguistic engine. We use this in the generator to find the "sweet spot" where names are fresh and innovative without devolving into unpronounceable gibberish.

5. Dealing with the Noise: Label Smoothing

Neural networks are prone to overconfidence. They want to be 100% sure that 'q' is followed by 'u'. But in a small dataset, this leads to overfitting.

We implement Label Smoothing. Instead of targeting a 1.0 probability, we target 0.9 and spread the remaining 0.1 across all other letters. This forces the model to stay "curious" and prevents the gradients from exploding.

// Add Label Smoothing
let epsilon = D::from_f64(0.1); // The "smoothing" factor
let num_classes = D::from_u32(vocab_size);

let y_train_data = y_train.get_data();
let mut smooth_data = vec![];

for val in y_train_data {
  // Standard one-hot is [0, 1, 0]
  // Smoothed becomes [0.003, 0.903, 0.003]
  smooth_data.push(val * (D::one() - epsilon) + (epsilon / num_classes));
}

let y_train = T::new(y_train.get_shape().to_vec(), smooth_data)?;

6. The Result: Artificial Life

When you run the generator, you see the "Innovation Rate." Our code checks the generated name against the training set. If the model outputs "Alara" and that name wasn't in the original list, we've successfully taught a machine the concept of a name without it just memorizing a list.

Here are few interesting ones I could see my machine invent based on 1084 Bengali Names:

✓ 'manvi ' NEW | Innovation Rate: 100.00%
✓ 'jasha ' NEW | Innovation Rate: 85.11%
✓ 'naru ' NEW | Innovation Rate: 46.08%

Where to Find the Whole

Download or clone this repo - https://github.com/Palash90/iron_learn
Build it following the instructions mentioned in the README and run the following command:

target/release/iron_learn -n 5-gram -x n-gram --n-gram-size 5 -d data/names.txt -m 5 -e 20 -l 0.1

DEV Community