DEV Community

Cover image for Typed integers in Rust for safer Python bytecode compilation
Jones Beach
Jones Beach

Posted on • Edited on • Originally published at fromscratchcode.com

Typed integers in Rust for safer Python bytecode compilation

Shortly after I shared my previous post, a helpful redditor pointed out that the typed integers I alluded to is known as the newtype pattern in Rust, a design that allows you to define types purely for compile-time safety and with no runtime hit.

And here I thought I invented it! Alas. 😂

I wanted to share a few details about what challenge I encountered and how I solved it.

The problem

Like I mentioned in the previous post, bytecode is a lower-level representation of your code created for efficient evaluation. One of the optimizations is that variable names are converted into indices during bytecode compilation, which supports faster lookup at runtime when the VM pumps through your bytecode.

At one point while implementing this myself, I was debugging a sequence that felt like this. (I’m using the wishy-washy “felt like” here because I have no idea what the actual bytecode looked like.)

LOAD_GLOBAL 0
LOAD_FAST 0
LOAD_CONST 0
Enter fullscreen mode Exit fullscreen mode

Okay, so we’re loading the first global, the first local, and the first constant. Given the ongoing evolution of my VM design and implementation, the data structures I’m using to communicate the mappings of these indices to their symbol names and/or values between the compiler and VM have been in a steady flux.

I’d sit down to implement the execution of a given opcode in the VM, see that I was handed the value 0, and I’d think, “What do they want me to do with this?!” (“they” in the case being myself from 5 minutes prior hacking on the compiler stage.)

After interpreting a 0 wrong for the forth or fifth time, I decided THERE HAS TO BE A BETTER WAY. I asked myself, “Is there a way to get the compiler to tell me when I’m using a 0 incorrectly?” In addition to compile-time checking, with rust-analyzer configured in my Neovim, I should get a type error as soon as I made the mistake.

Could I add types to my integers?!

The CPython bytecode docs even hint at these distinctions, but incorporating them into my own implementation wasn’t immediately clear until I experienced the pain firsthand. For my three opcodes above, the respective documentation is:

LOAD_GLOBAL(namei)
Loads the global named co_names[namei>>1] onto the stack.

LOAD_FAST(var_num)
Pushes a reference to the local co_varnames[var_num] onto the stack.

LOAD_CONST(consti)
Pushes co_consts[consti] onto the stack.
Enter fullscreen mode Exit fullscreen mode

co_names, co_varnames, and co_consts are three fields on a CPython code object, which is an immutable piece of compiled bytecode. In my interpreter, I’m using names and varnames to show my originality.

(Note to self: I treat constants separately at the moments, but I should probably unify it as I solidify my understanding of code objects.)

(Another note to self: how curious that they are shifting namei to the right during LOAD_GLOBAL. I hope to one day know why.)

The solution

I added typed integers to keep myself sane. They offered an additional benefit of providing me an opportunity to gain more experience with generics in Rust!

Here is a subset of my Opcode implementation.

pub enum Opcode {
    /// Push the value found at the specified index in the constant pool onto the stack.
    LoadConst(ConstantIndex),
    /// Read the local variable indicated by the specified index and push the value onto the stack.
    LoadFast(LocalIndex),
    /// Read the global variable indicated by the specified index and push the value onto the stack.
    LoadGlobal(NonlocalIndex),
}
Enter fullscreen mode Exit fullscreen mode

My hope here is these new types, ConstantIndex, LocalIndex, and NonlocalIndex, would semantically illustrate the behavior of each opcode while providing type safety.

Generics entered the scene next as I was hoping to implement this with minimal code reuse.

Here is the next layer of the onion.

pub type ConstantIndex = Index<ConstantMarker>;
pub type LocalIndex = Index<LocalMarker>;
pub type NonlocalIndex = Index<NonlocalMarker>;
Enter fullscreen mode Exit fullscreen mode

We’re beginning to see some code reuse–awesome!

What are these marker types? They are truly that: a type which is literally just a marker with no other data. These empty types are enforced at compile-time and do nothing at runtime.

pub struct ConstantMarker;
pub struct LocalMarker;
pub struct NonlocalMarker;
Enter fullscreen mode Exit fullscreen mode

In Rust, PhantomData<T> from std::marker allows you to include a type in a struct purely for compile-time checks without affecting runtime performance—exactly the type safety for which we’re seeking! I’m using usize for the value because my use-case was indices, which should never be negative.

/// An unsigned integer wrapper which provides type safety. This is particularly useful when
/// dealing with indices used across the bytecode compiler and the VM as common integer values such
/// as 0, 1, etc, can be interpreted many different ways.
#[derive(Copy, Clone, PartialEq, Hash, Eq)]
pub struct Index<T> {
    value: usize,
    _marker: PhantomData<T>,
}

impl<T> Index<T> {
    pub fn new(value: usize) -> Self {
        Self {
            value,
            _marker: PhantomData,
        }
    }
}

impl<T> Deref for Index<T> {
    type Target = usize;

    fn deref(&self) -> &Self::Target {
        &self.value
    }
}

impl<T> Display for Index<T> {
    fn fmt(&self, f: &mut Formatter) -> Result<(), Error> {
        write!(f, "{}", self.value)
    }
}
Enter fullscreen mode Exit fullscreen mode

The last piece of magic is impl<T> Deref for Index<T>, which allows us to treat instances of our new type as integers when dereferenced.

As a result, a piece of code like this would pass with flying colors.

#[test]
fn test_dereference() {
    let index: LocalIndex = Index::new(4);
    assert_eq!(*index, 4)
}
Enter fullscreen mode Exit fullscreen mode

By this point, we are free to use these in the bytecode compiler! Most places in the code don’t require specifying the generic because it will be inferred by the function signature, like in this example.

fn get_or_set_local_index(&mut self, name: &str) -> LocalIndex {
    if let Some(index) = self.get_local_index(name) {
        index
    } else {
        let code = self.ensure_code_object_mut();
        let new_index = code.varnames.len();
        code.varnames.push(name.to_string());
        Index::new(new_index)
    }
}
Enter fullscreen mode Exit fullscreen mode

We initialize a new index with Index::new(new_index), which the compiler knows to treat as an Index<LocalMarker>, which we have aliased to LocalIndex to allow us to be blissfully ignorant VM developers.

The end

This is a small but powerful example of how I’m falling head-over-heels for Rust’s type system. I love writing expressive code to implement core features from scratch in a way which manages complexity and makes people smile.

My mentor at my first big-boy job was a wizard at this and I credit him for showing me what is possible: spending more time thinking about what you’re trying to build rather than being bogged down by how you are trying to build it.

Have you used Rust’s type system in any fun and creative ways? Or done something similar in another language? I’d love to hear your thoughts in the comments. Be well!


Subscribe & Save [on nothing]

If you’d like to get more posts like this directly to your inbox, you can subscribe here!

Work With Me

I mentor software engineers to navigate technical challenges and career growth in a supportive sometimes silly environment. If you’re interested, you can book a session.

Elsewhere

In addition to mentoring, I also write about my experience navigating self-employment and late-diagnosed autism. Less code and the same number of jokes.

Top comments (0)