DEV Community

Ashton Scott Snapp
Ashton Scott Snapp

Posted on

Writing an Assembler in Rust, and How I'm Reworking the Lexer (again)

Hello people! It's been a bit. Two months or so, to be specific. But I'm back, and I'm still working on this project. Obligatory Github reference:

GitHub logo AshtonSnapp / hasm

The Official Cellia Cross-Assembler for Modern Computers

hasm

Rust Build & Test

The Homebrew Assembler. Currently supporting the 16-bit Cellia architecture and the 8-bit ROCKET88 architecture.

Each architecture supported by hasm will be separated into its own module, although every architecture's assembler code will have the same general structure: you have a lexer which takes in files of assembly code and outputs streams of tokens which are fed into a parser which structures those tokens into a file syntax tree. Then the syntax trees are fed into a linker which tries to combine all of these trees into a single program tree, which is finally fed into a binary generator which does exactly what you think.

Right now I'm still trying to implement the assemblers for the two architectures I mentioned earlier, and I've only just now gotten to the parser. It's going to be a pain to write anything that's actually decently capable, but it'll be worth…

And now for something you've probably heard before: I'm reworking the lexer. I'm not switching away from the logos crate, don't worry. I just realized that there's a better way to implement some of the token variants and their callbacks, and also I need to figure out how to handle in-assembly operators.

First, let's have an example - addresses. The way addresses used to work in the lexer is that there was an AddressInfo struct that contained the address type, the number base, and the value as a signed 32-bit integer. However, I realized two things: first, we don't need to remember what base the number was in, and second, we can pair the number type with the address type. So relative addresses can use an i16 while absolute addresses can use a u32 (because there's no u24 type).

This has resulted in the creation of the AddressType enum, which has each variant take an integer argument that varies in size depending on the address type. This is what it looks like in the code:

pub enum AddressType {
    Absolute(u32),
    IndirectAbsolute(u32),
    ZeroBank(u16),
    IndirectZeroBank(u16),
    DirectPage(u8),
    Port(u16),
    Relative(i16),
    StackRelative(i16)
}
Enter fullscreen mode Exit fullscreen mode

Next we have the work-in-progress that is the address callback. After learning that logos only returns the text that triggered the callback, the code can be a lot simpler. Also, strip_prefix and strip_suffix are now in Rust Stable. So I don't have to use replacen. Yay!

The callback works quite simply. It takes the token slice and checks it for certain starting or ending characters. A pair of parentheses indicates an indirect address, for example. If it starts with either IP or SP, that's a relative address with the letter before the P indicating what pointer it's relative to. An ending p indicates a port address, and an ending d indicates a direct page address. Then, a $ indicates hexadecimal and a % indicates decimal. Simple.

This code is still majorly a work in progress, but I plan to commit to the GitHub repo once I get it to a certain point. Until then, y'all have an awesome day!

Top comments (0)