dBASE: Parsing a Binary File Format With Raku

Luis Uceta — Mon, 26 Jun 2023 00:03:35 +0000

Working with binary data is kind of like solving a puzzle. You’re given clues by reading a specification of what the data means and then you have to go out and turn that data into something usable in your application.

— Young and Harter’s NodeJS in Practice

Text files, images, videos and anything you can store in a computer have a thing in common: all of them are stored as binary data. We make sense of what a sequence of binary data refers to purely based on how we interpret the binary data, and whether it conforms to what we expect it to be. If some chunk of binary data holds any meaning, we can tell it apart from another chunk by using a binary format specification, which describes how some binary data ought to be interpreted. For example, in the dBASE specification the first 32 bytes make up the header, which contains information such as date of last update, number of records in the database file, etc. Every binary file format you can imagine has a specification, and when such a specification isn’t available to someone interested on decoding a binary file format, then they must reverse engineer it.

What’s binary data?

Binary data is made up of bytes, and each byte has 8 bits which maps to an unsigned integer in the range 0-255, that is, one of 256 values (2^8). For example, 01111010 is a byte, which is the decimal number 122 and we can represent it as the hexadecimal 0x7A by converting each 4 bits (known as a nibble) to its hexadecimal (just hex from here onward) equivalent. Thus, 0111 is 7 and 1010 is A. The prefix 0x isn't part of the number and simply denotes we’re dealing with a hex number. Also the hex characters A-F don’t need to be uppercase, 0x7A and 0x7a represent the same number.

NOTE: Even though a byte usually refers to a sequence of 8 bits, it’s context dependent. In this article, it's also 8 bits and no some other amount of bits. Whenever you see the term byte you can replace it with the term octet which specifically refers to 8 bits.

Storing text starts with a character encoding, which is a scheme that maps bytes to characters; a particular byte in an encoding is known as a codepoint. A popular character encoding is ASCII which maps a number in the range 0-128 to a specific character. For example, codepoint 122 (or 0x7A) maps to the lowercase letter z in ASCII. Because ASCII was originally based on modern English and due to its short range, it only encodes both lowercase and uppercase letters from a to z, decimal digits, punctuation symbols, and some non-printable characters such as ESC (escape). There are other character encodings, such as UTF-8, that encode characters over a wider range, and thus cover languages other than English. For example, the character Ǣ has a decimal codepoint 482 and hexadecimal codepoint 0x1E2. A file with only ASCII characters has been traditionally known as a plain text file, however in principle, it can be in any character encoding. With UTF-8 and UTF-16 becoming more ubiquitous, a plain text file nowadays contain more than ASCII characters.

Text file vs binary file

A text file is technically a binary file, i.e., it's made up of 0s and 1s. However, technicality aside, a text file is a file whose content consists of an encoded sequence of Unicode codepoints, and thus can be correctly interpreted via the character encoding in effect (e.g. ASCII, UTF8, UTF16, etc). For example, a text file consisting of the string Raku is o-fun followed by 4 bytes that represents a binary integer wouldn’t be considered a text file. Instead it’s a binary file because it contains bytes that cannot be decoded with a character encoding. Thus a binary file is any file that isn't a text file.

While subtle, this distinction is quite important because text is considered a universal interface, meaning we can do many things with it without manipulating 0s and 1s directly: "You can strip it, cut it, transform it, send it to other places. Humans can read it, programs can read it, your printer can output it. It can be sent to web APIs, it can be stored anywhere. It's compressible, can be colored and can be copy-pasted and is infinitely extendable. Thousands of protocols run over it." For example, the Unix philosophy is based on it:

This is the Unix philosophy: write programs that do one thing and do it well. Write programs to work together. Write programs that handle text streams, because that is a universal interface.

— Douglas McIlroy

It’s not surprising then that many file formats programmers use in a daily basis are formats built on top of the handy abstraction text files are. From JSON to Markdown to your favorite programming language's source code, you’re most likely dealing with text files which is a big deal (or at least I think it is): you don't need to deal directly with 0s and 1s in order to structure data in a meaningful and human-readable way. Instead of focusing on how it's represented, you can focus on the content itself. Text is all about human readability. However readability doesn’t necessarily lend itself to be efficiently stored. For example, storing an unsigned 4 digits number encoded as UTF-8 would take you 4 bytes. On the other hand, storing it as a binary number would take you 1 byte.

In contrast to text files, binary files are optimized for efficiently storing information in a ready-to-process format. Thus human readability isn't a goal, and unlike a text-based data format, simply looking at a binary file won’t give you any hints about what its contents are. Like I stated above, to even begin to understand a binary file, you need to read its format specification. This is the motivation for this article! In this article we will decode the dBASE file format, specifically the Level 5 DOS dBASE version, using the Raku programming language. Raku has incredible support for working and parsing binary data and on top of it, it’s a fun language to work with.

Reading binary data

We’ll jump into decoding the dBASE file format in a bit but first a bit of context. The dBASE file format is the underlying file format for dBASE, “one of the first database management systems for microcomputers and the most successful in its day”. This file format “is widely used in applications needing a simple format to store structured data.” More in Wikipedia about dBASE itself and the .dfb file format.

I won’t duplicate the file format’s layout info here, and instead will simply refer to it. Thus I advise you to skim the specification on Wikipedia. A dBASE file has extension .dbf so from here onward, I'll use the term DBF to refer to a file with dBASE data.

First, we’ll need to read some data from a .dbf file. We’ll use world.dbf, a file that contains a database of countries with their latitudes, longitudes, etc., which you can find here. In Raku, the easiest way to get a file handle is by using the open routine. It has both subroutine and method versions, and I’ll be using the latter.

my $fh = "./world.dbf".IO.open: :r, :bin;

Notice the :r and :bin arguments. They’re named arguments, shorthand for r => True and bin => True respectively, and they tell Raku to open the handle only for reading and in binary mode, in contrast to text mode which is the default.

The next step is to get hold of some data from the file, and for this we’ll use the read method which we’ll allow us to read up an n number of bytes from the handle and return them as a Buf. In the Rakudo compiler, read returns 65536 bytes by default.

my $buffer = $fh.read: 32;

A Buf is simply a mutable buffer of binary data; its immutable counterpart is a Blob. Here we’re reading the first 32 bytes from the handle, which results in the following buffer with 32 bytes.

Buf[uint8]:0x<03 6D 0B 14 4F 08 00 00 E1 00 4A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 57 00 00>

Each value is an 8-bit unsigned integer but Raku prints it out in hexadecimal. For example, 6D is simply hex for 109 in decimal like I explained above.

It’s worth mentioning that whenever we read n bytes from the file handle with read, the file pointer advances n bytes. For example, another $fh.read: 32; operation will return the next 32 bytes. You can call tell on the file handle to get the file pointer’s current position in bytes.

NOTE: In Raku, you can forgo parentheses whenever it's unambiguous to do so. Here, $fh.read: 32; is the same as $fh.read(32); as in more traditional languages. TMTOWTDI!

Decoding binary data

Now that we have some data we can start decoding it. We know the first 32 bytes in a DBF file contain information about the file such as the date of last update, number of records in the database, number of bytes in the header, number of bytes per record, etc. These first 32 bytes constitute the file header.

File Header

From the specification, byte 0:

bits 0–2 indicate version number (WWW)
bit 3 indicates the presence of a dBASE for DOS memo file (X)
bits 4–6 indicate the presence of a SQL table (YYY)
bit 7 indicates the presence of any memo file (either dBASE m PLUS or dBASE for DOS) (Z).

Thus byte 0 can be represented as ZYYYXWWW, and we must isolate each set of bits and extract them. We can extract those bits by performing a right shift and then applying a bitmask where appropriate. For example, to extract the bit X we right shift the byte 3 bits, which results in 000ZYYYX, and then apply the bitmask 0x01 using the bitwise AND operator +& in order to obtain the bit 0000000X.

my UInt:D $version       = $buffer[0] +& 0x03;
my Bool:D $has-dos-memo  = ($buffer[0] +> 3) +& 0x01 == 1;
my Bool:D $has-sql-table = ($buffer[0] +> 4) +& 0x03;
my Bool:D $has-any-memo  = ($buffer[0] +> 7) +& 0x01 == 1;

For the world.dbf file, its version’s number is 3 and it doesn’t have DOS memo file, SQL table, or any kind of memo. Basically, that byte value is 0x03 and since we’re parsing a “dBASE III” (see Wikipedia section), then all this mean is it’s a "modern dBASE III without memo (and SQL table)” to be precise. I was interested on the other version’s description, and I found this site which lists some combinations for that byte. Then I stumbled on this more complete table that matches that byte’s value to the version’s description. I’m including it here for reference more than anything else, after all we’re dealing with only one DBF version.

Byte	Description
0x02	FoxPro
0x03	dBase III without memo file
0x04	dBase IV without memo file
0x05	dBase V without memo file
0x07	Visual Objects 1.x
0x30	Visual FoxPro
0x31	Visual FoxPro with AutoIncrement field
0x43	dBASE IV SQL table files, no memo
0x63	dBASE IV SQL system files, no memo
0x7B	dBase IV with memo file
0x83	dBase III with memo file
0x87	Visual Objects 1.x with memo file
0x8B	dBase IV with memo file
0x8E	dBase IV with SQL table
0xCB	dBASE IV SQL table files, with memo
0XF5	FoxPro with memo file
0xFB	FoxPro without memo file

Looking at that table, it’s clear the version’s description for world.dbf is indeed “dBase III without memo file”. Thus, we didn’t need all that bit twiddling, however it was perfect to showcase bitwise operations in Raku.

Now we’ll determine the file’s last update date.

my UInt:D $year  = $buffer[1];
my UInt:D $month = $buffer[2];
my UInt:D $day   = $buffer[3];

my Str:D $last-update = Date.new(
  :year($year + 1900),
  :$month,
  :$day,
  :formatter({ "%04d-%02d-%02d".sprintf: .year, .month, .day }),
).Str;

As described by the specification file, bytes 1-3 (3 bytes) store the date of last update, where byte 1 stores the year, byte 2 stores the month, and byte 3 stores the day. The year is the number of years since 1900 so we must add 1900 to $year. We extract those bytes and then create a Date object with a formatter YYYY-MM-DD, on which we call .Str to get the string representation, i.e., 1995-07-26. Thus, this file was last updated on July 26th, 1995.

Next we’ll decode bytes 4-7 in order to determine the number of records in the database file. Unlike in the previous decoding, these four bytes are a single unit, namely a 32-bit number. Another bit of information (pun intended!) we’ve about this 32-bit number is that it must be read in little endian. At the risk of digressing, endianness simply refers to the order in which bytes in a multi-byte word are stored in computer memory. A system that stores the least-significant byte at the smallest address is known as little endian, in contrast to big endian. This word (i.e., 32-bit number) was stored as little endian, hence we must read it as such.

my UInt:D $records-count = $buffer.read-uint32: 4, LittleEndian;

The read-uint32 method returns the value for the four bytes starting at the given position. In other words, it reads 32 bits from byte 4 and return its value. We also specify the endianness using the Endian enum. For world.dbf, we get 246 which is the number of records in this database file.

Bytes 8-9 give us the number of bytes in the header. Here we’re also dealing with a single unit, namely a 16-bit number that must be read in little endian as well.

my UInt:D $header-length = $buffer.read-uint16: 8, LittleEndian;

The read-uint16 method returns the value for the two bytes starting at the given position. We get 385, which means the length of this file’s header is 385 bytes long. Since we know the header has 32 bytes for metadata, then 32 bytes for each field until we find the field descriptor terminator (i.e., the byte 0x0D), we can do some quick math to determine the number of fields in this file ahead of time: (385 - 32 - 1) / 32 = 352 / 32 = 11. So this file has 11 fields, each of which is 32 bytes long. The last field is immediately followed by the field descriptor terminator which is in turn followed by the first record.

Bytes 12-13 are reserved and filled with 0 so we will skip it. Byte 14 is a flag that indicates an incomplete transaction: if it’s set to 1, then the transaction didn’t complete. For a single byte, we can either index the buffer or use read-uint8 which reads a single byte from the given position.

my Bool:D $transaction-complete = $buffer[14] != 1;

For this file, $transaction-complete is True which means there’s no incomplete transaction.

Byte 15 is a flag that indicates the database is encrypted if set to 1. Although we’re not writing back to the file, it’s worth mentioning that switching this flag to 0 doesn’t decrypt the database.

my Bool:D $db-encrypted = $buffer[15] == 1;

For this file, $db-encrypted is False which means the database is not encrypted.

Bytes 16-27 (12 bytes) are reserved for dBASE for DOS in a multi-user environment, and thus we’ll skip them. Byte 28 is the production .mdx file flag which is set to 1 if there’s a .mdx file in the database.

my Bool:D $prod-mdx = $buffer[28] == 1;

For this file, $prod-mdx is False which means the database has no .mdx file.

Byte 29 is the language driver ID.

my $lang-driver-id = $buffer[29];

Bytes 30-31 are reserved and zero filled, so we’ll skip them.

Next bytes describe the array of field descriptors until we find the field descriptor terminator, i.e., 0x0D. This is still documented as part of the header but we’ll tackle it in the following section and make the decision to encapsulate it separately from the metadata we’ve collected thus far, which we now encapsulate in a FileHeader class.

class FileHeader {
    has IO::Handle:D $.fh is required('cannot read data without file handle');

    has Str  $.version              is built(False);
    has Str  $.last-updated         is built(False);
    has UInt $.record-count         is built(False);
    has UInt $.header-length        is built(False);
    has UInt $.record-length        is built(False);
    has Bool $.transaction-complete is built(False);
    has Bool $.db-encrypted         is built(False);
    has Bool $.prod-mdx             is built(False);
    has UInt $.lang-driver          is built(False);

    method TWEAK {
        my $data = $!fh.read: 32;
        self!read-metadata($data);
    }

    method !read-metadata($data) {
        my $version = $data[0];

        unless $version == 0x03 {
            die "Only dBase III without memo file supported";
        }
        $!version = 'dBase III without memo file';

        my $year  = $data[1];
        my $month = $data[2];
        my $day   = $data[3];
        $!last-updated = Date.new(
            :year($year + 1900),
            :$month,
            :$day,
            :formatter({ "%04d-%02d-%02d".sprintf(.year, .month, .day) })
        ).Str;

       $!record-count         = $data.read-uint32(4, LittleEndian);
       $!header-length        = $data.read-uint16(8, LittleEndian);
       $!record-length        = $data.read-uint16(10, LittleEndian);
       $!transaction-complete = $data[14] != 1;
       $!db-encrypted         = $data[15] == 1;
       $!prod-mdx             = $data[28] == 1;
       $!lang-driver          = $data[29];
    }
}

Field Descriptor Array

Now we’ll need to figure out the field names and any associated information. Remember that quick math we did in the previous section? We’ll use it here. Here we'll be loading the chunk of data that make the field descriptor array all at once into memory, so this might not be as memory efficient as simply reading each of the field’s parts at a time. However we’re only loading 11 x 32 bytes into memory. For simplicity’s sake, we’ll do it this way.

We start by writing down a few constants:

constant $FIELD-TERMINATOR = 0x0D;
constant $FIELD-TERMINATOR-LENGTH = 1;
constant $METADATA-LENGTH = 32;
constant $FIELD-LENGTH = 32;

Using the header’s length, we can determine how many of those bytes make up the field descriptor array:

my $TOTAL-FIELD-BYTES = $!header-length - $METADATA-LENGTH - $FIELD-TERMINATOR-LENGTH;

Here $!header-length is an attribute that must be passed to initialize a FieldDescriptorArray object. Same for $!fh down below.

Next we determine the fields count:

my $FIELDS-COUNT = $TOTAL-FIELD-BYTES div $FIELD-LENGTH;

Now we can read the correct number of bytes from the file handle:

my $buffer = $!fh.read: $TOTAL-FIELD-BYTES;

Keep in mind read advances the file pointer. Assuming we’ve the correct number of fields, then the next byte should be the field descriptor array terminator. We use this fact to determine if the program should fail fatally:

my $field-terminator = $!fh.read(1)[0];
unless $field-terminator == $FIELD-TERMINATOR {
    die "Wrong number of bytes for field descriptor array"
}

Lastly we only need to extract each individual field and its associated information from the field descriptor array buffer:

loop (my $i = 0; $i < $FIELD-COUNTS; $i++) {
    my Buf:D $field = $buffer.subbuf($FIELD-LENGTH * $i, $FIELD-LENGTH);
    my Str:D $name = $field.subbuf(0, 10).decode('ascii').subst(/\x[00]+/, '');
    my Str:D $type = $field.subbuf(11, 1).decode('ascii');
    my UInt:D $length = $field[16];
    my UInt:D $decimal-places = $field[17];
    @!fields.push: Field.new(:$name, :$type, :$length, :$decimal-places);
}

We extract $FIELD-LENGTH bytes from the current position $FIELD-LENGTH * $i, i.e., we extract 32 bytes from the current position. From this 32 bytes, we extract the first 10 bytes which make up the field’s name, decode it as ASCII, and remove null bytes if any since it might be padded with null characters (0x00). Byte 11 is a single character that denotes the field type. All the field types in “dBase level 5” are C for a string of characters, D for a date, F for a floating point, L for a logical value, and N for numeric. Bytes 12-15 (4 bytes) are reserved so we can skip them. Byte 16 is the field length in bytes, for which the maximum is 254 (0xFE).

Byte 17 is the number of decimal places. Bytes 18-19 (2 bytes) is the work area ID, and we’ll skip them. Similarly, bytes 20-31: Apparently byte 20 is the “Example” but it’s unclear to me what that refers to and I don’t think it’s that important anyways. Bytes 21-30 (10 bytes) are reserved so it’s clear why we’d like to skip them. As for byte 31, it’s the production MDX field flag, we skipped this as well for the header so we’re also skipping it here.

Putting everything we did in this section together:

class Field {
    has Str $.name            is required('field must have a name');
    has Str $.type            is required('field must have a type');
    has Int $.length          is required('field must have a length');
    has Int $.decimal-places;
}

class FieldDescriptorArray {
    has $.fh            is required('cannot read data without file handle');
    has $.header-length is required('cannot read records without knowing where to start');

    has @.fields is built(False);

    method TWEAK {
        self!read-fields;
    }

    method !read-fields {
        constant $FIELD-TERMINATOR = 0x0D;
        constant $FIELD-TERMINATOR-LENGTH = 1;
        constant $METADATA-LENGTH = 32;
        constant $FIELD-LENGTH = 32;

        my $TOTAL-FIELD-BYTES = $!header-length - $METADATA-LENGTH - $FIELD-TERMINATOR-LENGTH;
        my $FIELDS-COUNT = $TOTAL-FIELD-BYTES / $FIELD-LENGTH;
        my $buffer = $!fh.read($TOTAL-FIELD-BYTES);

        my $field-terminator = $!fh.read(1);
        unless $field-terminator[0] == $FIELD-TERMINATOR {
        die 'Wrong number of bytes for fields'
    }

        loop (my $i = 0; $i < $FIELDS-COUNT; $i++) {
        my $field = $buffer.subbuf($FIELD-LENGTH * $i, $FIELD-LENGTH);
        my $name = $field.subbuf(0, 10).decode('ascii').subst(/\x[00]+/, '');
        my $type = $field.subbuf(11, 1).decode('ascii');
        my $length = $field[16];
        my $decimal-places = $field[17];
        @!fields.push: Field.new(:$name, :$type, :$length, :$decimal-places);
        }
    }
}

Records

Now that we’ve read the file header and the field descriptors from the DBF file, the final step involves reading the database records using the information we’ve gathered thus far, namely the field descriptors, the record count, and the record length.

We start by declaring these two constants:

constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;

The $DELETION-FLAG constant is used to indicate if a record has been marked as “deleted”. It’s up to the reader to decide how to handle deleted records, in our case we simply set the deleted key in each record to indicate a record’s deletion status which could be true or false.

We’re using the record count and the record length in order to figure how many times to loop and how many bytes to read from the file handle:

constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
loop (my $i = 0; $i < $!record-count; $i++) {
    # ...
}

The contents of a DBF varies from file to file so we cannot create a class that represents a record ahead of time, for example. For this reason, we use a hash to store a record’s contents:

constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
loop (my $i = 0; $i < $!record-count; $i++) {
    my %record;
    my $buffer = $!fh.read($!record-length);
    %record{'deleted'} = $buffer[0] == $DELETION-FLAG;

    my $record-offset = 1;
}

Here we read $!record-length bytes from the file handle, and set %record{'deleted'} to the record’s deletion status. Because the first byte in each record represents its deletion status, we’re using $record-offset to determine the position to read the record’s data from over each loop.

Next we loop over each field in order to determine how many bytes each field occupies within $buffer, as well as the field’s type:

constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
loop (my $i = 0; $i < $!record-count; $i++) {
    my %record;
    my $buffer = $!fh.read($!record-length);
    %record{'deleted'} = $buffer[0] == $DELETION-FLAG;

    my $record-offset = 1;
    for $!fields.fields -> $field {
        my $buf = $buffer.subbuf($record-offset, $field.length);
        my $value = do given $field.type {
            when 'C' { $buf.decode('utf8-c8').trim }
            when 'N' { $buf.decode('ascii').Num }
            when 'L' {
                my $flag = $buf.decode('ascii').trim;
                'YyTt'.contains($flag) ?? True !! 'NnFf'.contains($flag) ?? False !! Bool;
            }
            when 'D' {
                my $date = $buf.decode('ascii');
                my ($year, $month, $day) = .substr(0, 4), .substr(4, 2), .substr(6, 2) given $date;
                Date.new: :$year, :$month, :$day;
            }
            when 'F' { $buf.decode('ascii').Num }
        }
        %record{$field.name} = $value;
        $record-offset += $field.length;
    }
    @!records.push: %record;
}

Depending on the field’s type, we do some processing. For this specific dBASE version, we only deal with string (C), numeric (N), logical (L), date (D), and float (F). In all the cases, we decode the buffer and then do some manipulation where necessary. For example, dates as stored as the string YYYYMMDD so we extract the date parts and then create a Date object using them. For each field, we map the field’s name to the field’s value, as well as incrementing the $record-offset mentioned above.

Finally, whenever we’re done with a record we add it to the list of records in @!records.push: %record;.

Putting everything together, we end up with the RecordsDB class:

class RecordsDB {
    has IO::Handle:D                $.fh            is required('cannot read data without file handle');
    has UInt:D                      $.record-count  is required('must know number of records');
    has UInt:D                      $.record-length is required("must know each record's length");
    has FieldDescriptorArray:D      $.fields        is required('must have fields');

    has @.records is built(False);

    submethod TWEAK {
        self!read-records;
    }

    method !read-records {
        constant $DELETION-FLAG = 0x2A;
        constant $HEADER-LENGTH = 32;
        loop (my $i = 0; $i < $!record-count; $i++) {
            my %record;

            my $buffer = $!fh.read($!record-length);
            %record{'deleted'} = $buffer[0] == $DELETION-FLAG;

            my $record-offset = 1;
            for $!fields.fields -> $field {
                my $buf = $buffer.subbuf($record-offset, $field.length);
                my $value = do given $field.type {
                    when 'C' { $buf.decode('utf8-c8').trim }
                    when 'N' { $buf.decode('ascii').Num }
                    when 'L' {
                        my $flag = $buf.decode('ascii').trim;
                        'YyTt'.contains($flag) ?? True !! 'NnFf'.contains($flag) ?? False !! Bool;
                    }
                    when 'D' {
                        my $date = $buf.decode('ascii');
                        my ($year, $month, $day) = .substr(0, 4), .substr(4, 2), .substr(6, 2) given $date;
                        Date.new: :$year, :$month, :$day;
                    }
                    when 'F' { $buf.decode('ascii').Num }
                }
                %record{$field.name} = $value;
                $record-offset += $field.length;
            }
            @!records.push: %record;
        }
    }
}

Conclusion

In this article, I’ve explained the difference between binary and textual data, and their pros and cons when storing information. I also showed how to read and decode binary data, specifically the dBASE III binary format, using Raku.

The source code for this article can be found in this repo https://github.com/uzluisf/raku-dbf-reader-art. Note there might be some slight variations with the code snippets here, mainly the fact I organized the code snippets into a Raku module in the repo, however that shouldn’t make any difference.

References

About text as the universal interface: https://news.ycombinator.com/item?id=8437038
Working with binary files: https://www.visuality.pl/posts/cs-lessons-001-working-with-binary-files

DEV Community: Luis Uceta