<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luis Uceta</title>
    <description>The latest articles on DEV Community by Luis Uceta (@uzluisf).</description>
    <link>https://dev.to/uzluisf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1104577%2Fc1375214-ac7c-4206-8ab4-30aa39bafe66.png</url>
      <title>DEV Community: Luis Uceta</title>
      <link>https://dev.to/uzluisf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/uzluisf"/>
    <language>en</language>
    <item>
      <title>dBASE: Parsing a Binary File Format With Raku</title>
      <dc:creator>Luis Uceta</dc:creator>
      <pubDate>Mon, 26 Jun 2023 00:03:35 +0000</pubDate>
      <link>https://dev.to/uzluisf/dbase-parsing-a-binary-file-format-with-raku-2fm6</link>
      <guid>https://dev.to/uzluisf/dbase-parsing-a-binary-file-format-with-raku-2fm6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Working with binary data is kind of like solving a puzzle. You’re given clues by reading a specification of what the data means and then you have to go out and turn that data into something usable in your application.&lt;/p&gt;

&lt;p&gt;— Young and Harter’s &lt;em&gt;NodeJS in Practice&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Text files, images, videos and anything you can store in a computer have a thing in common: all of them are stored as &lt;strong&gt;binary data&lt;/strong&gt;. We make sense of what a sequence of binary data refers to purely based on how we interpret the binary data, and whether it conforms to what we expect it to be. If some chunk of binary data holds any meaning, we can tell it apart from another chunk by using a &lt;strong&gt;binary format specification&lt;/strong&gt;, which describes how some binary data ought to be interpreted. For example, in the dBASE specification the first 32 bytes make up the header, which contains information such as date of last update, number of records in the database file, etc. Every binary file format you can imagine has a specification, and when such a specification isn’t available to someone interested on decoding a binary file format, then they must reverse engineer it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s binary data?
&lt;/h2&gt;

&lt;p&gt;Binary data is made up of &lt;strong&gt;bytes&lt;/strong&gt;, and each byte has 8 bits which maps to an unsigned integer in the range 0-255, that is, one of 256 values (2^8). For example, &lt;code&gt;01111010&lt;/code&gt; is a byte, which is the decimal number 122 and we can represent it as the hexadecimal &lt;code&gt;0x7A&lt;/code&gt; by converting each 4 bits (known as a &lt;strong&gt;nibble&lt;/strong&gt;) to its hexadecimal (just &lt;strong&gt;hex&lt;/strong&gt; from here onward) equivalent. Thus, &lt;code&gt;0111&lt;/code&gt; is &lt;code&gt;7&lt;/code&gt; and &lt;code&gt;1010&lt;/code&gt; is &lt;code&gt;A&lt;/code&gt;. The prefix &lt;code&gt;0x&lt;/code&gt; isn't part of the number and simply denotes we’re dealing with a hex number. Also the hex characters &lt;code&gt;A&lt;/code&gt;-&lt;code&gt;F&lt;/code&gt; don’t need to be uppercase, &lt;code&gt;0x7A&lt;/code&gt; and &lt;code&gt;0x7a&lt;/code&gt; represent the same number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; Even though a byte usually refers to a sequence of 8 bits, it’s context dependent. In this article, it's also 8 bits and no some other amount of bits. Whenever you see the term byte you can replace it with the term &lt;strong&gt;octet&lt;/strong&gt; which specifically refers to 8 bits. &lt;/p&gt;

&lt;p&gt;Storing text starts with a &lt;strong&gt;character encoding&lt;/strong&gt;, which is a scheme that maps bytes to characters; a particular byte in an encoding is known as a &lt;strong&gt;codepoint&lt;/strong&gt;. A popular character encoding is ASCII which maps a number in the range 0-128 to a specific character. For example, codepoint 122 (or &lt;code&gt;0x7A&lt;/code&gt;) maps to the lowercase letter &lt;em&gt;&lt;code&gt;z&lt;/code&gt;&lt;/em&gt; in ASCII.  Because ASCII was originally based on modern English and due to its short range, it only encodes both lowercase and uppercase letters from &lt;em&gt;&lt;code&gt;a&lt;/code&gt;&lt;/em&gt; to &lt;em&gt;&lt;code&gt;z&lt;/code&gt;&lt;/em&gt;, decimal digits, punctuation symbols, and some non-printable characters such as &lt;code&gt;ESC&lt;/code&gt; (escape). There are other character encodings, such as UTF-8, that encode characters over a wider range, and thus cover languages other than English. For example, the character &lt;code&gt;Ǣ&lt;/code&gt; has a decimal codepoint 482 and hexadecimal codepoint &lt;code&gt;0x1E2&lt;/code&gt;. A file with only ASCII characters has been traditionally known as a &lt;strong&gt;plain text&lt;/strong&gt; file, however in principle, it can be in any character encoding. With UTF-8 and UTF-16 becoming more ubiquitous, a plain text file nowadays contain more than ASCII characters. &lt;/p&gt;

&lt;h2&gt;
  
  
  Text file vs binary file
&lt;/h2&gt;

&lt;p&gt;A text file is technically a binary file, i.e., it's made up of 0s and 1s. However, technicality aside, a text file is a file whose content consists of an encoded sequence of Unicode codepoints, and thus can be correctly interpreted via the character encoding in effect (e.g. ASCII, UTF8, UTF16, etc). For example, a text file consisting of the string &lt;code&gt;Raku is o-fun&lt;/code&gt; followed by 4 bytes that represents a binary integer wouldn’t be considered a text file. Instead it’s a binary file because it contains bytes that cannot be decoded with a character encoding. Thus a binary file is any file that isn't a text file.&lt;/p&gt;

&lt;p&gt;While subtle, this distinction is quite important because text is considered a &lt;strong&gt;universal interface&lt;/strong&gt;, meaning we can do many things with it without manipulating 0s and 1s directly: &lt;a href="https://news.ycombinator.com/item?id=8437038"&gt;"You can strip it, cut it, transform it, send it to other places. Humans can read it, programs can read it, your printer can output it. It can be sent to web APIs, it can be stored anywhere. It's compressible, can be colored and can be copy-pasted and is infinitely extendable. Thousands of protocols run over it."&lt;/a&gt; For example, the Unix philosophy is based on it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is the Unix philosophy: write programs that do one thing and do it well. Write programs to work together. Write programs that handle text streams, because that is a universal interface.&lt;/p&gt;

&lt;p&gt;— Douglas McIlroy&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s not surprising then that many file formats programmers use in a daily basis are formats built on top of the handy abstraction text files are. From JSON to Markdown to your favorite programming language's source code, you’re most likely dealing with text files which is a big deal (or at least I think it is): you don't need to deal directly with 0s and 1s in order to structure data in a meaningful and human-readable way. Instead of focusing on how it's represented, you can focus on the content itself. Text is all about human readability. However readability doesn’t necessarily lend itself to be efficiently stored. For example, storing an unsigned 4 digits number encoded as UTF-8 would take you 4 bytes. On the other hand, storing it as a binary number would take you 1 byte. &lt;/p&gt;

&lt;p&gt;In contrast to text files, binary files are optimized for efficiently storing information in a ready-to-process format. Thus human readability isn't a goal, and unlike a text-based data format, simply looking at a binary file won’t give you any hints about what its contents are. Like I stated above, to even begin to understand a binary file, you need to read its format specification. This is the motivation for this article! In this article we will decode the dBASE file format, specifically the &lt;a href="https://en.wikipedia.org/wiki/.dbf#File_format_of_Level_5_DOS_dBASE"&gt;Level 5 DOS dBASE&lt;/a&gt; version, using the &lt;a href="http://raku.org"&gt;Raku&lt;/a&gt; programming language. Raku has incredible support for working and parsing binary data and on top of it, it’s a fun language to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading binary data
&lt;/h2&gt;

&lt;p&gt;We’ll jump into decoding the dBASE file format in a bit but first a bit of context. The dBASE file format is the underlying file format for dBASE, “one of the first database management systems for microcomputers and the most successful in its day”. This file format “is widely used in applications needing a simple format to store structured data.” More in Wikipedia about &lt;a href="https://en.wikipedia.org/wiki/DBase"&gt;dBASE&lt;/a&gt; itself and the &lt;a href="https://en.wikipedia.org/wiki/.dbf#Database_records"&gt;.dfb file format&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I won’t duplicate the file format’s layout info here, and instead will simply refer to it. Thus I advise you to skim the specification on Wikipedia. A dBASE file has extension &lt;code&gt;.dbf&lt;/code&gt; so from here onward, I'll use the term DBF to refer to a file with dBASE data.&lt;/p&gt;

&lt;p&gt;First, we’ll need to read some data from a &lt;code&gt;.dbf&lt;/code&gt; file. We’ll use &lt;code&gt;world.dbf&lt;/code&gt;, a file that contains a database of countries with their latitudes, longitudes, etc., which you can find &lt;a href="https://github.com/uzluisf/raku-dbf-reader-art/blob/main/data/world.dbf"&gt;here&lt;/a&gt;. In Raku, the easiest way to get a file handle is by using the &lt;code&gt;open&lt;/code&gt; routine. It has both subroutine and method versions, and I’ll be using the latter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $fh = "./world.dbf".IO.open: :r, :bin;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;:r&lt;/code&gt; and &lt;code&gt;:bin&lt;/code&gt; arguments. They’re named arguments, shorthand for &lt;code&gt;r =&amp;gt; True&lt;/code&gt; and &lt;code&gt;bin =&amp;gt; True&lt;/code&gt; respectively, and they tell Raku to open the handle only for reading and in binary mode, in contrast to text mode which is the default.&lt;/p&gt;

&lt;p&gt;The next step is to get hold of some data from the file, and for this we’ll use the &lt;code&gt;read&lt;/code&gt; method which we’ll allow us to read up an &lt;code&gt;n&lt;/code&gt; number of bytes from the handle and return them as a &lt;code&gt;Buf&lt;/code&gt;. In the Rakudo compiler, &lt;code&gt;read&lt;/code&gt; returns 65536 bytes by default.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $buffer = $fh.read: 32; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;code&gt;Buf&lt;/code&gt; is simply a mutable buffer of binary data; its immutable counterpart is a &lt;code&gt;Blob&lt;/code&gt;. Here we’re reading the first 32 bytes from the handle, which results in the following buffer with 32 bytes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Buf[uint8]:0x&amp;lt;03 6D 0B 14 4F 08 00 00 E1 00 4A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 57 00 00&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each value is an 8-bit unsigned integer but Raku prints it out in hexadecimal. For example, &lt;code&gt;6D&lt;/code&gt; is simply hex for 109 in decimal like I explained above.&lt;/p&gt;

&lt;p&gt;It’s worth mentioning that whenever we read &lt;code&gt;n&lt;/code&gt; bytes from the file handle with &lt;code&gt;read&lt;/code&gt;, the file pointer advances &lt;code&gt;n&lt;/code&gt; bytes. For example, another &lt;code&gt;$fh.read: 32;&lt;/code&gt; operation will return the next 32 bytes. You can call &lt;code&gt;tell&lt;/code&gt; on the file handle to get the file pointer’s current position in bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; In Raku, you can forgo parentheses whenever it's unambiguous to do so. Here, &lt;code&gt;$fh.read: 32;&lt;/code&gt; is the same as &lt;code&gt;$fh.read(32);&lt;/code&gt; as in more traditional languages. TMTOWTDI!&lt;/p&gt;

&lt;h2&gt;
  
  
  Decoding binary data
&lt;/h2&gt;

&lt;p&gt;Now that we have some data we can start decoding it. We know the first 32 bytes in a DBF file contain information about the file such as the date of last update, number of records in the database, number of bytes in the header, number of bytes per record, etc. These first 32 bytes constitute the file header.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Header
&lt;/h3&gt;

&lt;p&gt;From the specification, byte 0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bits 0–2 indicate version number (&lt;code&gt;WWW&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;bit 3 indicates the presence of a dBASE for DOS memo file (&lt;code&gt;X&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;bits 4–6 indicate the presence of a SQL table (&lt;code&gt;YYY&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;bit 7 indicates the presence of any memo file (either dBASE m PLUS or dBASE for DOS) (&lt;code&gt;Z&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus byte 0 can be represented as &lt;code&gt;ZYYYXWWW&lt;/code&gt;, and we must isolate each set of bits and extract them. We can extract those bits by performing a right shift and then applying a bitmask where appropriate. For example, to extract the bit &lt;code&gt;X&lt;/code&gt; we right shift the byte 3 bits, which results in &lt;code&gt;000ZYYYX&lt;/code&gt;, and then apply the bitmask &lt;code&gt;0x01&lt;/code&gt; using the bitwise AND operator &lt;code&gt;+&amp;amp;&lt;/code&gt; in order to obtain the bit &lt;code&gt;0000000X&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my UInt:D $version       = $buffer[0] +&amp;amp; 0x03;
my Bool:D $has-dos-memo  = ($buffer[0] +&amp;gt; 3) +&amp;amp; 0x01 == 1;
my Bool:D $has-sql-table = ($buffer[0] +&amp;gt; 4) +&amp;amp; 0x03;
my Bool:D $has-any-memo  = ($buffer[0] +&amp;gt; 7) +&amp;amp; 0x01 == 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the &lt;code&gt;world.dbf&lt;/code&gt; file, its version’s number is 3 and it doesn’t have DOS memo file, SQL table, or any kind of memo. Basically, that byte value is &lt;code&gt;0x03&lt;/code&gt; and since we’re parsing a “dBASE III” (see Wikipedia section), then all this mean is it’s a "modern dBASE III without memo (and SQL table)” to be precise. I was interested on the other version’s description, and I found this &lt;a href="https://www.dbf2002.com/dbf-file-format.html"&gt;site&lt;/a&gt; which lists some combinations for that byte. Then I stumbled on this more &lt;a href="https://github.com/yellowfeather/DbfDataReader/blob/main/src/DbfDataReader/DbfHeader.cs"&gt;complete table&lt;/a&gt; that matches that byte’s value to the version’s description. I’m including it here for reference more than anything else, after all we’re dealing with only one DBF version.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Byte&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0x02&lt;/td&gt;
&lt;td&gt;FoxPro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x03&lt;/td&gt;
&lt;td&gt;dBase III without memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x04&lt;/td&gt;
&lt;td&gt;dBase IV without memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x05&lt;/td&gt;
&lt;td&gt;dBase V without memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x07&lt;/td&gt;
&lt;td&gt;Visual Objects 1.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x30&lt;/td&gt;
&lt;td&gt;Visual FoxPro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x31&lt;/td&gt;
&lt;td&gt;Visual FoxPro with AutoIncrement field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x43&lt;/td&gt;
&lt;td&gt;dBASE IV SQL table files, no memo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x63&lt;/td&gt;
&lt;td&gt;dBASE IV SQL system files, no memo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x7B&lt;/td&gt;
&lt;td&gt;dBase IV with memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x83&lt;/td&gt;
&lt;td&gt;dBase III with memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x87&lt;/td&gt;
&lt;td&gt;Visual Objects 1.x with memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x8B&lt;/td&gt;
&lt;td&gt;dBase IV with memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0x8E&lt;/td&gt;
&lt;td&gt;dBase IV with SQL table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xCB&lt;/td&gt;
&lt;td&gt;dBASE IV SQL table files, with memo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0XF5&lt;/td&gt;
&lt;td&gt;FoxPro with memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0xFB&lt;/td&gt;
&lt;td&gt;FoxPro without memo file&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Looking at that table, it’s clear the version’s description for &lt;code&gt;world.dbf&lt;/code&gt; is indeed “dBase III without memo file”. Thus, we didn’t need all that bit twiddling, however it was perfect to showcase bitwise operations in Raku.&lt;/p&gt;

&lt;p&gt;Now we’ll determine the file’s last update date.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my UInt:D $year  = $buffer[1];
my UInt:D $month = $buffer[2];
my UInt:D $day   = $buffer[3];

my Str:D $last-update = Date.new(
  :year($year + 1900),
  :$month,
  :$day,
  :formatter({ "%04d-%02d-%02d".sprintf: .year, .month, .day }),
).Str;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As described by the specification file, bytes 1-3 (3 bytes) store the date of last update, where byte 1 stores the year, byte 2 stores the month, and byte 3 stores the day. The year is the number of years since 1900 so we must add &lt;code&gt;1900&lt;/code&gt; to &lt;code&gt;$year&lt;/code&gt;. We extract those bytes and then create a &lt;code&gt;Date&lt;/code&gt; object with a formatter &lt;code&gt;YYYY-MM-DD&lt;/code&gt;, on which we call &lt;code&gt;.Str&lt;/code&gt; to get the string representation, i.e., &lt;code&gt;1995-07-26&lt;/code&gt;. Thus, this file was last updated on July 26th, 1995.&lt;/p&gt;

&lt;p&gt;Next we’ll decode bytes 4-7 in order to determine the number of records in the database file. Unlike in the previous decoding, these four bytes are a single unit, namely a 32-bit number. Another bit of information (pun intended!) we’ve about this 32-bit number is that it must be read in little endian. At the risk of digressing, &lt;strong&gt;endianness&lt;/strong&gt; simply refers to the order in which bytes in a multi-byte word are stored in computer memory. A system that stores the least-significant byte at the smallest address is known as &lt;strong&gt;little endian&lt;/strong&gt;, in contrast to &lt;strong&gt;big endian&lt;/strong&gt;. This word (i.e., 32-bit number) was stored as little endian, hence we must read it as such.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my UInt:D $records-count = $buffer.read-uint32: 4, LittleEndian;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;read-uint32&lt;/code&gt; method returns the value for the four bytes starting at the given position. In other words, it reads 32 bits from byte 4 and return its value. We also specify the endianness using the &lt;code&gt;Endian&lt;/code&gt; enum. For &lt;code&gt;world.dbf&lt;/code&gt;, we get 246 which is the number of records in this database file.&lt;/p&gt;

&lt;p&gt;Bytes 8-9 give us the number of bytes in the header. Here we’re also dealing with a single unit, namely a 16-bit number that must be read in little endian as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my UInt:D $header-length = $buffer.read-uint16: 8, LittleEndian;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;read-uint16&lt;/code&gt; method returns the value for the two bytes starting at the given position. We get 385, which means the length of this file’s header is 385 bytes long. Since we know the header has 32 bytes for metadata, then 32 bytes for each field until we find the field descriptor terminator (i.e., the byte &lt;code&gt;0x0D&lt;/code&gt;), we can do some quick math to determine the number of fields in this file ahead of time: &lt;code&gt;(385 - 32 - 1) / 32 = 352 / 32 = 11&lt;/code&gt;. So this file has 11 fields, each of which is 32 bytes long. The last field is immediately followed by the field descriptor terminator which is in turn followed by the first record.&lt;/p&gt;

&lt;p&gt;Bytes 12-13 are reserved and filled with 0 so we will skip it. Byte 14 is a flag that indicates an incomplete transaction: if it’s set to 1, then the transaction didn’t complete. For a single byte, we can either index the buffer or use &lt;code&gt;read-uint8&lt;/code&gt; which reads a single byte from the given position.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my Bool:D $transaction-complete = $buffer[14] != 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this file, &lt;code&gt;$transaction-complete&lt;/code&gt; is &lt;code&gt;True&lt;/code&gt; which means there’s no incomplete transaction.&lt;/p&gt;

&lt;p&gt;Byte 15 is a flag that indicates the database is encrypted if set to 1. Although we’re not writing back to the file, it’s worth mentioning that switching this flag to 0 doesn’t decrypt the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my Bool:D $db-encrypted = $buffer[15] == 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this file, &lt;code&gt;$db-encrypted&lt;/code&gt; is &lt;code&gt;False&lt;/code&gt; which means the database is not encrypted.&lt;/p&gt;

&lt;p&gt;Bytes 16-27 (12 bytes) are reserved for dBASE for DOS in a multi-user environment, and thus we’ll skip them. Byte 28 is the production &lt;code&gt;.mdx&lt;/code&gt; file flag which is set to 1 if there’s a &lt;code&gt;.mdx&lt;/code&gt; file in the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my Bool:D $prod-mdx = $buffer[28] == 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this file, &lt;code&gt;$prod-mdx&lt;/code&gt; is &lt;code&gt;False&lt;/code&gt; which means the database has no &lt;code&gt;.mdx&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Byte 29 is the language driver ID.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $lang-driver-id = $buffer[29];
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bytes 30-31 are reserved and zero filled, so we’ll skip them. &lt;/p&gt;

&lt;p&gt;Next bytes describe the array of field descriptors until we find the field descriptor terminator, i.e., &lt;code&gt;0x0D&lt;/code&gt;. This is still documented as part of the header but we’ll tackle it in the following section and make the decision to encapsulate it separately from the metadata we’ve collected thus far, which we now encapsulate in a &lt;code&gt;FileHeader&lt;/code&gt; class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class FileHeader {
    has IO::Handle:D $.fh is required('cannot read data without file handle');

    has Str  $.version              is built(False);
    has Str  $.last-updated         is built(False);
    has UInt $.record-count         is built(False);
    has UInt $.header-length        is built(False);
    has UInt $.record-length        is built(False);
    has Bool $.transaction-complete is built(False);
    has Bool $.db-encrypted         is built(False);
    has Bool $.prod-mdx             is built(False);
    has UInt $.lang-driver          is built(False);

    method TWEAK {
        my $data = $!fh.read: 32;
        self!read-metadata($data);
    }

    method !read-metadata($data) {
        my $version = $data[0];

        unless $version == 0x03 {
            die "Only dBase III without memo file supported";
        }
        $!version = 'dBase III without memo file';

        my $year  = $data[1];
        my $month = $data[2];
        my $day   = $data[3];
        $!last-updated = Date.new(
            :year($year + 1900),
            :$month,
            :$day,
            :formatter({ "%04d-%02d-%02d".sprintf(.year, .month, .day) })
        ).Str;

       $!record-count         = $data.read-uint32(4, LittleEndian);
       $!header-length        = $data.read-uint16(8, LittleEndian);
       $!record-length        = $data.read-uint16(10, LittleEndian);
       $!transaction-complete = $data[14] != 1;
       $!db-encrypted         = $data[15] == 1;
       $!prod-mdx             = $data[28] == 1;
       $!lang-driver          = $data[29];
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Field Descriptor Array
&lt;/h3&gt;

&lt;p&gt;Now we’ll need to figure out the field names and any associated information. Remember that quick math we did in the previous section? We’ll use it here. Here we'll be loading the chunk of data that make the field descriptor array all at once into memory, so this might not be as memory efficient as simply reading each of the field’s parts at a time. However we’re only loading &lt;code&gt;11 x 32 bytes&lt;/code&gt; into memory. For simplicity’s sake, we’ll do it this way.&lt;/p&gt;

&lt;p&gt;We start by writing down a few constants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;constant $FIELD-TERMINATOR = 0x0D;
constant $FIELD-TERMINATOR-LENGTH = 1;
constant $METADATA-LENGTH = 32;
constant $FIELD-LENGTH = 32;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using the header’s length, we can determine how many of those bytes make up the field descriptor array:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $TOTAL-FIELD-BYTES = $!header-length - $METADATA-LENGTH - $FIELD-TERMINATOR-LENGTH;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here &lt;code&gt;$!header-length&lt;/code&gt; is an attribute that must be passed to initialize a &lt;code&gt;FieldDescriptorArray&lt;/code&gt; object. Same for &lt;code&gt;$!fh&lt;/code&gt; down below.&lt;/p&gt;

&lt;p&gt;Next we determine the fields count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $FIELDS-COUNT = $TOTAL-FIELD-BYTES div $FIELD-LENGTH;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can read the correct number of bytes from the file handle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $buffer = $!fh.read: $TOTAL-FIELD-BYTES;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep in mind &lt;code&gt;read&lt;/code&gt; advances the file pointer. Assuming we’ve the correct number of fields, then the next byte should be the field descriptor array terminator. We use this fact to determine if the program should fail fatally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my $field-terminator = $!fh.read(1)[0];
unless $field-terminator == $FIELD-TERMINATOR {
    die "Wrong number of bytes for field descriptor array"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lastly we only need to extract each individual field and its associated information from the field descriptor array buffer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loop (my $i = 0; $i &amp;lt; $FIELD-COUNTS; $i++) {
    my Buf:D $field = $buffer.subbuf($FIELD-LENGTH * $i, $FIELD-LENGTH);
    my Str:D $name = $field.subbuf(0, 10).decode('ascii').subst(/\x[00]+/, '');
    my Str:D $type = $field.subbuf(11, 1).decode('ascii');
    my UInt:D $length = $field[16];
    my UInt:D $decimal-places = $field[17];
    @!fields.push: Field.new(:$name, :$type, :$length, :$decimal-places);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We extract &lt;code&gt;$FIELD-LENGTH&lt;/code&gt; bytes from the current position &lt;code&gt;$FIELD-LENGTH * $i&lt;/code&gt;, i.e., we extract 32 bytes from the current position. From this 32 bytes, we extract the first 10 bytes which make up the field’s name, decode it as ASCII, and remove null bytes if any since it might be padded with null characters (&lt;code&gt;0x00&lt;/code&gt;). Byte 11 is a single character that denotes the field type. All the field types in “dBase level 5” are &lt;code&gt;C&lt;/code&gt; for a string of characters, &lt;code&gt;D&lt;/code&gt; for a date, &lt;code&gt;F&lt;/code&gt; for a floating point, &lt;code&gt;L&lt;/code&gt; for a logical value, and &lt;code&gt;N&lt;/code&gt; for numeric. Bytes 12-15 (4 bytes) are reserved so we can skip them. Byte 16 is the field length in bytes, for which the maximum is 254 (&lt;code&gt;0xFE&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;Byte 17 is the number of decimal places. Bytes 18-19 (2 bytes) is the work area ID, and we’ll skip them. Similarly, bytes 20-31: Apparently byte 20 is the “Example” but it’s unclear to me what that refers to and I don’t think it’s that important anyways. Bytes 21-30 (10 bytes) are reserved so it’s clear why we’d like to skip them. As for byte 31, it’s the production MDX field flag, we skipped this as well for the header so we’re also skipping it here.&lt;/p&gt;

&lt;p&gt;Putting everything we did in this section together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class Field {
    has Str $.name            is required('field must have a name');
    has Str $.type            is required('field must have a type');
    has Int $.length          is required('field must have a length');
    has Int $.decimal-places;
}

class FieldDescriptorArray {
    has $.fh            is required('cannot read data without file handle');
    has $.header-length is required('cannot read records without knowing where to start');

    has @.fields is built(False);

    method TWEAK {
        self!read-fields;
    }

    method !read-fields {
        constant $FIELD-TERMINATOR = 0x0D;
        constant $FIELD-TERMINATOR-LENGTH = 1;
        constant $METADATA-LENGTH = 32;
        constant $FIELD-LENGTH = 32;

        my $TOTAL-FIELD-BYTES = $!header-length - $METADATA-LENGTH - $FIELD-TERMINATOR-LENGTH;
        my $FIELDS-COUNT = $TOTAL-FIELD-BYTES / $FIELD-LENGTH;
        my $buffer = $!fh.read($TOTAL-FIELD-BYTES);

        my $field-terminator = $!fh.read(1);
        unless $field-terminator[0] == $FIELD-TERMINATOR {
        die 'Wrong number of bytes for fields'
    }

        loop (my $i = 0; $i &amp;lt; $FIELDS-COUNT; $i++) {
        my $field = $buffer.subbuf($FIELD-LENGTH * $i, $FIELD-LENGTH);
        my $name = $field.subbuf(0, 10).decode('ascii').subst(/\x[00]+/, '');
        my $type = $field.subbuf(11, 1).decode('ascii');
        my $length = $field[16];
        my $decimal-places = $field[17];
        @!fields.push: Field.new(:$name, :$type, :$length, :$decimal-places);
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Records
&lt;/h3&gt;

&lt;p&gt;Now that we’ve read the file header and the field descriptors from the DBF file, the final step involves reading the database records using the information we’ve gathered thus far, namely the field descriptors, the record count, and the record length.&lt;/p&gt;

&lt;p&gt;We start by declaring these two constants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;$DELETION-FLAG&lt;/code&gt; constant is used to indicate if a record has been marked as “deleted”. It’s up to the reader to decide how to handle deleted records, in our case we simply set the &lt;code&gt;deleted&lt;/code&gt; key in each record to indicate a record’s deletion status which could be true or false.&lt;/p&gt;

&lt;p&gt;We’re using the record count and the record length in order to figure how many times to loop and how many bytes to read from the file handle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
loop (my $i = 0; $i &amp;lt; $!record-count; $i++) {
    # ...
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The contents of a DBF varies from file to file so we cannot create a class that represents a record ahead of time, for example. For this reason, we use a hash to store a record’s contents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
loop (my $i = 0; $i &amp;lt; $!record-count; $i++) {
    my %record;
    my $buffer = $!fh.read($!record-length);
    %record{'deleted'} = $buffer[0] == $DELETION-FLAG;

    my $record-offset = 1;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we read &lt;code&gt;$!record-length&lt;/code&gt; bytes from the file handle, and set &lt;code&gt;%record{'deleted'}&lt;/code&gt; to the record’s deletion status. Because the first byte in each record represents its deletion status, we’re using &lt;code&gt;$record-offset&lt;/code&gt; to determine the position to read the record’s data from over each loop.&lt;/p&gt;

&lt;p&gt;Next we loop over each field in order to determine how many bytes each field occupies within &lt;code&gt;$buffer&lt;/code&gt;, as well as the field’s type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;constant $DELETION-FLAG = 0x2A;
constant $HEADER-LENGTH = 32;
loop (my $i = 0; $i &amp;lt; $!record-count; $i++) {
    my %record;
    my $buffer = $!fh.read($!record-length);
    %record{'deleted'} = $buffer[0] == $DELETION-FLAG;

    my $record-offset = 1;
    for $!fields.fields -&amp;gt; $field {
        my $buf = $buffer.subbuf($record-offset, $field.length);
        my $value = do given $field.type {
            when 'C' { $buf.decode('utf8-c8').trim }
            when 'N' { $buf.decode('ascii').Num }
            when 'L' {
                my $flag = $buf.decode('ascii').trim;
                'YyTt'.contains($flag) ?? True !! 'NnFf'.contains($flag) ?? False !! Bool;
            }
            when 'D' {
                my $date = $buf.decode('ascii');
                my ($year, $month, $day) = .substr(0, 4), .substr(4, 2), .substr(6, 2) given $date;
                Date.new: :$year, :$month, :$day;
            }
            when 'F' { $buf.decode('ascii').Num }
        }
        %record{$field.name} = $value;
        $record-offset += $field.length;
    }
    @!records.push: %record;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depending on the field’s type, we do some processing. For this specific dBASE version, we only deal with string (&lt;code&gt;C&lt;/code&gt;), numeric (&lt;code&gt;N&lt;/code&gt;), logical (&lt;code&gt;L&lt;/code&gt;), date (&lt;code&gt;D&lt;/code&gt;), and float (&lt;code&gt;F&lt;/code&gt;). In all the cases, we decode the buffer and then do some manipulation where necessary. For example, dates as stored as the string &lt;code&gt;YYYYMMDD&lt;/code&gt; so we extract the date parts and then create a &lt;code&gt;Date&lt;/code&gt; object using them. For each field, we map the field’s name to the field’s value, as well as incrementing the &lt;code&gt;$record-offset&lt;/code&gt; mentioned above. &lt;/p&gt;

&lt;p&gt;Finally, whenever we’re done with a record we add it to the list of records in &lt;code&gt;@!records.push: %record;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Putting everything together, we end up with the &lt;code&gt;RecordsDB&lt;/code&gt; class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class RecordsDB {
    has IO::Handle:D                $.fh            is required('cannot read data without file handle');
    has UInt:D                      $.record-count  is required('must know number of records');
    has UInt:D                      $.record-length is required("must know each record's length");
    has FieldDescriptorArray:D      $.fields        is required('must have fields');

    has @.records is built(False);

    submethod TWEAK {
        self!read-records;
    }

    method !read-records {
        constant $DELETION-FLAG = 0x2A;
        constant $HEADER-LENGTH = 32;
        loop (my $i = 0; $i &amp;lt; $!record-count; $i++) {
            my %record;

            my $buffer = $!fh.read($!record-length);
            %record{'deleted'} = $buffer[0] == $DELETION-FLAG;

            my $record-offset = 1;
            for $!fields.fields -&amp;gt; $field {
                my $buf = $buffer.subbuf($record-offset, $field.length);
                my $value = do given $field.type {
                    when 'C' { $buf.decode('utf8-c8').trim }
                    when 'N' { $buf.decode('ascii').Num }
                    when 'L' {
                        my $flag = $buf.decode('ascii').trim;
                        'YyTt'.contains($flag) ?? True !! 'NnFf'.contains($flag) ?? False !! Bool;
                    }
                    when 'D' {
                        my $date = $buf.decode('ascii');
                        my ($year, $month, $day) = .substr(0, 4), .substr(4, 2), .substr(6, 2) given $date;
                        Date.new: :$year, :$month, :$day;
                    }
                    when 'F' { $buf.decode('ascii').Num }
                }
                %record{$field.name} = $value;
                $record-offset += $field.length;
            }
            @!records.push: %record;
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, I’ve explained the difference between binary and textual data, and  their pros and cons when storing information. I also showed how to read and decode binary data, specifically the dBASE III binary format, using Raku. &lt;/p&gt;

&lt;p&gt;The source code for this article can be found in this repo &lt;a href="https://github.com/uzluisf/raku-dbf-reader-art"&gt;https://github.com/uzluisf/raku-dbf-reader-art&lt;/a&gt;. Note there might be some slight variations with the code snippets here, mainly the fact I organized the code snippets into a Raku module in the repo, however that shouldn’t make any difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;About text as the universal interface: &lt;a href="https://news.ycombinator.com/item?id=8437038"&gt;https://news.ycombinator.com/item?id=8437038&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Working with binary files: &lt;a href="https://www.visuality.pl/posts/cs-lessons-001-working-with-binary-files"&gt;https://www.visuality.pl/posts/cs-lessons-001-working-with-binary-files&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>raku</category>
      <category>dbf</category>
      <category>binary</category>
    </item>
  </channel>
</rss>
