Marcio Endo

Posted on Jul 17, 2022 • Originally published at objectos.com.br

A JDK 17+ alternative to using binary files in your Java tests

#java #testing #git

What do you do when you require binary data to write a test?

Say you are developing a pure Java Git implementation. You will need the bits that make up a Git repository: blob, tree or commit objects. Or perhaps you want to understand what's in a Java class file. In this case you will need the data in a Java class.

One solution is to use regular files. If you are using Maven you put them in your src/test/resources directory. You can then access their contents using, for example, the Class.getResourceAsStream method.

The files solution works fine. However, there is an alternative if:

you are using JDK 17 or later; and
your test data is relatively small.

It involves using two features:

the java.util.HexFormat class introduced in JDK 17; and
Text Blocks, delivered as permanent feature in JDK 15.

In this blog post I will show how we can use them together as a source of binary data.

Using `xxd` to get a textual representation of the binary data

Suppose we are writing a Git blob (loose) object reader in Java. Put simply, a Git blob object stores the contents of a single file under Git. Further details on Git internals are beyond the scope of this post. If you want to learn more you can refer to the Pro Git book chapter on Git internals.

To test our reader we will need data. Let's quickly create a test repository:

$ git init test
$ cd test/
$ cat > README.md <<EOF
# Our test project

This will be our test blob.
Let's see if we can read it from our test.
EOF
$ git add README.md
$ git commit -m "Add README.md file"
$ find .git/objects/ -type f
.git/objects/fe/210da9f7dc83fefa49ef54ba73f74e55e453e6
.git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3
.git/objects/18/acf6a96e6e43829c703ec8a8b6092b98829422
$ git cat-file -p 75a8b365ca1a5e731f49d3624960b314d0480ca3
# Our test project

This will be our test blob.
Let's see if we can read it from our test.

So the Git computed hash of the contents of our README file is:

75a8b365ca1a5e731f49d3624960b314d0480ca3

Let's use the xxd tool to obtain a hex dump of it:

$ xxd -plain .git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3
78013dcb310a80300c4661e79ee20707b782a377105cbc40532356aa9126
d2ebeba2f3fb1e65210c7dd362ba0b8cd57015d9399a73f3961435e50c62
c897e93dbc1bd93a853223ada88c184e140e0b92612d72fcdebb07525820
c6

Great! We now have a textual representation of our binary data.

Using a text block to store our hex dump

A text block is a java.lang.String literal suited for multi-line strings. Even though the output of the xxd tool is multi-line the line terminators are not part of the actual data. Therefore, before using it, we must strip the string of those characters. To do it, I can think of two options:

before consuming the string do a replaceAll(System.lineSeparator(), ""); or
use the \<line-terminator> escape sequence.

Both options are fine. In this blog post we will use the latter:

public class BlobReaderTest {
  private static final String README = """
      78013dcb310a80300c4661e79ee20707b782a377105cbc40532356aa9126\
      d2ebeba2f3fb1e65210c7dd362ba0b8cd57015d9399a73f3961435e50c62\
      c897e93dbc1bd93a853223ada88c184e140e0b92612d72fcdebb07525820\
      c6\
      """;
}

Notice that each "line" of the text block ends with a \ (backslash) character. It tells the Java compiler to suppress the line terminator from the resulting string value. For more information you can refer to the Programmer's Guide to Text Blocks.

Nice. We now have the blob data available in our Java source code.

Using `java.util.HexFormat` to obtain our `byte` array.

The Javadocs for the java.util.HexFormat class states:

HexFormat converts between bytes and chars and hex-encoded strings which may include additional formatting markup such as prefixes, suffixes, and delimiters.

In our case we want to convert from a hex-encoded string to a array of bytes. Converting the output provided by the xxd tool using the HexFormat class is straight-forward:

@Test
public void readme() {
  var hexFormat = HexFormat.of();

  byte[] bytes = hexFormat.parseHex(README);

  // consume bytes
}

We first obtained an instance of the HexFormat class. We used the of() factory which is suited for our xxd output.

Next, we invoked the parseHex method with the README string of the previous section. It returns the blob data as a byte[].

Great. We are now ready to consume our data and test the blob reader.

Consuming the binary data

How we consume our data depends on the API we are testing. Suppose our BlobReader provides a read method that takes a java.io.InputStream like so:

Blob readInputStream(InputStream inputStream) throws IOException;

In this case we need to wrap our byte array in a ByteArrayInputStream. The full version of the test is listed below:

@Test
public void readme() throws IOException {
  var hexFormat = HexFormat.of();

  var bytes = hexFormat.parseHex(README);

  try (var inputStream = new ByteArrayInputStream(bytes)) {
    var reader = new BlobReader();

    var blob = reader.readInputStream(inputStream);

    assertEquals(
      blob.text(),

      """
      # Our test project

      This will be our test blob.
      Let's see if we can read it from our test.
      """
    );
  }
}

ByteArrayInputStream is an in-memory InputStream. By this I mean that it does not do any actual I/O. In other words, neither its read nor its close method will return abruptly with an IOException. Regardless, we use a try-with-resources statement.

Next, we create our BlobReader instance and invoke it with our InputStream.

Finally, we verify if the blob contents matches the expected value.

Writing the data to a temporary file

At times you are not in control of the API you are testing or using in your tests. Suppose our blob reader does not provide a method that takes an InputStream. Instead it takes a file. And a java.io.File nonetheless:

Blob readFile(File file) throws IOException;

We have to write our data to a temporary file prior to invoking the method we are testing. The full version of the test is listed below:

@Test
public void readmeWithFile() throws IOException {
  var hexFormat = HexFormat.of();

  var bytes = hexFormat.parseHex(README);

  var file = File.createTempFile("blob-", ".tmp");

  file.deleteOnExit();

  try (var out = new FileOutputStream(file)) {
    out.write(bytes);
  }

  var reader = new BlobReader();

  var blob = reader.readFile(file);

  assertEquals(
    blob.text(),

    """
    # Our test project

    This will be our test blob.
    Let's see if we can read it from our test.
    """
  );
}

We create a temporary file using the File.createTempFile static method. We immediately call the deleteOnExit method: we want the file to be delete after we are done testing.

Next, we write our bytes to the file via a FileOutputStream.

Finally, we read the file with our BlobReader and verify if the returned blob has the expected contents.

Manually editing our data

Our test data is in Java source code. So, if required, we can manually edit the data. Of course, you can also edit binary files. But I find that text files are easier to edit; it is possible to do it directly in the Java editor.

Let's put this into practice. We will modify our blob hex dump so that we edit the README contents.

In Git, loose objects are compressed using DEFLATE. So we can get the uncompressed hex dump like so:

$ zlib-flate -uncompress \
    < .git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3 \
    | xxd -plain
626c6f622039310023204f757220746573742070726f6a6563740a0a5468
69732077696c6c206265206f7572207465737420626c6f622e0a4c657427
73207365652069662077652063616e20726561642069742066726f6d206f
757220746573742e0a

The following listing is an interpretation of the uncompressed data. To understand it, you should know this:

every two characters represents a single byte
Git blob (loose) objects have the following format: blob {size in ascii}\0{contents}
it helps having a ASCII table in hand

626c6f62 -- 'blob' in ASCII/UTF-8
20       -- SPACE
3931     -- object size in ASCII/UTF-8. size=91 bytes
00       -- NULL
23       -- first char of the contents: c='#'
-- rest of the contents
204f757220746573742070726f6a6563740a0a5468
69732077696c6c206265206f7572207465737420626c6f622e0a4c657427
73207365652069662077652063616e20726561642069742066726f6d206f
757220746573742e0a

Let's change the first character of our README from '#' to '='. The equals sign character has the hex code 0x3d. The following test passes:

public class UncompressedTest {
  private static final String README = """
      626c6f6220393100\
      3d\
      204f757220746573742070726f6a6563740a0a5468\
      69732077696c6c206265206f7572207465737420626c6f622e0a4c657427\
      73207365652069662077652063616e20726561642069742066726f6d206f\
      757220746573742e0a\
      """;

  @Test
  public void readme() throws IOException {
    var out = new ByteArrayOutputStream();

    try (var outputStream = new DeflaterOutputStream(out)) {
      var hexFormat = HexFormat.of();

      outputStream.write(hexFormat.parseHex(README));
    }

    var bytes = out.toByteArray();

    try (var inputStream = new ByteArrayInputStream(bytes)) {
      var reader = new BlobReader();

      var blob = reader.readInputStream(inputStream);

      assertEquals(
        blob.text(),

        """
        = Our test project

        This will be our test blob.
        Let's see if we can read it from our test.
        """
      );
    }
  }
}

We have successfully edited the blob data.

A variation using `java.util.Base64`

For the example in this blog post, using java.util.Base64 would be mostly the same. In fact, it has a few advantages:

java.util.Base64 is available since JDK 8
there is no need to escape the line terminator in the string literal

The following is a snippet of our running example using Base64:

private static final String README = """
    eAE9yzEKgDAMRmHnnuIHB7eCo3cQXLxAUyNWqpEm0uvrovP7HmUhDH3TYroLjNVwFdk5mnPzlhQ1
    5QxiyJfpPbwb2TqFMiOtqIwYThQOC5JhLXL83rsHUlggxg==
    """;

@Test
public void readme() throws IOException {
  var decoder = Base64.getMimeDecoder();

  var bytes = decoder.decode(README);

  // consume the bytes
}

It has a (possible) drawback though. It makes harder to manually edit the data.

Doing something similar as the previous section using Base64 would not be as simple. The uncompressed data encoded with Base64 is the following:

$ zlib-flate -uncompress \
    < .git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3 \
    | base64
YmxvYiA5MQAjIE91ciB0ZXN0IHByb2plY3QKClRoaXMgd2lsbCBiZSBvdXIgdGVzdCBibG9iLgpM
ZXQncyBzZWUgaWYgd2UgY2FuIHJlYWQgaXQgZnJvbSBvdXIgdGVzdC4K

Every character represents 6 bits of information. So editing a single character of the Base64 data means changing two bytes of our blob.

Conclusion

In this blog post we saw a way to store binary data in Java source code using text blocks. We used the java.util.HexFormat class to convert the string to an array of bytes.

We focused on using this data for testing. But, if needed, it is also possible to use this technique in production code as well.

Storing the data in text format makes it easier to edit it. This assumes the data:

has a defined format; and
its binary format allows for manipulation with some ease.

Additionally, since the data is in Java source code, edits can be visualized in Git diffs.

As mentioned this technique is better suited for data that is relatively small.

You can find the source code for all of the examples in this GitHub repository. It includes the source code of the BlobReader.

Originally published at the Objectos Software Blog on July 11^th, 2022.

Follow me on twitter.

Top comments (2)

Shai Almog • Jul 18 '22

Interesting take!

I usually go for the Base64 approach when faced with that. If editing I just use one of the free Base64 online converters. I wasn't familiar with HexFormat but we didn't really migrate to 17 yet...

Marcio Endo • Jul 18 '22

Hi there!

I was using the Base64 approach as well. It was not until recently I wanted an easier way to edit the data. By the way, at the time I did not think about the Base64 online converters. That's a great idea, thanks!

As for HexFormat I am not sure anymore where I learned about it. But most likely a blog post or perhaps Twitter.