What do you do when you require binary data to write a test?
Say you are developing a pure Java Git implementation. You will need the bits that make up a Git repository: blob, tree or commit objects. Or perhaps you want to understand what's in a Java class file. In this case you will need the data in a Java class.
One solution is to use regular files. If you are using Maven you put them in your src/test/resources
directory. You can then access their contents using, for example, the Class.getResourceAsStream
method.
The files solution works fine. However, there is an alternative if:
- you are using JDK 17 or later; and
- your test data is relatively small.
It involves using two features:
- the
java.util.HexFormat
class introduced in JDK 17; and - Text Blocks, delivered as permanent feature in JDK 15.
In this blog post I will show how we can use them together as a source of binary data.
Using xxd
to get a textual representation of the binary data
Suppose we are writing a Git blob (loose) object reader in Java. Put simply, a Git blob object stores the contents of a single file under Git. Further details on Git internals are beyond the scope of this post. If you want to learn more you can refer to the Pro Git book chapter on Git internals.
To test our reader we will need data. Let's quickly create a test repository:
$ git init test
$ cd test/
$ cat > README.md <<EOF
# Our test project
This will be our test blob.
Let's see if we can read it from our test.
EOF
$ git add README.md
$ git commit -m "Add README.md file"
$ find .git/objects/ -type f
.git/objects/fe/210da9f7dc83fefa49ef54ba73f74e55e453e6
.git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3
.git/objects/18/acf6a96e6e43829c703ec8a8b6092b98829422
$ git cat-file -p 75a8b365ca1a5e731f49d3624960b314d0480ca3
# Our test project
This will be our test blob.
Let's see if we can read it from our test.
So the Git computed hash of the contents of our README file is:
75a8b365ca1a5e731f49d3624960b314d0480ca3
Let's use the xxd
tool to obtain a hex dump of it:
$ xxd -plain .git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3
78013dcb310a80300c4661e79ee20707b782a377105cbc40532356aa9126
d2ebeba2f3fb1e65210c7dd362ba0b8cd57015d9399a73f3961435e50c62
c897e93dbc1bd93a853223ada88c184e140e0b92612d72fcdebb07525820
c6
Great! We now have a textual representation of our binary data.
Using a text block to store our hex dump
A text block is a java.lang.String
literal suited for multi-line strings. Even though the output of the xxd
tool is multi-line the line terminators are not part of the actual data. Therefore, before using it, we must strip the string of those characters. To do it, I can think of two options:
- before consuming the string do a
replaceAll(System.lineSeparator(), "")
; or - use the
\<line-terminator>
escape sequence.
Both options are fine. In this blog post we will use the latter:
public class BlobReaderTest {
private static final String README = """
78013dcb310a80300c4661e79ee20707b782a377105cbc40532356aa9126\
d2ebeba2f3fb1e65210c7dd362ba0b8cd57015d9399a73f3961435e50c62\
c897e93dbc1bd93a853223ada88c184e140e0b92612d72fcdebb07525820\
c6\
""";
}
Notice that each "line" of the text block ends with a \
(backslash) character. It tells the Java compiler to suppress the line terminator from the resulting string value. For more information you can refer to the Programmer's Guide to Text Blocks.
Nice. We now have the blob data available in our Java source code.
Using java.util.HexFormat
to obtain our byte
array.
The Javadocs for the java.util.HexFormat
class states:
HexFormat converts between bytes and chars and hex-encoded strings which may include additional formatting markup such as prefixes, suffixes, and delimiters.
In our case we want to convert from a hex-encoded string to a array of bytes. Converting the output provided by the xxd
tool using the HexFormat
class is straight-forward:
@Test
public void readme() {
var hexFormat = HexFormat.of();
byte[] bytes = hexFormat.parseHex(README);
// consume bytes
}
We first obtained an instance of the HexFormat
class. We used the of()
factory which is suited for our xxd
output.
Next, we invoked the parseHex
method with the README
string of the previous section. It returns the blob data as a byte[]
.
Great. We are now ready to consume our data and test the blob reader.
Consuming the binary data
How we consume our data depends on the API we are testing. Suppose our BlobReader
provides a read method that takes a java.io.InputStream
like so:
Blob readInputStream(InputStream inputStream) throws IOException;
In this case we need to wrap our byte array in a ByteArrayInputStream
. The full version of the test is listed below:
@Test
public void readme() throws IOException {
var hexFormat = HexFormat.of();
var bytes = hexFormat.parseHex(README);
try (var inputStream = new ByteArrayInputStream(bytes)) {
var reader = new BlobReader();
var blob = reader.readInputStream(inputStream);
assertEquals(
blob.text(),
"""
# Our test project
This will be our test blob.
Let's see if we can read it from our test.
"""
);
}
}
ByteArrayInputStream
is an in-memory InputStream
. By this I mean that it does not do any actual I/O. In other words, neither its read
nor its close
method will return abruptly with an IOException
. Regardless, we use a try-with-resources statement.
Next, we create our BlobReader
instance and invoke it with our InputStream
.
Finally, we verify if the blob contents matches the expected value.
Writing the data to a temporary file
At times you are not in control of the API you are testing or using in your tests. Suppose our blob reader does not provide a method that takes an InputStream
. Instead it takes a file. And a java.io.File
nonetheless:
Blob readFile(File file) throws IOException;
We have to write our data to a temporary file prior to invoking the method we are testing. The full version of the test is listed below:
@Test
public void readmeWithFile() throws IOException {
var hexFormat = HexFormat.of();
var bytes = hexFormat.parseHex(README);
var file = File.createTempFile("blob-", ".tmp");
file.deleteOnExit();
try (var out = new FileOutputStream(file)) {
out.write(bytes);
}
var reader = new BlobReader();
var blob = reader.readFile(file);
assertEquals(
blob.text(),
"""
# Our test project
This will be our test blob.
Let's see if we can read it from our test.
"""
);
}
We create a temporary file using the File.createTempFile
static method. We immediately call the deleteOnExit
method: we want the file to be delete after we are done testing.
Next, we write our bytes
to the file via a FileOutputStream
.
Finally, we read the file with our BlobReader
and verify if the returned blob has the expected contents.
Manually editing our data
Our test data is in Java source code. So, if required, we can manually edit the data. Of course, you can also edit binary files. But I find that text files are easier to edit; it is possible to do it directly in the Java editor.
Let's put this into practice. We will modify our blob hex dump so that we edit the README contents.
In Git, loose objects are compressed using DEFLATE. So we can get the uncompressed hex dump like so:
$ zlib-flate -uncompress \
< .git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3 \
| xxd -plain
626c6f622039310023204f757220746573742070726f6a6563740a0a5468
69732077696c6c206265206f7572207465737420626c6f622e0a4c657427
73207365652069662077652063616e20726561642069742066726f6d206f
757220746573742e0a
The following listing is an interpretation of the uncompressed data. To understand it, you should know this:
- every two characters represents a single byte
- Git blob (loose) objects have the following format:
blob {size in ascii}\0{contents}
- it helps having a ASCII table in hand
626c6f62 -- 'blob' in ASCII/UTF-8
20 -- SPACE
3931 -- object size in ASCII/UTF-8. size=91 bytes
00 -- NULL
23 -- first char of the contents: c='#'
-- rest of the contents
204f757220746573742070726f6a6563740a0a5468
69732077696c6c206265206f7572207465737420626c6f622e0a4c657427
73207365652069662077652063616e20726561642069742066726f6d206f
757220746573742e0a
Let's change the first character of our README from '#' to '='. The equals sign character has the hex code 0x3d
. The following test passes:
public class UncompressedTest {
private static final String README = """
626c6f6220393100\
3d\
204f757220746573742070726f6a6563740a0a5468\
69732077696c6c206265206f7572207465737420626c6f622e0a4c657427\
73207365652069662077652063616e20726561642069742066726f6d206f\
757220746573742e0a\
""";
@Test
public void readme() throws IOException {
var out = new ByteArrayOutputStream();
try (var outputStream = new DeflaterOutputStream(out)) {
var hexFormat = HexFormat.of();
outputStream.write(hexFormat.parseHex(README));
}
var bytes = out.toByteArray();
try (var inputStream = new ByteArrayInputStream(bytes)) {
var reader = new BlobReader();
var blob = reader.readInputStream(inputStream);
assertEquals(
blob.text(),
"""
= Our test project
This will be our test blob.
Let's see if we can read it from our test.
"""
);
}
}
}
We have successfully edited the blob data.
A variation using java.util.Base64
For the example in this blog post, using java.util.Base64
would be mostly the same. In fact, it has a few advantages:
-
java.util.Base64
is available since JDK 8 - there is no need to escape the line terminator in the string literal
The following is a snippet of our running example using Base64
:
private static final String README = """
eAE9yzEKgDAMRmHnnuIHB7eCo3cQXLxAUyNWqpEm0uvrovP7HmUhDH3TYroLjNVwFdk5mnPzlhQ1
5QxiyJfpPbwb2TqFMiOtqIwYThQOC5JhLXL83rsHUlggxg==
""";
@Test
public void readme() throws IOException {
var decoder = Base64.getMimeDecoder();
var bytes = decoder.decode(README);
// consume the bytes
}
It has a (possible) drawback though. It makes harder to manually edit the data.
Doing something similar as the previous section using Base64
would not be as simple. The uncompressed data encoded with Base64 is the following:
$ zlib-flate -uncompress \
< .git/objects/75/a8b365ca1a5e731f49d3624960b314d0480ca3 \
| base64
YmxvYiA5MQAjIE91ciB0ZXN0IHByb2plY3QKClRoaXMgd2lsbCBiZSBvdXIgdGVzdCBibG9iLgpM
ZXQncyBzZWUgaWYgd2UgY2FuIHJlYWQgaXQgZnJvbSBvdXIgdGVzdC4K
Every character represents 6 bits of information. So editing a single character of the Base64
data means changing two bytes of our blob.
Conclusion
In this blog post we saw a way to store binary data in Java source code using text blocks. We used the java.util.HexFormat
class to convert the string to an array of bytes.
We focused on using this data for testing. But, if needed, it is also possible to use this technique in production code as well.
Storing the data in text format makes it easier to edit it. This assumes the data:
- has a defined format; and
- its binary format allows for manipulation with some ease.
Additionally, since the data is in Java source code, edits can be visualized in Git diffs.
As mentioned this technique is better suited for data that is relatively small.
You can find the source code for all of the examples in this GitHub repository. It includes the source code of the BlobReader
.
Originally published at the Objectos Software Blog on July 11th, 2022.
Follow me on twitter.
Top comments (2)
Interesting take!
I usually go for the Base64 approach when faced with that. If editing I just use one of the free Base64 online converters. I wasn't familiar with
HexFormat
but we didn't really migrate to 17 yet...Hi there!
I was using the Base64 approach as well. It was not until recently I wanted an easier way to edit the data. By the way, at the time I did not think about the Base64 online converters. That's a great idea, thanks!
As for
HexFormat
I am not sure anymore where I learned about it. But most likely a blog post or perhaps Twitter.