Revision: 001
Last Update: 2026-05-12
One thing I feel is not discussed enough in typical programming conversations is working with binary data. It often seems as though there is a ready-to-use library for whatever data type we need. But that illusion breaks whenever we want to do something new, interface with a less common format, work with legacy data, or build high-performance applications around custom data formats.
Designing a custom data format that fits the application at hand is an important skill. In this post, however, we will focus less on the philosophy and more on the operational details.
Binary formats are everywhere: images, meshes, save files, network packets, custom asset bundles, compressed streams, hardware protocols, game data, and many old file formats that are still useful decades later. Yet in application code, binary parsing often becomes messy very quickly. You start with a byte[], then an offset integer, then a few helper methods, then nested offset calculations. Before long, the parser has two responsibilities at once: understanding the file format and manually managing where the cursor is.
File Container Formats WikiCommons
Design Motivation
At a higher abstraction level, the goal of Divooka is to move cursor-management logic into something closer to Kaitai Struct or DFDL, the Data Format Description Language. This makes it easier to work with standardized, well-documented formats such as RIFF.
Instead of writing code like this:
int offset = 0;
string magic = Encoding.ASCII.GetString(bytes, offset, 4);
offset += 4;
int version = BinaryPrimitives.ReadInt32LittleEndian(bytes.AsSpan(offset, 4));
offset += 4;
int count = BinaryPrimitives.ReadInt32LittleEndian(bytes.AsSpan(offset, 4));
offset += 4;
we want this:
var remaining = ImmutableByteArrayReader.FromFile("Data.bin")
.ReadString(4, out string magic)
.ReadInt(out int version)
.ReadInt(out int count);
This is similar in spirit to a stream-based API:
using var reader = new BinaryReader(stream, Encoding.ASCII, leaveOpen: true);
string magic = Encoding.ASCII.GetString(reader.ReadBytes(4));
int version = reader.ReadInt32();
int count = reader.ReadInt32();
The involved types here are ImmutableByteArrayReader, MutableByteArrayWriter, and ImmutableByteArrayWriter, all exposed through a fluent API. The parser reads like the binary format itself.
The writer mirrors that idea:
var data = new MutableByteArrayWriter()
.WriteStringFixedSize("DATA")
.WriteInt(version)
.WriteInt(count);
The idea is simple: binary data should be represented as a sequence of explicit read and write operations, not as scattered offset arithmetic.
Reading Binary Data
Binary Buffer Reader: ImmutableByteArrayReader
ImmutableByteArrayReader is a reader-oriented abstraction over binary data.
It is immutable. Reading from it does not mutate the current object. Instead, read operations return the remaining unread portion of the buffer.
The main API shape is:
public ImmutableByteArrayReader ReadInt(out int value);
public int ReadInt();
public int ReadIntWithOffset(int byteIndex);
These three forms cover three common use cases.
The out form is for fluent parsing:
var remaining = data
.ReadString(4, out string magic)
.ReadInt(out int width)
.ReadInt(out int height);
The direct-return form is for reading a value from the start of a buffer when the remaining data is not needed:
int width = data.ReadInt();
The offset form is for random access without consuming anything:
int width = data.ReadIntWithOffset(4);
This gives the type this clear behavior: Sequential reads consume by returning a new remaining buffer. Offset reads do not consume. Because the reader is immutable, slicing can be cheap: reading a subrange can return another reader over the same underlying data instead of copying bytes.
That means this kind of operation is cheap:
var remaining = data.ReadBytes(32, out ImmutableByteArrayReader header);
Both remaining and header can be views over the same internal byte array. Since neither object can mutate the backing storage, this remains safe.
Reading blocks
Binary formats often have fixed-size blocks. For example, a file might contain a 32-byte header, followed by 32 bytes of padding, followed by records.
The API shape for bytes is:
public ImmutableByteArrayReader ReadBytes(int count, out ImmutableByteArrayReader value);
public ImmutableByteArrayReader ReadBytes(int count);
public ImmutableByteArrayReader ReadBytesWithOffset(int byteIndex, int count);
public ImmutableByteArrayReader ReadBytes(int count, out ImmutableByteArrayReader value, Action<ImmutableByteArrayReader> continuation);
The continuation form is useful when a sub-block should be parsed separately:
int versionNumber = 0;
var remaining = ImmutableByteArrayReader.FromFile("MyData.bin")
.ReadString(4, out string magic)
.ReadBytes(32, out ImmutableByteArrayReader _, header => header.ReadUInt(out versionNumber))
.ReadBytes(32, out _)
.ReadInt(out int count);
The main stream advances by 32 bytes, while the continuation reads from the isolated header block.
This is especially useful for formats where blocks have fixed sizes, but only some fields inside the block are currently relevant.
Strings
Strings usually appear in binary formats in two common forms.
The first is fixed-size:
public ImmutableByteArrayReader ReadString(int size, out string value);
public string ReadString(int size);
public string ReadStringWithOffset(int byteIndex, int size);
For example, a file magic value:
data.ReadString(4, out string magic);
The second is zero-terminated:
public ImmutableByteArrayReader ReadString(out string value);
public string ReadString();
public string ReadStringWithOffset(int byteIndex);
This reads until the first 0x00 byte.
data.ReadString(out string name);
Fixed-size strings are common in headers. Zero-terminated strings are common in older binary formats, mesh files, embedded metadata, and C-style structures.
Writing Binary Data
Binary Buffer Writer: MutableByteArrayWriter
MutableByteArrayWriter is the mutable writer.
It is a fluent append-only byte builder:
var current = new MutableByteArrayWriter()
.WriteStringFixedSize("CBA1")
.WriteInt(1)
.WriteDouble(3.14159);
It is mutable in-place. Each call appends bytes to the same underlying buffer and returns this.
This makes it the practical default for producing binary files.
The API deliberately uses explicit method names:
WriteInt(...)
WriteUInt(...)
WriteDouble(...)
WriteStringFixedSize(...)
WriteStringZeroTerminated(...)
WriteBytes(...)
This avoids the ambiguity of a single overloaded Write(...) method. In binary code, explicitness is usually worth the extra characters.
Immutable writer: ImmutableByteArrayWriter
ImmutableByteArrayWriter mirrors the writer API, but every write returns a new object.
var baseFile = new ImmutableByteArrayWriter()
.WriteStringFixedSize("TEST")
.WriteInt(1);
var pathA = baseFile.WriteStringZeroTerminated("Variant A");
var pathB = baseFile.WriteStringZeroTerminated("Variant B");
This is useful when experimenting with different downstream write paths.
For example, you may have a common header, then several possible payload layouts:
var common = new ImmutableByteArrayWriter()
.WriteStringFixedSize("MESH")
.WriteInt(version);
var triangleMesh = common
.WriteStringZeroTerminated("triangles")
.WriteInt(triangleCount);
var pointCloud = common
.WriteStringZeroTerminated("points")
.WriteInt(pointCount);
The mutable writer is faster and more memory efficient. The immutable writer is better for branching, testing, and functional construction patterns.
Writing fixed-size blocks
Writers also support fixed-size subblocks:
public MutableByteArrayWriter WriteBytes(int count);
public MutableByteArrayWriter WriteBytes(int count, Action<MutableByteArrayWriter> subblock);
public MutableByteArrayWriter WriteBytes(MutableByteArrayWriter source);
For the mutable version:
int versionNumber = 1;
int count = 15;
MutableByteArrayWriter current = new MutableByteArrayWriter()
.WriteStringFixedSize("CBA1")
.WriteBytes(32, con => con.WriteUInt((uint)versionNumber))
.WriteBytes(32)
.WriteInt(count);
for (int i = 0; i < count; i++)
current = current.WriteBytes(128);
current.Save("Copy1.bin");
current.Save("Copy2.bin");
current.Dispose();
The block continuation writes into a temporary sub-buffer. If it writes fewer than 32 bytes, the result is automatically padded with zeroes. If it writes more than 32 bytes, the writer should throw.
This makes fixed-size binary structures much easier to express.
The immutable version uses a functional continuation instead:
ImmutableByteArrayWriter current = new ImmutableByteArrayWriter()
.WriteStringFixedSize("CBA1")
.WriteBytes(32, con => con.WriteUInt((uint)versionNumber))
.WriteBytes(32)
.WriteInt(count);
Internally, that continuation should be shaped as:
Func<ImmutableByteArrayWriter, ImmutableByteArrayWriter>
because immutable writes return new objects.
Examples
Example: reading a PPM image
PPM is a simple image format from the Netpbm family. It is useful as a teaching format because the structure is small and direct. The examples below intentionally implement a minimal binary P6 reader and writer, rather than a complete Netpbm parser.
A binary PPM file starts with an ASCII header:
P6
width height
maxValue
followed by a single whitespace byte and then the binary RGB pixel data.
For example:
P6
256 256
255
<binary RGB bytes>
The header is text, but the pixel payload is binary. This makes PPM a good example of mixed text and binary parsing. In the common 8-bit case where maxValue is 255, the payload is width * height * 3 bytes. PPM also allows comments in the header and larger sample values, but this example deliberately supports only simple comment-free P6 files with maxValue == 255.
A simple PPM image type might look like this:
public sealed class PpmImage
{
public int Width { get; }
public int Height { get; }
public byte[] Pixels { get; }
public PpmImage(int width, int height, byte[] pixels)
{
Width = width;
Height = height;
Pixels = pixels;
}
}
This simplified parser supports only the common 8-bit RGB form. General P6 files can use larger sample values and may contain comments in the header, so a production parser would need to handle those cases before reading the raster payload. Using ImmutableByteArrayReader, a basic parser could be written as:
public static PpmImage ReadPpm(string path)
{
ImmutableByteArrayReader data = ImmutableByteArrayReader.FromFile(path);
data = ReadAsciiToken(data, out string magic);
if (magic != "P6")
throw new InvalidDataException("Only binary P6 PPM files are supported.");
data = ReadAsciiToken(data, out string widthText);
data = ReadAsciiToken(data, out string heightText);
data = ReadAsciiToken(data, out string maxValueText);
int width = int.Parse(widthText);
int height = int.Parse(heightText);
int maxValue = int.Parse(maxValueText);
data = ConsumeSingleAsciiWhitespace(data); // Skip delimiter whitespace
if (maxValue != 255)
throw new InvalidDataException("Only 8-bit PPM files with max value 255 are supported.");
int pixelByteCount = width * height * 3;
data.ReadBytes(pixelByteCount, out ImmutableByteArrayReader pixelData);
return new PpmImage(width, height, pixelData.ToArray());
}
private static ImmutableByteArrayReader ConsumeSingleAsciiWhitespace(ImmutableByteArrayReader data)
{
if (data.Length == 0 || !IsAsciiWhitespace(data[0]))
throw new InvalidDataException("Expected whitespace before PPM raster data.");
return data.Skip(1);
}
The helper token reader can be implemented as a small routine over the immutable buffer:
private static ImmutableByteArrayReader ReadAsciiToken(ImmutableByteArrayReader data, out string token)
{
data = SkipAsciiWhitespace(data);
int count = 0;
while (count < data.Length && !IsAsciiWhitespace(data[count]))
count++;
token = Encoding.ASCII.GetString(data.AsSpan(0, count));
return data.Skip(count);
}
private static ImmutableByteArrayReader SkipAsciiWhitespace(ImmutableByteArrayReader data)
{
int count = 0;
while (count < data.Length && IsAsciiWhitespace(data[count]))
count++;
return data.Skip(count);
}
private static bool IsAsciiWhitespace(byte value)
{
return value == (byte)' ' || value == (byte)'\t' || value == (byte)'\r' || value == (byte)'\n';
}
This parser is intentionally minimal: it demonstrates the cursor-management pattern without implementing every detail of the Netpbm format. It does not support comments, alternate sample sizes, or multiple concatenated images in one file.
The parser is not fighting offsets. Each helper receives a buffer and returns the remaining buffer. That is the central pattern.
Example: writing a PPM image
Writing a binary PPM file is straightforward with MutableByteArrayWriter.
public static void WritePpm(string path, int width, int height, ReadOnlySpan<byte> rgbPixels)
{
int expectedLength = width * height * 3;
if (rgbPixels.Length != expectedLength)
throw new ArgumentException($"Expected {expectedLength} RGB byte(s).", nameof(rgbPixels));
new MutableByteArrayWriter()
.WriteStringFixedSize("P6\n")
.WriteStringFixedSize(width.ToString())
.WriteByte((byte)' ')
.WriteStringFixedSize(height.ToString())
.WriteByte((byte)'\n')
.WriteStringFixedSize("255\n")
.WriteBytes(rgbPixels)
.Save(path)
.Dispose();
}
This writes:
P6
width height
255
then appends raw RGB bytes.
The string methods are still explicit. We are not pretending the whole file is text. We are building a binary byte sequence with some ASCII sections.
You can open a .ppm image with GIMP.
Contrast this with a traditional stream-based version:
public static void WritePpm(string path, int width, int height, ReadOnlySpan<byte> rgbPixels)
{
int expectedLength = width * height * 3;
if (rgbPixels.Length != expectedLength)
throw new ArgumentException($"Expected {expectedLength} RGB byte(s).", nameof(rgbPixels));
using FileStream stream = File.Create(path);
using BinaryWriter writer = new BinaryWriter(stream, Encoding.ASCII, leaveOpen: false);
writer.Write(Encoding.ASCII.GetBytes("P6\n"));
writer.Write(Encoding.ASCII.GetBytes(width.ToString()));
writer.Write((byte)' ');
writer.Write(Encoding.ASCII.GetBytes(height.ToString()));
writer.Write((byte)'\n');
writer.Write(Encoding.ASCII.GetBytes("255\n"));
writer.Write(rgbPixels);
}
The code is command-oriented, whereas the fluent writer makes the data structure visually continuous.
Example: reading a PLY file
PLY is a common polygon mesh format. It is especially visible today because of its use in workflows such as Gaussian Splatting. It can be ASCII or binary. A binary little-endian PLY file starts with an ASCII header, followed by binary element data.
The important detail is that PLY is property-driven. The header declares the elements, their counts, and the properties stored for each element. The binary body then stores those declared properties in order. A complete PLY reader should therefore parse the element/property layout and read or skip fields according to the header.
For this article, we will use a deliberately small subset of binary little-endian PLY: vertices containing only float x, float y, and float z, followed by faces containing only property list uchar int vertex_indices.
A simplified binary PLY header for that subset might look like this:
ply
format binary_little_endian 1.0
element vertex 3
property float x
property float y
property float z
element face 1
property list uchar int vertex_indices
end_header
Then the binary body follows.
For that header, the body contains three vertices, each with three 32-bit floats:
x y z
x y z
x y z
Then one face:
uchar count
int index0
int index1
int index2
A small model type:
public readonly record struct Vertex(float X, float Y, float Z);
public sealed class PlyMesh
{
public List<Vertex> Vertices { get; } = new List<Vertex>();
public List<int[]> Faces { get; } = new List<int[]>();
}
A simplified binary PLY reader for exactly this layout can be written as:
public static PlyMesh ReadBinaryLittleEndianPly(string path)
{
ImmutableByteArrayReader data = ImmutableByteArrayReader.FromFile(path, ByteArrayEndianness.LittleEndian, Encoding.ASCII);
data = ReadPlyHeader(data, out int vertexCount, out int faceCount);
var mesh = new PlyMesh();
for (int i = 0; i < vertexCount; i++)
{
data = data
.ReadFloat(out float x)
.ReadFloat(out float y)
.ReadFloat(out float z);
mesh.Vertices.Add(new Vertex(x, y, z));
}
for (int i = 0; i < faceCount; i++)
{
data = data.ReadByte(out byte indexCount);
int[] indices = new int[indexCount];
for (int j = 0; j < indexCount; j++)
data = data.ReadInt(out indices[j]);
mesh.Faces.Add(indices);
}
return mesh;
}
The header reader can remain text-oriented:
private static ImmutableByteArrayReader ReadPlyHeader(ImmutableByteArrayReader data, out int vertexCount, out int faceCount)
{
vertexCount = 0;
faceCount = 0;
var lines = new List<string>();
while (true)
{
data = ReadAsciiLine(data, out string line);
lines.Add(line);
if (line == "end_header")
break;
}
foreach (string line in lines)
{
string[] parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
if (parts.Length == 3 && parts[0] == "element" && parts[1] == "vertex")
vertexCount = int.Parse(parts[2]);
if (parts.Length == 3 && parts[0] == "element" && parts[1] == "face")
faceCount = int.Parse(parts[2]);
}
return data;
}
private static ImmutableByteArrayReader ReadAsciiLine(ImmutableByteArrayReader data, out string line)
{
int count = 0;
while (count < data.Length && data[count] != (byte)'\n')
count++;
line = Encoding.ASCII.GetString(data.AsSpan(0, count)).TrimEnd('\r');
if (count < data.Length)
count++;
return data.Skip(count);
}
Again, the binary body parser is clean because the cursor is implicit in the returned ImmutableByteArrayReader.
This reader is not a general-purpose PLY parser. It assumes the exact layout shown above: three float properties for each vertex and one uchar/int list property for each face. Real PLY files often include additional vertex properties such as normals, colors, texture coordinates, opacity, or Gaussian Splatting attributes. They may also use different scalar names or different element layouts. A complete reader would parse the full header schema, then read or skip each declared property in order.
Example: writing a binary PLY file
Writing a simple triangle mesh to binary little-endian PLY is similarly direct. This writer emits the same restricted layout used by the reader: each vertex has three 32-bit floats, and each face has a uchar count followed by 32-bit integer vertex indices.
public static void WriteBinaryLittleEndianPly(string path, IReadOnlyList<Vertex> vertices, IReadOnlyList<int[]> faces)
{
var current = new MutableByteArrayWriter(ByteArrayEndianness.LittleEndian, Encoding.ASCII)
.WriteStringFixedSize("ply\n")
.WriteStringFixedSize("format binary_little_endian 1.0\n")
.WriteStringFixedSize($"element vertex {vertices.Count}\n")
.WriteStringFixedSize("property float x\n")
.WriteStringFixedSize("property float y\n")
.WriteStringFixedSize("property float z\n")
.WriteStringFixedSize($"element face {faces.Count}\n")
.WriteStringFixedSize("property list uchar int vertex_indices\n")
.WriteStringFixedSize("end_header\n");
foreach (Vertex vertex in vertices)
{
current = current
.WriteFloat(vertex.X)
.WriteFloat(vertex.Y)
.WriteFloat(vertex.Z);
}
foreach (int[] face in faces)
{
if (face.Length > byte.MaxValue)
throw new InvalidDataException("PLY face has too many indices for uchar list count.");
current = current.WriteByte((byte)face.Length);
for (int i = 0; i < face.Length; i++)
current = current.WriteInt(face[i]);
}
current.Save(path);
current.Dispose();
}
A triangle can then be written like this:
var vertices = new[]
{
new Vertex(0, 0, 0),
new Vertex(1, 0, 0),
new Vertex(0, 1, 0),
};
var faces = new[]
{
new[] { 0, 1, 2 },
};
WriteBinaryLittleEndianPly("Triangle.ply", vertices, faces);
The code reads as the layout of the file, not as a sequence of stream or array mutations.
Below is what the code would look like with traditional BinaryWriter:
public static void WriteBinaryLittleEndianPly(string path, IReadOnlyList<Vertex> vertices, IReadOnlyList<int[]> faces)
{
using FileStream stream = File.Create(path);
using BinaryWriter writer = new BinaryWriter(stream, Encoding.ASCII, leaveOpen: false);
writer.Write(Encoding.ASCII.GetBytes("ply\n"));
writer.Write(Encoding.ASCII.GetBytes("format binary_little_endian 1.0\n"));
writer.Write(Encoding.ASCII.GetBytes($"element vertex {vertices.Count}\n"));
writer.Write(Encoding.ASCII.GetBytes("property float x\n"));
writer.Write(Encoding.ASCII.GetBytes("property float y\n"));
writer.Write(Encoding.ASCII.GetBytes("property float z\n"));
writer.Write(Encoding.ASCII.GetBytes($"element face {faces.Count}\n"));
writer.Write(Encoding.ASCII.GetBytes("property list uchar int vertex_indices\n"));
writer.Write(Encoding.ASCII.GetBytes("end_header\n"));
foreach (Vertex vertex in vertices)
{
writer.Write(vertex.X);
writer.Write(vertex.Y);
writer.Write(vertex.Z);
}
foreach (int[] face in faces)
{
if (face.Length > byte.MaxValue)
throw new InvalidDataException("PLY face has too many indices for uchar list count.");
writer.Write((byte)face.Length);
for (int i = 0; i < face.Length; i++)
writer.Write(face[i]);
}
}
The PLY writer is still fairly readable, but it becomes more procedural - the stream owns the cursor, and the code is expressed as side effects against that stream.
A Reflection on the Design
The reader and writer APIs are intentionally symmetrical.
Reader:
ReadInt(out int value)
ReadFloat(out float value)
ReadString(...)
ReadBytes(...)
Writer:
WriteInt(int value)
WriteFloat(float value)
WriteString...
WriteBytes...
The reader consumes and returns remaining data. The writer appends and returns the current or new builder.
The result is a high-level binary workflow:
var data = ImmutableByteArrayReader.FromFile("Input.bin");
data = data
.ReadString(4, out string magic)
.ReadInt(out int version)
.ReadBytes(32, out ImmutableByteArrayReader header);
and:
var output = new MutableByteArrayWriter()
.WriteStringFixedSize(magic)
.WriteInt(version)
.WriteBytes(header);
This makes binary code easier to inspect, easier to test, and easier to refactor.
Procedural vs Dataflow Use: Mutable vs Immutable Construction
There are two writer types because they solve different problems.
MutableByteArrayWriter is for ordinary construction. It is mutable, append-only, efficient, and can be saved multiple times.
var data = new MutableByteArrayWriter()
.WriteStringFixedSize("FILE")
.WriteInt(1);
data.Save("A.bin");
data.Save("B.bin");
data.Dispose();
ImmutableByteArrayWriter is for branching construction.
var common = new ImmutableByteArrayWriter()
.WriteStringFixedSize("FILE")
.WriteInt(1);
var a = common.WriteStringZeroTerminated("A");
var b = common.WriteStringZeroTerminated("B");
The immutable writer is not meant to replace the mutable one for high-throughput output. It exists for safety, experimentation, and forkable construction.
Future Improvements
The API can be improved in several directions.
A schema layer could sit on top of these primitives. Instead of manually writing every field, a format could define named fields, block sizes, validations, and version-specific layouts.
The reader could gain richer parsing helpers, such as ReadAsciiToken, ReadAsciiLine, ReadUntil, TryRead, and Peek variants. These would make mixed text/binary formats like PPM and PLY even cleaner.
The writer could support reservation and backpatching. Many binary formats write a length or offset before the final value is known. A controlled API for placeholders would make this safe:
var lengthPosition = current.ReserveInt();
...
current.PatchInt(lengthPosition, payloadLength);
For very large files, both reader and writer could have stream-backed variants. ImmutableByteArrayReader is ideal for in-memory data, but some workflows need lazy reading, memory-mapped files, or chunked output.
The immutable writer could also use a rope-like internal representation instead of copying the whole byte array on every append. That would preserve functional semantics while making branching construction much more efficient.
The core idea, though, should stay the same: binary data should be worked with as a fluent sequence of explicit operations. The API should make the structure of the format visible, while hiding the repetitive mechanics of offset movement, byte ordering, padding, and raw byte appending.



Top comments (0)