Dmitrii Zatona

Posted on Feb 21 • Originally published at atl-protocol.org

98 Bytes That Prove Your Document Existed

#rust #cryptography #security #opensource

A checkpoint in ATL Protocol is a signed snapshot of the transparency log state. It captures the root hash of the Merkle tree at a specific tree size, at a specific moment in time, from a specific log instance. If you have a checkpoint and the corresponding signature verifies, you know the exact state of the log at that point.

The entire wire format is 98 bytes. Fixed size. No variable-length fields. No parser required.

I want to walk through why it looks the way it does, byte by byte.

The Layout

Offset  Size   Field
0       18     Magic bytes: "ATL-Protocol-v1-CP" (ASCII)
18      32     Origin ID (SHA256 of instance UUID)
50      8      Tree size (u64 little-endian)
58      8      Timestamp (u64 little-endian, Unix nanoseconds)
66      32     Root hash (SHA256)
---
Total: 98 bytes

That is the complete wire format. The Ed25519 signature (64 bytes) and the key ID (32 bytes) are stored separately -- they are not part of the 98-byte blob. This separation is a deliberate design decision that I will explain below.

Fixed Size, No Parser

Everyone reaches for JSON first. Or Protobuf. Or CBOR. Or MessagePack. These are fine formats for APIs, configuration, and data interchange. They are not fine for the thing you cryptographically sign.

The problem with variable-length encodings in a signed context is ambiguity. JSON has key ordering issues -- the same logical object can produce different byte sequences depending on which keys come first. This is well-known enough that RFC 8785 (JSON Canonicalization Scheme) exists specifically to address it. Protobuf has field ordering that is technically undefined in the spec; different implementations can serialize the same message into different bytes. CBOR has multiple valid encodings for the same value.

When I sign a checkpoint, I need to know exactly what I signed. Not "a checkpoint with these fields" but "these exact 98 bytes." If the serialization is ambiguous, the verification is ambiguous. If two implementations serialize the same checkpoint differently, one of them will fail to verify a signature produced by the other.

A fixed 98-byte blob has zero ambiguity. Any implementation in any language reads it the same way:

Read 18 bytes of magic.
Read 32 bytes of origin ID.
Read 8 bytes, interpret as u64 little-endian -- that is the tree size.
Read 8 bytes, interpret as u64 little-endian -- that is the timestamp.
Read 32 bytes of root hash.

Done. No length prefixes to parse. No delimiters to scan for. No TLV encoding to decode. No canonical form debates, because the bytes ARE the canonical form.

The serialization in Rust:

pub fn to_bytes(&self) -> [u8; CHECKPOINT_BLOB_SIZE] {
    let mut blob = [0u8; CHECKPOINT_BLOB_SIZE];
    blob[0..18].copy_from_slice(CHECKPOINT_MAGIC);
    blob[18..50].copy_from_slice(&self.origin);
    blob[50..58].copy_from_slice(&self.tree_size.to_le_bytes());
    blob[58..66].copy_from_slice(&self.timestamp.to_le_bytes());
    blob[66..98].copy_from_slice(&self.root_hash);
    blob
}

No allocations. No error paths. The return type is [u8; 98], not Vec<u8> or Result<Vec<u8>>. It cannot fail.

The Signature Lives Outside

This is the design decision that seems obvious in hindsight but that many implementations get wrong.

The Ed25519 signature and the key ID are stored alongside the checkpoint, but they are not part of the 98-byte signed blob. The method to_bytes() always produces the exact data that was signed. Always. No exceptions.

Why does this matter? Consider the alternative: include the signature in the serialized format. Now the signature signs... what? The data plus the signature itself? That is a chicken-and-egg problem. You cannot compute a signature over data that includes the signature, because you do not have the signature yet when you need to hash the data.

The common workaround is to define a "signing input" that differs from the "stored format" -- serialize the checkpoint without the signature for signing, then serialize it again with the signature for storage. Now you have two serialization formats for the same logical object, and every implementation must know which one to use when. Bugs creep in. Someone verifies the stored format instead of the signing input. The signature fails. Hours of debugging follow.

I avoided this entirely. to_bytes() returns what was signed. The signature is a separate field. Verification reconstructs the blob and checks:

pub fn verify(&self, verifier: &CheckpointVerifier) -> AtlResult<()> {
    if self.key_id != verifier.key_id {
        return Err(AtlError::InvalidSignature(format!(
            "key_id mismatch: checkpoint has {}, verifier has {}",
            hex::encode(self.key_id), hex::encode(verifier.key_id)
        )));
    }
    let blob = self.to_bytes();
    verifier.verify(&blob, &self.signature)
}

Notice the key ID check before the signature verification. key_id is SHA256(public_key), and it is checked first. Ed25519 verification is not free -- it involves elliptic curve operations. If the key ID does not match, I reject immediately without touching the expensive crypto. This is a fast-rejection pattern: filter out obviously wrong keys with a cheap hash comparison before doing real work.

Magic Bytes With a Version Baked In

The first 18 bytes of every checkpoint are the ASCII string "ATL-Protocol-v1-CP". This serves two purposes.

First, it identifies the format. If someone accidentally feeds a JPEG, a Protobuf message, or a random 98-byte buffer to a checkpoint parser, the magic bytes will not match. The error is InvalidCheckpointMagic, not a confusing downstream failure from misinterpreted fields.

Second, the v1 in the magic string bakes the wire format version into the data itself. If I ever need to change the checkpoint format -- add fields, change sizes, use a different hash function -- the new format gets different magic bytes (say, "ATL-Protocol-v2-CP"). A v1 parser encountering a v2 checkpoint cleanly rejects it instead of silently misinterpreting bytes 18-50 as an origin ID when v2 might use those bytes for something else entirely.

Eighteen bytes is generous for magic bytes. I chose a human-readable string over a shorter binary magic number because checkpoint blobs end up in hex dumps, log files, and debugging sessions. Seeing 41544C2D50726F746F636F6C2D76312D4350 in a hex dump is recognizable. Seeing 89504E47 requires you to remember that those are PNG magic bytes.

Nanosecond Timestamps

The timestamp field is a u64 encoding Unix nanoseconds. Not seconds, not milliseconds -- nanoseconds.

The range of a u64 in nanoseconds covers from 1970 to approximately the year 2554. That is sufficient.

I chose nanosecond precision because a transparency log can process multiple entries within the same millisecond. If two checkpoints have the same millisecond-resolution timestamp, their ordering becomes ambiguous. With nanosecond resolution, even entries processed microseconds apart have distinct timestamps. The timestamp is generated by current_timestamp_nanos() with clamping to u64::MAX to handle the theoretical case where system time exceeds the representable range.

Little-Endian Encoding

The two u64 fields (tree size and timestamp) are encoded as little-endian. This is an explicit choice, not a default.

Little-endian matches the native byte order of modern hardware: x86, ARM (in its default configuration), and RISC-V. On these architectures, encoding a u64 as little-endian is a no-op -- the in-memory representation is already little-endian. This eliminates an entire class of byte-swapping bugs on the most common platforms.

I wrote an explicit test that verifies the encoding:

test_endianness: 0x0102_0304_0506_0708 encodes as
[0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01]

The test name is test_endianness. It exists because byte order is the kind of thing that is trivially correct on one platform and silently wrong on another. The test makes the encoding a documented, verified property of the format, not an assumption.

JSON Roundtrip, but Not for Crypto

Checkpoints need a human-readable representation for APIs, debugging, and storage in systems that do not handle raw binary well. I implemented a CheckpointJson struct with string encodings:

Hashes are encoded as "sha256:<64 hex characters>"
Signatures are encoded as "base64:<base64 string>"

The to_json() and from_json() methods convert between the binary checkpoint and this JSON representation. But -- and this is important -- cryptographic operations always operate on the 98-byte binary blob, never on JSON. The Ed25519 signature is computed over to_bytes(), not over serde_json::to_string().

To enforce this, the main Checkpoint struct does not derive Serialize or Deserialize. There is no #[derive(Serialize)] on it. This is intentional. If someone accidentally tries to JSON-serialize a Checkpoint directly, the compiler stops them. They must go through to_json() explicitly, which forces them to think about which representation they are working with.

Trust Model: Signature as Integrity Check

Per ATL Protocol v2.0, the Ed25519 signature on a checkpoint is an integrity check, not a trust anchor. This is a deliberate departure from PKI-style trust.

The signature proves: "this checkpoint was issued by the holder of this private key." It does not prove: "you should trust this key." The distinction matters.

Trust in ATL Protocol comes from external anchors: RFC 3161 TSA timestamps (a trusted third-party timestamping authority attests to when the checkpoint existed) and Bitcoin OTS (the checkpoint hash is anchored in the Bitcoin blockchain, providing a timestamp that no single party can forge). The Ed25519 signature binds the checkpoint to a specific log instance. The external anchors prove the checkpoint existed at a specific point in time.

This design means that compromising the Ed25519 signing key does not retroactively compromise past checkpoints. If a checkpoint was anchored via RFC 3161 or Bitcoin OTS before the key was compromised, that anchor remains valid regardless of what happens to the key afterwards. The key is not the trust root -- it is a consistency mechanism.

The Test Suite

The checkpoint implementation has a test suite that covers the wire format exhaustively:

test_checkpoint_blob_size -- verifies the blob is exactly 98 bytes.
test_magic_bytes -- verifies the first 18 bytes are "ATL-Protocol-v1-CP".
test_endianness -- verifies little-endian encoding of u64 fields.
test_wire_format_layout -- verifies every field is at the correct byte offset.
test_sign_and_verify -- round-trip: create checkpoint, sign, verify.
test_verify_wrong_key_fails -- signature from key A does not verify with key B.
test_verify_tampered_data_fails -- flip a byte in the checkpoint, verification fails.
test_verify_tampered_signature_fails -- flip a byte in the signature, verification fails.
test_json_roundtrip -- binary to JSON to binary produces identical bytes.
test_empty_tree_checkpoint -- a checkpoint for an empty tree (tree_size = 0) is valid.

Each test name documents a specific property of the format. If a future change breaks any of these properties, the test name tells you exactly what broke and why it matters.

Why 98 Bytes

The number 98 is not arbitrary. It is the sum of the minimum fields needed to identify a unique log state:

18 bytes to identify the format and version.
32 bytes to identify which log instance (origin).
8 bytes to identify how many entries are in the log (tree size).
8 bytes to identify when this snapshot was taken (timestamp).
32 bytes to cryptographically commit to the entire log contents (root hash).

There is nothing to remove. The magic bytes prevent misidentification. The origin prevents cross-log confusion. The tree size and timestamp locate the snapshot in the log's history. The root hash commits to every entry ever written. Remove any field and the checkpoint becomes ambiguous or forgeable.

There is nothing to add, either. Anything beyond these fields -- the signature, the key ID, metadata, annotations -- belongs outside the signed blob. The 98 bytes are the statement. Everything else is commentary on the statement.

The full implementation is open source: github.com/evidentum-io/atl-core (Apache-2.0)

The file discussed in this post:

src/core/checkpoint.rs -- checkpoint wire format, serialization, signing, verification (1080 lines)
Ed25519 signatures via ed25519-dalek crate

DEV Community