Peter Harrison

Posted on Jan 30

SHARD: Deniable File Distribution Through XOR-Based Sharding

#crypto #cybersecurity

The Problem: Protecting Information Sources

In 2012, I developed SHARD to address a fundamental challenge in information security: how do you enable the distribution of sensitive information without being able to identify the source?

Traditional encryption doesn't solve this problem. An encrypted file is still evidence of something. If you're found in possession of secret_document.gpg, you have a file that clearly contains information, even if investigators can't decrypt it. For whistleblowers, journalists, and activists operating under authoritarian regimes, mere possession of encrypted files can be incriminating.

The requirement was different: create a system where:

Information can be distributed through normal channels (FTP, HTTP, file sharing)
Individual components are meaningless - they provide no evidence of what information they might contain
Source protection is cryptographic - not just operational security, but mathematically provable deniability
Reconstruction is possible for intended recipients with the proper instructions

SHARD achieves this through a elegant application of XOR operations and a separation of concerns: bulk data (shards) travels through one channel, while reconstruction metadata (recipes) travels through another.

How SHARD Works: The Concept

SHARD splits files into components called "shards" using XOR operations. The critical innovation is that shards are not simply encrypted fragments. Instead, each shard can be a required component of many different files - potentially hundreds. This makes it cryptographically impossible to associate any single shard with a particular source file.

Here's the process:

A pool of random "seed" shards is created - these are just random data
When you shard a file, each 1MB section is XORed with 3 randomly selected existing shards
The result is written as a new shard added to the pool
A small "recipe" file records which shards to XOR together to reconstruct the original

As multiple users shard their files using a shared pool, the deniability compounds. A shard in your possession could be part of:

Your own files
Files sharded by other users
Nothing at all (just a random seed shard)
Multiple files simultaneously

Without the recipe, there's no way to determine what any shard contains.

Network Effects and Collaborative Use

The system becomes more powerful when multiple users share a shard pool. If Alice, Bob, and Carol all use the same collection of shards:

Alice shards her sensitive document, creating new shards A1, A2, A3
Bob shards his document, potentially using A1, A2 in his XOR operations, creating B1, B2
Carol does the same, creating C1, C2, C3

Now shard A1 is a component of Alice's file AND Bob's file. There's no way to prove which file A1 "belongs to" - it's genuinely part of both. As the pool grows and more users participate, this ambiguity increases exponentially.

Practical Usage

SHARD consists of three Python scripts. Let's walk through using them.

Setup

First, create a directory for shards and generate an initial pool of random seed shards:

mkdir shards
python random_shards.py

This creates 10 random 1MB files in the shards/ directory with names like:

shard-a3f5e9c2b1d4f8e7c6b5a4f3e2d1c0b9
shard-7c8d2e1f4a5b6c9d0e3f1a2b8c4d5e6f
...

These filenames are not random - they're the BLAKE2b hash (128-bit) of the shard contents. This makes the shard store content-addressable: the filename is a direct cryptographic function of the data it contains. This becomes important for integrity verification during reconstruction.

These seed shards provide the initial XOR key material for the system.

Sharding a File

To shard a file:

python shard.py secret_document.pdf

The script:

Reads secret_document.pdf
Processes it in 1MB sections
For each section:
- Randomly selects 3 existing shards from the pool
- XORs the section with each of the 3 shards
- Writes the result as a new shard
Creates secret_document.pdf.recipe containing the reconstruction instructions

The recipe file is small - just a list of shard filenames. For a 10MB file, the recipe might be around 1-2KB.

Reconstructing a File

To reconstruct the original file:

python unshard.py secret_document.pdf.recipe

The script:

Reads the recipe file
For each shard referenced in the recipe:
- Reads the shard file
- Computes its BLAKE2b hash
- Verifies the hash matches the filename
- Exits with an error if any shard fails verification
For each 1MB section, XORs the 4 verified shards together
Writes the reconstructed data to the original filename

The output file is identical to the input - bit-for-bit perfect reconstruction.

Integrity verification is automatic. If any shard has been corrupted, modified, or is simply the wrong file, its BLAKE2b hash won't match its filename and reconstruction will fail immediately. This prevents producing a corrupted output file from bad input shards.

Distribution Strategy

The power of SHARD comes from separating the distribution channels:

Shards (bulk data):

Can be hosted publicly on any file server
Uploaded to cloud storage
Distributed via BitTorrent
Shared on FTP sites
No risk in possession - they're just random-looking data

Recipes (reconstruction metadata):

Much smaller - can be transmitted via secure channels
Can be printed (for small files)
Transmitted via encrypted messaging
Read over phone/radio for very small files
Hand-delivered on USB sticks

A whistleblower could host shards on a public website under their own name with no legal risk. The recipe travels separately through secure channels to intended recipients.

Technical Details: The Cryptography

XOR Operations

SHARD uses XOR (exclusive OR) as its core cryptographic primitive. XOR has a critical property: A XOR B XOR B = A. This means XORing a value with the same key twice returns the original value.

For each 1MB file section, the sharding process:

section = file_data[offset:offset+1MB]

# XOR with 3 randomly selected shards
section = section XOR shard1
section = section XOR shard2  
section = section XOR shard3

# Write the result as new_shard
write(new_shard, section)

To reconstruct:

# Start with the output shard
data = read(new_shard)

# XOR with the same 3 shards
data = data XOR shard1
data = data XOR shard2
data = data XOR shard3

# Result is the original section
# Because: (section XOR shard1 XOR shard2 XOR shard3) XOR shard1 XOR shard2 XOR shard3 = section

Information-Theoretic Deniability

The security comes from the properties of XOR operations:

One-time pad property: When you XOR data with truly random bytes, the output is indistinguishable from random data
No information leakage: Without the recipe, there's no way to determine what shards contribute to what files
Collision-free reconstruction: Because we track exactly which shards were used, reconstruction is deterministic

Each shard is effectively random data. Even if an attacker has:

All the shards
Knowledge that certain shards exist
Suspicions about what might be sharded

Without the recipe, they cannot:

Determine what any shard contains
Prove any shard is part of a particular file
Reconstruct any file

Content-Addressable Storage and Collision Resistance

Shards are named using the BLAKE2b hash of their contents:

import hashlib

def create_shard_name(data):
    h = hashlib.blake2b(data, digest_size=16)  # 16 bytes = 128 bits
    return 'shard-' + h.hexdigest()  # Returns 32 hex characters

This creates content-addressable storage where the filename is a cryptographic function of the file's contents. A shard named shard-a3f5e9c2b1d4f8e7c6b5a4f3e2d1c0b9 will always contain exactly the data that produces that specific BLAKE2b hash.

Benefits of this approach:

Built-in integrity verification: To verify a shard, simply hash its contents and check if the result matches its filename. No separate checksums needed.
Automatic deduplication: If two sharding operations produce identical data, they generate the same hash and thus the same filename. Only one copy is stored.
Collision resistance: BLAKE2b with 128 bits provides 2^128 possible hash values. The probability of collision is negligible - you'd need to generate about 2^64 (18 quintillion) shards before having a 50% chance of a single collision.
Performance: BLAKE2b is one of the fastest cryptographic hash functions available, typically achieving 1-3 GB/s on modern CPUs - much faster than SHA-256.
Recipe simplicity: The recipe file just lists shard names. Those names are also the verification hashes. No additional metadata needed.

Padding for Uniform Size

All shards are exactly 1MB, regardless of the actual data they contain:

with open('shards/' + newShard, 'wb') as newShardFile:
    newShardFile.write(section)
    bufSize = (1024 * 1024) - len(section)
    if bufSize > 0:
        newShardFile.write(os.urandom(bufSize))

This prevents information leakage through file sizes. The last section of a file is padded with random data to reach exactly 1MB, making all shards uniform and indistinguishable.

Integrity Verification Through Content-Addressable Storage

SHARD uses content-addressable storage where each shard's filename is derived from its contents:

import hashlib

def hash_shard(filepath):
    h = hashlib.blake2b(digest_size=16)  # 128-bit hash
    with open(filepath, 'rb') as f:
        h.update(f.read())
    return h.hexdigest()

def verify_shard(filepath):
    # Extract hash from filename (remove 'shard-' prefix)
    expected_hash = filepath.split('shard-')[1]
    actual_hash = hash_shard(filepath)
    return expected_hash == actual_hash

During reconstruction, unshard.py automatically verifies every shard before using it:

Read the shard file from disk
Compute its BLAKE2b hash
Compare against the hash embedded in the filename
If mismatch: exit with error message identifying the corrupted shard
If match: proceed to use the shard in XOR operations

This provides fail-fast integrity checking. If you've downloaded shards from an untrusted source, or if transmission errors have corrupted files, you'll know immediately before attempting reconstruction. The system won't produce a corrupted output file - it either succeeds completely with verified shards or fails cleanly with an error message.

Why BLAKE2b?

BLAKE2b was chosen for several technical reasons:

Speed: 2-3x faster than SHA-256, crucial when verifying many large files
Security: Provides cryptographic-strength collision resistance
Standard library: Available in Python's hashlib since Python 3.6
Appropriate size: 128-bit output provides the right balance between collision resistance (2^64 shards before 50% collision probability) and compact filenames (32 hex characters)

The integrity verification is not just detecting accidental corruption - it also prevents attacks where someone might substitute malicious shards. Without knowing the content that produces a specific BLAKE2b hash, an attacker cannot create a substitute shard that passes verification.

Random Shard Selection

When sharding a file, 3 shards are randomly selected from the pool:

def getShardFiles():
    dirList = os.listdir("shards")
    fileList = []
    while len(fileList) < 3:
        consider = dirList[int.from_bytes(os.urandom(4), 'big') % len(dirList)]
        if consider not in fileList:
            fileList.append(consider)
    return fileList

This ensures:

Different files use different shard combinations
The mapping between shards and files is unpredictable
The pool's ambiguity grows with each use

Recipe Files: The Weak Point and the Strength

The recipe file is both the vulnerability and the key to SHARD's security model.

A recipe contains:

original_filename.pdf
10485760
shard-f81d4fae-7dec-11d0-a765-00a0c91e6bf6
shard-a3d5e8c2-4b91-11ec-9f24-00a0c91e6bf6
shard-b7c3f1d9-5a82-11ec-8e35-00a0c91e6bf6
shard-c9e4a2b8-6c73-11ec-9d46-00a0c91e6bf6
...

For a 100MB file, the recipe is roughly 16KB - small enough to:

Print on a few pages
Transmit via low-bandwidth channels
Store on a USB stick hidden physically
Encode in images or other steganographic techniques
Memorize in chunks (for very small files)

The security trade-off:

Shards alone: Completely safe to possess, host, or distribute. Provide zero information.
Recipe alone: Useless without access to the shard pool.
Shards + Recipe: Full reconstruction capability.

This separation enables a powerful distribution strategy: shards move through monitored, high-bandwidth channels where possession means nothing. Recipes move through secure, potentially lower-bandwidth channels where small size is an advantage.

Limitations and Considerations

Recipe Size for Large Files

While recipes are small relative to file size (~0.016% overhead), they grow linearly with file size at 4 shard names per megabyte. A 1GB file needs a recipe listing ~4,000 shards (roughly 160KB with 32-character hash-based shard names).

This crosses the threshold from "easily non-digital transmission" into "needs digital channels anyway" for very large files. The system works best for documents, images, and small datasets rather than multi-gigabyte video files.

Trust and Distribution

SHARD provides technical deniability, but practical deployment requires:

Trusted channels for recipe distribution
Confidence that shard pools haven't been compromised
Understanding that recipe holders can reconstruct files

The recipe is the single point of failure. If intercepted, and the attacker has access to the shard pool, reconstruction is trivial.

Shard Pool Management

As the pool grows, managing thousands of shard files becomes a practical concern. The system has no built-in mechanisms for:

Shard garbage collection (removing unused shards)
Versioning or tracking which shards are still needed
Synchronizing shard pools across multiple users

These would need to be handled at the operational level.

Comparison to Modern Approaches

Since 2012, various systems have emerged with related goals:

IPFS and Distributed Hash Tables: IPFS also uses content-addressed storage with cryptographic hashes as identifiers. However, IPFS content hashes uniquely identify files - there's no deniability. Each file has one hash. SHARD is different: each shard can be a component of multiple files, creating genuine ambiguity about what any shard contains.

Blockchain-based storage: Systems like Filecoin or Storj distribute encrypted fragments. But they require massive computational overhead, cryptocurrency mechanisms, and energy consumption far beyond SHARD's simple XOR operations. They're solutions in search of a problem, optimizing for decentralization rather than deniability.

Steganography: Hiding data in innocent-looking files. SHARD is different - shards look like random data, not innocent files, and the deniability comes from mathematical ambiguity rather than hiding.

Secret Sharing (Shamir's): Splits secrets so N of M shares are needed for reconstruction. SHARD is different - it's about creating ambiguity about which shards belong to which files, not threshold reconstruction.

SHARD remains unique in its specific approach: XOR-based sharding with collaborative pool sharing for cryptographic deniability, combined with content-addressable storage for automatic integrity verification.

Conclusions

SHARD represents appropriate technology - using the simplest cryptographic primitives that solve the problem. XOR operations for deniability, BLAKE2b hashing for integrity verification. No complex protocols, no distributed consensus, no cryptocurrency, no massive energy consumption. Just elegant mathematics and file management.

The content-addressable storage design means shards are self-verifying - the filename is the checksum. This eliminates an entire class of problems around shard corruption and verification without adding complexity to the recipe files.

The separation of bulk data (shards) from reconstruction metadata (recipes) creates genuine plausible deniability. Individual shards provide no evidence of what they contain or contribute to. As shared pools grow through collaborative use, the ambiguity compounds mathematically.

While SHARD was never adopted in practice, it demonstrates an elegant approach to a real problem: how do you enable information distribution while cryptographically protecting sources? The technical solution works. The deployment challenges - user experience, trust models, operational security - proved harder than the mathematics.

The code is available under GPL v3 at: https://bitbucket.org/cheetah100/shard/

For those interested in source protection, deniable storage, or just elegant applications of XOR cryptography and content-addressable storage, SHARD remains a useful proof of concept and learning tool.

Peter Harrison has been working in software development for over 30 years and founded the New Zealand Open Source Society in 2002. This article describes SHARD, developed in 2012 as a proof of concept for deniable file distribution.

DEV Community