TheLinuxMan

Posted on Mar 9

I Built a Corrupt Archive Recovery Engine in Rust — Because Every Tool I Tried Just Gave Up

#rust #dataengineering

It started with a 2 GB backup archive and a familiar error:

$ unzip backup.zip
error [backup.zip]:  reported length of central directory is
  -14 bytes too long. Aborting.

$ 7z x backup.7z
ERROR: Can not open output file : Data Error

$ tar xzf data.tar.gz
gzip: data.tar.gz: unexpected end of file
tar: Child returned status 1

Three tools. Three failures. Every single file — gone.

Except they weren't gone. The archive metadata was damaged, but the actual file data — photos, documents, source code — was almost entirely intact. The tools just didn't care enough to find it.

So I built Helix Salvager — a corrupt archive recovery engine in Rust that uses a fail-forward architecture to save every file it possibly can.

The Core Idea: Stop Aborting, Start Skipping

Traditional tools treat an archive like a chain. One broken link = entire chain fails.

Helix Salvager treats each file inside an archive as independent. If file #3 of 10 is corrupt, you still get #1, #2, #4 through #10.

Standard:  File1 ✓ → File2 ✓ → File3 ✗ → ABORT  (0 files saved)
Helix:     File1 ✓ → File2 ✓ → File3 ✗ → File4 ✓ → ... → File10 ✓  (9 files saved)

That's the philosophy. The implementation is where it gets interesting.

Three Engines. One Pipeline.

┌──────────────────────────────────────────────────────────┐
│                  CORRUPT ARCHIVE INPUT                   │
│          (ZIP, 7z, gzip, bzip2, xz, tar, raw)           │
└──────────────────────┬───────────────────────────────────┘
                       │
         ┌─────────────┼─────────────┐
         ▼             ▼             ▼
   ┌───────────┐ ┌───────────┐ ┌───────────┐
   │  Engine A  │ │  Engine B  │ │  Engine C  │
   │ Fail-Fwd  │ │  Zombie   │ │  Magic    │
   │ Extractor │ │  LZMA     │ │  Carver   │
   └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
         └─────────────┼─────────────┘
                       ▼
         ┌─────────────────────────┐
         │     RECOVERED FILES     │
         │  SHA-256 deduplication  │
         │  Per-type breakdown     │
         └─────────────────────────┘

Engine A — Fail-Forward Extraction

The ZIP format stores each file entry with its own local header. My extractor wraps every decompression attempt in catch_unwind isolation — if decompressing one entry panics or errors, the engine logs it and continues to the next.

CRC mismatches are also bypassed entirely. Yes, you might get a file with a few corrupt bytes. But you get the file — which is infinitely better than nothing.

Engine B — Zombie LZMA Decoder

7z archives are the hardest to recover because LZMA decompression is stateful — corruption early in the stream can make everything after it undecodable.

My solution: Shannon entropy validation + chunked retry.

Entropy Classification:
  H < 1.5    →  Padding/Empty      (discard)
  1.5–7.85   →  Structured Data    (keep — this is your content)
  H > 7.85   →  Random Noise       (mark as tainted)

When full-stream decompression fails, the Zombie decoder slides forward byte-by-byte, trying to find the next decodable region. Every output byte gets a taint flag — clean vs. reconstructed — so you know exactly which parts of your recovered file to trust.

Engine C — Magic Header Carver

When archive metadata is completely destroyed, Engine C goes raw. It uses an Aho-Corasick multi-pattern automaton to scan every byte of the input for 29 known file signatures — in a single O(n) pass.

Category	Types
Images	JPEG, PNG, GIF, BMP, WebP, TIFF, ICO, PSD
Documents	PDF
Audio	WAV, MP3, FLAC, OGG
Video	MP4, AVI
Executables	ELF, PE/EXE, WASM
Archives	ZIP, RAR, 7z, TAR

Every match goes through structure validation (PNG chunk integrity, JPEG marker sequences, ELF header fields) and SHA-256 deduplication. The same JPEG bytes matching at multiple offsets only gets extracted once.

Benchmarks Against Standard Tools

Tested on intentionally corrupted archives:

Scenario	`unzip`	`7z`	Helix Salvager
1 dead sector in 7-file ZIP	ABORT	ABORT	6/7 files ✓
20% sectors zeroed	ABORT	ABORT	3/7 files ✓
Central directory destroyed	ABORT	0 files	5/7 files ✓
Truncated at 50%	ABORT	ABORT	4/7 files ✓
Heavy bitrot (100 bit flips)	ABORT	ABORT	2/7 files ✓

Not perfect — but something is always better than nothing.

The Test Suite

179 tests, 0 failures, 0 clippy warnings

──────────────────────────────────────────────────
  29  unit tests (core engine)
  35  integration tests (real-life archive files)
  21  real-world corruption simulation tests
  12  stress tests (edge cases, zip bombs, 100MB)
  42  regression tests
   1  doc test
──────────────────────────────────────────────────

I tested against real corrupt files, not synthetic ones — including archives from corkami/pocs (Ange Albertini's legendary file format PoCs), real GNU hello tarballs, 7-Zip's official XZ distribution, and W3C PDFs and PNGs.

Corruption patterns: sector death, bitrot, USB transfer errors, NAND flash degradation, truncation, header destruction, TCP packet reordering, power loss mid-write, and combinations of all of the above.

Using It

CLI:

git clone https://github.com/vedLinuxian/helix-salvager.git
cd helix-salvager
cargo build --release

# Recover to directory
./target/release/salvager recover broken.zip -o ./recovered/

# Recover as ZIP
./target/release/salvager recover damaged.7z -o recovered.zip --zip

# Inspect without extracting
./target/release/salvager inspect suspicious_file.bin

# JSON output for scripting
./target/release/salvager recover data.zip -o ./out/ --json --quiet

Web UI:

./start.sh
# Drag-and-drop at http://localhost:3000

Rust library:

use salvager_core::SalvageEngine;

let data = std::fs::read("corrupt_archive.zip")?;
let engine = SalvageEngine::new();

let report = engine.salvage(&data, Some(&|phase, pct| {
    println!("[{pct}%] {phase}");
}));

println!("Recovered {} files ({} bytes)",
    report.files_salvaged, report.bytes_recovered);

Docker:

docker build -t helix-salvager .
docker run -p 3000:3000 helix-salvager

What's Next

The roadmap I'm most excited about:

WASM build — Run the entire recovery engine in the browser, no server needed
Streaming/mmap mode — Process multi-gigabyte files without loading into RAM
RAR native extraction — True RAR v4/v5 header parsing (not just magic carving)
Confidence scoring — Per-file recovery percentage so you know what to trust
Python bindings — pip install helix-salvager

Why Rust?

Short answer: because this is exactly the domain Rust was made for.

Long answer: corrupt data recovery means touching malformed bytes, invalid lengths, and decompressors that were never designed to handle what you're throwing at them. Rust's ownership model and catch_unwind isolation let me do aggressive fault-tolerance without risking undefined behavior in the recovery engine itself. Zero unsafe blocks in the core library. 0 clippy warnings. Production-grade from day one.

Links

GitHub: github.com/vedLinuxian/helix-salvager
Website: vedlinuxian.github.io/helix-salvager
License: MIT / Apache-2.0 (dual)

If this has ever happened to you — an archive that every tool refused to open — give it a try. And if you find a corruption pattern it can't handle, open an issue. That's exactly how the test suite grows.

Built with 🧬 by Ved Prakash Pandey

DEV Community