It started with a 2 GB backup archive and a familiar error:
$ unzip backup.zip
error [backup.zip]: reported length of central directory is
-14 bytes too long. Aborting.
$ 7z x backup.7z
ERROR: Can not open output file : Data Error
$ tar xzf data.tar.gz
gzip: data.tar.gz: unexpected end of file
tar: Child returned status 1
Three tools. Three failures. Every single file — gone.
Except they weren't gone. The archive metadata was damaged, but the actual file data — photos, documents, source code — was almost entirely intact. The tools just didn't care enough to find it.
So I built Helix Salvager — a corrupt archive recovery engine in Rust that uses a fail-forward architecture to save every file it possibly can.
The Core Idea: Stop Aborting, Start Skipping
Traditional tools treat an archive like a chain. One broken link = entire chain fails.
Helix Salvager treats each file inside an archive as independent. If file #3 of 10 is corrupt, you still get #1, #2, #4 through #10.
Standard: File1 ✓ → File2 ✓ → File3 ✗ → ABORT (0 files saved)
Helix: File1 ✓ → File2 ✓ → File3 ✗ → File4 ✓ → ... → File10 ✓ (9 files saved)
That's the philosophy. The implementation is where it gets interesting.
Three Engines. One Pipeline.
┌──────────────────────────────────────────────────────────┐
│ CORRUPT ARCHIVE INPUT │
│ (ZIP, 7z, gzip, bzip2, xz, tar, raw) │
└──────────────────────┬───────────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Engine A │ │ Engine B │ │ Engine C │
│ Fail-Fwd │ │ Zombie │ │ Magic │
│ Extractor │ │ LZMA │ │ Carver │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
└─────────────┼─────────────┘
▼
┌─────────────────────────┐
│ RECOVERED FILES │
│ SHA-256 deduplication │
│ Per-type breakdown │
└─────────────────────────┘
Engine A — Fail-Forward Extraction
The ZIP format stores each file entry with its own local header. My extractor wraps every decompression attempt in catch_unwind isolation — if decompressing one entry panics or errors, the engine logs it and continues to the next.
CRC mismatches are also bypassed entirely. Yes, you might get a file with a few corrupt bytes. But you get the file — which is infinitely better than nothing.
Engine B — Zombie LZMA Decoder
7z archives are the hardest to recover because LZMA decompression is stateful — corruption early in the stream can make everything after it undecodable.
My solution: Shannon entropy validation + chunked retry.
Entropy Classification:
H < 1.5 → Padding/Empty (discard)
1.5–7.85 → Structured Data (keep — this is your content)
H > 7.85 → Random Noise (mark as tainted)
When full-stream decompression fails, the Zombie decoder slides forward byte-by-byte, trying to find the next decodable region. Every output byte gets a taint flag — clean vs. reconstructed — so you know exactly which parts of your recovered file to trust.
Engine C — Magic Header Carver
When archive metadata is completely destroyed, Engine C goes raw. It uses an Aho-Corasick multi-pattern automaton to scan every byte of the input for 29 known file signatures — in a single O(n) pass.
| Category | Types |
|---|---|
| Images | JPEG, PNG, GIF, BMP, WebP, TIFF, ICO, PSD |
| Documents | |
| Audio | WAV, MP3, FLAC, OGG |
| Video | MP4, AVI |
| Executables | ELF, PE/EXE, WASM |
| Archives | ZIP, RAR, 7z, TAR |
Every match goes through structure validation (PNG chunk integrity, JPEG marker sequences, ELF header fields) and SHA-256 deduplication. The same JPEG bytes matching at multiple offsets only gets extracted once.
Benchmarks Against Standard Tools
Tested on intentionally corrupted archives:
| Scenario | unzip |
7z |
Helix Salvager |
|---|---|---|---|
| 1 dead sector in 7-file ZIP | ABORT | ABORT | 6/7 files ✓ |
| 20% sectors zeroed | ABORT | ABORT | 3/7 files ✓ |
| Central directory destroyed | ABORT | 0 files | 5/7 files ✓ |
| Truncated at 50% | ABORT | ABORT | 4/7 files ✓ |
| Heavy bitrot (100 bit flips) | ABORT | ABORT | 2/7 files ✓ |
Not perfect — but something is always better than nothing.
The Test Suite
179 tests, 0 failures, 0 clippy warnings
──────────────────────────────────────────────────
29 unit tests (core engine)
35 integration tests (real-life archive files)
21 real-world corruption simulation tests
12 stress tests (edge cases, zip bombs, 100MB)
42 regression tests
1 doc test
──────────────────────────────────────────────────
I tested against real corrupt files, not synthetic ones — including archives from corkami/pocs (Ange Albertini's legendary file format PoCs), real GNU hello tarballs, 7-Zip's official XZ distribution, and W3C PDFs and PNGs.
Corruption patterns: sector death, bitrot, USB transfer errors, NAND flash degradation, truncation, header destruction, TCP packet reordering, power loss mid-write, and combinations of all of the above.
Using It
CLI:
git clone https://github.com/vedLinuxian/helix-salvager.git
cd helix-salvager
cargo build --release
# Recover to directory
./target/release/salvager recover broken.zip -o ./recovered/
# Recover as ZIP
./target/release/salvager recover damaged.7z -o recovered.zip --zip
# Inspect without extracting
./target/release/salvager inspect suspicious_file.bin
# JSON output for scripting
./target/release/salvager recover data.zip -o ./out/ --json --quiet
Web UI:
./start.sh
# Drag-and-drop at http://localhost:3000
Rust library:
use salvager_core::SalvageEngine;
let data = std::fs::read("corrupt_archive.zip")?;
let engine = SalvageEngine::new();
let report = engine.salvage(&data, Some(&|phase, pct| {
println!("[{pct}%] {phase}");
}));
println!("Recovered {} files ({} bytes)",
report.files_salvaged, report.bytes_recovered);
Docker:
docker build -t helix-salvager .
docker run -p 3000:3000 helix-salvager
What's Next
The roadmap I'm most excited about:
- WASM build — Run the entire recovery engine in the browser, no server needed
- Streaming/mmap mode — Process multi-gigabyte files without loading into RAM
- RAR native extraction — True RAR v4/v5 header parsing (not just magic carving)
- Confidence scoring — Per-file recovery percentage so you know what to trust
-
Python bindings —
pip install helix-salvager
Why Rust?
Short answer: because this is exactly the domain Rust was made for.
Long answer: corrupt data recovery means touching malformed bytes, invalid lengths, and decompressors that were never designed to handle what you're throwing at them. Rust's ownership model and catch_unwind isolation let me do aggressive fault-tolerance without risking undefined behavior in the recovery engine itself. Zero unsafe blocks in the core library. 0 clippy warnings. Production-grade from day one.
Links
- GitHub: github.com/vedLinuxian/helix-salvager
- Website: vedlinuxian.github.io/helix-salvager
- License: MIT / Apache-2.0 (dual)
If this has ever happened to you — an archive that every tool refused to open — give it a try. And if you find a corruption pattern it can't handle, open an issue. That's exactly how the test suite grows.
Built with 🧬 by Ved Prakash Pandey
Top comments (0)