How to Verify a Downloaded File Has Not Been Tampered With

#security #programming #devops #beginners

You download a Linux ISO from what appears to be an official mirror. The file is 4.7 GB. How do you know that every single byte is exactly what the developers intended? How do you know nobody injected malware into the mirror, swapped a dependency in transit, or that your download did not silently corrupt at byte 3,200,417?

The answer is file hashing, and it is one of those fundamental security practices that every developer should understand and routinely use.

What a hash function does

A cryptographic hash function takes an input of any size and produces a fixed-size output, called a digest or checksum. SHA-256, for example, always produces a 64-character hexadecimal string regardless of whether the input is a 1-byte text file or a 50 GB database backup.

Key properties:

Deterministic: The same input always produces the same hash.
Avalanche effect: Changing a single bit in the input produces a completely different hash. There is no way to predict how the output changes from looking at the input change.
One-way: Given a hash, you cannot reconstruct the original file. The function is computationally irreversible.
Collision-resistant: It is practically impossible to find two different inputs that produce the same hash.

Common hash algorithms

MD5 (128-bit): Fast but cryptographically broken since 2004. Researchers have demonstrated practical collision attacks. Do not use MD5 for security-critical verification, but it is still fine for non-adversarial integrity checks like detecting accidental corruption.

SHA-1 (160-bit): Broken since 2017 when Google demonstrated a practical collision (the SHAttered attack). Deprecated by major browsers and certificate authorities. Avoid for anything security-related.

SHA-256 (256-bit): The current standard for most verification purposes. Part of the SHA-2 family. No known practical attacks. This is what you should use by default.

SHA-512 (512-bit): Slightly more secure than SHA-256 and actually faster on 64-bit processors because it operates on 64-bit words natively. The extra hash length provides more headroom against future attacks.

Practical verification workflow

When a software project provides a checksum, the verification process is straightforward:

On macOS:

shasum -a 256 downloaded-file.iso

On Linux:

sha256sum downloaded-file.iso

On Windows (PowerShell):

Get-FileHash downloaded-file.iso -Algorithm SHA256

Compare the output string against the checksum published on the official website. They must match exactly. One character difference means the file is not what you expected.

When this actually matters

Supply chain attacks are not theoretical. The 2020 SolarWinds attack injected malware into a legitimate software update that was distributed to 18,000 organizations. If those organizations had been verifying builds against independently published hashes, the attack surface would have been smaller.

Dependency verification in package managers often relies on hash checking. When npm installs a package, it verifies the tarball hash against the registry's published integrity hash. Go modules use a checksum database. Docker image digests are content-addressable hashes.

Backup validation is another critical use case. I have seen teams discover months-old backup corruption because nobody ever verified that the backup files were intact. Hashing your backups immediately after creation and periodically re-checking the hashes catches silent corruption before it matters.

Forensic integrity in legal and audit contexts requires proving that evidence has not been modified. Chain-of-custody documentation typically includes file hashes computed at the time of collection.

Beyond simple file comparison

Hash functions enable more sophisticated integrity systems:

Merkle trees (used in Git, Bitcoin, and IPFS) hash data in a tree structure where each leaf is a data block hash and each node is the hash of its children. This lets you verify any individual piece of data without downloading the entire dataset.

HMAC (Hash-based Message Authentication Code) combines a hash with a secret key to create an authentication tag. This proves not just integrity but also that the sender possessed the key.

Content-addressable storage uses the hash as the filename or key. If two files have the same hash, they are the same file. Git's object store works this way -- every blob, tree, and commit is identified by its SHA-1 hash.

The tool

I built a file hash checker at zovo.one/free-tools/file-hash-checker that computes MD5, SHA-1, SHA-256, and SHA-512 hashes for any file directly in your browser. The file never leaves your machine -- all computation happens client-side using the Web Crypto API. Drop a file in, get all four hashes instantly, and compare against published checksums without installing any command-line tools.

I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.