AnimeForLife191

Posted on Mar 25

Building a Malware Scanner in Rust That Scans 1.4 Million Files per Minute

#rust #cybersecurity #programming #learning

For the past few months, I've been working on building my own cybersecurity tools in Rust. I wanted to understand and learn how these different tools work behind the scene and hopefully help somebody that wants to do the same.

This is my Malware Scanner named Takeri.

Why I'm Building This

I'm a 21 year old cybersecurity student about to graduate with my Associates and I had found an interest in programming. Mainly with Rust. I wanted to find a way to combine both fields, not just to build something meaningful, but hopefully turn it into a career and help others understand how these tools actually work.

The problem was… I realized I didn’t fully understand how a lot of these cybersecurity tools worked behind the scenes either.

So instead of just using them, I decided to start building my own.

What Takeri Is

Takeri is a malware scanner focused on detecting known threats using signature-based detection. It works by comparing file hashes (MD5 and SHA256) against a large database of known malicious signatures.

By leveraging ClamAV signature databases (.hdb and .hsb), Takeri has access to over 500,000 known signatures.

In addition to hash matching, it also performs magic byte analysis to verify that files are actually what they claim to be, helping detect suspicious files that may be disguised with misleading extensions.

How It Works

Takeri starts by downloading the main.cvd and daily.cvd databases from ClamAV. From these, it extracts the .hdb and .hsb signature files and loads them into memory.

Each signature includes:

The hash (MD5 or SHA256)
The expected file size
The malware name

These are stored using HashMap and HashSet for fast lookups:

pub struct SignatureInfo {
    pub size: SignatureSize,
    pub name: String,
}

pub struct SignatureDb {
    pub md5_signatures: HashMap<[u8; 16], SignatureInfo>,
    pub sha256_signatures: HashMap<[u8; 32], SignatureInfo>,

    pub md5_sizes: HashSet<u64>,
    pub sha256_sizes: HashSet<u64>,

    pub all_sizes: HashSet<u64>
}

pub enum SignatureSize {
    Specific { size: HashSet<u64> },
    Wildcard
}

Scanning Process

File Size Filtering
Before doing any expensive work, it checks if the file size matches any known signature sizes.

If it doesn’t match, the file is skipped entirely.

This avoids unnecessary hashing and is one of the biggest performance optimizations in the scanner.

Magic Byte Analysis
Next, Takeri reads the file’s magic bytes and compares them to its extension.

If the file claims to be one type but the actual format doesn’t match, it gets flagged as suspicious.

(Side note: while writing this, I realized suspicious files currently skip signature matching entirely which is something I plan to fix.)

Hashing and Signature Matching
Finally, if the file passes the earlier checks, Takeri hashes it (MD5 and/or SHA256 depending on the case) and compares it against the loaded signature database.

If a match is found, the file is flagged as infected.

Parallel Processing
To improve performance, Takeri uses Rayon to scan files in parallel across multiple threads.

This allows the scanner to fully utilize available CPU cores and significantly increases throughput.

Performance

Now for the interesting part...PERFORMANCE.
Linux
On a machine running Arch Linux with an Intel i9-13900KF and an NVMe SSD, Takeri was able to scan the root directory and process 1.4 million files in about 45 seconds on a cold run.

When files are cached by the system, that time drops even further, often cutting the scan time in half or more.

Windows
On a machine running Windows 11 and a two core cpu at 1.1Mhz with an unknown storage device only holding 60GB. It can scan C:/ at 50,000 files.........every 10 minutes. Lightning speeds

Why the Difference?

At the moment, I don’t have access to higher-end Windows hardware to fully compare results, but there are a few likely factors:

Hardware limitations (most significant)
Disk speed differences
OS-level file system performance
Thread scheduling differences between Linux and Windows

Testing in a VM produced similar or worse results, which further points to hardware and I/O constraints as the primary bottleneck.

Takeri performs extremely well on modern hardware, but like most scanners, its performance is heavily dependent on disk speed and system resources.

Whats Next

Takeri is still very early, and there’s a lot I want to improve and build on.

Right now, the focus is on making the scanner smarter and more efficient. While signature-based detection works well for known threats, it has limitations, so I want to start expanding beyond that.

Some of the next steps include:

Smarter file selection
Improving how files are chosen for scanning to reduce unnecessary work and improve performance on lower-end systems.
Heuristic scanning
Adding basic behavior and pattern-based detection to catch suspicious files that don’t match known signatures.
YARA rule support
Integrating custom rule-based detection to allow more flexible and advanced scanning.
Archive scanning
Being able to scan inside compressed files like .zip and .tar, which is where malware often hides.
Better scan modes
Introducing options like quick scans and more configurable behavior.
Improved output and reporting
Making results easier to read, export, and actually useful for users.

Beyond Takeri

Takeri is just one part of a larger project I’m working on called Shuhari CyberForge.

The goal is to build a small suite of cybersecurity tools that are:

Open source
Transparent
Educational

Right now, that includes:

Shugo - a Windows security auditor
Takeri - the malware scanner

And eventually:

A Network security tool
A password manager
And more (still figuring that part out)

I’m still learning a lot as I build this, and that’s honestly the main goal.

Even if this never becomes a full antivirus or widely used tool, it’s already been a huge learning experience and if other people can learn something from it too, that’s a win. If your interested in this at all, please go star the repo, it would mean a lot.