Andre Faria

Posted on Dec 10, 2025

I wanted to know how malware works, so I built an analyser

#malware #security #staticanalysis

1. Introduction & Motivation

When I began thinking about what to do for my Master’s thesis, one question kept resurfacing: How do people actually classify malware? I had always been curious about the internal logic behind malware categorization, not just at a high level, but at the level of processes, features, and decision-making.

In the end, the thesis became more of a means to an end: a structured excuse to finally build something I’d wanted for years, my own static malware analyser.

To do that, I needed a system that was:

Reproducible, so others could follow the same steps
Interpretable, so each decision had a clear explanation
Automated, so large numbers of samples could be processed
Modular, so rules, enrichment, or extraction could evolve over time

This article describes how I designed the baseline analysis pipeline, what I learned from it, and why building it was the most effective way to understand how malware works (see survey: ResearchGate).

Why Static Analysis not Dynamic analysis or both?

I chose static analysis because it’s the simplest, safest way to make progress fast. You can point mature tools like Ghidra at a binary and immediately get structure, imports and strings—no sandbox to provision, no risk of executing the sample, and results that are easy to trace back to rules. That makes static ideal for batch triage and for learning: it’s repeatable, quick, and interpretable.

Of course, static has blind spots. Dynamic analysis shows what the program actually does at runtime—process creation, network I/O, registry and file changes—and it can expose unpacking or decryption that static won’t see. The trade‑off is overhead and fragility: running malware safely requires instrumentation and isolation, it’s slower per sample, and many families try to evade sandboxes. My approach was to start with static to build a clear baseline, then layer enrichment (and later, hybrid methods) where deeper behaviour visibility is needed.

2. High-Level Architecture of the Baseline Pipeline

The baseline pipeline is intentionally simple. It follows a straight, modular workflow:

Feature Extraction – gather structural and semantic information from the PE file.
Heuristic Evaluation – apply rule-based checks to detect suspicious patterns.
Optional Data Enrichment – pull external intelligence (e.g., VirusTotal) for reference.
Decision Fusion – combine heuristic signals with enrichment (if available).
Reporting – output structured evidence, classification, and metadata.

Each component has a narrow purpose and produces structured data that the next stage consumes. This keeps the design predictable and transparent.

On the optional enrichment step

The enrichment layer is intentionally optional. In theory, it makes the classification stronger because the heuristic output can be cross-checked against external intelligence.

But enrichment also introduces an unexpected trade-off:

If the heuristic analysis is roughly aligned with the enrichment data, the result improves.
If the heuristic analysis is far off from the enrichment (e.g., near-random heuristics), the fusion process can skew the final label in unhelpful ways.

So enrichment is useful, but only when the baseline heuristics are not too noisy. This became a recurring theme in the project.

3. Extracting Features from Malware Samples

Static analysis begins with extraction gathering every meaningful property of a file without running it (overview: IJRASET). This includes:

PE metadata
Section layout
Import tables
Strings
Function signatures and decompiler output
Embedded resources
Other structural features

In the baseline, the decompiler stage writes a per-sample features JSON you can reuse downstream. Typical fields include program (name, format, language, compiler, image_base, size, sha256), functions, imports, sections, strings, and optional decompiled function records. For runs, artifacts are written under a run folder (e.g., decompile-<RUN_ID>/<sha256>.features.json).

Why only PE binaries (and how to adapt)

For the experiments in this article, I focused on PE (Portable Executable) binaries (.exe, .dll, .sys). Two practical reasons guided this decision:

PE is the most widespread format in desktop malware telemetry (Windows dominance in consumer endpoints).
Tooling and ecosystem maturity are strongest around PE (Ghidra processors, import table conventions, common packers/obfuscators), which reduces ambiguity when building a baseline.

That focus simplified feature extraction (e.g., sections, imports, entry points) and made heuristic authoring more reliable (static vs dynamic context: SJSU ScholarWorks).

Adapting to other formats is feasible with incremental changes:

ELF (Linux): switch language/processor in Ghidra, adjust extractors for ELF sections/segments, symbol tables, and libc/syscall imports; re-map heuristics to Linux TTPs (e.g., ptrace, /proc tampering, init/systemd persistence).
Mach-O (macOS): use Mach-O program metadata, dyld imports, code signatures/entitlements; adapt persistence/networking rules to macOS paths and launch agents.
Android APK/Dex: pivot to bytecode/decompiled Java/Kotlin; extract manifest, permissions, receivers/services; heuristics on exfil domains, trackers, sensitive API calls.
Scripted binaries (e.g., .NET, Python, JS packagers): add language-aware parsers, focus on runtime loaders, reflection/dynamic resolution, embedded payloads.

Concretely, the baseline needs:

A format detector in the orchestrator to choose the right decompiler/extractor path.
Format-specific feature schemas with a shared core (program info, strings, imports/exports, sections), plus optional blocks per format.
A heuristic ruleset per format (or parametric rules) and a tagging map aligned to each platform’s taxonomy.

With these adaptations, the same pipeline (decompile → heuristics → optional enrichment → fusion → report) extends beyond PE with minimal structural changes.

Why Ghidra (and not radare2, IDA, or Ada-based tools)?

A few people ask why I didn’t use Ada or other specialized tools. The answer is simple:

Ghidra is fully open source
It provides Python bindings through PyGhidra
It integrates a powerful decompiler
It can be automated in a pipeline without licensing issues

That said, Ghidra’s Python bindings are not trivial to use. Because Python and Java operate differently (different memory models, threading assumptions, and API expectations), interacting with Ghidra programmatically can become clunky. But it remained the most practical option.

Limits of the extraction approach

Because this is a lightweight baseline pipeline, the extraction steps are intentionally simple. This leads to a major limitation:

The analysis depends heavily on readable strings and predictable patterns. If the malware is encrypted, packed, or obfuscated, the extracted data becomes almost useless.

This constraint shapes everything downstream in the pipeline.

4. The Heuristics Engine: How the Rules Work

The heuristics engine is the simplest component of the pipeline by design. A heuristic rule is just:

A pure function
That examines extracted features
And returns structured evidence if a condition is met

The logic behind the rules is intentionally basic. Most rules rely on simple string-matching or pattern detection, such as:

Suspicious API calls
Writable/executable sections
Unusual import patterns
Indicators in metadata or strings

A double limitation

Because rules depend on literal string matching:

The input must closely match what the rule expects, or the rule will not fire.
Cryptographed, packed, or obfuscated malware evades the heuristics almost completely.

The upside is interpretability: every rule hit produces clear evidence.
The downside is coverage: many modern malware families will not match at all.

Rule shape and evidence contract

Rules are pure functions that take extracted features and return either Evidence or a miss reason. In REXIS they follow a signature like:

def rule_example(features: dict, rule_score: float = 0.2, params: dict = {}):
 # return (Evidence, "reason") on hit, or (None, "miss reason") on miss
 ...

Evidence is structured with id, title, detail, severity (info|warn|error) and a raw score in [0,1]. The analyser attaches a reason and per-evidence categories (derived from a tagging map) to aid traceability.

Tuning is externalized: a rules config (YAML/JSON) can reweight rules, pass per‑rule params via rule_args, filter by allow_rules/deny_rules, and define label_overrides for strong signals. Tag inference (e.g., ransomware, stealer, backdoor) is computed from evidence via a configurable tagging section.

Tip: Always return a miss reason; it surfaces in rule_misses and makes rule calibration easier.

Authoring and wiring a rule (concrete example)

Here is a simplified example that flags mutex creation APIs, showing the recommended return contract and tunable rule_score:

from typing import Any, Dict, Optional, Tuple
from rexis.utils.types import Evidence
from rexis.tools.heuristics_analyser.utils import get_imports_set

def rule_suspicious_mutex_creation(
 features: Dict[str, Any], rule_score: float = 0.10, params: Dict[str, Any] = {}
) -> Tuple[Optional[Evidence], Optional[str]]:
 imps = get_imports_set(features)
 mutex_apis = {"createmutexa", "createmutexw", "openmutexa", "openmutexw"}
 hits = imps & mutex_apis
 if not hits:
  return None, "no mutex-related imports found"
 return (
  Evidence(
   id="suspicious_mutex_creation",
   title="Mutex creation/manipulation",
   detail=f"Imports include: {', '.join(sorted(hits))}",
   severity="info",
   score=float(rule_score),
  ),
  f"matched mutex imports: {', '.join(sorted(hits))}",
 )

To wire it, register the function with a stable id in the analyser’s ruleset and add a default weight in the config. At runtime, you can raise/lower its impact via weights.suspicious_mutex_creation and pass parameters through rule_args.

Tuning via config (weights, thresholds, tags)

In your rules YAML/JSON you can control:

scoring.combine (weighted_sum|max) and label_thresholds.{malicious,suspicious}
weights: per‑rule caps; contribution is min(1.0, ev.score * weight)
allow_rules / deny_rules: enable/disable subsets
label_overrides: force a label if a rule fires
rule_args: (rule_score, params) per rule
tagging: map evidence to tags (e.g., ransomware, stealer, backdoor) with tag_weights, threshold, top_k

This keeps rule code simple while giving you environment‑specific control.

Testing a rule quickly (ad‑hoc)

from rexis.tools.heuristics_analyser.main import heuristic_classify

features = {
 "program": {"name": "sample.exe", "size": 200_000, "sha256": "...", "format": "pe", "language": "x86"},
 "imports": ["CreateMutexA", "GetProcAddress"],
 "sections": [{"name": ".text", "size": 3500, "flags": ["exec", "write"]}],
 "strings": ["http://example.com", "VirtualBox"],
}

result = heuristic_classify(features)
print(result["score"], result["label"])      # inspect overall score/label
print(result.get("evidence", []))             # list of evidence with reasons
print(result.get("tags", []))                 # tag candidates with scores
print(result.get("rule_misses", []))          # why a rule didn’t fire

5. Enrichment Through External Intelligence (Optional Step)

Enrichment was added only after early experiments revealed a problem:

The heuristics alone generated output that was “too weak” to stand on its own.

Not because the system was flawed, but because simple static heuristics have very limited visibility into modern malware. To counter that, enrichment allows the analyser to pull external data, such as (background: VirusTotal docs):

Hash reputation
Threat vendor classifications
Historical submissions
Known malicious families associated with a SHA-256
Community tags or detection ratios

This creates a baseline to compare the heuristic output against. But enrichment was never meant to override the heuristics, only to contextualize them (what enrichment adds and caveats: Wiz Academy).

Why enrichment is useful but imperfect

It improves confidence when heuristics are directionally correct.
It destabilizes classification when heuristics are very noisy.
It introduces dependency on an external service (API, rate limiting, coverage gaps).

Despite its imperfections, enrichment helped ground the pipeline’s outputs and made the entire system more meaningful.

6. Decision Fusion: Combining All Signals

Once both the heuristic engine and the optional enrichment layer produce their outputs, the pipeline needs a final step that decides:
What is the most reasonable label for this sample?

The decision fusion module combines the available signals:

Heuristic evidence (rule hits, counts, weights)
Optional enrichment (external reputation, vendor labels, known families)

The fusion logic uses a simple, weighted approach:

If heuristics show strong, consistent evidence → they carry more weight.
If heuristics are weak but enrichment is strong → enrichment influences the decision more.
If both are weak → the sample defaults to suspicious or unknown.
If they strongly disagree → the system emits a warning, and the final label is conservative.

This prevents the analyser from being “overconfident,” which is a real risk when combining noisy static heuristics with external reputation data.

Confidence-weighted fusion (with disagreement penalty)

The reconciler computes a final score using per-source confidences and weights:

S_final = clip_01( w_h C_h S_h + w_vt C_vt S_vt − penalty(|S_h − S_vt|) )

S_h, S_vt: heuristics and VT scores in [0,1]
C_h, C_vt: confidences in 0,1
w_h, w_vt: relative weights
penalty(...): applied when both signals exist and disagree beyond a policy threshold

When both sources are high‑confidence yet strongly disagree, a conservative hard‑override can force a mid score and an abstain/suspicious label. Final labels are then chosen via calibrated thresholds (e.g., T_mal=0.70, T_susp=0.40).

The core idea

The fusion layer isn’t meant to be clever, just balanced.
It ensures that neither heuristics nor enrichment dominate blindly, and that the final classification reflects the overall confidence of the system rather than any individual signal.

7. Output, Reporting, and Traceability

Every run of the baseline pipeline produces structured output that makes the analysis reproducible and auditable. For each sample, the system stores:

Extracted features
All heuristic rule evidence
Optional enrichment results
The fused classification label
Metadata (hashes, timestamps, config parameters)
A JSON report representing the entire reasoning chain

This traceability was crucial for the thesis.
It allowed me to re-run experiments, refine rules, compare outputs, and understand how every decision was made. When you are building an analyser from scratch, having visibility into why something happened is as important as the result itself.

Why reporting matters

It makes the pipeline reproducible.
It allows for manual inspection when results are unclear.
It provides ground truth for later LLM/RAG experiments.
It helps identify weak rules, noisy features, or misaligned fusion logic.

The reporting layer ended up being one of the most valuable parts of the pipeline, even though it was initially treated as a simple output function.

Concrete artifact paths

For a run directory like baseline-analysis-<RUN_ID>/, you’ll typically see:

Features: decompile-<RUN_ID>/<sha256>.features.json
Heuristics: <sha256>.baseline.json
Final report (fusion): <sha256>.report.json
Batch runs: baseline_summary.json plus a per‑run baseline-analysis-<RUN_ID>.report.json

8. Lessons Learned from Building a Static Malware Analyser

Building a malware analyser, even a simple baseline one, teaches you a lot about both malware and tooling. A few reflections stood out.

What worked well

The architecture was clear, modular, and easy to extend.
The rule engine was transparent and interpretable.
The pipeline could analyse large sets of files quickly.
It established a solid foundation for later ML and LLM-based experiments.

What didn’t work as well

Static analysis alone struggles with packed or cryptographed malware (see recent studies: ScienceDirect, MDPI).
The heuristic engine is only as good as the extracted strings and it often isn’t enough.
Simple string matching has obvious limits in modern malware ecosystems.
Enrichment, while useful, can distort results when heuristics are too weak.

What surprised me

How quickly the heuristics break when input patterns change.
How hard it is to design “general” rules that work across many families.
How often malware authors rely on simple tricks that defeat static inspection.

How this shaped the next phase of the thesis

These lessons directly informed the development of the LLM + RAG-enhanced pipeline (which will be covered on a dedicated article).
Static heuristics gave me structure, data, and understanding. But not enough depth.
The next logical step was to use LLMs to interpret extracted features more flexibly, grounded by retrieval to avoid hallucinations.

The baseline pipeline provided the scaffolding needed to move forward.

Analysis Results & Repository Structure

The complete artefacts from my experiments live in the repository under analysis/. It has two main branches of outputs and a simple aggregate:

Note: The LLM + RAG pipeline is only referenced here for structure and comparison; I’ll cover its design, prompts, retrieval strategy, and results in a dedicated follow‑up article.

analysis/baseline/: results from the baseline static pipeline (with and without VirusTotal enrichment) (link)
analysis/llmrag/: results from the LLM + RAG pipeline (link)
analysis/aggregation-output.json and analysis/aggregation-report.csv: quick roll‑ups of the per‑run outputs (link)

Directory layout (overview)

analysis/baseline/baseline-analysis-<family>-run-2508/: baseline runs per family (e.g., botnet, ransomware, rootkit, trojan) (examples)
analysis/baseline/baseline-analysis-<family>-run-vt-2508/: same families with VirusTotal enrichment enabled (examples)
analysis/llmrag/llmrag-analysis-<family>-run-2508/: LLM + RAG runs per family (examples)

Inside each run directory you’ll find the per‑sample artefacts described earlier:

decompile-<RUN_ID>/<sha256>.features.json: extracted features
<sha256>.baseline.json: heuristics output
<sha256>.report.json: fused final report (label, score, trace)S
baseline-analysis-<RUN_ID>.report.json: batch‑level summary for the run
baseline_summary.json: compact summary across all processed samples

Baseline analysis: what the runs show

Across the baseline folders, you can inspect how the simple heuristics behave for different malware families and how optional VirusTotal enrichment shifts confidence and labels:

Without enrichment (baseline-analysis-<family>-run-2508/), evidence is driven purely by structural/string‑based signals; many samples land in suspicious or unknown when strings are sparse or obfuscated.
With enrichment (baseline-analysis-<family>-run-vt-2508/), labels tend to stabilize when external reputation aligns with the heuristics; disagreement cases are explicitly noted in the fused reports via the reconciliation policy.

For a quick, high‑level view across runs, open analysis/aggregation-report.csv or the machine‑readable analysis/aggregation-output.json. These aggregate files summarize per‑run counts and label distributions without having to traverse each directory.

If you want to reproduce similar outputs, run the commands in Section 9 and point -o to a top‑level analysis/ directory; the pipeline will create run‑specific folders and the same artefact structure.

9. How to Run It Yourself

The analyser is open-source and can be run with only a few prerequisites:

Requirements

Python environment (follow the installation setup on the repository's README.md for setting up the environment)
Ghidra + PyGhidra (Ghidra installed at /opt/ghidra on Linux). If you need a fast, distro‑agnostic setup, follow my guide: Ghidra on Linux: Zero Fuss Install
A directory of PE files
(Optional) VirusTotal API key for enrichment (set [baseline].virus_total_api_key in config/settings.toml)

Basic usage

Once installed, running the baseline pipeline is straightforward (Typer CLI):

pdm run rexis analyse baseline -i ./data/samples/<file>.exe -o ./data/analysis

or for batch mode:

pdm run rexis analyse baseline -i ./data/samples -o ./data/analysis --parallel 4

Common options:

-i, --input: file or directory to analyse (required)
-o, --out-dir: output directory (defaults to CWD)
-r, --run-name: logical run name (default: UUID)
-y, --overwrite: overwrite existing artifacts
-p, --parallel: workers for directory mode
--rules: path to heuristics rules config (YAML/JSON)
-m, --min-severity: filter returned evidence (info|warn|error)
--vt: enable VirusTotal enrichment (requires API key in config/settings.toml)
--vt-timeout, --vt-qpm: timeout and queries-per-minute budget

Rule customization

Users can:

Add new heuristic rules
Tune weights and thresholds
Enable or disable individual rules
Adjust fusion parameters
Add their own enrichment sources

Where to start

All documentation is available in the repository:

Baseline pipeline guide: https://github.com/andremmfaria/rexis/blob/main/guides/BaselinePipeline.md
Heuristic rule‑writing guide: https://github.com/andremmfaria/rexis/blob/main/guides/WritingHeuristicRules.md
Reconciliation (fusion) details: https://github.com/andremmfaria/rexis/blob/main/guides/Reconciliation.md
Example configurations and sample reports in the repo

This makes it easy to experiment, modify, or build your own extensions.

10. Conclusion: Why Building Tools Is the Best Way to Learn

I started this project because I wanted to understand how malware classification works.
Building my own analyser forced me to confront all the assumptions, shortcuts, limitations, and edge cases that textbooks and blog posts never mention.

What I gained was not just a working pipeline, but a practical understanding of:

how static analysis actually behaves
where heuristics break
why enrichment matters
how evidence should be combined
and how analysts think about classification

The baseline pipeline is not perfect. It was never meant to be.
But it gave me the foundation I needed to build more advanced approaches, including the LLM + RAG pipeline that became the core of the second half of my thesis. This will be covered in a future article.

Most importantly, it taught me this:

If you want to learn how something works, build a tool that does it.
You’ll understand the entire problem far more deeply.

DEV Community