1. Introduction & Motivation
When I began thinking about what to do for my Master’s thesis, one question kept resurfacing: How do people actually classify malware? I had always been curious about the internal logic behind malware categorization, not just at a high level, but at the level of processes, features, and decision-making.
In the end, the thesis became more of a means to an end: a structured excuse to finally build something I’d wanted for years, my own static malware analyser.
To do that, I needed a system that was:
- Reproducible, so others could follow the same steps
- Interpretable, so each decision had a clear explanation
- Automated, so large numbers of samples could be processed
- Modular, so rules, enrichment, or extraction could evolve over time
This article describes how I designed the baseline analysis pipeline, what I learned from it, and why building it was the most effective way to understand how malware works (see survey: ResearchGate).
Why Static Analysis not Dynamic analysis or both?
I chose static analysis because it’s the simplest, safest way to make progress fast. You can point mature tools like Ghidra at a binary and immediately get structure, imports and strings—no sandbox to provision, no risk of executing the sample, and results that are easy to trace back to rules. That makes static ideal for batch triage and for learning: it’s repeatable, quick, and interpretable.
Of course, static has blind spots. Dynamic analysis shows what the program actually does at runtime—process creation, network I/O, registry and file changes—and it can expose unpacking or decryption that static won’t see. The trade‑off is overhead and fragility: running malware safely requires instrumentation and isolation, it’s slower per sample, and many families try to evade sandboxes. My approach was to start with static to build a clear baseline, then layer enrichment (and later, hybrid methods) where deeper behaviour visibility is needed.
2. High-Level Architecture of the Baseline Pipeline
The baseline pipeline is intentionally simple. It follows a straight, modular workflow:
- Feature Extraction – gather structural and semantic information from the PE file.
- Heuristic Evaluation – apply rule-based checks to detect suspicious patterns.
- Optional Data Enrichment – pull external intelligence (e.g., VirusTotal) for reference.
- Decision Fusion – combine heuristic signals with enrichment (if available).
- Reporting – output structured evidence, classification, and metadata.
Each component has a narrow purpose and produces structured data that the next stage consumes. This keeps the design predictable and transparent.
On the optional enrichment step
The enrichment layer is intentionally optional. In theory, it makes the classification stronger because the heuristic output can be cross-checked against external intelligence.
But enrichment also introduces an unexpected trade-off:
- If the heuristic analysis is roughly aligned with the enrichment data, the result improves.
- If the heuristic analysis is far off from the enrichment (e.g., near-random heuristics), the fusion process can skew the final label in unhelpful ways.
So enrichment is useful, but only when the baseline heuristics are not too noisy. This became a recurring theme in the project.
3. Extracting Features from Malware Samples
Static analysis begins with extraction gathering every meaningful property of a file without running it (overview: IJRASET). This includes:
- PE metadata
- Section layout
- Import tables
- Strings
- Function signatures and decompiler output
- Embedded resources
- Other structural features
In the baseline, the decompiler stage writes a per-sample features JSON you can reuse downstream. Typical fields include program (name, format, language, compiler, image_base, size, sha256), functions, imports, sections, strings, and optional decompiled function records. For runs, artifacts are written under a run folder (e.g., decompile-<RUN_ID>/<sha256>.features.json).
Why only PE binaries (and how to adapt)
For the experiments in this article, I focused on PE (Portable Executable) binaries (.exe, .dll, .sys). Two practical reasons guided this decision:
- PE is the most widespread format in desktop malware telemetry (Windows dominance in consumer endpoints).
- Tooling and ecosystem maturity are strongest around PE (Ghidra processors, import table conventions, common packers/obfuscators), which reduces ambiguity when building a baseline.
That focus simplified feature extraction (e.g., sections, imports, entry points) and made heuristic authoring more reliable (static vs dynamic context: SJSU ScholarWorks).
Adapting to other formats is feasible with incremental changes:
- ELF (Linux): switch language/processor in Ghidra, adjust extractors for ELF sections/segments, symbol tables, and
libc/syscall imports; re-map heuristics to Linux TTPs (e.g.,ptrace,/proctampering, init/systemd persistence). - Mach-O (macOS): use Mach-O program metadata, dyld imports, code signatures/entitlements; adapt persistence/networking rules to macOS paths and launch agents.
- Android APK/Dex: pivot to bytecode/decompiled Java/Kotlin; extract manifest, permissions, receivers/services; heuristics on exfil domains, trackers, sensitive API calls.
- Scripted binaries (e.g., .NET, Python, JS packagers): add language-aware parsers, focus on runtime loaders, reflection/dynamic resolution, embedded payloads.
Concretely, the baseline needs:
- A format detector in the orchestrator to choose the right decompiler/extractor path.
- Format-specific feature schemas with a shared core (program info, strings, imports/exports, sections), plus optional blocks per format.
- A heuristic ruleset per format (or parametric rules) and a tagging map aligned to each platform’s taxonomy.
With these adaptations, the same pipeline (decompile → heuristics → optional enrichment → fusion → report) extends beyond PE with minimal structural changes.
Why Ghidra (and not radare2, IDA, or Ada-based tools)?
A few people ask why I didn’t use Ada or other specialized tools. The answer is simple:
- Ghidra is fully open source
- It provides Python bindings through PyGhidra
- It integrates a powerful decompiler
- It can be automated in a pipeline without licensing issues
That said, Ghidra’s Python bindings are not trivial to use. Because Python and Java operate differently (different memory models, threading assumptions, and API expectations), interacting with Ghidra programmatically can become clunky. But it remained the most practical option.
Limits of the extraction approach
Because this is a lightweight baseline pipeline, the extraction steps are intentionally simple. This leads to a major limitation:
The analysis depends heavily on readable strings and predictable patterns. If the malware is encrypted, packed, or obfuscated, the extracted data becomes almost useless.
This constraint shapes everything downstream in the pipeline.
4. The Heuristics Engine: How the Rules Work
The heuristics engine is the simplest component of the pipeline by design. A heuristic rule is just:
- A pure function
- That examines extracted features
- And returns structured evidence if a condition is met
The logic behind the rules is intentionally basic. Most rules rely on simple string-matching or pattern detection, such as:
- Suspicious API calls
- Writable/executable sections
- Unusual import patterns
- Indicators in metadata or strings
A double limitation
Because rules depend on literal string matching:
- The input must closely match what the rule expects, or the rule will not fire.
- Cryptographed, packed, or obfuscated malware evades the heuristics almost completely.
The upside is interpretability: every rule hit produces clear evidence.
The downside is coverage: many modern malware families will not match at all.
Rule shape and evidence contract
Rules are pure functions that take extracted features and return either Evidence or a miss reason. In REXIS they follow a signature like:
def rule_example(features: dict, rule_score: float = 0.2, params: dict = {}):
# return (Evidence, "reason") on hit, or (None, "miss reason") on miss
...
Evidence is structured with id, title, detail, severity (info|warn|error) and a raw score in [0,1]. The analyser attaches a reason and per-evidence categories (derived from a tagging map) to aid traceability.
Tuning is externalized: a rules config (YAML/JSON) can reweight rules, pass per‑rule params via rule_args, filter by allow_rules/deny_rules, and define label_overrides for strong signals. Tag inference (e.g., ransomware, stealer, backdoor) is computed from evidence via a configurable tagging section.
Tip: Always return a miss reason; it surfaces in rule_misses and makes rule calibration easier.
Authoring and wiring a rule (concrete example)
Here is a simplified example that flags mutex creation APIs, showing the recommended return contract and tunable rule_score:
from typing import Any, Dict, Optional, Tuple
from rexis.utils.types import Evidence
from rexis.tools.heuristics_analyser.utils import get_imports_set
def rule_suspicious_mutex_creation(
features: Dict[str, Any], rule_score: float = 0.10, params: Dict[str, Any] = {}
) -> Tuple[Optional[Evidence], Optional[str]]:
imps = get_imports_set(features)
mutex_apis = {"createmutexa", "createmutexw", "openmutexa", "openmutexw"}
hits = imps & mutex_apis
if not hits:
return None, "no mutex-related imports found"
return (
Evidence(
id="suspicious_mutex_creation",
title="Mutex creation/manipulation",
detail=f"Imports include: {', '.join(sorted(hits))}",
severity="info",
score=float(rule_score),
),
f"matched mutex imports: {', '.join(sorted(hits))}",
)
To wire it, register the function with a stable id in the analyser’s ruleset and add a default weight in the config. At runtime, you can raise/lower its impact via weights.suspicious_mutex_creation and pass parameters through rule_args.
Tuning via config (weights, thresholds, tags)
In your rules YAML/JSON you can control:
-
scoring.combine(weighted_sum|max) andlabel_thresholds.{malicious,suspicious} -
weights: per‑rule caps; contribution ismin(1.0, ev.score * weight) -
allow_rules/deny_rules: enable/disable subsets -
label_overrides: force a label if a rule fires -
rule_args:(rule_score, params)per rule -
tagging: map evidence to tags (e.g.,ransomware,stealer,backdoor) withtag_weights,threshold,top_k
This keeps rule code simple while giving you environment‑specific control.
Testing a rule quickly (ad‑hoc)
from rexis.tools.heuristics_analyser.main import heuristic_classify
features = {
"program": {"name": "sample.exe", "size": 200_000, "sha256": "...", "format": "pe", "language": "x86"},
"imports": ["CreateMutexA", "GetProcAddress"],
"sections": [{"name": ".text", "size": 3500, "flags": ["exec", "write"]}],
"strings": ["http://example.com", "VirtualBox"],
}
result = heuristic_classify(features)
print(result["score"], result["label"]) # inspect overall score/label
print(result.get("evidence", [])) # list of evidence with reasons
print(result.get("tags", [])) # tag candidates with scores
print(result.get("rule_misses", [])) # why a rule didn’t fire
5. Enrichment Through External Intelligence (Optional Step)
Enrichment was added only after early experiments revealed a problem:
The heuristics alone generated output that was “too weak” to stand on its own.
Not because the system was flawed, but because simple static heuristics have very limited visibility into modern malware. To counter that, enrichment allows the analyser to pull external data, such as (background: VirusTotal docs):
- Hash reputation
- Threat vendor classifications
- Historical submissions
- Known malicious families associated with a SHA-256
- Community tags or detection ratios
This creates a baseline to compare the heuristic output against. But enrichment was never meant to override the heuristics, only to contextualize them (what enrichment adds and caveats: Wiz Academy).
Why enrichment is useful but imperfect
- It improves confidence when heuristics are directionally correct.
- It destabilizes classification when heuristics are very noisy.
- It introduces dependency on an external service (API, rate limiting, coverage gaps).
Despite its imperfections, enrichment helped ground the pipeline’s outputs and made the entire system more meaningful.
6. Decision Fusion: Combining All Signals
Once both the heuristic engine and the optional enrichment layer produce their outputs, the pipeline needs a final step that decides:
What is the most reasonable label for this sample?
The decision fusion module combines the available signals:
- Heuristic evidence (rule hits, counts, weights)
- Optional enrichment (external reputation, vendor labels, known families)
The fusion logic uses a simple, weighted approach:
- If heuristics show strong, consistent evidence → they carry more weight.
- If heuristics are weak but enrichment is strong → enrichment influences the decision more.
- If both are weak → the sample defaults to suspicious or unknown.
- If they strongly disagree → the system emits a warning, and the final label is conservative.
This prevents the analyser from being “overconfident,” which is a real risk when combining noisy static heuristics with external reputation data.
Confidence-weighted fusion (with disagreement penalty)
The reconciler computes a final score using per-source confidences and weights:
S_final = clip_01( w_h C_h S_h + w_vt C_vt S_vt − penalty(|S_h − S_vt|) )
-
S_h,S_vt: heuristics and VT scores in [0,1] -
C_h,C_vt: confidences in 0,1 -
w_h,w_vt: relative weights -
penalty(...): applied when both signals exist and disagree beyond a policy threshold
When both sources are high‑confidence yet strongly disagree, a conservative hard‑override can force a mid score and an abstain/suspicious label. Final labels are then chosen via calibrated thresholds (e.g., T_mal=0.70, T_susp=0.40).
The core idea
The fusion layer isn’t meant to be clever, just balanced.
It ensures that neither heuristics nor enrichment dominate blindly, and that the final classification reflects the overall confidence of the system rather than any individual signal.
7. Output, Reporting, and Traceability
Every run of the baseline pipeline produces structured output that makes the analysis reproducible and auditable. For each sample, the system stores:
- Extracted features
- All heuristic rule evidence
- Optional enrichment results
- The fused classification label
- Metadata (hashes, timestamps, config parameters)
- A JSON report representing the entire reasoning chain
This traceability was crucial for the thesis.
It allowed me to re-run experiments, refine rules, compare outputs, and understand how every decision was made. When you are building an analyser from scratch, having visibility into why something happened is as important as the result itself.
Why reporting matters
- It makes the pipeline reproducible.
- It allows for manual inspection when results are unclear.
- It provides ground truth for later LLM/RAG experiments.
- It helps identify weak rules, noisy features, or misaligned fusion logic.
The reporting layer ended up being one of the most valuable parts of the pipeline, even though it was initially treated as a simple output function.
Concrete artifact paths
For a run directory like baseline-analysis-<RUN_ID>/, you’ll typically see:
- Features:
decompile-<RUN_ID>/<sha256>.features.json - Heuristics:
<sha256>.baseline.json - Final report (fusion):
<sha256>.report.json - Batch runs:
baseline_summary.jsonplus a per‑runbaseline-analysis-<RUN_ID>.report.json
8. Lessons Learned from Building a Static Malware Analyser
Building a malware analyser, even a simple baseline one, teaches you a lot about both malware and tooling. A few reflections stood out.
What worked well
- The architecture was clear, modular, and easy to extend.
- The rule engine was transparent and interpretable.
- The pipeline could analyse large sets of files quickly.
- It established a solid foundation for later ML and LLM-based experiments.
What didn’t work as well
- Static analysis alone struggles with packed or cryptographed malware (see recent studies: ScienceDirect, MDPI).
- The heuristic engine is only as good as the extracted strings and it often isn’t enough.
- Simple string matching has obvious limits in modern malware ecosystems.
- Enrichment, while useful, can distort results when heuristics are too weak.
What surprised me
- How quickly the heuristics break when input patterns change.
- How hard it is to design “general” rules that work across many families.
- How often malware authors rely on simple tricks that defeat static inspection.
How this shaped the next phase of the thesis
These lessons directly informed the development of the LLM + RAG-enhanced pipeline (which will be covered on a dedicated article).
Static heuristics gave me structure, data, and understanding. But not enough depth.
The next logical step was to use LLMs to interpret extracted features more flexibly, grounded by retrieval to avoid hallucinations.
The baseline pipeline provided the scaffolding needed to move forward.
Analysis Results & Repository Structure
The complete artefacts from my experiments live in the repository under analysis/. It has two main branches of outputs and a simple aggregate:
Note: The LLM + RAG pipeline is only referenced here for structure and comparison; I’ll cover its design, prompts, retrieval strategy, and results in a dedicated follow‑up article.
-
analysis/baseline/: results from the baseline static pipeline (with and without VirusTotal enrichment) (link) -
analysis/llmrag/: results from the LLM + RAG pipeline (link) -
analysis/aggregation-output.jsonandanalysis/aggregation-report.csv: quick roll‑ups of the per‑run outputs (link)
Directory layout (overview)
-
analysis/baseline/baseline-analysis-<family>-run-2508/: baseline runs per family (e.g., botnet, ransomware, rootkit, trojan) (examples) -
analysis/baseline/baseline-analysis-<family>-run-vt-2508/: same families with VirusTotal enrichment enabled (examples) -
analysis/llmrag/llmrag-analysis-<family>-run-2508/: LLM + RAG runs per family (examples)
Inside each run directory you’ll find the per‑sample artefacts described earlier:
-
decompile-<RUN_ID>/<sha256>.features.json: extracted features -
<sha256>.baseline.json: heuristics output -
<sha256>.report.json: fused final report (label, score, trace)S -
baseline-analysis-<RUN_ID>.report.json: batch‑level summary for the run -
baseline_summary.json: compact summary across all processed samples
Baseline analysis: what the runs show
Across the baseline folders, you can inspect how the simple heuristics behave for different malware families and how optional VirusTotal enrichment shifts confidence and labels:
- Without enrichment (
baseline-analysis-<family>-run-2508/), evidence is driven purely by structural/string‑based signals; many samples land insuspiciousorunknownwhen strings are sparse or obfuscated. - With enrichment (
baseline-analysis-<family>-run-vt-2508/), labels tend to stabilize when external reputation aligns with the heuristics; disagreement cases are explicitly noted in the fused reports via the reconciliation policy.
For a quick, high‑level view across runs, open analysis/aggregation-report.csv or the machine‑readable analysis/aggregation-output.json. These aggregate files summarize per‑run counts and label distributions without having to traverse each directory.
If you want to reproduce similar outputs, run the commands in Section 9 and point -o to a top‑level analysis/ directory; the pipeline will create run‑specific folders and the same artefact structure.
9. How to Run It Yourself
The analyser is open-source and can be run with only a few prerequisites:
Requirements
- Python environment (follow the installation setup on the repository's README.md for setting up the environment)
- Ghidra + PyGhidra (Ghidra installed at
/opt/ghidraon Linux). If you need a fast, distro‑agnostic setup, follow my guide: Ghidra on Linux: Zero Fuss Install - A directory of PE files
- (Optional) VirusTotal API key for enrichment (set
[baseline].virus_total_api_keyinconfig/settings.toml)
Basic usage
Once installed, running the baseline pipeline is straightforward (Typer CLI):
pdm run rexis analyse baseline -i ./data/samples/<file>.exe -o ./data/analysis
or for batch mode:
pdm run rexis analyse baseline -i ./data/samples -o ./data/analysis --parallel 4
Common options:
-
-i, --input: file or directory to analyse (required) -
-o, --out-dir: output directory (defaults to CWD) -
-r, --run-name: logical run name (default: UUID) -
-y, --overwrite: overwrite existing artifacts -
-p, --parallel: workers for directory mode -
--rules: path to heuristics rules config (YAML/JSON) -
-m, --min-severity: filter returned evidence (info|warn|error) -
--vt: enable VirusTotal enrichment (requires API key inconfig/settings.toml) -
--vt-timeout,--vt-qpm: timeout and queries-per-minute budget
Rule customization
Users can:
- Add new heuristic rules
- Tune weights and thresholds
- Enable or disable individual rules
- Adjust fusion parameters
- Add their own enrichment sources
Where to start
All documentation is available in the repository:
- Baseline pipeline guide: https://github.com/andremmfaria/rexis/blob/main/guides/BaselinePipeline.md
- Heuristic rule‑writing guide: https://github.com/andremmfaria/rexis/blob/main/guides/WritingHeuristicRules.md
- Reconciliation (fusion) details: https://github.com/andremmfaria/rexis/blob/main/guides/Reconciliation.md
- Example configurations and sample reports in the repo
This makes it easy to experiment, modify, or build your own extensions.
10. Conclusion: Why Building Tools Is the Best Way to Learn
I started this project because I wanted to understand how malware classification works.
Building my own analyser forced me to confront all the assumptions, shortcuts, limitations, and edge cases that textbooks and blog posts never mention.
What I gained was not just a working pipeline, but a practical understanding of:
- how static analysis actually behaves
- where heuristics break
- why enrichment matters
- how evidence should be combined
- and how analysts think about classification
The baseline pipeline is not perfect. It was never meant to be.
But it gave me the foundation I needed to build more advanced approaches, including the LLM + RAG pipeline that became the core of the second half of my thesis. This will be covered in a future article.
Most importantly, it taught me this:
If you want to learn how something works, build a tool that does it.
You’ll understand the entire problem far more deeply.
Top comments (0)