Patience Mpofu

Posted on May 16

Why I Chose Random Forest Over Deep Learning for Secrets Detection

#machinelearning #python #security #programming

Every time I mention that my secrets detector uses a Random Forest classifier, someone asks the same question.

"Why not a neural network?"

It's a reasonable question. Deep learning dominates ML benchmarks. Transformers have redefined what's possible in natural language understanding. If you're building a tool that reads code — which is text — shouldn't you be using the most powerful text understanding architecture available?

The answer is no. And the reasoning reveals something important about how to match ML approaches to real-world engineering constraints.

This article is the full argument: why Random Forest was the right choice for this specific problem, what I give up by not going deep, and when the calculus would flip.

The Five Constraints That Shaped the Decision

Before choosing a model architecture, I defined the constraints the tool had to satisfy. These weren't aspirational — they were hard requirements that would determine whether the tool was actually useful in practice.

Constraint 1: Must run locally with zero infrastructure.
The tool needs to work in a pre-commit hook, on a developer's laptop, without internet access. No API calls, no GPU, no Docker compose stack with a model server. A developer running git commit should not experience meaningful latency.

Constraint 2: Must ship as a self-contained package.
The model file needs to be small enough to live in the repository alongside the code. Teams shouldn't need to download a separate model artifact or manage model versioning separately from tool versioning.

Constraint 3: Must be retrainable by a non-ML engineer.
When a team encounters false positives specific to their codebase, they should be able to add examples and retrain without ML expertise, a GPU, or more than a few minutes of compute time.

Constraint 4: Must explain its decisions.
When the tool flags a finding, an engineer should be able to understand why. "The model said so" is not an acceptable answer in a security context where false positives erode trust.

Constraint 5: Must generalise from a small training set.
I'm training on synthetically generated data, not millions of real examples. The architecture needs to perform well with thousands of samples, not billions.

Every significant architectural decision in this tool flows from these five constraints. Let me show you how.

Why Deep Learning Fails Each Constraint

Constraint 1: Local execution without infrastructure

A production-quality transformer model for code understanding — something like CodeBERT or GraphCodeBERT — runs to hundreds of megabytes to several gigabytes. Running inference on a CPU is possible but slow: several seconds per scan on a typical laptop. In a pre-commit hook, where the developer is waiting at the terminal, that's unacceptable friction.

The Random Forest model runs inference in milliseconds on CPU. Scanning a 10,000-line codebase takes under two seconds on a five-year-old laptop. There's no perceptible delay between git commit and the hook completing.

Constraint 2: Self-contained package

The trained Random Forest model serialises to approximately 1MB as a pickle file. It lives in the model/ directory, ships with the tool, and requires no separate download or version management.

A fine-tuned transformer model would be 400MB–2GB depending on architecture. That's not viable as a repository artifact. It requires separate model hosting, download scripts, and version coordination — none of which a team setting up a pre-commit hook wants to manage.

Constraint 3: Retrainable by non-ML engineers

Retraining the Random Forest on 6,000 samples takes approximately eight seconds on a standard laptop CPU. Scaling to 50,000 samples takes about ninety seconds. The entire workflow is:

# Edit trainer.py to add your examples
# Then:
python main.py train --samples 10000
# Done. New model.pkl in model/

Retraining a transformer requires GPU infrastructure, hours of compute time, careful learning rate scheduling to avoid catastrophic forgetting, and validation that fine-tuning didn't degrade performance on the base cases. A team without an ML engineer cannot do this.

Constraint 4: Explainable decisions

This is where the gap between Random Forest and deep learning is most significant for a security tool.

Random Forest gives you feature importances globally and, with one additional step, per-prediction explanations. When the tool flags a finding, it can tell you exactly which features drove the decision:

Finding: api_key = "sk-proj-abc123XYZ789..."
Confidence: 96%

Contributing features:
  key_name_risk:        0.90  (HIGH — 'api_key' matches sensitive vocabulary)
  shannon_entropy:      5.82  (HIGH — consistent with cryptographic secret)
  pattern_openai_key:   1.00  (MATCH — matches OpenAI key format sk-proj-*)
  repetition_ratio:     0.94  (HIGH — low character repetition, high randomness)

An engineer reading this knows immediately why the finding was generated. They can evaluate whether the reasoning is sound. They can make an informed decision about whether to fix or suppress.

A neural network produces a probability: 0.96. No more. You can apply techniques like SHAP or LIME to approximate explanations, but these add complexity, latency, and approximation error. For a pre-commit hook that needs to explain itself to a developer in real time, "here are the features that drove this" is vastly better than "the attention mechanism focused on these tokens (approximately)."

Constraint 5: Generalisation from small data

Transformer models are data-hungry. They're pre-trained on billions of tokens and fine-tuned on millions of examples. Their power comes from the scale of pre-training, which means fine-tuning on thousands of synthetic examples carries real risk of the model not generalising well to patterns it hasn't seen.

Random Forest with well-engineered features generalises effectively from thousands of examples. The feature engineering does the heavy lifting — entropy, character ratios, key name scoring, pattern flags. The model only needs to learn the relationships between these pre-computed features, which is a much simpler learning problem than learning representations from raw text.

What I Give Up

Intellectual honesty requires being clear about the tradeoffs.

Peak accuracy ceiling. A well-fine-tuned code understanding model operating on token sequences would almost certainly achieve higher peak accuracy than my feature-engineered Random Forest. It would learn representations I haven't thought to engineer explicitly. It would capture multi-token context — the fact that password appears three lines before = "..." rather than directly adjacent, for instance.

Novel format generalisation. When a new cloud provider launches with a distinctive key format, my tool catches it only if I add a pattern match flag. A neural network trained on diverse secret formats might generalise to novel formats by recognising that they "look like secrets" in ways the feature vector doesn't capture. My tool requires an explicit pattern update.

Code context understanding. The feature vector sees one value at a time. A transformer scanning the whole file could understand that a value is being loaded from an environment variable rather than being hardcoded, that it's inside a test mock, or that it's in a comment rather than executable code. My tool handles some of these through pre-processing (only scanning string literals in executable code), but the context window is fundamentally narrower.

Cross-line data flow. If a secret is assembled across multiple lines — partial string concatenation, format strings, bytes operations — the feature vector sees fragments rather than the complete secret. A model with broader context could potentially catch these.

The Accuracy Numbers

On my test set of 1,200 labeled samples (a held-out 20% of the 6,000 training samples), the Random Forest achieves:

Metric	Score
Accuracy	94.2%
Precision	93.8%
Recall	94.7%
F1 Score	94.2%
False Positive Rate	5.8%
False Negative Rate	5.3%

For context: TruffleHog v3 (regex + entropy) reports false positive rates in the 10–15% range on typical codebases according to published evaluations. The ML approach achieves meaningfully better precision without sacrificing recall.

I don't have a head-to-head comparison against a fine-tuned transformer on this specific task — that would require the transformer, the training infrastructure, and a larger labeled dataset than I have. What I can say is that the Random Forest achieves accuracy that's competitive with existing tools, meets all five operational constraints, and does so at a fraction of the complexity.

The Decision Framework: When Would I Choose Deep Learning?

Given all of the above, there are scenarios where I would choose a different architecture.

If I were building a cloud-hosted scanning service, the infrastructure constraint disappears. GPU inference is available. Model size doesn't matter. Latency can be managed with caching and batching. In that scenario, a transformer-based approach becomes viable and the accuracy ceiling argument gets stronger.

If I had a large labeled dataset of real secrets, the data constraint relaxes. Fine-tuning on tens of thousands of real examples would likely push accuracy significantly higher than what synthetic data training achieves. The question then becomes whether the accuracy gain justifies the operational complexity.

If the primary use case were batch scanning rather than pre-commit hooks, the latency constraint loosens. Scanning a repository's entire history overnight can tolerate seconds or minutes per file. The pre-commit use case is what drives the millisecond inference requirement.

If cross-file context mattered, a graph neural network operating on the code's data flow graph might be more appropriate than either approach. Understanding that secret = get_secret_from_vault() is safe and secret = "hardcoded" is dangerous requires understanding function call semantics — something Random Forest on string features cannot do.

The right architecture is always determined by the constraints of the deployment context, not by what achieves the best benchmark score.

The Broader Principle: Fit for Purpose Over State of the Art

The machine learning community has a bias toward the most powerful available architecture. More parameters, more data, more compute — these are treated as virtues in research contexts where they often are virtues.

Production engineering has different values. A tool that actually gets used — because it's fast, explainable, maintainable, and deployable without infrastructure — delivers more security value than a theoretically superior tool that sits unused because it's too slow, too opaque, or too complex to operate.

This is an instance of a general principle I keep encountering in AppSec: the best security control is the one that gets implemented and maintained, not the one that provides the strongest theoretical protection.

A Random Forest secrets detector running in every developer's pre-commit hook, catching 94% of secrets before they reach the repository, is more valuable than a transformer-based detector achieving 98% accuracy that nobody bothered to deploy because the setup was too complicated.

The 4% accuracy difference is real. The deployment difference is everything.

What the Feature Importances Reveal About the Problem

One thing Random Forest gives you that deep learning doesn't: a clear picture of what the problem actually is.

Here are the top 10 feature importances from the trained model:

Rank	Feature	Importance
1	`key_name_risk`	0.28
2	`shannon_entropy`	0.14
3	`pattern_aws_access_key`	0.09
4	`repetition_ratio`	0.08
5	`hex_ratio`	0.07
6	`pattern_github_pat`	0.06
7	`base64_ratio`	0.05
8	`log_length`	0.04
9	`pattern_private_key_header`	0.04
10	`uppercase_ratio`	0.03

This table is a map of the secrets detection problem. It tells you that variable naming context is more predictive than any statistical property of the string itself. It tells you that entropy matters but not as much as everyone assumes. It tells you that AWS and GitHub keys are important enough that their specific pattern flags appear in the top ten even though there are 16 pattern flags spread across the remaining importance budget.

A neural network would learn similar underlying structure — it would attend more to variable names than to arbitrary string characters — but it wouldn't show you that structure explicitly. The interpretability of Random Forest turns model training into a research exercise as well as an engineering one.

That visibility into what the problem actually is informed every design decision in this tool. It's one of the most valuable things the architecture choice gave me.

The trainer code, feature importances, and model evaluation scripts are all in the repository at github.com/pgmpofu/secrets-detector.

Next up: the ethical and practical challenge of training a security ML model without using real leaked credentials — why synthetic data, how I generated it, and what the tradeoffs are.

DEV Community