Patience Mpofu

Posted on May 14

The 26-Dimensional Feature Vector: How a Machine Learns to Recognise a Secret

#security #machinelearning #appsec #python

hen my secrets detector evaluates a candidate string, it doesn't see code.

It sees a vector of 26 numbers.

That vector is the bridge between human intuition — "this looks like a secret" — and machine classification. Every insight a security engineer uses when reading code to spot exposed credentials has been translated into a numerical feature that the Random Forest classifier can reason about.

This article is a complete walkthrough of those 26 features: what each one measures, why it matters, what it catches, and what it misses. By the end, you'll understand exactly what the model sees when it evaluates any candidate value — and why the combination of features catches things that no single signal could.

How Feature Extraction Works

Before the classifier sees anything, every candidate string goes through a feature extraction pipeline in features.py. The pipeline takes two inputs: the string value itself, and the name of the variable holding it.

def extract_features(value: str, key_name: str) -> np.ndarray:
    features = []

    # Entropy features
    features.append(shannon_entropy(value))
    features.append(math.log(len(value) + 1))
    features.append(repetition_ratio(value))
    features.append(longest_run_normalized(value))

    # Character distribution features (8 features)
    features.extend(character_ratios(value))

    # Key name context
    features.append(key_name_risk_score(key_name))

    # Pattern match flags (16 features)
    features.extend(pattern_match_flags(value))

    return np.array(features)

The output is a fixed-length array of 26 floating point numbers. The classifier never sees the original string — only this vector. That's both a strength (the model generalises across different string formats) and a limitation (some context that a human would use is deliberately excluded).

Let me walk through each group.

Group 1: Entropy Features (4 features)

These four features capture the statistical "randomness" of the string — the property that real secrets share with random data.

Feature 1: Shannon Entropy

Shannon entropy measures the unpredictability of the character sequence. For a string of length n with character frequencies p_i:

H = -Σ p_i × log₂(p_i)

A perfectly random string of alphanumeric characters has entropy around 5.7–6.0 bits. Common English words have entropy around 3.5–4.5 bits. Cryptographically generated secrets cluster at the high end.

# High entropy — likely a secret or hash
"sk-proj-abc123XYZ789..." → entropy: 5.82

# Low entropy — likely a password or human-chosen value  
"Winter2019!"            → entropy: 3.21

# Very low entropy — definitely not a secret
"aaaaaaaaaaaaa"          → entropy: 0.00

Entropy alone is a weak classifier — UUIDs, SHA-256 hashes, and base64 image data all have high entropy but are not secrets. That's why it's one of 26 features, not the only feature.

Feature 2: Log-Scaled Length

Raw string length would give too much weight to very long strings. Log-scaling (math.log(len(value) + 1)) compresses the range so that the difference between a 32-character key and a 64-character key has roughly the same weight as the difference between a 4-character and 8-character string.

Secrets tend to fall in predictable length ranges: AWS access keys are 20 characters, GitHub PATs are 40, JWT tokens are variable but typically 200+. Length contributes signal, but it's a soft signal — there's no length that definitively indicates "secret."

Feature 3: Repetition Ratio

repetition_ratio = len(set(value)) / len(value)

This is the proportion of unique characters to total characters. A perfectly random string of 32 characters will have close to 32 unique characters (ratio ≈ 1.0). A string like "aababcababc" has low unique character count relative to its length (ratio ≈ 0.3).

Low repetition ratio is a strong signal that a string is not a secret — real secrets don't repeat characters predictably. High repetition ratio is a necessary but not sufficient condition for being a secret.

Feature 4: Longest Run (Normalised)

The length of the longest consecutive run of the same character, divided by string length:

longest_run = max(len(list(g)) for _, g in itertools.groupby(value))
longest_run_normalized = longest_run / len(value)

"aaabbbccc" has a longest run of 3 out of 9 characters — normalised run of 0.33.
"sk-abc123XYZ789def456" has a longest run of 1 out of 21 characters — normalised run of 0.05.

Long runs of repeated characters are a strong signal of non-random data. No cryptographically generated secret will have a long run. Human-readable strings often will.

Group 2: Character Distribution Features (8 features)

These eight features describe the composition of the string across character classes. Together they capture the "shape" of the character set that a human eye uses to distinguish secrets from benign strings.

Feature	What It Measures
`uppercase_ratio`	Proportion of A–Z characters
`lowercase_ratio`	Proportion of a–z characters
`digit_ratio`	Proportion of 0–9 characters
`special_ratio`	Proportion of non-alphanumeric characters
`hex_ratio`	Proportion of valid hexadecimal characters (0–9, a–f, A–F)
`base64_ratio`	Proportion of base64-safe characters (alphanumeric + /+=)
`printable_ratio`	Proportion of printable ASCII characters
`whitespace_ratio`	Proportion of whitespace characters

Why these specific ratios matter:

hex_ratio is particularly useful for distinguishing hash values from secrets. A SHA-256 hash has a hex_ratio of 1.0 — every character is a valid hex digit. An AWS access key has a hex_ratio of approximately 0.6 (uppercase letters reduce it). A JWT token has a hex_ratio near 0.0 (it's base64url-encoded, using characters outside the hex alphabet).

special_ratio catches secrets that include special characters — a strong signal for human-chosen passwords ("P@ssw0rd!") versus machine-generated tokens (which typically avoid special characters for compatibility reasons).

base64_ratio is the mirror of hex_ratio for base64-encoded content. Base64-encoded image data has a base64_ratio near 1.0. An API key that uses only alphanumeric characters has a high base64_ratio too — which is where the key name and other features need to disambiguate.

The classifier learns the interaction between these ratios. A string with high entropy, high hex_ratio, and a key name that scores 0.0 is almost certainly a hash. A string with high entropy, mixed character ratios, and a key name that scores 1.0 is almost certainly a secret.

Group 3: Key Name Risk Score (1 feature)

This is the single most important feature in the model — feature importance 0.28, more than the entropy and character features combined.

KEY_NAME_RISK = {
    # Score 1.0 — unambiguously sensitive
    "password": 1.0, "passwd": 1.0, "secret": 1.0,
    "private_key": 1.0, "privkey": 1.0,

    # Score 0.9 — very likely sensitive  
    "api_key": 0.9, "apikey": 0.9, "token": 0.9,
    "credential": 0.9, "auth_token": 0.9,

    # Score 0.85 — likely sensitive
    "access_key": 0.85, "client_secret": 0.85,
    "bearer": 0.85, "authorization": 0.85,

    # Score 0.7 — possibly sensitive
    "key": 0.7, "auth": 0.7, "login": 0.7,

    # Score 0.2 — unlikely sensitive
    "config": 0.2, "setting": 0.2, "value": 0.1,

    # Score 0.0 — not sensitive
    "checksum": 0.0, "hash": 0.0, "version": 0.0,
    "id": 0.0, "uuid": 0.0, "color": 0.0
}

def key_name_risk_score(key_name: str) -> float:
    normalised = key_name.lower().strip("_")
    for keyword, score in KEY_NAME_RISK.items():
        if keyword in normalised:
            return score
    return 0.3  # Unknown key names get a moderate default

The scoring function does substring matching, so DB_PASSWORD, database_password, and user_passwd all score 1.0. API_KEY_V2 and service_api_key both score 0.9.

Unknown variable names — ones that don't contain any recognised keyword — get a default score of 0.3. This is deliberately moderate: an unknown variable name is mild evidence that the string might not be sensitive (if it were, it would likely have a recognisable name), but it's not strong evidence either way.

The impact of this feature on classification decisions is substantial:

# Same value, wildly different classifications
password = "d8e8fca2dc0f896fd7cb4cb0031ba249"  # → flagged at 94% confidence
checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"  # → passed at 8% confidence

Without the key name feature, these two lines are identical to the classifier. With it, they're completely distinguishable.

Group 4: Pattern Match Flags (16 features)

These are binary features — 0 or 1 — indicating whether the value matches any of 16 known secret format patterns.

Flag	Pattern
`pattern_aws_access_key`	`AKIA[0-9A-Z]{16}`
`pattern_github_pat`	`gh[pousr]_[A-Za-z0-9]{36}`
`pattern_github_fine_grained`	`github_pat_[A-Za-z0-9]{82}`
`pattern_jwt`	`eyJ[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+`
`pattern_openai_key`	`sk-[A-Za-z0-9]{48}`
`pattern_slack_token`	`xox[baprs]-[A-Za-z0-9-]+`
`pattern_stripe_secret`	`sk_live_[A-Za-z0-9]{24}`
`pattern_stripe_publishable`	`pk_live_[A-Za-z0-9]{24}`
`pattern_google_api`	`AIza[0-9A-Za-z-_]{35}`
`pattern_heroku_api`	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`pattern_private_key_header`	`-----BEGIN (RSA\
{% raw %}`pattern_db_connection`	`(postgresql\
{% raw %}`pattern_basic_auth`	`[A-Za-z0-9+/]{20,}={0,2}` (base64 basic auth)
`pattern_bearer_token`	`Bearer [A-Za-z0-9-._~+/]+=*`
`pattern_hex_key_32`	`[0-9a-f]{32}` (32-char hex — common key length)
`pattern_hex_key_64`	`[0-9a-f]{64}` (64-char hex — SHA-256 length)

When any of these flags fire, the classifier has strong prior evidence that the value is a known secret format. A value that matches pattern_aws_access_key will be classified as a secret at very high confidence regardless of what the other features say.

The last two flags — pattern_hex_key_32 and pattern_hex_key_64 — deserve special mention. These match the lengths of common cryptographic keys but also match MD5 and SHA-256 hashes, which are not secrets. This is where the key name feature does critical disambiguation work: a 32-character hex string with key name checksum has pattern_hex_key_32 = 1 but key_name_risk = 0.0, and the classifier correctly passes it. The same string with key name encryption_key gets flagged.

How the Features Interact: Three Case Studies

Understanding individual features is useful. Understanding how they interact is where the real insight lives.

Case Study 1: The Human-Chosen Password

SMTP_PASSWORD = "Winter2019!"

Feature	Value	Signal
`shannon_entropy`	3.4	Weak — below threshold for "looks random"
`repetition_ratio`	0.91	Neutral
`special_ratio`	0.09	Slightly elevated
`key_name_risk`	1.0	Very strong — "password" scores maximum
`pattern_*` flags	All 0	No known format match
Classification	Secret — 91% confidence

The entropy would cause a pure entropy scanner to miss this. The key name saves it.

Case Study 2: The UUID False Positive

session_correlation_id = "550e8400-e29b-41d4-a716-446655440000"

Feature	Value	Signal
`shannon_entropy`	4.1	Moderate — looks somewhat random
`hex_ratio`	0.89	Very high — almost all hex characters
`special_ratio`	0.08	Low — only hyphens
`key_name_risk`	0.0	Minimal — "id" scores 0.0
`pattern_heroku_api`	1	Fires — Heroku API keys are UUID-format
Classification	Benign — 23% confidence

The pattern flag fires (Heroku API keys look like UUIDs), but the key name score is 0.0 and the classifier correctly suppresses the finding. A regex scanner using only the Heroku pattern would flag this. The ML classifier does not.

Case Study 3: The Ambiguous High-Entropy String

encryption_key = "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6"

Feature	Value	Signal
`shannon_entropy`	3.9	Moderate
`hex_ratio`	1.0	Maximum — pure hex
`key_name_risk`	0.9	Very high — "key" scores 0.7, "encryption" modifier pushes it higher
`pattern_hex_key_32`	1	Fires — 32-char hex matches
`repetition_ratio`	0.44	Low — repeating pattern visible
Classification	Secret — 78% confidence

The repetition ratio is low (the value has a repeating a1b2c3... pattern that reduces uniqueness), which pulls the confidence down from what it would be for a truly random key. But the key name and pattern flag are strong enough to push it above the reporting threshold. The finding would be reported at MEDIUM confidence — a prompt for human review rather than a guaranteed finding.

What the Feature Vector Cannot See

Intellectual honesty requires being clear about the limits.

Cross-variable context. The feature vector sees one value at a time. It can't see that key = config["encryption_key"] is loading the key from a config object rather than hardcoding it. A human engineer would immediately see that's not a hardcoded secret; the feature vector has no way to represent that.

File context. The feature vector doesn't know it's in a test file, a mock object, or a README code example. TEST_API_KEY = "fake-key-for-testing" might have a high key name risk score despite being explicitly for testing. The inline suppression annotation (# secrets-ignore) is the escape hatch for this case.

Semantic intent. version = "1.0.0" will correctly score a low key name risk. But release_token = "1.0.0-beta" might score higher because "token" is a high-risk keyword — even though in context this is clearly a version string. The feature vector sees the word "token" in the variable name without understanding that it's used semantically differently here.

These limitations are why the classifier is a signal generator rather than an oracle. Every finding above the confidence threshold warrants a human review. The classifier reduces the review burden dramatically — from "look at every high-entropy string in the codebase" to "look at these 20 high-confidence findings" — but it doesn't eliminate the need for human judgment.

Retraining on Your Own Data

The feature vector approach makes retraining practical in a way that deep learning approaches don't. Because the features are hand-engineered and interpretable, adding new training samples has predictable effects.

If your codebase has a pattern of false positives — say, your internal logging library uses variable names like log_token that consistently score high key name risk despite being benign — you can add synthetic examples of that pattern to the benign training set and retrain in seconds:

# Add your custom generators to trainer.py, then:
python main.py train --samples 5000

The retrained model immediately incorporates your organisation-specific context. That's a capability that's practically unavailable with regex-based tools (you'd have to modify pattern files and accept increased miss rates) and theoretically possible but operationally impractical with deep learning (retraining takes hours and requires ML expertise).

The complete feature extraction code is in secrets_detector/features.py at github.com/pgmpofu/secrets-detector.

Next up: why the variable name is the single most important feature in secrets detection — and what that tells us about how developers accidentally expose credentials.

DEV Community