Patience Mpofu

Posted on May 16

Training on Synthetic Data: How to Build an ML Security Tool Without Touching Real Leaked Secrets

#machinelearning #security #python #ethics

Before I wrote a single line of model training code, I made a decision that constrained everything that followed.

I would not train on real leaked credentials.

The alternative was straightforward. GitHub's public commit history contains millions of accidentally committed secrets — API keys, passwords, connection strings, private keys — that have been scraped, indexed, and catalogued by security researchers. Datasets of this material exist. Using them as positive training examples would produce a model trained on exactly the kind of data it needs to recognise.

I chose not to do that. And the reasoning is more nuanced than "it felt wrong."

This article is about why I made that choice, how I built synthetic training data that avoids the problem, what the tradeoffs are, and what the broader principle is for anyone building ML security tooling.

Why Real Leaked Credentials Are a Problematic Training Source

The obvious objection to using real leaked secrets is ethical: those credentials belong to real people and organisations. Even if the data is technically public — visible in a GitHub commit, indexed by search engines — using it for commercial or portfolio purposes raises questions about consent and purpose.

But the ethical argument alone isn't the strongest one. The stronger arguments are practical.

Legal Ambiguity

The legal status of scraping and using publicly accessible but unintentionally published credentials is genuinely unclear across jurisdictions. In some interpretations of computer fraud and data protection law, accessing and storing leaked credentials — even for research purposes — could constitute unauthorised access to data or improper processing of personal information.

The GDPR position on this is particularly murky. Credentials are often linked to personal accounts. Processing personal data, even publicly accessible personal data, requires a lawful basis. "I needed it to train my model" is not a lawful basis.

I'm not a lawyer and this isn't legal advice. But I am someone building a tool I intend to put on GitHub with my name on it. "The legal status of my training data is unclear" is not a position I wanted to be in.

Data Quality Problems

Leaked credential datasets have severe quality problems that make them worse training data than they might appear.

Temporal distribution shift. Key formats change over time. GitHub PATs changed format in 2021 from a 40-character hex string to a structured format with a ghp_ prefix. AWS has introduced new key formats. An older leaked credentials dataset would train the model on formats that no longer exist while underrepresenting current formats.

Survivorship bias. The credentials that get scraped and catalogued are the ones that were detected and revoked. Harder-to-detect secrets — generically named variables, low-entropy human-chosen passwords — are systematically underrepresented in public leaked credential datasets precisely because they're harder to find.

Label noise. Not every string in a "leaked credentials" dataset is actually a sensitive credential. Test keys, example values, documentation snippets, and deliberately fake keys appear throughout. Cleaning a scraped dataset to get reliable labels is a substantial manual effort.

Negative example scarcity. A dataset of leaked credentials is purely positive examples. You still need high-quality negative examples — high-entropy strings that aren't secrets — to train a classifier that distinguishes secrets from benign values. These need to be generated separately anyway.

Synthetic data generation, done carefully, avoids all of these problems. You control the format distribution, the label quality, and the class balance precisely.

Reusability and Sharing

A tool trained on synthetic data can be shared freely. The training code can be published. The data generation methodology can be documented. Other researchers can reproduce, audit, and improve the approach.

A tool trained on scraped real credentials has a provenance problem the moment someone asks "where did your training data come from?" Publishing that training data would mean republishing the leaked credentials. Not publishing it means the model can't be fully reproduced or audited.

Reproducibility matters in security tooling specifically because trust matters. A secrets detector that you can't audit end-to-end is a secrets detector you're taking on faith.

How I Generated Synthetic Training Data

The synthetic data generator in trainer.py produces two classes of examples: secrets (label=1) and benign high-entropy strings (label=0).

Generating Positive Examples (Secrets)

For known-format secrets, I generate values that match the structural properties of real secrets without being real secrets:

def generate_aws_access_key():
    """Generate synthetic AWS access key format"""
    chars = string.ascii_uppercase + string.digits
    suffix = ''.join(random.choices(chars, k=16))
    return f"AKIA{suffix}"

def generate_github_pat():
    """Generate synthetic GitHub PAT (new format)"""
    chars = string.ascii_letters + string.digits
    suffix = ''.join(random.choices(chars, k=36))
    prefix = random.choice(['ghp', 'gho', 'ghu', 'ghs', 'ghr'])
    return f"{prefix}_{suffix}"

def generate_jwt():
    """Generate syntactically valid JWT structure"""
    header = base64.urlsafe_b64encode(
        json.dumps({"alg": "HS256", "typ": "JWT"}).encode()
    ).rstrip(b'=').decode()
    payload = base64.urlsafe_b64encode(
        json.dumps({"sub": "1234567890", "iat": 1516239022}).encode()
    ).rstrip(b'=').decode()
    signature = ''.join(random.choices(
        string.ascii_letters + string.digits + '-_', k=43
    ))
    return f"{header}.{payload}.{signature}"

For generic hardcoded credentials — the human-chosen passwords and internal tokens that no regex would catch — I generate values following common human password patterns:

def generate_human_chosen_password():
    """Generate realistic human-chosen passwords"""
    patterns = [
        # Word + year + special
        lambda: f"{random.choice(COMMON_WORDS)}{random.randint(2015, 2024)}{random.choice('!@#$%')}",
        # Capitalised word + number
        lambda: f"{random.choice(COMMON_WORDS).capitalize()}{random.randint(1, 999)}",
        # Two words concatenated
        lambda: f"{random.choice(COMMON_WORDS)}{random.choice(COMMON_WORDS)}",
        # Word + special pattern
        lambda: f"{random.choice(COMMON_WORDS).upper()}_{random.randint(100, 999)}",
    ]
    return random.choice(patterns)()

COMMON_WORDS = [
    "winter", "summer", "spring", "autumn", "admin", "secure",
    "company", "service", "backend", "system", "master", "main",
    "deploy", "production", "staging", "develop", "internal"
]

Each generated secret is paired with a realistic variable name drawn from the high-risk vocabulary:

SECRET_VARIABLE_NAMES = [
    "API_KEY", "api_key", "apiKey",
    "SECRET_KEY", "secret_key", "secretKey",
    "PASSWORD", "password", "passwd", "pwd",
    "ACCESS_TOKEN", "access_token", "accessToken",
    "DATABASE_URL", "database_url", "db_url", "DB_URL",
    "PRIVATE_KEY", "private_key", "privateKey",
    # ... 40+ more
]

The (variable_name, value) pairs that go into training represent the full context the feature extractor sees.

Generating Negative Examples (Benign High-Entropy Strings)

The negative class is where most secrets detectors fail — they don't have enough high-quality negative examples, so the model learns "high entropy = secret" rather than "high entropy in a credential context = secret."

I generate several categories of benign high-entropy strings:

def generate_uuid():
    return str(uuid.uuid4())

def generate_sha256_hash():
    content = ''.join(random.choices(string.printable, k=random.randint(10, 100)))
    return hashlib.sha256(content.encode()).hexdigest()

def generate_md5_hash():
    content = ''.join(random.choices(string.printable, k=random.randint(10, 100)))
    return hashlib.md5(content.encode()).hexdigest()

def generate_base64_data():
    """Simulate base64-encoded image or binary data fragments"""
    data = bytes(random.randint(0, 255) for _ in range(random.randint(20, 60)))
    return base64.b64encode(data).decode()

def generate_package_integrity_hash():
    """npm/yarn integrity hash format"""
    data = bytes(random.randint(0, 255) for _ in range(48))
    hash_val = base64.b64encode(data).decode()
    return f"sha512-{hash_val}"

def generate_hex_color():
    return f"{''.join(random.choices('0123456789abcdef', k=6))}"

def generate_version_string():
    major = random.randint(0, 10)
    minor = random.randint(0, 99)
    patch = random.randint(0, 999)
    return f"{major}.{minor}.{patch}"

Each negative example is paired with a low-risk variable name:

BENIGN_VARIABLE_NAMES = [
    "checksum", "hash", "digest", "fingerprint",
    "uuid", "guid", "id", "identifier", "correlation_id",
    "version", "release", "build_number",
    "color", "colour", "hex_color",
    "integrity", "content_hash",
    # ... 30+ more
]

Class Balance and Distribution

The training set uses a 50/50 class balance — equal numbers of secrets and benign strings. This is a deliberate choice.

Real codebases have far fewer secrets than benign strings — maybe 1% of high-entropy strings are actual secrets in a typical codebase. Training on a 1% positive class would produce a classifier that learns to say "not a secret" almost all the time and achieves 99% accuracy by doing so — completely useless.

A 50/50 balance forces the model to actually learn to distinguish the classes. The resulting classifier has higher false positive rates on real codebases than the training accuracy suggests, which is why the confidence threshold (default 0.7) and the key name feature do so much work in production.

The threshold can be adjusted to trade precision for recall:

# Higher threshold — fewer false positives, more false negatives
python main.py scan ./src --threshold 0.85

# Lower threshold — more findings, more false positives
python main.py scan ./src --threshold 0.55

Validating Synthetic Data Quality

The risk of synthetic data is that it doesn't reflect the distribution of real data. A model trained on synthetic examples might perform well on the test set (also synthetic) and poorly on real codebases.

I validated against three real-world test cases:

Test 1: Known public secret patterns. I collected public documentation examples of secret formats — the example values shown in AWS, GitHub, and OpenAI documentation. These are not real secrets; they're deliberately fake values used in documentation. The model should classify them as secrets (since they match real formats) and does so at >95% confidence.

Test 2: Known benign high-entropy strings. I collected package-lock.json integrity hashes, UUID values from public test suites, and SHA-256 checksums from public software distributions. The model should classify these as benign and does so at <10% confidence in the vast majority of cases.

Test 3: Edge cases from my own code. I scanned my own development projects — including the secrets detector itself — and manually reviewed every finding above 0.5 confidence. This is where real-world calibration happens. The findings from this scan informed several adjustments to the key name vocabulary and confidence thresholds.

The synthetic approach doesn't eliminate the need for this kind of real-world validation. It just means the real-world validation is about calibration rather than about whether the model has learned anything at all.

The Ongoing Data Problem: Concept Drift

Secret formats change. New services launch with new key formats. Existing services rotate their key structures for security reasons. The synthetic data that was representative in 2023 may underrepresent the formats that matter in 2025.

This is the secrets detection equivalent of the vulnerability scanner coverage gap problem — there will always be a lag between a new format appearing in the wild and the tool being updated to detect it.

The response to this is the same as it is for signature-based detection: a clear update process. When a new cloud service launches with a distinctive key format, the update is:

Add a generator function for the new format to trainer.py
Add a pattern match flag to the feature extractor
Retrain: python main.py train --samples 6000
The new format is now detected The synthetic data approach makes this update cycle fast and low-risk. Adding new training examples doesn't require finding or curating real examples of the new format — just implementing its generation logic. Retraining takes seconds. The update can ship as a minor version bump.

Synthetic Data as a Security Research Methodology

Stepping back from this specific tool: the synthetic data approach is applicable to a much broader class of security ML problems.

Phishing email detection can be trained on algorithmically generated phishing templates rather than real phishing emails, which carry real malicious links and attachments.

Malware classification researchers face the same problem I faced — real malware samples are dangerous to handle and distribute. Synthetic malware features derived from known behavioral signatures can substitute for actual samples in feature-level classifiers.

Log anomaly detection for security can use synthetic attack log patterns derived from published attack techniques rather than actual attack logs from production systems.

The common thread: real security data is often sensitive, legally ambiguous, dangerous to handle, or has quality problems that make it worse than it appears. Carefully generated synthetic data, validated against real-world examples without incorporating them into training, is frequently the more practical path.

The tradeoff is always the same: you give up the naturalness of real data distribution in exchange for control, safety, reproducibility, and shareability. For security tooling specifically — where trust and auditability matter — that tradeoff is often worth making.

What I'd Do Differently

If I were building this tool for a commercial security product rather than a portfolio project, I'd approach training data differently in two ways.

Structured negative mining from real codebases. Rather than generating synthetic negative examples, I'd mine real open source repositories for high-entropy strings that are demonstrably not secrets — package hashes, checksums in test suites, example values in documentation. These are safe to use (no real credentials), have the right distribution (they appear in real code as developers write it), and don't require synthetic generation. The labeling work is the constraint, not the data availability.

A small labeled set of real format examples. Not real credentials — but real format examples. The example values in service provider documentation (AWS's AKIAIOSFODNN7EXAMPLE, GitHub's documented PAT format examples) are designed to look like real keys without being real keys. A small set of these, clearly labeled, would improve the model's calibration on the exact formats that matter most.

The synthetic approach I built is the right choice given the constraint of a solo portfolio project with no data labeling resources. A team building a production tool would have access to more options.

The data generation code is in trainer.py at github.com/pgmpofu/secrets-detector. All generators are clearly documented and the entire training pipeline is reproducible from scratch with a single command.

Next up: building the pre-commit hook — blocking secrets before they ever reach the repository, and the UX considerations that determine whether developers actually leave it enabled.

DEV Community