Saint Zero Day

Posted on May 30

My test suite was green. My software was lying to me.

#go #security #testing #devops

My CI was green. 1,885 tests, 66 packages, zero failures. go vet clean. The build was a single self-contained binary. By every signal a Go project gives you, it worked.

Then I pointed it at something real, and watched it lie to my face.

This is the story of six bugs I found in my own security platform — ZDS Core — by refusing to trust a green checkmark. Five of the six belonged to the same scary family: the code reported success and stored nothing. No error. No stack trace. A 200 OK and an empty database.

If you ship anything that ingests data from the outside world, you have at least one of these right now. Let me show you what they look like.

The setup: test against reality, not fixtures

ZDS Core is a security platform written in Go — vulnerability scanning, an endpoint agent, EDR integrations, vuln-data feeds, compliance exports, the works. My unit tests were thorough. They were also, I realized, all talking to fixtures and in-memory SQLite. They proved my logic. They proved nothing about what happens when a real Wazuh server, a real OpenSearch cluster, or a real CVE feed shows up with data shaped slightly differently than I assumed.

So I spent a weekend wiring up the real things:

podman containers for an nginx target, a Wazuh 4.7 manager, an OpenSearch node
nmap for actual scanning
Live feeds: CISA KEV, FIRST.org EPSS, Google OSV, NVD
The endpoint agent running on my actual Fedora box

Rule for the weekend: "it ran" is not a pass. Data has to land, and it has to be correct. Then I went looking for lies.

Bug #1 — The endpoint agent that threw everything away

The agent registered with the server. It collected 2,343 software packages and a pile of open ports off my machine. The server logged results accepted for every batch. The agent was happy. I was happy.

The software table had zero rows.

failed to insert port finding 38810/udp: FOREIGN KEY constraint failed
failed to insert port finding 40319/udp: FOREIGN KEY constraint failed
...

The handler looked up the agent's asset by IP to get a parent asset_id. But nothing ever created that asset. So agentAssetID was 0, every insert violated the foreign key, and the error was logged-and-swallowed inside a loop while the endpoint cheerfully returned:

{ "accepted": true, "message": "accepted software_inventory results" }

The fix was to create the asset on first sight (idempotently), and — embarrassingly — to actually set AssetID on the port finding, which the original code never did. After that: 2,343 packages and 96 findings landed. The entire endpoint telemetry pipeline had been a no-op, and the API was reporting victory the whole time.

Lesson: an accepted: true you control is worth nothing. The only proof is the row.

Bug #2 — "Synced 1" / stored 0

Same family, different corner. My EDR alert sync said alerts synced: 1. The findings table was empty.

This one was a landmine hiding in a helper:

func (db DB) UpsertFinding(f model.Finding) (int64, bool, error) {
    ...
    if f.FingerprintHash == "" {
        return 0, false, nil   // <-- no error. no insert. nothing.
    }
    ...
}

A finding with no fingerprint returns (0, false, nil). No error. The caller increments synced++, logs success, moves on. Four different EDR code paths built findings without ever setting a fingerprint. Every alert and every vulnerability from those paths evaporated, silently, with a success message.

If you take one thing from this post: a function that returns nil for "I did nothing" is a trap. At minimum it should be loud. Ideally the type system shouldn't let you build the thing without the required field.

Bug #3 — The CVE that wasn't there

The OSV feed connector queried Google's OSV API for a CVE and reported Loaded 0 vulnerabilities. The API was returning 200 with a full record. I checked by hand:

$ curl -s https://api.osv.dev/v1/vulns/CVE-2021-44228 | jq '{id, aliases}'
{
  "id": "CVE-2021-44228",
  "aliases": ["GHSA-jfh8-c2jp-5v3q"]
}

The parser only scanned aliases for a CVE- prefix. But when you query OSV by CVE, the CVE is the record's id, and the alias is the GHSA. So the parser found no CVE, mapped to an empty ID, and dropped the result. Every CVE-keyed OSV sync had been a quiet no-op. One-line shape mismatch, total data loss.

Bug #4 — A removed endpoint nobody noticed

edr sync against a real Wazuh 4.7 manager: agents and vulns synced fine, alerts 404'd.

wazuh api GET /alerts?limit=500&offset=0 returned status 404

The code's own comment admitted the truth — "Wazuh 4.x alerts come from the indexer API" — and then queried the Manager API's /alerts endpoint anyway. That endpoint existed in Wazuh 3.x and was removed in 4.x; alerts now live in the Indexer (OpenSearch). The integration could never have worked against any modern Wazuh.

Fixing it meant actually implementing the Indexer query path (wazuh-alerts-*/_search), wiring an --indexer-url, and skipping gracefully when it isn't set instead of faceplanting on a dead URL. Then I stood up a real OpenSearch node, seeded an alert, and watched it flow all the way through to a stored finding:

scanner=wazuh  category=alert  severity=high
title=[5710] sshd: Attempt to login using a non-existent user

That end-to-end run is the only reason I believe it works. A unit test with a mocked response would never have caught that the endpoint itself was gone.

Bug #5 — The connector with no front door

I went to test the Nessus/Nuclei/OpenVAS importer and discovered there was no way to call it. The parsers existed, were unit-tested, and were registered in a nice little plugin registry — and absolutely nothing in the CLI or API ever invoked them. A whole feature with no door.

So I added one (connector import --tool nuclei --file ...), ran a real Nuclei scan against my nginx target, imported it — and immediately found:

Bug #6 — Nuclei findings, minus the CVE

Nuclei tags its CVE-template matches with a classification block:

"classification": { "cve-id": ["CVE-2021-44228"], "cwe-id": ["cwe-502"], "cvss-score": 10.0 }

My parser's struct didn't have a classification field, so encoding/json quietly dropped it. Every CVE that Nuclei detected lost its CVE on import — no correlation, no KEV matching, no CSAF export. The findings landed, but stripped of the one field that makes a vuln finding useful.

After the fix, the full chain finally worked: Nuclei scan → import → CVE finding → OSV enrich → correlate → CSAF-VEX 2.0 advisory. Six tools, one pipeline, and it took running every link to trust the chain.

What I actually learned

Green tests are a floor, not a ceiling. Mine proved the logic and hid every integration assumption I'd gotten wrong.
Silent success is the most dangerous failure mode. A crash gets fixed in an hour. A 200 OK that stores nothing survives to production and costs you a customer's trust.
"Accepted" is not "stored." Verify the row, not the response.
Field-shape mismatches are invisible in Go. encoding/json drops unknown fields without a peep. The CVE-in-id bug and the dropped classification block were both this. Decode strictly, or assert on the parsed result.
Run the whole chain against the real thing at least once. Every one of these six was only catchable with a live target.

All six are fixed, each with a regression test, and I wrote the evidence down — actual commands, actual output — in a VERIFICATION.md in the repo. If I'm going to ask people to trust this thing with their security, "trust me, the tests pass" isn't good enough. "Here's exactly what I ran and what it returned" is the bar.

The other thing that happened this week

In between bug hunts, I got the company site live: szdsecurity.com. Zero Day Security is the company I'm building this under — security and AI-adoption work, with the platform you just read about as the backbone. It's early. I'm pre-customer and building in the open, which is exactly why I'd rather publish the weekend I spent finding my own bugs than a glossy feature list.

If you build tools that ingest data, go try to break one this weekend. Point it at something real. I promise it's lying to you about something — the only question is whether you find out before your users do.

Building Zero Day Security. More war stories to come.

Top comments (2)

Harjot Singh • May 31

"A 200 OK and an empty database" is the most dangerous failure in software because it's invisible by design, success was reported, nothing was stored, and there's no error to alert on. That whole family of bugs exists because we test that the code ran, not that the world changed, and a green checkmark only proves the former. The fix you landed on (test against reality, not mocks) is the uncomfortable but correct one: mocks verify your assumptions about the boundary, and these bugs live precisely in the gap between your assumptions and the real system's behavior. The deeper lesson generalizes way past Go, it's the same reason "the LLM returned a confident answer" and "the LLM returned a correct answer" are different claims: a clean exit is not evidence of a correct effect. The discipline I keep is verify the outcome, not the call, assert the row exists, don't trust the 200. That's the core of how I build Moonshift's verify layer. Of the six, was the worst one a swallowed error, or a success path that never actually wrote, the silent-no-op is the one that keeps me up?

xulingfeng • May 30

Great breakdown of testing strategies! I've been working on AI-driven test automation and the "cost of test maintenance" point you raised really resonates — it's often the hidden bottleneck teams don't account for when adopting new frameworks. Curious, have you found any particular patterns that reduce flakiness in CI? 👀