Fuzz testing found bugs in our API that unit tests never would

#testing #ai #security #devops

I used to think our test suites were solid. We had unit tests, integration tests, contract tests for the API layer. Good coverage numbers. The kind of setup that makes you feel safe when you merge to main on a Friday afternoon.

Then we ran a fuzzer against the same API and watched it fall apart in under an hour.

Fourteen crashes. Server panics on malformed JSON. A file upload endpoint that accepted literally anything as long as you set the right Content-Type header. An input field on a form that crashed the entire backend process when it received a float instead of an integer.

None of these showed up in our existing tests. Not one.

That was the day I stopped treating fuzzing as a "nice to have" and started treating it as the part of security testing that actually finds the bugs hiding between your test cases.

What fuzzing actually does

Fuzzing is simple in concept. You throw garbage at your software and see what breaks.

More precisely: you take valid inputs, mutate them in thousands of ways (wrong types, oversized strings, null bytes, nested objects 500 levels deep, unicode edge cases, truncated payloads), and send them at your application as fast as you can. Then you watch for crashes, hangs, memory leaks, unexpected error codes, and data that leaks out in error messages.

The OWASP fuzzing page describes the technique well if you want the textbook version. But here is what it looks like in practice: you point a tool at an endpoint, go make coffee, and come back to a list of inputs that made your software do something it should not have done.

The reason this works so well is that developers test for what they expect. You write a test that sends valid JSON and checks the response. Maybe you write a test that sends empty JSON and checks for a 400 error. But you probably do not write a test that sends JSON with a key that is 50,000 characters long, or a nested array 200 levels deep, or a number where a string should be with a trailing null byte.

Fuzzers do not have expectations. They just try things. And software has a lot of assumptions baked into it that only surface when those assumptions get violated.

The bugs fuzzing catches that nothing else does

Let me walk through the actual categories of failures we find during fuzz testing engagements. These are real patterns from real projects.

Input type confusion. A registration form expects a string for the phone number field. The API handler parses it and passes it to a validation function that calls .match() on it. Send an integer instead of a string and the backend throws an unhandled TypeError. The server returns a 500 with a stack trace that includes the file path and line number. Now an attacker knows your framework, your file structure, and exactly where to probe next.

Unit tests rarely cover this because the developer wrote the test with the same mental model they used to write the code. They send a string because that is what the field is for.

Malformed JSON handling. We see this constantly. APIs that parse JSON request bodies without validating the structure first. Send {"user": {"name": {"name": {"name": ...}}}} nested 100 times and the server either runs out of memory or hits a recursion limit and crashes. Send JSON with a trailing comma (technically invalid) and some parsers accept it while others throw. Send a 10MB payload to an endpoint that expects 200 bytes and there is no size limit enforced.

These are not exotic attacks. They are basic robustness issues that every public-facing API should handle. Fuzzers find them in minutes.

File upload validation gaps. This one is a classic. An endpoint says it accepts PNG files. It checks the Content-Type header. It does not check the actual file content. So you can upload a PHP script, a shell script, or an SVG containing embedded JavaScript, and the server happily stores it. Depending on the server configuration, that file might be directly executable.

We tested a client's document upload feature and found that it validated the file extension in the filename but not the actual bytes. Rename malicious.php to malicious.php.png and it went straight through.

Error message information leakage. When software crashes on unexpected input, the error messages often contain information that should never reach the client. Database connection strings, internal IP addresses, full stack traces with dependency versions, SQL query fragments. Fuzzers trigger these crashes systematically, and each crash response becomes a reconnaissance opportunity for an attacker.

Integer overflows and boundary values. We worked on a payment processing system where fuzz testing found an integer overflow in the transaction amount field. The field was a 32-bit signed integer. Send a value just past 2,147,483,647 and the system wrapped around to a negative number. In a payment context, that could mean a credit instead of a debit. Standard tests sent amounts like 100, 500, 10000. Nobody tested what happens at the boundary of the data type itself.

Why your existing tests miss these

Your unit tests are written by the same people who wrote the code. They share the same assumptions about what valid input looks like. They test the happy path and a handful of known error cases.

Your integration tests verify that components work together correctly when given correct data. They rarely test what happens when component A sends garbage to component B.

Your end-to-end tests simulate real user behavior. Real users do not typically paste 50,000 characters into a phone number field or send raw bytes to a JSON endpoint. Attackers do.

Fuzzing fills the gap between "does it work correctly?" and "does it fail safely?" Those are two very different questions, and most test suites only answer the first one.

How we actually run fuzz tests

At BetterQA, fuzzing is part of our DAST (Dynamic Application Security Testing) work. We built an AI Security Toolkit with over 30 scanners, and fuzzing is integrated into the dynamic analysis pipeline.

Here is how a typical engagement works:

1. Map the attack surface. Before we fuzz anything, we need to know what exists. We crawl the application, identify all endpoints, document the expected input formats, and note which endpoints handle sensitive data (auth, payments, file uploads, admin functions).

2. Seed the fuzzer with valid inputs. Good fuzzing starts with valid data. We capture real requests from the application (with test accounts, never production data), and the fuzzer uses these as templates. It knows what a valid request looks like, so it can make targeted mutations rather than purely random noise.

3. Run mutation-based fuzzing. The fuzzer takes each valid input and generates thousands of variants. Wrong types, boundary values, encoding tricks, oversized payloads, special characters, null bytes, format string patterns. Each variant gets sent to the endpoint, and we capture the response code, response body, response time, and any server-side logs.

4. Triage the findings. Not every crash is a security vulnerability. Some are just robustness issues (the server returns a 500 but recovers cleanly). Some are actual security holes (the server leaks data, accepts the malformed input as valid, or enters an inconsistent state). We classify each finding by severity and exploitability.

5. Verify and document. Every finding gets manually verified. We reproduce the crash, confirm the root cause, and write up the fix. No false positives in the final report.

For web applications, we often use OWASP ZAP as one of the tools in this pipeline. For APIs, we combine custom fuzzing scripts with tools like Burp Suite's Intruder or purpose-built API fuzzers. For projects with unusual protocols (IoT devices, custom binary formats), we write targeted fuzzers from scratch.

When to fuzz (and when not to)

Fuzzing works best when:

You have a public-facing API that accepts user input
You process file uploads
You handle payment or financial data
You parse complex data formats (JSON, XML, CSV, binary protocols)
You have already done basic security testing and want to go deeper

Fuzzing is less useful when:

The application has no external input surface (purely internal batch processing)
You have not done basic input validation yet (fix the obvious stuff first, then fuzz)
The codebase changes so frequently that findings become stale before they are fixed

The best time to start fuzzing is after your first round of functional testing is stable but before you go to production. That is when the cost of fixing issues is lowest and the risk of missing something is highest.

The security testing reality in 2024

As Tudor Brad, BetterQA's founder, puts it: "It's a good versus evil game right now." AI is accelerating development speed, which means more code ships faster, which means more potential vulnerabilities reach production faster. Features that used to take months now take days. The testing has to keep pace.

Fuzzing is one of the few techniques that scales with code output. You do not need to manually write a test case for every possible malformed input. The fuzzer generates them. You just need to point it at the right targets and have someone who knows what they are looking at to triage the results.

If you have never run a fuzzer against your application, I would strongly suggest trying it on a staging environment. The results will probably surprise you. We have yet to fuzz a non-trivial application and find zero issues. Every single engagement has turned up something the existing test suite missed.

The question is never "does my software have these bugs?" The question is "do I find them before someone else does?"

More on security testing and QA practices on the BetterQA blog.