How to Build a CI Pipeline That Catches Common AI Code Issues

#ai #automation #programming

A CI pipeline designed for human-written code will catch some AI code problems and miss others. The ones it misses are not random. They follow a consistent pattern: issues that look fine visually but fail under specific conditions, dependency assumptions that only surface at runtime, and security gaps in input handling that well-formed test inputs do not expose.

Building a CI pipeline that catches the specific failure modes of AI-generated code does not require replacing your existing pipeline. It requires adding a few targeted stages that address the specific ways AI output differs from human output.

This guide walks through each stage.

Step 1: Add Static Analysis as the First Gate

Static analysis runs without executing the code. It reads the source and identifies patterns that match configurable rules. For AI-generated code, static analysis catches two categories of problems that would otherwise reach human review: unsafe coding patterns and security-relevant input handling issues.

For JavaScript and TypeScript, ESLint is the standard tool. Configure it to enforce strict rules on error handling (no silent catches), type safety (no implicit any), and null safety (no unchecked property access on potentially null values). These are patterns that AI models introduce frequently because they represent the "common enough" code the model learned from training data, not code written to match your team's specific standards.

For security-specific patterns across multiple languages, Semgrep is the right addition. Its pre-built rule library covers injection patterns, insecure deserialization, hardcoded secrets, and other classes of issue that AI-generated code handling user input tends to introduce. Running Semgrep on pull requests takes around ninety seconds on a typical codebase and catches a class of issues that general-purpose lint tools miss.

Configure both to fail the pipeline on any match rather than reporting warnings. If the rules produce too many false positives, tune the rules rather than degrading them to warnings. A warning that does not block a merge is rarely read.

Step 2: Run Parameterized Unit Tests Covering Edge Cases

Standard unit tests catch logic errors on the inputs you thought to test. The gap in AI-generated code is that you did not write the code, so you may not have thought to test the inputs that would have made you nervous while implementing it.

Parameterized tests address this gap. Instead of writing separate tests for each input, you define the test structure once and run it against a data table covering the full range of relevant inputs: valid inputs at the boundaries, invalid inputs at the boundaries, empty inputs, null inputs, and a representative set of malformed inputs.

In Jest, the pattern looks like this:

test.each([
  [0, 'error: value must be positive'],
  [1, 'valid'],
  [100, 'valid'],
  [101, 'error: value exceeds maximum'],
  [null, 'error: value required'],
  ['abc', 'error: value must be numeric'],
])('processValue(%s) returns %s', (input, expected) => {
  expect(processValue(input)).toBe(expected);
});

The test structure is the same for all six cases. The data table is where the edge case coverage lives. Adding a new case is adding one row to the table.

For Python and Pytest, @pytest.mark.parametrize provides the same pattern with the same efficiency benefits.

Configure the CI pipeline to fail if test coverage for new files falls below a threshold. The specific percentage matters less than the requirement: every AI-generated function shipped to production should have test coverage on its non-trivial edge cases, not just its happy path.

Step 3: Add Explicit Failure Tests for Error Handling

AI-generated code has a consistent error handling pattern: it catches exceptions and returns a default value rather than propagating the error or converting it to a domain-specific type. The code runs without throwing. It silently returns an incorrect result that is difficult to trace back to the original failure.

CI pipelines typically test the success path. To catch this pattern, add explicit tests that inject failures into external dependencies and verify the error behavior.

For a function that calls a database, write a test where the database call throws. Check what the function actually returns. If it returns null or an empty object when it should propagate the error, that is a silent failure that will be hard to debug in production.

def test_get_user_propagates_db_error(mocker):
    mocker.patch('db.query', side_effect=DatabaseConnectionError("timeout"))
    with pytest.raises(ServiceError):
        get_user(user_id=42)

This test fails if the function swallows the database error rather than converting it to a ServiceError. Without this test, the silent catch would pass all other CI checks.

Step 4: Run Security Tests Against Input Handling Functions

Any function that accepts data from outside the trusted perimeter of the system needs security-specific tests. AI-generated input handling tends to use the right data types but miss the specific sanitization your system requires.

OWASP maintains a comprehensive test guide documenting the standard input patterns for each vulnerability category. The categories most relevant to AI-generated code are: SQL injection, command injection, path traversal, and cross-site scripting. Each has a set of standard test payloads that should produce a predictable error or sanitized output rather than being processed as valid data.

Write a parameterized test for each input-handling function that runs it against the OWASP test payloads for relevant categories. The test should assert that the function rejects or sanitizes each payload rather than processing it.

These tests run fast in isolation. Add them to the same test suite as your unit tests so they run on every push.

Step 5: Add Dependency Scanning

When AI models generate code that imports packages, they draw from their training data. A package that was common and safe at training time may have outstanding vulnerabilities in its current version.

Dependency scanning is a one-step addition to any CI pipeline. Run it against your dependency manifest (package.json, requirements.txt, or equivalent) and fail the pipeline on high-severity vulnerabilities in direct dependencies.

Snyk and GitHub's native dependency review both integrate into standard CI pipelines in under an hour. The scan adds about fifteen seconds to pipeline runtime and catches a class of risk that code review does not address.

Step 6: Run Integration Tests Against Realistic State

Unit tests with mocked dependencies verify that the function's logic is correct given the assumed behavior of those dependencies. Integration tests verify that the logic is correct given the actual behavior of those dependencies in your system.

For AI-generated code that interacts with external systems, this distinction matters more than for human-written code. The model's assumptions about how a library behaves or what shape a database response takes may not match your specific setup. Unit tests with well-crafted mocks will pass. Integration tests against realistic system state will fail on the mismatch.

Add integration tests that run against a representative database, a real API endpoint (or a realistic stub), or actual file system paths. Run these in a dedicated CI stage after unit tests pass, using a seeded test database or a sandbox environment.

The integration test stage takes longer than unit tests. That is expected. Run it on every pull request for code that touches external systems, and on every commit to main or release branches regardless of what was changed.

Step 7: Report Coverage Trends Over Time

A single coverage report tells you where you are today. Coverage trends tell you whether AI-assisted development is accumulating untested code over time.

Configure your CI pipeline to report coverage metrics to a central store and alert when coverage for new code falls below the team standard. SonarCloud provides this at the project level with a free tier that covers public repositories. For private repositories, the same principle applies using whatever coverage reporting your CI platform supports.

The goal is visibility. If AI-generated code is shipping at volume and test coverage is declining, that is a signal that the review workflow is under pressure. Catching that trend early is easier than catching it after a quarter of untested code has accumulated.

For a complete framework on applying these tests systematically across the review workflow, the guide on testing AI-generated code before shipping at 137Foundry covers the decision points and integrations in more detail.

The pipeline described here adds meaningful protection without meaningfully slowing down the development workflow. Static analysis runs in under two minutes. Parameterized unit tests run in under five. Security tests for input handling functions add seconds per function. The overhead is real but small, and it is far smaller than the cost of catching the issues it prevents after they have reached production.

Photo by K on Pexels