Dmitry Turmyshev

Posted on Feb 13 • Originally published at bitdive.io

Quality Assurance in AI Assisted Software Development: Risks and Implications

#ai #codequality #softwaredevelopment #testing

"We're now cooperating with AIs and usually they are doing the generation and we as humans are doing the verification. It is in our interest to make this loop go as fast as possible. So, we're getting a lot of work done."

— Andrej Karpathy: Software Is Changing (Again)

This quote describes a shift that is already visible in many teams. Code generation has accelerated. Verification and validation increasingly become the bottleneck.

With AI tools, writing code is often not the limiting factor anymore. The hard part is proving that what was generated is correct, safe, and maintainable.

Code Volume Growth and Test Review Challenges

To understand QA challenges, we should look at how code is produced. Testing is not isolated. It reflects development speed and development habits. If development accelerates, QA pressure grows too.

The Main Shift: Writing Has Become Cheap, Verification Has Become Expensive

A common side effect of AI coding is rapid codebase growth without matching growth in quality. This is often described as "code bloat".

Some Facts:

Explosive Growth in Code Volume (Greptile, 2025): The "State of AI Coding 2025" report recorded a 76% increase in output per developer. At the same time, the average size of a Pull Request (PR) increased by 33%. Physically, significantly more material arrives for verification and review than a person can qualitatively process.

Code quality signals degrade (GitClear, 2025): A study covering 211 million lines of changed code (2020 to 2024) reports that the share of refactoring and code movement fell from about 25% to under 10% by 2024, while copy and paste style changes increased.
Delivery stability can suffer (DORA, v2025.2): In "Impact of Generative AI in Software Development", DORA reports an estimated association: for every 25% increase in AI adoption, delivery throughput decreases by about 1.5% and delivery stability decreases by about 7.2%.

Anti-Patterns and "Review Fatigue"

Code generated by AI often contains structural security errors and architectural "crutches" that a human expert would never write. Errors become more subtle and difficult to detect, as AI writes syntactically correct but logically vulnerable code.

As the volume of generated code grows, human capacity for critical analysis decreases. The phenomenon of "Review Fatigue" sets in.

Engineers tends to trust correctly formatted code by default. The "Looks good to me" (LGTM) effect kicks in, where the reviewer's attention is dulled by the visual correctness of the generated solution.

When Tests Become Unmaintainable: The Knowledge Gap Problem

When test code grows faster than shared understanding, teams accumulate a critical knowledge gap.

Before AI tools, writing an automated test was a cognitive process. The engineer had to study requirements and formulate verification conditions. The test served as documentation of this understanding.

With AI assistants, hundreds of lines of test code can be generated in seconds, skipping this cognitive stage. When an AI-generated test fails, the engineer faces code they are seeing for the first time. The team's collective knowledge about what exactly these tests verify approaches zero.

The "Safe Path" Trap and Mocking Hell

AI models, being probabilistic, strive to minimize the risk of syntax errors. The safest path for the model is to write a test that calls a function but checks minimal conditions.

This gets worse with heavy mocking. AI often produces verbose tests with many mocks. Such tests can lock onto implementation details instead of behavior. Then refactors break tests even when user-visible behavior stays the same.

Test Maintenance Costs with AI-Generated Code

Generating code is cheap, maintaining it is not. AI tests are often tightly coupled to the internal structure of the code. When internals change, these tests fail.

In World Quality Report 2025-26, 50% of QA leaders using AI in test case automation cite "Maintenance burden & flaky scripts" as a key challenge. "Resources are continually being depleted by maintenance."

How AI Changes Testing Practices: TDD and the Testing Pyramid

The essence of TDD has always been not just about verification, but about design. Writing a test before code forced the engineer to think through the architecture. When AI generates implementation in seconds, this "thinking stage" is skipped.

Transformation of the Testing Pyramid

The classic "Testing Pyramid" (many cheap Unit tests, few expensive E2E) is rapidly losing relevance.

AI inverts the classic testing pyramid. Since AI writes code faster than humans, the bottleneck becomes checking intentions. The emphasis shifts to integration and end-to-end (E2E) tests, where autonomous agents check the operability of the entire system.

In an economy where coding approaches zero cost, value shifts from writing lines to Intent Validation. Achieving an 80-90% coverage indicator has become trivial, but the correlation between high coverage and actual product quality has practically disappeared. This requires updating test-to-code ratio standards for AI-assisted development.

Integration and Contract Testing in Microservices

Contract testing is becoming a critically important standard for microservices and API-First architecture.

The Essence of the Difference: Integration testing checks real interaction of running services. Contract testing validates compliance with agreements (contracts) in isolation.

From Static Contracts to Behavior Validation

The next evolutionary step is the transition from static, manually maintained contracts to dynamic behavior-based validation.

Modern integration testing increasingly uses containerization. Libraries like Testcontainers allow spinning up disposable instances of databases (PostgreSQL, Redis, Kafka) for each test run. This enables deep integration testing with real dependencies, achieving E2E-level reliability but with the speed and isolation of Unit tests.

Self-Healing Tests and E2E

A key trend in E2E is using AI to combat locator fragility. If a developer changes a button ID, tools with "self-healing" find the element by other features (text, position, neighbors).

However, adoption lags. The World Quality Report reports self-healing tests at 49% adoption and warns that self-healing scripts remain underused, leaving teams with fragile pipelines and rising maintenance costs.

Practical Recommendations for QA Teams

1. Reconsider Metrics

Shift focus from coverage percentage to Mutation Testing and Requirements Coverage. What matters is whether the test fails if you actually break the logic.

2. Invest in Verification Infrastructure

Implement Ephemeral Environments. For each Pull Request, an isolated environment with real dependencies (via Testcontainers) should be automatically spun up.

3. Give AI Access to Context

Integrate agents with runtime. AI should get access to container logs, traces, and test execution results to analyze failure causes effectively.

Conclusion

The main challenge of 2026 is learning to validate code faster than AI can generate it. We are moving away from a model where a human writes both code and tests, to a model where a human defines intentions and AI implements them under supervision.

Top comments (1)

openclawgotchi • Feb 13

🤔