Denis Stetskov

Posted on Sep 30 • Edited on Oct 3 • Originally published at techtrenches.substack.com

The Brutal Reality of AI in Engineering: Lessons from Q3 2025

#programming #ai #software #development

On September 29, 2025, Anthropic released Claude Sonnet 4.5. I spent that Monday running it through real production tasks from my systems—not demos, not benchmarks, actual engineering work.

I use AI tools daily. Documentation, tests, UI implementation, API scaffolding—it saves me 2+ hours of repetitive work. AI has become an essential part of my workflow.

Nine months ago, several tech CEOs announced that AI would transform engineering roles. Salesforce paused engineering hires. Meta announced AI agent initiatives. Google reported 30% of new code is AI-generated.

I wanted to understand the current state: Where does AI excel? Where does it struggle? What works in production versus what's still experimental?

Here's what I found after 18 months of tracking industry data, daily AI usage, and systematic testing with the latest models.

What The CEOs Promise

Sam Altman (OpenAI), March 2025:

"At some point, yeah, maybe we do need less software engineers."

Mark Zuckerberg (Meta), January 2025:

"In 2025, we're going to have an AI that can effectively be a mid-level engineer."

Marc Benioff (Salesforce), February 2025:

"We're not going to hire any new engineers this year."

Then he fired 4,000 support staff.

Sundar Pichai (Google), October 2024:

"More than a quarter of all new code at Google is generated by AI."

By April 2025: 30%.

The pattern: freeze hiring, lay off thousands, hype AI productivity, and justify over $364 billion in AI infrastructure spending.

Here's what they're not telling you.

Current Industry Metrics: Adoption and Productivity Data

Based on 18 months of tracking AI adoption across teams (survey panels of 18 CTOs, >500 engineers across startups and enterprise):

84% of developers now use or plan to use AI coding tools.
Developers self-report being +20% faster.
Measured productivity shows -19% net decrease on average.

This gap between perception and measurement deserves closer examination.

Production Incidents and Code Quality

Common pattern: "Code functions correctly in testing but fails under edge cases."
Security reviews found +322% more vulnerabilities in AI-generated code samples.
45% of AI-generated snippets contained at least one security issue.
Junior developers using AI tools showed 4x higher defect rates.
70% of hiring managers report higher confidence in AI output than junior developer code.

Notable incident: The Replit case where AI deleted a production database (1,206 executives, 1,196 companies) and created fake profiles to mask the deletion (detailed analysis).

Return on Investment Metrics

Less than 30% of AI initiative leaders report executive satisfaction with ROI.
Stack Overflow 2025 developer survey shows decreased confidence in AI tool reliability.
Tool vendors report strong growth (Cursor estimated at $500M ARR) while enterprise customers struggle to quantify benefits.
Industry consensus: AI tools provide value as assistants but require significant oversight.

September 29, 2025: Testing the Frontier Model

On the day Claude Sonnet 4.5 launched, I tested it with real production tasks.

Case Study #1: Over-Engineering Without Constraints

Task: Add multi-tenant support to a B2B platform for data isolation between corporate clients.

AI's Proposal:

DB schema changes + migrations
client_id columns & indexes
Complex ClientMetadata interface
15–20 files to modify

My Simplification:

10 files only
No DB changes (metadata in vectors only)
Simple Record type
1-hour timeline

Outcome: 5 minutes of architectural thinking prevented weeks of unnecessary complexity.

Pattern: AI optimizes for "theoretically possible," not "actually needed."

Case Study #2: 1:1 With Human—Because I Wrote the Spec

Task: Implement the simplified spec from Case #1.

Setup: I wrote a detailed 60-minute specification (file list, examples, constraints, "what not to do").

AI's Output:

10+ files modified in 1 hour (missed doc updates)
100% spec compliance
Zero type or lint errors
Consistent patterns

But:

Missed a microservice dependency
Wrong assumption about JWT endpoints
Started coding without analysis (had to force "ultrathink")
Delegated to a sub-agent without verifying

The Math:

Human alone: ~2 hours.
AI path: 1h spec + 1h AI coding + 30m review = ~2.5 hours.
Net: 1:1 parity in wall time. Only benefit: AI typed faster, but I still had to think.

Case Study #3: Integration Hell

Task: Mock BullMQ queue in Vitest tests for a Next.js app to avoid Redis connections.

Expectation: 30–60 min for experienced dev.

What Happened:

11 failed attempts: vi.mock, hoisted mocks, module resets, mocks dirs, env checks, etc.
Root cause: Vitest mocks can't intercept modules imported by the Next.js dev server.
AI never pivoted—just kept retrying variants of the same broken approach.
Proposed polluting prod code with test logic.
Time: ~4h wasted (2h AI thrashing, 2h human triage).

Human path: After 30m, realize it's architectural. Pivot to HTTP-level mocking or Testcontainers. Solution in ~1–2h.

Outcome: Complete failure.

The Pattern Across Cases

Case #1: AI over-engineered by ~90% without constraints.
Case #2: With spec + oversight, outcome matched human baseline. No net time saved.
Case #3: Integration problems = total failure.

Synthesis:

AI can implement if you specify perfectly.
AI cannot decide what to build.
AI fails on integration.
AI proposes bad solutions instead of admitting limits.

Understanding the Results: Tool versus Autonomous System

These test results highlight a crucial distinction between AI as an assistant and AI as an autonomous agent.

Successful daily AI applications in my workflow:

Converting Figma designs to React components (Figma MCP)
Generating test suites for existing, working code
Creating API documentation and Swagger specifications
Developing database migrations and type definitions
Refactoring code to established patterns
Writing Playwright E2E tests (Playwright MCP)

The consistent factor: human oversight and validation at every step. I provide context, catch errors, reject suboptimal patterns, and guide the overall architecture.

Key observation: AI tools with expert supervision multiply productivity. Without supervision, they multiply problems.

Perception vs Reality

Scenario	Net Effect
Developers' self-report	+20% faster
Aggregate (no process)	−19% slower
Senior on routine tasks	+~30% faster
Spec → AI → Review loop	Parity to +50%
Integration tasks	~100% fail rate

Impact on Engineering Career Pipeline

Current hiring trends show significant shifts:

Big Tech new graduate hiring: 7% (down from 14% pre-2023)
74% of developers report difficulty finding positions
CS graduates unemployment rate: 6.1%
Industry-wide engineering layoffs since 2022: 300,000+
Entry-level positions commonly require 2-5 years experience
New graduates report submitting 150+ applications on average

(Full analysis: Talent Pipeline Trends)

This creates a potential issue for future talent development:

Year 3 projection: Shortage of mid-level engineers

Year 5 projection: Senior engineer compensation increases, shorter tenure

Year 10 projection: Large legacy codebases with limited expertise

The pattern suggests a gap between junior training and senior expertise requirements.

Analysis: Gap Between Executive Expectations and Technical Reality

Measurement Methodology Differences

Executive teams often measure lines of code generated, while engineering teams measure complete delivery including debugging, security, and maintenance. This creates different success metrics.

Integration Complexity

My Case #3 demonstrates that AI tools struggle with system integration, which comprises approximately 80% of production software work versus 20% isolated tasks.

Infrastructure Investment Context

Major technology companies have committed significant capital to AI infrastructure:

Microsoft: $89B
Amazon: $100B
Meta: $72B
Google: $85B

This represents 30% of revenue (compared to historical 12.5% for infrastructure), while cloud revenue growth rates have decreased (Infrastructure Analysis).

Technical Constraints

Energy consumption: ChatGPT queries require 0.3-0.43 watt-hours versus 0.04 for traditional search
Data center capacity: Currently 4.4% of global electricity, projected 12% by 2028
Power availability: 40% of data centers expected to face constraints by 2027

Talent Development Considerations

AI tools amplify existing expertise but don't create it independently. Experienced developers report 30% productivity gains on routine tasks, while junior developers show increased defect rates (detailed metrics).

Where AI Actually Shines: My Daily Workflow

I'm not an AI skeptic. I use AI every single day. Here's where it genuinely transforms my productivity:

Documentation & Tests

Code documentation: AI writes better JSDoc comments than most developers. Feed it a function, get comprehensive docs in seconds.
Test coverage: Give AI working code, it generates comprehensive test suites. I've increased coverage from 40% to 85% in hours, not days.
Edge cases: AI finds test scenarios I wouldn't think of. "What if the date is February 29th in a non-leap year?"

UI Implementation

Figma to Code: Using Figma MCP, I go from design to React components in minutes. The code needs tweaking, but the structure is solid.
Playwright validation: Playwright MCP writes E2E tests while I'm building. Real-time validation as I code.
Component variations: "Give me this button in 5 different states" - done in seconds.

API & Schema Work

Route definitions: AI generates Express/FastAPI routes with proper validators, error handling, and Swagger metadata. No logic yet, but perfect scaffolding.
Type definitions: From JSON to TypeScript interfaces with proper optionals and unions.
Migration scripts: Database schema changes with up/down migrations, foreign keys, and indexes.

The Daily Reality

My actual workflow:

Morning standup prep: AI summarizes my commits and PRs from yesterday
Code review: AI pre-reviews my code for obvious issues before human review
Refactoring: "Make this more functional" or "Add proper error handling" - AI handles the tedious parts
Documentation: AI documents while I code the next feature

Time saved: ~2 hours daily on repetitive tasks.

But here's the critical point: I review every single line. I know what good looks like. I catch when AI hallucinates an API that doesn't exist or suggests an O(n²) solution.

What AI Actually Can / Cannot Do

AI as a TOOL excels at:

✅ Documentation generation
✅ Test suite creation from working code
✅ UI component scaffolding
✅ API route boilerplate with validators
✅ Type definitions and schemas
✅ Code refactoring patterns
✅ SQL queries from natural language
✅ Regex patterns that actually work

AI as a REPLACEMENT fails at:

❌ Making architecture decisions
❌ Understanding business context
❌ Debugging complex state issues
❌ System integration
❌ Performance optimization choices
❌ Security architecture
❌ Data model design
❌ Working without supervision

Can AI Replace Developers?

No. Not in 2025, not in 2030. Under current constraints, replacement at scale is not feasible.

AI is a fast intern. It amplifies expertise, but doesn't create it.

Key Questions for Evaluation

When assessing AI's role in software development, consider:

If AI can replace engineers, why do major tech companies maintain large engineering teams?
Why did my test cases require continuous human oversight?
Why does adoption reach 84% but code generation remains at 30%?
Why do 16 of 18 CTOs report production issues from AI-generated code?
What explains the gap between self-reported productivity (+20%) and measured output (-19%)?

Implementation Strategies for Organizations

Organizations seeing success with AI tools share common approaches:

Effective patterns:

Deploy AI as productivity enhancement for experienced developers
Maintain investment in junior developer training programs
Focus on measurable outcomes rather than activity metrics
Implement review processes for all AI-generated code

Risk areas to monitor:

Over-reliance on AI for architectural decisions
Reduced investment in talent development
Assumption that code generation equals completed features

Conclusions from Testing and Daily Usage

After 18 months of data collection, production testing, and daily AI tool usage:

AI tools excel as productivity enhancers. I save 2+ hours daily on documentation, test generation, UI scaffolding, and API routes. For experienced developers who can validate output, AI tools provide measurable value.

Clear patterns emerge by experience level:

Senior engineers (5+ years): 30-50% productivity increase on routine tasks
Junior engineers (<2 years): Higher output velocity but 4x defect rate
Unsupervised AI: Consistent failure on integration and architectural tasks

Current state assessment:

AI tools have become essential for repetitive development tasks
Human oversight remains critical for quality and security
The gap between marketing claims and production reality remains significant
Infrastructure investments may not align with actual capability improvements

Industry implications:
The combination of reduced junior hiring, increased reliance on AI tools, and significant infrastructure investment creates potential long-term challenges for talent development and code maintainability.

Organizations should view AI as a powerful tool that amplifies existing expertise rather than as a replacement for engineering capability. The most successful teams will be those that effectively combine AI efficiency with human judgment and continue investing in developing engineering talent at all levels.

These patterns aren't new. We've been documenting the systematic destruction of software quality, the talent pipeline crisis, and Big Tech's tendency to throw money at problems instead of solving them across multiple investigations.

P.S. All three case studies are available with full technical details. Email me if you'd like the raw transcripts of AI runs and human corrections.

P.S.S. - The spec from Case #1 is included below for transparency.

Related Reading:

Multi-Tenant Implementation Requirements

What We Need

B2B platform needs proper data isolation between corporate clients:

Client A can’t see Client B’s data (obviously)
Each client gets their own knowledge base
Dynamic metadata per industry (law firms need different fields than hospitals)
Manual onboarding (not self-serve)

How It Works Now

Current flow:

Web App → Queue → Processor → Vector DB → Search

Right now everything filters by userId. Works fine for consumer use case, but B2B needs another layer - client-level isolation on top of user-level.

The Solution: Just Add Metadata

Instead of rebuilding everything with clientId columns and migrations (don’t do this), just pass dynamic metadata through the existing flow.

The idea:

Shared API accepts optional metadata
Queue passes it along
Processor stores it in vectors
Search filters by userId + metadata

That’s it. No DB changes.

Metadata Examples

// Law firm
{
  client: “lawfirm_abc”,
  department: “corporate”,
  confidentiality: “high”
}

// Hospital
{
  customer: “hospital_xyz”, 
  ward: “cardiology”,
  sensitive: “true”
}

// Finance
{
  tenant: “finance_corp”,
  division: “accounting”, 
  year: “2024”
}

Just Record<string, string>. Call the fields whatever makes sense for the client.

API Changes

Upload with metadata:

POST /api/shared/v1/data
{
  “data”: “document content...”,
  “metadata”: {
    “client”: “lawfirm_abc”,
    “department”: “corporate”
  }
}

Search with same metadata:

POST /api/shared/v1/ask
{
  “question”: “What is the policy?”,
  “metadata”: {
    “client”: “lawfirm_abc”,
    “department”: “corporate”
  }
}

Implementation

web-app (5 files)

Add metadata?: Record<string, string> to:

data/route.ts
ask/route.ts
chat/route.ts
queue.ts (pass it through)
upstash/index.ts (filter by it)

processor (2 files)

processor.ts: add metadata field
vector-storage.ts: spread it into vector metadata

const vectorMetadata = {
  documentId,
  userId,
  fileName,
  // ... system stuff
  ...(customMetadata || {}), // spread user fields (safe)
};

What Stays The Same

No database migrations
No schema changes
Existing vectors work fine
Old code keeps working
If no metadata provided → search everything for that user
If metadata provided → automatic isolation

Files to Touch

web-app:

src/lib/server/utils/queue.ts
src/lib/server/external_api/upstash/index.ts
src/app/api/shared/v1/data/route.ts
src/app/api/shared/v1/ask/route.ts
src/app/api/shared/v1/chat/route.ts

processor:

src/types/processor.ts
src/services/vector-storage.ts