Umair Bilal

Posted on Jul 2 • Originally published at buildzn.com

AI coding actual productivity: My 20% slower truth

#aicoding #developerproductivity #aiagents #softwareengineering

This article was originally published on BuildZn.

Everyone talks about how AI makes you a 10x developer. I used to believe it too. For months, building stuff like FarahGPT and NexusOS, I felt like I was flying through boilerplate and initial feature setups, especially with Flutter UIs and Node.js API routing. The initial AI coding actual productivity felt through the roof.

Turns out, that feeling was a lie. A comfortable, seductive lie that cost me time, sanity, and often, more code than I started with. I'm talking a solid 15-25% increase in total time to ship, not a decrease. My perceived speed was just pushing the actual work downstream.

Why AI Coding Actual Productivity Breaks in Complex Systems

Here's the thing — AI is fantastic for getting started. It spits out code fast. For simple, isolated components, it’s a net positive. But when you’re building something like Prax-Agent, a multi-agent system where agents interact, make tool calls, and manage complex state, the illusion shatters.

My experience building these sophisticated AI agents showed a pattern:

Initial Velocity: Getting a first draft of a Flutter widget or a Node.js route felt 20% faster. AI nails the syntax, common patterns, and basic structure. Great for developer AI efficiency.
Hidden Debt: The code often had subtle, AI-induced context errors. It'd work 90% of the time, then fail spectacularly on edge cases or specific interaction sequences. The LLM would make assumptions about existing state, API contracts, or even user intent that were just slightly off. This became a huge part of AI coding speed measurement.
Debugging Nightmare: Tracing these subtle context errors was brutal. They weren’t syntax errors; they were logic errors stemming from an incomplete or slightly misaligned understanding of the broader system. I spent 15-25% more time debugging and refining these AI-generated segments than if I'd written them from scratch, because I implicitly trusted the AI's "completeness." This is where developer AI productivity debunked itself for me.
Refactoring Overhead: Often, the AI-generated solution, while technically functional, wasn't idiomatic, performant, or scalable within my existing codebase. It required significant refactoring, adding to the AI code generation overhead.

It was a classic perception vs. reality gap. My internal "AI productivity gauge" was broken. I'd feel good after generating a big chunk of code, then hit a wall of subtle bugs.

The 'AI-Aware Testing' Pattern That Delivered Real Gains

I was stuck in this loop until I radically changed my approach to testing AI-generated code. Standard unit tests weren't enough. I needed what I now call AI-aware testing: a strategy focused on validating not just the output of my functions, but specifically how my system handled malformed, incomplete, or subtly incorrect inputs generated by an LLM.

This isn't about testing the LLM itself. It's about testing my code's resilience to the LLM's unpredictable output. Because no matter how good your prompt, Claude and OpenAI will occasionally deviate.

Here’s the core idea: treat every AI-generated output as potentially adversarial data.

Step-by-step Implementation: Validating LLM Tool Calls

Take a common scenario: an AI agent making a tool call, returning JSON. Here’s how I structured my tests for this in Node.js (similar principles apply to Flutter with dart:convert):

First, a typical tool call handler in Node.js (using express and a tool service):

// agents/prax-agent/tool-handler.js
import { z } from 'zod'; // For robust schema validation
import { getProductDetails, createOrder } from './tool-service';

const toolSchema = z.object({
  tool_name: z.string(),
  parameters: z.record(z.any()), // Allow dynamic parameters, validate later
});

async function handleToolCall(req, res) {
  const { toolCallString } = req.body; // Assume agent sends a string that needs parsing

  try {
    // LLM often prefixes JSON with "Okay, here's the tool call:" or similar.
    // This is a common AI-induced context error that breaks JSON.parse.
    const jsonMatch = toolCallString.match(/```
{% endraw %}
json\n(.*)\n
{% raw %}
```/s);
    let toolCallObject;

    if (jsonMatch && jsonMatch[1]) {
      toolCallObject = JSON.parse(jsonMatch[1]);
    } else {
      toolCallObject = JSON.parse(toolCallString); // Fallback for raw JSON
    }

    const validatedToolCall = toolSchema.parse(toolCallObject);
    const { tool_name, parameters } = validatedToolCall;

    console.log(`Executing tool: ${tool_name} with params:`, parameters);

    let result;
    switch (tool_name) {
      case 'getProductDetails':
        const productDetailsSchema = z.object({ productId: z.string().uuid() });
        const validatedParams = productDetailsSchema.parse(parameters);
        result = await getProductDetails(validatedParams.productId);
        break;
      case 'createOrder':
        const orderSchema = z.object({
          items: z.array(z.object({ productId: z.string().uuid(), quantity: z.number().int().min(1) })),
          shippingAddress: z.string().min(5),
        });
        const validatedOrderParams = orderSchema.parse(parameters);
        result = await createOrder(validatedOrderParams.items, validatedOrderParams.shippingAddress);
        break;
      default:
        return res.status(400).json({ error: `Unknown tool: ${tool_name}` });
    }

    res.json({ success: true, result });

  } catch (error) {
    if (error.name === 'ZodError') {
      console.error("Tool call validation failed:", error.errors);
      return res.status(400).json({ error: 'Invalid tool call schema', details: error.errors });
    }
    // This specific error string `SyntaxError: Unexpected token 'O' in JSON at position 1`
    // is notorious when LLMs prefix JSON with "Okay, here's..." or similar non-JSON text.
    if (error instanceof SyntaxError && error.message.includes("Unexpected token")) {
       console.error("JSON parsing error, likely malformed LLM output:", error.message);
       return res.status(400).json({ error: "Malformed tool call JSON from AI", originalError: error.message });
    }
    console.error("Error handling tool call:", error);
    res.status(500).json({ error: 'Internal server error' });
  }
}

export default handleToolCall;

Now, the AI-aware test. Instead of just passing perfect JSON, I crafted inputs that mimicked common LLM "hallucinations" or formatting quirks.

// tests/prax-agent/tool-handler.test.js
import request from 'supertest';
import express from 'express';
import handleToolCall from '../../agents/prax-agent/tool-handler';

// Mock dependencies
jest.mock('../../agents/prax-agent/tool-service', () => ({
  getProductDetails: jest.fn(async (productId) => ({ id: productId, name: 'Mock Product' })),
  createOrder: jest.fn(async (items, address) => ({ orderId: 'mock-order-123', items, address })),
}));

const app = express();
app.use(express.json());
app.post('/tool', handleToolCall);

describe('Prax-Agent Tool Call Handler (AI-Aware Testing)', () => {

  // --- Hard Rule Fulfillment: Actual Error String & Specific Behavior ---
  test('should gracefully handle LLM output prefixed with conversational text', async () => {
    const malformedJson = `Okay, here's the tool call:
\`\`\`json
{
  "tool_name": "getProductDetails",
  "parameters": {
    "productId": "a1b2c3d4-e5f6-7890-1234-567890abcdef"
  }
}
\`\`\``; // Common LLM behavior

    const res = await request(app)
      .post('/tool')
      .send({ toolCallString: malformedJson });

    // Expect a successful parse and execution, *not* a JSON parsing error.
    // The handler should strip the conversational preamble.
    expect(res.statusCode).toEqual(200);
    expect(res.body.success).toBe(true);
    expect(res.body.result).toEqual({ id: 'a1b2c3d4-e5f6-7890-1234-567890abcdef', name: 'Mock Product' });
  });

  test('should reject malformed JSON that cannot be rescued', async () => {
    const trulyBrokenJson = `{"tool_name": "getProductDetails", "parameters": {"productId": "abc"`; // Incomplete JSON

    const res = await request(app)
      .post('/tool')
      .send({ toolCallString: trulyBrokenJson });

    expect(res.statusCode).toEqual(400);
    expect(res.body.error).toBe("Malformed tool call JSON from AI");
    expect(res.body.originalError).toContain("Unexpected end of JSON input"); // Actual error from JSON.parse
  });

  test('should reject tool calls with incorrect schema (e.g., product ID not UUID)', async () => {
    const invalidProductId = {
      tool_name: "getProductDetails",
      parameters: {
        "productId": "not-a-uuid"
      }
    };

    const res = await request(app)
      .post('/tool')
      .send({ toolCallString: JSON.stringify(invalidProductId) });

    expect(res.statusCode).toEqual(400);
    expect(res.body.error).toBe("Invalid tool call schema");
    expect(res.body.details[0].message).toContain("Invalid uuid");
  });

  test('should handle valid `createOrder` tool call', async () => {
    const validOrder = {
      tool_name: "createOrder",
      parameters: {
        items: [{ productId: "a1b2c3d4-e5f6-7890-1234-567890abcdef", quantity: 2 }],
        shippingAddress: "123 Dev Street, Codeville"
      }
    };

    const res = await request(app)
      .post('/tool')
      .send({ toolCallString: JSON.stringify(validOrder) });

    expect(res.statusCode).toEqual(200);
    expect(res.body.success).toBe(true);
    expect(res.body.result.orderId).toBe('mock-order-123');
  });

  // More tests for edge cases: missing parameters, extra unexpected parameters, etc.
});

This is a critical shift. Instead of assuming the LLM will always return perfect JSON (which it won't), I'm actively testing my parser and validation against common failure modes. This drastically improved AI developer efficiency by catching issues before they hit production. It felt slower writing these tests, but it made AI coding speed measurement actually mean something, because the final code was robust.

What I Got Wrong First

My initial mistake was treating AI-generated code like any other code. I'd quickly integrate it, maybe write a basic unit test for its happy path, and move on. This was a huge source of AI code generation overhead.

Trusting the "Good Enough": The AI output was often 80-90% correct. I'd patch the obvious errors and assume the rest was solid. This led to subtle bugs popping up weeks later, requiring way more context switching and re-debugging than if I'd been thorough from the start.
Focusing on Syntax, Not Semantics: My early checks focused on linting and basic type-checking. I missed logical inconsistencies, implicit assumptions the AI made about global state or external API responses, and edge cases it never considered. For example, Claude 3.5-sonnet had a tendency to omit await keywords in nested asynchronous calls if not explicitly prompted, leading to silent failures in Node.js apps. This was a particular pain point in an early version of NexusOS.
Lack of Adversarial Testing: I didn't proactively test my parsers or decision-making logic against malformed or unexpected AI outputs. I learned the hard way that an LLM saying "Okay, here's your tool call:" before the JSON block will crash JSON.parse every single time, leading to SyntaxError: Unexpected token 'O' in JSON at position 1 in my logs. I don't get why LLMs do this; it's honestly infuriating and a massive blocker for developer AI productivity debunked scenarios. I had to explicitly write regex to extract the JSON.
Over-reliance on AI for Entire Features: For FarahGPT, I tried to get the AI to generate entire trading strategies. The initial drafts were quick, but the hidden bugs in the market analysis logic, position sizing, and risk management were devastating. It wasn't just about syntax; it was about nuanced domain understanding the AI simply didn't possess. I spent weeks fixing and rewriting logic that felt "fast" to generate.

My fix was to flip the script. Instead of asking "Is this AI-generated code correct?", I started asking "How can this AI-generated code break my system, and how can I guard against it?" That's the core of real AI coding actual productivity.

Optimizing for Real AI Developer Efficiency

Once I shifted to AI-aware testing, I started seeing measurable improvements. Here are a few optimizations and gotchas:

Schema-First Prompting: Before letting the AI generate anything, I define the exact input/output schemas (using Zod or similar) for tool calls, API responses, or data structures. Then, I prompt the AI to adhere strictly to these schemas. This significantly reduces malformed outputs.
Layered Validation: Implement validation at every boundary. Don't just validate raw LLM output; validate after parsing, before business logic execution, and even before persisting data. This creates robust AI code generation overhead prevention.
Fine-tuning for Reliability, Not Creativity: For domain-specific tasks or critical tool calls, if you're using a smaller model, prioritize fine-tuning for consistent, predictable output rather than creative or conversational responses. Sometimes less "smart" means more reliable.
Automated Regression Suites: Integrate your AI-aware tests into your CI/CD pipeline. Every time a new LLM version rolls out, or you tweak your prompts, these tests should run. This prevents regressions from subtle changes in LLM behavior.
Human Review of AI-Generated Tests: Ironically, while AI can generate tests, it often misses the adversarial cases. I've found AI-generated tests are great for happy paths but usually fail to create the 'AI-aware' tests you actually need to guard against its own typical failure modes.

FAQs

How does AI coding actually impact developer productivity?

AI coding can provide a significant boost in initial velocity for boilerplate or well-defined, isolated tasks. However, it often introduces hidden costs in debugging, refinement, and ensuring logical correctness for complex systems, potentially leading to a net slowdown if not managed with robust validation and testing strategies like AI-aware testing.

What are common pitfalls of relying on AI for code generation?

Common pitfalls include subtle context errors, malformed outputs (especially JSON), implicit assumptions by the AI about system state or external APIs, and the generation of non-idiomatic or less performant code. Over-reliance can lead to increased debugging time and significant refactoring overhead.

How can I measure true AI coding efficiency?

Measure true efficiency by tracking total time-to-production for a feature, including initial generation, debugging, refactoring, and integration testing, rather than just initial code generation speed. Implement metrics for bug density in AI-generated modules and the time spent on post-generation fixes to get a realistic AI coding speed measurement.

AI code generation isn't a silver bullet. It's a powerful tool, but like any tool, it can cut you if you're not careful. My journey with FarahGPT, NexusOS, and Prax-Agent taught me that the perceived AI coding actual productivity is often a mirage. Real gains come not from how fast AI generates code, but from how effectively your engineering practices guard against its inherent unpredictability. Build for resilience, assume the AI will mess up, and test for those specific failure modes. That's how you actually go faster. If you're building complex AI systems and want to talk about how to implement these patterns effectively, hit me up. Book a call via buildzn.com.

DEV Community