DEV Community

Karl Weinmeister for Google AI

Posted on • Originally published at Medium on

Performance shouldn’t be an afterthought: Hardening the AI-Assisted SDLC

Performance shouldn’t be an afterthought: Hardening the AI-assisted SDLC

It’s amazing how quickly you can now build a working application with AI assistance. It’s even more amazing how easily you can harden your application for production. But that’s a step that’s often left out of the “vibe coding” software development lifecycle, or SDLC. I hope to change that.

Why does it matter? The impact of high latency is lost users, and the impact of excess memory usage is lost budget.

Study after study shows that your application’s latency directly correlates with user satisfaction, a key ingredient for business success. Meanwhile, your application’s memory usage impacts your Cloud infrastructure cost. For example, Cloud Run offers memory limits at various tiers ranging from 512 MiB to 32 GiB. Not to mention, if you underprovision memory, your application reliability will suffer.

In this post, I’ll walk through steps I recommend that ensure your application is hardened for production. I’ll use Google Antigravity to build an application with sample application code available on GitHub.

Discovery and Tool Selection

If you aren’t an expert in the tooling ecosystem for your application’s language, use AI to bridge the gap. Avoid guessing and ask for industry standards. For example, you can ask:

“I need to profile a Python application for both CPU execution time and memory leaks. What are the most modern, low-overhead tools available? I know about cProfile, but are there better options with visualization (like flame graphs)?”

What modern stack might your AI assistant suggest? scalene is a high-performance profiler whose standout capability is separating time spent in Python versus native code. To dig into memory details, memray can track allocations in native extensions and generate flame graphs that make it easy to spot areas for improvement. Finally, pytest-benchmark is a useful plugin that handles warm-up rounds and statistical analysis automatically.

If you’re writing code in other languages, the same strategy applies. You might discover pprof for Go, clinic.js for Node.js, and other useful tools.

Establish a Baseline

My use case is calculating the perplexity of a given text, which is helpful for AI detection and other use cases. The initial implementation started with a naïve algorithm which processes one token at a time, which isn’t uncommon when you simply ask for a solution.

for i in range(seq_len - 1):
    current_token = input_ids_int64[i]

    # 1. Construct single-token input
    inputs = { "input_ids": np.array([[current_token]]) }
    inputs.update(past_key_values)

    # 2. Run inference for just this token
    outputs = session.run(None, inputs)
Enter fullscreen mode Exit fullscreen mode

Optimize for Speed

While this code works, it’s slow. With our tools selected from the research phase, we can ask our AI agent to benchmark the baseline code.

“Generate a Python script using pytest-benchmark to benchmark my perplexity function against a baseline. Create a mock dataset to simulate load.”

Once we have a benchmark, we can then ask our AI agent to optimize it:

“Profile this baseline code and suggest an optimized routine. Focus on throughput.”

A standard engineering strategy to address loop overhead is vectorization. The revised approach feeds the entire sequence to the model in one go:

def calculate_perplexity_batch(context, text):
    # 1. Encode entire text at once
    input_ids = tokenizer.encode(text)

    # 2. Single inference call for the whole sequence
    outputs = session.run(None, inputs)
    logits = outputs[0] # Shape: [1, SeqLen, Vocab]

    # 3. Vectorized loss calculation (No loops)
    # ... numpy vector operations ...
    return float(np.exp(mean_nll))
Enter fullscreen mode Exit fullscreen mode

In my test environment, this change led to an overall 2.5x speed improvement over the naïve loop.

Optimize Memory Usage

Unfortunately, this speed came at a cost. By loading all logits for the entire sequence into memory at once, I created an unbounded memory situation. Long documents would cause peak memory usage to spike uncontrollably. I had solved for latency, but in doing so, I had broken cost constraints.

How could I prompt Antigravity to help?

“Analyze my optimized perplexity routine. The target environment is Google Cloud Run with a strict 2GB memory limit. Identify the peak memory usage and refactor the code to stay under this limit without reverting to the slow loop.”

The solution balanced speed and memory, processing data in batches large enough to achieve high throughput but small enough to manage peak memory:

chunk_size = 128
logits_list = []

for i in range(len(input_ids) - 1):
    append_tokens(input_ids[i : i + 1])
    logits_list.append(get_logits()[0, 0, :])

    if len(logits_list) >= chunk_size:
        # Process this chunk
        _process_logits_chunk(logits_list, targets_list)

        # Free memory immediately to clip the peak
        logits_list = []
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

Before unleashing this process across your codebase, let’s be clear that performance engineering is a rigorous discipline that goes beyond optimizing functions. Industry veteran Brendan Gregg famously warns against the Streetlight Anti-Method: looking for performance problems where it’s easiest, rather than where the problems actually exist.

Providing your AI assistant the broader context of your application is key, and it’s easy to overlook important details in your prompting. An AI assistant doesn’t know that your production workload is 10 million rows, not the 100 rows in your test script. It can’t see that your database is missing an index or that your network bandwidth is saturated. Most importantly, an AI assistant doesn’t know your intent. If you steer it towards speeding up a query, it will focus on what you asked for, but it likely won’t ask why that data isn’t cached in the first place.

With those considerations in mind, using AI as a final check is a low-risk, high-reward step. It takes minutes and often catches low-hanging fruit that is overlooked. Then, the next step is maintaining your application’s performance. Consider leveraging tools for continuous application monitoring to identify regressions and ensure reliability in a live environment.

I’d love to hear how you’re innovating with your software development lifecycle. Connect with me on LinkedIn, X, or Bluesky!

Top comments (0)