DEV Community

Ale Santini
Ale Santini

Posted on

AI Is Creating a New Kind of Technical Debt — And Most Teams Don't See It Yet

AI Is Creating a New Kind of Technical Debt — And Most Teams Don't See It Yet

You're shipping AI features faster than you ever have. The prompts work. The model responds. Users are happy. Your sprint velocity looks incredible.

Six months from now, your AI system will be a maintenance nightmare.

Not because AI is fundamentally different, but because teams treat it like magic instead of infrastructure. You wouldn't ship a database query without tests, without monitoring, without versioning. But somehow, a 500-character string that controls your model's behavior? That lives in a .py file with no version control, no A/B testing, no audit trail.

This is AI technical debt. It's different from code debt. It's worse because it's invisible until it breaks production.

1. Prompt Debt: The Hardcoded Time Bomb

Bad Pattern:

def generate_summary(text):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "You are a helpful assistant that summarizes text concisely."
        }],
        temperature=0.7
    )
    return response['choices'][0]['message']['content']
Enter fullscreen mode Exit fullscreen mode

This prompt exists nowhere else. It's not versioned. You changed it last Tuesday to add "concisely" and forgot. Nobody knows when it changed or why. Your data team runs an analysis and finds summaries degraded 3% last week. They have no way to correlate it to your prompt tweak.

Fix:

# prompts.py - version controlled, tagged
PROMPTS = {
    "summarize_v1": {
        "created": "2024-01-15",
        "modified": "2024-01-20",
        "system": "You are a helpful assistant that summarizes text concisely.",
        "temperature": 0.7,
        "tags": ["production", "active"]
    },
    "summarize_v2": {
        "created": "2024-02-01",
        "modified": "2024-02-01",
        "system": "Summarize the following text in 1-2 sentences. Focus on actionable insights.",
        "temperature": 0.5,
        "tags": ["staging"]
    }
}

def generate_summary(text, prompt_version="summarize_v1"):
    prompt_config = PROMPTS[prompt_version]
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": prompt_config["system"]
        }],
        temperature=prompt_config["temperature"]
    )
    return response['choices'][0]['message']['content']
Enter fullscreen mode Exit fullscreen mode

Now your prompts are:

  • Versioned in git with commit history
  • Tagged for production/staging/experiment
  • Auditable (who changed what, when)
  • A/B testable (compare v1 vs v2 systematically)

Store this in version control. Treat it like database schema migrations. Because it is.

2. Model Coupling: The Vendor Lock-In Trap

Bad Pattern:

class AIAssistant:
    def __init__(self):
        self.client = openai.OpenAI()

    def get_response(self, prompt):
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

You're locked to OpenAI's API. Your code is full of OpenAI-specific parameters. Claude's API is slightly different. Anthropic's pricing is better. You want to switch? You're rewriting everything.

Your CEO sees Claude's 200K context window and wants to migrate. Your team says "maybe next quarter" because it's entangled everywhere. That's model coupling.

Fix:

from abc import ABC, abstractmethod

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, messages: list[dict], temperature: float) -> str:
        pass

class OpenAIProvider(LLMProvider):
    def __init__(self):
        self.client = openai.OpenAI()

    def complete(self, messages, temperature):
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            temperature=temperature
        )
        return response.choices[0].message.content

class ClaudeProvider(LLMProvider):
    def __init__(self):
        self.client = anthropic.Anthropic()

    def complete(self, messages, temperature):
        response = self.client.messages.create(
            model="claude-3-opus-20240229",
            max_tokens=2048,
            messages=messages,
            temperature=temperature
        )
        return response.content[0].text

class AIAssistant:
    def __init__(self, provider: LLMProvider):
        self.provider = provider

    def get_response(self, prompt):
        messages = [{"role": "user", "content": prompt}]
        return self.provider.complete(messages, temperature=0.7)

# Usage
# assistant = AIAssistant(OpenAIProvider())
# or
# assistant = AIAssistant(ClaudeProvider())
Enter fullscreen mode Exit fullscreen mode

Now switching models is a configuration change, not a rewrite. You can A/B test different providers. You can fall back gracefully if one API goes down.

3. Evaluation Desert: The Silent Degradation

Bad Pattern:

def generate_title(article_text):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "Generate a catchy title for this article"
        }],
        messages=[{"role": "user", "content": article_text}]
    )
    return response['choices'][0]['message']['content']

# Ship it
# No tests
# No metrics
# No benchmarks
Enter fullscreen mode Exit fullscreen mode

Three months later, your titles are garbage. But you don't know when it started. Was it the model update? The prompt change? Something in your data pipeline? You have no baseline to compare against.

Fix:

from dataclasses import dataclass
import json

@dataclass
class EvalResult:
    prompt_version: str
    model: str
    score: float
    timestamp: str
    sample_size: int

class TitleGenerator:
    def __init__(self, prompt_version="v1", model="gpt-4"):
        self.prompt_version = prompt_version
        self.model = model

    def generate_title(self, article_text):
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[{
                "role": "system",
                "content": PROMPTS[self.prompt_version]
            }],
            messages=[{"role": "user", "content": article_text}]
        )
        return response['choices'][0]['message']['content']

class TitleEvaluator:
    def __init__(self, generator: TitleGenerator):
        self.generator = generator
        self.eval_results = []

    def evaluate(self, test_articles: list[dict]) -> EvalResult:
        """
        test_articles: [{"text": "...", "human_rating": 8}, ...]
        """
        scores = []

        for article in test_articles:
            title = self.generator.generate_title(article["text"])
            # Your evaluation metric (could be LLM-based, human, or heuristic)
            score = self._score_title(title, article.get("expected_style"))
            scores.append(score)

        avg_score = sum(scores) / len(scores)
        result = EvalResult(
            prompt_version=self.generator.prompt_version,
            model=self.generator.model,
            score=avg_score,
            timestamp=datetime.now().isoformat(),
            sample_size=len(test_articles)
        )

        self.eval_results.append(result)
        return result

    def _score_title(self, title, expected_style):
        # This could be:
        # - Length check (5-10 words)
        # - Sentiment analysis
        # - LLM-based scoring
        # - Human feedback loop
        if len(title.split()) < 5 or len(title.split()) > 10:
            return 0.5
        return 0.8  # simplified

    def regression_check(self, threshold=0.85):
        if len(self.eval_results) < 2:
            return True

        latest = self.eval_results[-1].score
        previous = self.eval_results[-2].score

        if latest < threshold:
            print(f"⚠️ Quality below threshold: {latest}")
            return False

        if latest < previous * 0.95:  # 5% drop
            print(f"⚠️ Regression detected: {previous}{latest}")
            return False

        return True

# Usage in CI/CD
evaluator = TitleEvaluator(TitleGenerator("v1", "gpt-4"))
test_data = load_eval_dataset()  # Fixed test set
result = evaluator.evaluate(test_data)

if not evaluator.regression_check():
    exit(1)  # Fail the build
Enter fullscreen mode Exit fullscreen mode

Now you have:

  • A baseline to compare against
  • Automated regression detection
  • Visibility into when quality changes
  • Data to debug what broke

4. Context Window Inflation: The Expensive Slide

Bad Pattern:

def answer_question(question, user_history):
    # Just dump everything into the context
    context = "\n".join([
        f"Previous message {i}: {msg}"
        for i, msg in enumerate(user_history[-100:])  # Last 100 messages
    ])

    prompt = f"""
    {context}

    New question: {question}
    """

    response = openai.ChatCompletion.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
    )
    return response['choices'][0]['message']['content']
Enter fullscreen mode Exit fullscreen mode

Your tokens per request: 8,000. Your monthly bill: $50K.

Six months ago, your context was 2,000 tokens. You kept adding "useful" context. Now every request costs 4x as much. Nobody noticed because it was gradual.

Fix:


python
from typing import Optional

class ContextManager:
    def __init__(self, max_tokens: int = 2000, model: str = "gpt-4-turbo"):
        self.max_tokens = max_tokens
        self.model = model
        self.token_counter = TokenCounter()  # Use tiktoken

    def build_context(
        self,
        question: str,
        user_history: list[str],
        max_history_messages: int = 5
    ) -> tuple[str, int]:
        """Returns (context_string, token_count)"""

        # Start with question
        context_parts = [f"Question: {question}"]
        token_count = self.token_counter.count(context_parts[0])

        # Add history incrementally, stop when we hit budget
        for msg in reversed(user_history[-max_history_messages:]):
            msg_tokens = self.token_counter.count(msg)
            if token_count + msg_tokens > self.max_tokens * 0.8:  # Leave 20% buffer
                break
            context_parts.insert(1, f"Previous: {msg}")
            token_count += msg_tokens

        context = "\n".join(context_parts)
        return context, token_count

    def answer_question(self, question: str, user_history: list[str]) -> str:
        context, token_count = self.build_context(question, user_history)

        # Log token usage for monitoring
        log_metric("llm_tokens

---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)