AI Is Creating a New Kind of Technical Debt — And Most Teams Don't See It Yet
You're shipping AI features faster than you ever have. The prompts work. The model responds. Users are happy. Your sprint velocity looks incredible.
Six months from now, your AI system will be a maintenance nightmare.
Not because AI is fundamentally different, but because teams treat it like magic instead of infrastructure. You wouldn't ship a database query without tests, without monitoring, without versioning. But somehow, a 500-character string that controls your model's behavior? That lives in a .py file with no version control, no A/B testing, no audit trail.
This is AI technical debt. It's different from code debt. It's worse because it's invisible until it breaks production.
1. Prompt Debt: The Hardcoded Time Bomb
Bad Pattern:
def generate_summary(text):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{
"role": "system",
"content": "You are a helpful assistant that summarizes text concisely."
}],
temperature=0.7
)
return response['choices'][0]['message']['content']
This prompt exists nowhere else. It's not versioned. You changed it last Tuesday to add "concisely" and forgot. Nobody knows when it changed or why. Your data team runs an analysis and finds summaries degraded 3% last week. They have no way to correlate it to your prompt tweak.
Fix:
# prompts.py - version controlled, tagged
PROMPTS = {
"summarize_v1": {
"created": "2024-01-15",
"modified": "2024-01-20",
"system": "You are a helpful assistant that summarizes text concisely.",
"temperature": 0.7,
"tags": ["production", "active"]
},
"summarize_v2": {
"created": "2024-02-01",
"modified": "2024-02-01",
"system": "Summarize the following text in 1-2 sentences. Focus on actionable insights.",
"temperature": 0.5,
"tags": ["staging"]
}
}
def generate_summary(text, prompt_version="summarize_v1"):
prompt_config = PROMPTS[prompt_version]
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{
"role": "system",
"content": prompt_config["system"]
}],
temperature=prompt_config["temperature"]
)
return response['choices'][0]['message']['content']
Now your prompts are:
- Versioned in git with commit history
- Tagged for production/staging/experiment
- Auditable (who changed what, when)
- A/B testable (compare v1 vs v2 systematically)
Store this in version control. Treat it like database schema migrations. Because it is.
2. Model Coupling: The Vendor Lock-In Trap
Bad Pattern:
class AIAssistant:
def __init__(self):
self.client = openai.OpenAI()
def get_response(self, prompt):
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
return response.choices[0].message.content
You're locked to OpenAI's API. Your code is full of OpenAI-specific parameters. Claude's API is slightly different. Anthropic's pricing is better. You want to switch? You're rewriting everything.
Your CEO sees Claude's 200K context window and wants to migrate. Your team says "maybe next quarter" because it's entangled everywhere. That's model coupling.
Fix:
from abc import ABC, abstractmethod
class LLMProvider(ABC):
@abstractmethod
def complete(self, messages: list[dict], temperature: float) -> str:
pass
class OpenAIProvider(LLMProvider):
def __init__(self):
self.client = openai.OpenAI()
def complete(self, messages, temperature):
response = self.client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=temperature
)
return response.choices[0].message.content
class ClaudeProvider(LLMProvider):
def __init__(self):
self.client = anthropic.Anthropic()
def complete(self, messages, temperature):
response = self.client.messages.create(
model="claude-3-opus-20240229",
max_tokens=2048,
messages=messages,
temperature=temperature
)
return response.content[0].text
class AIAssistant:
def __init__(self, provider: LLMProvider):
self.provider = provider
def get_response(self, prompt):
messages = [{"role": "user", "content": prompt}]
return self.provider.complete(messages, temperature=0.7)
# Usage
# assistant = AIAssistant(OpenAIProvider())
# or
# assistant = AIAssistant(ClaudeProvider())
Now switching models is a configuration change, not a rewrite. You can A/B test different providers. You can fall back gracefully if one API goes down.
3. Evaluation Desert: The Silent Degradation
Bad Pattern:
def generate_title(article_text):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{
"role": "system",
"content": "Generate a catchy title for this article"
}],
messages=[{"role": "user", "content": article_text}]
)
return response['choices'][0]['message']['content']
# Ship it
# No tests
# No metrics
# No benchmarks
Three months later, your titles are garbage. But you don't know when it started. Was it the model update? The prompt change? Something in your data pipeline? You have no baseline to compare against.
Fix:
from dataclasses import dataclass
import json
@dataclass
class EvalResult:
prompt_version: str
model: str
score: float
timestamp: str
sample_size: int
class TitleGenerator:
def __init__(self, prompt_version="v1", model="gpt-4"):
self.prompt_version = prompt_version
self.model = model
def generate_title(self, article_text):
response = openai.ChatCompletion.create(
model=self.model,
messages=[{
"role": "system",
"content": PROMPTS[self.prompt_version]
}],
messages=[{"role": "user", "content": article_text}]
)
return response['choices'][0]['message']['content']
class TitleEvaluator:
def __init__(self, generator: TitleGenerator):
self.generator = generator
self.eval_results = []
def evaluate(self, test_articles: list[dict]) -> EvalResult:
"""
test_articles: [{"text": "...", "human_rating": 8}, ...]
"""
scores = []
for article in test_articles:
title = self.generator.generate_title(article["text"])
# Your evaluation metric (could be LLM-based, human, or heuristic)
score = self._score_title(title, article.get("expected_style"))
scores.append(score)
avg_score = sum(scores) / len(scores)
result = EvalResult(
prompt_version=self.generator.prompt_version,
model=self.generator.model,
score=avg_score,
timestamp=datetime.now().isoformat(),
sample_size=len(test_articles)
)
self.eval_results.append(result)
return result
def _score_title(self, title, expected_style):
# This could be:
# - Length check (5-10 words)
# - Sentiment analysis
# - LLM-based scoring
# - Human feedback loop
if len(title.split()) < 5 or len(title.split()) > 10:
return 0.5
return 0.8 # simplified
def regression_check(self, threshold=0.85):
if len(self.eval_results) < 2:
return True
latest = self.eval_results[-1].score
previous = self.eval_results[-2].score
if latest < threshold:
print(f"⚠️ Quality below threshold: {latest}")
return False
if latest < previous * 0.95: # 5% drop
print(f"⚠️ Regression detected: {previous} → {latest}")
return False
return True
# Usage in CI/CD
evaluator = TitleEvaluator(TitleGenerator("v1", "gpt-4"))
test_data = load_eval_dataset() # Fixed test set
result = evaluator.evaluate(test_data)
if not evaluator.regression_check():
exit(1) # Fail the build
Now you have:
- A baseline to compare against
- Automated regression detection
- Visibility into when quality changes
- Data to debug what broke
4. Context Window Inflation: The Expensive Slide
Bad Pattern:
def answer_question(question, user_history):
# Just dump everything into the context
context = "\n".join([
f"Previous message {i}: {msg}"
for i, msg in enumerate(user_history[-100:]) # Last 100 messages
])
prompt = f"""
{context}
New question: {question}
"""
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
)
return response['choices'][0]['message']['content']
Your tokens per request: 8,000. Your monthly bill: $50K.
Six months ago, your context was 2,000 tokens. You kept adding "useful" context. Now every request costs 4x as much. Nobody noticed because it was gradual.
Fix:
python
from typing import Optional
class ContextManager:
def __init__(self, max_tokens: int = 2000, model: str = "gpt-4-turbo"):
self.max_tokens = max_tokens
self.model = model
self.token_counter = TokenCounter() # Use tiktoken
def build_context(
self,
question: str,
user_history: list[str],
max_history_messages: int = 5
) -> tuple[str, int]:
"""Returns (context_string, token_count)"""
# Start with question
context_parts = [f"Question: {question}"]
token_count = self.token_counter.count(context_parts[0])
# Add history incrementally, stop when we hit budget
for msg in reversed(user_history[-max_history_messages:]):
msg_tokens = self.token_counter.count(msg)
if token_count + msg_tokens > self.max_tokens * 0.8: # Leave 20% buffer
break
context_parts.insert(1, f"Previous: {msg}")
token_count += msg_tokens
context = "\n".join(context_parts)
return context, token_count
def answer_question(self, question: str, user_history: list[str]) -> str:
context, token_count = self.build_context(question, user_history)
# Log token usage for monitoring
log_metric("llm_tokens
---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~88k EUR/month processed).
Email: **alevibecoding@gmail.com** | [Portfolio](https://alessandrotrimarco.github.io) | [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)
Top comments (0)