AI's Silent Mistakes: Hours Lost in My Side Project

#life #ai #promptengineering #rag

Introduction to AI Projects: Excitement and Hidden Costs

Over the past few years, AI has started playing a very central role in both my main projects and my side projects. What began as a journey where I initially thought "wow, I'm getting results so fast," gradually led me to realize some "silent mistakes." These errors don't directly throw a 500 Internal Server Error or crash the system, but they slowly erode my valuable time, and sometimes even money from my pocket. It was as if a faucet had been left running in the background, and I hadn't noticed. I encountered many such issues, especially when integrating AI into my side project where I developed financial calculators, or in my Android spam application. In this post, I want to talk about these hidden traps and how I learned from them.

This situation made me see the "everything has a cost" principle, which I've learned over the years from systems and networking, in a different dimension within the AI world. Just as incorrectly calculated IP ranges when doing VLAN segmentation caused me headaches later, a wrong assumption in AI can consume hours of my time. In one of my side projects, when doing production planning with AI, everything seemed fine at first. However, when faced with real-world data, the AI's lack of "common sense" or its failure to grasp subtle details led to weeks of debugging. In short, while AI can give you a quick start, overlooking the devils in the details can cost you much more in the long run.

Understanding and Controlling API Costs

When I started using AI model APIs, I initially underestimated token costs. While costs seemed low in small experiments, when I developed a module that analyzes financial data in my side project, API call bills quickly skyrocketed. Especially as the length of prompts and model outputs increased, token usage grew exponentially. I was surprised by the bill at the end of one month. This was like a misconfigured Redis instance constantly burning CPU and increasing the bill due to an OOM eviction policy in a client project; everything appeared to be working, but resources were being wasted in the background.

To get this situation under control, my first step was to log API calls and token usage in detail. By recording the number of incoming and outgoing tokens for each call, I tried to understand which scenarios were more costly. Then, I started optimizing prompts. I reduced token usage by methods such as removing unnecessary words, using shorter and more concise phrases, and narrowing down the expected output from the model as much as possible. For example, in a complex financial calculation scenario, I initially sent all financial texts to the model, but later significantly saved tokens by sending only the necessary data points and formulas.

import tiktoken
import openai

def count_tokens(text: str, model: str = "gpt-4"):
    """Metindeki token sayısını hesaplar."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def analyze_cost(prompt: str, response: str, model: str = "gpt-4"):
    """Prompt ve response'un maliyetini (tahmini) hesaplar."""
    input_tokens = count_tokens(prompt, model)
    output_tokens = count_tokens(response, model)
    # Örnek maliyetler (gerçek fiyatlar API sağlayıcısına göre değişir)
    # Fiyatlar 2024 yılına ait, güncel fiyatları kontrol edin!
    cost_per_input_k_tokens = 0.01  # $0.01 / 1K input tokens
    cost_per_output_k_tokens = 0.03  # $0.03 / 1K output tokens

    total_cost = (input_tokens / 1000 * cost_per_input_k_tokens) + \
                 (output_tokens / 1000 * cost_per_output_k_tokens)
    print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
    print(f"Tahmini Maliyet: ${total_cost:.4f}")
    return total_cost

# Örnek kullanım

<figure>
  <Image src={cover} alt="An AI symbol slowly melting inside a digital clock, symbolizing how time is lost due to AI errors." />
</figure>

my_prompt = "Türkiye'de 2025 yılı için beklenen enflasyon oranı hakkında detaylı bir analiz yap ve bu analizi 500 kelimeyi geçmeyecek şekilde özetle."
my_response = "..." # Modelden gelen cevap

# analyze_cost(my_prompt, my_response, model="gpt-3.5-turbo")

This type of cost analysis allowed me to clearly see which prompts and workflows were more expensive. As I mentioned in my [related: PostgreSQL performance tuning] post, it's the same logic as optimizing database queries: eliminating unnecessary load.

Effective Prompt Engineering: Ways to Shorten the Experimentation Process

Prompt engineering, while seemingly simple at first, can be one of the most insidious costs of AI projects due to the time spent trying to get the exact output you want. I know I've spent hours trying to get the same output with different prompts while testing AI production planning models in a manufacturing company's ERP. Sometimes I saw that even a single word, or even a punctuation mark, could completely change the model's behavior. This was similar to the effort I put into defining specific ports instead of ANY in a firewall policy; a small detail makes a big difference. Many times in my side project, I worked on a prompt for 3-4 hours, only to then doubt, "Would another prompt have been better?"

To shorten this experimentation process, I adopted some principles. First, I try to write the prompt as clear and concise as possible. I specify upfront what I expect from the model, in what format I want it, and what constraints it should adhere to. Second, I use an iterative approach. I make small changes and immediately check the output. Instead of rewriting the entire prompt from scratch, I improve it step by step. Finally, I do prompt versioning. I save changes so I can revert to a poorly performing prompt or compare different prompts. This is very similar to code versioning in [related: CI/CD reliability] processes.

💡 Prompt Tips

When writing prompts, always ask these questions:

<ul>
    <li>What role do I expect the model to play? (e.g., "You are an experienced financial analyst...")</li>
    <li>What information am I providing? (Clearly state the context)</li>
    <li>What output format do I want? (JSON, bullet points, plain text)</li>
    <li>What constraints are there? (Word count, language, topics to include/exclude)</li>
</ul>
<p>This helped me speed up the experimentation process by 30%.</p>

Especially in RAG (Retrieval-Augmented Generation) based systems, the quality of the prompt is vital for synthesizing the retrieved information correctly. A wrong or incomplete prompt can cause the model to "hallucinate" even if it retrieves the correct information.

Ensuring Data Integrity and Freshness in RAG Systems

Retrieval-Augmented Generation (RAG) architectures are a great way to provide AI models with access to up-to-date and specific data. However, when using these systems in my side project, I experienced serious problems with data freshness and integrity. Especially in my financial calculators, the freshness of information retrieved from the database is crucial. One day, I noticed that a user's financial calculations were producing incorrect outputs. Upon investigation, I found that the data in the vector database used by RAG was 48 hours behind the latest changes in the main database. This was like reports showing incorrect data due to replication delays in a bank's internal platform; data inconsistency directly affects reliability.

To solve this problem, I re-evaluated my data synchronization strategies. Initially, I used a simple cron job for daily synchronization, but I switched to a more event-driven approach. I set up a mechanism that instantly indexes or updates the relevant data in the vector database when a critical change occurs in the main database. This was a CDC (Change Data Capture)-like approach. Additionally, I also added a mechanism to check the "age" of the data retrieved by RAG. If the retrieved data was older than a certain threshold, I instructed the model to indicate this or search for a more up-to-date source.

# Örnek: Basit bir veri tazeği kontrolü
import datetime

def get_data_last_updated(record_id: str) -> datetime.datetime:
    """Veritabanından kaydın son güncellenme tarihini çeker."""
    # Gerçek uygulamada buradan DB sorgusu yapılır.
    # Örnek için rastgele bir tarih dönelim
    if record_id == "finansal_bilgi_123":
        return datetime.datetime.now() - datetime.timedelta(hours=2) # 2 saat önce
    return datetime.datetime.now() - datetime.timedelta(days=3) # 3 gün önce

def check_data_freshness(record_id: str, max_stale_hours: int = 24) -> bool:
    """Verinin belirli bir taze eşiği içinde olup olmadığını kontrol eder."""
    last_updated = get_data_last_updated(record_id)
    if not last_updated:
        return False # Veri bulunamadı

    current_time = datetime.datetime.now()
    stale_duration = current_time - last_updated

    if stale_duration.total_seconds() / 3600 > max_stale_hours:
        print(f"UYARI: {record_id} verisi {stale_duration.days} gün {stale_duration.seconds // 3600} saat önce güncellendi. Eşik: {max_stale_hours} saat.")
        return False
    print(f"{record_id} verisi taze.")
    return True

# check_data_freshness("finansal_bilgi_123", max_stale_hours=12)
# check_data_freshness("eski_finansal_bilgi_456", max_stale_hours=12)

Such controls play a critical role in increasing the reliability of RAG systems. Otherwise, no matter how good the model is, it's inevitable for it to "hallucinate" with incorrect or outdated information. This is like requests being sent to old IP addresses due to DNS negative caching on the network side, meaning problems in the underlying data layer affect the entire system.

Ensuring Stability in Agent Patterns

AI agent patterns have always excited me with their potential to perform autonomous tasks. In my own task management application, I tried to develop an agent that understands natural language input from the user, automatically creates subtasks, and attempts to complete them in a specific order. Initial experiments were great; it successfully performed simple tasks. However, in more complex scenarios, I noticed that the agent could never reach a conclusion, repeatedly tried the same steps, and eventually entered an "infinite loop." This would drive the CPU to 100% and consume system resources. This was like a workflow in a manufacturing ERP constantly repeating the same step and getting stuck due to a misdefined condition; losing control of the flow leads to serious performance issues.

To prevent such infinite loops, I added some control mechanisms to the agent design. First, I started logging every agent step to see which steps were being repeated. Second, I implemented a "step counter" and "timeout" mechanism. If the agent exceeded a certain number of steps (e.g., 100 steps) or failed to reach a conclusion within a specific time (e.g., 5 minutes), I made it cancel the task and return an error message to the user. Third, I optimized the agent's "memory." By deleting unnecessary or old information from its memory, I made its decision-making mechanism more focused.


python
import time

class AIAgent:
    def __init__(self, max_steps: int = 50, timeout_seconds: int = 300):
        self.step_count = 0
        self.start_time = time.time()
        self.max_steps = max_steps
        self.timeout_seconds = timeout_seconds
        self.history = [] # Agent'ın hafızası

    def execute_step(self, task_description: str) -> str:
        self.step_count += 1
        elapsed_time = time.time() - self.start_time

        if self.step_count > self.max_steps:
            print(f"HATA: Agent {self.max_steps} adımı aştı, görevi iptal ediyor.")