Mustafa ERBAY

Posted on May 14 • Originally published at mustafaerbay.com.tr

Retries and Idempotency in AI Pipelines: A Guide to Error Handling

#ai #pipeline #retries #idempotency

AI-based systems, especially pipelines running in production, constantly carry the risk of errors. I frequently encounter situations like network outages, API response times, delayed model responses, data inconsistencies, or temporary unavailability of external services. These issues directly impact the reliability of the AI pipeline and can disrupt workflows.

In this post, I will discuss two fundamental strategies I use to manage errors in AI pipelines: retry and idempotency mechanisms. I'll share how I design them, when and how I implement them, and the lessons I've learned from real-world scenarios. My goal is to ensure my systems run more resiliently and predictably with these approaches.

The Nature of AI Pipeline Errors and My Initial Observations

AI pipelines typically consist of multiple stages: data ingestion, preprocessing, model inference, and storing results or transferring them to another system. Each stage carries its unique error potential. Especially external dependencies and distributed architecture make error scenarios more complex.

Last month, I experienced a similar situation with an AI-based production planning engine I integrated into a manufacturing company's ERP. While pulling production data from an external API, an average of 15 out of every 1000 calls failed due to network timeouts or slow API responses. This led the planning engine to operate with incomplete data, resulting in incorrect production decisions. The root of the problem wasn't with the API itself, but with instantaneous network fluctuations.

ℹ️ Error Observation

Errors in AI pipelines are often transient and caused by external factors. Even if the model itself works correctly, momentary interruptions or delays in the data flow can halt the entire process. Therefore, understanding the nature of errors is critical for choosing the right solution strategy.

In such scenarios, distinguishing whether an error is permanent or transient is important. For permanent errors (e.g., invalid data format or authorization issues), retrying is pointless and only wastes resources. However, for transient errors, retry mechanisms allow the system to self-heal, reducing operational burden. In my experience, over 80% of the errors I encounter are transient.

Retries: When and How to Use Them?

Retry mechanisms ensure that when a transient error occurs, an operation is retried at specific intervals and for a certain number of times. This is vital, especially for remote service calls, database connections, or network operations. However, indiscriminately adding retries everywhere can lead to undesirable side effects in the system.

My most common retry strategy is "exponential backoff with jitter." In this method, the waiting time between failed attempts increases exponentially (exponential backoff), and a random delay (jitter) is added to this time. Jitter prevents multiple operations that fail simultaneously from retrying at the exact same moment, thereby preventing sudden load spikes on the target service.

import time
import random
import requests

def retry_with_backoff(func, retries=5, delay=1, backoff_factor=2, jitter=0.1):
    """
    Belirli bir fonksiyonu exponential backoff ve jitter ile tekrar dener.
    """
    for i in range(retries):
        try:
            return func()
        except requests.exceptions.RequestException as e:
            print(f"Deneme {i+1} başarısız oldu: {e}")
            if i < retries - 1:
                sleep_time = delay * (backoff_factor ** i)
                sleep_time = sleep_time * (1 + random.uniform(-jitter, jitter))
                print(f" {sleep_time:.2f} saniye bekliyor...")
                time.sleep(sleep_time)
            else:
                print("Tüm denemeler başarısız oldu.")
                raise

def call_ai_service():
    """
    Örnek bir AI servis çağrısı. Bazen hata verir.
    """
    response = requests.get("http://ai-service-endpoint.com/predict", timeout=2)
    response.raise_for_status() # HTTP 4xx/5xx hataları için exception fırlatır
    return response.json()

# Kullanım örneği
# try:
#     result = retry_with_backoff(call_ai_service, retries=3, delay=0.5)
#     print(f"Başarılı sonuç: {result}")
# except Exception as e:
#     print(f"İşlem nihayetinde başarısız oldu: {e}")

In the Android application of my side product, I generally use 3 retries with exponential backoff up to 2 seconds for calls to backend services. This proves quite effective for momentary fluctuations in mobile network conditions and significantly improves the user experience. However, there's a trade-off to consider here: retries can extend the total processing time and add extra load to the target service. Therefore, correctly setting the number of retries and wait times is critical. Too many retries can lead to resource waste and longer delays, while too few can make the system fragile.

Idempotency: Why It's Critical and How to Ensure It?

One of the most important issues not to be overlooked when implementing retry mechanisms is idempotency. An operation is idempotent if, when performed multiple times, the system's state remains the same as it was after the first execution. This is vital in distributed systems and especially where retry mechanisms are used. Otherwise, the same operation could be processed multiple times, leading to data inconsistencies or undesirable side effects.

For example, consider an AI service that updates the status of an order in a production ERP. If a network error occurs during this update and the retry mechanism is activated, the same update request might reach the server multiple times. If idempotency is not ensured, the order status could be updated twice or other side effects might occur.

⚠️ Retry Risks Without Idempotency

Retrying a non-idempotent operation can lead to unexpected data corruption, duplicate records, or serious issues like double payments in financial systems. Therefore, when implementing retries, it is imperative to ensure that the related operations are idempotent or to make them so.

There are several ways to ensure idempotency:

Unique Request IDs (Idempotency Keys): A unique ID (like a UUID) is added to each request. The server uses this ID to check if the same request has been processed before. If it has, it returns the same result and does not re-process the operation.
Conditional Updates: When performing database updates, only execute the operation if a specific condition (e.g., a version field) is met.
UPSERT (INSERT ON CONFLICT): At the database level, insert a record if it doesn't exist, otherwise update it. In PostgreSQL, the INSERT ... ON CONFLICT (id) DO UPDATE ... construct provides this.

In a bank's internal platform, I always used a transaction_id or request_uuid when keeping transaction records. This ensured that even after a network outage, when we re-sent the same operation, the system processed it only once. This is an indispensable approach, especially in critical areas like financial transactions. I also use a similar X-Request-ID header for requests coming to my financial calculators.

# FastAPI örneği
from fastapi import FastAPI, Header, HTTPException
from typing import Optional
from uuid import uuid4

app = FastAPI()

processed_requests = {} # Gerçekte bir veritabanı veya Redis kullanılmalı

@app.post("/process_ai_task")
async def process_ai_task(
    data: dict,
    x_idempotency_key: Optional[str] = Header(None)
):
    if not x_idempotency_key:
        raise HTTPException(status_code=400, detail="X-Idempotency-Key is required")

    if x_idempotency_key in processed_requests:
        # Daha önce işlenmiş, aynı sonucu döndür
        print(f"Idempotency Key {x_idempotency_key} ile zaten işlenmiş, aynı sonucu döndürüyor.")
        return processed_requests[x_idempotency_key]

    # İşlemi gerçekleştir
    print(f"Idempotency Key {x_idempotency_key} ile yeni bir işlem başlatılıyor.")
    result = {"status": "processed", "data": data, "processed_id": str(uuid4())}
    processed_requests[x_idempotency_key] = result

    # Gerçek uygulamada, bu nokta işlem tamamlandıktan sonra kalıcı hale getirilmeli
    # Örn: Veritabanına kaydet ve committed_transactions tablosuna idempotency_key ekle

    return result

# Kullanım örneği (curl ile)
# curl -X POST "http://localhost:8000/process_ai_task" \
#      -H "X-Idempotency-Key: some-unique-key-123" \
#      -H "Content-Type: application/json" \
#      -d '{"prompt": "Generate a summary"}'

In this example, the processed_requests dictionary acts as a simple in-memory store. In a real production environment, this should be managed with a fast key-value store like Redis or a dedicated table in a database. The key's lifecycle and cleanup mechanisms should also be considered.

Designing Retry and Idempotency Together

Retry and idempotency are two powerful mechanisms that complement each other. Retries increase the chances of an operation completing by overcoming transient errors, while idempotency ensures these retries do not create unintended duplicates in the system. Designing this duo correctly multiplies the reliability of distributed AI pipelines.

In a client project, we needed to notify an AI model when an order's shipment status changed in the data stream coming from a production ERP. If a network outage occurred during this notification and a retry was performed, without idempotency, the model could attempt to process the same order's shipment status twice. This could lead to incorrect production planning or unnecessary resource allocation. For such scenarios, I typically use the "transaction outbox pattern" or message queues (Kafka, RabbitMQ) along with unique message IDs.

💡 Transaction Outbox Pattern

The Transaction Outbox Pattern ensures that a database operation (e.g., updating a record) and a message triggered by this operation (e.g., an event) are published atomically. This guarantees that when the database record is updated, the message will definitely be sent, making it a powerful way to ensure idempotency in conjunction with retry mechanisms.

In this pattern, when an operation is performed in the database, an event is also written to an "outbox" table within the same transaction. Once the transaction is successful, a separate service or process listens to this outbox table and sends the events to a message queue. If an error occurs and the operation is retried, duplicate sending of the same event is prevented thanks to the unique event IDs (acting as idempotency keys) in the outbox table. This approach is frequently used, especially in event-sourcing architectures or inter-microservice communication.

-- Örnek bir outbox tablosu şeması (PostgreSQL)
CREATE TABLE outbox_messages (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    processed_at TIMESTAMPTZ,
    status VARCHAR(50) DEFAULT 'PENDING' -- PENDING, SENT, FAILED
);

-- Bir işlem ve outbox kaydının atomik olarak yazılması (pseudocode)
-- BEGIN TRANSACTION;
--     UPDATE orders SET status = 'SHIPPED' WHERE id = 'order-123';
--     INSERT INTO outbox_messages (aggregate_type, aggregate_id, event_type, payload)
--     VALUES ('Order', 'order-123', 'OrderShipped', '{"order_id": "order-123", "new_status": "SHIPPED"}');
-- COMMIT TRANSACTION;

This structure significantly mitigates transaction consistency issues in distributed systems. In my experience, in systems where I have implemented this pattern, the consistency of data sent to AI pipelines and the repeatability of operations have significantly increased. If you'd like to learn more about [related: transaction management in distributed systems], you can check out an article I wrote on this topic previously.

Monitoring and Alerting Mechanisms: Detecting Failures

No matter how robust retry and idempotency mechanisms are, monitoring errors and their effects in the system is indispensable. Metrics such as pipeline performance, error rates, number of retry attempts, and success rate of retries provide valuable insights into the system's health. Without observability, it's difficult to understand how effectively these mechanisms are working or where improvements are needed.

For me, monitoring involves detecting not only errors but also deviations from the system's normal operating conditions. Some key monitoring metrics I use include:

Error Rate: The percentage of failed operations within the total number of operations.
Retry Count: How many times an operation has been retried. High retry counts can indicate a persistent transient issue in the background.
Latency: The time it takes for operations to complete, especially the 95th or 99th percentile values. Since retries increase latency, it's necessary to check if this is within expected limits.
Successful Retry Rate: The percentage of operations that succeed as a result of retries. This metric indicates how effectively the retry mechanism is working.

Last year, in a client project, we set up metrics to trigger automatic alerts when the 95th percentile latency of the AI prediction service exceeded 500ms. This usually indicated that the underlying model was experiencing resource constraints or an external dependency was slowing down. When such metrics are collected with tools like Prometheus and visualized with dashboards like Grafana, I can quickly spot anomalies.

# Prometheus için örnek bir alert kuralı
groups:
- name: ai_pipeline_alerts
  rules:
  - alert: AIPipelineHighErrorRate
    expr: sum(rate(ai_pipeline_errors_total[5m])) by (pipeline_name) / sum(rate(ai_pipeline_requests_total[5m])) by (pipeline_name) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "AI Pipeline {{ $labels.pipeline_name }} hata oranı çok yüksek"
      description: "Son 5 dakikada {{ $labels.pipeline_name }} pipeline'ının hata oranı %5'in üzerine çıktı. Retry'lar yetersiz kalıyor olabilir."

  - alert: AIPipelineExcessiveRetries
    expr: sum(rate(ai_pipeline_retries_total[5m])) by (pipeline_name) / sum(rate(ai_pipeline_requests_total[5m])) by (pipeline_name) > 0.10
    for: 10m
    labels:
      severity: info
    annotations:
      summary: "AI Pipeline {{ $labels.pipeline_name }} aşırı tekrar deneme yapıyor"
      description: "Son 10 dakikada {{ $labels.pipeline_name }} pipeline'ının isteklerinin %10'undan fazlası retry edildi. Altta yatan bir sorun olabilir."

On the logging side, I meticulously record every retry attempt, why it failed, and whether an idempotency key was used. Collecting these logs in a central location using journald or a similar log management system and being able to search by keywords significantly speeds up troubleshooting processes. On my own VPS, when monitoring Redis's OOM evictions, I set up alarms that caught OOM-killed messages and cgroup memory.high limits in journald records. This allowed me to understand why the system was becoming unstable.

Lessons from Real-World Scenarios and Trade-offs

One of the most important lessons I've learned when implementing retry and idempotency in AI pipelines is the need to adopt a "good enough" and pragmatic approach, rather than always seeking the "perfect" solution. Every system has its unique constraints and priorities.

In an AI-powered fraud detection system I developed for a bank's internal platform, we knew that every second and every operation was critical. Therefore, instead of making retry strategies very aggressive, we focused more on idempotency and rapid error notification. Because a delay in processing could mean missing a potential fraud event. The trade-off here was determining an acceptable latency for high reliability.

🔥 Avoiding Over-engineering

Trying to apply the most complex solution to every problem often means unnecessary costs, increased complexity, and more potential for errors. In my experience, simple and understandable solutions are generally more long-lived and manageable.

In my smaller projects or the backend of my side products, sometimes simply triggering failed AI tasks again with a cronjob can be more practical than building a complex event-sourcing architecture. For example, in a text summarization service of mine, a simple mechanism that automatically retries a request an hour later if it fails is sufficient. The trade-off here was sacrificing a bit of immediate consistency to focus on development speed.

In conclusion, when designing retry and idempotency mechanisms, it's important to ask the following questions:

Is this error transient or permanent?
How much delay will retrying cause?
Is the operation idempotent? If not, how can I make it idempotent?
Is the complexity of these mechanisms worth the benefits they provide?

The answers to these questions will vary for each project and context. However, by keeping these principles in mind, I can generally build more robust and sustainable AI pipelines.

Conclusion

Error management in AI pipelines is a fundamental requirement for the reliability and continuity of systems. Retry and idempotency are two of the most powerful tools I use in this process. While I overcome transient errors with retry mechanisms, I prevent data inconsistencies by ensuring the repeatability of operations with idempotency.

When designing these mechanisms, I always consider real-world scenarios, the criticality level of the system, and available resources. Continuously monitoring the system's health with observability tools and setting up alarms for anomalies enhances the effectiveness of these strategies. Managing complexity and making the right trade-offs have been among the most valuable lessons I've learned in my 20 years of experience in this field.

I hope this guide helps you in developing error management strategies for your own AI pipelines. The next step will be on AI model versioning strategies and roll-back processes.

DEV Community