DevUnionX

Posted on Oct 23

Graceful API Failure 101 for Data Scientists: A Modern Approach to Robust Error Handling

#python #datascience #data101 #api

In modern data workflows, APIs are everywhere — powering everything from model inference to data extraction. Yet, handling API failures gracefully is often neglected by data scientists, who tend to focus on analysis and modeling while treating fault tolerance as a “software engineering problem.”

However, failure handling is not just an engineering nicety — it’s what makes a data pipeline resilient, automated, and production-ready.
In this article, we’ll explore how to design clean, reusable error-handling patterns for APIs, using the Google Gemini API as a practical example.

Why API Failure Handling Matters

When your data pipeline processes hundreds or thousands of files via APIs, a single timeout or upload error can halt the entire process.
API failure handling is uniquely challenging because:

You depend on external systems — their uptime, latency, and error messages are outside your control.

Failures aren’t binary — some require retries, others should be skipped or gracefully degraded.

Without a structured strategy, you risk brittle pipelines that break unpredictably.

A Practical Example: Google Gemini API for PDF Extraction

Imagine you’re using Google Gemini to extract data from PDF files:

class GeminiClient:
    def generate(self, ...):
        response = self._client.models.generate_content(
            model=self._model.value,
            contents=attached_files + [prompt],
            config=types.GenerateContentConfig(
                max_output_tokens=max_tokens,
                system_instruction=system_prompt or None,
            )
        )
        return response

When processing many PDFs, you might hit:

Timeouts due to server overload

Upload failures from large files

Context limit errors (too much data for the model)

Without proper handling, one error can stop the entire pipeline. Let’s fix that.

Building a Robust Failure-Handling Strategy

A reliable API call mechanism should:

Identify error types — transient vs. fatal.

Retry transient errors with backoff delays.

Gracefully skip oversized inputs.

Log everything for observability.

Ensure idempotency — retries shouldn’t cause duplicate actions.

Instead of a tangled try-except jungle, we’ll use Python decorators for clean and modular control.

Setting Up Logging

Before anything else, let’s add structured logging:

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler()]
)

logger = logging.getLogger(__name__)

This ensures retries, skips, and failures are all tracked with timestamps — critical for debugging production pipelines.

Retry with Backoff Decorator

This decorator retries failed API calls after configurable delays. It includes a final retry, logs every attempt, and avoids sleeping after the last one.

import functools
import time
from typing import Callable, List

def retry_with_backoff(backoffs: List[int], when: Callable[[Exception], bool]):
    """
    Retry a function with increasing backoff intervals.
    Performs a final attempt after all backoffs are exhausted.
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(self, *args, **kwargs):
            last_exception = None
            for i, backoff in enumerate(backoffs):
                try:
                    return func(self, *args, **kwargs)
                except Exception as e:
                    if not when(e):
                        raise
                    last_exception = e
                    logger.warning(f"[Retry {i+1}/{len(backoffs)}] {e}. Retrying in {backoff}s...")
                    if i < len(backoffs) - 1:
                        time.sleep(backoff)

            # Final attempt after all backoffs
            try:
                return func(self, *args, **kwargs)
            except Exception as e:
                logger.error(f"Final retry failed: {e}")
                raise last_exception or e
        return wrapper
    return decorator

This pattern helps recover from temporary network or server issues without manual intervention.

Skip Silently Decorator

If a file exceeds the model’s context limit or cannot be processed, we want to skip it gracefully — not crash the pipeline.

Instead of returning strings, we’ll raise a custom exception for clarity and logging.

class SkippedFileError(Exception):
    """Raised when a file is too large or unprocessable."""
    pass

def skip_silently(when: Callable[[Exception], bool]):
    def decorator(func):
        def wrapper(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                if not when(e):
                    raise
                logger.warning(f"Skipping file due to size/context issue: {e}")
                raise SkippedFileError("File skipped due to size constraints.") from e
        return wrapper
    return decorator

Applying the Decorators

Now, we’ll apply both decorators to the Gemini client.

class GeminiClient:
    @retry_with_backoff([30, 60], when=_is_retryable)
    @skip_silently(when=_is_file_size_exceeded)
    def generate(self, ...):
        response = self._client.models.generate_content(
            model=self._model.value,
            contents=attached_files + [prompt],
            config=types.GenerateContentConfig(
                max_output_tokens=max_tokens,
                system_instruction=system_prompt or None,
            )
        )
        return response

Decorator Order Matters

Python applies decorators from the bottom up.
That means in this case:

@retry_with_backoff(...)
@skip_silently(...)
def generate(...):

skip_silently wraps generate()

retry_with_backoff wraps that entire wrapper

So the retry logic operates first — if retries fail, the skip logic decides whether to bypass the error.
If you reverse them, skipped errors could get retried unnecessarily.

Safer Exception Filtering

Avoid fragile string matching when detecting errors. Instead, use structured exception attributes where possible.

def _is_file_size_exceeded(e: Exception):
    if hasattr(e, "code") and e.code == 413:  # HTTP 413 Payload Too Large
        return True
    if hasattr(e, "message") and "context window" in e.message.lower():
        return True
    return False

Idempotency: The Hidden Gotcha

Retries are only safe if your operation is idempotent — meaning running it multiple times yields the same result.

For example:

✅ Safe: Extracting text from a file

❌ Unsafe: Uploading a record or charging a customer

If your API is not idempotent, add unique request tokens or deduplication logic to prevent duplicates.

Going Further: Production-Ready Enhancements

For large-scale, fault-tolerant systems, you can extend this pattern with:

tenacity for configurable retries, jitter, and custom stop conditions.

Circuit breaker pattern using pybreaker to prevent hammering failing APIs.

Monitoring integration (Prometheus, OpenTelemetry) for retry metrics.

Async support with asyncio and async decorators for high concurrency.

Final Thoughts

Graceful failure handling isn’t just about preventing crashes — it’s about designing systems that expect failure and recover automatically.

With just a few well-structured decorators and logging hooks, you can transform brittle scripts into resilient, production-grade pipelines.

Key takeaway:

Don’t treat API errors as surprises — treat them as part of the system design.

DEV Community