In modern data workflows, APIs are everywhere — powering everything from model inference to data extraction. Yet, handling API failures gracefully is often neglected by data scientists, who tend to focus on analysis and modeling while treating fault tolerance as a “software engineering problem.”
However, failure handling is not just an engineering nicety — it’s what makes a data pipeline resilient, automated, and production-ready.
In this article, we’ll explore how to design clean, reusable error-handling patterns for APIs, using the Google Gemini API as a practical example.
Why API Failure Handling Matters
When your data pipeline processes hundreds or thousands of files via APIs, a single timeout or upload error can halt the entire process.
API failure handling is uniquely challenging because:
You depend on external systems — their uptime, latency, and error messages are outside your control.
Failures aren’t binary — some require retries, others should be skipped or gracefully degraded.
Without a structured strategy, you risk brittle pipelines that break unpredictably.
A Practical Example: Google Gemini API for PDF Extraction
Imagine you’re using Google Gemini to extract data from PDF files:
class GeminiClient:
def generate(self, ...):
response = self._client.models.generate_content(
model=self._model.value,
contents=attached_files + [prompt],
config=types.GenerateContentConfig(
max_output_tokens=max_tokens,
system_instruction=system_prompt or None,
)
)
return response
When processing many PDFs, you might hit:
Timeouts due to server overload
Upload failures from large files
Context limit errors (too much data for the model)
Without proper handling, one error can stop the entire pipeline. Let’s fix that.
Building a Robust Failure-Handling Strategy
A reliable API call mechanism should:
Identify error types — transient vs. fatal.
Retry transient errors with backoff delays.
Gracefully skip oversized inputs.
Log everything for observability.
Ensure idempotency — retries shouldn’t cause duplicate actions.
Instead of a tangled try-except jungle, we’ll use Python decorators for clean and modular control.
Setting Up Logging
Before anything else, let’s add structured logging:
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[logging.StreamHandler()]
)
logger = logging.getLogger(__name__)
This ensures retries, skips, and failures are all tracked with timestamps — critical for debugging production pipelines.
Retry with Backoff Decorator
This decorator retries failed API calls after configurable delays. It includes a final retry, logs every attempt, and avoids sleeping after the last one.
import functools
import time
from typing import Callable, List
def retry_with_backoff(backoffs: List[int], when: Callable[[Exception], bool]):
"""
Retry a function with increasing backoff intervals.
Performs a final attempt after all backoffs are exhausted.
"""
def decorator(func):
@functools.wraps(func)
def wrapper(self, *args, **kwargs):
last_exception = None
for i, backoff in enumerate(backoffs):
try:
return func(self, *args, **kwargs)
except Exception as e:
if not when(e):
raise
last_exception = e
logger.warning(f"[Retry {i+1}/{len(backoffs)}] {e}. Retrying in {backoff}s...")
if i < len(backoffs) - 1:
time.sleep(backoff)
# Final attempt after all backoffs
try:
return func(self, *args, **kwargs)
except Exception as e:
logger.error(f"Final retry failed: {e}")
raise last_exception or e
return wrapper
return decorator
This pattern helps recover from temporary network or server issues without manual intervention.
Skip Silently Decorator
If a file exceeds the model’s context limit or cannot be processed, we want to skip it gracefully — not crash the pipeline.
Instead of returning strings, we’ll raise a custom exception for clarity and logging.
class SkippedFileError(Exception):
"""Raised when a file is too large or unprocessable."""
pass
def skip_silently(when: Callable[[Exception], bool]):
def decorator(func):
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
if not when(e):
raise
logger.warning(f"Skipping file due to size/context issue: {e}")
raise SkippedFileError("File skipped due to size constraints.") from e
return wrapper
return decorator
Applying the Decorators
Now, we’ll apply both decorators to the Gemini client.
class GeminiClient:
@retry_with_backoff([30, 60], when=_is_retryable)
@skip_silently(when=_is_file_size_exceeded)
def generate(self, ...):
response = self._client.models.generate_content(
model=self._model.value,
contents=attached_files + [prompt],
config=types.GenerateContentConfig(
max_output_tokens=max_tokens,
system_instruction=system_prompt or None,
)
)
return response
Decorator Order Matters
Python applies decorators from the bottom up.
That means in this case:
@retry_with_backoff(...)
@skip_silently(...)
def generate(...):
skip_silently wraps generate()
retry_with_backoff wraps that entire wrapper
So the retry logic operates first — if retries fail, the skip logic decides whether to bypass the error.
If you reverse them, skipped errors could get retried unnecessarily.
Safer Exception Filtering
Avoid fragile string matching when detecting errors. Instead, use structured exception attributes where possible.
def _is_file_size_exceeded(e: Exception):
if hasattr(e, "code") and e.code == 413: # HTTP 413 Payload Too Large
return True
if hasattr(e, "message") and "context window" in e.message.lower():
return True
return False
Idempotency: The Hidden Gotcha
Retries are only safe if your operation is idempotent — meaning running it multiple times yields the same result.
For example:
✅ Safe: Extracting text from a file
❌ Unsafe: Uploading a record or charging a customer
If your API is not idempotent, add unique request tokens or deduplication logic to prevent duplicates.
Going Further: Production-Ready Enhancements
For large-scale, fault-tolerant systems, you can extend this pattern with:
tenacity for configurable retries, jitter, and custom stop conditions.
Circuit breaker pattern using pybreaker to prevent hammering failing APIs.
Monitoring integration (Prometheus, OpenTelemetry) for retry metrics.
Async support with asyncio and async decorators for high concurrency.
Final Thoughts
Graceful failure handling isn’t just about preventing crashes — it’s about designing systems that expect failure and recover automatically.
With just a few well-structured decorators and logging hooks, you can transform brittle scripts into resilient, production-grade pipelines.
Key takeaway:
Don’t treat API errors as surprises — treat them as part of the system design.
Top comments (0)