Introduction
When you're building applications that rely on third-party APIs, one of the certainties is that those APIs will, at some point, fail.
Network issues, transient server errors, or rate limiting can all lead to failed requests. A robust application needs to anticipate these failures and handle them gracefully.
In this tutorial, we'll walk through a real-world scenario I recently encountered in one of my Rails projects.
My app uses the ruby-openai
gem to interact with the OpenAI API, and I noticed that the background job responsible for generating the LLM responses was intermitently failing with a Faraday::ServerError
.
We'll look at how I diagnosed the problem and used Rails' built-in features to make my background jobs more resilient.
The Problem: A Failing Background Job
The issue started with jobs landing in my "failed" queue. The error was always the same: Faraday::ServerError: the server responded with status 500
.
Here's a snippet of the stack trace:
/usr/local/bundle/ruby/3.3.0/gems/faraday-2.13.1/lib/faraday/response/raise_error.rb:38:in `on_complete'
...
/rails/app/services/llm/assistant_response_service.rb:22:in `generate_response'
/rails/app/jobs/llm/assistant_response_job.rb:13:in `perform'
...
Here's the generate_response
method responsible for making the API call to OpenAI
# `app/services/assistant_response_service.rb`
class Llm::AssistantResponseService < Llm::BaseOpenAiService
# ...
def generate_response
parameters = {
model: DEFAULT_MODEL,
input: @input_messages,
tools: @tool_registry&.registered_tools || [],
previous_response_id: chat.previous_response_id,
text: {
verbosity: "low"
}
}
response = client.responses.create(parameters: parameters)
handle_response(response)
end
# ...
end
And here's the background job that was calling it
# `app/jobs/llm/assistant_response_job.rb`
class Llm::AssistantResponseJob < ApplicationJob
queue_as :default
def perform(message_id)
message = Message.includes(chat: :chatbot).find(message_id)
chat = message.chat
chatbot = chat.chatbot
Llm::AssistantResponseService.new(
input_message: message.content,
chat: chat,
chatbot: chatbot,
).generate_response
end
end
This wasn't an error in my code, but an issue on OpenAI's end. However, my app wasn't handling it well. The job would try once, fail, and give up.
The problem was that there wasn't any handling for Faraday::ServerError
. The job simply fails and is moved to the dead-letter queue, requiring manual intervention to retry.
The Solution: Automatic Retries with Active Job
The best way to handle transient errors like a 500 status is to simply try again after a short delay. Fortunately, Rails makes this trivial with the retry_on
feature.
Step 1: Add Retries to the Job
The first and most important change is to tell our job to retry when it encounters a Faraday::ServerError
.
I modified app/jobs/llm/assistant_response_job.rb
like this:
# app/jobs/llm/assistant_response_job.rb
class Llm::AssistantResponseJob < ApplicationJob
queue_as :default
# **********************
# ADD THIS NEXT LINE ⬇️
# **********************
retry_on Faraday::ServerError, wait: :polynomially_longer, attempts: 3
def perform(message_id)
message = Message.includes(chat: :chatbot).find(message_id)
chat = message.chat
chatbot = chat.chatbot
Llm::AssistantResponseService.new(
input_message: message.content,
chat: chat,
chatbot: chatbot,
).generate_response
end
end
With this single line, the job will now:
- Catch any
Faraday::ServerError
that occurs during its execution. - Automatically re-enqueue itself to be run again later.
-
Wait for a polynomially increasing amount of time between retries (
:polynomially_longer
):polynomially_longer
is a built-in backoff strategy for retries in Rails. It makes the wait time between retries increasingly grow using a formula based on the number of attempts so far:wait_time = (executions ** 4) + (random_jitter) + 2
. E.g:
- **First retry:** about **3 seconds**
- **Second retry:** about **18 seconds**
- **Third retry:** about **83 seconds**
- **Fourth retry:** much longer, and so on.
The idea is to give the system more and more time to recover before trying again, instead of hammering the failing api at a fixed interval.
- Attempt this up to 3 times before finally giving up and moving to the failed jobs queue.
This immediately makes our job much more robust.
Step 2: Improve Error Logging
While retrying is great, we still want to know when these errors are happening. To fix this, we need to rescue the error in the service, log it, and then re-raise it so that the job's retry_on
handler can catch it.
Here's the updated generate_response
method in app/services/llm/assistant_response_service.rb
:
# app/services/llm/assistant_response_service.rb
class Llm::AssistantResponseService
# ...
def generate_response
# ...
response = client.responses.create(parameters: parameters)
handle_response(response)
rescue Faraday::ServerError => e # <-- Add this rescue block
log_error(e, parameters) # <-- Log the error
raise e # <-- Re-raise the exception
end
private
def log_error(error, parameters = {})
# Log the error to a monitoring and error tracking service, e.g: Sentry
end
end
The key here is raise e
. If we just rescued the exception without re-raising it, the job would never know that an error occurred, and it wouldn't retry. By rescuing, logging, and re-raising, we get the best of both worlds: visibility into the errors and automatic retries.
Conclusion
By combining Active Job's retry_on
with specific error handling and logging, we just built a resilient background job.
Implementing this is incredibly effective for dealing with unreliable network requests to third-party services, guarantees that your users will have a smoother experience and you'll spend less time manually retrying failed jobs.
Next time you're working with an external API, remember to ask yourself: "What happens if this fails?" and build in a resilient error-handling strategy from the start.
If you enjoyed this tutorial, here's where to find more of my work:
Top comments (0)