How to Fix Random OpenAI 500 Errors in Rails Background Jobs Using retry_on

#tutorial #rails #openai #ruby

Introduction

When you're building applications that rely on third-party APIs, one of the certainties is that those APIs will, at some point, fail.

Network issues, transient server errors, or rate limiting can all lead to failed requests. A robust application needs to anticipate these failures and handle them gracefully.

In this tutorial, we'll walk through a real-world scenario I recently encountered in one of my Rails projects.

My app uses the ruby-openai gem to interact with the OpenAI API, and I noticed that the background job responsible for generating the LLM responses was intermitently failing with a Faraday::ServerError.

We'll look at how I diagnosed the problem and used Rails' built-in features to make my background jobs more resilient.

The Problem: A Failing Background Job

The issue started with jobs landing in my "failed" queue. The error was always the same: Faraday::ServerError: the server responded with status 500.

Here's a snippet of the stack trace:

/usr/local/bundle/ruby/3.3.0/gems/faraday-2.13.1/lib/faraday/response/raise_error.rb:38:in `on_complete'
...
/rails/app/services/llm/assistant_response_service.rb:22:in `generate_response'
/rails/app/jobs/llm/assistant_response_job.rb:13:in `perform'
...

Here's the generate_response method responsible for making the API call to OpenAI

# `app/services/assistant_response_service.rb`

class Llm::AssistantResponseService < Llm::BaseOpenAiService

    # ...

    def generate_response
        parameters = {
            model: DEFAULT_MODEL,
            input: @input_messages,
            tools: @tool_registry&.registered_tools || [],
            previous_response_id: chat.previous_response_id,
            text: {
                verbosity: "low"
            }
        }

        response = client.responses.create(parameters: parameters)
        handle_response(response)
    end

    # ...

end

And here's the background job that was calling it

# `app/jobs/llm/assistant_response_job.rb`
class Llm::AssistantResponseJob < ApplicationJob
    queue_as :default

    def perform(message_id)
        message = Message.includes(chat: :chatbot).find(message_id)
        chat = message.chat
        chatbot = chat.chatbot

        Llm::AssistantResponseService.new(
            input_message: message.content,
            chat: chat,
            chatbot: chatbot,
        ).generate_response
    end
end

This wasn't an error in my code, but an issue on OpenAI's end. However, my app wasn't handling it well. The job would try once, fail, and give up.

The problem was that there wasn't any handling for Faraday::ServerError. The job simply fails and is moved to the dead-letter queue, requiring manual intervention to retry.

The Solution: Automatic Retries with Active Job

The best way to handle transient errors like a 500 status is to simply try again after a short delay. Fortunately, Rails makes this trivial with the retry_on feature.

Step 1: Add Retries to the Job

The first and most important change is to tell our job to retry when it encounters a Faraday::ServerError.

I modified app/jobs/llm/assistant_response_job.rb like this:

# app/jobs/llm/assistant_response_job.rb

class Llm::AssistantResponseJob < ApplicationJob
    queue_as :default

    # **********************
    # ADD THIS NEXT LINE ⬇️
    # **********************
    retry_on Faraday::ServerError, wait: :polynomially_longer, attempts: 3

    def perform(message_id)
        message = Message.includes(chat: :chatbot).find(message_id)
        chat = message.chat
        chatbot = chat.chatbot

        Llm::AssistantResponseService.new(
            input_message: message.content,
            chat: chat,
            chatbot: chatbot,
        ).generate_response
    end
end

With this single line, the job will now:

Catch any Faraday::ServerError that occurs during its execution.
Automatically re-enqueue itself to be run again later.
Wait for a polynomially increasing amount of time between retries (:polynomially_longer)

:polynomially_longer is a built-in backoff strategy for retries in Rails. It makes the wait time between retries increasingly grow using a formula based on the number of attempts so far: wait_time = (executions ** 4) + (random_jitter) + 2. E.g:

- **First retry:** about **3 seconds**
- **Second retry:** about **18 seconds**
- **Third retry:** about **83 seconds**
- **Fourth retry:** much longer, and so on.

The idea is to give the system more and more time to recover before trying again, instead of hammering the failing api at a fixed interval.

Attempt this up to 3 times before finally giving up and moving to the failed jobs queue.

This immediately makes our job much more robust.

Step 2: Improve Error Logging

While retrying is great, we still want to know when these errors are happening. To fix this, we need to rescue the error in the service, log it, and then re-raise it so that the job's retry_on handler can catch it.

Here's the updated generate_response method in app/services/llm/assistant_response_service.rb:

# app/services/llm/assistant_response_service.rb

class Llm::AssistantResponseService
# ...
    def generate_response
    # ...
    response = client.responses.create(parameters: parameters)
    handle_response(response)
    rescue Faraday::ServerError => e # <-- Add this rescue block
        log_error(e, parameters) # <-- Log the error
        raise e # <-- Re-raise the exception
    end

    private

        def log_error(error, parameters = {})
            # Log the error to a monitoring and error tracking service, e.g: Sentry
        end
end

The key here is raise e. If we just rescued the exception without re-raising it, the job would never know that an error occurred, and it wouldn't retry. By rescuing, logging, and re-raising, we get the best of both worlds: visibility into the errors and automatic retries.

Conclusion

By combining Active Job's retry_on with specific error handling and logging, we just built a resilient background job.

Implementing this is incredibly effective for dealing with unreliable network requests to third-party services, guarantees that your users will have a smoother experience and you'll spend less time manually retrying failed jobs.

Next time you're working with an external API, remember to ask yourself: "What happens if this fails?" and build in a resilient error-handling strategy from the start.

If you enjoyed this tutorial, here's where to find more of my work:

Untaught Blog
Read Article Here
Follow me on X