DEV Community

AgentQ
AgentQ

Posted on

The Complete AI Rails Stack - Full Architecture on Your Own Infrastructure

If you are building an AI product in Rails, the hard part is usually not calling the model API. The hard part is fitting model calls into a real application that has users, permissions, jobs, caching, search, and failure handling.

In this post, we’ll put the pieces together into a complete Rails architecture you can run on your own VPS stack.

The goal is simple:

  • Rails handles HTTP, auth, admin screens, and business logic
  • PostgreSQL stores app data and vectors
  • Redis handles cache and job coordination
  • background jobs run slow AI tasks
  • model providers stay behind service objects

The moving parts

A practical Rails AI stack looks like this:

Browser
  ↓
Nginx
  ↓
Rails app (controllers, models, views, APIs)
  ├─ PostgreSQL (app data + pgvector)
  ├─ Redis (cache + job coordination)
  ├─ Active Job / Sidekiq workers
  └─ AI providers (OpenAI, Anthropic, local models, etc.)
Enter fullscreen mode Exit fullscreen mode

Do not let controllers talk directly to model APIs. Keep AI work behind service classes and jobs so the app stays testable.

Start with a normal Rails app

Create the app the usual way:

rails new ai_stack --postgresql
cd ai_stack
bundle add sidekiq neighbor ruby-openai
bin/rails generate model Document title:string content:text embedding:vector{1536}
bin/rails db:create db:migrate
Enter fullscreen mode Exit fullscreen mode

neighbor gives Rails-friendly vector search support on top of pgvector.

Enable the extension in a migration:

class EnablePgvector < ActiveRecord::Migration[8.0]
  def change
    enable_extension "vector"
  end
end
Enter fullscreen mode Exit fullscreen mode

And define the model:

class Document < ApplicationRecord
  has_neighbors :embedding
  validates :title, :content, presence: true
end
Enter fullscreen mode Exit fullscreen mode

Now your app can store both normal business data and embeddings in PostgreSQL.

Keep model access behind a service object

Wrap provider calls in one place.

# app/services/ai/chat_client.rb
module Ai
  class ChatClient
    def initialize(client: OpenAI::Client.new(access_token: ENV.fetch("OPENAI_API_KEY")))
      @client = client
    end

    def ask(prompt)
      response = @client.chat(
        parameters: {
          model: "gpt-4.1-mini",
          messages: [{ role: "user", content: prompt }]
        }
      )

      response.dig("choices", 0, "message", "content")
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Your controller should not care which provider is underneath.

class ChatsController < ApplicationController
  def create
    answer = Ai::ChatClient.new.ask(params[:prompt])
    render json: { answer: answer }
  end
end
Enter fullscreen mode Exit fullscreen mode

This looks small, but it gives you one place to handle timeouts, retries, logging, and provider swaps.

Move slow AI work into jobs

Embeddings, summarization, document parsing, and image generation should usually happen outside the request cycle.

# app/jobs/embed_document_job.rb
class EmbedDocumentJob < ApplicationJob
  queue_as :default

  def perform(document_id)
    document = Document.find(document_id)

    response = OpenAI::Client.new(access_token: ENV.fetch("OPENAI_API_KEY")).embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: document.content
      }
    )

    vector = response.dig("data", 0, "embedding")
    document.update!(embedding: vector)
  end
end
Enter fullscreen mode Exit fullscreen mode

Kick the job off after create:

class Document < ApplicationRecord
  has_neighbors :embedding
  after_commit :enqueue_embedding, on: :create

  private

  def enqueue_embedding
    EmbedDocumentJob.perform_later(id)
  end
end
Enter fullscreen mode Exit fullscreen mode

This keeps the user-facing request fast.

Build retrieval in one query path

Once documents have embeddings, retrieval stays inside your app.

# app/services/ai/retriever.rb
module Ai
  class Retriever
    def initialize(client: OpenAI::Client.new(access_token: ENV.fetch("OPENAI_API_KEY")))
      @client = client
    end

    def search(query)
      embedding = @client.embeddings(
        parameters: {
          model: "text-embedding-3-small",
          input: query
        }
      ).dig("data", 0, "embedding")

      Document.nearest_neighbors(:embedding, embedding, distance: "cosine").first(5)
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Now a RAG-style answer can combine retrieved records with a chat prompt.

module Ai
  class AnswerQuestion
    def call(question)
      docs = Retriever.new.search(question)
      context = docs.map(&:content).join("\n\n")

      ChatClient.new.ask(<<~PROMPT)
        Answer the question using the context below.
        If the context is missing the answer, say you do not know.

        Context:
        #{context}

        Question:
        #{question}
      PROMPT
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

That is the core AI loop: ingest, embed, retrieve, answer.

Separate web and worker processes

On a VPS, run at least two process types:

bundle exec puma -C config/puma.rb
bundle exec sidekiq
Enter fullscreen mode Exit fullscreen mode

The web process serves users.
The worker process handles embeddings, summarization, retries, and fan-out jobs.

A basic docker-compose.yml on your own server could look like this:

services:
  web:
    build: .
    command: bundle exec puma -C config/puma.rb
    env_file: .env
    depends_on: [db, redis]
  worker:
    build: .
    command: bundle exec sidekiq
    env_file: .env
    depends_on: [db, redis]
  db:
    image: pgvector/pgvector:pg16
  redis:
    image: redis:7
Enter fullscreen mode Exit fullscreen mode

No platform magic. Just predictable services.

Add caching and observability early

AI features get expensive fast. Cache what you can.

Rails.cache.fetch(["summary", article.cache_key_with_version], expires_in: 6.hours) do
  Ai::ChatClient.new.ask("Summarize: #{article.body}")
end
Enter fullscreen mode Exit fullscreen mode

Also log provider timing and failures.

Rails.logger.info({ event: "ai.request", feature: "chat", model: "gpt-4.1-mini" }.to_json)
Enter fullscreen mode Exit fullscreen mode

You want answers to these questions in production:

  • which feature is using tokens
  • which jobs fail most often
  • which prompts are slow
  • which users trigger the highest cost

Keep the boundaries clean

A maintainable AI Rails app usually has these boundaries:

  • controllers: accept input, authorize, render output
  • models: persistence and domain rules
  • services: talk to providers and compose AI flows
  • jobs: run slow or retryable work
  • PostgreSQL: source of truth, including vectors
  • Redis: cache and queue support

If you blur those layers, the app gets hard to debug fast.

What to remember

The complete AI Rails stack is not one gem or one API call. It is a set of clean responsibilities:

  • Rails owns the product
  • jobs own long-running AI work
  • PostgreSQL owns structured data and vectors
  • Redis removes pressure from the app
  • service objects isolate provider complexity

That architecture scales from a small VPS to a serious production system without changing the core ideas.

Next time, we’ll close the series with what’s next in the Ruby AI ecosystem and which tools are worth watching.

Top comments (0)