DEV Community

Nishkarsh Sahu
Nishkarsh Sahu

Posted on

Building a Rails-Native AI Abstraction Layer for Local and Hosted LLMs

Recently I’ve been experimenting with integrating local AI runtimes into Rails applications using tools like Ollama and LM Studio.

At first, the integration looked straightforward:
make an HTTP request, stream the response, and return the generated text.

But after experimenting with multiple providers, I realized the actual challenge wasn’t calling the APIs - it was normalizing the differences between providers cleanly.

The Problem

Every AI provider behaves slightly differently.

Some providers:

stream using SSE
stream newline-delimited JSON
return partial JSON chunks
expose different finish signals
structure responses differently
implement retries/errors differently

Even providers claiming OpenAI compatibility often differ subtly in:

chunk formatting
streaming behavior
error payloads
lifecycle handling

This becomes painful when trying to build reusable Rails infrastructure.

You quickly end up writing:

provider-specific parsing
provider-specific retry handling
provider-specific response normalization
provider-specific streaming logic

inside application code.

What I Wanted Instead

I wanted a Rails-native abstraction layer where application code could stay provider-independent.

Something conceptually similar to how ActiveRecord abstracts databases.

The goal became:

response = AiModels.chat(
provider: :ollama,
model: "llama3.2",
messages: [
{
role: "user",
content: "Explain ActiveRecord associations"
}
]
)

puts response.content

without the application caring about:

SSE parsing
chunk formats
provider-specific APIs
retry lifecycle behavior
Streaming Was the Hardest Part

The most interesting challenge turned out to be streaming.

Different providers stream differently:

SSE chunks
JSON lines
partial JSON payloads
token-by-token deltas
different completion signals

Normalizing these cleanly required:

provider adapters
shared streaming parsers
unified response objects
lifecycle hooks
retry boundaries

I ended up implementing:

callback-based streaming
Enumerator-based streaming
normalized chunk responses
provider-independent lifecycle hooks

Example:

AiModels.chat_stream(
provider: :lm_studio,
model: "tinyllama-1.1b-chat-v1.0",
messages: [
{
role: "user",
content: "Explain belongs_to vs has_many"
}
]
) do |chunk|
print chunk.content
end
Why Local AI Matters

One thing I found particularly interesting was how useful local AI becomes during development.

Running models through:

Ollama
LM Studio
LocalAI

gives:

faster experimentation
offline development
lower costs
more privacy
easier debugging

without depending entirely on hosted APIs.

Rails developers are already used to running infrastructure locally:

PostgreSQL
Redis
Sidekiq
Elasticsearch

Local AI runtimes fit naturally into that workflow.

Architecture Approach

The structure I ended up with looks roughly like this:

Rails App

AiModels.chat

Client

Provider Registry

Provider Adapter

Ollama / LM Studio / DeepSeek / OpenAI-compatible APIs

Key ideas:

provider isolation
normalized response objects
reusable streaming lifecycle
provider-independent retries/hooks
Rails-native configuration
Current State

The project currently supports:

Ollama
LM Studio
DeepSeek
OpenAI-compatible providers
streaming
retries/hooks
Rails integration

The next area I’m exploring is embeddings support for:

semantic search
RAG pipelines
vector databases
AI memory systems
Final Thoughts

One thing I’ve learned while building AI integrations:
the hard part usually isn’t the model call itself.

The difficult part is building stable infrastructure around:

streaming
retries
provider abstraction
observability
lifecycle management

especially once multiple providers enter the picture.

I’m curious how other Ruby/Rails developers are approaching:

local AI runtimes
provider abstractions
streaming APIs
embeddings/RAG infrastructure
Rails AI architecture in general

GitHub:
https://github.com/nishkarshh013/ai_models

Top comments (0)