Nishkarsh Sahu

Posted on May 19

Building a Rails-Native AI Abstraction Layer for Local and Hosted LLMs

#ai #architecture #llm #rails

Recently I’ve been experimenting with integrating local AI runtimes into Rails applications using tools like Ollama and LM Studio.

At first, the integration looked straightforward:
make an HTTP request, stream the response, and return the generated text.

But after experimenting with multiple providers, I realized the actual challenge wasn’t calling the APIs - it was normalizing the differences between providers cleanly.

The Problem

Every AI provider behaves slightly differently.

Some providers:

stream using SSE
stream newline-delimited JSON
return partial JSON chunks
expose different finish signals
structure responses differently
implement retries/errors differently

Even providers claiming OpenAI compatibility often differ subtly in:

chunk formatting
streaming behavior
error payloads
lifecycle handling

This becomes painful when trying to build reusable Rails infrastructure.

You quickly end up writing:

provider-specific parsing
provider-specific retry handling
provider-specific response normalization
provider-specific streaming logic

inside application code.

What I Wanted Instead

I wanted a Rails-native abstraction layer where application code could stay provider-independent.

Something conceptually similar to how ActiveRecord abstracts databases.

The goal became:

response = AiModels.chat(
provider: :ollama,
model: "llama3.2",
messages: [
{
role: "user",
content: "Explain ActiveRecord associations"
}
]
)

puts response.content

without the application caring about:

SSE parsing
chunk formats
provider-specific APIs
retry lifecycle behavior
Streaming Was the Hardest Part

The most interesting challenge turned out to be streaming.

Different providers stream differently:

SSE chunks
JSON lines
partial JSON payloads
token-by-token deltas
different completion signals

Normalizing these cleanly required:

provider adapters
shared streaming parsers
unified response objects
lifecycle hooks
retry boundaries

I ended up implementing:

callback-based streaming
Enumerator-based streaming
normalized chunk responses
provider-independent lifecycle hooks

Example:

AiModels.chat_stream(
provider: :lm_studio,
model: "tinyllama-1.1b-chat-v1.0",
messages: [
{
role: "user",
content: "Explain belongs_to vs has_many"
}
]
) do |chunk|
print chunk.content
end
Why Local AI Matters

One thing I found particularly interesting was how useful local AI becomes during development.

Running models through:

Ollama
LM Studio
LocalAI

gives:

faster experimentation
offline development
lower costs
more privacy
easier debugging

without depending entirely on hosted APIs.

Rails developers are already used to running infrastructure locally:

PostgreSQL
Redis
Sidekiq
Elasticsearch

Local AI runtimes fit naturally into that workflow.

Architecture Approach

The structure I ended up with looks roughly like this:

Rails App
↓
AiModels.chat
↓
Client
↓
Provider Registry
↓
Provider Adapter
↓
Ollama / LM Studio / DeepSeek / OpenAI-compatible APIs

Key ideas:

provider isolation
normalized response objects
reusable streaming lifecycle
provider-independent retries/hooks
Rails-native configuration
Current State

The project currently supports:

Ollama
LM Studio
DeepSeek
OpenAI-compatible providers
streaming
retries/hooks
Rails integration

The next area I’m exploring is embeddings support for:

semantic search
RAG pipelines
vector databases
AI memory systems
Final Thoughts

One thing I’ve learned while building AI integrations:
the hard part usually isn’t the model call itself.

The difficult part is building stable infrastructure around:

streaming
retries
provider abstraction
observability
lifecycle management

especially once multiple providers enter the picture.

I’m curious how other Ruby/Rails developers are approaching:

local AI runtimes
provider abstractions
streaming APIs
embeddings/RAG infrastructure
Rails AI architecture in general

GitHub:
https://github.com/nishkarshh013/ai_models