Recently I’ve been experimenting with integrating local AI runtimes into Rails applications using tools like Ollama and LM Studio.
At first, the integration looked straightforward:
make an HTTP request, stream the response, and return the generated text.
But after experimenting with multiple providers, I realized the actual challenge wasn’t calling the APIs - it was normalizing the differences between providers cleanly.
The Problem
Every AI provider behaves slightly differently.
Some providers:
stream using SSE
stream newline-delimited JSON
return partial JSON chunks
expose different finish signals
structure responses differently
implement retries/errors differently
Even providers claiming OpenAI compatibility often differ subtly in:
chunk formatting
streaming behavior
error payloads
lifecycle handling
This becomes painful when trying to build reusable Rails infrastructure.
You quickly end up writing:
provider-specific parsing
provider-specific retry handling
provider-specific response normalization
provider-specific streaming logic
inside application code.
What I Wanted Instead
I wanted a Rails-native abstraction layer where application code could stay provider-independent.
Something conceptually similar to how ActiveRecord abstracts databases.
The goal became:
response = AiModels.chat(
provider: :ollama,
model: "llama3.2",
messages: [
{
role: "user",
content: "Explain ActiveRecord associations"
}
]
)
puts response.content
without the application caring about:
SSE parsing
chunk formats
provider-specific APIs
retry lifecycle behavior
Streaming Was the Hardest Part
The most interesting challenge turned out to be streaming.
Different providers stream differently:
SSE chunks
JSON lines
partial JSON payloads
token-by-token deltas
different completion signals
Normalizing these cleanly required:
provider adapters
shared streaming parsers
unified response objects
lifecycle hooks
retry boundaries
I ended up implementing:
callback-based streaming
Enumerator-based streaming
normalized chunk responses
provider-independent lifecycle hooks
Example:
AiModels.chat_stream(
provider: :lm_studio,
model: "tinyllama-1.1b-chat-v1.0",
messages: [
{
role: "user",
content: "Explain belongs_to vs has_many"
}
]
) do |chunk|
print chunk.content
end
Why Local AI Matters
One thing I found particularly interesting was how useful local AI becomes during development.
Running models through:
Ollama
LM Studio
LocalAI
gives:
faster experimentation
offline development
lower costs
more privacy
easier debugging
without depending entirely on hosted APIs.
Rails developers are already used to running infrastructure locally:
PostgreSQL
Redis
Sidekiq
Elasticsearch
Local AI runtimes fit naturally into that workflow.
Architecture Approach
The structure I ended up with looks roughly like this:
Rails App
↓
AiModels.chat
↓
Client
↓
Provider Registry
↓
Provider Adapter
↓
Ollama / LM Studio / DeepSeek / OpenAI-compatible APIs
Key ideas:
provider isolation
normalized response objects
reusable streaming lifecycle
provider-independent retries/hooks
Rails-native configuration
Current State
The project currently supports:
Ollama
LM Studio
DeepSeek
OpenAI-compatible providers
streaming
retries/hooks
Rails integration
The next area I’m exploring is embeddings support for:
semantic search
RAG pipelines
vector databases
AI memory systems
Final Thoughts
One thing I’ve learned while building AI integrations:
the hard part usually isn’t the model call itself.
The difficult part is building stable infrastructure around:
streaming
retries
provider abstraction
observability
lifecycle management
especially once multiple providers enter the picture.
I’m curious how other Ruby/Rails developers are approaching:
local AI runtimes
provider abstractions
streaming APIs
embeddings/RAG infrastructure
Rails AI architecture in general
Top comments (0)