DEV Community

Cover image for Semantic Caching for RubyLLM: Cut Your AI Costs by 70%
Stokry
Stokry

Posted on

Semantic Caching for RubyLLM: Cut Your AI Costs by 70%

If you're using RubyLLM to build AI-powered applications in Ruby, you're already enjoying a clean, unified API across multiple providers. But there's one problem RubyLLM doesn't solve: every API call costs money.

Enter SemanticCache — a semantic caching layer that integrates seamlessly with RubyLLM to dramatically reduce your AI costs.

The Problem: Redundant API Calls

Consider a typical chatbot or AI assistant. Users often ask variations of the same question:

  • "What's the capital of France?"
  • "What is France's capital city?"
  • "Tell me the capital of France"
  • "Capital of France?"

Without caching, each of these triggers a separate API call. With traditional caching, you'd need an exact string match — which almost never happens with natural language.

Semantic caching solves this. It understands that these questions are semantically identical and returns the cached response instantly.

Why SemanticCache + RubyLLM?

1. Provider Flexibility

RubyLLM already gives you the freedom to switch between OpenAI, Gemini, Mistral, Ollama, Bedrock, and more. SemanticCache extends this flexibility to embeddings:

# Use Gemini for embeddings (cheaper) while using GPT-4 for completions
RubyLLM.configure do |config|
  config.gemini_api_key = ENV["GEMINI_API_KEY"]
  config.openai_api_key = ENV["OPENAI_API_KEY"]
end

SemanticCache.configure do |c|
  c.embedding_adapter = :ruby_llm
  c.embedding_model = "text-embedding-004"  # Gemini's embedding model
end

# Now cache expensive GPT-4 calls using cheap Gemini embeddings
cache = SemanticCache.new
response = cache.fetch("Explain quantum computing") do
  RubyLLM.chat(model: "gpt-4o", messages: [{ role: "user", content: "Explain quantum computing" }])
end
Enter fullscreen mode Exit fullscreen mode

2. Local Embeddings with Ollama

Running Ollama locally? You can generate embeddings without any API costs:

RubyLLM.configure do |config|
  config.ollama_api_base = "http://localhost:11434/v1"
end

SemanticCache.configure do |c|
  c.embedding_adapter = :ruby_llm
  c.embedding_model = "nomic-embed-text"  # Free, local embeddings
end
Enter fullscreen mode Exit fullscreen mode

This is perfect for:

  • Development environments (no API costs while testing)
  • Privacy-sensitive applications (embeddings never leave your server)
  • High-volume applications where embedding costs add up

3. Enterprise-Ready with AWS Bedrock

For enterprise deployments, you might need to keep everything within AWS:

RubyLLM.configure do |config|
  config.bedrock_region = "us-east-1"
end

SemanticCache.configure do |c|
  c.embedding_adapter = :ruby_llm
  c.embedding_model = "amazon.titan-embed-text-v1"
end
Enter fullscreen mode Exit fullscreen mode

No data leaves your AWS environment, and you get the compliance benefits of Bedrock.

Real-World Impact

Let's do some quick math. Assume:

  • 10,000 queries per day
  • Average query: 50 tokens
  • GPT-4o pricing: ~$5/1M input tokens, ~$15/1M output tokens
  • Average response: 200 tokens

Without caching:

  • Daily cost: ~$35/day
  • Monthly cost: ~$1,050/month

With SemanticCache (assuming 70% hit rate):

  • Daily cost: ~$10.50/day + minimal embedding costs
  • Monthly cost: ~$315/month
  • Savings: $735/month

And that's a conservative estimate. Applications with repetitive queries (FAQ bots, customer support, documentation assistants) often see 80-90% hit rates.

How It Works

  1. Query comes in → SemanticCache generates an embedding vector using your configured RubyLLM provider
  2. Similarity search → The cache searches for stored entries with high cosine similarity
  3. Cache hit? → If similarity exceeds your threshold (default 0.85), return the cached response
  4. Cache miss? → Execute the LLM call, store the result with its embedding for future queries
cache = SemanticCache.new(similarity_threshold: 0.85)

# First call - cache miss, calls the API
response = cache.fetch("What is Ruby?") do
  RubyLLM.chat(messages: [{ role: "user", content: "What is Ruby?" }])
end

# Second call - cache hit! Returns instantly
response = cache.fetch("Tell me about the Ruby programming language") do
  RubyLLM.chat(messages: [{ role: "user", content: "Tell me about Ruby" }])
end
# => Returns cached response, no API call made
Enter fullscreen mode Exit fullscreen mode

Getting Started

Add both gems to your Gemfile:

gem "ruby_llm"
gem "semantic-cache"
Enter fullscreen mode Exit fullscreen mode

Configure your providers:

# config/initializers/ruby_llm.rb
RubyLLM.configure do |config|
  config.openai_api_key = ENV["OPENAI_API_KEY"]
end

# config/initializers/semantic_cache.rb
SemanticCache.configure do |c|
  c.embedding_adapter = :ruby_llm
  c.embedding_model = "text-embedding-3-small"
  c.similarity_threshold = 0.85
  c.store = :redis  # For production
  c.store_options = { url: ENV["REDIS_URL"] }
end
Enter fullscreen mode Exit fullscreen mode

Start caching:

cache = SemanticCache.new

response = cache.fetch(user_message, model: "gpt-4o") do
  RubyLLM.chat(
    model: "gpt-4o",
    messages: [{ role: "user", content: user_message }]
  )
end

# Check your savings
puts cache.savings_report
# => Total saved: $23.45 (156 cached calls)
Enter fullscreen mode Exit fullscreen mode

Advanced Patterns

Tag-Based Invalidation

Group related cache entries for bulk invalidation:

# Cache with tags
cache.fetch("Ruby version?", tags: [:ruby, :versions]) do
  RubyLLM.chat(messages: [{ role: "user", content: "What's the latest Ruby version?" }])
end

# When Ruby 3.4 releases, invalidate all version-related caches
cache.invalidate(tags: [:versions])
Enter fullscreen mode Exit fullscreen mode

TTL for Time-Sensitive Data

# News summaries should expire quickly
cache.fetch("Latest tech news?", ttl: 3600) do  # 1 hour
  RubyLLM.chat(messages: [{ role: "user", content: "Summarize today's tech news" }])
end
Enter fullscreen mode Exit fullscreen mode

Per-User Namespacing

# Each user gets their own cache namespace
SemanticCache.with_cache(namespace: "user_#{current_user.id}") do
  # Cached responses are isolated per user
end
Enter fullscreen mode Exit fullscreen mode

Conclusion

SemanticCache and RubyLLM are a natural fit. RubyLLM gives you provider flexibility for LLM calls; SemanticCache extends that flexibility to embeddings while dramatically cutting your costs.

The combination is particularly powerful because:

  1. No vendor lock-in — Switch embedding providers without changing your caching logic
  2. Cost optimization — Use cheaper providers for embeddings, premium providers for completions
  3. Local development — Use Ollama for free local embeddings during development
  4. Enterprise compliance — Keep everything within AWS using Bedrock

Stop paying for the same answers twice. Add SemanticCache to your RubyLLM project today.


Links:

Top comments (0)