DEV Community: Carmine Paolino

Async Ruby is the Future of AI Apps (And It's Already Here)

Carmine Paolino — Wed, 09 Jul 2025 12:11:37 +0000

After a decade as an ML engineer/scientist immersed in Python's async ecosystem, returning to Ruby felt like stepping back in time. Where was the async revolution? Why was everyone still using threads for everything? SolidQueue, Sidekiq, GoodJob -- all thread-based. Even newer solutions defaulted to the same concurrency model.

Coming from Python, where the entire community had reorganized around asyncio, this seemed bizarre. FastAPI replaced Flask. Every library spawned an async twin. The transformation was total and necessary.

Then, building RubyLLM and Chat with Work, I noticed that LLM communication is async Ruby's killer app. The unique demands of streaming AI responses -- long-lived connections, token-by-token delivery, thousands of concurrent conversations -- expose exactly why async matters.

Here's the exciting bit: once I understood Ruby's approach to async, I realized it's actually superior to Python's. While Python forced everyone to rewrite their entire stack, Ruby quietly built something better. Your existing code just works. No syntax changes. No library migrations. Just better performance when you need it.

The async ecosystem that Samuel Williams and others have been building for years suddenly makes perfect sense. We just needed the right use case to see it.

Why LLM Communication Breaks Everything

LLM applications create a perfect storm of challenges that expose every weakness in thread-based concurrency:

1. Slot Starvation

Configure any thread-based job queue with 25 workers:

class StreamAIResponseJob < ApplicationJob
  def perform(chat, message)
    # This job occupies 1 of your 25 slots...
    chat.ask(message) do |chunk|
      # ...for the ENTIRE streaming duration (30-60 seconds)
      broadcast_chunk(chunk)
      # Thread is 99% idle, just waiting for tokens
    end
    # Slot only freed here, after full response
  end
end

Your 26th user? They're waiting in line. Not because your server is busy, but because all your workers are occupied by jobs waiting for tokens.

2. Resource Multiplication

Each thread needs its own:

Database connection (25 threads = 25 connections minimum)
Stack memory allocation
OS thread management overhead

For 1000 concurrent conversations, you'd need 1000 threads. Each thread needs its database connection. That's 1000 database connections for threads that are 99% idle.

3. Performance Overhead

Real benchmarks show¹:

Creating a thread: ~80μs
Thread context switch: ~1.3μs
Maximum throughput: ~5,000 requests/second

When you're handling thousands of streaming connections, these microseconds add up to real latency.

4. Scalability Challenges

Try creating 10,000 threads and the OS scheduler starts to struggle. The overhead becomes crushing. Yet modern AI apps need to handle thousands of concurrent conversations.

These aren't separate issues -- they're all symptoms of the same architectural mismatch. LLM communication is fundamentally different from traditional background jobs.

Understanding Concurrency: Threads vs Async

To understand why LLM applications are async's perfect use case -- and why Ruby's implementation is so elegant -- we need to build up from first principles.

The Hierarchy: Processes, Threads, and Fibers

Think of your computer as an office building:

Processes are like separate offices -- each with its own locked door, furniture, and files. They can't see into each other's spaces (memory isolation).
Threads are like workers sharing the same office -- they can access the same filing cabinets (shared memory) but need to coordinate to avoid collisions.
Fibers are like multiple tasks juggled by one worker at their desk -- switching between them manually when waiting for something (like a phone call).

Scheduling: The Core Difference

The fundamental question in concurrency is: who decides when to switch between tasks?

Threads: Preemptive Multitasking

With threads, the operating system is the boss. It forcibly interrupts running threads to give others a turn:

# You start threads, but the OS controls them
threads = 10.times.map do |i|
  Thread.new do
    # This might be interrupted at ANY point
    expensive_calculation(i)
    fetch_from_api(i)  # Each thread blocks individually here
    process_result(i)
  end
end

Each thread:

Gets scheduled by the OS kernel
Can be interrupted mid-execution (in Ruby, after 100ms)
Blocks individually on I/O operations
Requires OS resources and kernel data structures
Needs its own resources (like database connections)

Fibers: Cooperative Concurrency

With fibers, switching is voluntary -- they only yield at I/O boundaries:

# Fibers yield control cooperatively
Async do
  fibers = 10.times.map do |i|
    Async do
      expensive_calculation(i)  # Runs to completion
      fetch_from_api(i)         # Yields here, other fibers run
      process_result(i)         # Continues after I/O completes
    end
  end
end

Each fiber:

Schedules itself by yielding during I/O
Never gets interrupted mid-calculation
Managed entirely in userspace (no kernel involvement)
Shares resources through the event loop

Ruby's GVL: Why Fibers Make Even More Sense

Ruby's Global VM Lock (GVL) means only one thread can execute Ruby code at a time. Threads are preempted after a 100ms time quantum.

This creates an interesting dynamic:

# CPU work: Threads don't help much due to GVL
threads = 4.times.map do
  Thread.new { calculate_fibonacci(40) }
end
# Takes about the same time as sequential execution!

# I/O work: Threads do parallelize (GVL released during I/O)
threads = 4.times.map do
  Thread.new { Net::HTTP.get(uri) }
end
# Takes 1/4 the time of sequential execution

But here's the thing: if threads only help with I/O anyway, why pay their overhead?

The I/O Multiplexing Advantage

This is where fibers truly shine. Threads use a "one thread, one I/O operation" model:

# Traditional threading approach
thread1 = Thread.new { socket1.read }  # Blocks this thread
thread2 = Thread.new { socket2.read }  # Blocks this thread
thread3 = Thread.new { socket3.read }  # Blocks this thread
# Need 3 threads for 3 concurrent I/O operations

Fibers use I/O multiplexing -- one thread monitors all I/O:

# Async's approach (simplified)
Async do
  # One thread, many I/O operations
  task1 = Async { socket1.read }  # Registers with selector
  task2 = Async { socket2.read }  # Registers with selector
  task3 = Async { socket3.read }  # Registers with selector

  # Event loop uses epoll/kqueue to monitor ALL sockets
  # Resumes fibers as data becomes available
end

The kernel (via epoll, kqueue, or io_uring) can monitor thousands of file descriptors with a single system call. No thread-per-connection needed.

Why Fibers Win: The Complete Picture

Let's look at real benchmark data comparing fibers to threads¹:

Performance Advantages (Ruby 3.4 data):

20x faster allocation: Creating a fiber takes ~3μs vs ~80μs for a thread
10x faster context switching: Fiber switches in ~0.1μs vs ~1.3μs for threads
15x higher throughput: ~80,000 vs ~5,000 requests/second

But the real advantage is scalability:

Fewer OS Resources: Fibers are managed in userspace, avoiding kernel overhead
Efficient Scheduling: No kernel involvement means less overhead
I/O Multiplexing: One thread monitors thousands of I/O operations via epoll/kqueue/io_uring
GVL-Friendly: Cooperative scheduling works naturally with Ruby's concurrency model
Resource Sharing: Database connections and memory pools are naturally shared

While memory usage between fibers and threads is comparable, fibers don't depend on OS resources. You can create vastly more fibers than threads, switch between them faster, and manage them more efficiently while monitoring thousands of connections -- all from userspace.

How Async Solves Every LLM Challenge

Remember those four problems? Here's how async addresses each one:

No More Slot Starvation: Fibers are created on-demand and destroyed immediately. No fixed worker pools.
Shared Resources: One process with a few pooled database connections can handle thousands of conversations.
Improved Performance: 20x faster to create, 10x faster to switch, 15x less scheduling overhead (synthetic upper bound).
Massively Improved Scalability: 10,000+ concurrent fibers? No problem. The OS doesn't even know they exist.

Ruby's Async Ecosystem

The beauty of Ruby's async lies in its transparency. Unlike Python's requirement to use async/await everywhere, Ruby code just works:

The Foundation: The async Gem

require 'async'
require 'net/http'

# This code handles 1000 concurrent requests
# Using ONE thread and minimal memory
Async do
  responses = 1000.times.map do |i|
    Async do
      uri = URI("https://api.openai.com/v1/chat/completions")
      # Net::HTTP automatically yields during I/O
      response = Net::HTTP.post(uri, data.to_json, headers)
      JSON.parse(response.body)
    end
  end.map(&:wait)

  # All 1000 requests complete concurrently
  process_responses(responses)
end

No callbacks. No promises. No async/await keywords. Just Ruby code that scales.

Why RubyLLM Just Works™

Here's the thing that made me smile when I discovered it: RubyLLM gets async performance for free. No special RubyLLM-async version needed. No code changes to the library. No configuration. Nothing.

Why? Because RubyLLM uses Net::HTTP under the hood. When you wrap RubyLLM calls in an Async block, Net::HTTP automatically yields during network I/O, allowing thousands of concurrent LLM conversations to happen on a single thread.

# This is all you need for concurrent LLM calls
Async do
  10.times.map do
    Async do
      # RubyLLM automatically becomes non-blocking
      # because Net::HTTP knows how to yield to fibers
      message = RubyLLM.chat.ask "Explain quantum computing"
      puts message.content
    end
  end.map(&:wait)
end

This is Ruby at its best. Libraries that follow conventions get superpowers without even trying. It just works because it was built on solid foundations.

The Rest of the Ecosystem

Falcon: Multi-process, multi-fiber web server built for streaming
async-job: Background job processing using fibers
async-cable: ActionCable replacement with fiber-based concurrency
async-http: Full-featured HTTP client with streaming support

... and many more available from Socketry.

Migrate your Rails app to Async

The migration requires almost no code changes:

Step 1: Update Your Gemfile

# Gemfile
# Comment out thread-based gems
# gem "puma"
# gem "sidekiq" / "good_job" / "solid_queue"
# gem "solid_cable"

# Add async gems
gem "falcon"
gem "async-job-adapter-active_job"
gem "async-cable"

Step 2: One Configuration Line

# config/application.rb
require "async/cable"

# config/environments/production.rb
config.active_job.queue_adapter = :async_job

Step 3: There's No Step 3!

Your existing jobs work unchanged. Your channels don't need updates.

Just deploy and watch. You'll get more performance, more capacity, and better response times.

When to Use What

Let's be practical -- async isn't always the answer:

Use threads for:

CPU-intensive work
Tasks needing true isolation
Legacy C extensions that aren't fiber-safe

Use async for:

I/O-bound operations
API calls
WebSockets, SSE, and other forms of streaming
LLM applications

A New Chapter for Ruby

After years in Python's async world, I've seen what happens when a language forces a syntax change to access the benefits of async concurrency on its community. Libraries fragment. Codebases split. Developers struggle with new syntax and concepts.

Ruby chose a different path -- and it's the right one.

We're witnessing Ruby's next evolution. Not through breaking changes or ecosystem splits, but through thoughtful additions that make our existing code better. The async ecosystem that seemed unnecessary when compared to traditional threading suddenly becomes essential when you hit the right use case.

LLM applications are that use case. The combination of long-lived connections, streaming responses, and massive concurrency creates the perfect storm where async's benefits become undeniable.

Samuel Williams and the async community have given us incredible tools. Unlike Python, you don't have to rewrite everything to use it.

For those building the next generation of AI-powered applications, async Ruby isn't just an option -- it's a competitive advantage. Lower costs, better performance, simpler operations, and you keep your existing codebase.

The future is concurrent. The future is streaming. The future is async.

And in Ruby, that future works with the code you already have.

RubyLLM powers Chat with Work in production with thousands of concurrent AI conversations using async. Want elegant AI integration in Ruby? Check out RubyLLM.

Special thanks to Samuel Williams for reviewing this post and providing the fiber-vs-thread benchmarks that substantiate these performance claims.

Join the conversation: I'll be speaking about async Ruby and AI at EuRuKo 2025, San Francisco Ruby Conference 2025, and RubyConf Thailand 2026. Let's build the future together.

Samuel Williams' fiber-vs-thread performance comparison ↩

RubyLLM 1.3.0: Just When You Thought the Developer Experience Couldn't Get Any Better 🎉

Carmine Paolino — Tue, 03 Jun 2025 16:03:37 +0000

RubyLLM 1.3.0 is here, and just when you thought the developer experience couldn't get any better, we've made attachments ridiculously simple, added isolated configuration contexts, and officially ended the era of manual model tracking.

The Attachment Revolution: From Complex to Magical

The biggest transformation in 1.3.0 is how stupidly simple attachments have become. Before, you had to categorize every file:

# The old way (still works, but why would you?)
chat.ask "What's in this image?", with: { image: "diagram.png" }
chat.ask "Describe this meeting", with: { audio: "meeting.wav" }
chat.ask "Summarize this document", with: { pdf: "contract.pdf" }

Now? Just throw files at it and RubyLLM figures out the rest:

# The new way - pure magic ✨
chat.ask "What's in this file?", with: "diagram.png"
chat.ask "Describe this meeting", with: "meeting.wav"
chat.ask "Summarize this document", with: "contract.pdf"

# Multiple files? Mix and match without thinking
chat.ask "Analyze these files", with: [
  "quarterly_report.pdf",
  "sales_chart.jpg",
  "customer_interview.wav",
  "meeting_notes.txt"
]

# URLs work too
chat.ask "What's in this image?", with: "https://example.com/chart.png"

This is what the Ruby way looks like: you shouldn't have to think about file types when the computer can figure it out for you.

Configuration Contexts: Multi-Tenancy Made Trivial

The global configuration pattern works beautifully for simple applications. But the moment you need different configurations for different customers, environments, or features, that simplicity becomes a liability.

We could have forced everyone to pass configuration objects around. We could have built some complex dependency injection system. Instead, we built contexts:

# Each tenant gets their own isolated configuration
tenant_context = RubyLLM.context do |config|
  config.openai_api_key = tenant.openai_key
  config.anthropic_api_key = tenant.anthropic_key
  config.request_timeout = 180 # This tenant needs more time
end

# Use it without polluting the global namespace
response = tenant_context.chat.ask("Process this customer request...")

# Global configuration remains untouched
RubyLLM.chat.ask("This still uses your default settings")

Simple, elegant, Ruby-like. Your multi-tenant application doesn't need architectural gymnastics. Each context is isolated, thread-safe, and garbage-collected when you're done with it.

Perfect for multi-tenancy, A/B testing different providers, environment targeting, or any situation where you need temporary configuration changes.

Local Models with Ollama

Your development machine shouldn't need to phone home to OpenAI every time you want to test something:

RubyLLM.configure do |config|
  config.ollama_api_base = 'http://localhost:11434/v1'
end

# Same API, different model
chat = RubyLLM.chat(model: 'mistral', provider: 'ollama')
response = chat.ask("Explain Ruby's eigenclass")

Perfect for privacy-sensitive applications, offline development, or just experimenting with local models. This matters for development, for testing, for compliance, for costs. Sometimes the best model is the one running on your own hardware.

Hundreds of Models via OpenRouter

Access models from dozens of providers through a single API:

RubyLLM.configure do |config|
  config.openrouter_api_key = ENV['OPENROUTER_API_KEY']
end

# Access any model through OpenRouter
chat = RubyLLM.chat(model: 'anthropic/claude-3.5-sonnet', provider: 'openrouter')

One API key, hundreds of models. Simple.

The End of Manual Model Tracking

Here's where things get revolutionary. We've partnered with Parsera to create a single source of truth for LLM capabilities and pricing. When you run RubyLLM.models.refresh!, you're now pulling from the Parsera API - a continuously updated registry that scrapes model information directly from provider documentation.

No more manually updating capabilities files every time OpenAI changes their pricing. No more hunting through documentation to find context windows. Context windows, pricing, capabilities, supported modalities - it's all there, always current.

However, providers don't always document everything perfectly. We discovered plenty of older models still available through their APIs but missing from official docs. That's why we kept our capabilities.rb files - they fill in the gaps for models the Parsera API doesn't cover yet. Between the two sources, we support virtually every model worth using.

Rails Integration That Finally Feels Like Rails

The Rails integration now works seamlessly with ActiveStorage:

# Enable attachment support in your Message model
class Message < ApplicationRecord
  acts_as_message
  has_many_attached :attachments # Add this line
end

# Handle file uploads directly from forms
chat_record.ask("Analyze this upload", with: params[:uploaded_file])

# Work with existing ActiveStorage attachments
chat_record.ask("What's in my document?", with: user.profile_document)

# Process multiple uploads at once
chat_record.ask("Review these files", with: params[:files])

We've brought the Rails attachment handling to complete parity with the plain Ruby implementation. No more "it works in Ruby but not in Rails" friction.

Fine-Tuned Embeddings

Custom embedding dimensions let you optimize for your specific use case:

# Generate compact embeddings for memory-constrained environments
embedding = RubyLLM.embed(
  "Ruby is a programmer's best friend",
  model: "text-embedding-3-small",
  dimensions: 512  # Instead of the default 1536
)

Enterprise OpenAI Support

Organization and project IDs are now supported for enterprise deployments:

RubyLLM.configure do |config|
  config.openai_api_key = ENV['OPENAI_API_KEY']
  config.openai_organization_id = ENV['OPENAI_ORG_ID']
  config.openai_project_id = ENV['OPENAI_PROJECT_ID']
end

Rock-Solid Foundation

We now officially support and test against:

Ruby 3.1 to 3.4
Rails 7.1 to 8.0

Your favorite Ruby version is covered.

Ship It

gem 'ruby_llm', '1.3.0'

As always, we've maintained full backward compatibility. Your existing code continues to work exactly as before, but now with magical attachment handling and powerful new capabilities.

A Growing Community

This release includes contributions from 13 new contributors, with merged PRs covering everything from foreign key improvements to HTTP proxy support. The Ruby community continues to amaze me with its thoughtfulness and attention to detail.

Special thanks to @papgmez, @timaro, @rhys117, @bborn, @xymbol, @roelbondoc, @max-power, @itstheraj, @stadia, @tpaulshippy, @Sami-Tanquary, and @seemiller for making this release possible. [mentions are based on GitHub handles and may not be accurate on dev.to]

This Is Just The Beginning

Want to shape RubyLLM's future? Join us on GitHub.

The future of AI development in Ruby has never been brighter. ✨

Introducing RubyLLM 1.0: A Beautiful Way to Work with AI

Carmine Paolino — Tue, 11 Mar 2025 10:19:48 +0000

I released RubyLLM 1.0 today. It's a library that makes working with AI in Ruby feel natural, elegant, and enjoyable.

Why This Matters

AI should be accessible to Ruby developers without ceremony or complexity. When building Chat with Work, I wanted to simply write:

chat = RubyLLM.chat
chat.ask "What's the best way to learn Ruby?"

And have it work - regardless of which model I'm using, whether I'm streaming responses, or which provider I've chosen. The API should get out of the way and let me focus on building my product.

The RubyLLM Philosophy

Beautiful interfaces matter. Ruby has always been about developer happiness. Your AI code should reflect that same elegance:

# Global methods for core operations - simple and expressive
chat = RubyLLM.chat
embedding = RubyLLM.embed("Ruby is elegant")
image = RubyLLM.paint("a sunset over mountains")

# Method chaining that reads like English
chat.with_model('gpt-4o-mini')
    .with_temperature(0.7)
    .ask("What's your favorite gem?")

Convention over configuration. You shouldn't need to think about providers or remember multiple APIs:

# Don't care which model? We'll use a sensible default
chat = RubyLLM.chat

# Want a specific model? Just say so
chat = RubyLLM.chat(model: 'claude-3-5-sonnet')

# Switch to GPT mid-conversation? Just as easy
chat.with_model('gpt-4o-mini')

Practical tools for real work. Function calling should be Ruby-like, not JSON Schema gymnastics:

class Search < RubyLLM::Tool
  description "Searches our knowledge base"
  param :query, desc: "Search query"
  param :limit, type: :integer, desc: "Max results", required: false

  def execute(query:, limit: 5)
    Document.search(query).limit(limit).map(&:title)
  end
end

# Clean, practical, Ruby-like
chat.with_tool(Search).ask "Find our product documentation"

Streaming done right. No need to parse different formats for different providers:

chat.ask "Write a story about Ruby" do |chunk|
  # No provider-specific parsing - we handle that for you
  print chunk.content
end

Token tracking by default. Cost management should be built-in:

response = chat.ask "Explain Ruby modules"
puts "This cost #{response.input_tokens + response.output_tokens} tokens"

Meaningful error handling. Production apps need proper error types:

begin
  chat.ask "Question"
rescue RubyLLM::RateLimitError
  puts "Rate limited - backing off"
rescue RubyLLM::UnauthorizedError
  puts "API key issue - check configuration"
end

Rails as a first-class citizen. Because most of us are building Rails apps:

class Chat < ApplicationRecord
  acts_as_chat
end

chat = Chat.create!(model_id: 'gemini-2.0-flash')
chat.ask "Hello"  # Everything persisted automatically

Built for Real Applications

RubyLLM supports the features you actually need in production:

# Vision
chat.ask "What's in this image?", with: { image: "photo.jpg" }

# PDFs
chat.ask "Summarize this document", with: { pdf: "contract.pdf" }

# Audio
chat.ask "Transcribe this recording", with: { audio: "meeting.wav" }

# Multiple files
chat.ask "Compare these diagrams", with: { image: ["chart1.png", "chart2.png"] }

Minimal Dependencies

Just Faraday, Zeitwerk, and a tiny event parser. No dependency hell.

Used in Production Today

RubyLLM powers Chat with Work in production. It's battle-tested with real-world AI integrations and built for serious applications.

Give it a try today: gem install ruby_llm

More details at rubyllm.com