DEV Community: Tracy Atteberry

Cognitive debt: A personal story and practical advice

Tracy Atteberry — Fri, 27 Mar 2026 21:31:37 +0000

A few days ago I built a tool called JoyConf: a real-time audience feedback system that lets speakers see emoji reactions floating up in the corner of their presentation while they're talking. It was a fun, simple idea and I was pretty excited about it.

I built it in Elixir and Phoenix LiveView, which was a deliberate choice. I mostly write Ruby these days, but this project felt like a good excuse to dig into Elixir and LiveView. Learn something new, build something useful. Two birds, one stone.

I drove the overall design and implementation planning, did active code review, and contributed everywhere I could. But for the Elixir and LiveView specifics, I leaned heavily on Claude. The syntax, architecture decisions, and debugging were Claude's domain, because I simply didn't know enough yet to own them. The tool worked and it seemed to work well. But when I got to the end and looked at the codebase, I realized I didn't really understand the parts Claude had built. I had reviewed the code as carefully as I could, but reviewing code in a language you don't know, implementing an unfamiliar architecture, only gets you so far. The understanding of those pieces had mostly stayed with Claude.

That's cognitive debt. And LLMs are very good at generating it.

What cognitive debt actually is

Cognitive debt accumulates when you defer the thinking that should happen now. It's different from technical debt, which is about the code itself (shortcuts taken, tests skipped, abstractions that didn't quite work out). Cognitive debt is about the reasoning that never happened. The mental model that never got built. The decision that got made without being understood.

Like financial debt, it doesn't feel like much at first. You're moving fast, things are working, you're shipping. The bill comes later, when you need to debug something you can't reason about, extend a system you don't understand, or explain a decision you never actually made. And to be clear, cognitive debt has been around long before LLMs, LLMs just magnify the problem.

LLMs make this disturbingly easy

LLM-generated code is mostly right. That's what makes it dangerous.

If the code were obviously wrong, you'd catch it. You'd dig in, figure out what went wrong, learn something in the process. But LLM output is usually plausible, often correct, and just coherent enough that it passes the vibe check. You run the tests. They pass. You move on. The mental model of how it works never gets built, because you never needed it... until you do.

There's a specific failure mode worth naming here. Using an LLM to move faster on things you understand is leverage. Using it to skip understanding altogether is debt. Those feel identical in the short term. Both result in code getting written. One leaves you with understanding you can build on; the other leaves you with output you're stuck with.

And it catches everyone. Junior developers accept LLM output because they don't know enough to question it. Senior developers accept it because they had a hundred PRs today and the code looks fine, so they assume it is fine. Both skip the reasoning step. The result is a codebase full of decisions nobody on the team can actually defend.

Back to JoyConf

When I realized I'd built something I didn't fully understand, I asked Claude to write me an explainer document. Not a summary, but an actual explanation of the architecture, the key concepts, why certain decisions were made, how the pieces fit together. Something I could read, learn from, and come back to later.

It wasn't a magic pill. I was starting from near zero with Elixir and LiveView, so one document didn't make me an expert. But it meaningfully closed the gap. I understood the code better than I did before. I had something to refer back to. And I started to feel like the codebase was actually mine.

That experience shaped how I think about using LLMs for coding. The tool works fine. How you engage with it makes all the difference.

Practical ways to keep the debt in check

Ask for explanations before you accept the code. Don't just run it. Ask the LLM to walk you through what it did and why. This takes an extra minute and catches a surprising number of cases where the code is technically correct but built on assumptions you don't share.

Ask for an explainer document for bigger decisions. Architecture choices, non-obvious patterns, anything you're going to need to live with for a while: ask the LLM to write it up in plain language. Keep it in the repo. Future you will thank present you.

Use Simon Willison's "showboat" approach to document what was built. The showboat tool "creates executable demo documents that show and prove an agent's work." (kind of like a Jupyter notebook, but just markdown). The LLM walks through its output with explanation and context. It's a great way to produce living documentation that captures not just what the code does, but why it was written that way. It's a great tool, but not suitable for every use case.

Read the LLM's thinking, especially when debugging. Many LLMs can expose their reasoning process. When you're stuck on a bug or trying to understand a decision, asking the LLM to think out loud before answering is one of the fastest ways to build genuine understanding rather than just getting an answer.

Write the tests yourself. Even if you let the LLM write the implementation, writing the tests forces you to reason about the behavior you actually want. It's one of the best ways to make sure the mental model gets built. Of course, it takes more time and it's not always possible, like with JoyConf where I didn't know enough about the Elixir environment to write effective tests. But when you can, it's a great way to stay in the driver's seat.

Slow down at decision points. LLMs are fast. That's the point. But speed can accelerate debt. When you hit a fork in the road (an architectural choice, a tradeoff, a "there are a few ways to do this" moment) pause and do the reasoning yourself, even if you use the LLM to help you think it through.

The goal isn't to use LLMs less

LLMs are genuinely useful and I don't plan to stop using them. The goal is to stay in the driver's seat mentally, using them for leverage rather than as a substitute for thinking.

A healthy LLM workflow and a debt-generating one can look identical from the outside. The difference shows up later, when you need to understand, maintain, or extend what you built. If you finish each session understanding what you built and why, you're using the tool well. If you don't, you're taking out a loan.

And like financial debt, cognitive debt is a lot easier to avoid than to pay off.

References

JoyConf: a live emoji reaction app for presentations

Tracy Atteberry — Thu, 26 Mar 2026 16:45:18 +0000

I have a talk coming up in a few weeks. It got rescheduled once, which gave me extra time to prepare. The presentation itself is finished, so I used that time to build an application that I think will make the talk more fun and engaging for the audience.

I wanted to add some live audience interaction. The usual options, Slido, Mentimeter, Poll Everywhere, are fine, but they're designed around Q&A and polls. What I actually wanted was simpler and more visual: live emoji reactions that float up as an overlay on my slides while I'm presenting. None of the existing tools seem to do that. So I built my own, and I called it JoyConf.

It also gave me a reason to finally learn some Elixir.

What it does

The flow is simple:

You create a talk in the admin panel and get a QR code
You put the QR code on your title slide
Attendees scan it and land on a page with pre-defined set emoji buttons: ❤️ 😂 🙋🏻 👏 🤯, etc.
They tap a button, and the emoji floats up on their screen in real time
A Chrome extension running on your laptop picks up the same broadcast and overlays the emoji directly in the lower right corner of your Google Slides presentation

That's it. No accounts for attendees, no app to install, no configuration to fiddle with. Scan, tap, react.

Why Elixir and Phoenix

I write Ruby day to day, so the syntax was familiar enough. But Elixir runs on the BEAM, the Erlang virtual machine, which was built for soft real-time systems with lots of concurrent connections. Phoenix LiveView lets you build interactive, server-rendered UIs over WebSockets without writing much JavaScript. And Phoenix PubSub gives you a message bus that lets any process broadcast to any other, regardless of what type of process it is.

All three of those things are important for a system where dozens of phones are sending events to a server that needs to fan them out to a slide presentation and all the phones (for their own consolidated live view) in under a second.

LiveView is pretty magical, by the way. You define your UI as a function of state, and Phoenix handles keeping the browser in sync. You get WebSocket-backed interactivity without writing a single-page app.

How it works under the hood

There are three clients talking to one Phoenix server.

Attendee phones connect via Phoenix LiveView, which manages the WebSocket lifecycle automatically. When an attendee taps an emoji, it fires a phx-click event to the server. The server checks a rate limiter (one reaction per session every 3 seconds) and, if allowed, broadcasts the event to a PubSub topic keyed by the talk slug: "reactions:my-talk".

The Chrome extension connects via a Phoenix Channel, a lower-level WebSocket primitive. The extension can't use LiveView because it's not a web page; it just needs to receive events. The ReactionChannel is subscribed to the same PubSub topic, so when the attendee's broadcast lands, it gets forwarded to the extension automatically.

The admin browser is a standard LiveView protected by HTTP Basic Auth. You create a talk, the server generates a slug from the title, and you get a QR code pointing at the attendee URL.

The PubSub layer is what makes the architecture clean. TalkLive doesn't need to know the Chrome extension exists. It just broadcasts on the topic, and PubSub delivers it to everyone subscribed, whether that's a LiveView process, a Channel process, or both.

Here's the entire broadcast path in code:

def handle_event("react", %{"emoji" => emoji}, socket) do
  if RateLimiter.allow?(socket.id) do
    Endpoint.broadcast!("reactions:#{socket.assigns.talk.slug}", "new_reaction", %{emoji: emoji})
  end
  {:noreply, socket}
end

That's the whole thing. Five lines. PubSub does the fan-out.

Rate limiting with ETS

The rate limiter is a GenServer that owns an ETS table. ETS is an in-memory key/value store built into the BEAM runtime. It's extremely fast and, crucially, you can configure it for concurrent reads without going through the GenServer process. This matters because with many attendees tapping at once, you don't want all those requests queuing up behind a single process.

The table stores {session_id, last_reaction_at}. The allow?/1 function looks up the session, checks if enough time has passed, and updates the timestamp atomically. No database round-trip, no lock contention.

There's also a client-side rate limit in JavaScript: buttons are disabled for 3 seconds with a visible countdown timer. That's just UX, the real enforcement is on the server.

The Chrome extension

The extension has two parts: a popup where you enter the talk slug once, and a content script injected into Google Slides pages that handles the actual connection and overlay.

The content script connects a Phoenix WebSocket client to the server, joins the reactions:${slug} channel, and listens for new_reaction messages. When one arrives, it spawns a floating <span> element with a CSS animation that drifts up and fades out.

It never works on the first try

Double emojis on slides. Early on, every reaction appeared twice on the speaker's screen. The bug was that the Chrome extension was subscribing to PubSub directly as well as receiving the Channel push. Two subscriptions, two deliveries. The fix was removing the redundant subscription and letting the Channel handle delivery exclusively.

Fullscreen mode swallows the overlay. When you go fullscreen in Google Slides, the browser creates a new stacking context. Any position: fixed element on <body> becomes invisible. The fix was listening for fullscreenchange events and re-parenting the overlay <div> into document.fullscreenElement when it fires:

document.addEventListener("fullscreenchange", () => {
  const overlay = document.getElementById("joyconf-overlay");
  if (document.fullscreenElement) {
    document.fullscreenElement.appendChild(overlay);
  } else {
    document.body.appendChild(overlay);
  }
});

Not obvious, but straightforward once you know it.

Button clicks were being swallowed. The initial implementation disabled the emoji buttons immediately on click. That blocked the phx-click handler from firing, so the server never received the event. The fix was deferring the button disable to a setTimeout(..., 0), which lets the click event propagate before the buttons get disabled.

The Chrome extension's origin. Chrome extensions run from a chrome-extension:// origin, which Phoenix's check_origin protection rejects by default. One line in endpoint.ex fixes it:

socket "/socket", JoyconfWeb.UserSocket,
  websocket: [check_origin: false]

The tech stack, briefly

Concern	Choice
Language / framework	Elixir / Phoenix LiveView
Real-time	Phoenix PubSub + Channels
Database	PostgreSQL (Fly.io managed)
Rate limiting	ETS-backed GenServer
QR codes	`eqrcode` hex package
Deployment	Fly.io
Extension	Chrome Manifest V3

The database has one table: talks. Reactions are ephemeral and never persisted. If the server restarts, in-flight reactions are lost, which is fine.

What's next

JoyConf is an MVP. It does one thing and it's ready for me to use at my talk. It's not productized, it's not super-polished, and it's likely not easy for the average non-technical user to deploy.

Things that might be nice to add in the future:

Ease of deployment for less technical users.
Reaction analytics tied to slides so you can see which parts of your talk landed
Support for other presentation tools beyond Google Slides

For now, I'm just going to go use it.

The source code is on GitHub. If you want to run your own instance, the README should have everything you need.

Building MockOpenAI: a weekend MVP story

Tracy Atteberry — Thu, 26 Mar 2026 16:25:36 +0000

Last weekend I built and published a Ruby gem. From idea to published thing in about four days. Here's how it went, including the part where I had to reconsider whether I'd built something useful at all.

Friday: 20 ideas, one bet

I'm between jobs right now. Good position to be in if you like building things, terrible position to be in if you like eating. So I've been running a little experiment: each weekend, pick one small idea and see if I can ship it.

Friday's job was to generate and pick an idea. I sat down with an AI and brainstormed 20 candidates. Developer tools. Content products. Micro-SaaS. I narrowed it down to one: a local mock server for testing OpenAI-compatible APIs.

The pitch to myself was simple. I write a lot of Ruby. I write a lot of tests. Testing code that talks to LLMs is a bit annoying because, while the happy path is easy to mock, some of the failure modes and edge cases can be more of a pain. There had to be a better way.

By end of day Friday I had a repo, a gemspec, and a clear plan.

Saturday: build day

The core idea for MockOpenAI is that it's a real HTTP server running on localhost, not a mock object or a stub. Your application code talks to it exactly the way it would talk to the LLM provider in production. You just point your client at http://localhost:4000 instead of the usaul API endpoint.

That distinction makes a difference, I think. With a real HTTP server you can test things that object-level mocking can't touch: actual TCP timeouts, truncated streaming responses, retry headers. The kind of failure modes that bite you in production but never show up in your test suite because you stubbed them away.

The architecture is deliberately simple. A Rack server reads a shared JSON state file on every request. Your tests write rules to that file. The server is stateless. No client wrapping, no monkey-patching, no magic.

Here's what using it looks like:

it "handles a rate limit", :mock_openai_rate_limit do
  expect { MyService.call_llm("Hello") }.to raise_error(RubyLLM::RateLimitError)
end

it "handles mixed outcomes", :mock_openai do
  MockOpenAI.set_responses([
    { match: "Step 1", response: "OK" },
    { match: "Step 2", failure_mode: :timeout },
    { match: "Step 3", response: "Done" }
  ])

  expect(MyService.step1).to eq("OK")
  expect { MyService.step2 }.to raise_error(Timeout::Error)
  expect(MyService.step3).to eq("Done")
end

The failure modes are:

Mode	What it does
`:timeout`	Sleeps then closes the connection without responding
`:rate_limit`	Returns HTTP 429 with an OpenAI-format error body
`:malformed_json`	Returns truncated JSON that causes a parse error in your client
`:internal_error`	Returns HTTP 500
`:truncated_stream`	Sends partial SSE chunks then closes the connection

You can also mix success and failure in a single test:

it "handles mixed outcomes", :mock_openai do
  MockOpenAI.set_responses([
    { match: "Step 1", response: "OK" },
    { match: "Step 2", failure_mode: :timeout },
    { match: "Step 3", response: "Done" }
  ])
end

Saturday was productive. By end of day I had all the core classes written TDD-style: Config, State, Matcher, ResponseBuilder, TemplateRenderer, all five failure mode classes. The code was written, the tests passed, and life was good.

Sunday: documentation and shipping

Sunday was docs day. I set up a Jekyll site and wrote the README. I added an Anthropic endpoint too, because my personal projects use both.

I also migrated the first personal project to use MockOpenAI. That went smoothly. The HTTP-level fidelity made a few tests a little more honest than they'd been with simple stubs at the client level.

Monday: the uncomfortable question

Monday I migrated a second personal project. This one used a helper module I'd written a while back to stub LLM calls. Just a few lines of code. It worked fine for that project.

I stared at that code for a while. Here it is:

module RubyLLMMocks
  def mock_ruby_llm_chat(content: nil, error: nil)
    if error
      allow(RubyLLM).to receive(:chat).and_raise(error)
    else
      mock_response = instance_double(
        RubyLLM::Message,
        content: content,
        inspect: "RubyLLM::Message(content: #{content.inspect})"
      )
      mock_chat = instance_double(RubyLLM::Chat, ask: mock_response)
      allow(RubyLLM).to receive(:chat).and_return(mock_chat)
    end
  end
end

# Example:
it "handles general ruby_llm errors gracefully" do
  error = RubyLLM::Error.new(nil, "Unexpected error")
  mock_ruby_llm_chat(error: error)

  generator = described_class.new(options)

  expect { generator.generate }
    .to output(/Error.*Unexpected error/m).to_stdout
    .and raise_error(SystemExit)
end

That's it. Fifteen lines, no gem dependency, works perfectly for a project that uses RubyLLM as a wrapper. The error case is handled with and_raise. Clean.

So the question had to be asked: did I just build a solution in search of a problem?

After sitting with it, I don't think so. But I did have to sharpen my thinking about when MockOpenAI actually earns its place versus when a helper method is the right call.

The short version: if you're using a wrapper library like RubyLLM for all your LLM calls, and you only need happy-path responses and exception simulation in unit tests, the 15-line helper is probably the right answer. It's less to maintain, has no extra dependencies, and does the job.

MockOpenAI is the right call when you need the actual HTTP layer in the picture. When you're using the raw OpenAI or Anthropic client directly. When you're running integration or system tests that make real HTTP connections. When you need to test what happens when TCP actually times out, or when a streaming response gets cut off halfway through, or when your retry logic processes a Retry-After header.

Those are real problems. They're just not every project's problems. I added a when not to use page to the docs to make the tradeoffs explicit.

What I'd do differently

One thing I'd change: I'd research the problem space a bit more to make sure I had a better understanding of problem scope and existing solution. (Especially ones I wrote myself!) The tool is solid, but I made some assumptions about the breadth of problems it would solve for. That's a classic weekend MVP trap I suppose. You're so focused on building that you skip the a bit of due diligence you think you don't need.

The gem is published, the docs are live, and it works. The scope is narrower than I originally thought, but the use case is real. That feels like an honest result for a long weekend.

MockOpenAI is a Ruby gem for testing OpenAI-compatible and Anthropic APIs locally. References:

Source: github.com/grymoire7/mockopenai.
Docs: grymoire7.github.io/mockopenai.
Landing page: tracyatteberry.com/mockopenai.

Documentation search: When your Rails app needs which approach

Tracy Atteberry — Mon, 16 Mar 2026 21:10:43 +0000

Imagine spending an afternoon watching a developer tear out a perfectly functional search feature. They replace their solid Postgres full-text search with a vector database and RAG pipeline because, well, that's what you're supposed to do now, right? The new system is slower, cost them $200 a month in OpenAI API calls, and returned worse results for their specific use case.

This keeps happening. The AI hype cycle has convinced developers that every search problem needs embeddings, vector databases, and agentic loops. Sometimes that's true. Often it's not.

Let's build the same feature four different ways and see what each approach actually costs you.

The use case: searching Ruby gem documentation

We're building a search feature for a documentation site that indexes about 5,000 Ruby gems. Each gem has README content, API documentation, and code examples. Users ask questions like "How do I upload files to S3?" or "What's the best gem for handling webhooks?"

This is a realistic scale for most Rails apps. Not Google-sized, not trivial. Just normal business software that needs to help users find information.

I'll show you four implementations, each adding a layer of complexity. We'll look at the code, measure the actual costs, and figure out when the added complexity pays for itself.

Approach 1: Traditional search with AI summarization

Start with what works. Postgres full-text search has been solving search problems since before your junior devs were born.

class Documentation < ApplicationRecord
  include PgSearch::Model

  pg_search_scope :search_content,
    against: {
      title: 'A',
      content: 'B',
      code_examples: 'C'
    },
    using: {
      tsearch: { prefix: true },
      trigram: { threshold: 0.3 }
    }
end

class DocumentationSearcher
  def initialize(query)
    @query = query
  end

  def search
    results = Documentation.search_content(@query).limit(10)

    {
      results: results,
      summary: summarize_results(results)
    }
  end

  private

  def summarize_results(results)
    return nil if results.empty?

    prompt = <<~PROMPT
      User question: #{@query}

      Here are the top search results:
      #{format_results(results)}

      Provide a concise answer to the user's question based on these results.
      If the results don't contain relevant information, say so.
    PROMPT

    client = OpenAI::Client.new
    client = OpenAI::Client.new
    response = client.chat(
      parameters: {
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        temperature: 0.3
      }
    )

    response.dig(:choices, 0, :message, :content)
  end

  def format_results(results)
    results.map.with_index do |doc, i|
      "#{i + 1}. #{doc.title}\n#{doc.content.truncate(500)}"
    end.join("\n\n")
  end
end

This approach does one database query and one API call. The search uses proven Postgres features: full-text search with ranking, trigram matching for typos, and weighted fields. Then we send the top results to GPT-4 to generate a summary.

Cost per query:

Database: ~5ms
OpenAI API: ~$0.002 (about 1,000 input tokens, 200 output tokens)
Total latency: ~800ms

When this fails:

User queries are conceptually different from how docs are written ("async jobs" versus "background processing")
Important information is buried in the middle of long documents
You need to combine information from multiple sources

The failure mode is subtle. Traditional search ranks by keyword matching and field weights. When users phrase questions differently than your documentation uses terminology, they get poor results. You can't fix this with better prompt engineering because the LLM never sees the relevant documents.

Approach 2: Basic RAG with vector embeddings

RAG (Retrieval Augmented Generation) means embedding your documents as vectors, embedding the user's query as a vector, and finding documents with similar embeddings. This solves the terminology mismatch problem.

# We need to store embeddings
class AddEmbeddingsToDocumentation < ActiveRecord::Migration[7.1]
  def change
    add_column :documentations, :embedding, :vector, limit: 1536
    add_index :documentations, :embedding, using: :hnsw, opclass: :vector_cosine_ops
  end
end

class Documentation < ApplicationRecord
  has_neighbors :embedding

  after_save :generate_embedding, if: :content_changed?

  private

  def generate_embedding
    text = "#{title}\n\n#{content}"

    client = OpenAI::Client.new
    response = client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    self.update_column(:embedding, response.dig(:data, 0, :embedding))
  end
end

class RagSearcher
  def initialize(query, client: OpenAI::Client.new)
    @query = query
    @client = client
  end

  def search
    query_embedding = generate_embedding(@query)
    results = Documentation.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(5)

    {
      results: results,
      answer: generate_answer(results)
    }
  end

  private

  def generate_embedding(text)
    response = @client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    response.dig(:data, 0, :embedding)
  end

  def generate_answer(results)
    context = results.map { |doc| "#{doc.title}\n#{doc.content}" }.join("\n\n---\n\n")

    prompt = <<~PROMPT
      Answer the user's question based only on the following documentation:

      #{context}

      Question: #{@query}

      If you cannot answer based on the provided documentation, say so clearly.
    PROMPT

    response = @client.chat(
      parameters: {
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        temperature: 0.3
      }
    )

    response.dig(:choices, 0, :message, :content)
  end
end

Now we're making two API calls per search: one to embed the query, one to generate the answer. We're also using pgvector with HNSW indexing for fast similarity search.

Cost per query:

Database: ~15ms (vector similarity search)
OpenAI embeddings API: ~$0.00001 (negligible)
OpenAI chat API: ~$0.003
Total latency: ~1,200ms

When this works better:
The semantic matching is noticeably better. A query about "background jobs" will match documents about "async processing" and "delayed tasks" because the concepts are similar in vector space. This is real improvement over keyword search.

When this still fails:

Complex questions requiring information from many documents
Multi-step reasoning ("compare these two approaches")
Questions where the first retrieval doesn't get the right context

Here's a concrete failure case I hit: A user asks "What's the difference between Sidekiq and Good Job?" The vector search retrieves five documents, but three are about Sidekiq and two are about Good Job. The LLM tries to compare them but doesn't have complete information about both systems. It hedges and gives a vague answer.

Approach 3: Agentic RAG with adaptive retrieval

This is where we let the LLM decide if it needs more information. Instead of one retrieve-then-generate pass, we give the LLM tools to search again, rephrase queries, or combine results.

class AgenticRagSearcher
  MAX_ITERATIONS = 3

  def initialize(query, client: OpenAI::Client.new)
    @query = query
    @client = client
    @conversation_history = []
    @retrieved_docs = []
  end

  def search
    initial_prompt = <<~PROMPT
      You are a helpful assistant that searches Ruby gem documentation.

      User question: #{@query}

      You have access to these tools:
      - search_docs(query): Search documentation with a semantic query
      - get_related(doc_id): Get documents related to a specific document

      Think step by step. You can search multiple times with different queries
      to gather complete information before answering.
    PROMPT

    @conversation_history << { role: "user", content: initial_prompt }

    MAX_ITERATIONS.times do
      response = call_llm_with_tools

      break if response[:finish_reason] == "stop"

      if response[:tool_calls]
        handle_tool_calls(response[:tool_calls])
      end
    end

    {
      results: @retrieved_docs.uniq,
      answer: @conversation_history.last[:content],
      iterations: @conversation_history.length
    }
  end

  private

  def call_llm_with_tools
    response = @client.chat(
      parameters: {
        model: "gpt-4o",
        messages: @conversation_history,
        tools: tool_definitions,
        temperature: 0.3
      }
    )

    message = response.dig(:choices, 0, :message)
    @conversation_history << message

    {
      finish_reason: response.dig(:choices, 0, :finish_reason),
      tool_calls: message[:tool_calls]
    }
  end

  def tool_definitions
    [
      {
        type: "function",
        function: {
          name: "search_docs",
          description: "\"Search documentation using semantic search\","
          parameters: {
            type: "object",
            properties: {
              query: {
                type: "string",
                description: "The search query"
              }
            },
            required: ["query"]
          }
        }
      },
      {
        type: "function",
        function: {
          name: "get_related",
          description: "\"Get documents related to a specific document\","
          parameters: {
            type: "object",
            properties: {
              doc_id: {
                type: "integer",
                description: "The ID of the document"
              }
            },
            required: ["doc_id"]
          }
        }
      }
    ]
  end

  def handle_tool_calls(tool_calls)
    results = tool_calls.map do |tool_call|
      function_name = tool_call.dig(:function, :name)
      arguments = JSON.parse(tool_call.dig(:function, :arguments) || "{}")

      result = case function_name
      when "search_docs"
        search_docs(arguments["query"])
      when "get_related"
        get_related_docs(arguments["doc_id"])
      end

      @retrieved_docs.concat(result)

      {
        role: "tool",
        tool_call_id: tool_call[:id],
        content: format_docs_for_llm(result)
      }
    end

    @conversation_history.concat(results)
  end

  def search_docs(query)
    embedding = generate_embedding(query)
    Documentation.nearest_neighbors(:embedding, embedding, distance: "cosine").limit(3)
  end

  def get_related_docs(doc_id)
    doc = Documentation.find(doc_id)
    Documentation
      .nearest_neighbors(:embedding, doc.embedding, distance: "cosine")
      .where.not(id: doc_id)
      .limit(3)
  end

  def generate_embedding(text)
    response = @client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    response.dig(:data, 0, :embedding)
  end

  def format_docs_for_llm(docs)
    docs.map do |doc|
      {
        id: doc.id,
        title: "doc.title,"
        content: doc.content.truncate(1000)
      }
    end.to_json
  end
end

This is a real step up in complexity. We're now orchestrating multiple LLM calls with tool use. The LLM can search multiple times, explore related documents, and build up context before answering.

Cost per query:

Database: 30-90ms (multiple vector searches)
OpenAI embeddings API: $0.00002-0.00006 (2-6 embedding calls)
OpenAI chat API: $0.015-0.045 (3-5 LLM calls with larger context)
Total latency: 3-8 seconds

Notice the variance. Some queries get answered in one iteration. Complex ones might do three searches with different phrasings, explore related documents, and make five total LLM calls.

When this works better:
That comparison query from before ("What's the difference between Sidekiq and Good Job?") now works great. The LLM searches for "Sidekiq background jobs", gets those docs, then searches for "Good Job background jobs", gets those docs, then synthesizes a real comparison.

Multi-part questions work too. "How do I set up Stripe payments and handle webhooks?" triggers two separate searches that gather comprehensive information.

When this gets expensive:
Every query where the LLM decides it needs more information costs you 3-5x more. If your users ask a lot of complex questions, your API bill climbs fast.

The latency is also noticeable. Eight seconds feels slow in a web UI. You need to stream responses or show progress indicators.

Approach 4: Full conversational agent with external tools

Now we're building a real agent that can search your documentation, browse
external sites, and maintain conversation context across multiple turns.

class DocumentationAgent
  def initialize(session_id, client: OpenAI::Client.new)
    @session_id = session_id
    @client = client
    @conversation_history = load_conversation_history
  end

  def chat(message)
    @conversation_history << { role: "user", content: message }

    loop do
      response = call_llm_with_tools

      break if response[:finish_reason] == "stop"

      if response[:tool_calls]
        handle_tool_calls(response[:tool_calls])
      else
        break
      end
    end

    save_conversation_history

    {
      response: @conversation_history.last[:content],
      sources: extract_sources
    }
  end

  private

  def call_llm_with_tools
    response = @client.chat(
      parameters: {
        model: "gpt-4o",
        messages: @conversation_history,
        tools: tool_definitions,
        temperature: 0.3
      }
    )

    message = response.dig(:choices, 0, :message)
    @conversation_history << message

    {
      finish_reason: response.dig(:choices, 0, :finish_reason),
      tool_calls: message[:tool_calls]
    }
  end

  def generate_embedding(text)
    response = @client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    response.dig(:data, 0, :embedding)
  end

  def tool_definitions
    [
      {
        type: "function",
        function: {
          name: "search_internal_docs",
          description: "\"Search our Ruby gem documentation\","
          parameters: {
            type: "object",
            properties: {
              query: { type: "string" }
            },
            required: ["query"]
          }
        }
      },
      {
        type: "function",
        function: {
          name: "fetch_external_url",
          description: "\"Fetch content from an external URL like GitHub or RubyGems.org\","
          parameters: {
            type: "object",
            properties: {
              url: { type: "string" }
            },
            required: ["url"]
          }
        }
      },
      {
        type: "function",
        function: {
          name: "search_github",
          description: "\"Search GitHub repositories and code\","
          parameters: {
            type: "object",
            properties: {
              query: { type: "string" }
            },
            required: ["query"]
          }
        }
      }
    ]
  end

  def handle_tool_calls(tool_calls)
    results = tool_calls.map do |tool_call|
      function_name = tool_call.dig(:function, :name)
      arguments = JSON.parse(tool_call.dig(:function, :arguments) || "{}")

      result = case function_name
      when "search_internal_docs"
        search_internal_docs(arguments["query"])
      when "fetch_external_url"
        fetch_external_url(arguments["url"])
      when "search_github"
        search_github(arguments["query"])
      end

      {
        role: "tool",
        tool_call_id: tool_call[:id],
        content: result.to_json
      }
    end

    @conversation_history.concat(results)
  end

  def search_internal_docs(query)
    embedding = generate_embedding(query)
    docs = Documentation.nearest_neighbors(:embedding, embedding, distance: "cosine").limit(5)

    docs.map { |d| { title: "d.title, content: d.content.truncate(800), source: \"internal\", id: d.id } }"
  end

  def fetch_external_url(url)
    # In production, use a proper HTTP client with timeouts and error handling
    response = HTTP.timeout(5).get(url)

    {
      url: url,
      content: extract_main_content(response.body.to_s).truncate(2000),
      source: "external"
    }
  rescue HTTP::Error => e
    { error: "Failed to fetch URL: #{e.message}" }
  end

  def search_github(query)
    # Use Octokit or similar
    client = Octokit::Client.new(access_token: ENV['GITHUB_TOKEN'])
    results = client.search_code(query, language: "ruby")

    results.items.first(3).map do |item|
      {
        name: item.name,
        repo: item.repository.full_name,
        url: item.html_url,
        source: "github"
      }
    end
  rescue Octokit::Error => e
    { error: "GitHub search failed: #{e.message}" }
  end

  def load_conversation_history
    cache_key = "agent_conversation:#{@session_id}"
    JSON.parse(Rails.cache.read(cache_key) || "[]")
  end

  def save_conversation_history
    cache_key = "agent_conversation:#{@session_id}"
    # Keep last 10 messages to control context size
    trimmed_history = @conversation_history.last(10)
    Rails.cache.write(cache_key, trimmed_history.to_json, expires_in: 1.hour)
  end

  def extract_sources
    @conversation_history
      .select { |msg| msg[:role] == "tool" }
      .flat_map { |msg| JSON.parse(msg[:content] || "[]") }
      .select { |item| item.is_a?(Hash) && item["source"] }
      .uniq { |item| item["id"] || item["url"] || item["title"] }
  end
end

This is a different beast. We're maintaining conversation state, hitting external APIs, and letting the LLM orchestrate complex research tasks.

Cost per conversation turn:

Database: 15-50ms
External API calls: 200-2000ms (GitHub, external sites)
OpenAI embeddings: $0.00001-0.00005
OpenAI chat: $0.02-0.15 (larger context windows, multiple turns)
Total latency: 5-15 seconds

When this is worth it:
You're building a research assistant or technical support bot where users have
complex, multi-turn conversations. They ask follow-up questions, need you to
check external sources, and expect the system to remember context.

A user might ask "What's the best gem for image processing?", then follow up with "Show me examples from the ImageMagick wrapper", then "Is there a more modern alternative?" The agent maintains context and can search different sources for each question.

When this is overkill:
Most search features. If users are doing one-off queries and moving on, you're paying for conversational capabilities they don't need.

Choosing your approach

I've built all four of these systems in production. Here's how I decide which to use.

Start with enhanced traditional search if:

You have well-written documentation with consistent terminology
Queries are mostly straightforward lookup tasks
You need predictable costs and latency
Your document corpus is under 10,000 items

The cost difference matters. At 1,000 queries per day, enhanced traditional search costs you $2/day. Basic RAG costs $3/day. Agentic RAG costs $15-45/day. A full agent costs $100-300/day.

Move to basic RAG when:

Users phrase questions differently than your docs
Keyword search returns poor matches for valid queries
You have good quality source documents
Your corpus is large enough that keyword search becomes unwieldy (50,000+ documents)

You'll know you need this when users complain that search doesn't work, and you look at their queries and think "we have docs about that, but they're using different words."

Move to agentic RAG when:

Users ask complex questions requiring multiple sources
You see patterns of users doing multiple searches in sequence
Simple RAG returns incomplete answers
You have budget for 3-5x higher API costs

Watch your analytics. If users do three searches in a row and then give up, they're manually doing what an agentic system would do automatically.

Build a full agent when:

You're building a product feature, not just search
Users need multi-turn conversations with context
You need to integrate external data sources
You have engineering resources for proper tool integration and error handling

The engineering complexity here is significant. You need proper timeout handling, retry logic, conversation state management, and graceful degradation when external APIs fail. This is a feature, not a quick enhancement.

The implementation details that matter

Some practical considerations that aren't obvious from the code samples.

Chunking strategy for vector search:
Don't just embed entire documents. Break them into logical chunks. For documentation, consider chunking by section with overlap:

class DocumentationChunker
  CHUNK_SIZE = 1000 # characters
  OVERLAP = 200

  def chunk(document)
    sections = document.content.split(/^##\s+/)

    sections.flat_map do |section|
      break_into_overlapping_chunks(section)
    end
  end

  private

  def break_into_overlapping_chunks(text)
    chunks = []
    start = 0

    while start < text.length
      chunk_end = start + CHUNK_SIZE
      chunks << text[start...chunk_end]
      start += CHUNK_SIZE - OVERLAP
    end

    chunks
  end
end

This means each document generates multiple rows in your database with different embeddings. Your vector search returns chunks, not whole documents.

Hybrid search combines the best of both:

def hybrid_search(query)
  # Keyword search results
  keyword_results = Documentation.search_content(query).limit(20)

  # Vector search results
  query_embedding = generate_embedding(query)
  vector_results = Documentation.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20)

  # Combine with reciprocal rank fusion
  combine_results(keyword_results, vector_results)
end

def combine_results(keyword_results, vector_results)
  scores = Hash.new(0)

  keyword_results.each_with_index do |doc, i|
    scores[doc.id] += 1.0 / (i + 60)
  end

  vector_results.each_with_index do |doc, i|
    scores[doc.id] += 1.0 / (i + 60)
  end

  Documentation.where(id: scores.keys).sort_by { |doc| -scores[doc.id] }
end

This gives you semantic understanding from vectors plus precise keyword matching. The reciprocal rank fusion formula is from research on combining search results. It works better than naive score addition.

Caching saves you money:

class CachedRagSearcher
  def search(query)
    cache_key = "rag_search:#{Digest::MD5.hexdigest(query)}"

    Rails.cache.fetch(cache_key, expires_in: 1.hour) do
      perform_search(query)
    end
  end
end

Popular queries get asked repeatedly. Cache the embeddings and the LLM responses. This cuts your API costs dramatically for common questions.

The above caching method uses the raw query as the cache key. If users ask the same question with slightly different wording, they won't hit the cache. You might want to normalize queries before hashing them for the cache key. For example, you could lowercase the query, remove stop words, etc.

Monitor your failure modes:

class SearchMetrics
  def self.track(query, approach, results)
    SearchLog.create!(
      query: query,
      approach: approach,
      result_count: results.length,
      latency_ms: results[:latency],
      cost_cents: calculate_cost(results),
      user_clicked: false # updated when user clicks a result
    )
  end
end

Track which results users actually click. If they click the first result, your search works. If they reformulate their query three times, it doesn't. This data tells you whether to upgrade your approach.

What I actually recommend

Build the simplest thing first. Most Rails apps should start with Postgres full-text search plus GPT summarization. It costs almost nothing, has predictable latency, and works fine for straightforward queries.

Add instrumentation immediately. Track user behavior, measure latency, and watch your API costs. You need this data to know if upgrading is worth it.

When you see concrete evidence that simple search fails for your use case, add vector embeddings. This is a real improvement for semantic search. The pgvector extension makes this straightforward in Postgres. You don't need a separate vector database until you have millions of documents.

Only add agentic features when you can point to specific query patterns that need them. "Users ask comparison questions and we don't have comparison docs" is a good reason. "Agents are cool and I want to try them" is not.

Save full conversational agents for when you're building a product feature that needs conversations. This is engineering work, not just adding a library. Budget for it appropriately.

The hype cycle pushes developers toward complexity. Resist it. Your users don't care about your architecture. They care about getting answers quickly and cheaply. Often the simplest approach gives them exactly that.

But does the code work? See this repo for an implementation of all four approaches.

Disclaimer: The code samples in this post are simplified for clarity. In production, you need proper error handling, timeouts, retries, and security considerations (e.g., sanitizing user input before embedding). The costs and latencies mentioned are estimates based on recent OpenAI pricing and typical response times; your actual costs may vary based on usage patterns and model choices. Always monitor your API usage and costs when deploying AI features. Void where prohibited.