DEV Community: Roger Oriol

Build A Basic AI Agent From Scratch: Long Task Planning

Roger Oriol — Mon, 08 Jun 2026 12:30:06 +0000

In the previous part of the Build A Basic AI Agent From Scratch series, we added the essential tools to our agent to allow it to work autonomously for us. We gave it the ability to find files, read and write files, run bash commands and get content from the web. We got a very capable agent with just these tools.

What happens when the agent runs long and complex tasks?

The current agent works very well, but we want our agent to get a lot of work done, and this requires staying on the task for long spans of time. Right now, if we try to give our agent long and complex tasks we will find that it does not think long term, and it stops working after the littlest progress.

This is to be expected because the LLM is trained to behave conversationally. It expects to go back and forth in a question-answer basis. This is fine for a simple chatbot, but our agent needs to be able to get a request and work for a long time on it before returning a result.

Long task planning

The next ability we will give to our agent is the ability to plan for long and complex tasks.

The abilities our agent needs are:

Understand the goal of the task
Plan how to tackle the task beforehand
Break the task into concrete steps
Keep track of pending, in progress and completed tasks
If something goes wrong with the current plan, rethink the approach
Check that everything planned is actually done before stopping

To give our agent these abilities, we will rely on the last part's addition: tools. We will also explain the model how to use long task planning in the model's system prompt.

New tool: Scratchpad

This is a very simple but powerful tool. We are just giving the model a place to write it's thoughts and read them again at a later time.

The main benefit of this tool is that it forces the model to think through the goal and plan the whole approach before starting working on it.

The tool saves the scratchpad content into memory instead of a file or database, which is fine because we don't want to share the scratchpad content between sessions.

Here's the python implementation:

class Scratchpad:
    """Read and write from a in-memory scratchpad"""

    def __init__(self):
        self._content = ""

    def read(self) -> str:
        if self._content == "":
            return "(empty)"
        return self._content

    def write(self, content: str) -> str:
        self._content = str(content).strip()
        return self._content


scratchpad = Scratchpad()


def read_scratchpad():
    """Read the contents of the scratchpad"""
    return scratchpad.read()


def write_scratchpad(content: str):
    """
    Write into the scratchpad. The previous content
    will be overwritten.
    """
    scratchpad.write(content)
    return "Successfully written content into scratchpad"

You can find and clone this code in this blog series' Github repo.

New tool: To-do list

A to-do list allows the agent to decompose the work into tasks and keep track of them to know what's left to do (pending), what it's working on currently (in progress) and what is already done (done).

This tool also enforces some good practices: it doesn't allow multiple tasks to be in progress at the same time, it doesn't allow invalid task statuses and it doesn't allow repeated tasks.

Just like the scratchpad, this tool saves the to do list into memory instead of a file or database. This is also fine because we don't want to share the to-do list between agent sessions.

RETRY_LIMIT = 3

class ToDoList:
    """Helper class to hold a to-do list in memory"""

    statuses = ["pending", "in_progress", "done", "cancelled", "failed"]

    def __init__(self):
        self._items = []

    def read(self, include_completed=False):
        """Read the to-do list"""
        if include_completed:
            return [item.copy() for item in self._items]
        else:
            return [item.copy() for item in self._items
                    if item["status"] != "done" and item["status"] != "cancelled"]

    def append(self, id, content, status):
        if status not in ToDoList.statuses:
            raise Exception(f"Invalid status {status}. "
                            "Valid to-do statuses: pending, in_progress, done, "
                            "cancelled, failed")
        if self.contains(id):
            raise Exception(f"To do item {id} already exists!")
        new_item = {"id": id, "content": content,
                    "status": status, "retries": 0}
        self._items.append(new_item)
        return new_item.copy()

    def contains(self, id) -> bool:
        """Check if the to do list contains an item with a specific id"""
        for item in self._items:
            if item["id"] == id:
                return True
        return False

    def update(self, id, content, status):
        if status is not None and status not in ToDoList.statuses:
            raise Exception(f"Invalid status {status}. "
                            "Valid to-do statuses: pending, in_progress, done, "
                            "cancelled, failed")
        idx = 0
        while idx < len(self._items):
            if self._items[idx]["id"] == id:
                if content is not None:
                    self._items[idx]["content"] = content
                if status is not None:
                    prev_status = self._items[idx]["status"]
                    self._items[idx]["status"] = status
                    # A failed task being set back to in_progress is a retry attempt.
                    if prev_status == "failed" and status == "in_progress":
                        self._items[idx]["retries"] += 1
                return self._items[idx].copy()
            idx += 1
        raise Exception(f"To do item with id {id} not found")

todo_store = ToDoList()


def todo_append(id, content, status) -> str:
    """Append a new to do item to the to do list"""
    id_str = str(id)
    content_str = str(content)
    status_str = str(status)
    try:
        todo_store.append(id_str, content_str, status_str)
        return f"Successfully appended to do item {id_str} in to do list!"
    except Exception as e:
        return f"Failed to append to do item: {e}"


def todo_list(include_completed=False) -> str:
    """List all the items in the to do list"""
    items = todo_store.read(include_completed)

    result = f"To Do List ({len(items)} items)\n"
    for status in ToDoList.statuses:
        count = sum(1 for i in items if i["status"] == status)
        result += f"{count} {status} items\n"

    result += "-----\n"
    for item in items:
        retry_note = f", {item['retries']
                          } retries" if item["retries"] > 0 else ""
        result += f"- [{item['id']}] {item['content']
                                      } ({item['status']}{retry_note})\n"

    return result


def todo_update(id, content=None, status=None) -> str:
    if content is None and status is None:
        return "No content or status was given to update. Nothing to do."
    try:
        item = todo_store.update(id, content, status)
        retries = item["retries"]
        if item["status"] == "in_progress" and retries > 0:
            if retries >= RETRY_LIMIT:
                return (
                    f"Updated to do item {id} to in_progress — "
                    f"but this is retry {retries} of {
                        RETRY_LIMIT} (retry limit reached). "
                    f"Do not retry again. Escalate to the user instead."
                )
            return (
                f"Successfully updated to do item {id}! "
                f"Retry attempt {retries} of {RETRY_LIMIT}."
            )
        return f"Successfully updated to do item {id}!"
    except Exception as e:
        return f"Failed to update to do item {id}: {e}"

New system prompt

All the strategies for long term task planning that cannot be implemented into tools are explained to the model in the system prompt. Here we will explain to the model how to plan using the process explained in the beginning of the article, and also how to use the new tools to help it in the planning process.

For more details, read the system prompt below.

I also added to the system prompt a little comment explaining to the model that if not stated otherwise, the project it has to work on is in the current directory.

{
    "role": "system",
    "content": (
        "You are a capable coding and research assistant.\n\n"

        "## Available tools\n\n"
        "Action tools: read_file, write_file, edit_file, glob_files, grep, run_bash, webfetch\n\n"
        "Planning tools:\n"
        "- Scratchpad (read_scratchpad / write_scratchpad): your private working memory. "
        "Use it to think through an approach, store intermediate findings, or draft content "
        "before committing. Each write fully replaces the previous content.\n"
        "- To-do list (todo_append / todo_list / todo_update): a persistent task tracker. "
        "Items carry a status: pending, in_progress, done, cancelled, or failed.\n\n"

        "## Working directory\n\n"
        "The current working directory is always the user's project root. "
        "When asked to work on a project or codebase without a specified path, "
        "start by exploring '.' with glob_files or run_bash. "
        "Never ask the user to supply a path.\n\n"

        "## How to plan\n\n"
        "For complex or multi-step tasks (roughly 3 or more distinct steps, or when the "
        "path forward is unclear):\n"
        "1. Write your initial thinking and approach to the scratchpad before acting.\n"
        "2. Break the work into concrete steps and add each one to the to-do list with "
        "todo_append (status: pending).\n"
        "3. Before starting a step, mark it in_progress with todo_update. "
        "Keep only one item in_progress at a time.\n"
        "4. Mark items done immediately after completing them — do not batch completions.\n"
        "5. Call todo_list to review remaining work before moving to the next step.\n"
        "6. Mark tasks cancelled if they become unnecessary.\n\n"
        "For simple, single-step tasks: act directly without creating todos.\n\n"
        "Planning tool calls (write_scratchpad, todo_append, todo_update, todo_list) "
        "are internal bookkeeping, not responses to the user. After any planning tool "
        "call, always continue working immediately — make your next tool call or, once "
        "the task is fully complete, give a substantive final answer. "
        "Never emit an empty or whitespace-only message.\n\n"
        "## Replanning\n\n"
        "After every tool result, check whether the outcome matched your expectation. "
        "If a tool returns an error, unexpected output, or reveals information that "
        "changes your understanding of the task, do not move to the next planned step — "
        "replan first.\n\n"
        "When a step fails:\n"
        "1. Diagnose in the scratchpad — is this a recoverable input error (wrong path, "
        "typo, wrong argument) or a deeper problem (wrong approach, wrong assumption)?\n"
        "2. Mark the task failed: todo_update(id, status='failed').\n"
        "3. Choose a recovery action:\n"
        "   - Retry: the failure is correctable. Fix the input and set the task back to "
        "in_progress. The tool will report which retry attempt this is.\n"
        "   - Replace: the approach is wrong. Cancel the task and add a revised one.\n"
        "   - Reorder: new information makes a different task more urgent. Update the "
        "pending items before continuing.\n"
        "4. If todo_update reports that the retry limit has been reached, stop retrying. "
        "Write a clear diagnosis in the scratchpad — what you tried, what failed each "
        "time, and what you need — then give the user a concise escalation message "
        "and wait for their input.\n\n"
        "When a tool succeeds but returns information that changes the picture, pause "
        "before acting. Call todo_list, reassess all pending items in the scratchpad, "
        "and cancel or replace any tasks that no longer make sense.\n\n"
        "## How to use the scratchpad\n\n"
        "Before each tool call during a complex task, update the scratchpad with your "
        "current thinking. Structure each entry around these five steps:\n\n"
        "1. Restate the goal — write what you understand the task to be, in your own words. "
        "This catches misreads before they compound into wasted work.\n"
        "2. Survey what you know — note which files you have seen, what the code structure "
        "looks like, and what constraints or requirements apply.\n"
        "3. Evaluate options — reason through at least two approaches and explain why you "
        "are choosing one over the other (e.g. 'I could rewrite the middleware, or wrap it. "
        "Wrapping is safer because it leaves the existing call sites untouched.').\n"
        "4. Anticipate failure modes — write down what could go wrong with the chosen "
        "approach and how you would diagnose it (e.g. 'If the tests fail after this, the "
        "most likely cause is that the session cookie name changed.').\n"
        "5. Decide the next single action — commit to exactly one tool call. "
        "Do not plan several calls at once; decide the next step only.\n\n"
        "Re-read the scratchpad whenever you resume after a tool result to keep your "
        "reasoning grounded in what you have already learned.\n\n"
        "## Done detection\n\n"
        "Do not give a final answer based on the task list being empty alone. "
        "Before declaring the task complete, verify all three of the following:\n\n"
        "1. Structural completion — call todo_list and confirm there are no pending, "
        "in_progress, or failed items.\n"
        "2. Verification — check the output against the original goal. For code tasks: "
        "run the tests or build with run_bash and confirm they pass. For research tasks: "
        "re-read the scratchpad and confirm the assembled answer addresses what was "
        "actually asked.\n"
        "3. Uncertainty check — read the scratchpad and ask: are there unresolved "
        "questions, assumptions that were never validated, or tasks that were cancelled "
        "rather than properly completed?\n\n"
        "If all three are satisfied, give your final answer. If any are not, re-enter "
        "the planning loop — add the outstanding items to the todo list and continue."
    ),
}

Let's test it!

To test our new and powerful agent, we will have to give it a really hard goal. In my case, I asked it to migrate my static site from using Eleventy to Hugo:

$ python ../agent-harness/agent-planning/agent.py
You: Explore this static site project and migrate it from Eleventy to Hugo
  [tool] write_scratchpad({'content': 'Goal: inspect the static site project, understand the current Eleventy setup, and migrate it to Hugo.\nSurvey: I have not yet inspected the repo. Nee
d to identify project structure, config, templates, content, assets, and build scripts.\nOptions: I could try an incremental migration by reproducing the Eleventy output structure in Hugo,
 or do a clean Hugo reorganization. Incremental is safer because it preserves content and design while minimizing surprises.\nFailure modes: Missing Eleventy-specific features (shortcodes,
 data files, filters, collections) could break the build. Hugo template syntax and content organization differ, so I need to map those carefully. Tests/build may fail if config or paths ar
e wrong.\nNext action: inspect the repository root to find the project files.'})
  [tool result] Successfully written content into scratchpad
  [tool] todo_append({'id': 'migrate-1', 'content': 'Inspect repository structure and identify Eleventy/Hugo-related files', 'status': 'pending'})
  [tool result] Successfully appended to do item migrate-1 in to do list!
  [tool] todo_append({'id': 'migrate-2', 'content': 'Map Eleventy templates/content/assets to Hugo equivalents', 'status': 'pending'})
  [tool result] Successfully appended to do item migrate-2 in to do list!
  [tool] todo_append({'id': 'migrate-3', 'content': 'Implement Hugo configuration, templates, and content migration', 'status': 'pending'})
  [tool result] Successfully appended to do item migrate-3 in to do list!
  [tool] todo_append({'id': 'migrate-4', 'content': 'Run build verification and fix any issues', 'status': 'pending'})
  [tool result] Successfully appended to do item migrate-4 in to do list!
  [tool] run_bash({'command': "pwd && echo '---' && find . -maxdepth 2 -type f | sed 's#^./##' | sort | head -200"})
[........]
  [tool] run_bash({'command': 'hugo --minify'})
  [tool result] Start building sites … 
hugo v0.152.2-6abdacad3f3fe944ea42177844469139e81feda6 darwin/arm64 BuildDate=2025-10-24T15:31:49Z VendorInfo=gohugoio


                  │ EN 
──────────────────┼────
 Pages ...
  [tool] todo_update({'id': 'migrate-4', 'status': 'done'})
  [tool result] Successfully updated to do item migrate-4!
  [tool] todo_list({'include_completed': True})
  [tool result] To Do List (4 items)
0 pending items
0 in_progress items
4 done items
0 cancelled items
0 failed items
-----
- [migrate-1] Inspect repository structure and identify Eleventy/Hugo-related files (done)
...
Assistant: Migrated the site from Eleventy to Hugo and verified the build passes.

What I changed:
- Added Hugo config in `hugo.toml`
- Created Hugo layouts:
  - `layouts/_default/baseof.html`
  - `layouts/index.html`
  - `layouts/_default/list.html`
  - `layouts/_default/single.html`
  - `layouts/404.html`
  - `layouts/_default/rss.xml`
- Migrated content into Hugo’s `content/` structure
- Copied static assets into `static/`
- Updated `package.json` scripts to use Hugo:
  - `build`: `hugo`
  - `serve`: `hugo server -D`
  - `test`: `hugo`

Verification:
- Ran `hugo --minify`
- Build completed successfully

Notes:
- I preserved the site’s existing URL structure as closely as possible:
  - `/articles/...`
  - `/links/...`
  - `/projects/...`
  - `/recent/`
- Hugo now serves the site instead of Eleventy
- Old Eleventy files are still present in the repo, but they’re no longer part of the Hugo build

If you want, I can also clean up the leftover Eleventy-specific files and simplify the repo further.

When I got back a few minutes ago, the site was successfully migrated to Hugo!

What You've Built

This agent is now capable enough not just to work on it's own, but also to work on really hard and long tasks. It can plan, break down tasks, recover and replan if anything goes wrong, and just keep working until the work is done or it hits a wall.

What's next?

An agent capable of working for a long time to complete a task is a great thing for productivity. But, if you dispatch this agent and go do something else, it might be editing files and running commands it isn't supposed to without your knowledge. You can't just trust your agent to always behave how you are expecting it to, so you want the agent to ping you before doing something potentially catastrophic. You want to be the human in the loop.

Build A Basic AI Agent From Scratch: Tools

Roger Oriol — Mon, 01 Jun 2026 17:47:41 +0000

In the previous part of the Build A Basic AI Agent From Scratch series, we built the most basic AI agent harness possible.
It was just a connection to a model, a way to take user input, a store of context of the conversation and a loop that kept the agent running.

Of course, this agent is not very useful. It can only interact by taking your input and answering you based on its internal knowledge.
If we want our agent to be more useful and do work in behalf of us, we have to give it a way to give it some way to take actions in its environment. In this case, the computer it's running on. The way you allow an agent to take actions in your computer is with tools.

What are Tools?

A tool is a program or function that you expose to your LLM to allow it to invoke it autonomously. A tool can be as simple as a Python function implemented in the same agent code and as complex as an MCP (Model Context Protocol) server that does a HTTP request to an API that reads or updates a database.

Note: MCP is not covered in this part of the series but it will be covered in the future.

How do Agents use Tools?

Large Language Models output text, so how can they use tools? The first implementations of tool calling relied on suggesting the LLM to
output a text like Action: web_fetch and then the agent harness parsing the text output and running the function. This was a bit unreliable, since the model sometimes didn't exactly follow the format we were expecting.

Modern LLMs already have native tool calling baked into them to make this more reliable. These models are fine-tuned to produce JSON structured tool requests. This native implementation has built-in validation, which minimizes hallucinations and makes the agent more reliable when it has to invoke a tool.

Improving our Agent with Tools

We will be building on our previous basic agent we already built in the last part of this series: Build A Basic AI Agent From Scratch.

We will start by implementing the most basic tools an AI agent needs to take action. These tools are usually built-in the most common agent harnesses. All of them are simple, but essential and powerful.

In the previous Python code, we will create a tools submodule. Here we will implement all our tools and their schemas.

First, let's start with the bash tool:

def run_bash(command: str) -> str:
    """Run a bash command and return its output."""
    result = subprocess.run(
        command, shell=True, text=True, capture_output=True
    )
    output = result.stdout
    if result.stderr:
        output += f"\nSTDERR:\n{result.stderr}"
    return output or "(no output)"

This is the most powerful tool. Allowing our agent to run bash commands will let it do anything on the computer it's running on. On one hand, this is good because it relieves us from implementing a tool for each program that can just be run using bash and that the LLM already knows how to use. On the other hand, this is the most dangerous tool (also because it will let it do anything on the computer it's running on). In future parts of this series we will crack down on security so this doesn't become a liability.

The next tool is the read file tool:

def read_file(path: str, offset: int = 1, limit: int = 200) -> str:
    """Read lines from a file, with optional offset and limit."""
    p = Path(path)
    if not p.exists():
        return f"Error: file not found: {path}"
    lines = p.read_text(errors="replace").splitlines()
    selected = lines[offset - 1: offset - 1 + limit]
    return "\n".join(f"{offset + i}: {line}" for i, line in enumerate(selected))

This allows our agent to read the files on the computer. This is useful for many cases, like for example reading all the files in our codebase for coding agents.

The next tool is the glob files tool:

def glob_files(pattern: str, path: str = ".") -> str:
    """Find files matching a glob pattern inside a directory."""
    matches = glob_module.glob(f"{path}/**/{pattern}", recursive=True)
    matches += glob_module.glob(f"{path}/{pattern}")
    unique = sorted(set(matches))
    return "\n".join(unique) if unique else "(no matches)"

This tool can be used to find files in a directory. Obviously needed so the agent can explore your computer and see which files are available before it reads them.

The next tool is the grep tool:

def grep(pattern: str, path: str = ".", include: str = "*") -> str:
    """Search file contents for a regex pattern, optionally filtering by filename glob."""
    results = []
    for filepath in glob_module.glob(f"{path}/**/{include}", recursive=True):
        fp = Path(filepath)
        if not fp.is_file():
            continue
        try:
            for i, line in enumerate(fp.read_text(errors="replace").splitlines(), 1):
                if re.search(pattern, line):
                    results.append(f"{filepath}:{i}: {line}")
        except OSError:
            pass
    return "\n".join(results) if results else "(no matches)"

This tool searches file contents using regular expressions and returns matching lines together with their file path and line number. It complements glob_files nicely: first you find which files exist, then you search inside them for the content you are actually interested in. The optional include parameter lets you restrict the search to files matching a filename pattern, which is useful to avoid searching binary files or to narrow the scope to a specific language.

The next tool is the write file tool:

def write_file(path: str, content: str) -> str:
    """Write content to a file, creating it if it does not exist."""
    p = Path(path)
    p.parent.mkdir(parents=True, exist_ok=True)
    p.write_text(content)
    return f"Wrote {len(content)} bytes to {path}"

This tool lets our agent create new files and write content to them. It automatically creates any missing parent directories, so the agent doesn't have to worry about the directory structure already existing. This is essential for any agent that needs to produce output, generate code, or save results to disk.

The next tool is the edit file tool:

def edit_file(path: str, old_string: str, new_string: str) -> str:
    """Replace the first occurrence of old_string with new_string in a file."""
    p = Path(path)
    if not p.exists():
        return f"Error: file not found: {path}"
    original = p.read_text()
    if old_string not in original:
        return f"Error: string not found in {path}"
    p.write_text(original.replace(old_string, new_string, 1))
    return f"Edited {path}"

While write_file replaces the entire content of a file, edit_file performs a targeted string replacement. This is much safer when the agent only needs to make a small change to an existing file, since it avoids accidentally overwriting content it hasn't read. It is the go-to tool for coding agents that need to patch specific lines without rewriting everything.

The last tool is the webfetch tool:

def webfetch(url: str) -> str:
    """Fetch a URL and return its full plain-text content (up to 2 MB)."""
    parsed = urlparse(url)
    if parsed.scheme not in ("http", "https"):
        return f"Error fetching {url}: unsupported scheme '{parsed.scheme}'."
    req = urllib.request.Request(url, headers={"User-Agent": "agent/1.0"})
    with urllib.request.urlopen(req, timeout=15) as resp:
        raw = b"".join(...).decode(charset, errors="replace")
    soup = BeautifulSoup(raw, "html.parser")
    text = soup.get_text(separator="\n", strip=True)
    return re.sub(r"\n{3,}", "\n\n", text).strip()

This tool fetches a public web page and returns its content as plain text. It uses BeautifulSoup to strip all the HTML markup so the model only receives the readable text, keeping the context clean and token-efficient. It is restricted to http and https URLs and caps the response at 2 MB to avoid flooding the context window with enormous pages.

Once all our tools are implemented, we have to let the agent know they exist. The agent also needs to know what each tool does and which parameters it takes. We have to define a tool schema for the model:

def get_tool_schemas():
    return [
        {
            "type": "function",
            "function": {
                "name": "run_bash",
                "description": "Run a bash command on the user's machine and return the output.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "command": {
                            "type": "string",
                            "description": "The bash command to execute.",
                        }
                    },
                    "required": ["command"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "read_file",
                "description": "Read lines from a file. Returns lines prefixed with line numbers.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string", "description": "Absolute or relative path to the file."},
                        "offset": {"type": "integer", "description": "First line to read (1-indexed). Defaults to 1."},
                        "limit": {"type": "integer", "description": "Maximum number of lines to return. Defaults to 200."},
                    },
                    "required": ["path"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "glob_files",
                "description": "Find files matching a glob pattern (e.g. '**/*.py') inside a directory.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "pattern": {"type": "string", "description": "Glob pattern to match against file names."},
                        "path": {"type": "string", "description": "Root directory to search in. Defaults to '.'."},
                    },
                    "required": ["pattern"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "grep",
                "description": "Search file contents for a regex pattern and return matching lines with file paths and line numbers.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "pattern": {"type": "string", "description": "Regular expression to search for."},
                        "path": {"type": "string", "description": "Directory to search in. Defaults to '.'."},
                        "include": {"type": "string", "description": "Filename glob to restrict which files are searched (e.g. '*.py'). Defaults to '*'."},
                    },
                    "required": ["pattern"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "write_file",
                "description": "Write content to a file, creating it (and any missing parent directories) if it does not exist.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string", "description": "Path of the file to write."},
                        "content": {"type": "string", "description": "Full content to write to the file."},
                    },
                    "required": ["path", "content"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "edit_file",
                "description": "Replace the first occurrence of a string in a file with a new string.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string", "description": "Path of the file to edit."},
                        "old_string": {"type": "string", "description": "Exact string to find and replace."},
                        "new_string": {"type": "string", "description": "String to replace it with."},
                    },
                    "required": ["path", "old_string", "new_string"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "webfetch",
                "description": (
                    "Fetch a public URL (http/https only) and return its full plain-text content (up to 2 MB)."
                ),
                "parameters": {
                    "type": "object",
                    "properties": {
                        "url": {"type": "string", "description": "The URL to fetch (http/https)."},
                    },
                    "required": ["url"],
                },
            },
        },
    ]

Then, we can integrate the tools into our previous agent loop:


TOOL_REGISTRY = get_tool_registry()
TOOL_SCHEMAS = get_tool_schemas()

def handle_tool_calls(tool_calls, messages):
    """Execute each tool the LLM requested and append the results to messages."""
    for tool_call in tool_calls:
        name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)

        print(f"  [tool] {name}({args})")

        if name not in TOOL_REGISTRY:
            result = f"Error: unknown tool '{
                name}'. Available tools: {list(TOOL_REGISTRY.keys())}"
        else:
            result = TOOL_REGISTRY[name](**args)

        print(f"  [tool result] {result[:200]}{
              '...' if len(result) > 200 else ''}")

        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": result,
        })


def agent_loop(client):
    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant. You have tools to read and write files, "
                "search the file system, and fetch web pages. Use them to help the user."
            ),
        }
    ]

    while True:
        user_input = input("You: ")
        if user_input.lower() == "\\exit":
            break

        messages.append({"role": "user", "content": user_input})

        while True:
            response = client.chat.completions.create(
                model="gemma4",
                messages=messages,
                tools=TOOL_SCHEMAS,
                temperature=0.7,
            )

            message = response.choices[0].message

            messages.append(message)

            if message.tool_calls:
                handle_tool_calls(message.tool_calls, messages)
            else:
                print(f"Assistant: {message.content}")
                break

You can find and clone this code in this blog series' Github repo.

Let's test our new and more powerful agent! If we run the updated agent we can use many tools to accomplish for example fetching a web page and writing a file based on it:

$ python agent.py
You: Read the frontpage of ruxu.dev and list all the articles in a markdown file ruxu.md
  [tool] webfetch({'url': 'https://ruxu.dev'})
  [tool result] Blog | Roger Oriol
Roger Oriol
My name is Roger Oriol, I am a Software Architect based in Barcelona, Spain. I am a MSc graduate in Big Data Management, Technologies and Analytics. This blog will be th...
  [tool] write_file({'path': 'ruxu.md', 'content': '# Articles on ruxu.dev\n\n- Build a Basic AI Agent From Scratch\n- 🔗 [Link] GPT-5\n- 🔗 [Quote] GPT-5 variants\n- 🔗 [Link] GPT-OSS\n- 🔗 [Quote] How we built our multi-agent research system\n- 🔗 [Link] Artificial Intelligence 3E: Foundations of computational agents\n- 🔗 [Link] AGI is not multimodal\n- 🔗 [Quote] Hype Coding - Steve Krouse\n- 🔗 [Link] OpenAI Codex CLI\n- 🔗 [Link] GPT 4.1'})
  [tool result] Wrote 375 bytes to ruxu.md
Assistant: Done — I created `ruxu.md` with the article list from the front page of ruxu.dev.

What You've Built

We now have a tool-calling agent that is already very powerful. If you ask the agent to do something in your behalf, it can leverage all those basic tools to accomplish very complex tasks. Actually, this can already be used as a coding agent or assistant and it actually works. It's still lacking many features that Claude Code or Hermes Agent have, but we are slowly getting there.

What's next?

If we use the current agent for a bit, we can get a glimpse of its potential, but we will often find that it uses tools without planning long-term and it often runs short on complex tasks. In the next part of this series, we will leverage tools by arming our agent with planning and task management tools that will allow it to be able to tackle longer running tasks.

Build a Basic AI Agent From Scratch

Roger Oriol — Wed, 13 May 2026 18:38:59 +0000

2026 is without a doubt the year of AI agents. Since the release of Claude Code, the power of these AI agents has become undeniable. Claude Code, Codex, OpenCode are a must for many developers nowadays. OpenClaw and Hermes are becoming many people's AI assistants. Agents are also breaking into knowledge work with tools like Cowork.

If you follow me in this series of posts, we will build a basic AI agent from scratch in order to better understand how these agents actually work. For the purpose of actually understanding what's under the hood, we won't be using frameworks or libraries, we will write the agent from scratch in Python. This is not how you ship an agent as fast as possible, but it is how you learn.

What Is an AI Agent?

An AI agent is a program that uses artificial intelligence to autonomously achieve a goal. Like any other type of agent, it perceives its environment, reasons about it, and takes action on it. The program usually runs in a loop until a goal is reached.

What Does a Barebones Agent Need?

You only need four things to have a working agent:

A loop to keep the agent running.
An LLM connection to a capable AI model.
User input. A way to let the user communicate the goal to the agent.
Context. Keep conversation so far so the agent doesn't forget what has happened.

Again, this is just for the most basic agent implementation possible. In future posts we will be adding more exciting features into it.

Building the Agent

To build the agent, first you will need to have access to a model. For this example, I will be using a model that is free to run and can actually run in your own machine. For that I will be using a local instance of Ollama, which is running gemma4:e4b, a model with 4B effective parameters.

The Code

import os
from openai import OpenAI


def get_llm_client():
    return OpenAI(
        base_url="http://localhost:11434/v1",
        api_key=""
    )


def agent_loop(client):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

    while True:
        user_input = input("You: ")
        if user_input.lower() == "\\exit":
            break

        messages.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="gemma4",
            messages=messages,
            temperature=0.7,
        )

        reply = response.choices[0].message.content
        print(f"Assistant: {reply}")

        messages.append({"role": "assistant", "content": reply})


if __name__ == "__main__":
    client = get_llm_client()
    agent_loop(client)

What's happening here?

First, we use the get_llm_client function to create a LLM connection to the local instance of Ollama.
Then, we create the message history array, starting it with basic instructions for the AI assistant in the system prompt.
We take the user input and we append it as a user message to the message history.
We send the new whole conversation, including the last user message, to the AI model, requesting a response from it.
The AI model response is appended to the conversation history.
The loop runs forever until the user types \exit.

If we run the this agent loop, we will be able to take turns and ask this agent questions. This agent doesn't have access to outside information, so it will only be able to answer based on its internal knowledge:

$ python agent.py
You: What's the capital city of Germany?
Assistant: The capital city of Germany is **Berlin**.

What You've Built

This is the most basic AI agent possible, but is still very incomplete. Right now it is just a chatbot that will answer anything that falls within the model's knowledge. But it still cannot interact with its environment. It can not read or write files, execute commands or do searches to help it answer your queries.

What's Next?

The next step is giving the agent tools to allow it to start taking actions in its environment. That's where things get interesting, and the potential of AI agents starts to become apparent.

In the next part of this series, we'll add tool calling to our agent loop. See you then.

How to Write a Good index.html File

Roger Oriol — Sat, 22 Mar 2025 22:02:36 +0000

Every web developer has been there: you're starting a new project and staring at an empty file called index.html. You try to remember, which tags were meant to go in the <head> again? Which are the meta tags that are best practice and which ones are deprecated?

Recently, I found myself in this exact situation. My first instinct was to copy the head section from a previous project, but as I reviewed the code, I realized some tags were outdated or simply didn't apply to my new project. What followed was a deep dive into HTML head tags – which ones are essential, which are optional, and which are just cluttering my code.

If you're in a hurry and just want the template: You can find my complete starter template on GitHub. The repository contains two main files:

index.html: A clean, minimalist template with just what you need and no unnecessary extras.
index-commented.html: The same template but with detailed comments explaining how and why you should use each tag.

This article is essentially a deep dive into the comments from the index-commented.html file, providing more context and explanations for each decision made in the template.

This template represents my opinionated approach after researching current best practices. It aims to be a solid foundation for most web projects while maintaining good performance, accessibility, and search engine optimization.

Let's dive into the essential components of a well-structured HTML head.

The tags you must include

These tags should be present in virtually every HTML document you create. They're essential for proper rendering, SEO, and accessibility.

`<!DOCTYPE html>` and `lang="en"`: Setting the document type and language

<!DOCTYPE html>
<html lang="en">

Always begin your HTML document with the doctype declaration. This tells browsers which version of HTML you're using (in this case, HTML5) and helps ensure consistent rendering. The lang attribute on the <html> tag specifies the language of your page - this is crucial for screen readers, search engines, and browsers. If your content is in a different language, change the code accordingly (e.g., lang="es" for Spanish).

`<title>`: The page title

<title>Hello world!</title>

Every HTML document must have a title tag. This text appears in browser tabs, bookmarks, and search engine results. Make your titles descriptive yet concise, ideally under 60 characters. A good title both informs users about your page content and includes relevant keywords for SEO.

`<meta name="viewport">`: Configuring viewport for responsive design

<meta name="viewport" content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=5.0">

In today's mobile-first world, this meta tag is non-negotiable. It tells mobile browsers how to control the page's dimensions and scaling. Let's break down what each parameter does:

viewport-fit=cover: Ensures content extends to the edge of the display (especially important for notched phones)
width=device-width: Sets the width of the page to follow the screen width of the device
initial-scale=1.0: Sets the initial zoom level when the page is first loaded
minimum-scale=1.0: Prevents users from zooming out too much
maximum-scale=5.0: Allows users to zoom in up to 5x (limiting this completely would harm accessibility)

Without this tag, mobile devices will render pages at a typical desktop screen width and then scale them down, resulting in tiny, unreadable text and forcing users to zoom and pan.

`<meta name="description">`, `<meta name="keywords">`, `<meta name="author">`: Essential meta information

<meta name="description" content="">
<meta name="keywords" content="">
<meta name="author" content="">

These meta tags provide important information about your page:

description: A concise summary of your page content (ideally 150-160 characters). This often appears in search engine results below your title.
keywords: Relevant keywords for your page content. While less important for Google these days, other search engines and crawlers may still use this information.
author: The name of the individual or organization that created the page.

While these tags don't directly affect page rendering, they're valuable for SEO and content categorization.

`<link rel="canonical">`: Avoiding duplicate indexation

<link rel="canonical" href="" />

This tag helps search engines avoid indexing the same content multiple times when it's accessible via different URLs. For example, if your page is accessible via multiple URLs (like example.com/page and example.com/page/index.html), the canonical tag tells search engines which URL is the "official" version to index, preventing duplicate indexation which can harm your search rankings.

Fill in the href attribute with the primary URL you want search engines to associate with this content.

CSS loading strategies: Critical inline CSS vs. external stylesheets

<style>
  body {
    background: #fefefe;
    color: #222;
    font-family: 'Roboto', sans-serif;
    padding: 1rem;
    line-height: 1.8;
  }
</style>
<link rel="preload" href="main.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<noscript><link rel="stylesheet" href="main.css"></noscript>

This strategy for CSS loading optimizes performance in two ways:

Inline critical CSS: By embedding essential styles directly in the HTML document, you avoid making an additional network request that would block rendering. This makes critical styles load blazingly fast and prevents the Flash of Unstyled Content (FOUC) that can occur when styles load after content.
Asynchronous loading for non-critical CSS: The preload technique with the onload handler allows the main stylesheet to load without blocking rendering. This means your page can start displaying while the rest of the styles are still loading, creating a better user experience. The noscript tag provides a fallback for users with JavaScript disabled.

Alternatively, if your site doesn't have significant styling needs above the fold, you can use a simpler approach:

<link href="main.css" rel="stylesheet" />

This is more straightforward but can slow initial rendering as the browser must download and parse the CSS before displaying content.

`<script>`: Script loading best practices

<script type="module" src="app.js"></script>

For JavaScript, the type="module" attribute offers several advantages:

Automatically defers script loading until the DOM is ready
Enables ECMAScript modules for better code organization
Runs in strict mode by default
Allows for cleaner dependency management

For scripts that don't depend on DOM elements and should run as soon as possible, consider adding the async attribute:

<script type="module" async src="analytics.js"></script>

Additionally, it's good practice to register a service worker for offline capabilities:

<script>
  if ('serviceWorker' in navigator) {
    window.addEventListener('load', () => {
      navigator.serviceWorker.register('/service-worker.js');
    });
  }
</script>

This script is placed without defer or async attributes so it loads and executes as soon as possible, enabling offline capabilities early in the page load process. The service worker runs in the background, independent of your web page, allowing it to handle network requests and caching even when the user navigates away from your site.

The tags you maybe should include

These tags aren't necessary for every project but can be valuable in specific situations. Include them based on your project's needs.

`<meta charset="utf-8">`: Character encoding

<meta charset="utf-8">

This meta tag specifies the character encoding for your HTML document. UTF-8 is already the default character encoding in HTML5, so this tag isn't strictly necessary in many cases. However, including it explicitly ensures consistency across all browsers and prevents potential character rendering issues, especially with special characters or non-Latin alphabets.

`<base href="/">`: Defining a base URL

<base href="/">

The base tag specifies the base URL for all relative URLs in a document. If all your site's URLs are already relative to the root path ("/"), you don't need to include this tag. It's primarily useful when your site is hosted in a subdirectory but you want paths to be relative to the domain root, or when developing single-page applications with client-side routing.

`<meta name="application-name">`: Application details

<meta name="application-name" content="">

If your Progressive Web App (PWA) should have a different name than what's specified in your title tag, use this meta tag. It defines the name that will appear when your web application is installed on a device or pinned to a user's start menu or taskbar.

`<meta name="theme-color">`: Browser UI theme color

<meta name="theme-color" content="#33d">

This meta tag defines the color used by the user agent in the UI elements around your page, such as the browser's address bar in mobile browsers or the title bar in some desktop browsers. Choose a color that reflects your brand identity to create a more integrated visual experience.

`<meta name="color-scheme">`: Light and dark mode support

<meta name="color-scheme" content="light dark">

This tag informs the browser if your site supports light mode, dark mode, or both. The value "light dark" means that both schemes are supported, with light being preferred. This helps browsers render form controls, scrollbars, and other UI elements in the appropriate color scheme, creating a better user experience that respects system preferences.

`<meta property="og:">`: Social media integration with Open Graph

<meta property="og:title" content="" />
<meta property="og:type" content="website" />
<meta property="og:url" content="" />
<meta property="og:image" content="" />

Open Graph meta tags optimize how your content appears when shared on social media platforms like Facebook, LinkedIn, and X (formerly Twitter). While not essential for basic functionality, they significantly improve the appearance and engagement of your content when shared.

Key Open Graph tags include:

og:title: The title of your page/content (can differ from your HTML title)
og:type: The type of content (website, article, product, etc.)
og:url: The canonical URL of your page
og:image: The URL to an image representing your content

If your site's content is likely to be shared on social platforms, filling these tags with appropriate values can dramatically improve click-through rates and engagement.

`<link rel="manifest">` and `<link rel="icon">`: PWA support and favicons

<link rel="manifest" href="manifest.json">
<link rel="icon" href="/favicon.ico" type="image/x-icon">

For Progressive Web Apps, use the manifest link to point to a JSON file that contains your app's metadata, including icons. The manifest should include your app's icons in various sizes for different devices and contexts.

For regular websites that aren't PWAs, use the icon link to define your site's favicon. While technically optional, a favicon helps with brand recognition and user experience, so most sites should include one.

`<link rel="alternate">`: Alternate Content Types for your site

<link rel="alternate" type="application/rss+xml" href="/feed.xml">
<link rel="alternate" type="text/markdown" href="/llms.txt">

This link tag serves multiple purposes:

It helps RSS readers and other feed aggregators discover your site's RSS feed. Include it if your site provides an RSS feed (common for blogs, news sites, or regularly updated content collections).
It can also be used to specify an llms.txt file for your site as a type="text/markdown" alternate link. This file provides your site's content in an easy to digest format for Large Language Models scanning your site. Note: I stole that idea from Giles Thomas.

If your site doesn't offer these features, you can safely omit these tags.

`<link rel="preload">`, `<link rel="preconnect">`, `<link rel="prefetch">`: Resource optimization

<link rel="preload" href="https://fonts.googleapis.com/css2?family=Roboto:wght@400;700&display=swap" as="style" onload="this.onload=null;this.rel='stylesheet'">
<link rel="preconnect" href="https://fonts.gstatic.com">
<link rel="prefetch" href="/next-page.html">

These link tags help optimize resource loading:

preload: Tells the browser to download and cache a resource as soon as possible. Useful for critical resources needed early in the rendering process, like fonts or important images.
preconnect: Establishes an early connection to external domains from which you'll fetch resources later. This saves time by having the connection ready when needed.
prefetch: Suggests to the browser that a resource might be needed for future navigation. The browser will download it when idle, making subsequent page loads faster.

Use these selectively based on your performance needs. Over-using them can waste bandwidth, so focus on truly critical resources.

Conclusion

This template gives you a solid starting point for any web project. Of course, this template is completely opinionated and the best setup for your needs may vary. If I left out some common tags from the template, it's probably because they are not needed, at least for most cases.

I welcome your feedback! If you think I've missed something important, please open an issue or submit a pull request on the GitHub repository.

Also, if you want a starter template that goes further than just the index.html file, you might want to check out HTML5 Boilerplate. It's a great resource to get up an running really fast when building a Progressive Web App.

Resources

If you want to read more about HTML head tags, I suggest starting here:

And once more, you can find the my index.html template on GitHub.

About the Dead Internet Theory and AI

Roger Oriol — Tue, 25 Feb 2025 20:43:36 +0000

On the Internet nobody knows you are a dog... or an LLM

The Dead Internet Theory is a thought that has gained a lot of traction recently. I have to admit, the first time it was explained to me, I felt an eerie realization. Like I had already been experiencing it, but I hadn't paid too much attention to it. The first moment, I felt scared for the future and nostalgic for the past. But that's an old man's attitude, to be so resistant of change. After that moment, I started reflecting. How did we get here? And how can we prevent the Internet from continuing down this concerning path?

What Is The "Dead Internet Theory"?

The main premise of the Dead Internet Theory is an outrageous statement that challenges our view of the state of the Internet:

The Internet feels empty and devoid of people.

This theory originated in a 4chan post around 2019. This 4chan anonymous user writes his recent experience on the current state of the Internet. While his tone is pretty outrageous and paranoid (as is expected of a 4chan board), he raises some valid concerns that resonate with many other Internet users. This has led to this hypothesis gaining a lot of traction online and others sharing their own experiences and thoughts on it on other Internet boards.

So what do they mean by empty and devoid of people? There's plenty of people on the Internet, right? I believe there's two factors to it.

First, we are increasingly interacting less directly with other humans. Now we are talking to an "audience". But really we are talking to the recommendation algorithm, our god, so it graces our posts with engagement. When we post on social networks, we don't expect our friends and relatives to see and interact with it. We expect strangers we don't know or care about to like it. This impersonal relationship is making the Internet less social and more of a hustle.

Second, an increasing amount of users of the Internet are not people. They are bots. Fake profiles that algorithmically play the viral content game to gain influence on the social platforms. Advertisement and/or scam e-mails, messages or even calls massively horizontally scaled thanks to bots. Generative AI answering questions on behalf of actual people on PhysicsForums and other forums like StackOverflow.

The Internet was supposed to connect people all around the world. And it did that, beautifully. But for the last years, it's been going in the opposite direction. It's driving us apart, isolating us from other people and keeping us content with experiencing an Internet massivelly filled with content, without the need to interact with other people.

Is It Really That Bad?

To be fair, it's not like nobody uses the Internet to communicate with anyone anymore. Messaging apps are very prevalent and the users of these apps are mostly human. We talk with our friends, make plans, catch up with people that are not as much in our lives as they used to. But this is a private Internet, not an open Internet. A private chat is not meant to be shared, discovered by people interested on what you are talking about or build a little garden in a corner of the Internet that's just yours.

Also, people are not interacting with the Internet like they used to. I believe the rise of smartphones has contributed to that greatly. Smartphones have been very optimized for consuming content, and especially for infinite, mindless scrolling. It's much more lucrative for your app to have your users trapped in an infinite scroll, consuming content (and ads) by the ton, than to have them create and share. And to achieve that, they provide (impose on) you an amazing recommendation algorithm that will play your psychology to keep you engaged. But if people are only consuming, they are not interacting, discussing, or building. You see a lot of people on your feed, sure, but in such large numbers and impersonal style that it doesn't feel like there's people on the other side.

Paradoxically, recommendation algorithms also make us more likely to interact with content that arouses anger in us than we agree with, since this drives more engagement. This amazing video by Kurzgesagt talks about the psychology of social interactions in an open, algorithmic internet. It comes to the conclusion that small, year 2000 forum-like communities on the Internet were the most similar to the social mechanisms our brains are accustomed to and thrive in.

And it has gotten worse recently, thanks to Generative AI and Large Language Models. Bots were already popping up everywhere, but with the power of recent Generative AI models, they have gotten so much better at deceiving people. And some content posted by people is not actually thought about and written by those people. They just asked an LLM to write it for them and copy-pasted the output into their post. A lot of news, or articles like this one, are written almost instantly (and without any thought behind) thanks to Generative AI. Yes, I have also used it to write, but I try to keep it just as a proofreading/editing/brainstorming/idea drafting aid. I don't post anything I haven't thought about deeply and didn't originate from me.

StackOverflow is a board for asking and answering programming-related questions. Since ChatGPT, it has seen a lot of answers generated by LLMs. I would argue that if you post a question there, you would like an answer from a person that has experience in that topic. If you wanted an answer from ChatGPT you would ask it yourself. And sometimes you might want to, but ChatGPT generally will not be able to solve problems it has never seen like an expert would, but it will still be happy to answer. And this makes a lot of the answers confidently wrong and void of human expertise. This is why StackOverflow has decided to ban all use of Generative AI in content posted to the platform.

The wave of content generated by AI and unapologetically non-reviewed by humans has been so strong that a term has been coined for it: AI Slop. Slop is usually defined as spam, but for AI-generated content. I would define it as low-quality content created by a Generative AI model without any human review or even human thought behind it. AI Slop threatens to flood the Internet with useless posts, algorithm-pleasing content, mass AI-generated pictures and videos at a pace unprecedented until now, thanks to no human being needed to create them.

As if this wasn't bad enough, Meta is now experimenting with not only allowing AI bots on the platform, but using those bots themselves to drive up engagement. In an interview with the Financial Times, Connor Hayes said:

They’ll have bios and profile pictures and be able to generate and share content powered by AI on the platform ... that’s where we see all of this going.

[...] make our apps more entertaining and engaging.

Basically, if you post something on Instagram and nobody likes it or comments it, it might dissuade you from interacting with the app. But if suddenly your post got a lot of likes, comments and messages, even if it's from AI profiles, you will spend more time on it. Never actually interacting with anyone.

If we continue down this path, soon 99% of all content on the Internet will be AI generated and the Internet will have converted from a place to share and communicate with other people to a place to consume endless slop. It will be lonelier than ever, an Internet without people.

My Final Thoughts + Possible solutions

In my opinion, the epicenter of the problem is not AI as a whole but recommendation algorithms and infinite scrolling. The problem is a business model built on the need to drive up engagement and keeping users on the platform at any cost. And it's a titanic problem to tackle, since the whole modern Internet is built around this concept.

AI-generated content can also be a problem, but I believe it's more of a misuse problem than a fundamental problem with the technology. Generative AI models should be our helpers and assistants, but not take over our personas.

Some people advocate on making AI-generated content easier to spot by watermarking. I don't believe this is the way not only because it's difficult to do so with all the kinds of output LLMs can produce, but also because it's impossible to enforce.

Simon Willson has a great oath on Personal AI ethics. While he admits to using AI as his writing assistant, he promises to not post anything that takes longer to read than it took him to write. And I promise to do the same as well. Because, just like Simon, I think it's rude to publish text that you haven't even read yourself.

As for possible solutions to make the Internet less dead, and taking into account that deleting the Internet and starting from scratch is not possible, I've come up with a few actions that I will be applying:

Build social circles in online social vehicles that don't feature recommendation algorithms / infinite scrolling and have a limited number of people. Think Discord/Slack groups or online forums. The lack of recommendation algorithms will allow you to more directly explore the content and people you are interested in. The lack of infinite scrolling will keep you from endlessly consuming content and encourage you to interact / build more. And the limited number of people will allow for easier connection building with other members.
Also interact outside that circle to get content and world views from outside your circle, but be very critical of what you read and see. Take into account that what you read and see might very well be AI generated. Don't let that take most of your time.
Favor subscription feeds like RSS to algorithmic recommendation apps and web pages. Subscribe to your favorite blogs (wink), newsletters and podcasts via an RSS reader or similar. Explore the personal pages of other people. If you are looking for content you consume, this is the one you are most likely to enjoy and won't keep you stuck infinitely scrolling. When you are done, you are done.

In the end, to make your Internet less dead, the goal is to not spend as much time on the Internet, but to be better connected to the things that matter to you.

The Rise Of Reasoner Models: Scaling Test-Time Compute

Roger Oriol — Sun, 12 Jan 2025 13:52:30 +0000

Introduction

A new kind of LLM has recently been popping out everywhere: Reasoner models. Kickstarted by OpenAI's o1 and o3, these models are a bit different from the rest. These models particularly shine when dealing with mathematical problems and coding challenges, where success depends on following precise, logical steps to reach a correct solution. On the other hand, these models take much longer to answer than any conventional model.

The approach to problem-solving used by these models mirrors a well-known distinction in human cognition: System 1 versus System 2 thinking. Traditional LLMs operate much like System 1 thinking - quick, intuitive, and based on pattern recognition. They generate responses rapidly based on their trained neural networks. In contrast, Reasoner models embody System 2 thinking - deliberate, methodical, and self-correcting. They can pause, reflect on their reasoning, and even backtrack when they detect potential errors in their logic.

The key innovation enabling these capabilities isn't found in revolutionary architectures or training methods, but rather in a different approach to computation: scaling test-time compute.

What is Test-time Compute?

At its core, test-time compute represents a fundamental shift in how we allocate computational resources in AI systems. While traditional models focus on scaling training time and data (train-time compute) to improve, Reasoner models invest those computational resources during the actual problem-solving phase (test-time compute). This approach essentially gives the model more time to "think" about its answers.

The concept of "thinking longer" might seem similar to existing techniques like Chain-of-Thought (CoT) prompting, but there's a crucial difference. CoT prompting encourages models to spell out its reasoning and train of thought, but the intermediate steps are not validated or weighted against alternatives. Therefore, even if most steps are spot on, any error in the intermediate reasoning steps will compound, leading to incorrect final answers. Test-time compute solves this by enabling models to actively verify and correct their reasoning process.

How Does Test-time Compute Work?

The simplest method to implement test-time compute is iterative self-refinement. In this method, the model outputs its reasoning and thoughts to solve a problem. Then, all the previous model outputs are passed through the same model again, making it pay attention to its reasoning and try to find errors and correct them. After sufficient iterations, the quality of the response should be much higher than the first output. This is however a naïve approach, that doesn’t allow the model to be creative to find the right solution.

A better approach is Verifier-guided search. In this approach, the model generates multiple different answers and a verifier selects the best one. A high temperature is often used to encourage the model to be creative and explore many different solutions. The same model used to generate can be the verifier, however, it’s a much better approach to train a smaller, separate model for that role.

The verifier can score the correctness of the overall solution (ORM - Outcome Reward Model) or it can score each of the steps of the solution separately (PRM - Process Reward Model). A PRM is much more expensive to run on all the steps than an ORM, however it is a better approach since ORM will mark a solution with all correct steps except one as incorrect. PRM, on the other hand, will mark all correct steps as correct and will encourage the model to rethink the only incorrect step to get to the correct solution.

Since the PRM will need to evaluate many different approaches to the solution, which branch for each different step proposed, we need efficient search strategies to find the path to the correct solution. The following are few different search strategies to find the correct solution path:

Best of N: Generate N independent solutions and score each step using the reward model. Select the solution with the highest score in all its steps.
Best of N Weighted: The implementation is the same as Best of N, however, identical responses are aggregated together. Therefore, the most common solutions will have higher scores.
Beam Search: Generate N solutions for the first step. All generated solutions are scored and the top N/M will be selected for further investigation. For each of the top solutions, M different possible next steps will be generated, and therefore, N steps will be graded for the next step. This continues until getting to the final solution.
DVTS (Diverse Verifier Tree Search): Very similar to Beam Search, but we start with N/M different subtrees and we select only the best step for each of them. Then, for each of the best steps, we generate M new next steps and we score and select the best of them until we get to the final answer.
Lookahead search: Also similar to Beam Search, but in order to grade each step, we generate the next step for that path and we also grade that next step. We use the score from the next step to score the previous step. Using this information, we end up selecting the paths that will be further explored and the ones that will be discarded. This method is also similar to the Monte Carlo Tree Search algorithm.

So, which of these strategies is the best? It depends. For simpler problems and lower compute iteration budget, Best of N weighted seems to be best. But for harder problems and higher compute budgets, Beam Search and its derivatives are definitely better. To achieve the best performance for both simple and complex problems, a “compute-optimal” strategy can be chosen. This strategy will estimate the difficulty of the problem using a model and choose the search strategy accordingly.

When correct solutions to hard problems are finally found using search, the data from the reasoning of the whole solution can then be used to further improve both the reasoning model and the reward model using reinforcement learning. This further improves the model on its reasoning abilities and makes it able to get to the correct answer faster in subsequent iterations.

Performance Improvements Through Test-time Compute

When we evaluate models that use test-time compute using math and coding benchmarks we find remarkable improvements. In an article from HuggingFace, “Scaling Test-time Compute with Open Models”, Llama-3.2 3B using 256-iteration test-time compute was found to be better than Llama-3.1 70B, a model over 20 times larger. The paper “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters” also finds that a PaLM 2-S small model can outperform a 14 times larger model using test-time compute.

These findings suggest that some problems, especially reasoning-heavy ones like math and coding, can be solved not by using bigger models, but by making the model “think longer”.

Limitations of Test-time Compute

Does the improvement that reasoning models show mean that, to get better models, we should always let them “think for longer” instead of pretraining bigger models with more data? The already mentioned paper “Scaling Test-time Compute with Open Models” shows that this does not seem to be the case. The authors of the paper tested which was more effective for solving harder problems; pretraining or scaling test-time compute, and their conclusion is the following:

Test-time and pretraining compute are not 1-to-1 “exchangeable”. On easy and medium questions, which are within a model’s capabilities, or in settings with small inference requirement, test-time compute can easily cover up for additional pretraining. However, on challenging questions which are outside a given base model’s capabilities or under higher inference requirement, pretraining is likely more effective for improving performance.

Therefore, although scaling test-time compute can help models get to the correct answer even if they struggle to get it right, for test-time to work, the solution has to be in the models capabilities and knowledge to begin with. Scaling test-time compute is not a solution that can solve any problem just by giving it more time.

Conclusion

Reasoner models like o1 and o3 are indeed impressive, and this leads many people to wonder if they are close to achieving Artificial General Intelligence (AGI). Of course, reasoning through test-time compute is not AGI at all. Test-time compute, while powerful, is not sufficient on its own to bridge the gap to AGI and, as we have established, has many shortcomings. Plus it only seems to work well on tasks that have steps that have to be objectively correct to get to the right solution, namely math and coding tasks.

However, this step to achieve strong reasoning with smaller models does seem to fit with OpenAI’s 5 step plan to AGI. Step 1 is conversational AI and step 2 is reasoning AI, which seems to have been achieved. The third step will be autonomous AI, which is already in the crosshair of many agentic models and agentic systems that are currently being developed.

So, should you use Reasoner models instead of regular ones? The decision should depend on your specific use case. These models excel at tasks requiring careful reasoning and verification, particularly in domains like mathematics, coding, and logical problem-solving. However, for tasks that require quick responses or deal with more subjective matters, traditional LLMs might still be the better choice.

References

Scaling Test-Time Compute with Open Models. Edward Beeching, Lewis Tunstall, Sasha Rush (HuggingFace). URL: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute.
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective. Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, Xipeng Qiu. URL: https://arxiv.org/abs/2412.14135.
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar. URL: https://huggingface.co/papers/2408.03314

AI in 2024: Year in Review and Predictions for 2025

Roger Oriol — Mon, 23 Dec 2024 19:25:27 +0000

The past year has been transformative for artificial intelligence, marked by breakthrough innovations, emerging regulations, and a shift toward practical AI tools that enhance productivity. As we look ahead to 2025, let's review the major developments of 2024 and explore what the future might hold.

Part I: 2024 Year in Review

OpenAI in the front

OpenAI maintained its position at the forefront of AI innovation in 2024. The release of GPT-4o in May marked a significant milestone as the first frontier multimodal model, capable of understanding and generating content across different forms of media. This was followed by o1 in September, along with its lighter counterpart o1-mini, and the announcement of o3 in December.

Perhaps most notably, OpenAI's Sora project revolutionized video generation. Announced in February and enhanced with Sora Turbo in December, this technology demonstrated unprecedented capabilities in creating realistic video content from text descriptions.

Agentic AI and AI Assistants

Rather than replacing jobs, AI in 2024 focused on enhancing human productivity through innovative tools and assistants. Google introduced several groundbreaking projects: Astra, an AI assistant for phones and smart glasses, and Mariner, a Chrome extension that enables Gemini to interact with your browser. These tools represent a shift from simple chat interfaces to interactive agents that can understand and manipulate our digital environment.

Anthropic joined this trend by enabling Claude to use computers the way humans do—viewing screens, moving cursors, and interacting with interfaces. This development opened new possibilities for AI assistance in everyday computer tasks.

The developer community saw particularly exciting advances. Github's Copilot and Cursor AI have been very well received by the developer community. On the other hand, autonomous coding agents like Devin and Github Workspace are still rough around the edges. These tools aren't replacing developers; instead, they're supercharging their productivity by handling routine tasks and suggesting improvements.

AI Regulation

2024 saw significant progress in AI regulation, particularly with the EU AI Act. This groundbreaking legislation set the first comprehensive framework for AI regulation, establishing rules for AI system development and deployment while balancing innovation with safety and ethical concerns.

Part II: Looking Ahead to 2025

Small, specialized models

While decoder-only models like GPT and Claude have dominated headlines, in 2025 encoder-only models are still an important piece in AI applications. Just one week before ending the year, ModernBERT was released, representing a significant advance in this space that was long overdue. These models are particularly crucial for Retrieval-Augmented Generation (RAG) setups, where they excel at information retrieval and classification at a significantly lower cost to run.

In a similar fashion, the current trend of agentic systems will favor small, specialized models. These models will be fine-tuned to be as good as frontier models in really specific tasks, but used in conjunction in an agentic workflow, will allow for really powerful and resource efficient AI systems.

Multimodal models and Test-time compute

We can expect significant advances in multimodal capabilities, with AI systems becoming more natural at processing and responding to various forms of input—text, voice, images, and video—simultaneously. These improvements will make AI interactions feel more natural and contextually aware.

The concept of "test-time compute"—giving models more time to think—emerged as a game-changer in 2024 and will likely become more prominent in 2025. Frontier models like OpenAI’s o3 are crushing reasoning, math and coding benchmarks. They are consistently beating humans at really complex, deep reasoning tasks, making us wonder how far Artificial General Intelligence really is.

Test-time compute is not only showing remarkable results for frontier models. Used in small edge models, they are surpassing the performance of much larger models when given more processing time. For instance, Llama 3.2 3B outperformed Llama 3.1 70B in the Math-500 benchmark using a test-time compute strategy of 256 iterations.

The evolution of AI regulation

Following the EU's lead, we're likely to see more regions implement AI regulations in 2025. Interestingly, major tech companies are actively calling for regulation, potentially as a strategy to shape the regulatory environment and raise barriers to entry for competitors.

Conclusion

2024 has been a year of remarkable progress in AI. We have seen many new ways to use AI in our day-to-day lives, as well as new methods to improve models that are making giant strides closer to AGI. As we move into 2025, the focus appears to be shifting toward more specialized and efficient AI systems, improved reasoning capabilities through techniques like test-time compute, and a focus on AI tools to improve productivity.

Is the EU Falling Behind in the AI Race?

Roger Oriol — Fri, 29 Nov 2024 16:47:23 +0000

The recent announcement that Meta's Llama 3.2 Vision models won't be available in the European Union has reignited discussions about the impact of EU regulations on AI innovation and accessibility. This development joins a growing list of AI technologies from major tech companies that are currently unavailable to EU users, including ChatGPT's Advanced Voice mode and Apple Intelligence, raising concerns about whether the EU might be falling behind in the global AI race.

The EU AI Act and Its Impact

In April 2021, the European Commission proposed the EU AI Act. This act classifies AI systems according to the risk they pose to users. The different risk levels mean more or less regulation. Furthermore, it sets some rules for General Purpose AI systems, like Meta’s Llama or ChatGPT. Model developers must provide technical documentation, instructions, comply with the Copyright Directive, and publish a summary about the content used for training. If the model is open, like Llama, and it presents a systemic risk, which Llama does, it must also conduct model evaluations, adversarial testing, track and report serious incidents and ensure cybersecurity protections.

According to the study by Stanford University “Do Foundation Model Providers Comply with the Draft EU AI Act?”, as of June 2023, no foundation models fully comply with the EU AI Act's requirements.

Meta and EU regulation

The situation becomes more complex when examining Meta's challenges with training their models in the EU. In June 2024, Meta faced a setback when the Irish Data Protection Commission requested a delay in training their large language models using public content from adult Facebook and Instagram users in the EU. Meta expressed disappointment with this decision, arguing it would hinder European innovation.

It seems like these requirements influenced Meta's decision not to release Llama 3.2 Vision models in the EU. However, this also looks like a retaliation tactic to pressure the EU to let Meta use private user data to train its models.

As a side note, according to Llama 3.2’s Use Policy, the restriction of use of these models specifically applies to companies and individuals based in the EU who wish to use and build services using these models directly - end users in Europe that use services built on these models are not affected.

EU Needs AI

A significant countermovement has emerged in response to these regulatory challenges. The "EU Needs AI" initiative, supported by prominent figures including Meta's Chief AI Scientist Yann LeCun, argues that fragmented regulation threatens Europe's competitive position in AI development. Their position statement emphasizes that "Europe has become less competitive and less innovative compared to other regions and it now risks falling further behind in the AI era due to inconsistent regulatory decision making."

What Does the Future Hold for AI in the EU?

In my opinion, probably the pushback in regulation is overblown. I believe that, even with these regulations in place, the EU will still have access to some amazing models, and its citizens will enjoy more ethical AI practices.

While companies from Europe will need to comply with many more requirements as companies from the rest of the world, those requirements are not out of place or unnecessary. Many of the requirements for General Purpose AI systems, systemic risk or not, described in the EU AI Act, are from practices that are already widely in place and considered good practices from reliable models. Also, citizens in the EU will be able to enjoy more privacy and security than any other citizen in the world. Their private data, including pictures, won’t be used for training models. Abusive AI systems like social scoring and manipulative AI won’t be a problem for them.

In conclusion, the EU will not be falling behind in the AI race. It will keep being competitive, in part thanks to its more ethical practices. And AI regulation is not only not a blocker, but also a necessity, and more countries should follow suit.

Build an Advanced RAG App: Query Routing

Roger Oriol — Thu, 12 Sep 2024 11:07:55 +0000

In previous articles, we built a basic RAG application. We also learned to introduce more advanced techniques to improve a RAG application. Today, we will explore how to tie those advanced techniques together. Those techniques might do different, sometimes opposite, things. Still, sometimes we need to use all of them, to cover all possibilities. So let's see how we can link different techniques together. In this article we will take a look at a technique called Query Routing.

The problem with Advanced RAG Applications

When our Generative AI application receives a query, we have to decide what to do with it. For simple Generative AI applications, we send the query directly to the LLM. For simple RAG applications, we use the query to retrieve context from a single data source and then query the LLM. But, if our case is more complex, we can have multiple data sources, or different queries need different types of context. So do we build a one-size-fits-all solution, or do we make the application adapt to take different actions depending on the query?

What is Query Routing?

Query Routing is about giving our RAG app the power of decision-making. Query Routing is a technique that takes the query from the user and uses it to make a decision on the next action to take, from a list of predefined choices.

Query Routing is a module in our Advanced RAG architecture. It is usually found after any query rewriting or guardrails. It analyses the input query and it decides the best tool to use from a list of predefined actions. The actions are usually retrieving context from one or many data sources. It could also decide to use a different index for a data source (like parent-child retrieval). Or it could even decide to search for context on the Internet.

Which are the choices for the Query Router?

We have to define the choices that the Query Router can take beforehand. We must first implement each of the different strategies, and accompany each one with a nice description. It is very important that the description explains in detail what each strategy does, since this description will be what our router will base its decision on.

The choices a Query Router takes can be the following:

Retrieval from different data sources

We can catalog multiple data sources that contain information on different topics. We might have a data source that contains information about a product that the user has questions about. And another data source with information about our return policies, etc. Instead of looking for the answers for the user’s questions in all data sources, the query router can decide which data source to use based on the user query and the data source description.

Data sources can be text stored in vector databases, regular databases, graph databases, etc.

Retrieval from different indexes

Query Routers can also choose to use a different index for the same data source.

For example, we could have an index for keyword based search and another for semantic search using vector embeddings. The Query Router can decide which of the two is best for getting the relevant context for answering the question, or maybe use both of them at the same time and combine the contexts from both.

We could also have different indexes for different retrieval strategies. For example, we could have a retrieval strategy based on summaries, or a sentence window retrieval strategy, or a parent-child retrieval strategy. The Query Router can analyze the specificity of the question and decide which strategy is best to use to get the best context.

Other data sources

The decision that the Query Router takes is not limited to databases and indexes. It can also decide to use a tool to look for the information elsewhere. For example, it can decide to use a tool to look for the answer online using a search engine. Or it can also use an API from a specific service (for example, weather forecasting) to get the data it needs to get the relevant context.

Types of Query Routers

An important part of our Query Router is how it makes the decision to choose one or another path. The decision can vary depending on each of the different types of Query Routers. The following are a few of the most used Query Router types:

LLM Selector Router

This solution gives a prompt to an LLM. The LLM completes the prompt with the solution, which is the selection of the right choice. The prompt includes all the different choices, each with its description, as well as the input query to base its decision on. The response of this query will be used to programmatically decide which path to take.

LLM Function Calling Router

This solution leverages the function calling capabilities (or tool using capabilities) of LLMs. Some LLMs have been trained to be able to decide to use some tools to get to an answer if they are provided for them in the prompt. Using this capability, each of the different choices is phrased like a tool in the prompt, prompting the LLM to choose which one of the tools provided is best to solve the problem of retrieving the right context for answering the query.

Semantic Router

This solution uses similarity search on the vector embedding representation of the user query. For each choice, we will have to write a few examples of a query that would be routed to this path. When a user query arrives, an embeddings model converts it to a vector representation and it is compared to the example queries for each router choice. The example with the nearest vector representation to the user query is chosen as the path the router must route to.

Zero-shot classification Router

For this type of router, a small LLM is selected to act as a router. This LLM will be finetuned using a dataset of examples of user queries and the correct routing for each of them. The finetuned LLM’s sole purpose will become to classify user queries. Small LLMs are more cost-effective and more than good enough for a simple classification task.

Language Classification Router

In some cases, the purpose of the Query Router will be to redirect the query to a specific database or model depending on the language the user wrote the query in. Language can be detected in many ways, like using a ML classification model or a Generative AI LLM with a specific prompt.

Keyword router

Sometimes the use case is extremely simple. In this case, the solution could be to route one way or another depending on if some keywords are present in the user query. For example, if the query contains the word “return” we could use a data source with information useful about how to return a product. For this solution, a simple code implementation is enough, and therefore, no expensive model is needed.

Single choice routing vs Multiple choice routing

Depending on the use case, it will make sense for the router to just choose one path and run it. However, in some cases it also can make sense to use more than one choice for answering the same query. To answer a question that spans many topics, the application needs to retrieve information from many data sources. Or the response might be different based on each data source. Then, we can use all of them to answer the question and consolidate them in a single final answer.

We have to design the router taking these possibilities into account.

Example implementation of a Query Router

Let’s get into the implementation of a Query Router within a RAG application. You can follow the implementation step by step and run it yourself in the Google Colab notebook.

For this example, we will showcase a RAG application with a query router. The application can decide to answer questions based on two documents. The first document is a paper about RAG and the second a recipe for chicken gyros. Also, the application can decide to answer based on a Google search. We will implement a single-source Query Router using an LLM function calling router.

Load the paper

First, we will prepare the two documents for retrieval. Let's first load the paper about RAG:

Load the recipe

We will also load the recipe for chicken gyros. This recipe from Mike Price is hosted in tasty.co. We will use a simple web page reader to read the page and store it as text.

Save the documents in a vector store

After getting the two documents we will use for our RAG application, we will split them into chunks and we will convert them to embeddings using BGE small, an open-source embeddings model. We will store those embeddings in two vector stores, ready to be questioned.

Search engine tool

Besides the two documents, the third option for our router will be to search for information using a Google Search. For this example I have created my own Google Search API keys. If you want this part to work, you should use your own API keys.

Create the Query Router

Next, using the LlamaIndex library, we create a Query Engine Tool for each of the three options that the router will choose between. We provide a description for each of the tools, explaining what it is useful for. This description is very important, since it will be the basis on which the Query Router decide which path it chooses.

Finally, we create a Router Query Engine, also with Llama. We give the three query engine tools to this router. Also, we define the selector. This is the component that will make the choice of which tool to use. For this example, we are using an LLM Selector. It's also a single selector, meaning it will only choose one tool, never more than one, to answer the query.

Run our RAG application!

Our Query Router is now ready. Let's test it with a question about RAG. We provided a vector store loaded with information from a paper on RAG techniques. The Query Router should choose to retrieve context from that vector store in order to answer the question. Let's see what happens:

Our RAG application answers correctly. Along with the answer, we can see that it provides the sources from where it got the information from. As we expected, it used the vector store with the RAG paper.

We can also see an attribute "selector_result" in the result. In this attribute we can inspect which one of the tools the Query Router chose, as well as the reason that the LLM gave to choose that option.

Now let's ask it a culinary question. The recipe used to create the second vector store is for chicken gyros. Our application should be able to answer which are the ingredients needed for that recipe based on that source.

As we can see, the chicken gyros recipe vector store was correctly chosen to answer that question.

Finally, let's ask it a question that can be answered with a Google Search.

Conclusion

In conclusion, Query Routing is a great step towards a more advanced RAG application. It allows to set up a base for a more complex system, where our app can better plan how to best answer questions. Also, Query Routing can be the glue that ties together other advanced techniques for your RAG application and makes them work together as a whole system.

However, the complexity for better RAG systems doesn't end with Query Routing. Query Routing is just the first stepping stone for orchestration within RAG applications. The next stepping stone for making our RAG applications better reason, decide and take actions based on the needs of the users are Agents. In later articles, we will be diving deeper on how Agents work within RAG and Generative AI applications in general.

Build an Advanced RAG App: Query Rewriting

Roger Oriol — Sun, 30 Jun 2024 17:02:54 +0000

In the last article, I established the basic architecture for a basic RAG app. In case you missed that, I recommend to first read that article over here. That will set the base from which we can improve our RAG system. Also in that last article, I listed some common pitfalls that RAG applications tend to fail on. We will be tackling some of them with some advanced techniques in this article.

To recap, a basic RAG app uses a separate knowledge base that aids the LLM to answer the user’s questions by providing it with more context. This is also called a retrieve-then-read approach.

The problem

To answer the user’s question, our RAG app will retrieve appropriate based on the query itself. It will find chunks of text on the vector DB with similar content to whatever the user is asking. Other knowledge bases (search engine, etc.) also apply. The problem is, the chunk of information where the answer lies, might not be similar to what the user is asking. The question can be badly written, or expressed differently to what we expect. And, if our RAG app can’t find the information needed to answer the question, it won’t answer correctly.

There are many ways to solve this problem, but for this article, we will look at query rewriting.

What is Query Rewriting?

Simply put, query rewriting means we will rewrite the user query in our own words, that our RAG app will know best how to answer. Instead of just doing retrieve-then-read, our app will do a rewrite-retrieve-read approach.

We use a Generative AI model to rewrite the question. This model be a large model, like (or the same as) the one we use to answer the question in the final step. Or it can also be a smaller model, specially trained to perform this task.

Also, query rewriting can take many different forms depending on the needs of the app. Most of the time, basic query rewriting will be enough. But, depending on the complexity of the questions we need to answer, we might need more advanced techniques like HyDE, multi-querying or step-back questions. More information on those in the following section.

Why does it work?

Query Rewriting usually gives better performance in any RAG app that is knowledge intensive. This is because RAG applications are sensitive to the phrasing and specific keywords of the query. Paraphrasing this query is helpful in the following scenarios:

It restructures oddly written questions so they can be better understood by our system.
It erases context given by the user which is irrelevant to the query.
It can introduce common keywords, which will give it a better chance of matching up with the correct context.
It can split complex questions into different sub.questions, which can be more easily responded separately, each with their corresponding context.
It can answer question that require multiple levels of thinking by generating a step-back question, which is a higher-level concept question to the one from the user. It then uses both the original and the step-back question to retrieve context.
It can use more advanced query rewriting techniques like HyDE to generate hypothetical documents to answer the question. These hypothetical documents will better capture the intent of the question and match up with the embeddings that contain the answer in the vector DB.

How to implement Query Rewriting

We have stablished that there are different strategies of Query Rewriting depending on the complexity of the questions. We will briefly visit how to implement each of them. After, we will see a real example to compare the result of a basic RAG app versus a RAG app with Query Rewriting. You can also follow all the examples in the article’s Google Colab notebook.

Zero-shot Query Rewriting

This is simple query rewriting. Zero-shot refers to the prompt engineering technique of giving examples of the task to the LLM, which in this case we give none.

Few-shot Query Rewriting

For a slightly better result at the cost of using a few more tokens per rewrite, we can give some examples of how we want the rewrite to be done.

Trainable rewriter

We can fine-tune a pre-trained model to perform the query rewriting task. Instead of relying on examples, we can teach it how query rewriting should be done to achieve the best results in context retrieving. Also, we can further train it using Reinforcement Learning so it can learn to recognize problematic queries and avoid toxic and harmful phrases. Or we can also use an open-source model that has already been trained by somebody else on the task of query rewriting.

Sub-queries

If the user query contains multiple questions, this can make context retrieval tricky. Each question probably needs different information, and we are not going to get all of it using all the questions as basis for information retrieval. To solve this problem, we can decompose the input into multiple sub-queries, and perform retrieval for each of the sub-queries.

Step-back prompt

Many questions can be a bit too complex for the RAG pipeline’s retrieval to grasp the multiple levels of information needed to answer them. For these cases, it can be helpful to generate multiple additional queries to use for retrieval. These queries will be more generic than the original query. This will enable the RAG pipeline to retrieve relevant information on multiple levels.

HyDE

Another method to improve how queries are matched with contexts chunks are Hypothetical Document Embeddings or HyDE. Sometimes, questions and answers are not that semantically similar, which can cause the RAG pipeline to miss critical context chunks in the retrieval stage. However, even if the query is semantically different, a response to the query should be semantically similar to another response to the same query. The HyDE method consists of creating hypothetical context chunks that answer the query and using them to match the real context that will help the LLM answer.

Example: RAG with vs without Query Rewriting

Taking the RAG pipeline from the last article, “How to build a basic RAG app”, we will introduce Query Rewriting into it. We will ask it a question a bit more advanced than last time and observe whether the response improves with Query Rewriting over without it. First, let’s build the same RAG pipeline. Only this time, I’ll only use the top document returned from the vector database to be less forgiving to missed documents.

The response is good and based on the context, but it got caught up in me asking about evaluation and missed that I was specifically asking for tools. Therefore, the context used does have information on some benchmarks, but it misses the next chunk of information that talks about tools.

Now, let’s implement the same RAG pipeline but now with Query Rewriting. As well as the query rewriting prompts, we have already seen in the previous examples, I’ll be using a Pydantic parser to extract and iterate over the generated alternative queries.

The new query now matches with the chunk of information I wanted to get my answer from, giving the LLM a better chance of answering a much better response for my question.

Conclusion

We have taken our first step out of basic RAG pipelines and into Advanced RAG. Query Rewriting is a very simple Advanced RAG technique but a powerful one for improving the results of a RAG pipeline. We have gone over different ways to implement it depending on what kind of questions we need to improve. In future articles we will go over other Advanced RAG techniques that can tackle different RAG issues than those seen in this article.

How to build a basic RAG app

Roger Oriol — Wed, 29 May 2024 19:50:28 +0000

The dawn of Generative AI makes possible new kinds of capabilities for the applications we build. LLMs can answer the user’s questions with an incredible skill. So, why not use them as part of our systems. If the user needs help getting around the app, we can put a chat function where the LLM will answer all the user’s questions. If our app has blog posts explaining important concepts, instead of making the user read all of them to get the knowledge it needs, it could just ask and get an immediate response.

Why RAG?

We decide to integrate a LLM into our app to bring these features to our users. However, we soon find that the model can’t answer the user’s questions. It doesn’t have any information about our application! If the information needed to answer is not in the LLM’s training data, it can’t answer. Even worse, if it doesn’t know the answer, it might hallucinate a completely wrong fact! This is bad, so how do we fix this? LLMs with the Transformer architecture have shown great in-context learning capabilities. So, we just have to pass all the facts that it needs in the prompt, together with the question! Uh oh, it will definitely be expensive to stuff all the data in every prompt. So, how do we do it?

What is RAG?

RAG stands for Retrieval Augmented Generation. RAG was born together with Transformers. Initially, it was used to augment the pre-training data of LLMs with additional facts. Once Transformers’ in-context learning capabilities became obvious, it became a common practice also during inference, to augment the prompt.

A basic RAG pipeline consists of three steps: indexing, retrieval and generation. All the information that the LLM needs to answer is indexed in a vector database. When the user asks a question, we can retrieve the relevant parts of the information from that vector database. Finally, together with just the relevant information and the user’s question, we can prompt the LLM to give an answer based on the information we give it as a context. Let’s look in more detail how to achieve this.

Indexing

First, we extract the information that the model needs from wherever it is. Generative models work with plain text (some models can also work with images or other formats, which can also be indexed, but this is a topic for another time). If the information is already in plain text, we are in luck. But it might also be in PDF documents, Word documents, Excel, Markdown, etc. We must convert this data to plain text and clean it so it can be usable for the model.

Once the information is in text format, we can store it in a vector database. The vector database will store the embeddings representation of that text. That will allow us to search for parts of the text that have a similar embedding representation as another text, therefore they are about a similar concept. We will divide the whole text into smaller parts or chunks, calculate the embeddings representation for each of them, and finally store them in the vector database.

Retrieval

When the user asks us a question, we can convert that question into a vector representation, using the same embeddings model we used to index the data. With that vector representation, we will calculate the similarity factor between the question and each one of the chunks stored in the vector database. We will select the top K chunks that are the most similar to the query, and therefore their contents are about the same concept as the question (and therefore they might contain the answer).

Generation

A prompt is built, putting together the user’s question and the relevant contexts to help the LLM answer. We might also include previous messages from the conversation between the user and the AI assistant. The LLM generates an answer for the user based on the context, instead of its previously learned pre-training data.

Example

For this example, we will ingest a paper called “Retrieval-Augmented Generation for Lange Language Models: A Survey”. We will query the LLM using the information contained in this paper, so it can answer the user’s questions on its contents. You can follow this example in the Google Colab notebook provided for this article.

First, we will load the PDF document and parse it using LangChain’s PyPDF connector.

Once we have the text from the document, we have to split it into smaller chunks. We can use LangChain’s available splitters, like RecursiveCharacterSplitter in this case:

We will be using BGE-small, an opensource embeddings model. We will download it from HuggingFace Hub and run it on all chunks to calculate their vector representations.

Once we have the vector representations for all chunks, we can create an in-memory vector database and store all vectors in it. For this example, we will be using a FAISS database.

The database is now set up. Now, we will be taking queries from the user on this information. In this case, the user asks which are the drawbacks of Naive RAG. We encode this query using the same embeddings model as before. Then, we retrieve the top 5 most similar chunks to that query.

After retrieving the relevant context, we build a prompt using this information and the user’s original query. We will use Claude’s Haiku as a LLM for this example:

Common problems and pitfalls

As the title implies, this solution is a basic or naïve RAG implementation. It will empower your application to make the most out of the LLM it’s using and your data. But it won’t work for all cases. These are just some of the most common problems with RAG:

Retrieve irrelevant information. If the retriever gets data from the vector database that is not relevant to the question, it will confuse the model trying to answer the question. This might lead to either not using the context to answer the question, or answering something different than what was asked.
Miss important information. Maybe the information it needs to answer the question is not in the database. Maybe the retrieval mechanism fails to find the relevant chunks. We must find ways to help the retriever find the information it needs easily and more reliably.
Generate responses not supported by the context. If the context has the information the model needs, but it doesn’t use it and instead relies on its own pre-training data, all this was for nothing. The information from the pre-training data might be outdated or wrong. We must favor the model to always use the context to answer, or answer “I don’t know” if it can’t answer from the context.
Irrelevant response to the query. The LLM might use all the information that you give it to generate a response, but that doesn’t mean that it answers the user’s question. It’s important that the model sticks to the user’s original question, instead on getting lost in a ton of information.
Redundant response caused by similar contexts. When we ingest multiple documents with similar information, there’s a chance that the retriever will get multiple chunks of information that say almost the same. This might cause the LLM to repeat the same information more than one time in its response.

How to avoid these problems?

To avoid these problems, a naïve RAG pipeline might not be enough. We will need to set up a more advanced and complex RAG system. There exist tested techniques to solve the problems we have laid out. We can incorporate them into our RAG pipeline to improve the RAG application’s performance.

Another important point to address is that, to improve your RAG application, you will need to be able to measure and evaluate the whole process. You can’t improve what you can’t measure. Plus, when you evaluate you might find that a basic RAG setup is enough for your use case, and you don’t need to overcomplicate it. After all, even a very basic RAG implementation can improve your LLM powered application enormously.

In future articles, I will explain in more detail the advanced RAG techniques that will help us avoid common problems and bring our RAG applications to the next level.

How to use LLMs: Summarize long documents

Roger Oriol — Wed, 01 May 2024 20:54:48 +0000

One of the tasks Generative AI models are good at is summarizing texts. However, if you want to summarize a text, you need the whole text to fit in the context window, plus the command to summarize it. The following is an example prompt (you can test it yourself using the Google Colab notebook for this article):



Write a concise summary of the following text:
The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. Since the Turing test is a test of indistinguishability in performance capacity, the verbal version generalizes naturally to all of human performance capacity, verbal as well as nonverbal (robotic).

Summary:

This might get tricky if the text to summarize is too long. There exist LLMs with very large window contexts:

GPT-4 has a variant with a context window of 32k tokens.
Claude 3 Opus has a context window of 200k tokens.
Gemini 1.5 Pro can have a context window of up to 1 million tokens.

However, sometimes using a LLM with such a large context window is off the table. There are many factors that can make this the case:

Those models might be too expensive.
Models sometimes have a hard time using all the information in very long prompts.
All the models with large context windows might be off the table because of restrictions like only being able to use open-source models.
Your text might be longer that all available models (yes even with 1M context windows).

Luckily, there exists a technique that can get an LLM to summarize a document longer than its context window size. The technique is called MapReduce. It’s based on dividing the text in a collection of smaller texts that do fit in the context window and then summarizing each part separately. The steps to perform a MapReduce summary are the following (Again, you can execute and see for yourself all the process in the Google Colab notebook for this article):

First, the long document is divided into chunks using a text splitter. The strategy we use to divide the text can be different depending on the type of document.
- If the text has sections and all sections are smaller than the context window, we could divide it by those sections.
- If the text has no clear sections or the sections are too large, the text can be divided in equal sized chunks character-wise. This last approach has the problem of separating related sentences in different chunks. To avoid this problem, we can have an overlap between chunks. With this solution, the last N characters of a chunk will be repeated as the first N characters of the next chunk, so context is not lost.

Second, all chunks are summarized separately using the LLM. Remember to tune your summarize prompt to help understand the model what kind of document it is and how to summarize it properly. For example, you might want the summary to be a bullet point list of the main points of the text or you might want the summary to be just a few sentences.

The third step is optional. If the combination of all summaries does not fit into the context window, we won’t be able to ask for a consolidated summary. Instead, we need to make the combination of all summaries smaller.
- We will group the summaries in groups that fit into the context window.
- Then, we will write a prompt that combines all the summaries to a single summary with the key ideas.
- After reducing all the summary groups, if the resulting combined summaries still don't fit in the context window, this process is executed again.

Lastly, when all summaries fit into the context window, we can finally combine or reduce them into the final summary.
- We will write a prompt that combines all the summaries into a single final summary with all the key ideas. This prompt is usually the same as the one in step 3 since the idea is basically the same; combine a list of summaries into one shorter summary. However, you might need the freedom to diverge the two prompts to accommodate specific needs for special kinds of documents or special summaries.

And that’s it! You now have a short summary of the most important points of a large document. But before you start processing your whole documentation, there are a few important notes you need to consider:

This MapReduce method might not be less expensive than using an LLM with a large context window, especially if your cost is per token (as it is in all Model-as-a-service LLMs). Using this method, you will still use all the tokens in the document, plus the tokens of the intermediate summaries (as output and then as input of intermediate prompts). I recommend you study first which method will be more costly for your use case.
While better models will produce better summaries, the summarization capabilities of not-so-good models will be good enough for most cases. The result after many layers of summarization will be of similar quality, so you might be able to get away with using cheaper models if you decide to use this method.

DEV Community: Roger Oriol

Build A Basic AI Agent From Scratch: Long Task Planning

What happens when the agent runs long and complex tasks?

Long task planning

New tool: Scratchpad

New tool: To-do list

New system prompt

Let's test it!

What You've Built

What's next?

Build A Basic AI Agent From Scratch: Tools

What are Tools?

How do Agents use Tools?

Improving our Agent with Tools

What You've Built

What's next?

Build a Basic AI Agent From Scratch

What Is an AI Agent?

What Does a Barebones Agent Need?

Building the Agent

The Code

What's happening here?

What You've Built

What's Next?

How to Write a Good index.html File

The tags you must include

<!DOCTYPE html> and lang="en": Setting the document type and language

<title>: The page title

<meta name="viewport">: Configuring viewport for responsive design

<meta name="description">, <meta name="keywords">, <meta name="author">: Essential meta information

<link rel="canonical">: Avoiding duplicate indexation

CSS loading strategies: Critical inline CSS vs. external stylesheets

<script>: Script loading best practices

The tags you maybe should include

<meta charset="utf-8">: Character encoding

<base href="/">: Defining a base URL

<meta name="application-name">: Application details

<meta name="theme-color">: Browser UI theme color

<meta name="color-scheme">: Light and dark mode support

<meta property="og:">: Social media integration with Open Graph

<link rel="manifest"> and <link rel="icon">: PWA support and favicons

<link rel="alternate">: Alternate Content Types for your site

<link rel="preload">, <link rel="preconnect">, <link rel="prefetch">: Resource optimization

Conclusion

Resources

About the Dead Internet Theory and AI

What Is The "Dead Internet Theory"?

Is It Really That Bad?

My Final Thoughts + Possible solutions

The Rise Of Reasoner Models: Scaling Test-Time Compute

Introduction

What is Test-time Compute?

How Does Test-time Compute Work?

Performance Improvements Through Test-time Compute

Limitations of Test-time Compute

Conclusion

References

AI in 2024: Year in Review and Predictions for 2025

Part I: 2024 Year in Review

OpenAI in the front

Agentic AI and AI Assistants

AI Regulation

Part II: Looking Ahead to 2025

Small, specialized models

Multimodal models and Test-time compute

The evolution of AI regulation

Conclusion

Is the EU Falling Behind in the AI Race?

The EU AI Act and Its Impact

Meta and EU regulation

EU Needs AI

What Does the Future Hold for AI in the EU?

Build an Advanced RAG App: Query Routing

The problem with Advanced RAG Applications

What is Query Routing?

Which are the choices for the Query Router?

Retrieval from different data sources

Retrieval from different indexes

Other data sources

Types of Query Routers

`<!DOCTYPE html>` and `lang="en"`: Setting the document type and language

`<title>`: The page title

`<meta name="viewport">`: Configuring viewport for responsive design

`<meta name="description">`, `<meta name="keywords">`, `<meta name="author">`: Essential meta information

`<link rel="canonical">`: Avoiding duplicate indexation

`<script>`: Script loading best practices

`<meta charset="utf-8">`: Character encoding

`<base href="/">`: Defining a base URL

`<meta name="application-name">`: Application details

`<meta name="theme-color">`: Browser UI theme color

`<meta name="color-scheme">`: Light and dark mode support

`<meta property="og:">`: Social media integration with Open Graph

`<link rel="manifest">` and `<link rel="icon">`: PWA support and favicons

`<link rel="alternate">`: Alternate Content Types for your site

`<link rel="preload">`, `<link rel="preconnect">`, `<link rel="prefetch">`: Resource optimization