DEV Community: Akshay Ballal

Utility is all you need

Akshay Ballal — Tue, 07 Apr 2026 13:54:49 +0000

Closing the Agent Learning Loop with Utility-Ranked Memory

Your agent failed on the same edge case last Tuesday. Someone noticed, tweaked the prompt, redeployed. This Tuesday it failed again: different input, same root cause. The prompt fix was too narrow. The trace was right there in your observability stack, but nothing connected that signal back to the agent's actual behavior.

This is where most production agent systems sit today. Teams have traces. They have evals. What they don't have is a mechanism that turns failure signals into better retrieval on the next run.

We built Reflect to close that gap. It's an outcome-informed memory layer: it stores reflections from past runs, scores them by how useful they actually were, and re-ranks retrieval so your agent improves with every reviewed trace, not just every prompt rewrite.

The gap between evals and action

Production agent stacks generally have three pieces: observability (traces), evaluation (pass/fail judgments), and the agent itself. The problem is these layers don't talk to each other.

Your observability stack captures every tool call, every LLM completion, every exception. Your eval suite judges whether the final output was correct. But the agent that runs tomorrow? It starts from a blank slate. It has no access to why yesterday's runs passed or failed. The eval signal just dies in a dashboard.

What's missing is a system that consumes traces, absorbs eval outcomes, and converts both into retrievable guidance for future runs.

Reflect sits between your evals and your agent. It treats traces not as passive audit logs but as training signal. When a review marks a trace as failed, Reflect doesn't just file it. It extracts a reflection, a compressed lesson about what went wrong, and stores it as a memory tied to that task type. When a similar task shows up, that reflection surfaces. The agent acts differently because it now has context about what not to do.

The key idea is that memories become outcome-addressable. You don't retrieve by keyword. You retrieve by semantic similarity to the current task, weighted by whether those memories have historically helped or hurt. The eval outcome becomes a first-class signal in retrieval ranking, not just a post-mortem footnote.

The manual improvement tax

After talking to dozens of enterprise agent teams, we keep seeing the same pattern. They can tell when something went wrong. They can measure pass/fail rates. But improvement still happens outside the data loop:

A developer reviews a failing trace, rewrites part of the prompt, redeploys.
The team swaps to a more capable (and more expensive) model.
Someone restructures the tool-call harness.
In extreme cases, the team fine-tunes a custom model.

These are all valid moves. But they share a weakness: they're coarse, slow, and global. A prompt change shifts behavior for every future request, whether or not the change was relevant. A model upgrade costs more across the board, even for tasks the cheaper model handled fine.

None of these approaches directly answer the question every agent team actually cares about: what from past experience helped this task succeed, and what made it fail?

Automated prompt optimization can reduce costs dramatically. Databricks showed up to 90x savings while matching or exceeding hand-tuned performance. But even automated optimization treats the prompt as the unit of improvement. We think the unit should be smaller: the individual memory retrieved at inference time.

Why current agent memory systems stall

Memory was supposed to solve this. In practice, most agent memory frameworks have focused on user continuity: preferences, profile facts, conversation history, personalization. That's useful for chat experiences, but it doesn't create a self-improving system for production agents.

Here's a rough breakdown of the dominant approaches and where they fall short for outcome-based learning:

Approach	What it stores	Retrieval signal	Learns from outcomes?
Conversation buffer (LangChain, most frameworks)	Raw chat history	Recency	No
Semantic memory (Mem0, LangMem)	Extracted facts, preferences	Embedding similarity	No
Temporal knowledge graph (Zep/Graphiti)	Entity relationships over time	Graph traversal + recency	No
Episodic + Reflexion	Verbal self-reflections	Similarity to current task	Partially: reflections capture lessons, but retrieval isn't ranked by outcome
Reflect	Task-linked reflections with utility scores	Similarity weighted by outcome-derived utility	Yes: utility scores update after every reviewed run

Systems like Mem0 get strong results on conversational benchmarks (26% improvement over OpenAI Memory on LoCoMo). But their retrieval is driven by semantic similarity and recency, not by whether a memory actually helped produce a good outcome. Reflexion (Shinn et al., NeurIPS 2023) introduced the important idea that agents can learn from verbal self-reflection, hitting 91% pass@1 on HumanEval. But Reflexion's retrieval doesn't incorporate a learned utility signal from downstream results.

What production agents need is memory that's tied to tasks (not just users), updated by outcomes (not just recency), shaped by failure (not only similarity), and grounded in reflections about what actually happened.

Without that, memory is a static index. It retrieves things that sound similar, not things that have a track record of helping.

Utility: a credit score for memories

Think of utility like a credit score. A credit score doesn't just record that you had a loan; it tracks whether you paid it back. Utility doesn't just record that a memory was retrieved; it tracks whether the run it participated in succeeded or failed.

In Reflect, a memory carries:

The original task the agent was solving
A reflection about what worked or failed
The review outcome (pass, fail, or detailed feedback)
A utility score that changes every time the memory is retrieved and the run is reviewed

The ranking formula combines both signals:

score = (1 - λ) × similarity + λ × utility

Similarity tells you whether a memory is about the same kind of problem. Utility tells you whether that memory has historically helped. You need both. At λ = 0 retrieval is pure semantic search; at λ = 1 it ranks entirely by past outcomes.

If a memory was retrieved and the run passed, its utility rises. If it was retrieved and the run failed, its utility falls. Over repeated runs, Reflect starts preferring memories that repeatedly contribute to successful outcomes and down-ranking the ones that keep showing up in failures.

That's how the loop closes. The signal isn't going to a dashboard for a human to interpret. It feeds directly into retrieval ranking.

Storing reasoning, not just facts

Most memory systems store facts: user preferences, conversation history, document chunks. These are static artifacts. Reflect stores something different: reasoning about outcomes.

A fact memory says: "The user prefers dark mode." A reasoning memory says: "When processing a duplicate charge, check settlement status before issuing refunds because last time we skipped that check, the customer got two refunds."

Reflect generates these by combining two inputs: the trace (what the agent actually did) and the review (whether it worked). The trace gives you the trajectory: tool calls, LLM completions, which memories influenced decisions. The review gives you the outcome signal.

From those two inputs, Reflect constructs a reflection. Not a human-written note, but an LLM-generated compression of the lesson in that trace-review pair. What the agent tried, what went wrong, what should happen next time. The reflection gets embedded and stored as a memory, linked to the original task type.

When your agent queries Reflect before a run, it doesn't get document snippets. It gets distilled experience: "Last time you saw a task like this, you retrieved memory X, took action Y, and it failed. Consider action Z instead."

Over many runs, the memory store accumulates a layer of learned reasoning. Each memory carries not just the reflection text but a utility score tracking whether following that advice led to success. Retrieval becomes experience-weighted reasoning: the agent surfaces advice that has survived the test of multiple reviewed runs.

Where utility changes behavior: two examples

Customer support: learning which resolution strategies work

A support agent handles refund requests. Early on, Reflect retrieves a memory: "When a customer reports a duplicate charge, issue an immediate refund." The agent follows this and processes a refund, but the customer already had a pending settlement. The run gets reviewed as a failure: "Did not check existing settlement state before issuing refund."

Two things happen. The utility of that "issue an immediate refund" memory drops. And Reflect stores a new reflection: "For duplicate charge complaints, check settlement status before initiating any refund." Next time a similar ticket comes in, the new reflection ranks higher because its utility starts neutral while the old one has been penalized. The agent now checks settlement state first.

No prompt rewrite. No model swap. The retrieval layer learned from the outcome.

Code review: surfacing advice developers actually accept

A code review agent suggests changes on PRs. Accepted suggestions count as a pass; dismissed ones as a fail.

Over hundreds of PRs, Reflect learns which categories of advice are consistently useful. Memories about "add null checks before database lookups" accumulate high utility because reviewers almost always accept them. Memories about "rename this variable for clarity" accumulate lower utility because reviewers frequently dismiss them as noise.

After a few weeks, the agent's suggestions shift. It still retrieves both types, but the ranking now favors the high-acceptance patterns. The agent naturally focuses on feedback that developers actually value.

Integration

If you're using the Python SDK, the simplest way to wire in the loop is with client.trace(). It retrieves memories on entry and records which memories were used when the trace is submitted.

from reflect_sdk import ReflectClient

client = ReflectClient(
    base_url="https://api.starlight-search.com",
    api_key="rf_live_...",
    project_id="support-agent",
)

with client.trace("Handle a duplicate refund request for an enterprise customer") as ctx:
    messages = [
        {"role": "user", "content": ctx.augmented_task},
    ]
    response = my_llm(ctx.augmented_task)
    messages.append({"role": "assistant", "content": response})

    ctx.set_output(
        trajectory=messages,
        final_response=response,
        result="pass",
        model="gpt-5.4-mini",
    )

That block handles the core work: query memories before the run, inject them into the task, submit the trace, attach the review result, and preserve the retrieved memory IDs so utility can update correctly.

What happens after a review

Once a review comes in, Reflect does three things:

Creates a reflection from the task, trajectory, result, and feedback.
Stores that reflection as a new memory with an initial utility score.
Updates the utility of the memories that were retrieved for that run.

Every reviewed run leaves behind two kinds of value: a new reflection for future similar tasks and a better ordering of existing memories. This is what makes utility different from plain semantic search. Semantic search tells you what looks similar. Utility tells you what has earned trust.

Limitations and open questions

Some things we haven't solved yet, and want to be upfront about.

Cold start. A new project has no memories and no utility signal. The first N runs are pure semantic retrieval until enough reviews accumulate. We're exploring seeding strategies (importing reflections from related projects, bootstrapping from existing evals), but this is still active work.

Utility drift. Task distributions change. A memory that was highly useful six months ago may be irrelevant or harmful today. We apply time-decay to utility scores, but the right decay schedule is task-dependent and hard to set in advance.

Review quality. The whole loop depends on accurate reviews. Noisy pass/fail labels mean noisy utility scores. Automated review via judge models helps with throughput but introduces its own failure modes.

Lambda tuning. The similarity-vs-utility balance (λ) is a hyperparameter. We default to 0.5, but the right value depends on how much outcome signal your project has. We plan to make this adaptive.

Wrapping up

The bottleneck in most agent systems isn't missing traces or missing evals. It's the missing bridge between observation and action. Utility is that bridge. it turns outcomes into a learned ranking signal so your agent retrieves what has worked, not just what sounds relevant.

Once you have that, memory stops being a static retrieval layer and starts acting like experience.

We're building Reflect to make this the default. Try it at reflect.starlight-search.com and check out the full API reference in our docs.

References

Databricks Engineering Blog, "Building State-of-the-Art Enterprise Agents 90x Cheaper with Automated Prompt Optimization," 2025. Link
N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS 2023. arXiv:2303.11366
P. Chhikara et al., "Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory," ECAI 2025. arXiv:2504.19413

Infinite Compute Glitch - Why Local AI Matters

Akshay Ballal — Fri, 28 Mar 2025 13:13:05 +0000

The Infinite Compute Glitch: Why Local AI Matters

In our headlong rush toward AI-powered everything, we've created an interesting paradox. Companies and individuals alike have begun to operate as if compute resources are infinite—just fire off another API call to OpenAI, Claude, or Gemini. But this mindset is creating a significant blind spot in how we approach technology, with consequences that extend far beyond our screens.

I wasn't just creating another tool when I was building Starlight, a desktop app that indexes and enables semantic search across your files using local models. I was making a statement about how we should approach AI in our daily lives.

Starlight lets you chat with your data and find information across your files without ever sending that data to external services. Everything happens on your machine, using your compute resources. This blog explores why apps like Starlight matter and why we should not ignore local AI.

The Infinite Compute Glitch

I've started calling our current approach to AI "the infinite compute glitch"—a collective delusion that we can just keep scaling up models and data centers without consequences.

Let's look at the hard numbers:

A single ChatGPT query consumes approximately 10-15x more energy than a Google search RW Digital
Training GPT-4 is estimated to have used over 62 million kilowatt-hours of electricity
AI data centers can use between 3-5 million gallons of water per day for cooling
The carbon footprint of training a single large language model can equal that of five cars over their entire lifetimes

A 2023 study from the University of Massachusetts found that training a single transformer-based AI model can emit as much carbon as five cars over their entire lifetimes. Meanwhile, Microsoft had to limit Azure AI services in some regions due to capacity constraints, showing that even the cloud giants are hitting physical limits.

Why Local First Matters

Your Data Stays Home

The most obvious benefit is privacy. When your files never leave your computer, you maintain complete control over your information. No terms of service changes, no worrying about how your data might be used to train future models.

Consider this: in 2023 alone, major AI companies updated their terms of service multiple times, often expanding their rights to use user data. Samsung even had to ban employees from using generative AI tools after sensitive code was uploaded to ChatGPT and became part of its training data. With local AI, these scenarios become impossible.

At ASML, we are not allowed to use AI tools because the company carries a lot of sensitive information that is also of national interest. I understand their concern, but when other less sensitive industries utilize AI to make their employees more productive, we feel stuck in the past era. This can easily be solved by empowering the employees with local AI, either on-device or on-prem.

Distributed Computing Is Resilient Computing

We're creating a more distributed system by leveraging the computing power already sitting on our desks and in our laps. Instead of centralizing all AI compute in massive data centers owned by a handful of companies, we're spreading the load across millions of devices.

The average modern laptop now has more computing power than what was used to train early versions of BERT, one of the breakthrough language models from 2018. A typical gaming PC with a decent GPU can run models with billions of parameters. We're sitting on a vast, untapped ocean of compute power.

Speed Without the Wait

For many tasks, local processing is actually faster. No upload time, no API latency, no waiting for your request to be queued behind thousands of others. The results appear as quickly as your computer can process them.

In our internal testing, Starlight's local search returned results from a 1GB document collection in under 200ms, while the equivalent cloud API call took over 2 seconds when accounting for network latency and server processing time. That's a 10x difference in speed for common queries.

Right-Sized Solutions

Here's something that might surprise you: more than 80% of the AI tasks the average person needs day-to-day can be handled by smaller, local models. We don't need GPT-4 to summarize a document or help draft an email. By matching the task to the appropriate model size, we save enormous resources.

Take document summarization: a 1.5 billion parameter model running locally can produce summaries nearly indistinguishable from those generated by models 100x larger for most business documents. The difference? The smaller model uses a fraction of the energy and requires no data transmission.

What is Propelling Local AI Forward

Here's what's making local AI happen:

1. Hardware: Increased Processing Power

It's pretty amazing how much more powerful our devices have become! Laptops, smartphones, and even smaller embedded systems now pack a serious punch with their processors and memory. This means they're not just for basic tasks anymore; they've got the muscle to handle sophisticated AI models directly. Think about it: your phone today has more computing power than entire computers from not too long ago. This increase in processing power is a huge factor in enabling local AI. With the recent release of ARM based processors on PCs, local AI has become even more plausible. The key advancements have been make CPU and GPU share an unified memory, allowing you to fit even larger models on the GPU and super efficient CPU/GPU. This is only going to improve going forward so we need to keep pushing AI on the edge to make use of this untapped compute power.

2. Model Architecture and Data

There is no doubt that the LLM models have become ever more intelligent. Every day a new, SOTA model is released. But even more important is that the models are getting smaller and more efficient. While companies like OpenAI are pushing toward ever more larger models which I believe is the wrong direction to move in, companies like Google, Mistral and Meta continue to surprise us with small models that are as powerful as the largest models from last year. For example Gemma3 27b, an open-source 27 billion parameter model from Google is as good as their own closed-source Gemini-1.5-pro which is believed to be over 300 billion parameters. We can see from the below plot that both Mistral Small 3.1 and Gemma-3 are better than GPT-4o-mini which must be a very large model at this point. Even though these small models are good, they can only work well on high end computers. The user has to settle for even smaller versions of these models if they have a medium spec computer. But this is a start none the less.

Several new pieces of innovations have made this possible. Introduction of KV caching, Flash Attention, Rotary Position Embedding (RoPE) are just some of the innovations that have truly reduced the gap between the performance of a small local model and a large cloud model. Also the quality of data that is used to train these models has improved over the years. Instead of training on just raw data from the internet which can be very noisy, a lot of clever preprocessing is being done to allow smaller models to learn just as well.

3. Model Serving: Open-Source Models and Tools

Lastly, the final piece is how are these local AI models distributed. Here, the open source community has been quite active and projects like Ollama, LlamaCPP, Candle, vLLM are making it easier to allow consumer grade hardware to run these open source models efficiently. These projects are the interface between the model weights and the the actual hardware. Making this interaction efficient is the key to making the best models run on the least amount of compute.

But just serving these models isn’t enough. Tools like ChatGPT provide several more user experiences like Canvas, Web Search, Voice to Text etc, which make people use their services. Even if we made best models, available locally, people will still gravitate towards, OpenAI, Claude etc for the user experience. That’s where products like Starlight come in. The new age software has to be as much local first as possible and allow users to get the same experience that they get on ChatGPT or other similar platforms using their own local AI models. At this point, it is more about breaking the habit. Let’s say I want to summarize a research paper. The default habit that has developed over the last three years for people is to take the paper and upload it to ChatGPT and ask summarization. But this is a very easy task for local models. So how can we push software to break this habit and exploit use cases where local AI can outperform cloud AI. This is going to be a big challenge which we at Starlight strive to solve by building features that make you gravitate towards local AI rather than cloud models for most of your tasks.

Innovating Within Constraints

At Starlight, we're embracing constraints rather than ignoring them. We're finding clever ways to make smaller models more effective through better indexing, retrieval, and context management.

For example, we've developed a hybrid approach that uses local embedding models to index documents and then employs efficient retrieval algorithms to find the most relevant content. This approach reduces the context window needed for generation tasks, allowing even small local models to produce high-quality results on par with much larger cloud models. Moreover, we are implementing methods like recursive summarization and graph traversal to compress large amounts of information into the constrained context of local models. Also, it's not just about context; even if the local models have a long context, usually, the hardware cannot process this context as it might not be able to fit on the memory or the computation takes just too long.

This approach forces us to be more thoughtful about how we use AI. Rather than throwing compute at every problem, we ask: What's the most efficient way to solve this? How can we deliver value while minimizing resource use?

A More Balanced Future: The Hybrid Approach

I'm not suggesting we abandon cloud AI entirely—there are absolutely use cases where larger models are necessary. But by being intentional about when and how we use these resources, we can create a more balanced, sustainable approach to AI.

Imagine a world where:

Your personal documents, emails, and notes are processed entirely on your devices, maintaining your privacy
Basic creative tasks like drafting emails or summarizing articles happen locally, with instant response times
Only specialized tasks that truly require massive models—like complex code generation or scientific research—call out to cloud services
Organizations maintain their own small, specialized models trained on their data, running on local infrastructure

This isn't science fiction—all the technology exists today. What's missing is the mindset shift.

Taking Action

So what can you do to be part of this shift?

Try local-first AI tools like Starlight for your personal and professional needs
Be mindful of your API usage when you do use cloud services
Support open-source AI projects that are making models more accessible and efficient

The future isn't about unlimited compute—it's about smart compute. It's about knowing when a local model is sufficient and when you truly need something more powerful.

So next time you're about to send that API call, ask yourself: Could this happen locally instead? Your privacy, your wallet, and our planet might thank you for it.

Optimize VLM Tokens with EmbedAnything x ColPali

Akshay Ballal — Sun, 12 Jan 2025 12:38:49 +0000

ColPali, a late-interaction vision model, leverages this power to enable text searches within images. This means you can pinpoint the exact pages in a PDF containing relevant text, even if the text exists only as part of an image. For example, suppose you have hundreds of pages in a PDF and even hundreds of PDFs. In that case, ColPali can identify the specific pages matching a query—an impressive feat for streamlining information retrieval. This system is widely come to be known as Vision RAG.

However, due to its computational demands, running the ColPali model directly on a local machine might not always be feasible. To address this, I developed a onnx version of ColPali which can be quantized to different precisions. Quantization reduces the precision of the model's weights, significantly lowering computational and memory requirements. Despite this optimization, the quantized model maintains performance nearly equivalent to the original. In this article we will look at how to use ColPali for Vision RAG by using the EmbedAnything library that I have been developing for the last few months. You can read more about EmbedAnything here

What is Vision RAG?

Let’s look a bit deeper into what Vision RAG is. Traditional RAG methods use text throughout the pipeline. They store text chunks and their embeddings in a vector database and then retrieve these chunks for further downstream tasks. A simplest / naive RAG attaches these chunks as context to the original query and aims to provide more information to the model. There are two problems here. One is that getting text from many data sources may not be possible. Think about scanned PDFs or documents with many graphics, like design pamphlets, etc. The traditional RAG falls apart if any documents you work with are like this. A bandaid to the problem is to use OCR engines to somehow extract text. This adds additional moving parts to the process, and OCR engines are pretty fragile. The second problem, even if you manage to get the text, is the chunking process. Again, how do you decide what the chunk size should be and what the overlap should be? Even if you find optimal parameters for a few documents, will they hold for new ones? All these parameters add to the design space, and the RAG performance needs to be continuously evaluated based on these design choices. Vision RAG tries to solve this by removing the whole chunking process from the system and instead storing the image as a multi-vector embedding in the database. When there is a query, a Late Interaction Score (LIS), similar to the classical cosine similarity but for multi-vector, is measured, and the DB returns the document pages with the highest LIS scores. These documents can now be sent to a Vision Language Model (VLM) along with the original query to get the answer to the questions.

The image below shows this process from start to end. Since vision language models are more expensive than text models, Vision RAG is even more important because you don’t have to send complete PDFs to the model. You are just sending the relevant pages. This can save a lot of costs. The document embedding generation happens offline and is taken care of by EmbedAnything. One drawback with this approach is that not all vector databases today support storing multi-vectors. A few that support these are Qdrant and Vespa.

Let us look at how you can use Colpali models with EmbedAnything and convert PDFs into multi-vector embeddings. In this example, we will not use a vector database but find the late interaction score of the query against all the pages.

Step 1: Install the dependencies

Since we are going to convert pdfs into images, we need poppler-utils.

EmbedAnything requires poppler to convert pdfs to images. So make sure you have it installed.

For Linux:

apt install poppler-utils

For Mac

brew install poppler

For Windows

https://github.com/oschwartz10612/poppler-windows/releases/tag/v24.08.0-0
Download the binary from here, unzip it and add the bin folder to your system path.

Using the GPU version of EmbedAnything is highly recommended because ColPali is based on paligemma and requires a computation like any other small language model.

pip install embed-anything-gpu tabulate openai

Let’s import EmbedAnything and the other dependencies:

import base64
from embed_anything import EmbedData, ColpaliModel
import numpy as np
from tabulate import tabulate
from pathlib import Path
from PIL import Image
import io
import matplotlib.pyplot as plt
import openai
import os

Step 2: Get the files that need to be indexed

For this demo, we will clone the EmbedAnything repo which has some test pdfs with the “Attention is all you need” and a Mistral paper.

if not os.path.exists("EmbedAnything"):
  !git clone https://github.com/StarlightSearch/EmbedAnything.gi

Step 3 : Load the ColPali Onnx Model

Use the embed_anything function with from_pretrained_onnx to load the ColPali Onnx model from the specified link. This initializes the model for embedding tasks. If you are using a python notebook, this can take some time because the model is being downloaded. Unfortunately, the progress bar is not visible on a notebook. You can also load the original Colpali model and not the onnx model using the from_pretrained_hf function.

model: ColpaliModel = ColpaliModel.from_pretrained_onnx("starlight-ai/colpali-v1.2-merged-onnx", None)

Step 4: Load the files and embed them.

Now, we just load all the files from the directory with a PDF extension. Then, for each file, we run the embed_file function with a batch_size of 1. You can increase the batch size if you have higher VRAM, but one works well.

directory = Path("EmbedAnything/test_files")
files = list(directory.glob("*.pdf"))
file_embed_data: list[EmbedData] = []
for file in files:
    try:
        embedding: list[EmbedData] = model.embed_file(str(file), batch_size=1)
        file_embed_data.extend(embedding)
    except Exception as e:
        print(f"Error embedding file {file}: {e}")
file_embeddings = np.array([e.embedding for e in file_embed_data])
print("Embedded Files: ", files)

file_embeddings is a list of EmbedData object which contains other metadata along with the embeddings like page number, file name and the image of the page in string base64 format. You can now store these embeddings in a vector database of choice.

Step 5: Process the query

We do the same for the query as well using embed_query function.

query = "What is positional encoding?"
query_embedding = model.embed_query(query)
query_embeddings = np.array([e.embedding for e in query_embedding])

Step 6: Compute Similarity Scores

We can calculate the Late Interaction Score between query and file embeddings using the Einstein summation function. This identifies the most relevant pages based on the highest scores. Extract the top 3 pages for further processing. We also take out the image field from the EmbedData object of the embeddings. This is a base64 string representation of the image that will send to GPT.

def score(query_embeddings, file_embed_data):
    file_embeddings = np.array([e.embedding for e in file_embed_data])
    scores = np.einsum("bnd,csd->bcns", query_embeddings, file_embeddings).max(axis=3).sum(axis=2).squeeze()

    # Get top pages
    top_pages = np.argsort(scores)[::-1][:3]

    # Extract file names and page numbers
    table = [
        [file_embed_data[page].metadata["file_path"].split("/")[-1], file_embed_data[page].metadata["page_number"]]
        for page in top_pages
    ]

    # Print the results in a table
    print(tabulate(table, headers=["File Name", "Page Number"], tablefmt="grid"))
    results_str = tabulate(table, headers=["File Name", "Page Number"], tablefmt="grid")

    images = [file_embed_data[page].metadata["image"] for page in top_pages]
    images_pil = [Image.open(io.BytesIO(base64.b64decode(image))) for image in images]
    return images_pil, results_str, images_str

The result will look something like this:

+----------------------------------------+---------------+
| File Name                              |   Page Number |
+========================================+===============+
| EmbedAnything/test_files/attention.pdf |             6 |
+----------------------------------------+---------------+
| EmbedAnything/test_files/attention.pdf |             9 |
+----------------------------------------+---------------+
| EmbedAnything/test_files/linear.pdf    |            34 |
+----------------------------------------+---------------+
| EmbedAnything/test_files/attention.pdf |             3 |
+----------------------------------------+---------------+
| EmbedAnything/test_files/attention.pdf |            15 |
+----------------------------------------+---------------+

We can visualize the top 3 pages using this command

Step 7: Send these images to OpenAI

Now we can send these top 3 retrieved images to OpenAI gpt-4o-mini model along with the original query. You can add further instructions for the model here as per your needs. Don’t forget to add your OpenAI key to the client.

from openai import OpenAI

client = OpenAI(api_key = <openai-key> )

image_contents = [
    {
        "type": "image_url",
        "image_url": {"url": f"data:image/jpeg;base64,{image_str}"}
    }
    for image_str in images_str
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": query},
            ] + image_contents,
        }
    ],

)

The output looks like this


Positional encoding is a critical concept in transformer models, which addresses the inherent limitation of self-attention mechanisms: they do not consider the order of input tokens. Since transformers process all tokens simultaneously, they require a way to encode the order of tokens in a sequence to maintain their relative positions.

### Key Aspects of Positional Encoding:

1. **Purpose**: It helps the model understand the sequence of data since transformers lack recurrence or convolution that traditionally encode this information.

2. **Method**: 
   - Positional encodings are added to the input embeddings of tokens.
   - A common approach is to use sine and cosine functions of different frequencies, defined mathematically as:

     \[
     PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
     \]
     \[
     PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
     \]

   - Here, \( pos \) is the position of the token, \( i \) is the dimension, and \( d_{model} \) is the dimensionality of the embedding.

3. **Frequency**: The functions allow for various wavelengths, making it possible to learn relationships at different scales, which enables the model to understand both short-range and long-range dependencies in the sequence.

4. **Alternatives**: While sinusoidal encodings are widely used, learned positional embeddings can also be employed, which allows the model to learn the optimal way to encode positions during training.
y
In summary, positional encoding is vital for allowing transformer models to grasp the order of tokens in sequences, facilitating effective learning from sequential data.

This response used a total of 2500 tokens which translates to $0.006. If we would have sent the entire pdf of 15 pages, without retrieval to the model, it would have cost about 12,500 tokens which is five times higher than this system. And this is assuming we know which pdf to send. Also the response may not be accurate because the model has too much unnecessary information to filter out.

I hope that this blog was useful. Vision RAG is going to be a staple block in future retrieval systems so its good to start using it to make your LLM pipelines more efficient.

Check out the demo notebook at

Want to connect?
😸GitHub Repo
🌍My Website

🐦My Twitter

👨My LinkedIn

Build ReAct Agents using SLMs from Scratch

Akshay Ballal — Sun, 29 Sep 2024 10:00:52 +0000

In this post, I will demonstrate how to create a function-calling agent using Small Language Models (SLMs). Leveraging SLMs offers a range of benefits, especially when paired with tools like LoRA adapters for efficient fine-tuning and execution. While Large Language Models (LLMs) are powerful, they can be resource-intensive and slow. On the other hand, SLMs are more lightweight, making them ideal for environments with limited hardware resources or specific use cases where lower latency is critical.

By using SLMs with LoRA adapters, we can separate reasoning and function execution tasks to optimize performance. For instance, the model can execute complex function calls using the adapter and handle reasoning or thinking tasks without it, thus conserving memory and improving speed. This flexibility is perfect for building applications like function-calling agents without needing the infrastructure required for larger models.

Moreover, SLMs can be easily scaled to run on devices with limited computational power, making them ideal for production environments where cost and efficiency are prioritized. In this example, we'll use a custom model trained on the Salesforce/xlam-function-calling-60k dataset via Unsloth, demonstrating how you can utilize SLMs to create high-performance, low-resource AI applications.

Additionally, the approach discussed here can be scaled to more powerful models, such as LLaMA 3.1-8B, which have in-built function-calling capabilities, offering a smooth transition when larger models are necessary.

1. Initiate the Model and Tokenizer with Unsloth

We’ll first set up the model and tokenizer using Unsloth. Here, we define a max sequence length of 2048, though this can be adjusted. We also enable 4-bit quantization to reduce memory usage, ideal for running models on lower-memory hardware.

from unsloth import FastLanguageModel
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "akshayballal/phi-3.5-mini-xlam-function-calling",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model);

2. Implement Stopping Criteria for Controlled Generation

To ensure that the agent pauses execution after function calls, we define a stopping criteria. This will halt the generation when the model outputs the keyword "PAUSE," allowing the agent to fetch the result of the function call.

from transformers import StoppingCriteria, StoppingCriteriaList
import torch

class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids:list):
        self.keywords = keywords_ids

    def __call__(self, input_ids: torch.LongTensor, _: torch.FloatTensor, **kwargs) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False

stop_ids = [17171]
stop_criteria = KeywordsStoppingCriteria(stop_ids)

3. Define the Tools for Function Calling

Next, we define the functions the agent will use during execution. These Python functions will act as "tools" that the agent can call. The return type must be clear, and the function should include a descriptive docstring, as the agent will rely on this to choose the correct tool.

def add_numbers(a: int, b: int) -> int:
    """
    This function takes two integers and returns their sum.

    Parameters:
    a (int): The first integer to add.
    b (int): The second integer to add.
    """
    return a + b 

def square_number(a: int) -> int:
    """
    This function takes an integer and returns its square.

    Parameters:
    a (int): The integer to be squared.
    """
    return a * a

def square_root_number(a: int) -> int:
    """
    This function takes an integer and returns its square root.

    Parameters:
    a (int): The integer to calculate the square root of.
    """
    return a ** 0.5

4. Generate Tool Descriptions for the Agent

These function descriptions will be structured into a list of dictionaries. The agent will use these to understand the available tools and their parameters.

tool_descriptions = []
for tool in tools:
    spec = {
        "name": tool.__name__,
        "description": tool.__doc__.strip(),
        "parameters": [
            {
                "name": param,
                "type": arg.__name__ if hasattr(arg, '__name__') else str(arg),
            } for param, arg in tool.__annotations__.items() if param != 'return'
        ]
    }
    tool_descriptions.append(spec)
tool_descriptions

This is how the output looks like

[{'name': 'add_numbers',
  'description': 'This function takes two integers and returns their sum.\n\n    Parameters:\n    a (int): The first integer to add.\n    b (int): The second integer to add.',
  'parameters': [{'name': 'a', 'type': 'int'}, {'name': 'b', 'type': 'int'}]},
 {'name': 'square_number',
  'description': 'This function takes an integer and returns its square.\n\n    Parameters:\n    a (int): The integer to be squared.',
  'parameters': [{'name': 'a', 'type': 'int'}]},
 {'name': 'square_root_number',
  'description': 'This function takes an integer and returns its square root.\n\n    Parameters:\n    a (int): The integer to calculate the square root of.',
  'parameters': [{'name': 'a', 'type': 'int'}]}]

5. Create the Agent Class

We then create the agent class that takes the system prompt, the function calling prompt, the tools and the messages as input and returns the response from the agent.

__call__ is the function that is called when the agent is called with a message. It adds the message to the messages list and returns the response from the agent.
execute is the function that is called to generate the response from the agent. It uses the model to generate the response.
function_call is the function that is called to generate the response from the agent. It uses the function calling model to generate the response.

import ast

class Agent:
    def __init__(
        self, system: str = "", function_calling_prompt: str = "", tools=[]
    ) -> None:
        self.system = system
        self.tools = tools
        self.function_calling_prompt = function_calling_prompt
        self.messages: list = []
        if self.system:
            self.messages.append({"role": "system", "content": system})

    def __call__(self, message=""):
        if message:
            self.messages.append({"role": "user", "content": message})
        result = self.execute()
        self.messages.append({"role": "assistant", "content": result})
        return result

    def execute(self):
        with model.disable_adapter():  # disable the adapter for thinking and reasoning
            inputs = tokenizer.apply_chat_template(
                self.messages,
                tokenize=True,
                add_generation_prompt=True,
                return_tensors="pt",
            )
            output = model.generate(
                input_ids=inputs,
                max_new_tokens=128,
                stopping_criteria=StoppingCriteriaList([stop_criteria]),
            )
            return tokenizer.decode(
                output[0][inputs.shape[-1] :], skip_special_tokens=True
            )

    def function_call(self, message):
        inputs = tokenizer.apply_chat_template(
            [
                {
                    "role": "user",
                    "content": self.function_calling_prompt.format(
                        tool_descriptions=tool_descriptions, query=message
                    ),
                }
            ],
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        )
        output = model.generate(input_ids=inputs, max_new_tokens=128, temperature=0.0)
        prompt_length = inputs.shape[-1]

        answer = ast.literal_eval(
            tokenizer.decode(output[0][prompt_length:], skip_special_tokens=True)
        )[
            0
        ]  # get the output of the function call model as a dictionary
        print(answer)
        tool_output = self.run_tool(answer["name"], **answer["arguments"])
        return tool_output

    def run_tool(self, name, *args, **kwargs):
        for tool in self.tools:
            if tool.__name__ == name:
                return tool(*args, **kwargs)

6. Define System and Function-Calling Prompts

We now define two key prompts:

System Prompt: The core logic for the agent's reasoning and tool use, following the ReAct pattern.
Function-Calling Prompt: This enables function calling by passing the relevant tool descriptions and queries.

system_prompt = f"""
You run in a loop of Thought, Action, PAUSE, Observation.
At the end of the loop you output an Answer
Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then return PAUSE.
Observation will be the result of running those actions. Stop when you have the Answer. 
Your available actions are:

{tools}

Example session:

Question: What is the mass of Earth times 2?
Thought: I need to find the mass of Earth
Action: get_planet_mass: Earth
PAUSE 

Observation: 5.972e24

Thought: I need to multiply this by 2
Action: calculate: 5.972e24 * 2
PAUSE

Observation: 1,1944×10e25

If you have the answer, output it as the Answer.

Answer: \\{{1,1944×10e25\\}}.
PAUSE
Now it's your turn:
""".strip()

function_calling_prompt = """
You are a helpful assistant. Below are the tools that you have access to.  \n\n### Tools: \n{tool_descriptions} \n\n### Query: \n{query} \n
"""

7. Implement the ReAct Loop

Finally, we define the loop that enables the agent to interact with the user, execute the necessary function calls, and return the correct answers.

import re

def loop_agent(agent: Agent, question, max_iterations=5):

    next_prompt = question
    i = 0
    while i < max_iterations:
        result = agent(next_prompt)
        print(result)
        if "Answer:" in result:
            return result

        action = re.findall(r"Action: (.*)", result)
        if action:
            tool_output= agent.function_call(action)
            next_prompt = f"Observation: {tool_output}"
            print(next_prompt)
        else:
            next_prompt = "Observation: tool not found"
        i += 1
    return result

agent = Agent( system=system_prompt, function_calling_prompt=function_calling_prompt, tools=tools)

loop_agent(agent, "what is the square root of the difference between 32^2 and 54");

Check out the complete notebook on Colab here.

Conclusion

By following this step-by-step guide, you can create a function-calling agent using a custom model trained with Unsloth and LoRA adapters. This approach ensures efficient memory use while maintaining robust reasoning and function execution capabilities.

Explore further by extending this method to larger models or customizing the functions available to the agent.

Vector Streaming with EmbedAnything

Akshay Ballal — Thu, 12 Sep 2024 19:05:46 +0000

In my previous article Introducing EmbedAnything, I discussed the idea behind EmbedAnything and how it makes creating embeddings from multiple modalities easy. In this article, I want to introduce a new feature of EmbedAnything called vector streaming and see how it works with Weaviate Vector Database.

What is the problem?

First, let's examine the current problem with creating embeddings, especially in large-scale documents. The current embedding frameworks operate on a two-step process: chunking and embedding. First, the text is extracted from all the files, and chunks/nodes are created. Then, these chunks are fed to an embedding model with a specific batch size to process the embeddings. While this is done, the chunks and the embeddings stay on the system memory. This is not a problem when the files are small, and the embedding dimensions are small. But this becomes a problem when there are many files and you are working with large models and, even worse, multi-vector embeddings. Thus, to work with this, a high RAM is required to process the embeddings. Also, if this is done synchronously, a lot of time is wasted while the chunks are being created, as chunking is not a compute-heavy operation. As the chunks are being made, passing them to the embedding model would be efficient.

Our Solution

The solution is to create an asynchronous chunking and embedding task. We can effectively spawn threads to handle this task using Rust's concurrency patterns and thread safety. This is done using Rust's MPSC (Multi-producer Single Consumer) module, which passes messages between threads. Thus, this creates a stream of chunks passed into the embedding thread with a buffer. Once the buffer is complete, it embeds the chunks and sends the embeddings back to the main thread, where they are sent to the vector database. This ensures no time is wasted on a single operation and no bottlenecks. Moreover, only the chunks and embeddings in the buffer are stored in the system memory. They are erased from the memory once moved to the vector database.

Example Use Case

Now, let's see this feature in action.

With EmbedAnything, streaming the vectors from a directory of files to the vector database is a simple three-step process.

Create an adapter for your vector database: This is a wrapper around the database's functions that allows you to create an index, convert metadata from EmbedAnything's format to the format required by the database, and the function to insert the embeddings in the index. Adapters for the prominent databases are already created and present here:
Initiate an embedding model of your choice: You can choose from different local models or even cloud models. The configuration can also be determined to set the chunk size and buffer size for how many embeddings need to be streamed at once. Ideally, this should be as high as possible, but the system RAM limits this.
Call the embedding function from EmbedAnything: Just pass the directory path to be embedded, the embedding model, the adapter, and the configuration.

In this example, we will embed a directory of images and send it to the vector databases.

Step 1: Create the Adapter

In EmbedAnything, the adapters are created outside so as to not make the library heavy and you get to choose which database you want to work with. Here is a simple adapter for Weaviate.



from embed_anything import EmbedData
from embed_anything.vectordb import Adapter

class WeaviateAdapter(Adapter):
    def __init__(self, api_key, url):
        super().__init__(api_key)
        self.client = weaviate.connect_to_weaviate_cloud(
            cluster_url=url, auth_credentials=wvc.init.Auth.api_key(api_key)
        )
        if self.client.is_ready():
            print("Weaviate is ready")

    def create_index(self, index_name: str):
        self.index_name = index_name
        self.collection = self.client.collections.create(
            index_name, vectorizer_config=wvc.config.Configure.Vectorizer.none()
        )
        return self.collection

    def convert(self, embeddings: List[EmbedData]):
        data = []
        for embedding in embeddings:
            property = embedding.metadata
            property["text"] = embedding.text
            data.append(
                wvc.data.DataObject(properties=property, vector=embedding.embedding)
            )
        return data

    def upsert(self, embeddings):
        data = self.convert(embeddings)
        self.client.collections.get(self.index_name).data.insert_many(data)

    def delete_index(self, index_name: str):
        self.client.collections.delete(index_name)

### Start the client and index

URL = "your-weaviate-url"
API_KEY = "your-weaviate-api-key"
weaviate_adapter = WeaviateAdapter(API_KEY, URL)

index_name = "Test_index"
if index_name in weaviate_adapter.client.collections.list_all():
    weaviate_adapter.delete_index(index_name)
weaviate_adapter.create_index("Test_index")

Step 2: Create the Embedding Model

Here, since we are embedding images, we can use the clip model



import embed_anything import WhichModel

model = embed_anything.EmbeddingModel.from_pretrained_cloud(
    embed_anything.WhichModel.Clip, model_id=""
)

Step 3




data = embed_anything.embed_image_directory(
    "\image_directory",
    embeder=model,
    adapter=weaviate_adapter,
    config=embed_anything.ImageEmbedConfig(buffer_size=100),
)

Step 4



query_vector = embed_anything.embed_query(["image of a cat"], embeder=model)[
    0
].embedding

Step 5



response = weaviate_adapter.collection.query.near_vector(
    near_vector=query_vector,
    limit=2,
    return_metadata=wvc.query.MetadataQuery(certainty=True),
)

Check the response;

You can check out the notebook here in Colab

We think it’s one of the features that will empower many engineers to opt for a more optimized and no-tech debt solution. Instead of using bulky frameworks on the cloud, you can use a lightweight streaming option. Please don't forget to give us a ⭐ on our GitHub repo over here: https://github.com/StarlightSearch/EmbedAnything

Reinforcement Learning from Scratch - Part 3 - REINFORCE Algorithm

Akshay Ballal — Tue, 06 Aug 2024 14:01:08 +0000

Hey all! Welcome to the third part of the Reinforcement Learning Series, where I explain the different RL algorithms and introduce you to some tips and tricks that RL practitioners use to make them work. So far, we have looked at the Tabular Q Learning Method and the Deep Q Learning Method. If you haven’t checked these out yet, look at the links below for motivation as to why we are still investigating more algorithms.

The main problem with these algorithms was that they did not allow us to use continuous action space environments, and we always had to discretize the action space, which may not always be desirable. So, in this part, we will explore a new RL variant that is different from the Q-learning variant we have looked at until now. These methods are called Policy Gradient methods. Let’s look at some details about Policy Learning or Policy Gradient Methods.

With Q-learning, our neural network model learns the Q function at each state, from which we derive the policy. The policy was simple: take the epsilon greedy action over the Q values or select the action with the largest Q value. But do we actually need to learn the Q-values when we want the policy? The answer is a resounding Yes. We can learn the policy directly, and the REINFORCE algorithm we are looking at in this part is a straightforward algorithm that does just that. While it may not be the most suitable algorithm for all environments, it serves as an excellent introduction to policy learning or policy gradient algorithms.

Benefits of Policy Gradient Methods

Stochastic Policies: Policy gradient methods enable learning state-dependent stochastic policies. Unlike Q-learning, where exploration is introduced through the epsilon-greedy method, policy gradient methods allow for state-specific stochastic behavior. This approach is particularly valuable for partially observable states, where actions need to be sampled from a probability distribution rather than determined deterministically. Importantly, this method doesn't preclude deterministic policies for certain states. The result is a flexible mix of stochastic and deterministic actions, tailored to the confidence level in each state.
This is a more straightforward objective for many environments. Learning the value function can be more challenging and may also not be required. The quote below sums it up.

When solving a problem of interest, do not solve a more general problem as an intermediate step

Vladimir Vapnik

This means that if our objective is to get the policy and we are not interested in the values, we don’t need to solve for the values to get to the policy.

Policy Gradient methods also show better convergence properties as the action probabilities change smoothly. Value-based techniques are prone to oscillations because the policy can change drastically throughout training.

Deriving Policy Gradient

Let's explore how we can derive the policy. This theory could be a little dense if you are new to the concept. But trust me, if you can understand the idea of policy gradient, many algorithms like A2C and PPO become pretty easy to understand. So it’s OK to spend some time here. You can check out the references at the end of this article to some sources that elaborate on this topic. First, we need an objective to optimize. In Q-learning, we aimed to minimize the loss between predicted and target values. Specifically, our goal was to match the actual action-value function of a given policy. We parameterized a value function and minimized the mean squared error between predicted and target values. Note that we didn't have true target values; instead, we used actual returns in Monte Carlo methods or predicted returns in bootstrapping methods.

In policy-based methods, however, the objective is to maximize the performance of a parameterized policy. We're running gradient ascent (or regular gradient descent on the negative performance). An agent’s performance is the expected total discounted reward from the initial state—equivalent to the expected state-value function from all initial states of a given policy.

Let’s understand these equations step by step.

1. The objective function L(θ) is defined as the expected value of the initial state s_0
2. The expected value of the initial state is nothing but the expected total return from the trajectory.
3. The expectation can be shown as the weighted sum of the expected total return across all the different possible trajectories weighted by the probability of a trajectory given the weights θ of the policy network.
4. We take the gradient of the objective as it is required for the gradient ascent. 
5. We then use the gradient of a log trick to refactor the gradient of the probability of the trajectory. 
6. Finally, we can pack the whole thing back into an expectation.

7. We can see that the probability of a trajectory is the product of the likelihood of the state transition and the probability of taking the action (π).
8. We can compactly write the above part.
9. We take the log and the gradient. The log operations make the product a sum. 
10. Only the last term depends on theta as π depends on theta.

11. Plugging 10 into 6, we get the final (almost) equation of the policy loss / objective gradient. We usually calculate the value without the gradient and then use the PyTorch backward function to get the gradient.
12. One final modification we make is that instead of using the return of the complete trajectory we calculate the return from the time step t. This is because the action at time step t cannot influence the previous rewards and can only influence the future rewards.

This is the so called Vanilla Reinforcement Learning equation. But as you will see later in the results of this algorithm, there are some issues with this vanilla approach. Since the return is for the whole trajectory, there can be a high variance in the return value across different trajectories. Thus, to get a reasonable estimate, we need to sample a lot of trajectories. We need to reduce this variance and introduce some bias. This is the typical bias variance trade-off we see everywhere in machine learning. One of the simplest ways to do this is by introducing the Value Function estimate and a new term called Advantage. We define advantage as follows.

A_{t} (s) = G_{t} (τ) - V_{t} (s)

This says how good the actual return is compared to the estimated return for that state. If the advantage is positive, that means the action that the agent took at time step t was good compared to the other actions, and if the advantage is negative, that means the action was bad and made the agent perform worse subsequently.

We then modify equation 12 as follows:

This is called REINFORCE with Baseline or Value Policy Gradient (VPG) algorithm. What this equation does is that if the advantage is positive for a given action and its log probability given by log⁡ 𝜋(𝐴𝑡∣𝑆𝑡), likelihood of taking this action. If the advantage is negative, it decreases the probability of taking that action.

In addition to the policy loss, in the policy gradient methods, we also have the entropy loss. Entropy defines how much randomness is there in the system. More entropy corresponds to more randomness. It is good to maintain some entropy in the agent so that the agent does not become completely deterministic.

L_{e n t} (θ) = a \sum π_{θ} (a ∣ s) θ lo g π θ (a ∣ s)

Finally the total loss is given as:

The value loss is given by the mean square error between the actual returns and the estimated value from the value network. The parameters of the value network are given by η.

L_{v a l u e} (η) = \frac{1}{T} (t = 0 \sum T - 1 (G_{t} (τ) - V (s_{t}))^{2})

The value loss is given by the mean square error between the actual returns and the estimated value from the value network. The parameters of the value network are given by $\eta$.

L_{v a l u e} (η) = \frac{1}{T} (t = 0 \sum T - 1 (G_{t} (τ) - V (s_{t}))^{2})

That’s it. To use this equation, all we need to do is the following:

Create two neural networks: one for the policy and another for the value function.
Use the network to get a trajectory by sampling and performing the actions until the termination state.
We also store the actions' log probability and each state's value estimate.
Calculate the return $G_t$ for the trajectory at each time step.
Using the log probability, $G_t$ , and Values, we use equation 13 by calculating the advantage to get the policy objective.
We can take the negative of this to get the loss that can be minimized because PyTorch optimizers support minimization by default.
The value loss is calculated.
We can also calculate the entropy of the policy using 13.

Disadvantages of REINFORCE Algorithm

Even though REINFORCE algorithm is a simple algorithm to get started with Policy Gradient methods there is a disadvantage. As this algorithm relies on complete episodes for the learning we need to have a terminal state so that an episode can be terminated. That’s why we cannot use our Pendulum environment here because it does not have a terminal state. We are going to use the Cartpole environment as that gives us a termination condition. But don’t worry, in the next part we will look into how to solve this problem. But if you understand this algorithm rest of the algorithms will only be minor adjustments.

Also the learning is slow because even with baseline the variance is high.

Enough with the theory. Now let’s look at the code implementation first without the Value Estimate or Baseline.

Code Implementation: REINFORCE w/o Baseline

You can check out the GitHub Repo here:
https://github.com/akshayballal95/rl-blog-series/tree/reinforce

Import the dependencies:

import gymnasium as gym
import numpy as np
from tqdm import tqdm
import torch
from torch import nn
from torch.optim import AdamW
from torch.utils.tensorboard import SummaryWriter
import torch.nn.functional as F

Create the Policy Network

We create a simple network with the number of features as the input dimension and the number of available actions as the output. torch.distributions.Categorical function helps us to create a distribution from the output logits. Its very similar to using SoftMax but we get some handy functions like sample() , entropy() and log_prob() .

class PolicyNet(nn.Module):
    def __init__(self, nvec_s: int, nvec_u: int):
        super(PolicyNet, self).__init__()
        self.fc1 = nn.Linear(nvec_s, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, nvec_u)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        dist = torch.distributions.Categorical(logits=x)
        action = dist.sample()
        entropy = dist.entropy()
        log_prob = dist.log_prob(action)
        return action, log_prob, entropy

The REINFORCE Agent Class

In our REINFORCE class we have the __init__ function that initiates the attributes that we need. The main hyperparameters here are gamma , learning rate, and the number of steps. You can see that we no longer have epsilon because the exploration is intrinsically baked into the algorithm. We are using the Adam optimizer as usual.

class Reinforce:
    def __init__(self, env:gym.Env, lr, gamma, n_steps):

        self.env = env
        self.lr = lr
        self.gamma = gamma
        self.n_steps = n_steps

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.policy_net = PolicyNet(env.observation_space.shape[0], env.action_space.n).to(self.device)
        self.optimizer_policy = AdamW(self.policy_net.parameters(), lr=lr)

        self.total_steps = 0

        # stats
        self.episodes = 0
        self.total_rewards = 0
        self.mean_episode_reward = 0

Next is the rollout function. Here rollout is nothing but one episode with a fixed policy. We get the action, log_prob and entropy using the policy network. After taking a step in the environment using the action we store the rewards, log_probs and the entropies to a list. We also write the current mean episodic reward to TensorBoard after 100 episodes.

def rollout(self):

        state, info = self.env.reset()
        terminated = False
        truncated = False
        self.log_probs = []
        self.rewards = []
        self.entropies = []

        while True:

            action, log_prob, entropy = self.policy_net(torch.from_numpy(state).float().to(self.device))
            next_state, reward, terminated, truncated, _ = self.env.step(action.item())

            self.rewards.append(reward)
            self.log_probs.append(log_prob)
            self.entropies.append(entropy)

            state = next_state

            self.total_rewards += reward
            self.total_steps += 1
            self.pbar.update(1)
            if terminated or truncated:
                self.episodes += 1

                if self.episodes % 100 ==0:

                    self.mean_episode_reward = self.total_rewards / self.episodes
                    self.pbar.set_description(f"Reward: {self.mean_episode_reward :.3f}")
                    self.writer.add_scalar("Reward", self.mean_episode_reward, self.total_steps)
                    self.episodes =0
                    self.total_rewards = 0

                break

This handy method allows us to calculate the returns at every time step of the trajectory. Notice that we do it in a reverse way but we could as easily implement in the forward way. I just find this implementation more neat.

 def calculate_returns(self):    

        next_returns = 0
        returns = np.zeros_like(self.rewards, dtype=np.float32)
        for i in reversed(range(len(self.rewards))):
            next_returns = self.rewards[i] + self.gamma * next_returns
            returns[i] = next_returns   

        return torch.tensor(returns, dtype = torch.float32).to(self.device)

Now we can write our learn() function. Here we just retrieve everything that is required to be plugged in equation 15. We weight the entropy loss by 0.001 to keep some entropy for exploration.

 def learn(self):

        self.log_probs = torch.stack(self.log_probs)
        self.entropies = torch.stack(self.entropies) 

        returns = self.calculate_returns()

        advantages = returns.squeeze() 

        policy_loss = -torch.mean(advantages.detach() * self.log_probs)

        entropy_loss = -torch.mean(self.entropies)
        policy_loss = policy_loss + 0.001 * entropy_loss

        self.optimizer_policy.zero_grad()
        policy_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 1)
        self.optimizer_policy.step()

Finally we have the train function which just calls the rollout and the learn function until the number of steps is completed.

    def train(self):
        self.writer = SummaryWriter(log_dir="runs/reinforce_logs/REINFORCE_NO_BASELINE")

        self.pbar = tqdm(total=self.n_steps, position=0, leave=True)

        while self.total_steps < self.n_steps:

            self.rollout()
            self.learn()

Results

We can run this agent as we did in the earlier parts:

env = gym.make("CartPole-v1",
               render_mode='human'
               )

n_episodes = 100
for _ in range(n_episodes):
    obs, info = env.reset()
    terminated = False
    truncated = False
    while not terminated and not truncated:
        with torch.no_grad():
            action = agent.policy_net(torch.from_numpy(obs).float().to(agent.device))[0].item()
            obs, reward, terminated,  truncated, info = env.step(action)
            env.render()

We can see that vanilla REINFORCE algorithm struggles to learn reliably. The learning is very unstable due to the high variance. The maximum reward in the CartPole environment is 500 and our agent is able to achieve 300. So there is scope for improvement. Now let’s see how baseline can help.

REINFORCE with Baseline (Value Policy Gradient - VPG)

Code Implementation: REINFORCE with Baseline

This code is almost similar. The only difference is that we have a value network and store the values during rollout. We use these values in the learn function to gain an advantage.

Creating the Value Network

class ValueNet(nn.Module):
    def __init__(self, n_features, n_hidden):
        super(ValueNet, self).__init__()
        self.fc1 = nn.Linear(n_features, 256)
        self.fc2 = nn.Linear(256,128)
        self.fc3 = nn.Linear(128, 1)

    def forward(self, x) -> torch.Tensor:
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

Then we just introduce the values in the agent class. I am not writing the whole code here as most of the other things are same.

    def __init__(self, env:gym.Env, lr, gamma, n_steps):

                self.value_net = ValueNet(env.observation_space.shape[0], 128).to(self.device)
        self.optimizer_value = AdamW(self.value_net.parameters(), lr=lr)

        #<SAME AS BEFORE>

    def rollout(self):

              #<SAME AS BEFORE>
        self.values = []


        while True:

             #<SAME AS BEFORE>
           self.values.append(self.value_net(torch.from_numpy(state).float().to(self.device)))

           if terminated or truncated
           #<SAME AS BEFORE>

        def learn(self):

        self.values = torch.cat(self.values)
        value_loss = F.mse_loss( self.values, returns)

                #<SAME AS BEFORE>

          self.optimizer_value.zero_grad()
        value_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.value_net.parameters(), float('inf'))
        self.optimizer_value.step()

        #<SAME AS BEFORE>

Results

We can run this agent just like before.

The baseline version shows amazing improvements. The learning is stable, and the episodic reward grows consistently. So, if you ever have an episodic environment, REINFORCE with Baseline can be a viable choice.

Conclusion

In this part, we learned about the policy gradient method and wrote our first algorithm, REINFORCE. We saw that although this algorithm is powerful, it has some limitations,. Namely, it cannot work well with non-episodic tasks, and there is high variance. Thus, we must implement bootstrapping to reduce our dependence on complete episodic returns. We will look at this in the next part, where we will see our first actor critic method: Advantage Actor Critic (A2C). With that we will also see our first continuous action space implementation. Things are going to get exciting. Trust me!

Want to connect?
😸GitHub Repo
🌍My Website

🐦My Twitter

👨My LinkedIn

Using YOLO with CLIP to improve Retrieval

Akshay Ballal — Fri, 02 Aug 2024 14:10:15 +0000

In this article we are going to see how we can use object detection models like YOLO along with multimodal embedding models like CLIP to make image retrieval better.

Here is the idea: CLIP image retrieval works as follows: We embed the images we have using a CLIP model and store them somewhere, like in a vector database. Then, during inference, we can use a query image or a prompt, embed that, and find the closest images from the stored embeddings that can be retrieved. The problem is when the embedded images have too many objects or some objects are in the background, and we still want our system to retrieve them. This is because CLIP embeds the image as a whole. Think of it like what a word embedding model is to a sentence embedding model. We want to be able to search for words that are equivalent to objects in an image. So, the solution is to decompose the image into different objects using an object detection model. Then, embed these decomposed images but link them to their parent image. This will allow us to retrieve the crops and get the parent from which the crop originated. Let’s see how it works.

Install the Dependencies and import them



!pip install -q ultralytics torch matplotlib numpy pillow zipfile36 transformers

from ultralytics import YOLO
import matplotlib.pyplot as plt
from PIL import pillow
import os
from Zipfile import Zipfile, BadZipFile
import torch
from transformers import CLIPProcessor, CLIPModel, CLIPVisionModelWithProjection, CLIPTextModelWithProjection

Download the COCO Dataset and unzip



!wget http://images.cocodataset.org/zips/val2017.zip -O coco_val2017.zip

def extract_zip_file(extract_path):
    try:
        with ZipFile(extract_path+".zip") as zfile:
            zfile.extractall(extract_path)
        # remove zipfile
        zfileTOremove=f"{extract_path}"+".zip"
        if os.path.isfile(zfileTOremove):
            os.remove(zfileTOremove)
        else:
            print("Error: %s file not found" % zfileTOremove)
    except BadZipFile as e:
        print("Error:", e)

extract_val_path = "./coco_val2017"
extract_zip_file(extract_val_path)

We can then take some of the images and create a list of examples.



source = ['coco_val2017/val2017/000000000139.jpg', '/content/coco_val2017/val2017/000000000632.jpg', '/content/coco_val2017/val2017/000000000776.jpg', '/content/coco_val2017/val2017/000000001503.jpg', '/content/coco_val2017/val2017/000000001353.jpg', '/content/coco_val2017/val2017/000000003661.jpg']

Initiate the YOLO model and the CLIP Model

In this example we are going to use the latest Ultralytics Yolo10x model along with OpenAI clip-vit-base-patch32 .



device = "cuda"

 # YOLO Model
model = YOLO('yolov10x.pt')

# Clip model
model_id = "openai/clip-vit-base-patch32"
image_model = CLIPVisionModelWithProjection.from_pretrained(model_id, device_map = device)
text_model = CLIPTextModelWithProjection.from_pretrained(model_id, device_map = device)
processor = CLIPProcessor.from_pretrained(model_id)

Running the detection model



results = model(source=source, device = "cuda")

Let’s show us results with this code snippet



# Visualize the results
fig, ax = plt.subplots(2, 3, figsize=(15, 10))

for i, r in enumerate(results):
    # Plot results image
    im_bgr = r.plot()  # BGR-order numpy array
    im_rgb = Image.fromarray(im_bgr[..., ::-1])  # RGB-order PIL image

    ax[i%2, i//2].imshow(im_rgb)
    ax[i%2, i//2].set_title(f"Image {i+1}")

So we can see that the YOLO model works quite well in detecting the objects in the images. It does make some mistakes where it has tagged the monitor as TV. But that is fine. The actual classes that YOLO assigns are not that essential because we are going to use CLIP to do the inference.

Defining some helper Classes



class CroppedImage:

  def __init__(self, parent, box, cls):

    self.parent = parent
    self.box = box
    self.cls = cls

  def display(self, ax = None):
    im_rgb = Image.open(self.parent)
    cropped_image = im_rgb.crop(self.box)

    if ax is not None:
      ax.imshow(cropped_image)
      ax.set_title(self.cls)
    else:
      plt.figure(figsize=(10, 10))
      plt.imshow(cropped_image)
      plt.title(self.cls)
      plt.show()

  def get_cropped_image(self):
    im_rgb = Image.open(self.parent)
    cropped_image = im_rgb.crop(self.box)
    return cropped_image

  def __str__(self):
    return f"CroppedImage(parent={self.parent}, boxes={self.box}, cls={self.cls})"

  def __repr__(self):
    return self.__str__()

class YOLOImage:
  def __init__(self, image_path, cropped_images):
    self.image_path = str(image_path)
    self.cropped_images = cropped_images

  def get_image(self):
    return Image.open(self.image_path)

  def get_caption(self):
    cls  =[]
    for cropped_image in self.cropped_images:
      cls.append(cropped_image.cls)

    unique_cls = set(cls)
    count_cls = {cls: cls.count(cls) for cls in unique_cls}

    count_string = " ".join(f"{count} {cls}," for cls, count in count_cls.items())
    return "this image contains " + count_string

  def __str__(self):
    return self.__repr__()

  def __repr__(self):
    cls  =[]
    for cropped_image in self.cropped_images:
      cls.append(cropped_image.cls)

    return f"YOLOImage(image={self.image_path}, cropped_images={cls})"

class ImageEmbedding:
  def __init__(self, image_path, embedding, cropped_image = None):
    self.image_path = image_path
    self.cropped_image = cropped_image
    self.embedding = embedding

CroppedImage Class

The CroppedImage class represents a portion of an image cropped from a larger parent image. It is initialized with the path to the parent image, the bounding box defining the crop area, and a class label (e.g., "cat" or "dog"). This class includes methods to display the cropped image and to retrieve it as an image object. The display method allows for visualizing the cropped portion either on a provided axis or by creating a new figure, making it versatile for different use cases. Additionally, __str__ and __repr__ methods are implemented for easy and informative string representation of the object.

YOLOImage Class

The YOLOImage class is designed to handle images processed with the YOLO object detection model. It takes the path to the original image and a list of CroppedImage instances that represent the detected objects within the image. The class provides methods to open and display the full image and to generate a caption summarizing the objects detected in the image. The caption method aggregates and counts the unique class labels from the cropped images, providing a concise description of the image contents. This class is particularly useful for managing and interpreting results from object detection tasks.

ImageEmbedding Class

The ImageEmbedding class has an image and its associated embedding, which is a numerical representation of the image's features. This class can be initialized with the path to the image, the embedding vector, and optionally a CroppedImage instance if the embedding corresponds to a specific cropped portion of the image. The ImageEmbedding class is essential for tasks involving image similarity, classification, and retrieval, as it provides a structured way to store and access the image data alongside its computed features. This integration facilitates efficient image processing and machine learning workflows.

Crop each image and create a list of YOLOImage Objects



yolo_images: list[YOLOImage]= []

names= model.names

for i, r in enumerate(results):
  crops:list[CroppedImage] = []
  boxes = r.boxes
  classes = r.boxes.cls
  for j, box in enumerate(r.boxes):
    box = tuple(box.xyxy.flatten().cpu().numpy())
    cropped_image = CroppedImage(parent = r.path, box = box, cls = names[classes[j].int().item()])
    crops.append(cropped_image)
  yolo_images.append(YOLOImage(image_path=r.path, cropped_images=crops))

Embed Images using CLIP



image_embeddings = []

for image in yolo_images:
  input = processor.image_processor(images= image.get_image(), return_tensors = 'pt')
  input.to(device)
  embeddings = image_model(pixel_values = input.pixel_values).image_embeds
  embeddings = embeddings/embeddings.norm(p=2, dim = -1, keepdim = True) # Normalize the embeddings
  image_embedding = ImageEmbedding(image_path = image.image_path, embedding = embeddings)
  image_embeddings.append(image_embedding)

  for cropped_image in image.cropped_images:
    input = processor.image_processor(images= cropped_image.get_cropped_image(), return_tensors = 'pt')
    input.to(device)
    embeddings = image_model(pixel_values = input.pixel_values).image_embeds
    embeddings = embeddings/embeddings.norm(p=2, dim = -1, keepdim = True) # Normalize the embeddings

    image_embedding = ImageEmbedding(image_path = image.image_path, embedding = embeddings, cropped_image = cropped_image)
    image_embeddings.append(image_embedding)

   **image_embeddings_tensor = torch.stack([image_embedding.embedding for image_embedding in image_embeddings]).squeeze()**

We can now take these image embeddings and store in a vector database if we want to. But in this example we will just use the inner dot product technique to check the similarity and retrieve the images.

Retrieval



query = "image of a flowerpot"

text_embedding = processor.tokenizer(query, return_tensors="pt").to(device)
text_embedding = text_model(**text_embedding).text_embeds

similarities = (torch.matmul(text_embedding, image_embeddings_tensor.T)).flatten().detach().cpu().numpy()

# get the top 5 similar images
k = 5
top_k_indices = similarities.argsort()[-k:]

# Display the top 5 results
fig, ax = plt.subplots(2, 5, figsize=(20, 5))
for i, index in enumerate(top_k_indices):
  if image_embeddings[index].cropped_image is not None:
    image_embeddings[index].cropped_image.display(ax = ax[0][i])
  else:
  ax[0][i].imshow(Image.open(image_embeddings[index].image_path))
  ax[1][i].imshow(Image.open(image_embeddings[index].image_path))
  ax[0][i].axis('off')
  ax[1][i].axis('off')
  ax[1][i].set_title("Original Image")
plt.show()

You can see that we are able to retrieve even small plants which are hidden away in the background. Also sometimes it pulls the original image as the result because we are also embedding that .

This can be a very powerful technique. You can also finetune both the models for detection and embedding for your own images and improve the performance even more.

One downside is that we have to run the CLIP model on all the objects detected. One way to mitigate this is by limiting the number of boxes that YOLO produces.

You can check out the code on Colab at this link.

Want to connect?

🌍My Website

🐦My Twitter

👨My LinkedIn

Reinforcement Learning from Scratch - Part 2 - Deep Q Learning

Akshay Ballal — Tue, 30 Jul 2024 05:46:17 +0000

Hey everyone. This is the second part of my Reinforcement Learning Series, where we look at the different RL algorithms and discuss their implementation details. In the last part, we saw a basic RL algorithm that uses tabular Q learning. One disadvantage of that technique was that it required us to discretize the observation and action spaces. This increases the dimensionality of the observation space quite a bit and makes learning hard and slower.

To overcome this problem, at least partially, in this part, we look at Deep Q Learning (DQN), a prominent technique that brought Reinforcement learning into the mainstream. We can use the continuous observation space in this technique but still need a discrete action space. First, we will put together a simple DQN algorithm and then see how we can use LSTM architecture as the network in DQN. Let’s dive in.

DQN Theory

If you are not interested in the theory of DQN, that is fine. The code implementation below explains a lot. However, I am including some of the main driving equations below for completeness.

From the previous part, we know that the q value of a state action pair is the expected total return from that state if a certain action is taken. Using the Bellman Equation, we can write it out as:

If we are following the policy π we can say that:

From this, we can say that the q-value of a state action pair is equal to the average (expectation) of the sum of reward and γ times the maximum q-value of the next state. This is called bootstrapping

For example, if we have a transition (s, 2, r, s’), we take action 2. Then, we do the following.

Pass the state through the q-network and get the q-values of all the actions.
Pick the q-value of the action taken, which is 2 here.
Pass the next state s’ through the target q-network and get q-values of all the actions of s’.
Pick the maximum q-value.

We can then take a Mean Square Error Loss between the target and the prediction and perform backward propagation.

The below image can make this more clear

The Algorithm

Below you can see the pseudocode for the Deep Q Learning. It looks more complex than it actually is. You can follow along with the code to get a better understanding.

Code Implementation

We will again use the Pendulum environment we used in the last part. We will also log some metrics to TensorBoard to see how the learning is going.

You can check out the code at: GitHub Repo

Let’s start with importing the dependencies



import gymnasium as gym
import numpy as np
from tqdm import tqdm
import torch
from torch import nn
from torch.optim import AdamW
from copy import deepcopy
from torch.utils.tensorboard import SummaryWriter
import torch.nn.functional as F

Pendulum Environment Wrapper

We define the Gaussian Function that I introduced in Part 1, which creates the set of possible actions.



def gaussian(x, mu, sigma, max_amp):
    return max_amp * np.exp(-0.5 * ((x - mu) / sigma)**2)

Then, we create the wrapper. This is very similar, in fact more simpler than the previous wrapper because we can use the observation space from the Pendulum Environment as it is and don't have to do any discretization.



class PendulumDiscreteStateAction(gym.Wrapper):

    def __init__(self, env: gym.Env, nvec_u: int, sigma: float):

        super(PendulumDiscreteStateAction, self).__init__(env)

        self.env = env
        self.nvec_u = nvec_u

        # Create a Discrete action space
        self.action_space = gym.spaces.Discrete(nvec_u)

        kernel = gaussian(np.linspace(0, 1, nvec_u//2), 0, sigma, 
                                               self.env.action_space.high[0])
        self.actions = (-kernel).tolist() + [0] + np.flip(kernel).tolist()

    def step(self, action: int) -> tuple[np.ndarray[float], float, bool, dict]:

        action = self.actions[action]
        obs, reward, terminated, truncated, info = self.env.step([action])
        reward = reward/16.2736044 # normalize the reward between -1 and 1
        obs: np.ndarray[float] = obs/self.env.observation_space.high # normalize the observation between -1 and 0
        return obs, reward, terminated, truncated, info

    def reset(self) -> tuple[np.ndarray[float], dict]:
        """
        Resets the environment.

        Returns:
        - The initial discrete observation and additional information.
        """
        obs, info = self.env.reset()
        obs: np.ndarray[float] = obs/self.env.observation_space.high
        return obs, info

Notable Implementation Details

Reward Scaling: The rewards were scaled using MinMax Scaler to make the learning smoother. This was done by dividing the returned reward by the minimum possible reward: rscaled = r/rmin. The rmin is defined in the documentation of the pendulum environment on gymnasium. The intuition behind this is that the network does not have to output significantly big q values, and the weights can remain smaller, making learning smooth
Scaled Observations: The observation space is scaled by dividing the maximum value of the observation space.

Define the Q-Network

Here we define a simple Fully Connected Network which maps the state to the q-values of all possible actions.



class QNetwork(nn.Module):
    def __init__(self, nvec_s, nvec_u):
        super(QNetwork, self).__init__()

        self.fc1 = nn.Linear(nvec_s, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, nvec_u)

    def forward(self, x:torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Create ReplayMemory Class



class ReplayMemory:
    def __init__(self, capacity, env: gym.Env, device: torch.device):

        self.position = 0
        self.size = 0
        self.capacity = capacity
        self.device = device

        self.n_states = env.observation_space.shape  # Number of dimensions in the state space
        self.n_actions = env.action_space.n  # Number of discrete actions

        # Initialize arrays to store the replay memory
        self.states = np.zeros((capacity, *self.n_states))
        self.actions = np.zeros((capacity))
        self.rewards = np.zeros(capacity)
        self.next_states = np.zeros((capacity, *self.n_states))
        self.terminated = np.zeros(capacity)
        self.truncated = np.zeros(capacity)

    def push(self, state:np.ndarray, action:int, next_state:np.ndarray, reward:float, terminated: bool, truncated:bool):

        self.states[self.position] = state.flatten()
        self.actions[self.position] = action
        self.next_states[self.position] = next_state.flatten()
        self.rewards[self.position] = reward
        self.terminated[self.position] = terminated
        self.truncated[self.position] = truncated

        self.position = (self.position + 1) % self.capacity
        self.size = min(self.size + 1, self.capacity)

    def sample(self, batch_size):

        indices = np.random.choice(self.size, batch_size, replace=False)

        states = torch.tensor(self.states[indices], dtype = torch.float32, device=self.device)
        actions = torch.tensor(self.actions[indices], dtype = torch.int64, device=self.device)
        next_states = torch.tensor(self.next_states[indices], dtype = torch.float32, device=self.device)
        rewards = torch.tensor(self.rewards[indices], dtype = torch.float32, device=self.device)
        terminated = torch.tensor(self.terminated[indices], dtype = torch.float32, device=self.device)
        truncated = torch.tensor(self.truncated[indices], dtype = torch.float32, device=self.device)

        return states, actions, next_states, rewards, terminated, truncated

    def __len__(self):
        return len(self.size)

Create Agent Class

Initialization and some helper functions

Here we define the agent initialization function that initiates the various variables and hyperparameters. We also have a get_action function that takes random action with a probability of ϵ and otherwise takes an action by passing the state through the q-network and taking argmax i.e selecting action with the highest q-value.



class Agent:
    def __init__(
        self,
        env: gym.Env,
        gamma=0.99,
        alpha=0.0003,
        initial_epsilon=1,
        min_epsilon=0.1,
        decay_rate=0.9999,
        batch_size=64,
        n_rollouts=2000,
        capacity=100000,
        device: torch.device = torch.device("cpu"),
    ):
        self.env = env  # Environment
        self.device = device  # Computation device (CPU or GPU)
        self.gamma = gamma  # Discount factor
        self.alpha = alpha  # Learning rate
        self.epsilon = initial_epsilon  # Initial epsilon value for exploration
        self.batch_size = batch_size  # Batch size for training
        self.n_rollouts = n_rollouts  # Number of rollouts to collect

        self.epsilon = 1  # Initial epsilon value
        self.min_epsilon = min_epsilon  # Minimum epsilon value
        self.decay_rate = decay_rate  # Epsilon decay rate

        # Replay memory to store experiences
        self.replay_memory = ReplayMemory(capacity, env, device)
        # Q-network and target network for Q-learning
        self.q_network = QNetwork(
            env.observation_space.shape[0], env.action_space.n
        ).to(device)
        self.target_network = deepcopy(self.q_network)
        # Optimizer for Q-network
        self.optimizer = AdamW(self.q_network.parameters(), lr=alpha)

        # Number of dimensions in the state space
        self.n_states = env.observation_space.shape[0]

        # For metrics
        self.n_time_steps = 0  # Number of time steps
        self.episodes = 0  # Number of episodes
        self.n_updates = 0  # Number of gradient updates
        self.best_reward = -np.inf  # Best reward seen so far

    def get_action(self, obs, greedy=False):

        if not greedy and np.random.rand() < self.epsilon:  # Epsilon-greedy exploration
            return np.random.randint(self.env.action_space.n)  # Random action
        obs = torch.tensor(obs, dtype=torch.float32, device=self.device).unsqueeze(0)  # Convert observation to tensor
        self.q_network.eval()  # Set Q-network to evaluation mode
        with torch.no_grad():
            q_values: torch.Tensor = self.q_network(obs)  # Get Q-values for the observation
            return q_values.argmax().item()  # Return action with highest Q-value

    def sample_experience(self):

        return self.replay_memory.sample(self.batch_size)  # Sample experiences from replay memory

    def update_target(self):

        self.target_network.load_state_dict(self.q_network.state_dict())  # Update target network

Collect Rollouts:

The collect_rollouts function collects the transitions by simulating the environment by using the epsilon-greedy policy that we have defined in the get_actions function. Each transition is stored in the replay memory and we decay the epsilon to reduce exploration.



def collect_rollouts(self):
        obs, info = self.env.reset()  # Reset environment
    terminated = False
    truncated = False
    rewards = 0  # Total rewards
    episodes = 0  # Total episodes
    for _ in range(self.n_rollouts):
        action = self.get_action(obs, greedy=False)  # Get action
        next_obs, reward, terminated, truncated, _ = self.env.step(action)  # Step environment
        self.replay_memory.push(
            obs, action, next_obs, reward, terminated, truncated
        )  # Save the transition
        obs = next_obs  # Update observation
        rewards += reward  # Accumulate reward
        self.n_time_steps += 1  # Increment time steps
        if terminated or truncated:  # Check if episode ended
            episodes += 1
            self.episodes += 1
            obs, info = self.env.reset()  # Reset environment

        self.epsilon = max(
            self.min_epsilon, self.decay_rate * self.epsilon
        )  # Decrease epsilon

    return rewards / episodes  # Return the average reward per episode

Learn

The learn function samples the experiences of batch_size from the replay memory, predicts the q-values and gets the target values. We then get the loss and perform backward propagation. We do this for the defined number of epochs.



def learn(self, epochs):

    self.q_network.train()  # Set Q-network to training mode

    average_loss = 0
    for i in range(epochs):
        obs, action, next_obs, reward, terminated, truncated = (
            self.sample_experience()
        )  # Sample a batch of experiences
        q_values: torch.Tensor = self.q_network(obs)  # Get Q-values for the batch
        next_q_values = self.target_network(next_obs)  # Get Q-values for the next states

        q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)  # Gather Q-values for taken actions
        next_q_value = next_q_values.max(1).values  # Get max Q-value for next states
        target = reward + self.gamma * next_q_value * (1 - terminated) * (
            1 - truncated
        )  # Compute target Q-value

        loss = F.smooth_l1_loss(q_value, target)  # Compute loss

        self.optimizer.zero_grad()  # Zero gradients
        loss.backward()  # Backpropagate
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 10)  # Clip gradients
        self.optimizer.step()  # Update Q-network

        average_loss += (loss.item() - average_loss) / (i + 1)  # Update average loss
        self.n_updates += 1  # Increment update count

        if self.n_updates % 1000 == 0:  # Update target network periodically
            self.update_target()

    return average_loss  # Return average loss

Some Implementation Details

Smooth L1 Loss: This was inspired by the Stable Baselines implementation of DQN. Smooth L1 loss prevents exploding gradients due to outliers by transitioning from L2 loss to L1 loss. This is useful in off-policy algorithms because when transitions are sampled from the Replay Buffer, they can belong to an old, much different policy. It is not desirable to fit to these ”outliers.”
Gradient Clipping: Before performing the optimizer step, the L2 norm of the gradients is clipped to 10 to counteract the influence of outliers further. This prevents exploding gradients by keeping all gradient norms below 10.

Evaluate the current Policy

According to Zhenyi Wang, in many cases, DQN starts to forget and then relearn again once it has reached near the maximum reward. This can also be seen in the below figure, where the agent sometimes drops to a low reward. This can happen because even when the agent has achieved an optimal greedy policy, the rollout still happens with ϵ − greedy policy. This can cause the agent to suddenly take sub-optimal action, which can cause a large change in the network and cause the agent to unlearn. This can also be sometimes beneficial for making the agent more robust to perturbance. Nevertheless, the policy is evaluated after every learning cycle, and the best model is saved. This makes sure that, in the end, the best-performing model is retrievable.



def evaluate(self, n_steps):
        self.q_network.eval()  # Set Q-network to evaluation mode
    rewards = 0  # Total rewards
    episodes = 0  # Total episodes
    with torch.no_grad():
        obs, info = self.env.reset()  # Reset environment
        for _ in range(n_steps):
            action = self.get_action(obs, greedy=True)  # Get action
            obs, reward, terminated, truncated, _ = self.env.step(action)  # Step environment
            rewards += reward  # Accumulate reward

            if terminated or truncated:  # Check if episode ended
                episodes += 1
                self.env.reset()  # Reset environment

    rewards /= episodes  # Compute average reward per episode   
    if rewards > self.best_reward:  # Save best model if improved
        self.best_reward = rewards
        torch.save(self.q_network.state_dict(), "dqn_best_model.pth")
        print("New best model saved!")
    return rewards

Train

Finally we have the train function that runs the collect_rollouts, learn and evaluate functions in a loop for a number of epochs.



 def train(self, epochs):
                self.writer = SummaryWriter(log_dir="dqn_logs/DQN_2")  # TensorBoard writer

        pbar = tqdm(range(epochs))  # Progress bar
        for i in pbar:
            rewards = self.collect_rollouts()  # Collect rollouts
            loss = self.learn(int(self.n_rollouts/2))  # Perform learning
            eval_reward = self.evaluate(1000)  # Evaluate agent

            pbar.set_description(
                f"Iteration {i+1} || Reward: {rewards:.3f} || Eval Reward: {eval_reward :.3f} || Loss: {loss:.3f} || Epsilon: {self.epsilon:.2f} || Time steps: {self.n_time_steps} || N updates: {self.n_updates}"
            )  # Update progress bar
            self.writer.add_scalar("Training/Loss", loss, self.n_updates)  # Log training loss
            self.writer.add_scalar(
                "Training/Rollout: Mean Episode Reward", rewards, self.n_updates
            )  # Log mean episode reward during training
            self.writer.add_scalar(
                "Evaluation/Mean Episode Reward", eval_reward, self.n_updates
            )  # Log mean episode reward during evaluation

Hyperparameters:

For learning and exploration:gamma, alpha, initial_epsilon, min_epsilon, decay_rate:
For experience replay and training: batch_size, n_rollouts, capacity

Start Training



env = gym.make("Pendulum-v1")
env._max_episode_steps = 500
env = PendulumDiscreteStateAction(env, 11, 0.4)

agent = Agent(env)
agent.train(30)

Testing



import matplotlib.pyplot as plt

env = gym.make("Pendulum-v1",
            #    render_mode='human'
               )
env._max_episode_steps = 500
env = PendulumDiscreteStateAction(env, 11, 0.4)

agent = Agent(env)
agent.q_network.load_state_dict(torch.load("models/deep_q_learning/dqn_best_model.pth"))

rewards = []
thetas = []
tot_reward = 0
n_episodes = 100
n_steps= 0
for _ in range(n_episodes):
    obs, info = env.reset()
    terminated = False
    truncated = False
    while not terminated and not truncated:
        with torch.no_grad():
            action = agent.get_action(obs, greedy = True)
            # print(action)
            obs, reward, terminated,  truncated, info = env.step(action)
            x = obs[0] * env.observation_space.high[0]
            y = obs[1] * env.observation_space.high[1]
            theta = np.arctan2(y, x)
            thetas.append(theta)
            rewards.append(reward)
            tot_reward += reward
            n_steps+=1
            # env.render()
print("Total reward: ", tot_reward/n_episodes)

Results

Below, we can see the results of our trained agent. The agent can reach an angle of zero no matter where it starts. Also, the reward is zero.

The average episodic reward across 100 episodes is 9.8

LSTM Implementation

The transitions we collect are temporal. That is, there is a correlation between nearby transitions. To capture this temporal relation, we can change the state of the environment to be a history of past (state, action) pairs. So we need to add the previous action to the observation and maintain a list of Nhist number of previous observations. Let’s see if this improves the agent since now we have more information and a better network.

In the figure below, you can see the LSTM network architecture. There are several ways to implement this. Apart from this, another option is to take a linear output from the last hidden layer directly. But let’s proceed with this for now.



# Qnetwork with LSTM
class QNetwork(nn.Module):
    def __init__(self, nvec_s, nvec_u):
        super(QNetwork, self).__init__()

        self.lstm = nn.LSTM(nvec_s, 64, 1, batch_first=True)
        self.fc1 = nn.Linear(64, 64)
        self.fc2 = nn.Linear(64, nvec_u)

    def forward(self, x:torch.Tensor) -> torch.Tensor:
        x, _ = self.lstm(x)
        x = F.relu(self.fc1(x))
        x = torch.mean(x, dim=1)
        x = self.fc2(x)
        return x

Also, we need to change the wrapper to get the history of observations.



from collections import deque

class PendulumDiscreteStateAction(gym.Wrapper):

    def __init__(self, env: gym.Env, nvec_u: int, sigma: float, nhist: int):

        super(PendulumDiscreteStateAction, self).__init__(env)

        self.env = env
        self.nvec_u = nvec_u

        # Check if the observation space is of type Box
        assert isinstance(
            env.observation_space, gym.spaces.Box
        ), "Error: observation space is not of type Box"

        # Create a Discrete action space
        self.action_space = gym.spaces.Discrete(nvec_u)


        # Define the possible actions
        kernel:np.ndarray = gaussian(np.linspace(0, 1, 5), 0, sigma, 2)
        self.actions = (-kernel).tolist() + [0] + np.flip(kernel).tolist()

        low = []
        for _ in range(nhist):
            temp_low = []
            for value in self.env.observation_space.low:
                temp_low.append(value)
            temp_low.append(min(self.actions))
            low.append(temp_low)

        low = np.array([list(self.env.observation_space.low) + [min(self.actions)] for _ in range(nhist)])
        high = np.array([list(self.env.observation_space.high) + [max(self.actions)] for _ in range(nhist)])


        self.observation_space = gym.spaces.Box(
            low=low,
            high=high,
            dtype=np.float32
        )

        self.prev_action = None
        self.history = deque(maxlen = nhist)

        # Initialize the history with zeros
        for i in range(nhist):
            self.history.append(np.array([0, 0, 0]))

    def step(self, action: int) -> tuple[np.ndarray[float], float, bool, dict]:

        action = self.actions[action]
        obs, reward, terminated, truncated, info = self.env.step([action])
        reward = reward/16.2736044 # normalize the reward between -1 and 1
        obs: np.ndarray[float] = obs/self.env.observation_space.high # normalize the observation between -1 and 0
        obs=np.append(obs, action/2)
        self.history.append(obs)
        obs = np.array(list(self.history))  
        return obs, reward, terminated, truncated, info

    def reset(self) -> tuple[np.ndarray[float], dict]:

        obs, info = self.env.reset()
        obs: np.ndarray[float] = obs/self.env.observation_space.high
        obs=np.append(obs, 0)

        # Initialize the history with the same observation
        for i in range(len(self.history)):
            self.history.append(obs)
        obs = np.array(list(self.history))
        return obs, info

Here, you can see that we are using deque to keep a list of the history of fixed-size Nhist. We also added the action to the observation. Now, we can train the agent as usual. Below, we can see how the training occurs in comparison to FCN. It is pretty similar. This means the LSTM is not having too big of an effect here. But in some complex environments this can be useful and worth trying.

That's all for this part. We saw how to implement a deep Q learning algorithm from scratch and use any network as the Q network, not just FCN. We saw LSTM, but you do this with transformers for an even longer context history. In the next part we will look into how to solve the limitation of using a discrete action space by implementing the Reinforce Algorithm.

Want to connect?

🌍My Website

🐦My Twitter

👨My LinkedIn

Reinforcement Learning from Scratch - Part 1 - Tabular Q Learning

Akshay Ballal — Fri, 26 Jul 2024 05:36:08 +0000

Reinforcement Learning is becoming the new trend. From controlling robots to optimizing logistics to tuning language models, reinforcement learning is the go-to strategy. However, newcomers to the field face a fragmented landscape and heavy reliance on implementation details. Even if you find a suitable RL algorithm from the many available, implementing it requires attention to fine details that aren’t part of the main RL algorithm, and getting these details right is challenging. While this is often problem-dependent, this series aims to show examples of implementation details across different RL problems and algorithms.

In this first edition, we explore the most basic form of Q-learning: Vanilla or Tabular Q-learning. You’ll see that even with the most basic algorithm, there are several tricks we can use to improve and adapt it to various problems.

This RL series is geared towards practitioners who already have basic knowledge of RL and have trained some basic RL agents. For this reason, I may not go into complete theory details and will mainly focus on finer implementation tricks. Nonetheless, let’s begin with some required basics. Also, in practice, one would use libraries like stable-baselines, but they are black boxes and do not provide a good understanding of how things work under the hood.

What is the agent’s objective in RL?

Every RL agent is trained for one simple objective: to maximize the total rewards accumulated. To do this, we need the agent to learn to take the most optimal decision in every state it can face. The equation below shows the total reward accumulated by the agent starting at time step t.

Action-value function

The action-value function, often denoted as Q(s,a), is a fundamental concept in reinforcement learning. It represents the expected return (total accumulated reward) starting from a state s, taking an action a, and subsequently following a policy π. Formally, it can be expressed as:

This is all the theory we need to implement our first Q-learning algorithm: Tabular Q-learning. Initially, we'll use Tabular Q-Learning to solve the Frozen Lake Environment, ensuring our algorithm setup is correct. Following this, we will tackle the Pendulum Environment. This environment is chosen for several reasons: it has relatively few states and only one action. The states and actions are also continuous, meaning they can take any real value within specified limits. For Tabular Q-Learning, we require discrete states and actions. This scenario will demonstrate how we can modify the environment to fit our algorithm.

Pseudocode for Tabular Q Learning

Initialize Parameters: Set the initial values for all the parameters required for the algorithm. These include the exploration rate (ϵ), learning rate (α), reward discounting factor (γ), and the total number of steps the algorithm should run.
Initialize Q-matrix and Step Counter: The Q-matrix ( $Q ma t$ ) stores the Q-values for state-action pairs. Initially, it is empty. The step counter (k) keeps track of the number of steps executed.
Get first observation: The environment is reset to its initial state to start the learning process. The initial observation ( $x_{k}$ ) is obtained from the environment.
Action Selection: At each step, the algorithm decides whether to explore or exploit. With probability 1−ϵ, it exploits by choosing the action that has the highest Q-value for the current state. With probability ϵ, it explores by choosing a random action.
Step Environment: The selected action is executed in the environment, which returns the new state ( $x_{k + 1}$ ) and the reward ( $r_{k}$ ).
Check if Terminal State: If the new state is terminal (end of an episode), the Q-value for the state-action pair is updated using the reward received, and the environment is reset.
Non-Terminal State: If the new state is not terminal, the Q-value is updated considering the reward received and the maximum Q-value for the next state.
Decay ϵ: The exploration rate (ϵ) is decayed to balance exploration and exploitation over time.
Increment Step Counter: The step counter is incremented to proceed to the next step.
End While Loop: The process continues until the specified number of steps ( $n s t e p s$ ) is reached.

Code Implementation for FrozenLake Environment

First, let’s import the required dependencies. Here, we use the gym library, which was originally developed by OpenAI and has several environments for RL.



import gymnasium as gym
import numpy as np
from collections import defaultdict
from tqdm import tqdm
from copy import deepcopy
import matplotlib.pyplot as plt
from typing import Optional

Next, we create a method for building the environment. For the FrozenLake environment, we build the environment as such without any modification, but in this method, we can wrap the environment in different modifications. We will see these modifications for the Pendulum Environment.



def get_env(
    env_name: str,
    render_mode="rgb_array",
) -> gym.Env:
    env = gym.make(env_name, render_mode=render_mode)

    return env

We then define a custom argmax function to choose the action with the maximum q value. But if multiple actions have the same q value, this function chooses uniformly between the ties. The standard np.argmax function would just pick the first action and we don’t want that.



def argmax(a):
    # random argmax
    a = np.array(a)
    return np.random.choice(np.arange(len(a), dtype=int)[a == np.max(a)])

Now we implement the main algorithm



def Qlearn(
    build_env,
    alpha=0.2,
    gamma=0.99,
    min_epsilon=0.1,
    nsteps=800000,
    Qmat=None,
    callback_freq=5000,
    callback=None,
):
    if Qmat is None:
        Qmat = defaultdict(float)

    episode_reward = 0
    mean_episodic_reward = 0
    n_episodes = 0

    epsilon = 1.0

    env: gym.Env = build_env()
    obs, info = env.reset()

    pbar = tqdm(range(nsteps), colour="green")
    for i in pbar:
        if np.random.rand() < epsilon:
            # Exploration: choose a random action
            action = env.action_space.sample()
        else:
            # Exploitation: choose the action with the highest Q-value
            action = argmax([Qmat[obs, i] for i in range(env.action_space.n)])

        next_obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward

        if not terminated and not truncated:

            Q_next = max(
                Qmat[next_obs, action_next] for action_next in range(env.action_space.n)
            )  # Get the maximum Q-value for the next state

            # Update Q-value using Q-learning update rule
            Qmat[obs, action] += alpha * (reward + gamma * Q_next - Qmat[obs, action])

            obs = next_obs
        else:
            # Episode ends, update mean episodic reward and reset environment
            Qmat[obs, action] += alpha * (reward - Qmat[obs, action])
            n_episodes += 1
            mean_episodic_reward += (episode_reward - mean_episodic_reward) / n_episodes
            episode_reward = 0
            obs, info = env.reset()

        epsilon = max(min_epsilon, 0.99995 * epsilon)

        if i % callback_freq == 0:
            if callback is not None:
                # Execute callback function if provided
                callback(build_env, Qmat)
        pbar.set_description(
            f"Mean episodic reward: {mean_episodic_reward:.2f} | Epsilon: {epsilon:.2f} | Best reward: {best_reward:.2f}"
        )

    return Qbest

You can see here a few fine details:

We initiate the ϵ at 1.0 and then decay it at every step by multiplying it with 0.99995 until it reaches some minimum epsilon. This decay rate becomes one of the hyperparameters.
We are keeping track of the mean episodic reward.
We have a callback function which we haven't talked about yet. This is the test function that we define next. This is used for a couple of reasons. One needs to know how the greedy policy is performing because, in the end, we are going to just use the greedy policy. This is similar to a validation set that is used in Machine Learning. Another reason is to save the best-performing policy every time the test is performed. This is done to take into account the fact that RL algorithms face a phenomenon of catastrophic forgetting. This happens when the agent learns good policies, but then suddenly an update puts it on the wrong track, and it forgets what it has learned and then relearns. This is quite common with my RL algorithms and Q learning is one of them. That's why we save the best policy so that if the agent is in a forgetting state at the end of the training, we can still retrieve the best policy. Here is the test code.



def test(build_env, Qmat, test_steps=1000):
    global Qbest, best_reward

    env: gym.Env = build_env()
    n_episodes = 0
    tot_rewards = 0

    obs, info = env.reset()
    for _ in range(test_steps):
        # Choose the action with the highest Q-value
        action = argmax([Qmat[obs, i] for i in range(env.action_space.n)])
        next_obs, reward, terminated, truncated, info = env.step(action)

        tot_rewards += reward
        obs = next_obs

        if terminated or truncated:
            n_episodes += 1
            obs, info = env.reset()

    if best_reward < tot_rewards / n_episodes:
        # Update the best reward and Q-best if a better reward is achieved
        best_reward = tot_rewards / n_episodes
        Qbest = deepcopy(Qmat)

    return tot_rewards / n_episodes

You can see that we check if the total reward during testing is better than the best rewards up to now, then update the best rewards and save the current *Qmat* as Qbest.

Next, we just run the algorithm, by first initiating *Qbest* as None and best reward as -np.inf



Qbest = None
best_reward = -np.inf
build_env = lambda: get_env("FrozenLake-v1")

Qbest = Qlearn(build_env, alpha=0.1, callback=test, nsteps=100000)

After training, we can just test the agent using this code



env = get_env("FrozenLake-v1", render_mode="human") # Slippery Frozen Lake is Default
obs, info = env.reset()
n_episodes = 3

for i in range(n_episodes):
    terminated = False
    truncated = False
    obs, info = env.reset()
    while not terminated and not truncated:
        action = argmax([Qbest[obs, i] for i in range(env.action_space.n)])
        obs, reward, terminated, truncated, info = env.step(action)

A new window opens, and we can see our little guy walking around. You will see that even though the ground is slippery the agent takes "safe" actions to both avoid the holes and still reach the goal.

Code Implementation for the Pendulum Environment

Alright, now that we know that our algorithm works for the Frozen Lake Environment, we are ready to switch to a slightly complex environment. For the Pendulum Environment, we need to make a few modifications because our algorithm supports only discrete observation and action space whereas the Pendulum environment has continuous observation and action space. So, we need to discretize the environmental spaces. We will be creating a wrapper for this. But first, let’s see what this environment contains.

So for the action space, we need to discretize the actions between -2 and 2. We can just choose a list of actions like [-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2] , but this is manual and may not scale very well to other environments. Also, we need more actions that are smaller, because we need to have more fine control once the pendulum learns to swing up. Swinging up can be done easily using some large actions. For this, we can create a Gaussian kernel. A Gaussian kernel looks like this:

y = A * e^{\frac{x - μ}{2 σ ^{2}}}

Here, A is the maximum value of the action. In our case, we have μ = 0. You can see that changing the sigma changes the way our actions are distributed. Thus σ becomes one of the hyperparameters along with the number of actions available. Let’s define a function to create the Gaussian kernel.



def gaussian(x, mu, sigma, max_amp):
    return max_amp * np.exp(-0.5 * ((x - mu) / sigma)**2)

The above plot is created using this code:



x = np.linspace(-1, 1, 20)

# Compute the Gaussian kernel centered at 0 with standard deviation of 1
mu = 0
plt.plot(x, gaussian(x, mu, sigma=0.2, max_amp=2), label='Sigma = 0.2')
plt.plot(x, gaussian(x, mu, sigma=0.4, max_amp=2), label='Sigma = 0.4')
plt.plot(x, gaussian(x, mu, sigma=0.6, max_amp=2), label='Sigma = 0.6')

plt.xlabel('x')
plt.ylabel('Amplitude')
plt.title('1D Scaled Gaussian Kernel')
plt.legend()
plt.grid(True)
plt.show()

We are going to use $\sigma$ = 0.5 for the rest of the code. To discretize the state we use the formula,

x_{d i scre t e} = \frac{x - x _{min}}{x _{ma x} - x _{min}} \cdot n v e c_{s}

Where nvec_s is the number of discrete states that we want to make. For this implementation, we are going to use [10, 10, 10], which means that the x value, y value, and the angular velocity of the pendulum are discretized to 10 values. For example, state corresponding to [0, 0, 0] will map to [-1, -1, -8], and state corresponding to [10, 10, 10] will map to [1, 1, 8]. Here is the code to do all this. We inherit the gym.Wrapper



class PendulumDiscreteStateAction(gym.Wrapper):

    def __init__(self, env: gym.Env, nvec_s: list[int], nvec_u: int, sigma: float):

        super(PendulumDiscreteStateAction, self).__init__(env)

        self.env = env
        self.nvec_s = nvec_s
        self.nvec_u = nvec_u

        # Check if the observation space is of type Box
        assert isinstance(
            env.observation_space, gym.spaces.Box
        ), "Error: observation space is not of type Box"

        # Check if the length of nvec_s matches the shape of the observation space
        assert (
            len(nvec_s) == env.observation_space.shape[0]
        ), "Error: nvec_s does not match the shape of the observation space"

        # Create a MultiDiscrete observation space and action space
        self.observation_space = gym.spaces.MultiDiscrete(nvec_s)
        self.action_space = gym.spaces.Discrete(nvec_u)

        # Define the possible actions

        kernel = gaussian(np.linspace(0, 1, 5), 0, sigma, 2)
        self.actions = (-kernel).tolist() + [0] + np.flip(kernel).tolist()
        # self.actions = [-2.0, -1.0, -0.5, -0.25, -0.15, 0.0, 0.15, 0.25, 0.5, 1.0, 2.0]

    def _discretize_observation(self, obs: np.ndarray[int | float]) -> np.ndarray[int]:

        # Discretize each dimension of the observation
        for i in range(len(obs)):
            obs[i] = int(
            (obs[i] - self.env.observation_space.low[i])
            / (
                self.env.observation_space.high[i]
                - self.env.observation_space.low[i]
            )
            * self.nvec_s[i]
            ) # equation: (x - min) / (max - min) * nvec
        return obs.astype(int)

    def step(self, action: int) -> tuple[np.ndarray[int], float, bool, dict]:

        action = self.actions[action]
        obs, reward, terminated, truncated, info = self.env.step([action])
        obs = self._discretize_observation(obs)
        return tuple(obs), reward, terminated, truncated, info

    def reset(self) -> tuple[np.ndarray[int], dict]:

        obs, info = self.env.reset()
        obs = self._discretize_observation(obs)
        return tuple(obs), info

Now we can modify our method to build the environment to include this wrapper.



def get_env(
    env_name: str,
    nvec_s: Optional[list[int]] = None,
    nvec_u: Optional[int] =None,
    time_limit: Optional[int] = None,
    sigma: Optional[float] =  None,
    render_mode="rgb_array",
) -> gym.Env:
    env = gym.make(env_name, render_mode=render_mode)

    if nvec_s is not None and nvec_u is not None:
        env = PendulumDiscreteStateAction(env, nvec_s, nvec_u, sigma=sigma)

    if time_limit is not None:
        env = gym.wrappers.TimeLimit(env, max_episode_steps=time_limit)

    return env

We can also provide a different time limit to end the episode within a certain number of steps. Now, we can just call the Qlearn method again to train the agent on the new environment.



build_env = lambda: get_env("Pendulum-v1", [10, 10, 10], 11, 200, sigma = 0.5)
Qlearn(build_env, alpha=0.2, callback=test, nsteps=1000000)

As you can see, the agent learns to swing up pretty well, but keeping it centered on the top is quite hard. This shows that Tabular Q Learning is not the best method for complex environments. This is because discretizing to just 10 states yields 10 x 10 x 10 observation space. And this increases rapidly with more states and more discretization. Thus we need a better method. In the next part, we will look into Deep Q Learning, which uses a neural network to map the states to the Q values.

You can find the code for this part at the following GitHub Repo:

akshayballal95/rl-blog-series at tabular (github.com)

Want to connect?

🌍My Website

🐦My Twitter

👨My LinkedIn

Supercharge your Tests with CodiumAI Cover-Agent

Akshay Ballal — Mon, 20 May 2024 18:02:13 +0000

Writing great tests is a skill. Achieving comprehensive test coverage across a codebase is even more challenging. In this article, I will introduce a new tool to boost your test coverage and ensure that all your functions are tested thoroughly. Cover-Agent, an open-source tool by CodiumAI, leverages AI to generate tests aimed at increasing test coverage. Previously, I wrote about another tool from Codium, the PR-Agent, which you can read here.

Understanding Test Coverage

Before diving into CodiumAI Cover-Agent, let’s discuss what test coverage entails, particularly code coverage. Code coverage measures how much of the code is executed by tests. Key aspects of code coverage include:

Function Coverage: The percentage of functions or methods called by the test suite.
Statement Coverage: The percentage of executable statements run by the test suite.
Branch Coverage: The percentage of branches (like if-else and switch-case) executed.
Condition Coverage: The percentage of Boolean expressions evaluated as both true and false.

Benefits of Test Coverage

Identifying Gaps in Testing: Helps developers spot untested parts of the code.
Improving Software Quality: Comprehensive tests catch bugs and issues early.
Maintaining Code Reliability: Regularly measuring test coverage maintains and enhances code reliability.

However, high test coverage alone does not guarantee high-quality tests. It is possible to have high coverage with tests that do not thoroughly validate the code's correctness. Therefore, while aiming for high test coverage, it’s crucial to ensure that the tests are meaningful and well-designed. Codium Cover-Agent addresses this by not only aiming for coverage but also generating quality test cases that test edge cases, invalid inputs, and errors.

A Simple Example with CodiumAI Cover-Agent

To illustrate how CodiumAI Cover-Agent works, let’s start with a basic example. We will create a simple calculator.py file with functions for addition, subtraction, multiplication, and division.

# calculator.py
def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

def multiply(a, b):
    return a * b

def divide(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

Next, we write a test file test_calculator.py and place it in the tests folder.

# tests/test_calculator.py
from calculator import add, subtract, multiply, divide

class TestCalculator:

    def test_add(self):
        assert add(2, 3) == 5

To see the test coverage, we need to install pytest-cov, a pytest extension for coverage reporting.

pip install pytest-cov

Run the coverage analysis with:

pytest --cov=calculator

The output shows:

Name            Stmts   Miss  Cover
-----------------------------------
calculator.py      10      5    50%
-----------------------------------
TOTAL              10      5    50%

This indicates that 5 of the 10 statements in calculator.py are not executed, resulting in 50% code coverage. For simple examples like this, adding more tests manually is straightforward. But let's see how Codium Cover-Agent can enhance this process.

Setting Up CodiumAI Cover-Agent

To set up Codium Cover-Agent, follow these steps:

Install Cover-Agent:

pip install git+https://github.com/Codium-ai/cover-agent.git

Ensure OPENAI_API_KEY is set in your environment variables, as it is required for the OpenAI API.

Create the command to start generating tests:

cover-agent \
--source-file-path "calculator.py" \
--test-file-path "tests/test_calculator.py" \
--code-coverage-report-path "coverage.xml" \
--test-command "pytest --cov=. --cov-report=xml --cov-report=term" \
--test-command-dir "./" \
--coverage-type "cobertura" \
--desired-coverage 80 \
--max-iterations 3 \
--openai-model "gpt-4o" \
--additional-instructions "Since I am using a test class, each line of code (including the first line) needs to be prepended with 4 whitespaces. This is extremely important to ensure that every line returned contains that 4 whitespace indent; otherwise, my code will not run."

Command Arguments Explained

source-file-path: Path of the file containing the functions for which tests need to be generated.
test-file-path: Path of the file where the tests will be written by the agent. It’s best to create a skeleton of this file with at least one test and the necessary import statements.
code-coverage-report-path: Path where the code coverage report is saved.
test-command: Command to run the tests (e.g., pytest).
test-command-dir: Directory where the test command should run. Set this to the root or the location of your main file to avoid issues with relative imports.
coverage-type: Type of coverage to use. Cobertura is a good default.
desired-coverage: Coverage goal. Higher is better, though 100% is often impractical.
max-iterations: Number of times the agent should retry to generate test code. More iterations may lead to higher OpenAI token usage.
additional-instructions: Prompts to ensure the code is written in a specific way. For example, here we specify that the code should be formatted to work within a test class.

On running the command, the agent starts writing and iterating on the tests.

This is the code that it generates

import pytest
from calculator import add, subtract, multiply, divide

class TestCalculator:

    def test_add(self):
        assert(add(2, 3), 5

    def test_subtract(self):
        """
        Test subtracting two numbers.
        """
        assert subtract(5, 3) == 2
        assert subtract(3, 5) == -2

    def test_multiply(self):
        """
        Test multiplying two numbers.
        """
        assert multiply(2, 3) == 6
        assert multiply(-2, 3) == -6
        assert multiply(2, -3) == -6
        assert multiply(-2, -3) == 6

    def test_divide(self):
        """
        Test dividing two numbers.
        """
        assert divide(6, 3) == 2
        assert divide(-6, 3) == -2
        assert divide(6, -3) == -2
        assert divide(-6, -3) == 2

    def test_divide_by_zero(self):
        """
        Test dividing by zero, should raise ValueError.
        """
        with pytest.raises(ValueError, match="Cannot divide by zero"):
            divide(5, 0)

You can see that the agent also wrote tests for checking errors for edge cases.

We can now test the coverage again

pytest --cov=calculator

Name            Stmts   Miss  Cover
-----------------------------------
calculator.py      10      0   100%
-----------------------------------
TOTAL              10      0   100%

In this simple example we reached 100% test coverage.

Trying Cover-Agent in a real Codebase

The previous example was simple, and even a beginner can completely cover the codebase. Let’s see how the agent works for a more real-world application. I have a codebase with more than 15 files with approximately 1000 lines of code. This is the coverage of a certain module of the application.

Name                                      Stmts   Miss  Cover
-------------------------------------------------------------
maintenance\__init__.py                       0      0   100%
maintenance\events\degradation_event.py       6      2    67%
maintenance\events\event.py                  22      8    64%
maintenance\events\reach_machine.py           4      1    75%
maintenance\events\repaired_event.py          4      1    75%
maintenance\future_event_set.py              24     14    42%
maintenance\models\__init__.py                0      0   100%
maintenance\models\action.py                 12      0   100%
maintenance\models\engineer.py               15      5    67%
maintenance\models\factory.py                63     36    43%
maintenance\models\machine.py                79      1    99%
maintenance\models\state.py                  14      0   100%
maintenance\policies\policy.py               12      1    92%
maintenance\policies\reactive_policy.py      27      8    70%
maintenance\simResults.py                    65     65     0%
maintenance\simulation.py                    43     43     0%
maintenance\tests\__init__.py                 0      0   100%
maintenance\tests\conftest.py                25      0   100%
maintenance\tests\factory_test.py            12      0   100%
maintenance\tests\machine_test.py            63      0   100%
-------------------------------------------------------------
TOTAL                                       490    185    62%

The goal is to increase the coverage by writing tests for the factory.py file. Let’s do it

As before we create a skeleton for the test file. I have written one simple test here.

import pytest
from maintenance.models.factory import Factory
from maintenance.tests.conftest import factory
from maintenance.models.action import Action, ActionType
from maintenance.policies.reactive_policy import ReactivePolicy
from maintenance.events.event import Event, EventType
import numpy as np

class TestFactory:

    def test_execute_action(self,factory:Factory):
        factory.execute_action(factory.engineers[0], 0, factory.policy.determine_action(factory.state)[0])
        assert factory.state.fse_available[0] == True
        assert factory.state.fse_location[0] == 1

Then we run the following command

cover-agent \
 --source-file-path "maintenance/models/factory.py" \
 --test-file-path "maintenance/tests/factory_test.py" \
 --code-coverage-report-path "coverage.xml" \
 --test-command "pytest --cov=maintenance \
 --cov-report=xml --cov-report=term" \
 --test-command-dir "./" \
 --coverage-type "cobertura" \
 --desired-coverage 80 \
 --max-iterations 3 \
 --openai-model "gpt-4o" \ 
 --additional-instructions "Since I am using a test class, each line of code (including the first line), In your response will need to be prepended with 4 whitespaces. This is extremely important to check to make sure every line returned contains that 4 whitespace indent otherwise my code will not run."

The agent starts writing the code. This takes about a minute to finish. Once the code is generated, the updated coverage report can now be checked.

Name                                      Stmts   Miss  Cover
-------------------------------------------------------------
maintenance\__init__.py                       0      0   100%
maintenance\events\degradation_event.py       6      0   100%
maintenance\events\event.py                  22      3    86%
maintenance\events\reach_machine.py           4      1    75%
maintenance\events\repaired_event.py          4      0   100%
maintenance\future_event_set.py              24     13    46%
maintenance\models\__init__.py                0      0   100%
maintenance\models\action.py                 12      0   100%
maintenance\models\engineer.py               15      3    80%
maintenance\models\factory.py                63     10    84%
maintenance\models\machine.py                79      0   100%
maintenance\models\state.py                  14      0   100%
maintenance\policies\policy.py               12      1    92%
maintenance\policies\reactive_policy.py      27      2    93%
maintenance\simResults.py                    65     65     0%
maintenance\simulation.py                    43     43     0%
maintenance\tests\__init__.py                 0      0   100%
maintenance\tests\conftest.py                25      0   100%
maintenance\tests\factory_test.py            45      0   100%
maintenance\tests\machine_test.py            63      0   100%
-------------------------------------------------------------
TOTAL                                       523    141    73%

We have increased the total coverage for the module by 10% and the coverage of factory.py from 43% to 84%. This is amazing, given almost no effort from our side. This is the test code written by the agent. More edge cases can be tested. We can add them manually or instruct the agent to handle some specific edge cases.

import pytest
from maintenance.models.factory import Factory
from maintenance.tests.conftest import factory
from maintenance.models.action import Action, ActionType
from maintenance.policies.reactive_policy import ReactivePolicy
from maintenance.events.event import Event, EventType
from maintenance.models.engineer import Engineer
from maintenance.models.machine import Machine
import numpy as np

class TestFactory:

    def test_execute_action(self,factory:Factory):
        factory.execute_action(factory.engineers[0], 0, factory.policy.determine_action(factory.state)[0])
        assert factory.state.fse_available[0] == True
        assert factory.state.fse_location[0] == 1

    def test_machine_from_id_invalid_id(self, factory: Factory):
        """
        Test the machine_from_id method with an invalid machine ID.
        Ensures that the method returns None for an invalid machine ID.
        """
        machine = factory.machine_from_id(-1)  # Invalid machine ID
        assert machine is None

    def test_process_event_degradation_machine_under_repair(self, factory: Factory):
        """
        Test the process_event method for the DEGRADATION event type when the machine is under repair.
        Ensures that the degradation event is ignored.
        """
        machine = factory.machines[0]
        machine.under_repair = True
        event = Event(EventType.DEGRADATION, machine.id, 0, intensity=1)
        factory.process_event(event)
        assert machine.degradation == 0  # Degradation should not change

    def test_process_event_degradation_machine_failed(self, factory: Factory):
        """
        Test the process_event method for the DEGRADATION event type when the machine has failed.
        Ensures that the degradation event is ignored.
        """
        machine = factory.machines[0]
        machine.degradation = machine.failure_theshold
        event = Event(EventType.DEGRADATION, machine.id, 0, intensity=1)
        factory.process_event(event)
        assert machine.degradation == machine.failure_theshold  # Degradation should not change

    def test_process_event_degradation_threshold_reached(self, factory: Factory):
        """
        Test the process_event method for the DEGRADATION event type when the degradation reaches the failure threshold.
        Ensures that the machine handles failure correctly.
        """
        machine = factory.machines[0]
        event = Event(EventType.DEGRADATION, machine.id, 0, intensity=machine.failure_theshold)
        factory.process_event(event)
        assert machine.degradation == machine.failure_theshold
        assert machine.has_failed()

    def test_process_event_repaired_machine_not_under_repair(self, factory: Factory):
        """
        Test the process_event method for the REPAIRED event type when the machine is not under repair.
        Ensures that the method handles this scenario gracefully.
        """
        machine = factory.machines[0]
        machine.under_repair = False
        event = Event(EventType.REPAIRED, machine.id, 0)
        factory.process_event(event)
        assert not machine.under_repair  # Machine should still not be under repair

    def test_get_results_zero_sim_time(self, factory: Factory):
        """
        Test the get_results method with zero simulation time.
        Ensures that the method handles division by zero gracefully.
        """
        sim_time = 0
        with pytest.raises(ZeroDivisionError):
            factory.get_results(sim_time)

Conclusion

CodiumAI Cover-Agent simplifies achieving high test coverage and generating meaningful, high-quality tests. By using AI to identify and address gaps in testing, it ensures robust software development practices. Try CodiumAI Cover-Agent on your projects to see the difference it can make in your testing workflow.

You can check out the open-source repository for the cover agent at https://github.com/Codium-ai/cover-agent.

Want to connect?

🌍My Website

🐦My Twitter

👨My LinkedIn

Introducing EmbedAnything

Akshay Ballal — Thu, 18 Apr 2024 10:01:21 +0000

Motivation

Transformer models have become increasingly popular in recent times. One crucial requirement for all transformer models is a latent representation of the input data in embeddings. Word embeddings are used in language models, while vision models rely on patch embeddings. However, currently, there are no existing solutions to extract embeddings and custom metadata from various file formats. LangChain offers some solutions, but it is a bulky package, and extracting only the embedding data is not easy. Moreover, LangChain is not very suitable for vision-related tasks. Embeddings are helpful for language models and other models trained for various tasks, such as semantic segmentation and object detection.

This is where EmbedAnything comes in. It is a lightweight library that allows you to generate embeddings from different file formats and modalities. Currently, EmbedAnything supports PDFs and images, with many more formats in the pipeline. The idea is to provide an end-to-end solution where you can give the file and get the embeddings with the appropriate metadata.

Development of EmbedAnything started with these goals in mind:

Compatibility with Local and Cloud Models: Seamless integration with local and cloud-based embedding models.
High-Speed Performance: Fast processing to meet demanding application requirements.
Multimodal Capability: Flexibility to handle various modalities.
CPU and GPU Compatibility: Performance optimization for both CPU and GPU environments.
Lightweight Design: Minimized footprint for efficient resource utilization.

In this blog, we will see how we achieve these goals and what more must be done to improve EmbedAnything. We will also see why EmbedAnything is packaged the way it is with Rust as a backend and a Python interface.

Keeping it Local

While cloud-based embedding services like OpenAI, Jina, and Mistral offer convenience, many users require the flexibility and control of local embedding models. Here's why local models are crucial for some use cases:

Cost-Effectiveness: Cloud services often charge per API call or model usage. Running embeddings locally on your own hardware can significantly reduce costs, especially for projects with frequent or high-volume embedding needs.
Data Privacy: Certain data, like medical records or financial documents, might be too sensitive to upload to the cloud. Local embedding keeps your data confidential and under your control.
Offline Functionality: An internet connection isn't always guaranteed. Local models ensure your embedding tasks can run uninterrupted even without an internet connection.

Performance

EmbedAnything is built with Rust. This makes it faster and provides type safety and a much better development experience. But why is speed so crucial in this process?

The Need for Speed

Creating embeddings from files involves two steps that demand significant computational power:

Extracting Text from Files, Especially PDFs: Text can exist in different formats such as markdown, PDFs, and Word documents. However, extracting text from PDFs can be challenging and often causes slowdowns. It is especially difficult to extract text in manageable batches as embedding models have a context limit. Breaking the text into paragraphs containing focused information can help.
Inferencing on the Transformer Embedding Model: The transformer model is usually at the core of the embedding process, but it is known for being computationally expensive. To address this, EmbedAnything utilizes the Candle Framework by Hugging Face, a machine-learning framework built entirely in Rust for optimized performance.

The Benefit of Rust for Speed

By using Rust for its core functionalities, EmbedAnything offers significant speed advantages:

Rust is Compiled: Unlike Python, Rust compiles directly to machine code, resulting in faster execution.
Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages.
Rust achieves true multithreading.

What does Candle bring to the table?

Running language models or embedding models locally can be difficult, especially when you want to deploy a product that utilizes these models. If you use the transformers library from Hugging Face in Python, you will depend on PyTorch for tensor operations. This, in turn, has a dependency on Libtorch, which means that you will need to include the entire Libtorch library with your product. Also, Candle allows inferences on CUDA-enabled GPUs right out of the box. We will soon post on how we use Candle to increase the performance and decrease the memory usage of EmbedAnything.

Multimodality

Finally, let's see how EmbedAnything handles multimodality. When a directory is passed for embedding to EmbedAnything, the file extension is checked to see if it is text or image and a suitable embedding model is used to generate the embeddings.

Check out an example of an image search on this Google Colab Notebook:

You can start using EmbedAnything with

pip install embed_anything

You can view and contribute to the project on GitHub at:
EmbedAnything

my website: http://www.akshaymakes.com/

linkedin: https://www.linkedin.com/in/akshay-ballal/

twitter: https://twitter.com/akshayballal95/

Building an AI Game Bot 🤖 Using Imitation Learning and 3D Convolution ResNet

Akshay Ballal — Tue, 02 Jan 2024 19:40:33 +0000

Introduction

Hey guys, this article is going to be fun, and I am sure you will get to learn a lot about using AI in a fun way. Even though this article is about making an AI engine to play a game, the learnings can be used in more serious applications because the process is more or less the same. The good thing about learning to build AI with games, as seen in my previous article, is that you get to experiment a lot without consequences, and getting the data is cheap. Normally, for games, one would use reinforcement learning to train an agent. This can be done but requires a lot more effort to set up in regards to building your own simulator. But with this one, we are going to build an AI agent that plays the Google Snake game. You don’t have to build the game yourself.

Broadly, this is what we are going to do. We are going to use the technique of imitation learning, where the agent learns from the human counterpart about how to make decisions. This means that we as humans are first going to play the game for a while and collect the data. And then, the AI uses this data to train itself on how to play the game and pick up patterns from the human counterpart. This is known as imitation learning (not to be confused with transfer learning).

Getting into the technicalities, we are going to train a 3D convolution ResNet model. This is because we want to capture motion to know which direction the snake is heading. For this reason, we are going to feed our model 4 frames of the game at a time to infer motion. You could try just to use one frame and use a standard convolution model like EfficientNet, but it is not going to work that well without the motion information.

Without further ado, let’s get started. First, we will see how to collect the data.

Data Collection

There are two methods to gather data. The first involves using a Python script to capture the game screen and label the images with the keyboard command. The second method, which we will be using, relies on selenium, a Python tool that automates browser navigation and control. This is the code to use selenium and save the screen captures as images with the appropriate labels.

We can create a python script capture.py to collect the screenshots.

# Import the required modules
import base64
import io
import cv2
from PIL import Image
import numpy as np
import keyboard
import os
from datetime import datetime
from selenium import webdriver

from selenium.webdriver.common.by import By

# Check if the captures folder exists and delete any existing files in it
isExist = os.path.exists("captures")

if isExist:
    dir = "captures"
    for f in os.listdir(dir):
        os.remove(os.path.join(dir, f))

else:
    os.mkdir("captures")

current_key = "1"
buffer = []

# Define a function to record the keyboard inputs
def keyboardCallBack(key: keyboard.KeyboardEvent):
    global current_key

    # If a key is pressed and not in the buffer, add it to the buffer
    if key.event_type == "down" and key.name not in buffer:
        buffer.append(key.name)

    # If a key is released, remove it from the buffer
    if key.event_type == "up":
        buffer.remove(key.name)

    # Sort the buffer and join the keys with spaces
    buffer.sort()
    current_key = " ".join(buffer)

# Hook the function to the keyboard events
keyboard.hook(callback=keyboardCallBack)

# Create a webdriver instance using Firefox
driver = webdriver.Firefox()
# Navigate to the Google Snake game website
driver.get("https://www.google.com/fbx?fbx=snake_arcade")

# Loop indefinitely
while True:
    # Find the canvas element on the webpage
    canvas = driver.find_element(By.CSS_SELECTOR, "canvas")

    # Get the base64 encoded image data from the canvas
    canvas_base64 = driver.execute_script(
        "return arguments[0].toDataURL('image/png').substring(21);", canvas
    )
    # Decode the base64 data to get the PNG image
    canvas_png = base64.b64decode(canvas_base64)

    # Convert the PNG image to a grayscale numpy array
    image = cv2.cvtColor(
        np.array(Image.open(io.BytesIO(canvas_png))), cv2.COLOR_BGR2GRAY
    )

    # Save the image to the captures folder with the current timestamp and keyboard inputs as the file name
    if len(buffer) != 0:
        cv2.imwrite(
            "captures/"
            + str(datetime.now()).replace("-", "_").replace(":", "_").replace(" ", "_")
            + " "
            + current_key
            + ".png",
            image,
        )
    else:
        cv2.imwrite(
            "captures/"
            + str(datetime.now()).replace("-", "_").replace(":", "_").replace(" ", "_")
            + " n"
            + ".png",
            image,
        )

Once you run this script, you can start playing the game. In the background, the script will keep saving screenshots of the game screen and name the images with a unique timestamp and the key being pressed right now. When no key is pressed, it is marked as n.

This is how the directory will look after the images are saved.

Convert Image Folder to CSV File

Now we can convert these images into a csv file with the file names and the corresponding actions. We do this in a new file process.ipynb.

import pandas as pd
import matplotlib.pyplot as plt
import os
import csv 
import os

labels = []
dir = 'captures'
file_path = "data/labels_snake.csv"

if not os.path.exists(file_path):
    os.mkdir('data')

for f in os.listdir(dir):
    key = f.rsplit('.',1)[0].rsplit(" ",1)[1]

    if key=="n":
        labels.append({'file_name': f, 'class': 0})
    elif key=="left":
        labels.append({'file_name': f, 'class': 1})
    elif key=="up":
        labels.append({'file_name': f, 'class': 2})
    elif key=="right":
        labels.append({'file_name': f, 'class': 3})
    elif key=="down":
        labels.append({'file_name': f, 'class': 4})


field_names= ['file_name', 'class']

with open('data/labels_snake.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=field_names)
    writer.writeheader()
    writer.writerows(labels)

In this process, we are essentially creating a dataset of images with their corresponding labels, where each label represents the direction in which a key was pressed. Here are the steps involved:

Importing dependencies: We need certain libraries and modules to read and process the image files and write the labels to a CSV file. These are imported at the beginning of the process.
Creating a directory: We create a directory named "data" to save the CSV file that will contain the labels and image filenames.
Reading the file name: We extract the key pressed from the file name of each image. The key pressed can be either left, right, up, down, or no key at all.
Classifying the images: Based on the key pressed, we classify each image into one of four classes: 0 for no key being pressed, 1 for left, 2 for up, 3 for right, and 4 for down. This information is stored along with the filename of the image in a list called "labels".
Writing the labels: Finally, we write the labels to the CSV file. This dataset can now be used to train a machine learning model to recognize the direction in which a key is pressed, given an image.

Now, we create a new file snake_resnet.ipynb.

Import Dependencies

from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
import os
from PIL import Image
import torch
from sklearn.model_selection import train_test_split
import pandas as pd
from torchvision.transforms import transforms, Compose, ToTensor, Resize, Normalize, CenterCrop, Grayscale
from torch import nn
from tqdm import tqdm
from torchinfo import summary
import numpy as np
import math
from torchvision.models.video import r3d_18, R3D_18_Weights, mc3_18, MC3_18_Weights

Create Dataset

To get started, we need to create a custom dataset object using PyTorch. Our dataset consists of a stack of four images that are arranged in chronological order. Each item drawn from the dataset represents a sequence of four frames, where the last frame is associated with a key press. Essentially, this dataset captures motion through the last four frames and associates it with a key press.

class SnakeDataSet(Dataset):
    def __init__(self, dataframe, root_dir, stack_size, transform = None):
        self.stack_size = stack_size
        self.key_frame = dataframe
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.key_frame) - self.stack_size *3

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.to_list()
        try:
            img_names = [os.path.join(self.root_dir, self.key_frame.iloc[idx + i, 0]) for i in range(self.stack_size)]
            images = [Image.open(img_name) for img_name in img_names]
            label = torch.tensor(self.key_frame.iloc[idx + self.stack_size, 1])
            if self.transform:
                images = [self.transform(image) for image in images]
        except:
            img_names = [os.path.join(self.root_dir, self.key_frame.iloc[0 + i, 0]) for i in range(self.stack_size)]
            images = [Image.open(img_name) for img_name in img_names]
            label = torch.tensor(self.key_frame.iloc[0 + self.stack_size, 1])
            if self.transform:
                images = [self.transform(image) for image in images]
        return torch.stack(images,dim = 1).squeeze(), label

Let's break down the code:

Initialization (__init__ method):
- Parameters:
  - dataframe: The dataset containing information about images and labels.
  - root_dir: The root directory where the image files are located.
  - stack_size: The number of images to be stacked together as a single data point.
  - transform: An optional parameter for image transformations (e.g., data augmentation).
Length method (__len__ method):
- Returns the length of the dataset, which is the total number of data points. The length is calculated as the length of key_frame minus three times the stack_size. This indicates that the dataset is expected to contain sequences of images, and each data point consists of a stack of images.
Get item method (__getitem__ method):
- Takes an index idx and returns the corresponding data point.
- The code attempts to load a sequence of images starting from the index idx. It constructs a list of image file paths (img_names) based on the specified stack_size and root_dir.
- It then opens each image using the Python Imaging Library (PIL) and stores the resulting image objects in a list (images).
- The label for the data point is extracted from the key_frame dataframe.
- If an image transformation function (transform) is provided, it applies the transformation to each image in the sequence.
- The images are stacked along a new dimension (dimension 1) using torch.stack(images, dim=1), and then the singleton dimension is removed using squeeze(). This results in a tensor representing the stacked images.
- The stacked images tensor and the label tensor are returned as a tuple.
- Note: There's a try-except block that catches potential errors, and if an error occurs, it falls back to using the first sequence in the dataset. This is a basic form of error handling, but it's often better to handle errors more explicitly depending on the use case.

Balancing the Dataset and creating Dataloader

If you analyze the dataset closely, you will be able to see that there is a severe class imbalance in the dataset. This is because most of the time, while playing the game, the user is not pressing any key. Thus, most of the dataset items belong to class 0.

We fix this by using a PyTorch RandomWeightedSampler that takes as input the weights of each sample. We calculate these weights using the code below.

STACK_SIZE = 4
BATCH_SIZE = 32

train, test = train_test_split(pd.read_csv("data/labels_snake.csv"), test_size=0.2, shuffle=False)
classes = ["n", "left", "up", "right", "down"]

labels_unique, counts = np.unique(train["class"], return_counts=True)
class_weights = [sum(counts)/c for c in counts]
example_weights = np.array([class_weights[l] for l in train['class']])
example_weights = np.roll(example_weights, -STACK_SIZE)
sampler = WeightedRandomSampler(example_weights, len(train))

labels_unique, counts = np.unique(test["class"], return_counts=True)
class_weights = [sum(counts)/c for c in counts]
test_example_weights = np.array([class_weights[l] for l in test['class']])
test_example_weights = np.roll(test_example_weights, -STACK_SIZE)
test_sampler = WeightedRandomSampler(test_example_weights, len(test))

The process of calculating the weight of each sample in a dataset is an important step in various machine-learning tasks such as image classification, text classification, and object detection, among others. The weight of each sample is used to balance the dataset so that each class contributes equally to the learning process. When a dataset is imbalanced, that is, when some classes have more samples than others, the learning algorithm may focus more on the majority class and ignore the minority class, leading to poor performance.

To calculate the weight of each sample, we first split the dataset csv into train and test using the train_test_split method from the scikit-learn (sklearn) library. After that, we get the unique labels and the count of each label. This helps us to understand the distribution of classes in the dataset. We then calculate the weightage of each class by dividing the sum of all counts (size of the dataset) by the number of times that class occurs. This gives us a measure of how important each class is in the dataset.

The next step is to get the weight of each example by assigning the class weight to the example. This is done by iterating through the dataset and assigning the weight to each sample based on its class label. We roll the example weights by the stack size because the label associated with a certain image is actually the label of the index of that image + STACK_SIZE. This ensures that each sample is given the correct weight based on its class label.

In summary, calculating the weight of each sample in a dataset is a crucial step in machine learning tasks. It helps to balance the dataset and ensures that each class contributes equally to the learning process. The process involves splitting the dataset into train and test sets, obtaining the unique labels and the count of each label, calculating the weightage of each class, and assigning the weight to each sample based on its class label.

Finally, we can create the dataloader for both the training and test datasets.

dataset = SnakeDataSet(root_dir="captures", dataframe = train, stack_size=STACK_SIZE, transform=transformer)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, sampler=sampler, drop_last= True)
test_dataset = SnakeDataSet(root_dir="captures", dataframe = test, stack_size=STACK_SIZE,  transform=transformer)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler = test_sampler, drop_last=True)

Create Transforms

For the model that we are going to use, we need to perform certain transforms on the data. First, we need to compute the mean and standard deviation of the dataset images. We can do this using this code.

def compute_mean_std(dataloader):
    '''
    We assume that the images of the dataloader have the same height and width
    source: https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/Basics/pytorch_std_mean.py
    '''
    # var[X] = E[X**2] - E[X]**2
    channels_sum, channels_sqrd_sum, num_batches = 0, 0, 0

    for batch_images, labels in tqdm(dataloader):  # (B,H,W,C)
        batch_images = batch_images.permute(0,3,4,2,1)
        channels_sum += torch.mean(batch_images, dim=[0, 1, 2, 3])
        channels_sqrd_sum += torch.mean(batch_images ** 2, dim=[0, 1, 2,3])
        num_batches += 1

    mean = channels_sum / num_batches
    std = (channels_sqrd_sum / num_batches - mean ** 2) ** 0.5

    return mean, std

compute_mean_std(dataloader)

This will return the mean and standard deviation, which we can plug into the transformation below.

transformer = Compose([
    Resize((84,84), antialias=True),
    CenterCrop(84),
    ToTensor(),
    Normalize(mean =[ -0.7138, -2.9883,  1.5832], std =[0.2253, 0.2192, 0.2149]) 
])

This resizes the image to 84x84, crops it, converts it to tensor and normalizes it.

Creating the Model

As discussed in the introduction, we are going to use the PyTorch-provided r3d model. This is a 3D Convolution model that uses ResNet architecture.

model = r3d_18(weights = R3D_18_Weights.DEFAULT)
model.fc = nn.Linear(in_features=512, out_features=5, bias=True)
summary(model, (32,3,4,84,84))

We load the default weights and replace the last fully connected layer to predict 5 output classes as required in our application. This is what the network looks like. We have about 33 million trainable parameters.

Training the Model

Now, we can finally train the model. We first create the optimizer and the loss criterion. Here, we use a learning rate of 10e-5 and a weight decay of 0.1.

num_epochs = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
optimizer = torch.optim.AdamW(model.parameters(), 10e-5, weight_decay=0.1)
model.to(device)
criterion = nn.CrossEntropyLoss()

Now, we create the training loop.

for epoch in range(num_epochs):
    total_loss = 0.0
    correct_predictions = 0
    total_samples = 0

    val_loss = 0.0
    val_correct_predictions = 0
    val_total_samples = 0

    # Set model to training mode
    model.train()

    # tqdm bar for progress visualization
    pbar = tqdm(dataloader, desc=f'Epoch {epoch + 1}/{num_epochs}', leave=True)
    for inputs, labels in pbar:
        inputs, labels = inputs.to(device), labels.to(device)

        outputs = model(inputs.to(device))
        loss = criterion(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update statistics
        total_loss += loss.item()
        _, predicted = torch.max(torch.softmax(outputs,1), 1)
        correct_predictions += (predicted == labels).sum().item()
        total_samples += labels.size(0)

        # Update tqdm bar with current loss and accuracy
        pbar.set_postfix({'Loss': total_loss / total_samples, 'Accuracy': correct_predictions / total_samples})
        steps = steps + 1

    model.eval()
    with torch.inference_mode():
        for inputs, labels in test_dataloader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs.to(device))
            loss = criterion(outputs, labels)

            # Update statistics
            val_loss += loss.item()
            _, predicted = torch.max(torch.softmax(outputs,1), 1)
            val_correct_predictions += (predicted == labels).sum().item()
            val_total_samples += labels.size(0)

    # Calculate and print epoch-level accuracy and loss for validation
    epoch_loss = val_loss / val_total_samples
    epoch_accuracy = val_correct_predictions / val_total_samples
    print(f'Epoch {epoch + 1}/{num_epochs}, Val Loss: {epoch_loss:.4f}, Val Accuracy: {epoch_accuracy:.4f}')
    torch.save(model.state_dict(), "model_r3d.pth")

This will train the network.

Play the game

Create a new python script play.py. In this script, we will load the model and write code similar to [capture.py](http://capture.py) to load the game and collect screenshots of the game. We will stack 4 screenshots and pass it through our network.

import base64
import torch
import cv2
import keyboard
from PIL import Image
import numpy as np
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize,  Grayscale
from torch import nn
from collections import deque
from torchvision.models.video import r3d_18
from selenium import webdriver
from selenium.webdriver.common.by import By

label_keys= {
    0: "",
    1 :"left",
    2: "up",
    3: "right",
    4: "down"
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = r3d_18(weights = None)
model.fc = nn.Linear(in_features=512, out_features=5, bias=True)

model.load_state_dict(torch.load("model_mc3.pth"))
model.to(device)
model.eval()

transformer = Compose([
    Resize((84,84), antialias=True),
    CenterCrop(84),
    ToTensor(),
    Normalize(mean =[ -0.7138, -2.9883,  1.5832], std =[0.2253, 0.2192, 0.2149] )
])

# Create a webdriver instance using Firefox
driver = webdriver.Firefox()
# Navigate to the Google Snake game website
driver.get("https://www.google.com/fbx?fbx=snake_arcade")

frame_stack = deque(maxlen=4)

while True:
     # Find the canvas element on the webpage
    canvas = driver.find_element(By.CSS_SELECTOR, "canvas")

    # Get the base64 encoded image data from the canvas
    canvas_base64 = driver.execute_script(
        "return arguments[0].toDataURL('image/png').substring(21);", canvas
    )
    # Decode the base64 data to get the PNG image
    canvas_png = base64.b64decode(canvas_base64)

    # Convert the PNG image to a grayscale numpy array
    image = cv2.cvtColor(
        np.array(Image.open(io.BytesIO(canvas_png))), cv2.COLOR_BGR2RGB
    ) 

    frame_stack.append(transformer(image))
    input = torch.stack([*frame_stack],dim = 1).to(device).squeeze().unsqueeze(0)

    if len(frame_stack) == 4:
        with torch.inference_mode():
            outputs = model(input).to(device)
            preds = torch.softmax(outputs, dim=1).argmax(dim = 1)

            if preds.item() != 0:
                keyboard.press_and_release(label_keys[preds.item()])

Once we run this script, you can see the snake playing.

Conclusion

In wrapping up our journey into building an AI game bot for Google Snake, we've explored the exciting realm of imitation learning and harnessed the power of a 3D Convolution ResNet model. Beyond the fun of game development, the acquired skills have broader applications in serious AI scenarios.

We began by understanding the significance of imitation learning, focusing on training our AI agent through human gameplay. The use of a 3D Convolution ResNet model allowed us to accurately capture motion by stacking four frames of the game.

Practical details covered data collection using Selenium, creating labeled datasets, and transforming them into a PyTorch-friendly format. We highlighted the importance of addressing class imbalance through WeightedRandomSampler.

Implementation involved constructing a custom dataset class, selecting an appropriate model architecture, and training the model using essential transforms. The training loop and evaluation on a test set provided insights into the model's performance.

To complete the journey, we showcased how the trained model can be used to play the game in real-time. The provided Python script demonstrated capturing frames, making predictions, and controlling the snake's movements.

In essence, this concise guide empowers you to master AI in gaming through imitation learning and 3D Convolution ResNet models, with skills extending to broader AI applications.

Git Repo: https://github.com/akshayballal95/autodrive-snake/tree/blog

Want to connect?

🌍My Website

🐦My Twitter

👨My LinkedIn