DEV Community: Youdiowei Eteimorde

Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀

Youdiowei Eteimorde — Sun, 24 May 2026 22:54:37 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

The road to vision capabilities in the Gemma family has been an interesting one. The first and second generations of Gemma models did not include native vision support. Instead, multimodal capabilities were introduced through the PaliGemma models.

It wasn’t until Gemma 3 that we began to see native vision capabilities integrated directly into the Gemma. Even then, those capabilities were reserved for the larger variants. With Gemma 4, that changes. Every variant in the series can now see.

Gemma 4’s vision system is also a significant step forward from Gemma 3. It introduces several new ideas and challenges the common approach of representing images as fixed 16×16 words. Instead, Gemma 4 processes images using 48×48 soft tokens, a design that fundamentally changes how visual information is represented within the model.

In this article, we’ll take a deep dive into how Gemma 4’s vision capabilities work. Along the way, we’ll explore the architectural decisions behind the model and build an intuition for why the Google DeepMind team made certain design choices for this release of Gemma.

The Rise of 16 × 16 Words

To fully understand the changes introduced in Gemma 4’s vision system, we need to take a short trip down memory lane and look at how the transformer architecture, originally designed for language, came to dominate computer vision as well.

In its early days, the transformer struggled to achieve state of the art performance on vision tasks. That changed with the release of the paper “An Image is Worth 16 × 16 Words”, which introduced the Vision Transformer (ViT).

In the paper, they showed that transformers can be applied to images in a similar way to how they are applied to text. They did this by dividing an image into 16 × 16 pixel patches.

These patches are then arranged in sequence and passed into the transformer as input.

For example, consider a 48 × 48 image. Dividing it into 16 × 16 patches produces a 3 × 3 grid, giving us a total of 9 patches.

Each patch is then processed by taking its raw pixel values, flattening them into a vector, and projecting them through a linear layer. This produces a learned embedding for each patch.

These patch embeddings play the same role as word embeddings in a language model. We can think of each patch as a “word” in a sentence.

Also, just like words in a standard transformer, patches do not inherently contain positional information. To address this, positional embeddings are added so the model can understand where each patch comes from in the original image.

This simple idea turned out to be extremely powerful. ViT models not only competed with convolutional neural networks but eventually surpassed them. More importantly, they opened the door to combining vision and language within a single transformer-based framework.

16 × 16 Words Were a Window to Multimodality

Large Language Models are unimodal by nature. They can only work with text, hence the name. But we’ve been able to push them into multimodal territory by representing images as 16 × 16 patches, turning visual information into a sequence that transformers can understand.

Let’s explore how we make LLMs “see” by combining them with Vision Transformers.

Consider an image with a height and width of 144 × 144 pixels. We divide it into 16 × 16 patches. This gives us a total of 81 patches arranged in a 9 × 9 grid.

These patches cannot be fed directly into a transformer-based language model. So what we do is pass them through a vision transformer encoder. This encoder processes the image and produces embeddings for each patch. These embeddings are what we call visual tokens.

The language model still can’t use these visual tokens directly. The representation space is different. So we introduce a projection layer that maps the visual tokens into the embedding space of the language model.

Once this is done, the language model can now process both text and visual tokens in a unified space. That is what gives us a multimodal model.

Models like PaliGemma use this approach. It combines a ViT-based SigLIP vision encoder with Gemma, giving early Gemma systems vision capability. LLaVA follows a similar idea by combining a CLIP-based ViT encoder with LLaMA.

While this method of building multimodal models works, it comes with a few inefficiencies. Passing all those visual tokens into the language model increases the amount of computation the model has to perform.

As resolution increases, the number of tokens grows quickly:

Image size	Patch grid (16 × 16)	Number of tokens
144 × 144	9 × 9	81
256 × 256	16 × 16	256
448 × 448	28 × 28	784
896 × 896	56 × 56	3136

In our earlier example, a 144 × 144 image produces 81 tokens. Scaling up to 224 × 224 gives 196 tokens, while 448 × 448 jumps to 784 tokens. At 896 × 896, we are already at 3136 tokens.

Now put yourself in the shoes of the team working on Gemma 4. The goal is to deliver a vision-capable model that can run on edge devices and even smartphones. At that scale, this level of token overhead becomes expensive, not just in compute, but also in context length usage.

This is why the DeepMind team had to move away from fixed 16 × 16 patching and instead embrace soft tokens.

An Image is Worth 48 × 48 Soft Tokens

So what are soft tokens?

Soft tokens are a way of reducing the computational and context overhead that comes from working directly with 16 × 16 patch tokens. In Gemma 4, a soft token is formed by grouping a 3 × 3 block of patches, which corresponds to a 48 × 48 region of the image.

For a 144 × 144 image, this means we can group the 81 patch tokens into 9 soft tokens.

Soft tokens are created by performing a pooling operation over the patch embeddings inside each group. In simple terms, the model takes the embeddings of the 9 patches in a group and averages them to produce a single representation.

So instead of passing all 81 visual tokens directly into the language model, we compress them into just 9 soft tokens, where each one represents a larger, more meaningful region of the image.

This leads to a significant efficiency gain. Going from our earlier setup, the language model no longer processes 81 tokens. It now only processes 9 soft tokens.

This design works well because the vision encoder still extracts rich representations at the patch level, but the language model does not need to operate at that full resolution. The pooling step combines the information into a smaller representation, reducing both compute cost and context length while preserving the essential visual structure of the image.

Soft tokens aren’t new to Gemma 4. They have existed for quite a while, but Gemma 4 fully embraces the idea and pushes it beyond simple information compression.

Budget-Friendly Vision Models with Soft Tokens

In Gemma 3, every image is represented using a fixed 256 soft tokens, regardless of whether the image is small or large. This fixed design helped keep computation predictable, but it also meant that smaller images still used the same amount of compute. You couldn’t allocate more tokens for larger images either.

In Gemma 4, this changes with the introduction of a token budget. With a token budget, you can control how many soft tokens are passed into the language model. Instead of being fixed, the number of soft tokens now depends on the image size and the available budget.

Let’s say we set a token budget of 9. If we take a smaller version of our earlier image, say 96 × 96, and split it into patches and soft token groups, we end up with just 4 soft tokens.

These 4 soft tokens are what get passed into the language model. The remaining 5 slots are filled with zeros and do not take part in computation.

If the image is larger than the token budget allows, it is resized so that it fits within the limit.

Gemma 4 supports multiple token budgets, such as 70, 140, 280, 560, and 1120.

With token budgets, you can explicitly control the maximum number of visual tokens allocated to an image. This makes the system more flexible and efficient, since compute scales with need rather than being fixed.

The Model Who Resized Images

One drawback that most vision models have faced is their inability to work with images of varying sizes. Many vision systems only operate on square images. So when an image comes in a different shape, it is typically resized into a square before being processed. This has been the case for most vision models in the Gemma family, until Gemma 4.

Gemma 4 can work with images of any aspect ratio. It does not force images into a square shape. Instead, it preserves the original aspect ratio, and it achieves this using, you guessed it, soft tokens.

Here is how it works. When Gemma 4 receives an image, its image processor resizes the image. It follows two rules:

Rule 1: Both height and width must be divisible by 48
Rule 2: The image must fit the chosen token budget

Making the height and width divisible by 48 ensures that the image can be cleanly divided into soft token regions without leftover pixels.

In the image above, we start with a 1280 × 720 image and a token budget of 70. After resizing, the image becomes 528 × 288. Both dimensions are divisible by 48 so Rule 1 is satisfied. When we compute the soft tokens, we get 66 tokens, which also satisfies Rule 2, since it stays within the budget.

Now Gemma can process the image without breaking the constraints. Let’s look at another example with a different aspect ratio.

This time we start with a 976 × 1150 image, again using a token budget of 70. As before, the image is resized to meet the divisibility rule, becoming 336 × 432. The resulting number of soft tokens is 63, which also fits within the budget.

The ability of Gemma 4 to handle images of varying sizes while preserving aspect ratios is a major shift in the Gemma family. It means the model can perceive images more naturally, without forcing them into distorted square forms just to make them fit.

No matter what shape an image comes in, Gemma 4 can handle it. The image processor adjusts it to satisfy both rules while preserving the structure of the visual content.

Tracing Patches in Gemma 4

In the transformer architecture, tokens have no inherent sense of position. They only carry meaning. To address this, positional encoding was introduced so each token can also carry information about where it appears in a sequence.

In Vision Transformers, positional information becomes even more important because each image patch must know where it is located in the image. Adding positional information is straightforward when images have a fixed size, but it becomes more complex when a model needs to support images of varying sizes.

Gemma 4 uses two positional encoding schemes to keep track of patch locations. Let’s go through each of them.

Note: Gemma 4 encodes position at the patch level, not at the soft token level.

Patch Embedding Table

The first positional scheme used in Gemma 4 is patch embedding table. It is implemented using learned embedding tables that store positional vectors for each possible patch location.

Even though Gemma 4 resizes its image inputs, two things remain consistent: each 16 × 16 patch is assigned a unique position ID, and the total number of patches never exceeds 10,080.

The number 10,080 comes from multiplying the maximum token budget (1,120) by 3², which represents the number of patches in a soft token.

With these constraints, we can construct an embedding table that stores positional information for all possible patch locations. When a patch is processed, its position ID is used to retrieve the corresponding positional embedding.

However, a single table is not sufficient because image patches exist on a two dimensional grid. Instead of using one embedding table, Gemma 4 uses two separate tables: one for the x-axis (width position) and one for the y-axis (height position).

Each embedding table contains 10,240 vectors, each of size 768, where 768 is the hidden dimension of the model and matches the vision encoder’s embedding space. The table size is large enough to cover all possible patch positions, since the maximum number of patches (10,080) fits within it.

The patch embedding sits at the input of the vision encoder. When an image is processed, the image processor assigns a position ID to every patch. These IDs are then used to look up both the x and y positional embeddings from their respective tables. The two vectors are added together to form the final positional embedding for that patch, which is then added to the patch’s visual token.

This process is repeated for every patch in the image.

This patch-level positional encoding allows each patch to know where it is in the image. However, it does not explicitly model how patches relate to each other spatially.

A Quick Recap on RoPE

Before we move to the next positional encoding scheme used in Gemma 4 for images, let’s take a brief detour to understand Rotary Position Embedding (RoPE).

RoPE is one of the most popular positional encoding schemes used in AI today. Unlike traditional positional encoding methods that add positional embeddings to token embeddings, RoPE takes a different approach. Instead of adding positional information, it rotates the token vectors themselves, specifically the Q and K vectors used in attention.

This rotation allows tokens to encode positional information directly inside their representations. More importantly, because the vectors are rotated at different intervals, tokens can also understand relative position. Tokens that are close to each other end up with similar rotations, while tokens that are farther apart have more different rotations.

If we apply RoPE to an image, each visual token would be rotated to encode its position. I like to think of RoPE as a clock with only a minute hand, where each patch is rotated by a certain number of minutes to represent its position.

In the image above, the first patch has a rotation of 0 minutes, while the patch to its right has rotated by 13 minutes.

One interesting detail about how RoPE works is that it takes a vector and groups its values into pairs before applying rotation.

In the image above, we start with a vector of length 16. Grouping the values into pairs gives us 8 pairs, and rotation is applied to each pair independently. The combined rotations across all pairs determine the total rotation of that patch.

In this example, the first pair rotates by 5 minutes, the next pair rotates by 2 minutes, and the remaining 6 pairs rotate by 1 minute each. Together, these rotations produce the equivalent of a 13 minute rotation for that patch.

So was standard RoPE used in Gemma 4? The answer is no.

RoPE works well for one dimensional sequences like text, where tokens exist along a single axis. Images are different because they exist in two dimensions. They have both width and height. Applying a one dimensional rotation is not enough to properly capture that structure.

To encode relative position correctly in images, we need a variation of RoPE that works in two dimensions.

2D RoPE

The second positional encoding scheme used by Gemma 4 is 2D-RoPE. If we think of standard RoPE as a clock with a single hand, then 2D-RoPE is like a clock with both a minute hand and an hour hand.

The minute hand encodes movement along the x-axis, while the hour hand encodes movement along the y-axis. Each patch ends up with its own unique “time,” and because the rotations follow a consistent interval, patches can determine where other patches are located relative to them.

Along the x-axis, each patch rotates by 10 minutes. Moving along the y-axis is like moving into a new hour. For example in the image above, the right eye, which sits at 3:20, knows the left eye is nearby because it is at 3:30. It can also tell that the lips are below it because they are at 4:20 and 4:30.

This is the core intuition behind 2D-RoPE. Patches do not just know their own positions, they can also understand the positions of other patches relative to them.

Just like 1D RoPE, 2D-RoPE encodes position by grouping vector values into pairs and rotating them. However, it introduces one important difference. The vector is split into two halves: one half encodes movement along the x-axis, while the other half encodes movement along the y-axis.

In the image above, the selected patch sits at 10 minutes past 12. We can see how the individual vector pairs contribute to that rotation. For the minute hand, the first pair rotates by 7 minutes, while the remaining 3 pairs rotate by 1 minute each, giving a total rotation of 10 minutes. For the hour hand, there is barely any movement because there is no change along the y-axis.

Together, Gemma 4’s two positional encoding schemes create a powerful way for patches to understand position. Patch embeddings table allow each patch to know its absolute location, while 2D-RoPE adds relative positional awareness, allowing patches to understand where other patches are located in relation to them.

A Better Glimpse at Images

We’ve talked about token budgeting as a way of managing compute and context length efficiently, but token budgeting can also be seen as a way of giving Gemma 4 a better glimpse of the image it is processing.

One way to think about token budgeting is as a form of image resolution control. In traditional images, resolution determines how many pixels we can see. With Gemma 4, token budgets determine how many soft tokens can represent an image.

Let’s walk through the different token budgets available in Gemma 4 and see when each one is useful.

Token Budget of 70

This is the smallest token budget available in Gemma 4. It limits the maximum number of patches processed by the vision encoder to 630, while the maximum number of soft tokens passed to the language model is 70.

Property	Value
Patch grid	18 × 33
Resized height	288
Resized width	528
Original height	549
Original width	976
Total patches	594
Soft tokens after pooling	66

The image above originally had a height of 549 and a width of 976. Since it did not fit within the token budget, it was resized to 288 × 528, satisfying the divisibility by 48 rule we discussed earlier.

After patchification, the image produces 594 patches, which are then pooled into 66 soft tokens.

When to Use a Token Budget of 70

A token budget of 70 is ideal for lightweight vision tasks such as:

image classification
captioning
visual question answering (VQA)

It is also useful when running Gemma 4 on edge devices like smartphones or laptops, where compute resources are limited.

This budget is especially good for video workloads. Since videos are made up of many image frames, using higher token budgets for every frame can quickly become computationally expensive.

Token Budget of 140

The next token budget after 70 is 140. The maximum number of patches for this budget is 1,260, and the maximum number of soft tokens passed to the language model is 140.

Property	Value
Patch grid	30 × 39
Resized height	480
Resized width	624
Original height	672
Original width	896
Total patches	1,170
Soft tokens after pooling	130

The image above was originally 672 × 896. After setting a token budget of 140, it is resized to 480 × 624. This ensures it satisfies the divisibility rule and fits within the token budget. After patching and pooling, the image produces 130 soft tokens.

When to Use a Token Budget of 140

A token budget of 140 is useful when the 70 token budget starts to lose detail, but you still want to keep computation relatively low.

A good way to think about it is this: if the 70 token budget is not giving you enough visual detail, you can increase to 140 to improve performance while still maintaining efficiency.

However, if you are getting similar results from 140 as you are from 70, it is better to stick with 70 and save compute.

Token Budget of 280

This is a mid range token budget. It performs significantly better than the lower budgets while still keeping computational overhead at a reasonable level for most tasks. This budget allows the vision encoder to process up to 2,520 patches, which are then compressed into 280 soft tokens for the language model.

Property	Value
Patch grid	51 × 45
Resized height	816
Resized width	720
Original height	1107
Original width	1000
Total patches	2,295
Soft tokens after pooling	255

Let’s take a moment to appreciate the efficiency of soft tokens. The image above produces 2,295 patches, but after pooling, we end up with just 255 soft tokens. That is almost a 10× reduction.

When to Use a Token Budget of 280

This token budget is the default choice for Gemma 4, and for good reason. It is well suited for the following tasks:

Object detection
Multimodal chat
Graph and chart reasoning
Document understanding
UI reasoning

This is the budget you want when you need a capable vision model without pushing compute too far.

Token Budget of 560

We are in the big leagues now. This token budget allows a maximum of 5,040 patches for the vision encoder and sends up to 560 soft tokens to the language model.

Property	Value
Patch grid	51 × 90
Resized height	816
Resized width	1440
Original height	1080
Original width	1920
Total patches	4,590
Soft tokens after pooling	510

Just by looking at the image, we can see the sheer number of 16 × 16 patches present. But thanks to soft tokens, the language model only has to work with 510 soft tokens, instead of thousands of raw visual patches.

When to Use a Token Budget of 560

This budget should be used when you have enough compute available and you want to get the best performance out of the vision model.

It covers everything the 280 token budget can handle, but with better detail and robustness. It also works well for more complex inputs like PDF.

Token Budget of 1120

This is the max settings for soft tokens. The vision encoder can process up to 10,080 visual patches, while the language model receives up to 1,120 soft tokens.

Property	Value
Patch grid	75 × 132
Resized height	1200
Resized width	2112
Original height	1080
Original width	1920
Total patches	9,900
Soft tokens after pooling	1,100

The image example above is the same image we used for 560, but you can see the level of detail is on another level. This time, instead of the image being resized to be smaller, it was upscaled, adding more detail to work with.

The number of patches present in this image is 9,900. Imagine if the language model worked with it directly, it would just fill up the context. Instead, we get to work with 1,100 soft tokens.

When to Use a Token Budget of 1120

This token budget should be used when you need maximum visual fidelity and you have enough compute available.

It is especially useful for tasks like:

document parsing
handwriting recognition
reading small or dense text
fine-grained object detection

It can also be useful for segmentation tasks or detecting very small objects in complex scenes.

Overall, the flexibility Gemma 4 provides with token budgets is significant. Depending on the task, you can choose the right level of visual detail instead of being locked into a single fixed representation.

Working With Soft Tokens in Code

Now that we have a grasp of soft tokens in Gemma 4, let’s see how to use them in code.

To set your token budget in Gemma 4, you need to use the Gemma 4 image processor to resize the image and extract all the patches from it.

from transformers import AutoProcessor, AutoModelForMultimodalLM
import requests

MODEL_ID = "google/gemma-4-E2B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

img_url = "https://www.themarginalian.org/wp-content/uploads/2013/10/davidbowie1.jpg"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": img_url},
            {"type": "text", "text": "Who is in the image"}
        ]
    }
]

# Set token budget
processor.image_processor.max_soft_tokens = 140

# Process input
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)

response = processor.decode(
    outputs[0][input_len:],
    skip_special_tokens=True
)

print(response)

The example above uses the Gemma 4 E2B model, and we set a token budget of 140 for the image processor. The processor automatically handles resizing the image, extracting patches and making sure the image fits within the selected token budget.

Further Resources

We’ve covered a lot in this article, and if we decided to go even deeper, this article probably wouldn’t end anytime soon. So instead, I’ll leave you with a couple of resources that will help you better understand and experiment with Gemma 4 vision capabilities.

If you want to visualize how an image is split into patches and soft tokens, try out my visualization tool:

I also authored a couple of Colab notebooks that explore different parts of Gemma 4 vision in more detail:

Notebook	Description	Open
Image Resizing in Gemma 4	Demonstrates how the Gemma 4 image processor resizes images while preserving aspect ratio and fitting token budgets.	Open In Colab
Working With Token Budgets in Gemma 4	Shows how to configure and experiment with token budgets using Hugging Face Transformers.	Open In Colab
Patch Embedding Table	Breaks down the working principles behind the patch embedding tables in Gemma 4.	Open In Colab

If you are interested in learning more about the Gemma 4 vision encoder, check out the section on vision encoders in Maarten Grootendorst’s blog: A Visual Guide to Gemma 4

Recap

Over the course of this article, we learned the following:

Gemma 4 does not directly work with 16 × 16 patches
Instead, it works with soft tokens, where each soft token consists of 9 patches and covers a 48 × 48 region
Soft tokens help reduce both compute and context overhead
Gemma 4 allows users to control their token budget when working with images
There are five token budgets in Gemma 4: 70, 140, 280, 560, and 1120
Gemma 4 can work with images of different aspect ratios
Gemma 4 uses two positional encoding methods: patch embedding tables and 2D-RoPE

I had a lot of fun researching Gemma 4, and I’ll definitely continue exploring the models and sharing what I learn.

I’m particularly interested in the smaller “Effective” models that use Per Layer Embeddings (PLE), mostly because of their compact size and support for audio processing. I also made a video explaining PLE, check it out:

If you found this article informative, please consider showing your appreciation with a reaction: 💖 🦄 🤯 🙌 🔥

If you have any questions as well, feel free to drop them in the comments section.

Building a Self-Improving Plugin for OpenClaw

Youdiowei Eteimorde — Mon, 27 Apr 2026 06:42:19 +0000

This is a submission for the OpenClaw Writing Challenge

The creator of OpenClaw, Peter Steinberger, shared an “aha” moment he had with the agent. In the early days, OpenClaw only supported text input, but one day he sent a voice note without thinking. The agent wasn’t supposed to understand audio, yet it replied.

His reaction was simple: “How the f*ck did it do that?”

The agent explained that it received a file it didn’t recognize, inspected the file header, and discovered it was an Opus audio file. It then used ffmpeg to convert it to WAV, found Peter’s OpenAI API key, and sent the audio to Whisper using curl to transcribe it.

Peter had not built any transcription capability into the system. The agent figured it out.

This is what a self-improving system looks like. An AI agent that can extend its own capabilities without being explicitly programmed to do so. In this article, we’ll take that idea further by building a self-improving plugin for OpenClaw.

The Idea Behind Self-Improving Agents

The holy grail of software engineering is simple: software that can improve itself. If a user wants a new feature, the software adds it. If there’s a bug, it fixes it. If there’s a security vulnerability, it finds and patches it.

This is the idea behind self-improvement, also known as Recursive Self-Improvement (RSI).

What could make this possible is AI agents. Today, developers already use agents to write code, debug systems, and automate workflows. The next step is obvious: embedding these agents inside software systems so they can improve themselves autonomously.

This is the philosophy behind OpenClaw. OpenClaw has access to its own source code, its documentation, and the environment it runs in. In many ways, it has the context needed to reason about itself.

That means, in theory, it can modify and extend its own capabilities based on user interactions.

Plugins as a Path to Self-Improvement

The idea of OpenClaw improving itself is compelling. But anyone familiar with the current state of AI agents knows that letting an agent directly modify its own source code is a risky move.

So we take a different approach.

Instead of having the agent update its core codebase, we let it operate through plugins.

OpenClaw is built around a plugin architecture, where new features can be added as extensions. Rather than rewriting itself, the agent can create a plugin and iteratively improve it over time, updating the plugin’s code based on user requests.

Building a Plugin

Before we demonstrate self-improvement, let’s first build a simple plugin manually. Our plugin will live inside the agent’s workspace, located at ~/.openclaw/workspace.

Start by creating a directory for the plugin:

cd ~/.openclaw/workspace
mkdir self_improve_plugin
cd self_improve_plugin

Initialize it as a Node.js project:

npm init -y

Install the required dependencies:

npm install typescript ts-node --save-dev
npm install openclaw @sinclair/typebox

Initialize the TypeScript configuration:

npx tsc --init

Next, update your package.json. Set the project to use ES modules and add the OpenClaw configuration:

{
  "name": "self_improve_plugin",
  "version": "1.0.0",
  "type": "module",
  "devDependencies": {
    "ts-node": "^10.9.2",
    "typescript": "^6.0.3"
  },
  "dependencies": {
    "@sinclair/typebox": "^0.34.49",
    "openclaw": "^2026.4.24"
  },
  "openclaw": {
    "extensions": ["./index.ts"],
    "compat": {
      "pluginApi": ">=2026.3.24-beta.2",
      "minGatewayVersion": "2026.3.24-beta.2"
    },
    "build": {
      "openclawVersion": "2026.3.24-beta.2",
      "pluginSdkVersion": "2026.3.24-beta.2"
    }
  }
}

Now define the plugin manifest in openclaw.plugin.json:

{
  "id": "self_improve_plugin",
  "name": "Self Improve Plugin",
  "description": "A plugin that the OpenClaw agent can use to improve itself.",
  "configSchema": {
    "type": "object",
    "additionalProperties": false
  }
}

Next, implement the plugin logic in index.ts:

import { definePluginEntry } from "openclaw/plugin-sdk/plugin-entry";
import { Type } from "@sinclair/typebox";

export default definePluginEntry({
  id: "self_improve_plugin",
  name: "Self Improve Plugin",
  description: "A plugin that the OpenClaw agent can use to improve itself.",
  register(api) {
    api.registerTool({
      name: "reverse_text",
      description: "Reverse a text string.",
      parameters: Type.Object({ text: Type.String() }),
      async execute(_id, params) {
        const reversed = params.text.split("").reverse().join("");
        return { content: [{ type: "text", text: reversed }] };
      },
    });
  },
});

We’ve created a simple plugin that registers a tool called reverse_text, which reverses a string.

Install the plugin:

openclaw plugins install .

Then restart the gateway:

openclaw gateway

OpenClaw installs plugins into ~/.openclaw/extensions.

Improving the Plugin with OpenClaw

Now let’s see how the agent can improve this plugin.

First, we ask the agent to locate the plugin in the workspace:

It successfully finds the source code. Next, we test the reverse_text tool to confirm it works:

Now we push it further.

We ask the agent to add a new tool that counts how many times a character appears in a string:

The agent updates the plugin:

api.registerTool({
  name: "count_character",
  description: "Count how many times a character appears in a string.",
  parameters: Type.Object({
    text: Type.String(),
    character: Type.String({ minLength: 1, maxLength: 1 }),
  }),
  async execute(_id, params) {
    const count = [...params.text].filter(
      (char) => char === params.character
    ).length;
    return { content: [{ type: "text", text: String(count) }] };
  },
});

Next, we instruct the agent to install the updated plugin:

Even when a mistake is made in the command, the agent inspects the terminal output, identifies the issue, and suggests the correct command.

The installation fails again because the plugin already exists in ~/.openclaw/extensions.

We then instruct the agent to remove the existing version:

After that, we ask it to install the updated version and restart the gateway:

Once everything is up, we can instruct the agent to use the new tool.

At this point, we’ve demonstrated a self-improving plugin. The agent can iteratively extend it by adding new tools, refining existing ones, or even introducing more advanced capabilities like model providers, hooks, or new interaction channels.

The key idea is this: the agent isn’t modifying its core system. It’s evolving through a controlled extension point.

That constraint is what makes this approach powerful without breaking the system.

Let’s Make This a Skill

So far, everything we’ve done has been manual. We created the plugin, installed it, and updated it step by step, but this entire workflow is repeatable. That means we can take it one step further by turning it into a skill, where we encode these steps once and let the agent handle the rest.

With this skill, the agent can set up everything for us, starting from generating the boilerplate in the workspace to installing and updating the plugin whenever needed.

You can find the skill here.

We then create a directory for the skill:

mkdir -p ~/.openclaw/workspace/skills/self-improving-plugin-skill

Within this directory, we add the SKILL.md file, and after restarting the gateway, the agent will have access to the skill and can execute the entire workflow on its own.

Conclusion

The idea of self-improving software has always sounded powerful but difficult to control, especially if it involves directly modifying a system’s core code. What this approach shows is that self-improvement does not have to be risky to be useful. By constraining the agent to operate through plugins, we give it a safe and structured way to evolve, and by packaging the entire workflow into a skill, we make that evolution repeatable.

At that point, the interaction changes from manually building features to simply describing what you want. The agent handles the implementation, installation, and iteration behind the scenes, turning OpenClaw into a system that doesn’t just respond to requests but actively builds and improves itself based on them.

Understanding OpenClaw’s Hook: The Key to Evaluating Agents Properly

Youdiowei Eteimorde — Wed, 22 Apr 2026 12:41:26 +0000

Working with a tool like OpenClaw can sometimes feel like a black box, with many moving parts that make it hard to understand what is happening under the hood. However, OpenClaw’s Hook functionality changes that by giving you clear visibility into your agent’s behavior.

OpenClaw Hooks let you monitor events as they occur in real time, such as when a message is received or a tool call is made, providing detailed insight into your agent’s execution flow. When combined with an LLM observability tool, they make it easier to debug issues, understand decision-making, and optimize performance.

In this article, we’ll break down how Hooks work and then explore how to integrate them with Arize for monitoring and improving your agent.

Get your agent to set up the code for this article using the Agent’s skill

Hooks in OpenClaw

Within OpenClaw, whenever an action occurs, an event is fired to signal that the action has taken place. For example, when a user sends a message, an event is triggered to indicate this interaction. Similarly, when the agent responds or uses a tool, corresponding events are fired.

OpenClaw Hooks allow you to attach additional logic to these events as they occur. In essence, hooks enable you to tap into OpenClaw’s event-driven system, giving you the ability to customize, and extend the system’s behavior.

OpenClaw hooks can be divided into the following categories:

Agent lifecycle: Hooks triggered during the agent’s execution.
Message flow: Hooks related to messages entering and leaving the system.
Tool execution: Hooks triggered before and after tool execution.
Subagent coordination: Hooks triggered when subagents are invoked.
Gateway lifecycle: Hooks triggered during the OpenClaw gateway’s operation.

With all these hooks at our disposal, we gain visibility into a system that would otherwise be a black box.

The need for Observability in Agents

Observability is essential for understanding how an AI agent or LLM behaves and why it performs tasks in a certain way. Without it, you’re effectively running your agent in isolation, constantly guessing what went wrong.

OpenClaw is no exception. In fact, it requires strong observability for a few key reasons:

Sensitive operations: It often runs directly on a user’s machine and may handle sensitive tasks, making visibility critical.
Autonomous behavior: It can operate proactively through features like heartbeats or cron jobs, executing tasks without direct user input.
System complexity: A single interaction may involve multiple tool calls and internal steps, making it harder to track what’s happening under the hood.

To address these challenges, observability becomes a necessity. Observing a system like OpenClaw requires a dedicated observability tool, and this is where Arize comes in.

Arize is a leading platform for LLM and agent observability, enabling developers to instrument their applications and capture every input and output flowing through the system.

This captured data is then sent to Arize using the OpenTelemetry protocol, where it can be monitored and analyzed in detail.

Instrumentation is relatively straightforward when building applications from scratch, as Arize provides integrations for many LLM and agent frameworks. However, it becomes more challenging when working with an existing system like OpenClaw.

Fortunately, OpenClaw’s hooks provide a solution. By leveraging hooks, we can instrument the agent and gain the visibility needed to properly monitor and evaluate its behavior.

Creating Hooks in OpenClaw

OpenClaw provides two ways to create hooks: directly within the project or through plugins.

The direct method involves creating a hook directory containing two key files:

HOOK.md – defines the hook’s metadata and documentation.
handler.ts – implements the hook’s logic.

The second method is to create a plugin that utilizes hooks. In this article, we will focus on plugin hooks because they offer greater flexibility and are much easier to distribute and reuse.

To learn more about creating hooks directly, refer to the OpenClaw documentation.

Building a Simple Plugin Hook

Let’s create a simple plugin that leverages hooks.

1. Initialize the Project

Start by creating a TypeScript npm project named arize_openclaw_hook. Then install OpenClaw as a dependency:

npm install openclaw

Ensure that OpenClaw is already installed and properly configured on your system or in the environment where it is running.

Below is an example of what your package.json might look like:

{
  "name": "arize_openclaw_hook",
  "version": "1.0.0",
  "description": "",
  "main": "index.ts",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "openclaw": {
    "extensions": [
      "./index.ts"
    ],
    "compat": {
      "pluginApi": ">=2026.3.24-beta.2",
      "minGatewayVersion": "2026.3.24-beta.2"
    },
    "build": {
      "openclawVersion": "2026.3.24-beta.2",
      "pluginSdkVersion": "2026.3.24-beta.2"
    }
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "type": "module",
  "dependencies": {
    "openclaw": "^2026.3.28"
  }
}

We’ve also included additional metadata such as the OpenClaw version and Plugin SDK version.

2. Create the Plugin Manifest

All OpenClaw plugins require a manifest file named Openclaw.plugin.json, which contains metadata about the plugin:

{
  "id": "arize_openclaw_hook",
  "name": "Arize OpenClaw Hook",
  "description": "Observability plugin for OpenClaw with Arize.",
  "configSchema": {
    "type": "object",
    "additionalProperties": false,
  },
}

Note: The plugin id should match the name of the npm project.

3. Define the Plugin Entry

Next, create an index.ts file to define the plugin:

import { definePluginEntry } from "openclaw/plugin-sdk/plugin-entry";

export default definePluginEntry({
    id: "arize_openclaw_hook",
    name: "Arize OpenClaw Hook",
    description: "Observability plugin for OpenClaw with Arize.",
    register(api) {
        api.registerService(arizeService(api));
    },
});

The definePluginEntry function accepts an object with the plugin’s metadata and a register method. This method is used to register the plugin’s functionality with OpenClaw.

4. Implement the Plugin Service

We will register our plugin as a background service using api.registerService. A service must return an OpenClawPluginService object:

export type OpenClawPluginService = {
    id: string;
    start: (ctx: OpenClawPluginServiceContext) => void | Promise<void>;
    stop?: (ctx: OpenClawPluginServiceContext) => void | Promise<void>;
};

id: Uniquely identifies the service.
start: Executed when the plugin starts.
stop: Optional method executed during shutdown for cleanup.

Here is a minimal implementation:

function arizeService(api: OpenClawPluginApi): OpenClawPluginService {

    api.logger.info("Initializing Arize OpenClaw Hook service...");

    return {
        id: "arize-openclaw",
        async start(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };

            log.info("Arize OpenClaw Hook service is starting...");
        },
        async stop(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };

            log.info("Arize OpenClaw Hook service is stopping...");
        }
    };
}

This service simply logs messages using the OpenClaw logger. The api.logger is available during registration, while ctx.logger is used within the start and stop lifecycle methods.

5. Adding Hooks to the Service

Next, we attach hooks to listen for specific OpenClaw events:

llm_input – Triggered when the LLM receives a prompt.
llm_output – Triggered when the LLM generates a response.
before_tool_call – Triggered before the agent invokes a tool.

import { definePluginEntry,OpenClawPluginApi,OpenClawPluginService, OpenClawPluginServiceContext} from "openclaw/plugin-sdk/plugin-entry";


function arizeService(api: OpenClawPluginApi): OpenClawPluginService {

    api.logger.info("Initializing Arize OpenClaw Hook service...");

    api.on("llm_input", (event) => {
       api.logger.info(`Prompt: ${event.prompt}`);
       api.logger.info(`Model name: ${event.model}`);
       api.logger.info(`Model provide: ${event.provider}`);
    });

    api.on("llm_output", (event) => {
        api.logger.info(`Assistant Texts: ${event.assistantTexts.toString()}`);
    });

    api.on("before_tool_call", (event) => {
        api.logger.info(`Tool Name: ${event.toolName}`);
        api.logger.info(`Tool Call ID: ${event.toolCallId}`);
    });

    return {
        id: "arize-openclaw",
        async start(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };
            log.info("Arize OpenClaw Hook service is starting...");
        },
        async stop(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };
            log.info("Arize OpenClaw Hook service is stopping...");
        }
    };
}

export default definePluginEntry({
    id: "arize_openclaw_hook",
    name: "Arize OpenClaw Hook",
    description: "Observability plugin for OpenClaw with Arize.",
    register(api) {
        api.registerService(arizeService(api));
    },
});

Event Data Overview

llm_input: Contains the prompt, model name, and provider.
llm_output: Includes the assistant’s generated response.
before_tool_call: Provides details about the tool being invoked, such as its name and call ID.

6. Installing and Testing the Plugin

Once the implementation is complete, install the plugin locally:

openclaw plugins install .

After installation, restart the OpenClaw gateway to activate the plugin:

openclaw gateway restart

Next, we need to open the OpenClaw dashboard to communicate with the agent. We can do this using the following command:

openclaw dashboard

Once the dashboard is open, you can send a message to the agent from the Chat section in the sidebar.

Next, go to Logs in the sidebar and verify that the data was logged.

From the image above, we can see that the data was successfully logged, confirming that our plugin works as expected. Next, we need to add instrumentation so that instead of being logged locally, the data is sent to the Arize server.

Hooks + Observervability = Better Agent

In this section, we’ll extend the plugin from the previous section and add instrumentation so we can observe the agent’s behavior in Arize instead of relying only on local logs.

Installing Instrumentation Dependencies

First, install the required OpenTelemetry and Arize instrumentation packages:

npm install @opentelemetry/api @opentelemetry/exporter-trace-otlp-proto @opentelemetry/resources @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/semantic-conventions @arizeai/openinference-semantic-conventions openai

Configuring Arize Credentials

Next, we need to configure the credentials required to send traces to Arize:

ARIZE_API_KEY: Your Arize API key
ARIZE_SPACE_ID: Your Arize space ID
ARIZE_PROJECT_NAME: The name of your project Normally, these would be stored in an environment file. However, since we are working within OpenClaw, the better approach is to expose them as configurable plugin values via the manifest file.

We update Openclaw.plugin.json to include these configuration values:

{
  "id": "arize_openclaw_hook",
  "name": "Arize OpenClaw Hook",
  "description": "Observability plugin for OpenClaw with Arize.",
  "configSchema": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "spaceID": {
        "type": "string"
      },
      "apiKey": {
        "type": "string"
      },
      "projectName": {
        "type": "string"
      }
    }
  },
  "uiHints": {
    "apiKey": {
      "label": "Arize API Key",
      "placeholder": "your-api-key",
      "sensitive": true
    },
    "spaceID": {
      "label": "Arize Space ID",
      "placeholder": "your-space-id",
      "help": "The ID of the Arize space to use."
    },
    "projectName": {
      "label": "Project Name",
      "placeholder": "openclaw"
    }
  }
}

These values can now be set either in openclaw.json or directly from the dashboard thanks to the UI hints defined above.

Adding Instrumentation

Next, we implement instrumentation to connect OpenClaw to Arize. This involves initializing an OpenTelemetry tracer that sends spans to the Arize server.

function arizeService(api: OpenClawPluginApi): OpenClawPluginService {

    const COLLECTOR_ENDPOINT = "https://otlp.arize.com";
    const SERVICE_NAME = process.env.ARIZE_PROJECT_NAME || "";

    api.logger.info(`Initializing Arize OpenClaw Hook with service name: ${SERVICE_NAME}`);

    const provider = new NodeTracerProvider({
        resource: resourceFromAttributes({
            [ATTR_SERVICE_NAME]: SERVICE_NAME,
            [SEMRESATTRS_PROJECT_NAME]: SERVICE_NAME,
        }),
        spanProcessors: [
            new SimpleSpanProcessor(
                new OTLPTraceExporter({
                    url: `${COLLECTOR_ENDPOINT}/v1/traces`,
                    headers: {
                        'space_id': process.env.ARIZE_SPACE_ID || "",
                        'api_key': process.env.ARIZE_API_KEY || "",
                    },
                })
            ),
        ],
    });

    provider.register();

    const tracer = trace.getTracer(SERVICE_NAME); 

    return {
        id: "arize-openclaw",
        async start(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };

            log.info("Arize OpenClaw Hook service is starting...");
        },
        async stop(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };
            log.info("Arize OpenClaw Hook service is stopping...");
        }
    };
}

At this point, we have instrumentation configured, but no hooks attached yet.

Connecting Hooks to Traces

In the code below, we added our hooks as before, but this time we use the tracer to create spans that instrument all calls and send them to Arize.

function arizeService(api: OpenClawPluginApi): OpenClawPluginService {

    // Remaining code above

    api.on("llm_input", (event) => {
        tracer.startActiveSpan("call-llm", async (span) => {
            span.setAttribute(SemanticConventions.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKind.CHAIN);
            span.setAttribute(INPUT_VALUE, event.prompt);
            span.setAttribute(LLM_MODEL_NAME, event.model);
            span.setAttribute(LLM_PROVIDER, event.provider);


            api.on("llm_output", (event) => {
                span.setAttribute(OUTPUT_VALUE, event.assistantTexts.toString());
                span.setStatus({ code: SpanStatusCode.OK });
                span.end();
            });
        });
    });

    api.on("before_tool_call", (event) => {
        const span = tracer.startSpan("tool-call");
        span.setAttribute(SemanticConventions.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKind.TOOL);
        span.setAttribute(TOOL_NAME, event.toolName);
        span.setAttribute(TOOL_CALL_ID, event.toolCallId || "unknown");
        span.setStatus({ code: SpanStatusCode.OK });
        span.end();
    });

    // Remaining code below
}

Here is the full implementation:

import { definePluginEntry } from "openclaw/plugin-sdk/plugin-entry";
import type { OpenClawPluginApi, OpenClawPluginService , OpenClawPluginServiceContext } from "openclaw/plugin-sdk";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-proto";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { SimpleSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
import { SEMRESATTRS_PROJECT_NAME } from "@arizeai/openinference-semantic-conventions";


import { trace, SpanStatusCode } from "@opentelemetry/api";
import {
    INPUT_VALUE, OUTPUT_VALUE, LLM_MODEL_NAME, LLM_PROVIDER, SemanticConventions, TOOL_NAME, TOOL_PARAMETERS, TOOL_CALL_ID,
    OpenInferenceSpanKind
} from "@arizeai/openinference-semantic-conventions";


function arizeService(api: OpenClawPluginApi): OpenClawPluginService {

    const COLLECTOR_ENDPOINT = "https://otlp.arize.com";
    const SERVICE_NAME = process.env.ARIZE_PROJECT_NAME || "";

    api.logger.info(`Initializing Arize OpenClaw Hook with service name: ${SERVICE_NAME}`);

    const provider = new NodeTracerProvider({
        resource: resourceFromAttributes({
            [ATTR_SERVICE_NAME]: SERVICE_NAME,
            [SEMRESATTRS_PROJECT_NAME]: SERVICE_NAME,
        }),
        spanProcessors: [
            new SimpleSpanProcessor(
                new OTLPTraceExporter({
                    url: `${COLLECTOR_ENDPOINT}/v1/traces`,
                    headers: {
                        'space_id': process.env.ARIZE_SPACE_ID || "",
                        'api_key': process.env.ARIZE_API_KEY || "",
                    },
                })
            ),
        ],
    });

    provider.register();

    const tracer = trace.getTracer(SERVICE_NAME);


    api.on("llm_input", (event) => {
        tracer.startActiveSpan("call-llm", async (span) => {
            span.setAttribute(SemanticConventions.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKind.CHAIN);
            span.setAttribute(INPUT_VALUE, event.prompt);
            span.setAttribute(LLM_MODEL_NAME, event.model);
            span.setAttribute(LLM_PROVIDER, event.provider);


            api.on("llm_output", (event) => {
                span.setAttribute(OUTPUT_VALUE, event.assistantTexts.toString());
                span.setStatus({ code: SpanStatusCode.OK });
                span.end();
            });
        });
    });

    api.on("before_tool_call", (event) => {
        const span = tracer.startSpan("tool-call");
        span.setAttribute(SemanticConventions.OPENINFERENCE_SPAN_KIND, OpenInferenceSpanKind.TOOL);
        span.setAttribute(TOOL_NAME, event.toolName);
        span.setAttribute(TOOL_CALL_ID, event.toolCallId || "unknown");
        span.setStatus({ code: SpanStatusCode.OK });
        span.end();
    });


    return {
        id: "arize-openclaw",
        async start(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };

            log.info("Arize OpenClaw Hook service is starting...");
        },
        async stop(ctx: OpenClawPluginServiceContext) {
            const log = { info: ctx.logger.info.bind(ctx.logger) };
            log.info("Arize OpenClaw Hook service is stopping...");
        }
    };
}

export default definePluginEntry({
    id: "arize_openclaw_hook",
    name: "Arize OpenClaw Hook",
    description: "Observability plugin for OpenClaw with Arize.",
    register(api) {
        api.registerService(arizeService(api));
    },
});

Testing the Plugin

Now that we have observability in place, let’s test the plugin again.

In the OpenClaw dashboard, I created a new chat session. When a session starts, OpenClaw loads memory and other contextual data using internal tools.

Switching to the Arize dashboard, we can see that all tool calls and the new session events are being sent successfully.

If we inspect the session creation event, we can see:

Input message captured
Output recorded
Additional metadata such as latency and execution details

You can get the full source code implementation here on github.

Conclusion

OpenClaw’s hooks are a powerful feature that let you tap directly into the system’s internal events. By wrapping hooks inside a plugin, you can fully customize your OpenClaw agent and extend its capabilities in meaningful ways.

In this article, we used that flexibility to add observability to our agent using Arize. Observability is critical when working with systems like OpenClaw, where agents are complex, autonomous, and often operate behind multiple layers of abstraction. With the right instrumentation in place, you move from guessing what your agent is doing to clearly understanding its behavior in real time.

Building A Voice AI Agent with OpenClaw and AssemblyAI

Youdiowei Eteimorde — Fri, 17 Apr 2026 05:37:29 +0000

OpenClaw went viral this year because of its simplicity in allowing users to communicate with AI agents running on their own hardware. It wasn’t a new model, a new architecture, or a new agentic protocol. Instead, it demonstrated a new way people could work with AI agents using technology they were already familiar with.

OpenClaw allows users to communicate with their agents through chat apps such as Telegram and WhatsApp. A user can simply pick up their device and send a text message to the AI agent. This ease of use is what attracted many users to OpenClaw.

But can this interaction be simplified even further? Yes, it can. One way to do this is by turning your OpenClaw agent into a voice AI agent. Since OpenClaw already works through chat apps, and most of these apps support voice notes, voice interaction becomes a natural extension of the system.

In this article, we will show you how to set up OpenClaw as a voice AI agent. We will also demonstrate how to bring your own speech-to-text model and integrate it with OpenClaw. The model we will use is Universal-3 Pro, and we will explore how its prompting capabilities can be used to create a more customized voice interaction experience.

OpenClaw in a nutshell

There is a lot of confusion surrounding OpenClaw. Even the name itself creates confusion. So in this section, we will do a quick breakdown of what OpenClaw is and how it works.

What is OpenClaw?

You can think of OpenClaw as a gateway between your chat app and your AI agent.

The chat app can be Telegram, WhatsApp, or Slack. The AI agent can be powered by cloud-based Large Language Models (LLMs) such as those provided by Anthropic or OpenAI, or even by a locally hosted model.

The AI agent also has access to a computer system. This could be your personal computer (though this is not advisable), a Mac Mini, a Raspberry Pi, or a cloud server.

The OpenClaw setup consists of the following:

The chat application serves as the user interface
OpenClaw acts as the orchestrator
The AI agent is the brain
The computer serves as the universal tool.

What makes openclaw different?

Agents are not a new concept. Agents that have access to a computer are not new either, and chatting with AI is certainly not new.

However, two things make OpenClaw stand out.

The first is the medium of communication. Unlike many chatbots that require you to use a separate app or a dedicated website, OpenClaw allows you to communicate with your agent through the chat apps you already use.

The second difference is that the OpenClaw agent is more proactive. It is not just another chat session. The agent can maintain memory, send reminders about tasks it is working on, and interact with the computer it has access to.

Since the agent has access to the system, it can perform actions such as reading files, editing files, and running commands. In many ways, OpenClaw feels like giving a personal computer to an AI assistant.

Setting up openclaw

When it comes to installing openclaw they are several different options you can choose. The easiest option is to install it on your personal computer but this is not advisable since an AI agent will have full control over your computer and a lot security experts have warned several security vulnerabilities on openclaw.

But running on your personal computer is the fastest way to experiment with it. The other option is have a dedicated computer for openclaw such as mac mini or a raspberry pi. Another option is to run openclaw within container using docker that way it is sandboxed.

If you’re on Mac or Linux, you can install OpenClaw with this one-liner:

curl -fsSL https://openclaw.ai/install.sh | bash

Then set it up by running:

openclaw onboard --install-daemon

This will prompt you to configure your model. The --install-daemon flag sets up OpenClaw as a background service, so it runs automatically whenever your device starts.

Once setup is complete, you can confirm everything is running with:

openclaw gateway status

For other installation methods, refer to the official OpenClaw installation guide.

Setting up a channel for communication

When OpenClaw is installed, the next thing you need to do is set up the channel of communication. This is essentially the chat app you wish to use to communicate with OpenClaw from.

All channels support communication via text, but since we are working with a voice agent, we want one that can support other media types, such as audio. Telegram is perfect for this because it offers the easiest setup compared to other channels, and you can send voice notes to the openclaw via Telegram.

Go through the setup guide on Telegram on openclaw documentation.

OpenClaw’s media understanding capabilities

OpenClaw’s media understanding capabilities allow it to process more than just text. When it receives a media file, such as an image or audio, it can use one of its model providers to transform it into a format the agent can understand.

For example, if OpenClaw receives a voice note from a channel like Telegram, it will use a speech-to-text (STT) model to convert the audio into text before passing it to the LLM. Similarly, if it receives an image, it can summarize the content using an image model and send that information to the agent. In this article, we are focusing on the audio understanding aspect of OpenClaw.

By default, OpenClaw supports a limited set of STT providers, including OpenAI, Mixtral Voxtral, and Deepgram. In this guide, we’ll go a step further by integrating a custom STT model, extending OpenClaw beyond its built-in options.

Bring your own Speech To Text model

There are several ways to extend OpenClaw’s capabilities, one of which is via plugins. While writing a plugin to perform media understanding is possible, it is often overkill. OpenClaw already provides a built-in way to extend media understanding using a custom script.

With a custom script, you simply tell OpenClaw that whenever it receives an audio file, it should run the script. The script processes the audio and returns the transcribed text. All the heavy lifting is handled by OpenClaw. You just need to write the script and configure the openclaw.json.

Since we get to write the script, we can choose any STT model provider. In this guide, we will use AssemblyAI.

Step 1: Set Up Your Environment

It is best to create a dedicated Python environment first. Then, install the AssemblyAI SDK:

pip install assemblyai

Next, create an AssemblyAI API key and store it in an environment variable:

export ASSEMBLYAI_API_KEY="your_api_key_here"

For global access, it is recommended to add this line to your .bashrc or .zshrcfile.

Step 2: Create the Transcription Script

Create a Python file called main.py and add the following:

import argparse
import os
import sys
import assemblyai as aai


def main():
    # 1. Set up the argument parser
    parser = argparse.ArgumentParser(
        description="Transcribe an audio file using AssemblyAI."
    )

    # Add positional argument for the audio file path
    parser.add_argument(
        "audio_file", 
        type=str, 
        help="Path to the audio file you want to transcribe (e.g., ./voice_note.ogg)"
    )

    # Add optional argument for the API key
    parser.add_argument(
        "--api-key", 
        type=str, 
        help="Your AssemblyAI API key (can also be set via ASSEMBLYAI_API_KEY env variable)",
        default=None
    )

    args = parser.parse_args()

    # 2. Configure API Key
    api_key = args.api_key or os.environ.get("ASSEMBLYAI_API_KEY")
    if not api_key:
        print("Error: API key is missing.")
        print("Please set the ASSEMBLYAI_API_KEY environment variable or pass it via --api-key.")
        sys.exit(1)

    aai.settings.api_key = api_key

    # 3. Configure and run the transcription
    config = aai.TranscriptionConfig(speech_models=["universal-3-pro"], 
                                     language_detection=True, 
                                     prompt="Transcribe the audio make sure include fillers and stutters in the transcript.")

    print(f"Transcribing '{args.audio_file}'... Please wait.")

    try:
        transcript = aai.Transcriber(config=config).transcribe(args.audio_file)

        if transcript.status == "error":
            raise RuntimeError(f"Transcription failed: {transcript.error}")

        print("\n--- Transcript ---")
        print(transcript.text)
        print("------------------\n")


    except Exception as e:
        print(f"\nAn error occurred: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

This script takes the path to an audio file, transcribes it using AssemblyAI, and prints the result.

Step 3: Configure OpenClaw

Next, integrate the script with OpenClaw by editing openclaw.json:

"tools":{
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          {
            "type": "cli",
            "command": "python",
            "args": ["/PATH/TO/SCRIPT/main.py", "{{MediaPath}}"]
          }
        ]
      }
    }
  }

This configuration tells OpenClaw to enable audio understanding, but instead of using a default provider, it will run your custom script.

Tip: If you are using a Python virtual environment, set the command to the full path of your environment’s Python binary. You can find it with:

whereis python

Step 4: Restart OpenClaw

After setup, restart OpenClaw to apply the changes:

openclaw daemon restart

With this setup, you now have full control over OpenClaw’s audio understanding capabilities.

You can find the complete implementation in the GitHub repository.

Why Use Universal-3 Pro with OpenClaw?

A common question when working with OpenClaw’s media capabilities is: why switch to a different STT model? After all, don’t all STT models just convert speech to text?

The answer is no. Different STT models have different strengths and trade-offs, for example:

Speed: Some models prioritize fast transcription, making them suitable for real-time applications.
Accuracy (WER): Others focus on achieving a low Word Error Rate, improving transcription quality.
Domain specialization: Certain models are optimized for specific areas such as medicine, legal, or customer support.
Customization: Some models allow fine-tuning or prompting to handle unique names, jargon, or phrases.
Deployment preference: Developers may prefer local models for privacy, control, or cost reasons.

In this article, we use AssemblyAI’s Universal-3 Pro because of its powerful prompting capabilities. For example, my name is Eteimorde. It is not an English name and rarely appears in standard datasets.

While building my personal voice AI agent with OpenClaw, I noticed that default STT models consistently misheard my name. To solve this, I used Universal-3 Pro’s keyterm prompting feature to explicitly define my name as an important term:

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"], 
    language_detection=True, 
    keyterms_prompt=["Eteimorde"]
)

Additional Capabilities of Universal-3 Pro via Prompting

Universal-3 Pro provides advanced features that can be easily leveraged through prompting. You can customize the behavior of the model by updating the prompt in the transcription configuration:

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"], 
    language_detection=True, 
    prompt="YOUR_PROMPT_GOES_HERE"
)

Using prompting, the model can perform the following tasks:

Verbatim transcription and disfluencies
Preserve natural speech patterns such as filler words, repetitions, and self-corrections.
Audio event tagging
Mark non-speech sounds like laughter, music, applause, or background noise.
Crosstalk labeling
Identify overlapping speech, interruptions, and multiple speakers talking at once.
Numbers and measurements formatting
Control how numbers, percentages, and measurements are represented.
Context-aware clues
Improve transcription for domain-specific terms, names, and jargon by providing relevant hints in the prompt.
Speaker attribution
Detect and label different speakers in a conversation.
PII redaction
Tag personal identifiable information such as names, addresses, and contact details, useful for limiting what the agent can access.

By using prompting, these capabilities allow your OpenClaw voice agent to become more accurate, context-aware, and personalized, going beyond the default transcription behavior.

Conclusion

OpenClaw makes it easy to run AI agents through chat apps you already use, and adding voice capabilities takes the interaction to a whole new level. By integrating your own speech-to-text models, such as Universal-3 Pro, you unlock features beyond OpenClaw’s built-in media understanding.

Its prompting capabilities allow users to customize how the model transcribes audio, accurately recognize custom keyterms, and leverage features like verbatim transcription to preserve natural speech and audio event tagging to capture non-speech context such as background noise or laughter.

With this setup, your OpenClaw agent behaves more like a true personal assistant. It can remember context, send proactive reminders, and leverage system tools to perform tasks. Voice interaction, combined with Universal-3 Pro’s advanced prompting features, transforms the agent from a simple chat companion into a more robust, seamless, and highly personalized experience.

How AI generates images from noise

Youdiowei Eteimorde — Sun, 28 Jan 2024 14:38:51 +0000

Generative AI models like DALL-E and stable diffusion generate jaw-dropping and mind-boggling images. While the results may seem magical, understanding how these models work often involves delving into mathematical concepts. In this article, I will break down the workings of these models for non-technical readers, aiming to demystify the complexities behind their magic.

Diffusion models

Models like DALL-E belong to a class of machine learning models called Diffusion models. They are currently the best performing models when it comes to generating realistic images.

Diffusion models were inspired by the physical phenomenon known as diffusion. In physics, diffusion is the process by which particles move from an area of higher concentration to an area of lower concentration.

For instance, when you spray paint on a wall, it moves from the paint container where there's a high amount of paint to the wall where there isn't any.

In machine learning, diffusion models aren't so different. Their task involves adding noise to an image incrementally at different time steps until the image becomes barely visible.

This process is called forward diffusion. In the picture above, at time step 1, the image is clean, but starting from time step 2, we begin incrementally adding noise. This is similar to spraying paint on a wall, which is a form of diffusion in physics. In our case, we are spraying the "paint" on the image.

You might be wondering how this principle helps AI generate new images. Well, forward diffusion itself doesn't, but the reverse process does.

Reverse diffusion

Reverse diffusion is the process of taking a noisy image and reverting it back to the prior time step.

The image above reverses the noise from time step 4 back to 3. So, how is this helpful in generating new images? Well, let's walk through one scenario.

A friend presents us with the image above and challenges us to restore the image to its original form. We look at the image, since parts of it are still visible, it shouldn't be too hard to infer the other missing parts. So, we accept their challenge.

Here's how we are going to fill in the gaps. We start by breaking the noise removal process into time steps, and at each time step, we decide what part of the image to fill. Let's divide the time steps into 5. The initial noisy image is at time step 5, and time step 1 is when we have removed all the noise present in the image.

In our initial noisy image, which is at time step 5, we notice the stick figure is holding something in its left arm. We can make the assumption that it is a suitcase or a box, so we can remove the noise by completing the drawing. This will result in a new image, now at time step 4.

At time step 4, we have less noise. Let's remove the noise at the stick figure's foot. Since we can't see the direction of his right foot, we can make the assumption it is facing the same direction as his other foot. So, we remove the noise and draw his new foot again based on our assumptions. This will give us the image at time step 3.

At time step 3, we have to decide how to go about drawing the stick figure's second arm. What direction should we place his arm? Let's look at the noise. From the shape of the noise, we can assume the figure's arm is raised. The figure is apparently holding a suitcase, so that means the figure might be a businessman. What do businessmen tend to do? They tend to make calls a lot. So, we can assume this figure is making a phone call with its other arm. We draw this, clean off the noise, and move on to time step 2.

At time step 2, our diagram is now pretty clear with just a few noises, which we can easily remove. This will give us our final image at time step 1 with no noise at all.

How were we able to fill in the gaps? Well, it came from our knowledge of the world. Over the course of our lives, we have seen a lot of stick figures, and we have an understanding of how stick figures should appear. We made assumptions based on our past experiences in life.

Our friend who gave us this challenge provides us with the original image and suggests that we compare it with what we had generated through noise removal.

Oops, we were wrong; the stick figure wasn't making a call with his other hand. But would you look at that? While trying to fill in the gaps in the noise, we created a totally new image. In theory, that's how diffusion models generate new data.

If we want a diffusion model to solve this task as we did, we will present it with a lot of image data of stick figures. By a lot, I mean a significant amount, preferably in the thousands or even millions.

By providing the model access to such a dataset, it will develop an understanding of how stick figures should look. So, when we give it an image with noise, it will behave similarly to us and think, based on its training data, what is most likely supposed to fill that noise.

The Need for Time Steps

While generating our image, we divided the process into time steps. Why? By breaking the task into different time steps, it enables the model to gradually think through the task, much like we did. This is what makes diffusion models proficient at generating images compared to other models. At every time step, the model only cares about the prior time step, which helps it generate the next time step image. This process is known as a Markov chain.

Image Synthesis

In our last task, we focused on restoring an already existing image that was just noisy. Now, let's explore the challenge of creating an image without any preexisting image. The process of generating images from scratch is called image synthesis. This process isn't new, several ML algorithms are capable of performing it perfectly. However, diffusion models are currently considered the best at generating realistic images.

So, how can we get our model to make images out of thin air, or should I say, out of thin noise? Before we delve into that, let's consider how humans would achieve this task.

Our friend from before presents us with a highly noisy image, asking us to clean the image once again.

We take a look at the image and tell them it is indeed very noisy. They reply and say they know, but we should examine the image and utilize the patterns in the noise to recreate the image.

Like the previous example, we will divide our noise removal process into time steps. From the noisy image above, I can see a couple of black dots. Let's try to connect these dots and see what we can come up with.

After connecting the dots, we obtain two lines for this current time step. We then move on to the next time step.

In this time step, we connect our other two dots to the currently existing lines. I can definitely see a stick figure now. We can argue that our stick figure has limbs now, but then there's still one dot to connect.

We connect our existing lines to that dot, and we have what appears to be a stick figure with no head. Now, we have to carefully consider where to place its head.

Let's place the head at the bottom of the image. Although it is unlikely for a stick figure to be upside down, it is even more unlikely for us to have a stick figure with longer arms than legs. We can say the stick figure is somersaulting.

We then remove the remaining noise and present the figure to our friend, declaring that we are done. We then ask for the original image so we can compare our generated image. However, our friend replies to us that there was never an original image, and all he gave to us is noise.

That's unexpected; I could have sworn I saw a stick figure somersaulting in that noise. Well, this is what we humans do. We sometimes see patterns where there are none, and it can lead to some of the most artistic results. Take the ancient Greeks, for instance.

They saw these dots in the night sky and envisioned a hunter holding a club and a shield. This interpretation gave birth to the Orion constellation, even though there wasn't a hunter in the sky; it was just noise in the form of stars.

Over the course of generations (time steps), various artists have reimagined the Orion constellation, making it look more realistic.

This is exactly what diffusion models do. They are initially given noise, but thanks to their vast training data, they start seeing patterns that are clearly not there. These apparent patterns are enough for them to create novel and realistic images.

You are probably asking, 'When I use DALL-E, I provide a prompt, and then it generates an image. How does this work?' Well, it is a text-guided diffusion model. That will be the subject of the next article.

If you found this article interesting or useful, please drop a ❤️. Share it with anyone who is curious about Generative AI. Feel free to drop a comment if you have any questions.

"Everything is a file" Explained

Youdiowei Eteimorde — Fri, 08 Dec 2023 09:52:51 +0000

If you have ever used Linux or any other Unix-like operating system, you have probably come across the saying 'Everything is a file.' But what does that mean? Is everything in these systems actually a file? Well, that is not the case. To understand the saying, we will walk through various definitions that try to answer the question: What is a file?

What Is a File in the Real World

In the world we live in, a file is simply a container for documents. The document contains useful information, usually in the form of pages. There are four fundamental actions you can perform on a file in the real world:

open
read
write
close

The process of working with a file involves these four operations. When given a file, you first open it, then read through it by flipping through the pages of the file. If you have the right permissions, you can write to the file. When you are done, you can then close the file.

Computer files on storage devices

In computer systems, one way to define a file is as any piece of data stored on a storage device.

The same four fundamental operations that can be performed on real-world files can also be performed on computer files. The only difference is how the data is stored. Computer files are stored as streams of data.

They are called streams because instead of storing your file in one big location, the file is subdivided into smaller chunks. These chunks are similar to pages in a real-world file, and the process of streaming through those chunks is analogous to flipping through a file.

When you open a file on a computer, the operating system goes to your storage device and fetches the location of the file. Once its location has been fetched, you can either choose to read from the stream or write to the stream or perform both actions. When you are done with the file, you close it. These operations are exactly the same as the operations performed on files in the real world.

When working with files, the only thing that ever changes is the data source. Your file might be on a hard disk, CD-ROM, or flash drive, but you will still perform the same four fundamental operations, and its content is always streams.

Everything isn't a file rather everything is file like

The heading above is what the saying should have been, but it is a mouthful and isn't catchy. In Unix-like operating systems, everything is said to be a file because you can perform the same set of operations described for files stored on storage devices on other resources. These resources are commonly referred to as file-like.

You can open a particular file-like resource, read from it, write to it, and close it when you are done. The data from that resource is also streamed. The only thing that ever changes is the source of the data.

For instance, when you are typing in your terminal, you are actually writing to a resource called standard input. When data gets displayed to your terminal using the echo command, your operating system is writing to the standard output resource.

Another example of a file-like entity is a network connection to a server. When you make a connection to a server, you are opening a line of communication. You can send network requests to the server, which is equivalent to writing, and when the server responds, you are then reading. After the communication is over, you can then close the line of communication. The data is also in the form of a stream.

This same process applies to the majority of other resources in Unix-like operating systems, like when two processes are communicating; their means of communicating is file-like.

Reason why everything is a file

Seeing everything as a file is actually a clever design decision made by the creators of Unix. This allows the same set of APIs to be used across different tasks. When you open a file-like entity, it returns a file descriptor. A file descriptor is a unique integer that represents an open file. Standard input has the file descriptor of 0, while standard output has the file descriptor of 1. Several system calls in Unix work with file descriptors.

The example below uses the write system call to write Hello, world! to standard output.



#include <unistd.h>

int main() {
    write(1, "Hello, world!\n", 14);
    return 0;
}

The write system call can also be used to write to a file on disk.



#include <fcntl.h>
#include <unistd.h>

int main() {
    // Get the file description
    int fd = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);

    // write to the file
    write(fd, "Hello, World!\n", 14);

    // close the file
    close(fd);

    return 0;
}

The program above opens a file called output.txt. Then writes hello world to it. After that it closes the file using the close system call. As you can see both writing to standard output and writing to a file use the same function.

The code snippet below shows a C function that makes a request to a server. It does this by writing to the client socket. Which is file-like because it returns a file descriptor when created.



void send_http_request(int client_socket, const char *host, const char *path){
    // Create the HTTP request
    char request[MAX_BUFFER_SIZE];
    snprintf(request, sizeof(request), "GET %s HTTP/1.1\r\nHost: %s\r\n\r\n", path, host);

    // send the HTTP request to the server
    write(client_socket, request, strlen(request));
}

The function below reads from the socket. The loop is used to receive the streams of data as they come.



void receive_http_response(int client_socket){
    // Recieve and print the HTTP response from the server
    char response[MAX_BUFFER_SIZE];
    ssize_t bytes_received;

    // Read response streams
    while ((bytes_received = read(client_socket, response, sizeof(response) -  1)) > 0 ) {
        // Add string termination
        response[bytes_received] = '\0';
        printf("%s", response);
    }
}

So to reiterate on the question I asked earlier, what is a file? A file is any resource that streams it's data content. This content can be read from or written to. A lot resources on Unix-like Operating system follow this pattern and that is why everything is a file.

How python's Multithreading differs from other languages

Youdiowei Eteimorde — Mon, 13 Nov 2023 05:14:39 +0000

Python, like most programming languages, features multithreading. However, unlike many languages, Python's multithreading isn't capable of fully harnessing the potential of the system it operates in. In this post, I will explain why this is the case and how the Global Interpreter Lock (GIL) plays a role in this limitation.

Before we delve deeper into why Python's multithreading has limitations, let's get a couple of key concepts out of the way. This will help us have a better understanding of multithreading in Python.

What is a process?

As programmers, when we write programs and execute them, they become processes running on our CPU. A process has access to its own memory and is isolated from other running processes. At any given time, only one process can be running on a CPU. The image above represents a CPU with four running processes.

So, you might be wondering: if only one process can run at a time, how is it possible for most computers to handle multiple tasks simultaneously? Well, computers have the ability to multitask, and they employ a clever trick that fools us humans into believing they are doing multiple things at the same time. This trick is known as context switching. This technique involves the operating system quickly switching between the currently running processes so fast that we humans perceive it as happening at the same time.

A process can be single-threaded or multithreaded. All the processes running in our image above are single-threaded.

Threads explained

A process usually performs one task at a time. For instance, consider the program below, which calculates the square of natural numbers from 1 to 100.



def calculate_square():
    for i in range(1, 101):
        square = i ** 2
        print(f"The square of {i} is: {square}")

if __name__ == "__main__":
    calculate_square()

When we execute this program, it becomes a process that is performing one task, so we say it is single-threaded. What if we also want to calculate the cube of the numbers at the same time?

Well, that is when we employ multithreading. It allows one process to perform multiple tasks. The program below uses Python's threading library to calculate both the square and cube of the numbers.



import threading

def calculate_square():
    for i in range(1, 101):
        square = i ** 2
        print(f"The square of {i} is: {square}")

def calculate_cube():
    for i in range(1, 101):
        cube = i ** 3
        print(f"The cube of {i} is: {cube}")

if __name__ == "__main__":
    square_thread = threading.Thread(target=calculate_square)
    cube_thread = threading.Thread(target=calculate_cube)

    square_thread.start()
    cube_thread.start()

    # Wait for both threads to complete
    square_thread.join()
    cube_thread.join()

In the picture below, we can see five running processes, but the green process occurs twice. This shows that it is utilizing threading.

One thread is performing the square task, and the other is performing the cube task. Both threads that belong to the green process can share memory and other resources.

As mentioned before, only one process can run on the CPU. Even if the green process is multi-threaded, only one of its threads can run at a time, and it is left for the operating system to select which one.

You are probably thinking it would be so cool if two threads could run at the same time. Well, that is possible if the CPU has multiple cores.

Making threads run in parallel

Modern computers have multiple cores, and each core can run its own individual process. So, a CPU with two cores can handle two tasks simultaneously. In theory, this means our multithreaded program, as written above, can run on two separate cores simultaneously. This is called parallelism, and it is one of the benefits that come with multithreading.

Now, our CPU has two cores, and our green process still has two threads, with each thread on a different core. One core will be performing the square task, and the other will be performing the cube task, all at the same time.

This is awesome; our program is now more efficient. Well, not quite. If you wrote the program in a language like Java or C++, that would be the case, but not in Python. Threads in Python can't run in parallel because of the GIL.

What exactly is the GIL?

It is the Global Interpreter Lock. Basically, what the GIL does is simple; it tells any running Python process that only one of its threads can run at a time, even if it is on a multi-core CPU.

This is how our multithreaded program will actually work in Python. When one of our green process's threads is running, even if the other thread has the opportunity to run on a different core, it would have to wait for the running thread.

This is because the running thread basically grabs the interpreter and locks it, making it impossible for the other thread in the process to have access to the interpreter. The other thread just has to wait for its chance so it can get the interpreter and lock it.

Why the GIL?

Python has the GIL because it was not designed to be thread safe. Multithreading is a very complex feat to achieve. Consider this scenario: you have two threads, thread1 and thread2, each running on separate cores. They both have access to a list, my_list = [1, 2, 3].

If thread1 grabs the list and empties it my_list = [], then thread2 tries to access the first index of an empty list, thinking it still has values within it, that will be a problem. This is an example of a race condition.

The Python list and other objects in Python weren't designed to be used in a multithreaded environment; that's what it means not to be thread-safe. That is why the Python team decided to lock everything so errors like these won't occur.

How to escape the GIL

There are several ways of bypassing the GIL. First of all, the GIL is only present in the C implementation of Python, CPython. Other implementations of Python like Jython, IronPython, and PyPy don't have the GIL. Additionally, Python provides the multiprocessing library, which allows for parallelism in your Python program.

Also PEP 703 proposes making the GIL optional in CPython during build time.

Is multithreading useless in python

So, what's the point of having multithreading in Python, you might ask. Multithreading isn't completely useless in Python because it is useful when working on I/O-bound tasks. Imagine if you have two threads: one is performing a massive computation, and another is waiting for the user's input. The thread waiting for the user's input can yield control to the computation thread while the user is providing their input. The same applies when a thread is trying to access the file system, a remote database, or a network. This is where threading shines in Python.

Running Large Language Models on the CPU

Youdiowei Eteimorde — Sun, 03 Sep 2023 23:56:54 +0000

At some point while using ChatGPT, you might encounter the error shown above. The reason behind this lies in the necessity of substantial computational resources to run a large language model like ChatGPT. These models require a significant amount of computational power, which is typically provided by specialized hardware known as Graphical Processing Units (GPUs).

Unlike regular programs, large language models cannot efficiently run on Central Processing Units (CPUs). The cost associated with operating such massive language models is substantial. It's an expense that usually only large corporations like Google and Microsoft can afford due to the substantial resources involved.

There are two computationally intensive tasks that a language model must consider:

Training
Inference

Training is the process of instructing a language model on how to perform its intended task. It stands as the more computationally demanding process between the two. Training an LLM consumes both time and monetary resources.

While Inference is the utilization of a trained large language model. Whenever you engage with ChatGPT, you're utilizing it in the inference mode. Although less computationally intensive than training, running inference on LLMs remains relatively expensive due to the substantial requirement for GPUs. Especially if you are running inference on the scale of ChatGPT.

Considering all of this, you might be pondering whether it's feasible to run Large Language Models on a CPU. The answer is yes, at least for inference. In this article, we will delve into the recent advancements, techniques, and technology that have enabled LLMs to operate using nothing more than regular CPUs.

Introducing LLaMa

LLaMa, or Large Language Model Meta AI, is a lightweight and efficient open-source Large Language Model developed by Meta. It is designed to deliver performance similar to models that are ten times its size.

An essential aspect of Large Language Models (LLMs) is their parameters, which play a crucial role in determining their performance. LLaMa introduces various versions with differing parameter counts, including a variant with 7 billion parameters and another boasting around 65 billion parameters.

Historically, enhancing LLM performance involved augmenting the number of parameters. This adheres to the scaling law, a strategy that has proven successful for models like the GPT series.

However, this strategy poses a challenge due to the inherent relationship between parameter increase and computational demands. As parameter count grows, so does the computational workload, presenting a trade-off between model complexity and processing efficiency.

Recent research has indicated a shift in the approach to enhancing LLM performance. Instead of solely increasing the number of parameters, an effective alternative involves augmenting the training data. LLaMa adopts this strategy, relying on data enrichment to achieve its improved performance.

LLaMa marked a groundbreaking achievement in the realm of LLMs. LLaMa was trained with 2048 A100 GPUs. The big breakthrough in LLaMa wasn't its training, but rather its inference step. The LLaMa model can be ran on a single GPU during inference, a distinct advantage compared to other LLMs that demand multiple GPUs for operation.

Although this single-GPU capability was remarkable, it is still a far cry from running on the CPU. To enable a lightweight LLM like LLaMa to run on the CPU, a clever technique known as quantization comes into play. But before we dive into the concept of quantization, let's first understand how LLMs store their parameters.

How are LLMs parameters stored

The parameters of a Large Language Model (LLM) are commonly stored as floating-point numbers. The majority of LLMs utilize 32-bit floating-point numbers, also known as single precision. However, certain layers of the model may employ different numerical precision levels, such as 16-bit.

Let's perform a simple mathematical calculation:

\text{ bits} = 1 \text{ byte}\quad 32 \text{ bits} = 4 \text{ bytes}

There are 8 bits in one byte, which means 32 bits equals 4 bytes. Consider a LLaMa model with 7 billion parameters, where these parameters are stored as 32-bit floats. To calculate the total memory required, we can multiply 4 bytes by 7 billion:

\, \text{bytes} \times 7 \, \text{billion} = \text{28 billion bytes}

28 billion bytes is equivalent to 28 gigabytes (GB).

This implies that utilizing the model would require a memory capacity of 28 gigabytes (GB). However, it's worth considering the limited memory capacity of consumer based CPU devices. Such devices typically do not possess memory sizes that can accommodate 28 GB. We need to reduce the memory consumption of the LLM, and this is where quantization becomes relevant.

What is Quantization?

Quantization is the process of reducing the precision of the parameters of a LLM. In this context, it involves taking the existing parameters, which are typically stored as 32-bit floating-point numbers, and converting them to 4-bit integers. This reduction in precision significantly decreases the memory consumption required to store the model. While there might be a minor degradation in the model's performance, this impact is generally quite small and often negligible or unnoticeable.

Consider an array of values in floating-point representation:

[0.333, 0.98, 0.234]

When we apply quantization, these values are converted to integers:

[2, 15, 0]

While the two sets of numbers are different, they retain their meaning. For instance, the second value remains the largest in both arrays, and the third value is still the smallest.

Quantization can be likened to expressing values in a manner similar to using percentages. For example, someone might express a need for "3200 out of the 10000 cans". This expression can be simplified by stating the need as "32 percent of the cans". Although the specific numerical values differ, the conveyed information remains identical. This analogy shows how quantization simplifies value representation while retaining their inherent meaning.

Applying 4 bit quantization to LLaMa7B

Let's apply 4-bit quantization to the LLaMa model, which has 7 billion parameters, assuming that all parameters are originally stored as 32-bit..

\text{ bits} = 1 \text{ byte}\quad 4 \text{ bits} = 0.5 \text{ bytes}

Given that 1 byte equals 8 bits, 4 bits are equivalent to 0.5 bytes. Therefore, to determine the memory consumption after applying 4-bit quantization to a Large Language Model with 7 billion parameters, we multiply 0.5 bytes by 7 billion.

\times 0.5 \, \text{bytes} = 3,500,000,000 \, \text{bytes}

After performing our calculation, we find that the memory consumption is 3 billion bytes, equivalent to 3.5 GB. This indicates that it's possible to run the LLaMa 7 billion parameter model on a device with more than 3.5 GB of available RAM space.

The GGML ecosystem

Another key technology that has contributed to running LLMs on CPUs is GGML, which stands for Georgi Gerganov Machine Learning. It is named after its creator.

GGML is a machine learning library that implements tensor operations in C. Due to its implementation in C, it can facilitate the execution of ML models on diverse platforms. Additionally, GGML offers support for model quantization.

GGML was initially used to develop Whisper.cpp, a project designed to enable the execution of OpenAI's Whisper model on compact devices such as smartphones and other small devices. This laid the foundation for a subsequent project named llama.cpp, which facilitates the execution of LLaMa models on similarly compact devices.

Furthermore, GGML serves as a file format that enables the storage of model information within a single file.

Introduction to Llama.cpp

Llama.cpp is a runtime for LLaMa-based models that enables inference to be performed on the CPU, provided that the device has sufficient memory to load the model. It is written in C++ and utilizes the GGML library to execute tensor operations and carry out quantization processes.

Despite being written in C++, there exist bindings for several programming languages. The Python binding is referred to as llama-cpp-python.

Working with llama-cpp-python

Let's put into practice everything we've learned so far. We will utilize llama-cpp-python to execute a LLaMa 2-7B model. You can run this example locally if you have approximately 6.71 GB of available memory. If you don't, you can use the link below to run the code in Google Colab.

Firstly, we need to obtain our model, convert it into GGML format, and then proceed to quantize it. However, there's no need to be concerned about this step since we can readily acquire an already quantized model from HuggingFace. A user named TheBloke has quantized numerous models.



wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_1.bin

I used the command-line tool wget to download the model. This specific variant of LLaMa has undergone fine-tuning similar to ChatGPT.



pip install llama-cpp-python

Next, we proceed to install llama-cpp-python. Once the installation is complete, you have the option to either create a Python file to contain your code or utilize the Python interpreter for execution.



from llama_cpp import Llama

LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q4_1.bin")

We load our model and subsequently instantiate it. Following that, we input our prompt to the model.



prompt = "Tell me about the Python programming language? "

output = LLM(prompt)

Llama-cpp offers an API that is similar to the one provided by OpenAI. Here's the response generated by our provided prompt:



{'id': 'cmpl-4c10c54e-f6a1-4d80-87af-f63f19ce96c2',
 'object': 'text_completion',
 'created': 1692538402,
 'model': './llama-2-7b-chat.ggmlv3.q4_1.bin',
 'choices': [{'text': '\npython programming language\nThe Python programming language is a popular, high-level programming language that is used for a wide range of applications, such as web development, data analysis, artificial intelligence, scientific computing, and more. Here are some key features and benefits of Python:\n\n1. **Easy to learn**: Python has a simple syntax and is relatively easy to learn, making it a great language for beginners.\n2. **High-level language**: Python is a high-level language, meaning it abstracts away many low-level details, allowing developers to focus on the logic of their code rather than',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 10, 'completion_tokens': 128, 'total_tokens': 138}}

The text returned from the model is this:

python programming language
The Python programming language is a popular, high-level programming language that is used for a wide range of applications, such as web development, data analysis, artificial intelligence, scientific computing, and more. Here are some key features and benefits of Python:

Easy to learn: Python has a simple syntax and is relatively easy to learn, making it a great language for beginners.

High-level language: Python is a high-level language, meaning it abstracts away many low-level details, allowing developers to focus on the logic of their code rather than

The response received is actually quite satisfactory. However, it's essential to bear in mind that this particular model isn't equivalent to ChatGPT in terms of capabilities. For enhanced performance, consider utilizing a model with higher quantization, such as the 8-bit version. For even more advanced results, acquiring the 13-billion-parameter models could be beneficial. Using prompting engineering techniques can also enhance the quality of the responses obtained.

The ability to run Large Language Models on the CPU represents a significant breakthrough in the field. This advancement paves the way for various applications, benefiting small businesses, researchers, hobbyists, and individuals who prefer not to share their data with third-party organizations. This development is set for continuous growth, and in the upcoming years, if not months, we can expect more accessibility to LLMs.

Here are few useful resources to help expand your knowledge on running LLMs on CPU.

The blog post and paper that introduced LLaMa.
Check out ggml website and read their manifesto to learn more about the project's philosphy.
Watch this video to have an understanding of the hardware ChatGPT runs on.
To get a more in depth overview of quantization watch this video
To understand the GGML file format read this.

Understanding LangChain's RecursiveCharacterTextSplitter

Youdiowei Eteimorde — Sat, 12 Aug 2023 01:55:59 +0000

Large language models are powerful tools with extensive capabilities; nonetheless, they grapple with a distinct limitation known as the context window. This context window defines the boundaries within which these models can proficiently process text. Take, for example, gpt-3.5-turbo, which operates within a context length of 4,096 tokens, approximately corresponding to 3,500 words.

But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

Langchain provides users with a range of chunking techniques to choose from. However, among these options, the RecursiveCharacterTextSplitter emerges as the favored and strongly recommended method.

Quick overview

The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].

It takes in the large text then tries to split it by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. If it is still larger than our specified chunk size it moves to the next character in the set until we get a split that is less than our specified chunk size.

Code implementation

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The text above is extracted from an article written by Paul Graham, titled: What I Worked On. Let's utilize the RecursiveCharacterTextSplitter to break it into small chunks, each with a maximum size of 100 characters.

First we import it from langchain:



from langchain.text_splitter import RecursiveCharacterTextSplitter

Let's load the text we wish to create chunks from into a variable called text.



text = """What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
"""

Next we create a RecursiveCharacterTextSplitter instance, configuring it with a chunk_size of 100 and a chunk_overlap value of zero. Our approach involves using the length function to measure each chunk based on its character count.



text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 0,
    length_function = len,
)

The RecursiveCharacterTextSplitter offers several methods for performing splits. In our case, we will utilize the split_text method. This method requires a string input representing the text and returns an array of strings, each representing a chunk after the splitting process.



texts = text_splitter.split_text(text)
print(len(texts)) # 11
print(texts[0]) # 'What I Worked On\n\nFebruary 2021'

Upon performing the split our text was successfully divided into a total of 11 separate chunks.

In-Depth Explanation

Just as its name suggests, the RecursiveCharacterTextSplitter employs recursion as the core mechanism to accomplish text splitting. Now, let's take a detailed journey through the process of how our earlier code was capable of achieving this feat.

For our walkthrough, we'll utilize the same text and parameters that we employed during the code implementation. This involves a segment from Paul Graham's essay, and we'll consider a chunk size of 100 characters. The characters we use for splitting will be ['\n\n', '\n', ' ', '' ].

Let's begin with our initial text. Currently presented in human-readable form, our next step involves transforming it into a format that computers can readily comprehend.

Now, the new lines have been converted to \n, which is precisely what we need in order to carry out our splitting process.

Let's select our text. This can be likened to invoking the split_text method on our text.

As mentioned earlier, the RecursiveCharacterTextSplitter attempts to initiate splits using a predefined set of characters. Its first attempt involves the \n\n character, which serves as a means to split by paragraphs. Let's now identify all occurrences of this character within our text.

Once we've located all instances of the \n\n characters, the subsequent step involves executing a split using this character as our designated separator.

Presently, we have four splits. Our next step involves assessing each split to check whether they meet the condition of being smaller than our specified chunk size, which is set at 100 characters.

The first two splits satisfy this condition, thus earning them the label of good splits. Since both segments consist of fewer than 100 characters, we can combine them to create our initial chunk.

Proceeding to the second split, we find ourselves in a situation where further reduction isn't achievable using the \n\n character. Therefore, we proceed to the next character: \n. Our objective here is to execute a split using the \n character and determine if we can achieve a reduction in the split's size.

This operation is akin to invoking the split_text on the second split text, but with the inclusion of the \n character. This is where the concept of recursion comes into play.

Upon executing the split using the \n character, we end up with two splits. The first split qualifies as a good split, given that it contains only one character. However, the second split surpasses our designated chunk size.

Consequently, we need to invoke the split_text method on this particular split once again. However, this time we'll employ a split using the next character in our character list, which happens to be the ' ' character.

Finally, we have successfully decreased the split size. Now, we proceed to iterate through each split in order to perform a merge. The guiding principle for these merges is that no resulting merged split should exceed our designated chunk size of 100 characters.

Following the merge, we end up with four chunks, each adhering to our condition that a chunk should not surpass 100 characters.

Now, let's revisit the original text splits and identify which split remains to be processed.

We still have one split that is greater than our chunk size. We repeat the same procedures again.

We initiate the split using the new line character as the separator.

We perform a split using spaces as the separators.

Next, we proceed with a merge, ensuring that no merged segments exceed the defined chunk size.

After going through the entire process, we arrive at generating eleven individual chunks. Each of these eleven chunks successfully adheres to the 100-character limit.
This outcome aligns precisely with what we achieved programmatically.

And there we have it. We've delved into the inner workings of LangChain's RecursiveCharacterTextSplitter. For those who are intrigued, you can explore the source code here. If you found this article informative, please consider showing your appreciation with a reaction: 💖 🦄 🤯 🙌 🔥

A quick introduction to language models

Youdiowei Eteimorde — Fri, 04 Aug 2023 10:20:29 +0000

A language model is a computer program capable of understanding text in the form of natural language. This understanding enables it to achieve tasks such as predicting the sentiment of the text, performing grammatical corrections, translating languages, or generating new text. The generation aspect is currently capturing the interest of everyone.

With text generation, we can still perform the other tasks. The performance of a language model depends on its size: the larger the model, the better its performance. We have always had language models; examples include autocomplete on our phones and spam filters. However, these models were basic and could only perform one task at a time.

The models that are currently astonishing the world are large, hence the name Large Language Models.

Prompting

A compiler and a language model aren't so different; they both work with text. However, a compiler takes in structured text in the form of computer programs, while a Language Model works with unstructured text in the form of human language. The text it takes in is called prompts.

The output of most language models is text. Let's work on a simple example using the following prompt:

What is the capital of Nigeria?

The Language Model will analyze the prompt and generate the response that best completes it. The output of this prompt is:

The capital of Nigeria is Abuja.

The output essentially completes the input prompt. We can demonstrate this by providing an incomplete prompt:

Nigeria is a ....

This will produce the output:

Nigeria is a country located in West Africa.

OpenAI Completion API

OpenAI offers an API endpoint that allows us to interact with their language model, appropriately named the Completion API. It is a HTTP endpoint so all we need is a HTTP client and API keys then you are set to go.

curl https://api.openai.com/v1/completions   
 -H "Content-Type: application/json"   
 -H "Authorization: Bearer $OPENAI_API_KEY"
 -d '{
    "model": "text-davinci-003",
    "prompt": "What is the capital of Nigeria?"
  }'

I utilized curl as my HTTP client and stored my API key in the environment variable $OPENAI_API_KEY. I then passed it as a Bearer token in the header. The response to this request should resemble the following:

{
  "id": "cmpl-7jjcu1cmQqUZy7qawoQ3rlUBgmrqh",
  "object": "text_completion",
  "created": 1691134504,
  "model": "text-davinci-003",
  "choices": [
    {
      "text": "\n\nThe capital of Nigeria is Abuja.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 10,
    "total_tokens": 17
  }
}

The response contains several pieces of helpful information, but we are primarily interested in the choices field within our response.

"choices": [
  {
    "text": "\n\nThe capital of Nigeria is Abuja.",
    "index": 0,
    "logprobs": null,
    "finish_reason": "stop"
  }
]

The choices field contains an array of possible responses but we have only one response. We can obtain our completed text from the text field.

This was a concise introduction to language models. Their primary function is to complete the preceding text they receive. If you're keen on delving deeper into the concept of prompting, you can explore this site. Additionally, for detailed information about OpenAI's completions endpoint, take a look at their documentation.

Building a Machine Learning Model with tensorflow.js

Youdiowei Eteimorde — Thu, 15 Jun 2023 06:48:52 +0000

The most important thing in all of machine learning is the model. A model is a computer program that is trained using a dataset to recognize patterns and perform a given task. It emphasizes the training aspect, as opposed to regular programs that are manually programmed by humans. Instead, a machine learning model learns from data using an algorithm. In this article, we will be building a machine learning model with TensorFlow.js.

The Task

Before we begin building our model, we need to define the task at hand.

The image above displays a graph that plots a series of dots. Our goal is to find a straight line that can best fit these dots. This problem is known as linear regression. The line we are searching for can be considered as our model. So, how do we find this line?

We humans can easily draw a line that fits the data. So the goal is to train our computer program to do the same.

We will explore two approaches: one using statistical methods without machine learning, which utilizes mathematical equations to find the best fit line, and the other using machine learning to find the best fit line for us.

The dataset

First, we need to create a dataset. We are going to use TensorFlow.js operations to create an artificial dataset. We will use an equation to generate the data. Here's the equation:

y = m x + c

This equation represents a straight line, where m is the slope and c is the intercept. x is the input and y is the output. Let's randomly select values for m and c. For m, let's choose 2, and for c, let's choose 1. This gives us the equation:

y = 2 x + 1

However, this equation will only give us a straight line. To make the data entries in our dataset unique, we need to add noise. Let's modify the equation:

y = 2 x + 1 + n o i se

For our x values, we are going to generate one hundred values between 0 and 1. The value of y depends on x. We can use the equation to create new values of y. The equation says that 2 times x plus 1 plus random noise will give us y.

Now let's generate x values in code:



const x_data = tf.linspace(0, 1, 100);

The code above uses the linspace function to generate 100 values between 0 and 1. Let's convert the equation $y = 2 x + 1 + n o i se$ into code using TensorFlow.js:



const y_data = tf.add(tf.mul(2, x_data), 1).add(tf.randomNormal([100], 0, 0.1));

These operations multiply x by 2, add 1, and then add random noise to generate the corresponding y values for the dataset.

Ordinary Least squares

The mathematical technique we are going to use is called Ordinary least squares. This technique aims to find a line that best fits the dataset by calculating the difference between the predicted value and the expected value. Here is the equation:

y_i = \beta_0 + \beta_1x_i

$y_i$ represents our dependent variable or the y-coordinate in our graph. $x_i$ represents our independent variable or the x-coordinate in the graph. $β0\beta_0$ represents the intercept, and $β1\beta_1$ represents the slope. From our dataset, we obtain values for $x_i$ and $y_i$ . To calculate the values of $β0\beta_0$ and $β1\beta_1$ , we use the following equation:

\beta_1 = \frac{\sum((x_i - \bar{x})(y_i - \bar{y}))}{\sum((x_i - \bar{x})^2)}

\beta_0 = \bar{y} - \beta_1 \bar{x}

Let's convert the equation above into code. $xˉ\bar{x}$ and $yˉ\bar{y}$ are mean values of $x$ and $y$ . In TensorFlow.js, we can represent them as:



const xMean = tf.mean(x);
const yMean = tf.mean(y);

The numerator of the first equation $∑((xi−xˉ)(yi−yˉ))\sum((x_i - \bar{x})(y_i - \bar{y}))$ can be represented as:



const numerator = tf.sum(tf.mul(tf.sub(x, xMean), tf.sub(y, yMean)));

While the denominator $∑((xi−xˉ)2)\sum((x_i - \bar{x})^2)$ can be represented as:



const denominator = tf.sum(tf.square(tf.sub(x, xMean)));

Now we can obtain our slope( $β1\beta_1$ ) and intercept( $β0\beta_0$ ):



const slope = numerator.div(denominator);
const intercept = yMean.sub(slope.mul(xMean));

We can combine everything into a single function:



function calculateInterceptAndSlope(x, y) {

  const xMean = tf.mean(x);
  const yMean = tf.mean(y);

  const numerator = tf.sum(tf.mul(tf.sub(x, xMean), tf.sub(y, yMean)));
  const denominator = tf.sum(tf.square(tf.sub(x, xMean)));

  const slope = numerator.div(denominator);
  const intercept = yMean.sub(slope.mul(xMean));

  return { intercept: intercept.dataSync()[0], slope: slope.dataSync()[0] };
}

Let's provide the function with the data we defined earlier:



const { intercept, slope } = calculateInterceptAndSlope(x_data, y_data);

console.log('Intercept:', intercept); // Intercept: 0.980187177658081

console.log('Slope:', slope); // Slope: 2.0238144397735596

Our calculateInterceptAndSlope function calculates the intercept as 0.980187177658081 and the slope as 2.0238144397735596, which is close to the values we defined. This approach of using mathematical equations is effective, but it requires us to specify the relationship between the data variables in the form of an equation.

However, what we truly desire is a method that can automatically discern the relationship within the data without relying on a predefined equation.

Machine Learning Model

A machine learning model is a program that has the ability to identify and discover patterns within data without being manually programmed. Instead, it undergoes a training process where it learns from the data. Let's explore the training process of a machine learning model in more detail.

Training process of a model

The training process of a model involves three essential components:

The model
Loss function
Optimizer

Here's the process: The model takes an input value x and produces a prediction. This prediction is then compared to the actual value using the loss function. The loss function measures the performance of the model during training and returns a loss value. A high loss value indicates poor performance, while a low loss value indicates good performance. The optimizer is used to adjust the model's parameters in order to minimize the loss value.

This process is repeated multiple times, with each iteration referred to as an epoch. The number of epochs is determined by the creator of the model. For example, if the model is trained for 50 epochs, it means the training data is passed through the model, loss function, and optimizer 50 times.

Building a model

To create a model in TensorFlow.js, we can utilize the tf.sequential function:



const model = tf.sequential();

A model consists of layers, and these layers have sets of values called weights attached to them. The weights are what are adjusted during the training process. Upon creation of the model, the weights are randomly initialized. Data flows from the input through each corresponding layer before reaching the output.

Let's create a layer for our model:



tf.layers.dense({ units: 1, inputShape: [1] })

Here, we have defined what is known as a dense layer. There are different types of layers, but for now, we will focus on dense layers. The dense layer takes in a JavaScript object { units: 1, inputShape: [1] }, which contains two keys: units and inputShape. The units parameter defines the number of weights our layer will have, and in this case, it is set to one.

The weight in a dense layer is analogous to the slope in the ordinary least squares method. It can be any number of our choosing and doesn't have to be limited to one. Additionally, each layer has an associated bias value, which is automatically incorporated by TensorFlow.js. We can add the layer to our model using the add method provided by the defined model.

The inputShape key specifies the shape of our input data, particularly when our model takes in a scalar value.



model.add(tf.layers.dense({ units: 1, inputShape: [1] }));

With this, we have defined our model. Here is a diagram illustrating our model:

Let's experiment with different model configurations:



const model = tf.sequential();
model.add(tf.layers.dense({ units: 3, inputShape: [1] }));

The code above creates a model with one layer that takes one value as input. This layer has 3 units and 1 bias term, which is automatically added for us.

Let's go deeper and create a model with two layers.



const model = tf.sequential();
model.add(tf.layers.dense({ units: 3, inputShape: [1] }));
model.add(tf.layers.dense({ units: 2 }));

The second layer has two units and a bias attached to it. Here's the corresponding diagram:

All the units from Layer 1 have a connection to individual units of the next layer, which is why it is called a dense layer.

Let's get back to our task and place our original model configuration into a function for simplicity.



function createModel() {
  const model = tf.sequential();
  model.add(tf.layers.dense({ units: 1, inputShape: [1] }));
  return model;
}

This function returns a model. Currently, the model has randomly initialized weights and is not yet trained. The goal is to train the model to fit our data, and that's where the training process comes in. To train our model, we need two things: a loss function and an optimizer.

Loss function

The loss function, also known as the cost function, is a way for the model to measure its performance. It quantifies the difference between the predicted values and the actual values. TensorFlow.js provides several pre-defined loss functions, but for a better understanding, let's build one from scratch. We will use the mean squared error (MSE) as our loss function.

\text{MSE}(\text{predicted}, \text{actual}) = \frac{1}{n} \sum_{i=1}^{n} (\text{predicted}_i - \text{actual}_i)^2

The mean squared error (MSE) measures the difference between the predicted values and the actual values by squaring the difference to avoid negative values. The mean of these squared differences is then calculated. Here's the TensorFlow.js code equivalent for the MSE loss function:



function loss(predicted, actual) {
  return predicted.sub(actual).square().mean();
}

The optiimizer

While the loss function is used to evaluate the model's performance, the optimizer is responsible for updating the weights of the model based on the information provided by the loss function. There are several algorithms that can be used to optimize the weights, and in this case, we will use an algorithm called Stochastic Gradient Descent (SGD). TensorFlow.js provides an implementation of SGD that we can use.

The tf.train.sgd function implements SGD and takes one argument called the training step. The process of optimizing the weights of a model is a search problem.

The training step determines the size of the jumps taken during the optimization process. A higher training step value may lead to faster convergence to the optimal solution, but there's a risk of overshooting it. On the other hand, a lower training step will take longer but can provide a more thorough search for the optimal solution.



tf.train.sgd(0.1) // SGD with a training step of 0.1

The training step

Now let's put everything together and train our model. First, we need to create our model using the createModel function we defined earlier. Then, we add the loss function we created and the optimizer to our model using the compile method.



const model = createModel();
model.compile({ optimizer: tf.train.sgd(0.1), loss });

To train our model, we use the fit method. This method takes in our inputs and their corresponding outputs, which in our case are the x_data and y_data variables. Additionally, we need to provide additional information as an object.



model.fit(x_data, y_data, {
  epochs: 50,
  callbacks: {
    onEpochEnd: (epoch, logs) => {
      console.log(`Epoch ${epoch + 1}: Loss = ${logs.loss}`);
    },
  },
});

The third argument is an object that contains the epochs and callbacks keys. The epochs value determines how many times our dataset will go through the entire training process. The callbacks value is an object that defines functions to be executed at specific times during the training process. In our case, we have the onEpochEnd callback, which runs at the end of every epoch.

The code above won't work yet because model.fit is asynchronous we have to await it and place it in an asynchronous function.



async function trainModel() {
  const history = await model.fit(x_data, y_data, {
    epochs: 50,
    callbacks: {
      onEpochEnd: (epoch, logs) => {
        console.log(`Epoch ${epoch + 1}: Loss = ${logs.loss}`);
    },
  },
});

The process of using a model after training is called inference. Let's create a function for performing inference with our trained model.



async function inference(){
  // start the training process
  await trainModel();

  // Get the weights of the model
  const [ unit, bias ] = model.getWeights();

  console.log('Unit:', unit.dataSync()[0]);
  console.log('Bias:', bias.dataSync()[0])

  // predict a singular value
  const y = model.predict(tf.tensor([2]));
  console.log(y.dataSync()[0]);
}

This function calls the trainModel function to initiate the training process. After training, we obtain the trained weights from the model's getWeights method. Then, we use the model's predict method to make actual predictions on our data values.

The dataSync() method allows us to get the tensor data values in a synchronous manner. This is a blocking operation, but there's an asynchronous equivalent called data.

Let's summarize everything we have done so far:



const tf = require('@tensorflow/tfjs');

// Define a linear regression model
function createModel() {
  const model = tf.sequential();
  model.add(tf.layers.dense({ units: 1, inputShape: [1] }));
  return model;
}

// Prepare the data
const x_data = tf.linspace(0, 1, 100);
const y_data = tf.add(tf.mul(2, x_data), 1).add(tf.randomNormal([100], 0, 0.1));

// Define the loss function
function loss(predicted, actual) {
  return predicted.sub(actual).square().mean();
}

// Compile the model
const model = createModel();
model.compile({ optimizer: tf.train.sgd(0.1), loss });

// Define the function to train the model
async function trainModel() {
  const history = await model.fit(x_data, y_data, {
    epochs: 50,
    callbacks: {
      onEpochEnd: (epoch, logs) => {
        console.log(`Epoch ${epoch + 1}: Loss = ${logs.loss}`);
      },
    },
  });
}

// Define the inference function
async function inference(){
  // start the training process
  await trainModel();

  // Get the weights of the model
  const [ unit, bias ] = model.getWeights();

  console.log('Unit:', unit.dataSync()[0]);
  console.log('Bias:', bias.dataSync()[0])

  // predict a singular value
  const y = model.predict(tf.tensor([2]));
  console.log(y.dataSync()[0]);
}

// Call the inference function
inference();

The code above should work as it is, and here's the log message after the function has been called.



Epoch 1: Loss = 3.0514166355133057
Epoch 2: Loss = 0.3914150893688202
Epoch 3: Loss = 0.16473960876464844
Epoch 4: Loss = 0.12166999280452728
Epoch 5: Loss = 0.11165788769721985
Epoch 6: Loss = 0.1014869213104248
Epoch 7: Loss = 0.09044851362705231
Epoch 8: Loss = 0.0814569965004921
Epoch 9: Loss = 0.07538492977619171
Epoch 10: Loss = 0.0688215121626854
Epoch 11: Loss = 0.06399987637996674
Epoch 12: Loss = 0.05861040949821472
Epoch 13: Loss = 0.054225534200668335
Epoch 14: Loss = 0.04796575382351875
.
.
.
.
Epoch 40: Loss = 0.012772820889949799
Epoch 41: Loss = 0.01268770918250084
Epoch 42: Loss = 0.012648873031139374
Epoch 43: Loss = 0.012167338281869888
Epoch 44: Loss = 0.012036673724651337
Epoch 45: Loss = 0.011740590445697308
Epoch 46: Loss = 0.011509907431900501
Epoch 47: Loss = 0.01139025203883648
Epoch 48: Loss = 0.011547082103788853
Epoch 49: Loss = 0.011259474791586399
Epoch 50: Loss = 0.011122853495180607
Unit: 1.9403504133224487
Bias: 1.0415139198303223
4.922214508056641

From the log message, we can observe that during the first epoch, our loss value is high, but it subsequently reduces. This indicates that our model is improving with each epoch.

By using the model's getWeights method, we obtained the values of the unit and bias as 1.9403504133224487 and 1.0415139198303223 respectively. These values are approximately the same as the ones we obtained using the OLS method. When we provided the value of 2 to our model, it predicted 4.922214508056641.

Comparing both methods

The goal of linear regression is to find a straight line that bests represents our data. Our original equation was $y = 2 x + 1 + n o i se$ . The ML model came up with an equation of:



slope = 1.9403504133224487 // our unit's value from the model's weight
intercept = 1.0415139198303223 // our bias's value from the model's weight
x = 2 // input value
y = x * slope + intercept 
console.log(y) // 4.92221474647522

The solution using the Ordinary Least Squares (OLS) method can be expressed as:



slope = 2.02381443977355967
intercept = 0.980187177658081 
x = 2 // input value
y = x * slope + intercept 
console.log(y) // 5.0278160572052

Both of them are converging towards a slope of 2 and an intercept of 1.

When comparing the Ordinary Least Squares (OLS) method to the ML model, it is important to consider their generality. The OLS method provides a specific solution for linear regression and assumes a linear relationship between the variables. If the data doesn't follow a linear pattern, the OLS method may not yield accurate results, and finding a different equation or using alternative regression techniques becomes necessary.

On the other hand, the ML model offers more flexibility and generalizability. By adjusting the model's configuration, such as adding more layers, we can potentially capture more complex patterns in the data beyond simple linear relationships. This adaptability makes the ML model well-suited for handling a wider range of data patterns and potentially achieving better performance in different scenarios.

In this article, we were introduced to the concept of a machine learning model, specifically the linear model. However, it's important to note that there are other types of models available as well. One such example is the polynomial model, which is used when the data does not follow a straight line trend but exhibits a more complex pattern. Additionally, there are models specifically designed for natural language processing tasks, such as language models, which aim to understand and generate human language by capturing its underlying patterns and relationships. These different types of models allow us to tackle a wide range of problems and data patterns effectively.

All models, regardless of their specific architecture or purpose, share a common structure consisting of layers and weights. This allows them to process and learn from data in a structured manner. For instance, a language model like GPT-3 (Generative Pre-trained Transformer 3) is referred to as "large" due to its extensive scale. GPT-3 is composed of 96 layers and a staggering 175 billion weights, which enables it to capture complex patterns and relationships in language data. The size and complexity of models like GPT-3 contribute to their impressive performance on various natural language processing tasks.

Useful resources

Tutorial by TensorFlow that walks you through building a model using TensorFlow.js
A Guide that introduces the model and layers in TensorFlow.js.
Video introducing the concept of regression by the YouTube channel crash course(statistics).
Video explaining the Ordinary Least Sqaure(OLS) method by YouTuber Organic Chemistry Tutor.

Introduction to Tensorflow.js

Youdiowei Eteimorde — Sat, 18 Mar 2023 18:27:34 +0000

Tensorflow.js is a library that enables machine learning in web applications. It is a part of the Tensorflow ecosystem, which is a deep learning framework written in C++ and provides a Python API.

Tensorflow.js was written from scratch to take advantage of web technologies. It runs on WebGL and has an API that is similar to that of the actual Tensorflow library. It can also run on the server via Node.js. Additionally, Tensorflow has bindings for other languages such as Java, C++, and Rust.

Why tensorflow.js?

Tensorflow.js has the unique advantage of running directly on the web among all the ML libraries. Traditionally, ML models are trained and run directly on the server, and the client makes a request to the server when it is time to use the model. However, with Tensorflow.js, machine learning models can be run directly in the user's browser, eliminating the need for server requests and enabling real-time, interactive applications.

Tensorflow.js is not designed to train large models, but rather it is built for inference. Machine learning models trained in Python can be loaded into Tensorflow.js and used for inference in the user's browser.

Training machine learning models can require a large amount of computational resources. However, once a model has been trained, it can be reused multiple times for inference without requiring as much computation.

Setting up Tensorflow.js

To use Tensorflow.js we can either use it via the script tag:

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@latest/dist/tf.min.js"></script>

or via npm:

npm install @tensorflow/tfjs

or via yarn:

yarn add @tensorflow/tfjs

The packages above were designed to be used in the web browser. To use Tensorflow.js on nodejs check out this package.

The `Tensor` class

The main class in Tensorflow.js is the Tensor. It represents the tensor data structure. The tensor data structure is immutable. Tensorflow.js contains various methods to work with and manipulate this data structure.

Creating Tensors in tensorflow.js

Tensorflow.js provides several functions to help create tensors.

tf.tensor

The tf.tensor function is a general function for creating tensors. It can create tensors of any shape and data type.

tf.tensor(1); // creating a scalar 
tf.tensor([1,2,3,4]); // creating a vector
tf.tensor([[1,2],[3,4]]); // creating a matrix

tf.scalar

The tf.scalar is a more specialized function. Its job is to create scalars(rank 0 tensors) and they can be of any type.

tf.scalar(2)

It's encouraged to use this rather than tf.tensor when creating scalars. It helps with code readability.

tf.tensor1d

The tf.tensor1d is used to create vectors(rank 1 tensors). Like tf.scalar it is more specialized but helps with code readability.

tf.tensor1d([1,2,3,4]); // creating a vector

tf.tensor2d

The tf.tensor2d is used to create matrices(rank 2 tensors).

tf.tensor2d([[1,2],[3,4]]); // creating a matrix

tf.tensor3d

The tf.tensor3d is used to create rank 3 tensors.

tf.tensor3d([[[1], [2]], [[3], [4]]]); // creating a rank 3 tensor

Tensorflow.js has tensor creation functions up to rank 6. Check out their documentation page.

tf.Variable

The Tensor data structure is immutable but mutability is essential to solve common problems. Tensorflow.js provides the tf.Variable class which is a mutable tensor.

The tf.variable function is used to create tf.Variable objects.

const x = tf.variable(tf.tensor([1, 2, 3]));

In other to mutate them we use the assign method of the tf.Variable class.

x.assign(tf.tensor([4, 5, 6]));

Math operations

Basic mathematical operations like addition, subtraction, multiplication, and division can be performed on tensors.
Operators like +, -, /, and * don't work with tensors rather methods are used.

const a = tf.tensor1d([1, 2, 3, 4]);
const b = tf.tensor1d([5, 6, 7, 8]);

a.add(b) // tf.tensor1d([6, 8, 10, 12]) ✅
tf.add(a, b); // This also works: tf.tensor1d([6, 8, 10, 12]) ✅

a + b // This doesn't work ❌

To perform element-wise addition between tensors, they typically need to have the same shape. However, broadcasting can be used to add together tensors of different shapes.

Broadcasting involves when a smaller tensor stretches to make its shape the same as the bigger tensor.

const s = tf.scalar(2);
const a = tf.tensor1d([1, 2, 3]);

s.add(a); // tf.tensor1d([3, 4, 5]);

In the code above s has smaller shape and it stretches to fit the shape of a. The result ends up having a shape equivalent to a.

const s = tf.scalar(3);
const a = tf.tensor2d([[1, 2, 3, 4], [5, 6, 7, 8]]);

s.add(a); // tf.tensor2d([[4, 5, 6 , 7 ], [8, 9, 10, 11]]);

The same rules of addition apply to other mathematical operations:

// substraction
const a = tf.tensor1d([1, 2, 3, 4]);
const b = tf.tensor1d([5, 6, 7, 8]);


b.sub(a) // tf.tensor1d([4, 4, 4, 4]) ✅
tf.sub(b, a); // This also works: tf.tensor1d([4, 4, 4, 4]) ✅

a - b // This doesn't work ❌

// substraction broadcasting
const s = tf.scalar(3);
const a = tf.tensor1d([1, 2, 3, 4]);

s.sub(a); // tf.tensor1d([2, 1, 0, -1])

// multiplication
const a = tf.tensor1d([1, 2, 3, 4]);
const b = tf.tensor1d([5, 6, 7, 8]);


b.mul(a) // tf.tensor1d([5, 12, 21, 32]) ✅
tf.mul(b, a); // This also works: tf.tensor1d([5, 12, 21, 32]) ✅

a * b // This doesn't work ❌

// multiplication broadcasting
const s = tf.scalar(3);
const a = tf.tensor1d([1, 2, 3, 4]);

s.mul(a); // tf.tensor1d([3, 6, 9, 12]);

// division
const a = tf.tensor1d([1, 2, 3, 4]);
const b = tf.tensor1d([5, 6, 7, 8]);


b.mul(a) // tf.tensor1d([5, 3, 2.3333335, 2]) ✅
tf.mul(b, a); // This also works: tf.tensor1d([5, 3, 2.3333335, 2]) ✅

a / b // This doesn't work ❌

// division broadcasting
const s = tf.scalar(3);
const a = tf.tensor1d([1, 2, 3, 4]);

s.div(a); // tf.tensor1d([3, 1.5, 1, 0.75]);

Tensorflow.js also supports other mathematical operations like floordiv, exponent, and modulus etc.

Javascript has an in-built Math object for working with numbers. Tensorflow.js provide equivalent methods but for tensors.

Javascript	Tensorflow.js
Math.abs	tf.abs
Math.acos	tf.acos
Math.acosh	tf.acosh
Math.ceil	tf.ceil
Math.cos	tf.cos
Math.exp	tf.exp
Math.floor	tf.floor
Math.log	tf.log
Math.sin	tf.sin
Math.tan	tf.tan

Tensorflow.js has other operations that are not present in the Math object like the relu and sigmoid.

Linear algebra

Linear algebra is a branch of mathematics that deals with the study of vectors and matrices and the transformations that can be performed on them. It is used heavily in the field of Machine Learning.Tensorflow.js has several methods for linear algebra.

The dot product

The dot product also known as the scalar product is an operation that takes two vectors and produces a scalar.
It is represented mathematically as:

a⋅b=∑i=1naibi=a1b1+a2b2+⋯+anbn\textbf{a} \cdot \textbf{b} = \sum_{i=1}^{n} {\textbf{a}_i\textbf{b}_i} = {\textbf{a}_1\textbf{b}_1} + {\textbf{a}_2\textbf{b}_2} + \cdots + {\textbf{a}_n\textbf{b}_n}

In Tensorflow.js, we have the tf.dot function for computing the dot product between two tensors.

const a = tf.tensor1d([1, 2, 3, 4]);
const b = tf.tensor1d([5, 6, 7, 8]);

tf.dot(a, b); // 70

tf.Tensor objects also have a dot method to perform dot products.

a.dot(b);

The Norm

The norm also known as the magnitude is a mathematical concept used to measure the size or length of a vector. It is expressed mathematically as:

∣x∣=x12+x22+⋯+xn2|{x}|={\sqrt {x_{1}^{2}+x_{2}^{2}+\cdots +x_{n}^{2}}}

Tensorflow.js provides the method euclideanNorm to calculate the norm of a tensor.

const x = tf.tensor1d([1, 2, 3, 4]);

x.euclideanNorm() // 5.477225303649902

Matrix multiplication

In linear algebra, there are several ways of performing multiplication each of them has its usage. One of them is matrix multiplication which defines how two matrices can be multiplied by each other.

In tensorflow.js, we can perform matrix multiplication using the matMul function.

const a = tf.tensor2d([[1, 2],[3, 4]]);
const b = tf.tensor2d([[2, 4, 6],[8, 10, 12]]);

a.matMul(b); // [[18, 24, 30], [38, 52, 66]]

Transpose

Transposing a matrix is an operation that involves reshaping it by swapping its rows with its columns.

In tensorflow.js we have the Transpose function for performing transpose on a matrix.

const a = tf.tensor2d([[1, 2, 3], [4, 5, 6]] );
tf.transpose(a); // result: [[1, 4],[2, 5],[3, 6]]

Statistics

Tensorflow.js has several functions that can be used to perform statistical operations. One important concept in statistics is distribution, which determines how the data in a dataset is arranged. Tensorflow.js has functions for working with different types of distributions.

Normal distribution

The normal distribution is a probability distribution that models the way many natural phenomena are distributed. It is characterized by a bell-shaped curve, which indicates that the mean (average) value is more likely to occur than any other value. Tensorflow.js has the mean function for checking the mean of a tensor.

const x = tf.tensor1d([1, 2, 3, 4]);

x.mean().print();  // result: 2.5

In addition to the mean, the concept of standard deviation is also important in statistics. It is a value that measures how spread out data points are relative to the mean.

For example, in a room full of people, the average height is more likely to occur than either very tall or very short heights. Tensorflow.js has the randomNormal function that can be used to generate random values that are normally distributed.

tf.randomNormal([2, 2]); // [[0.0635663, 1.2073363],[-1.681078, 0.1550148]]

The code above generates a tensor with normally distributed values of shape 2 by 2.

Uniform distrubtion

In a uniform distribution, no values have a higher tendency of occurring. Rather all values are more likely to occur.
For example, flipping a coin, there's an equal chance of getting heads or tails. It is uniformly distributed, the same applies to a dice.

The randomUniform function of tensorflow.js can be used to generate random values that are uniformly distributed.

tf.randomUniform([2, 2]); // [[0.7101336, 0.6085116],[0.5631702, 0.1648723]]

Automatic differentiation

Automatic differentiation is one of the most important features of Tensorflow.js. Differentiation is the process of finding the rate of change of a function with respect to one of its variables. The derivative of a function describes how it changes as its input variable changes over time.

For example, consider the graph above which plots two functions: the red one is the sigmoid function, while the green one is its derivative. The green function shows how the value of the sigmoid function changes over time, as its input variable changes.

The graph plots the function $x^2$ in blue. We can easily calculate its derivative by using the power rule:

ddxx2=nxn−1\frac{d}{dx}x^2 = nx^n-1

Let's convert the equation above to JavaScript

// Define the function y = x^n
function powerFunction(x, n) {
  return Math.pow(x, n);
}

// Define the derivative of y = x^n using the power rule
function powerRuleDerivative(x, n) {
  return n * Math.pow(x, n-1);
}

// Test the power rule for n = 2 and x = 3
let x = 3;
let n = 2;

let y = powerFunction(x, n);
console.log(y) // 9

// calculate its derivative 
let dydx = powerRuleDerivative(x, n);
console.log(dydx) // reusult: 6

After using the formula above we find out the derivative is $2 x$ . In other to calculate the derivative we needed to use the formula above. Tensorflow.js can calculate the derivative of any function but it doesn't use formulas rather it uses automatic differentiation.

The tf.grad is used to take the derivative of any function provided it returns a tensor.


// f(x) = x ^ 2
const f = x => x.square();

// calculate the derivative of the function
const g = tf.grad(f);

const x = tf.tensor1d([3]);
g(x); // result 6

Automatic differentiation is a key component of the backpropagation algorithm used to train neural networks. It allows for the calculation of gradients efficiently without having to manually derive them using formulas.

In this article, we got introduced to the Tensorflow.js library that brings machine learning to the web. We explored its APIs and discovered it is capable of all sorts of computations.

Tensors are great but ML practitioners rarely work with them they are a bit low level rather most ML practitioners work on a higher level. In the next article, we will build our first ML model using Tensorflow.js.

Useful resources

Check out Coding train video on Tensorflow.js operations here.
Tensorflow.js API docs.
Check out this youtube playlist on linear algebra.
Check out this youtube video on differentiation.
To learn more about distribution watch this
To learn more about the normal distribution watch this